MBZUAI Workshop on Data, Learning, and Biological Problems

Day 2 Program (November 11, Tuesday)

9:00am	Trans-Glasso: A Transfer Learning Approach to Precision Matrix Estimation
	Mladen Kolar (MBZUAI)
	Many real-world systems—ranging from gene regulatory interactions in biology to financial asset dependencies—can be represented by networks, whose edges correspond to conditional relationships among variables. These relationships are succinctly captured by the precision matrix of a multivariate distribution. Estimating the precision matrix is thus fundamental to uncovering the underlying network structure. However, this task can be challenging when the available data for the target domain are limited, undermining accurate inference. In this talk, I will present Trans-Glasso, a novel two-step transfer learning framework for precision matrix estimation that leverages data from source studies to improve estimates in the target study. First, Trans-Glasso identifies shared and unique features across studies via a multi-task learning objective. Then, it refines these initial estimates through differential network estimation to account for structural differences between the target and source precision matrices. Assuming that most entries of the target precision matrix are shared with at least one source matrix, we derive non-asymptotic error bounds and show that Trans-Glasso achieves minimax optimality under certain conditions. Through extensive simulations, Trans-Glasso demonstrates improved performance over standard methods, especially in small-sample settings. Applications to gene regulatory networks across multiple brain tissues and protein networks in various cancer subtypes confirm its practical effectiveness in biological contexts, where understanding network structures can provide insights into disease mechanisms and potential interventions. Beyond biology, these techniques are broadly applicable wherever precision matrix estimation and network inference play a crucial role, including neuroscience, finance, and social science. This is joint work with Boxin Zhao and Cong Ma.

9:30am	Generate diverse protein conformations through AlphaFold
	Samuel Kou (Harvard)
	The introduction of AlphaFold has revolutionized the task of protein structure prediction from a given sequence of amino acids; the groundbreaking contribution of AlphaFold was recognized by the 2024 Nobel Prize in Chemistry. As a deep-learning based method, AlphaFold was trained from the publicly available Protein Data Bank (PDB), a database of known protein structures. An inherent limitation of AlphaFold is that its prediction can only give a static structure, whereas in reality, the structures of proteins are dynamic and can change in response to their environment or binding partners, with significant biological consequences. In this talk, we focus on enhancing and diversifying protein structure prediction using AlphaFold. Through a principled iterative statistical sampling framework, we significantly expand AlphaFold’s capabilities, enabling it to explore a broader conformational space. Key methodologies involve modifying the multiple sequence alignment (MSA) and template inputs to encourage AlphaFold to explore different conformations, thereby increasing structural diversity. This is achieved in particular through an iterative sequential sampling approach, which allows for the incorporation of protein residue co-evolutionary information in the structure prediction, broadening the conformational possibilities that AlphaFold can investigate. We will illustrate the capabilities of the statistical sampling approach through examples.

10:00am	TBA
	Eric Moulines (MBZUAI)
	TBA

10:30am	Coffee Break

11:00am	Knowledge Graph Embedding with Electronic Health Records
	Junwei Lu (Harvard)
	Due to the increasing adoption of electronic health records (EHR), large EHRs have become another rich data source for translational clinical research. We propose to infer the conditional dependency structure among EHR features via a latent graphical block model. The LGBM has a two-layer structure with the first providing semantic embedding vector representation for the EHR features and the second overlaying a graphical block model on the latent SEVs. The block structures on the graphical model also allow us to cluster synonymous features in EHR. We propose to learn the LGBM efficiently, in both statistical and computational sense, based on the empirical point mutual information matrix. We establish the statistical rates of the proposed estimators and show the perfect recovery of the block structure.

11:30am	Making Algorithms Robust to Structured Noise, and Beyond
	Qiang Sun (MBZUAI)
	Real-world data often conceal meaningful signals beneath both random and structured noise. Structured noise arises in many forms, from batch effects in biomedical studies to background in image classification. Surprisingly, algorithms that encourage diversity or uniformity in their learned representations often generalize better out of context. To understand this phenomenon, we study linear representation learning with two views, comparing classical and contrastive methods, with and without a uniformity constraint. The classical non-contrastive algorithms break down under structured noise. Contrastive learning with an alignment-only loss performs well when background variation is mild but fails under strong structured noise. In contrast, contrastive learning that enforces a uniformity constraint remains robust regardless of noise strength. Empirical results confirm these insights. Taking one step further, we discuss how to make algorithms that are robust to random noise and to nonstationary dynamics, such as shifting market trends in algorithmic trading.

12:00pm	Lunch

2:00pm	Personalized medicine based on deep human phenotyping
	Eran Segal (MBZUAI)
	TBA

2:30pm	Towards Truly Open, Language-Specific, Safe, Factual, and Specialized Large Language Models
	Preslav Nakov (MBZUAI)
	As large language models increasingly shape knowledge, communication, and creativity, it is imperative that we make them open, language-specific, safe, factual, and specialized. First, we will argue for the need for fully transparent open-source large language models (LLMs), and we will describe the efforts of MBZUAI's Institute on Foundation Models (IFM) towards that based on the LLM360 initiative. Second, we will argue for the need for language-specific LLMs, and we will share our experience from building Jais, the world's leading open Arabic-centric foundation and instruction-tuned large language model, Nanda, our open-weights Hindi LLM, Sherkala, our open-weights Kazakh LLM, and some other models. Third, we will argue for the need for safe LLMs, and we will present Do-Not-Answer, a dataset for evaluating the guardrails of LLMs, which is at the core of the safety mechanisms of our LLMs. Forth, we will argue for the need for factual LLMs, we will discuss the factuality challenges that LLMs pose. We will then present some recent relevant tools for addressing these challenges developed at MBZUAI: (i) OpenFactCheck, a framework for fact-checking LLM output, for building customized fact-checking systems, and for benchmarking LLMs for factuality, (ii) LM-Polygraph, a tool for predicting an LLM's uncertainty in its output using cheap and fast uncertainty quantification techniques, and (iii) LLM-DetectAIve, a tool for machine-generated text detection. Finally, we will argue for the need for specialized models, and we will present some other LLMs currently being developed at MBZUAI's IFM.

3:00pm	Coffee Break

3:30pm-5:30pm	Panel Discussion