Program (November 10, Monday)

9:10 am Welcome and Opening remarks
Eric Xing (MBZUAI)
9:30 am From Clinic to Discovery: Advancing Health Innovation Through Better Use of EHRs
Tianxi Cai (Harvard)
The digital transformation of healthcare has created an extraordinary opportunity to accelerate discovery. With the widespread adoption of electronic health records (EHRs), we now have access to vast, real-world clinical data that can reveal how diseases develop, how patients respond to treatments, and how care can be improved. Yet, this promise comes with a challenge: EHR data are inherently messy—noisy, incomplete, and fragmented across institutions—making it difficult to draw reliable conclusions. In this talk, I will show how we can turn this challenge into an opportunity by developing intelligent, collaborative approaches that learn from data across health systems while protecting patient privacy. By combining statistical rigor, machine learning, and biomedical knowledge, we can uncover hidden patterns in EHR data and translate them into actionable insights for precision medicine. Using examples from Mass General Brigham and the Veterans Health Administration, I will illustrate how these approaches move us closer to a true learning health system—one where every clinical encounter contributes to discovery and better health for all.
10:00 am Causal representation learning and causal generative AI
Kun Zhang (MBZUAI)
Causality is a fundamental notion in science, engineering, and even in machine learning. Uncovering the causal process behind observed data can naturally help answer 'why' and 'how' questions, inform optimal decisions, and achieve adaptive prediction. In many scenarios, observed variables (such as image pixels and questionnaire results) are often reflections of the underlying causal variables, instead of causal variables themselves. Causal representation learning aims to reveal the underlying hidden causal variables and their relations. In this talk, we show how the modularity property of causal systems makes it possible to recover the underlying causal representations from observational data with identifiability guarantees: under appropriate assumptions, the learned representations are consistent with the underlying causal process. We demonstrate how identifiable causal representation learning can naturally benefit generative AI, with image generation, image editing, and text generation as particular examples.
10:30am Coffee Break
11:00am Causal Effect Measures Beyond the Mean
Jin Tian (MBZUAI)
Causal effect measures are fundamental to understanding interventions and play key roles in causal explanation, decision-making, and responsibility attribution. Traditional measures, such as the Average Causal Effect (ACE), summarize causal relationships through averages but often obscure important heterogeneity in effects across individuals or subpopulations. In this talk, we introduce a set of new causal effect measures designed to capture and interpret causal heterogeneity. After reviewing standard measures and the Probabilities of Causation framework, we introduce new metrics that extend these ideas to continuous treatments and outcomes, propose characterizing the distribution of causal effects through its moments (variance, skewness, kurtosis), and present new measures for decision-making under multiple actions. Finally, we outline identification and bounding results for these measures under common causal assumptions, and illustrate their use in real-world applications.
11:30am Invariance and causality pursuit from heterogeneous environments
Yihong Gu (Harvard)
Pursuing causality from data is a fundamental problem in scientific discovery, treatment intervention, and transfer learning. In this talk, we introduce a novel algorithmic method for addressing nonparametric invariance and causality learning in regression models across multiple environments, where the joint distribution of response variables and covariates varies, but the conditional expectations of outcome given an unknown set of quasi-causal variables are invariant. The challenge of finding such an unknown set of quasi-causal or invariant variables is compounded by the presence of endogenous variables that have heterogeneous effects across different environments. The proposed Focused Adversarial Invariant Regularization (FAIR) framework utilizes an innovative minimax optimization approach that drives regression models toward prediction-invariant solutions through adversarial testing. Leveraging the representation power of neural networks, FAIR neural networks (FAIR-NN) are introduced for causality pursuit. It is shown that FAIR-NN can find the invariant variables and quasi-causal variables under a minimal identification condition and that the resulting procedure is adaptive to low-dimensional composition structures in a non-asymptotic analysis. Under a structural causal model, variables identified by FAIR-NN represent pragmatic causality and provably align with exact causal mechanisms under conditions of sufficient heterogeneity. Computationally, FAIR-NN employs a novel Gumbel approximation with decreased temperature and a stochastic gradient descent ascent algorithm. Finally, we also discuss the intrinsic computational hardness in theory.
12:00am Building a Multi-Modal Atlas of Immune Dysregulation in Type 2 Diabetes
Yulia Medvedeva (MBZUAI)
Type 2 diabetes mellitus (T2DM) represents a major global health challenge, a trend clearly reflected in Russia's nearly 4.9 million recorded cases by 2024. The burden of this disease, however, is not evenly distributed across the country's diverse population of approximately 200 distinct ancestry groups. Our research has directly quantified these disparities, revealing significant ancestry-specific patterns. We identified, for instance, a "Chechen paradox," where Chechen individuals exhibit higher BMI levels yet a lower prevalence of T2DM, suggesting the presence of protective genetic factors. In contrast, the Yakut population displays a metabolic and genetic profile indicative of β-cell dysfunction, similar to East Asian populations. Furthermore, we found that healthy Yakuts have high levels of HDL ("good" cholesterol), while those with T2DM exhibit elevated LDL ("bad" cholesterol)—a lipid profile linked to a distinct genetic signature that underscores the important role of lipid metabolism in T2DM onset among Yakuts. These findings highlight a critical gap in our understanding: the molecular mechanisms driving these population-specific differences remain unknown. To compensate for this, we generated targeted single-cell RNA sequencing (scRNA-seq) datasets for carefully selected cohorts. The combined genetic/sc-omics strategy allows us to bridge the gap between genetic association and biological function with high precision. By applying sophisticated bioinformatic analyses to both pre-existing data and our new ancestrally focused scRNA-seq data, we identified cell-type-specific variations and transcriptomic pathways that drive T2DM risk and protection across different populations.
12:30pm Lunch
2:00pm Predictive Patient Analytics and Precision Therapeutics from Multi-Omics Data
Natasa Przulj (MBZUAI)
Large amounts of multi-omic data are increasingly becoming available. They provide complementary information about cells, tissues and diseases. We need to utilize them to better stratify patients into risk groups, discover new biomarkers and targets, re-purpose known and discover new drugs to personalize medical treatment. This is nontrivial, because of computational intractability of many underlying problems on large interconnected data (networks, or graphs), necessitating the development of new algorithms for finding approximate solutions (heuristics). We develop versatile artificial intelligence (AI) frameworks for multi-omics data fusion, constrained by the state-of-the-art network science methods, to address key challenges in precision medicine and pharmacology from time-series, multi-omics data, including patient-derived single-cell data, to: better stratify patients, predict new biomarkers and targets, re-purpose existing and discover new drugs; we apply these to different types of cancer, Covid-19, Parkinson’s and other diseases. Our new methods stem from graph-regularized non-negative matrix tri-factorization (NMTF), a machine learning (ML) technique for dimensionality reduction, inference, fusion and co-clustering of heterogeneous datasets, coupled with novel graphlet-based network science algorithms. We utilize our new frameworks for improving the understanding of the molecular organization of life and of diseases from the embedding spaces of omics data. Also, we utilize the local network topology to correct for the topological information missed by random walks used in many ML methods, and to enable embedding of multi-omics networks into more linearly separable spaces, allowing for their explainable and sustainable mining. The aim is to develop an overreaching framework encompassing all multi-omics data towards consumer-facing precision medicine products.
2:30pm Empower Whole-Genome Statistical Analysis with Synthetic Data Generated by Generative ML/AI
Xihong Lin (Harvard)
Scalable and robust statistical methods empowered by synthetic data generated from generative AI offer unprecedented potential for trustworthy whole-genome analysis. In this talk, I will discuss robust and powerful approaches for genome-wide association studies (GWAS) that leverage synthetic phenotype data generated by generative ML/AI models such as diffusion models and transformers, while ensuring valid statistical inference even when the generative models are misspecified. I will illustrate key ideas using GWAS analyses from the UK Biobank and highlight connections with Prediction Power Inference (PPI). This work demonstrates how integrating statistics with generative AI through synthetic data can advance trustworthy scientific discovery.
3:00pm Towards AI-Driven Digital Organism A System of Multiscale Foundation Models for Biology
Le Song (MBZUAI)
Biology lies at the core of vital fields such as medicine, pharmacy, public health, longevity, agriculture, environmental protection, and clean energy. What will be the foundational AI models for biology? What data can be used to build them? How to build them exactly? In this talk, I will discuss an engineering viable approach to address these challenges by designing an AI-Driven Digital Organism (AIDO), a system of integrated multiscale foundation models, in a modular, connectable, and holistic fashion to reflect biological scales, connectedness, and complexities. A system like the AIDO opens up a safe, affordable and high-throughput alternative platform for predicting, simulating and programming biology at all levels from molecules to cells to individuals. An AIDO is poised to trigger a new wave of better-guided wet-lab experimentation and better-informed first-principle reasoning, which can eventually help us better decode and improve life.
3:30pm Coffee Break
4:00pm Modern Nonlinear Embedding Methods Unpacked: Empowering Biological Discoveries with Statistical Insights
Rong Ma (Harvard)
Learning and representing low-dimensional structures from noisy, high-dimensional data is a cornerstone of modern biomedical data science. Stochastic neighbor embedding algorithms, a family of nonlinear dimensionality reduction and data visualization methods, with t-SNE and UMAP as two leading examples, have become especially influential in recent years, particularly in single-cell analysis. Yet despite their popularity, these methods remain subject to points of debate, including limited theoretical understanding, ambiguous interpretations, and sensitivity to tuning parameters. In this talk, I will present our recent efforts to decipher and improve these nonlinear embedding approaches. Our key results include a rigorous theoretical framework that uncovers the intrinsic mechanisms, large-sample limits, and fundamental principles underlying these algorithms; a set of theory-informed practical guidelines for their principled use in trustworthy biological discovery; and a collection of new algorithms that address current limitations and improve performance in areas such as bias reduction and stability. Throughout the talk, I will highlight how these advances not only deepen our statistical understanding but also open new avenues for biological insight.
4:30pm Agentic Reasoning Models for Earth Observation
Salman Khan (MBZUAI)
Earth Observation (EO) satellites are continuously capturing vast streams of multimodal data, offering an unprecedented view of our planet’s dynamics at both global scale and fine resolution. Turning this raw information into actionable knowledge for addressing urgent challenges, such as climate extremes, disaster management, and food security, requires moving beyond pattern recognition toward AI systems that can reason, adapt, and interact with scientific tools. In this talk, I will present recent advances in agentic AI for EO, highlighting how reasoning models can connect satellite observations with domain knowledge and physical simulators. This talk will introduce foundation models such as TerraFM and EarthDial, and show how simple agentic pipelines can help integrate multi-sensor imagery, reanalysis products, and hydrological or agricultural models to provide trustworthy and actionable insights. We will discuss the emerging role of tool-augmented reasoning and agentic workflows, and a forward-looking vision where EO AI serves as a scientific collaborator for climate resilience and sustainability.
5:00pm Robustifying Generative AI for Human Genetics
Ahmad Abdel-Azim (Harvard)
Generative AI has transformed content generation, and its promise for biomedicine is compelling. Genomic sequencing remains costly and scarce relative to phenotyping, and privacy restrictions further limit access. We introduce a scalable pipeline for the generation of high-fidelity genomes, building on the latent diffusion model architecture (e.g. Stable Diffusion). Genotypes are efficiently embedded in a regularized, informative latent space where a conditional diffusion model is subsequently trained; this enables the generation of high-fidelity genetic data, where fidelity is assessed with genomics-specific diagnostics. Practically, sampling from the learned genomic distribution provides “data on demand"; millions of high-fidelity genomes can be generated to power variant discovery, fine-mapping, and other analyses, hence facilitating more rapid discovery. Nevertheless, the challenge is not only the generative modeling of genomes, but how to use the generated data without corrupting downstream inference procedures and inducing bias. We introduce a synthetic augmentation framework that is robust to misspecification of the data generative models. Our proposed framework is guaranteed to be more efficient than standard estimation procedures based solely on the observed data, even when the generators are imperfect. Applied in simulations and the UK Biobank, the approach delivers higher-powered genetic analyses. More broadly, we establish a general framework for robust inference with generative modeling, extensible to other modalities, which can accelerate discovery while reducing data-collection costs and privacy risks.