MBZUAI Workshop on Data, Learning, and Biological Problems

Program (November 10, Monday)

9:10 am	Welcome and Opening remarks
	Eric Xing (MBZUAI)

9:30 am
	Tianxi Cai (Harvard)

10:00 am	Causal representation learning and causal generative AI
	Kun Zhang (MBZUAI)
	Causality is a fundamental notion in science, engineering, and even in machine learning. Uncovering the causal process behind observed data can naturally help answer 'why' and 'how' questions, inform optimal decisions, and achieve adaptive prediction. In many scenarios, observed variables (such as image pixels and questionnaire results) are often reflections of the underlying causal variables, instead of causal variables themselves. Causal representation learning aims to reveal the underlying hidden causal variables and their relations. In this talk, we show how the modularity property of causal systems makes it possible to recover the underlying causal representations from observational data with identifiability guarantees: under appropriate assumptions, the learned representations are consistent with the underlying causal process. We demonstrate how identifiable causal representation learning can naturally benefit generative AI, with image generation, image editing, and text generation as particular examples.

10:30am	Coffee Break

11:00am	Causal Effect Measures Beyond the Mean
	Jin Tian (MBZUAI)
	Causal effect measures are fundamental to understanding interventions and play key roles in causal explanation, decision-making, and responsibility attribution. Traditional measures, such as the Average Causal Effect (ACE), summarize causal relationships through averages but often obscure important heterogeneity in effects across individuals or subpopulations. In this talk, we introduce a set of new causal effect measures designed to capture and interpret causal heterogeneity. After reviewing standard measures and the Probabilities of Causation framework, we introduce new metrics that extend these ideas to continuous treatments and outcomes, propose characterizing the distribution of causal effects through its moments (variance, skewness, kurtosis), and present new measures for decision-making under multiple actions. Finally, we outline identification and bounding results for these measures under common causal assumptions, and illustrate their use in real-world applications.

11:30am	Invariance and causality pursuit from heterogeneous environments
	Yihong Gu (Harvard)
	Pursuing causality from data is a fundamental problem in scientific discovery, treatment intervention, and transfer learning. In this talk, we introduce a novel algorithmic method for addressing nonparametric invariance and causality learning in regression models across multiple environments, where the joint distribution of response variables and covariates varies, but the conditional expectations of outcome given an unknown set of quasi-causal variables are invariant. The challenge of finding such an unknown set of quasi-causal or invariant variables is compounded by the presence of endogenous variables that have heterogeneous effects across different environments. The proposed Focused Adversarial Invariant Regularization (FAIR) framework utilizes an innovative minimax optimization approach that drives regression models toward prediction-invariant solutions through adversarial testing. Leveraging the representation power of neural networks, FAIR neural networks (FAIR-NN) are introduced for causality pursuit. It is shown that FAIR-NN can find the invariant variables and quasi-causal variables under a minimal identification condition and that the resulting procedure is adaptive to low-dimensional composition structures in a non-asymptotic analysis. Under a structural causal model, variables identified by FAIR-NN represent pragmatic causality and provably align with exact causal mechanisms under conditions of sufficient heterogeneity. Computationally, FAIR-NN employs a novel Gumbel approximation with decreased temperature and a stochastic gradient descent ascent algorithm. Finally, we also discuss the intrinsic computational hardness in theory.

12:00am	TBA
	Yulia Medvedeva (MBZUAI)
	TBA

12:30pm	Lunch

2:00pm	Predictive Patient Analytics and Precision Therapeutics from Multi-Omics Data
	Natasa Przulj (MBZUAI)
	Large amounts of multi-omic data are increasingly becoming available. They provide complementary information about cells, tissues and diseases. We need to utilize them to better stratify patients into risk groups, discover new biomarkers and targets, re-purpose known and discover new drugs to personalize medical treatment. This is nontrivial, because of computational intractability of many underlying problems on large interconnected data (networks, or graphs), necessitating the development of new algorithms for finding approximate solutions (heuristics). We develop versatile artificial intelligence (AI) frameworks for multi-omics data fusion, constrained by the state-of-the-art network science methods, to address key challenges in precision medicine and pharmacology from time-series, multi-omics data, including patient-derived single-cell data, to: better stratify patients, predict new biomarkers and targets, re-purpose existing and discover new drugs; we apply these to different types of cancer, Covid-19, Parkinson’s and other diseases. Our new methods stem from graph-regularized non-negative matrix tri-factorization (NMTF), a machine learning (ML) technique for dimensionality reduction, inference, fusion and co-clustering of heterogeneous datasets, coupled with novel graphlet-based network science algorithms. We utilize our new frameworks for improving the understanding of the molecular organization of life and of diseases from the embedding spaces of omics data. Also, we utilize the local network topology to correct for the topological information missed by random walks used in many ML methods, and to enable embedding of multi-omics networks into more linearly separable spaces, allowing for their explainable and sustainable mining. The aim is to develop an overreaching framework encompassing all multi-omics data towards consumer-facing precision medicine products.

2:30pm	TBA
	Xihong Lin (Harvard)

3:00pm	TBA
	Le Song (MBZUAI)

3:30pm

Coffee Break

4:00pm	Modern Nonlinear Embedding Methods Unpacked: Empowering Biological Discoveries with Statistical Insights
	Rong Ma (Harvard)
	Learning and representing low-dimensional structures from noisy, high-dimensional data is a cornerstone of modern biomedical data science. Stochastic neighbor embedding algorithms, a family of nonlinear dimensionality reduction and data visualization methods, with t-SNE and UMAP as two leading examples, have become especially influential in recent years, particularly in single-cell analysis. Yet despite their popularity, these methods remain subject to points of debate, including limited theoretical understanding, ambiguous interpretations, and sensitivity to tuning parameters. In this talk, I will present our recent efforts to decipher and improve these nonlinear embedding approaches. Our key results include a rigorous theoretical framework that uncovers the intrinsic mechanisms, large-sample limits, and fundamental principles underlying these algorithms; a set of theory-informed practical guidelines for their principled use in trustworthy biological discovery; and a collection of new algorithms that address current limitations and improve performance in areas such as bias reduction and stability. Throughout the talk, I will highlight how these advances not only deepen our statistical understanding but also open new avenues for biological insight.

4:30pm	TBA
	Salman Khan (MBZUAI)
	TBA

5:00pm	Robustifying Generative AI for Human Genetics
	Ahmad Abdel-Azim (Harvard)
	Generative AI has transformed content generation, and its promise for biomedicine is compelling. Genomic sequencing remains costly and scarce relative to phenotyping, and privacy restrictions further limit access. We introduce a scalable pipeline for the generation of high-fidelity genomes, building on the latent diffusion model architecture (e.g. Stable Diffusion). Genotypes are efficiently embedded in a regularized, informative latent space where a conditional diffusion model is subsequently trained; this enables the generation of high-fidelity genetic data, where fidelity is assessed with genomics-specific diagnostics. Practically, sampling from the learned genomic distribution provides “data on demand"; millions of high-fidelity genomes can be generated to power variant discovery, fine-mapping, and other analyses, hence facilitating more rapid discovery. Nevertheless, the challenge is not only the generative modeling of genomes, but how to use the generated data without corrupting downstream inference procedures and inducing bias. We introduce a synthetic augmentation framework that is robust to misspecification of the data generative models. Our proposed framework is guaranteed to be more efficient than standard estimation procedures based solely on the observed data, even when the generators are imperfect. Applied in simulations and the UK Biobank, the approach delivers higher-powered genetic analyses. More broadly, we establish a general framework for robust inference with generative modeling, extensible to other modalities, which can accelerate discovery while reducing data-collection costs and privacy risks.