Table of Links
2. Background
2.1 Amortized Stochastic Variational Bayesian GPLVM
2.2 Encoding Domain Knowledge through Kernels
3. Our Model and Pre-Processing and Likelihood
4.1 Each Component is Crucial to Modifies Model Performance
4.3 Consistency of Latent Space with Biological Factors
4. Conclusion, Acknowledgement, and References
ABSTRACT
Dimensionality reduction is crucial for analyzing large-scale single-cell RNA-seq data. Gaussian Process Latent Variable Models (GPLVMs) offer an interpretable dimensionality reduction method, but current scalable models lack effectiveness in clustering cell types. We introduce an improved model, the amortized stochastic variational Bayesian GPLVM (BGPLVM), tailored for single-cell RNA-seq with specialized encoder, kernel, and likelihood designs. This model matches the performance of the leading single-cell variational inference (scVI) approach on synthetic and real-world COVID datasets and effectively incorporates cell-cycle and batch information to reveal more interpretable latent structures as we demonstrate on an innate immunity dataset.
1 INTRODUCTION
Single-cell transcriptomics sequencing (scRNA-seq) has enabled the study of gene expression at the individual cell level. This high-resolution analysis has helped discover new cell types and cell states, reveal developmental lineages, and identify cell type-specific gene expression profiles (Montoro et al., 2018; Plasschaert et al., 2018; Luecken & Theis, 2019). This high-level resolution, however, comes with a cost. scRNA-seq data are often extremely sparse and prone to various technical and biological noise, such as sequencing depth, batch effects, and cell-cycle phases (Svensson et al., 2018; Tanay & Regev, 2017; Luecken & Theis, 2019; Hie et al., 2020). Various dimensionality reduction techniques have been developed to leverage intrinsic structures in the data (Heimberg et al., 2016) to map to a lower-dimensional latent space. These methods help facilitate downstream tasks like clustering and visualization, while avoiding the curse of dimensionality. Our work emphasizes probabilistic dimensionality reduction methods, which, through providing explicit probabilistic models for the data, allows for more interpretable models and uncertainty measures in the learned latent space.
In particular, we study a class of latent variable models known as Gaussian Process Latent Variable Models (GPLVMs) (Lawrence, 2004), which have recently been applied to scRNA-seq data (Campbell & Yau, 2015; Buettner et al., 2015; Ahmed et al., 2019; Verma & Engelhardt, 2020; Lalchand et al., 2022a). These models, which use Gaussian processes (GPs) to define nonlinear mappings from the latent space to data space, can incorporate prior information in the GP kernel function, motivating its use in single-cell transcriptomics data to model known or approximated covariate random effects, such as batch IDs and cell cycle phases. This approach is made scalable via mini-batching; however, the resulting Bayesian GPLVM model (BGPLVM) struggles to learn informative latent spaces for certain datasets (Lalchand et al., 2022a).
In this work, we present an amortized BGPLVM better fit to scRNA-seq data by leveraging design choices made in a leading probabilistic dimensionality reduction method called single cell variational inference (scVI) (Lopez et al., 2018). While scVI has seen impressive performance in a variety of downstream tasks, it does not easily allow for interpretable incorporation of prior domain knowledge.
In Sections 2 and 3, we describe this model, providing a concise background on BGPLVMs and highlighting the model modifications. Section 4 then discusses (1) an ablation study demonstrating each components contribution to the model’s performance via a synthetic dataset; (2) comparable performance to scVI for both the synthetic dataset and a real-world COVID-19 dataset (Stephenson et al., 2021); and (3) promising results for interpretably incorporating prior domain knowledge about cell-cycle phases in an innate immunity dataset (Kumasaka et al., 2021). Our work shines a light on key considerations in developing a scalable, interpretable, and informative probabilistic dimensionality method for scRNA-seq data.
This paper is available on arxiv under CC BY-SA 4.0 DEED license.
Authors:
(1) Sarah Zhao, Department of Statistics, Stanford University, (smxzhao@stanford.edu);
(2) Aditya Ravuri, Department of Computer Science, University of Cambridge (ar847@cam.ac.uk);
(3) Vidhi Lalchand, Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard (vidrl@mit.edu);
(4) Neil D. Lawrence, Department of Computer Science, University of Cambridge (ndl21@cam.ac.uk).