Ablation Study: BGPLVM Component Contributions to scRNA-seq Performance

by AmortizeMay 21st, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Discover how each component of our modified BGPLVM model, including pre-processing, likelihood, kernel, and encoder, significantly impacts performance on synthetic scRNA-seq data.

Coin Mentioned

Mention Thumbnail
featured image - Ablation Study: BGPLVM Component Contributions to scRNA-seq Performance
Amortize HackerNoon profile picture
0-item

Abstract and 1. Introduction

2. Background

2.1 Amortized Stochastic Variational Bayesian GPLVM

2.2 Encoding Domain Knowledge through Kernels

3. Our Model and Pre-Processing and Likelihood

3.2 Encoder

4. Results and Discussion and 4.1 Each Component is Crucial to Modifies Model Performance

4.2 Modified Model achieves Significant Improvements over Standard Bayesian GPLVM and is Comparable to SCVI

4.3 Consistency of Latent Space with Biological Factors

4. Conclusion, Acknowledgement, and References

A. Baseline Models

B. Experiment Details

C. Latent Space Metrics

D. Detailed Metrics

4 RESULTS AND DISCUSSION

We present results for three experiments on an simulated dataset and two real-world datasets, which are detailed in Appendix B.1. Full experiment details and results with latent space metrics are also presented in Appendix B and D.

4.1 EACH COMPONENT IS CRUCIAL TO MODIFIED MODEL PERFORMANCE

To better understand how each component affects our model performance, we conducted an ablation study with a synthetic scRNA-seq dataset distributed according to a true negative binomial likelihood simulated by Splatter (Zappia et al., 2017). In particular, we reverted each component to a more standard BGPLVM component to evaluate its importance to the model’s overall performance. The results for this experiment are detailed in Figure 2 for the simulated dataset. Changing the preprocessing step and likelihood to match a Gaussian distribution as is done in standard GPLVMs completely removes any perceivable cell type separation and results in separated batches (Fig. 2(b)). These observations support our hypothesis that the likelihoods were misaligned with the underlying distribution, at least for the simulated single-cell dataset.


If the SE-ARD+Linear kernel is changed to a fully linear kernel (detailed in Appendix B.3.1), the batches separate while the cell-types begin to mix but are still slightly differentiable, albeit within the separated batches (Fig. 2(c)). These changes may be attributed to the fact that linear kernel is not expressive enough to capture the cell-type information while the nonlinearity of the SE-ARD+Linear kernel permits extra flexibility.


In this reverse ablation study, the encoder exhibits the least impact on the latent space representation, as evidenced by the clear separation of cell types and well-mixed batches in Fig. 2(d). This behavior can be attributed to the encoder playing a smaller role in defining the generative model as it primarily functions as a means of regularization for mappings from the data space to the latent space.


Figure 2: Ablation study with the simulated dataset on the proposed BGPLVM model where we change one component at a time (labeled in subfigures) and visualize the resulting UMAPs. The top row is colored by cell-type and the bottom row by batch.


This paper is available on arxiv under CC BY-SA 4.0 DEED license.

Authors:

(1) Sarah Zhao, Department of Statistics, Stanford University, (smxzhao@stanford.edu);

(2) Aditya Ravuri, Department of Computer Science, University of Cambridge (ar847@cam.ac.uk);

(3) Vidhi Lalchand, Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard (vidrl@mit.edu);

(4) Neil D. Lawrence, Department of Computer Science, University of Cambridge (ndl21@cam.ac.uk).


Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks