Ablation Study: BGPLVM Component Contributions to scRNA-seq Performance

Table of Links

Abstract and 1. Introduction

2. Background

2.1 Amortized Stochastic Variational Bayesian GPLVM

2.2 Encoding Domain Knowledge through Kernels

3. Our Model and Pre-Processing and Likelihood

3.2 Encoder

4. Results and Discussion and 4.1 Each Component is Crucial to Modifies Model Performance

4.2 Modified Model achieves Significant Improvements over Standard Bayesian GPLVM and is Comparable to SCVI

4.3 Consistency of Latent Space with Biological Factors

4. Conclusion, Acknowledgement, and References

A. Baseline Models

B. Experiment Details

C. Latent Space Metrics

D. Detailed Metrics

4 RESULTS AND DISCUSSION

We present results for three experiments on an simulated dataset and two real-world datasets, which are detailed in Appendix B.1. Full experiment details and results with latent space metrics are also presented in Appendix B and D.

4.1 EACH COMPONENT IS CRUCIAL TO MODIFIED MODEL PERFORMANCE

To better understand how each component affects our model performance, we conducted an ablation study with a synthetic scRNA-seq dataset distributed according to a true negative binomial likelihood simulated by Splatter (Zappia et al., 2017). In particular, we reverted each component to a more standard BGPLVM component to evaluate its importance to the model’s overall performance. The results for this experiment are detailed in Figure 2 for the simulated dataset. Changing the preprocessing step and likelihood to match a Gaussian distribution as is done in standard GPLVMs completely removes any perceivable cell type separation and results in separated batches (Fig. 2(b)). These observations support our hypothesis that the likelihoods were misaligned with the underlying distribution, at least for the simulated single-cell dataset.

If the SE-ARD+Linear kernel is changed to a fully linear kernel (detailed in Appendix B.3.1), the batches separate while the cell-types begin to mix but are still slightly differentiable, albeit within the separated batches (Fig. 2(c)). These changes may be attributed to the fact that linear kernel is not expressive enough to capture the cell-type information while the nonlinearity of the SE-ARD+Linear kernel permits extra flexibility.

In this reverse ablation study, the encoder exhibits the least impact on the latent space representation, as evidenced by the clear separation of cell types and well-mixed batches in Fig. 2(d). This behavior can be attributed to the encoder playing a smaller role in defining the generative model as it primarily functions as a means of regularization for mappings from the data space to the latent space.

This paper is available on arxiv under CC BY-SA 4.0 DEED license.

Authors:

(1) Sarah Zhao, Department of Statistics, Stanford University, (smxzhao@stanford.edu);

(2) Aditya Ravuri, Department of Computer Science, University of Cambridge (ar847@cam.ac.uk);

(3) Vidhi Lalchand, Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard (vidrl@mit.edu);

(4) Neil D. Lawrence, Department of Computer Science, University of Cambridge (ndl21@cam.ac.uk).