Datasets & Experimental Setup for Single-Cell RNA-seq Model Evaluation

Table of Links

Abstract and 1. Introduction

2. Background

2.1 Amortized Stochastic Variational Bayesian GPLVM

2.2 Encoding Domain Knowledge through Kernels

3. Our Model and Pre-Processing and Likelihood

3.2 Encoder

4. Results and Discussion and 4.1 Each Component is Crucial to Modifies Model Performance

4.2 Modified Model achieves Significant Improvements over Standard Bayesian GPLVM and is Comparable to SCVI

4.3 Consistency of Latent Space with Biological Factors

4. Conclusion, Acknowledgement, and References

A. Baseline Models

B. Experiment Details

C. Latent Space Metrics

D. Detailed Metrics

B EXPERIMENT DETAIL

B.1 DATA

We evaluate these models with two datasets: (1) a simulated dataset using the single-cell simulation framework Splatter (Zappia et al., 2017) and (2) a COVID-19 dataset (Stephenson et al., 2021

Simulated Data. As the focus of our work is to dissect the assumptions made in single-cell data, we build our model based on a synthetic scRNA-seq dataset generated by the Splatter Splat scRNA-seq simulation (Zappia et al., 2017). The data are modelled off of a negative binomial distribution based on a hierarchical Gamma-Poisson model, where the parameters are drawn from the dataset (Kotliar et al., 2019). The data are simulated with seven cell types and two batches, with 10000 cells in each batch and 10000 genes per cell. We then remove cells with fewer than 200 total gene expression counts and genes that are expressed in three or fewer cells. This results in a synthetic dataset having 16016 cells and 8819 genes.

COVID-19 Data. The COVID-19 dataset (Stephenson et al., 2021) is a real world dataset comprised of gene expression counts obtained from peripheral blood mononuclear cells. This dataset includes samples from 107 patients exhibiting different degrees of COVID-19 severity, as well as samples from 23 healthy individuals. There are three main sampling locations – Sanger, Cambridge, and Newcastle – and the dataset also includes sample ID (143 batches total), where the sample IDs have unique codes for the sampling locations. There are 143 such sample IDs and 18 cell types considered. For this project, we take a subsample of this dataset that takes 100 000 cells and 5000 most variable genes as determined by Scanpy (Wolf et al., 2018).

Innate Immunity Data. The innate immunity dataset of Kumasaka et al. (2021) is comprised of 22,188 primary dermal fibroblasts from 68 donors who were either in the control group or were exposed to two stimulants to mimic innate immune response: (1) dsRNA Poly(I:C) for primary antiviral and inflammatory responses and (2) IFN-beta for secondary antiviral response. There were a total of 4999 genes and 7 latent dimensions (including cell-cycle latents).

B.2 EXPERIMENTAL SET-UP

For each of the experiments, we train the model with batch size 300, learning rate 0.05, and three different seeds: 0, 42, and 123. For the synthetic dataset, we train with 50 epochs and for the COVID-19 dataset, we use 15 epochs, which is sufficient for convergence for the corresponding datasets. The latent space dimension is set to Q = 10 for all models. For evaluation, we use seed 1 for all UMAP visualizations, and the latent metrics are reported with the average and standard deviation (each up to two decimal digits) over the three training runs for each model. We use the CSD3 high-performance computers for model training.

B.3 EXTRA MODIFICATIONS AND EXPERIMENTS

B.3.1 LINEAR KERNEL

For the ablation study, we also consider a linear kernel that models the augmented latent space information

The corresponding augmented GP with linear mean and linear kernel is given by:

B.3.2 LIKELIHOODS

In our ablation studies, more complex likelihoods (for example, a negative binomial likelihood where the library size of each row was learned) were observed to perform poorly, and likelihood simplifications like using the approximate Poisson likelihood led to improved performance (see Fig. 6). This phenomenon could be explained by an issue with the identifiability of the model. The extra parameters in the model allow more flexibility in these likelihoods, but may also be learning and abstracting away pertinent cell-type information from the latent space variables. When the library size parameter is learned slowly, the model may be biased towards high-count cells, potentially disregarding the rest of the data and attributing latent space factors to technical noise rather than relevant biological differences. By constraining our likelihoods to slightly misaligned models, we may be encouraging the BGPLVM model to learn the 0s and smaller count values extremely well.

This paper is available on arxiv under CC BY-SA 4.0 DEED license.

Authors:

(1) Sarah Zhao, Department of Statistics, Stanford University, (smxzhao@stanford.edu);

(2) Aditya Ravuri, Department of Computer Science, University of Cambridge (ar847@cam.ac.uk);

(3) Vidhi Lalchand, Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard (vidrl@mit.edu);

(4) Neil D. Lawrence, Department of Computer Science, University of Cambridge (ndl21@cam.ac.uk).