Correcting batch effects in single-cell RNA-seq data

Aaron T. L. Lun1 and Michael D. Morgan2

1Cancer Research UK Cambridge Institute, Li Ka Shing Centre, Robinson Way, Cambridge CB2 0RE, United Kingdom
2Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom

Contents

Table of Contents
Processing the different datasets
Top

1 Introduction

Large single-cell RNA sequencing (scRNA-seq) projects usually need to generate data across multiple batches due to logistical constraints. However, the processing of different batches is often subject to uncontrollable differences, e.g., changes in operator, differences in reagent quality. This results in systematic differences in the observed expression in cells from different batches, which we refer to as “batch effects”. Batch effects are problematic as they can be major drivers of heterogeneity in the data, masking the relevant biological differences and complicating interpretation of the results.

Computational correction of these effects is critical for eliminating batch-to-batch variation, allowing data across multiple batches to be combined for valid downstream analysis. However, existing methods such as removeBatchEffect() (Ritchie et al. 2015) assume that the composition of cell populations are either known or the same across batches. This workflow describes the application of an alternative strategy for batch correction based on the detection of mutual nearest neighbours (MNNs) (Haghverdi et al. 2018). The mnnCorreect() approach does not rely on pre-defined or equal population compositions across batches, only requiring that a subset of the population be shared between batches. We demonstrate its use on three human pancreas scRNA-seq datasets from different groups and using different protocols.

Table of Contents
Feature selection across batches
Introduction
Top
CEL-seq2, GSE85241

2 Processing the different datasets

2.1 CEL-seq, GSE81076

2.1.1 Loading in the data

This dataset was generated by Grun et al. (2016) using the CEL-seq protocol with unique molecular identifiers (UMIs) and ERCC spike-ins. Count tables were obtained from the NCBI Gene Expression Omnibus using the accession number above. We first read the table into memory.

gse81076.df <- read.table("GSE81076_D2_3_7_10_17.txt.gz", sep='\t', 
    header=TRUE, stringsAsFactors=FALSE, row.names=1)
dim(gse81076.df)
## [1] 20148  1728

Unfortunately, the data and metadata are all mixed together in this file. As a result, we need to manually extract the metadata from the column names.

donor.names <- sub("^(D[0-9]+).*", "\\1", colnames(gse81076.df))
table(donor.names)
## donor.names
##    D101    D102  D10631     D17   D1713 D172444      D2      D3     D71     D72 
##      96      96      96     288      96      96      96     480      96      96 
##     D73     D74 
##      96      96
plate.id <- sub("^D[0-9]+(.*)_.*", "\\1", colnames(gse81076.df))
table(plate.id)
## plate.id
##      All1 All2  en1  en2  en3  en4   ex TGFB 
##  864   96   96   96   96   96   96  192   96

Another irritating feature of this dataset is that gene symbols were supplied, rather than stable identifiers such as Ensembl. We convert all row names to Ensembl identifiers, removing NA or duplicated entries (with the exception of spike-in transcripts).

gene.symb <- gsub("__chr.*$", "", rownames(gse81076.df))
is.spike <- grepl("^ERCC-", gene.symb)
table(is.spike)
## is.spike
## FALSE  TRUE 
## 20064    84
library(org.Hs.eg.db)
gene.ids <- mapIds(org.Hs.eg.db, keys=gene.symb, keytype="SYMBOL", column="ENSEMBL")
gene.ids[is.spike] <- gene.symb[is.spike]

keep <- !is.na(gene.ids) & !duplicated(gene.ids)
gse81076.df <- gse81076.df[keep,]
rownames(gse81076.df) <- gene.ids[keep]
summary(keep)
##    Mode   FALSE    TRUE 
## logical    2252   17896

We create a SingleCellExperiment object to store the counts and metadata together. This reduces the risk of book-keeping errors in later steps of the analysis. Note that we re-identify the spike-in rows, as the previous indices would have changed after the subsetting.

library(SingleCellExperiment)
sce.gse81076 <- SingleCellExperiment(list(counts=as.matrix(gse81076.df)),
    colData=DataFrame(Donor=donor.names, Plate=plate.id),
    rowData=DataFrame(Symbol=gene.symb[keep]))
isSpike(sce.gse81076, "ERCC") <- grepl("^ERCC-", rownames(gse81076.df)) 
sce.gse81076  
## class: SingleCellExperiment 
## dim: 17896 1728 
## metadata(0):
## assays(1): counts
## rownames(17896): ENSG00000268895 ENSG00000121410 ... ENSG00000074755
##   ENSG00000036549
## rowData names(1): Symbol
## colnames(1728): D2ex_1 D2ex_2 ... D17TGFB_95 D17TGFB_96
## colData names(2): Donor Plate
## reducedDimNames(0):
## spikeNames(1): ERCC

Table of Contents
Identifying highly variable genes
Loading in the data
CEL-seq, GSE81076

2.1.2 Quality control and normalization

We compute quality control (QC) metrics for each cell (McCarthy et al. 2017) and identify cells with low library sizes, low numbers of expressed genes, or high ERCC content.

library(scater)
sce.gse81076 <- calculateQCMetrics(sce.gse81076, compact=TRUE)
QC <- sce.gse81076$scater_qc
low.lib <- isOutlier(QC$all$log10_total_counts, type="lower", nmad=3)
low.genes <- isOutlier(QC$all$log10_total_features_by_counts, type="lower", nmad=3)
high.spike <- isOutlier(QC$feature_control_ERCC$pct_counts, type="higher", nmad=3)
data.frame(LowLib=sum(low.lib), LowNgenes=sum(low.genes), 
    HighSpike=sum(high.spike, na.rm=TRUE))
##   LowLib LowNgenes HighSpike
## 1     54       130       389

Cells with extreme values for these QC metrics are presumed to be of low quality and are removed. A more thorough analysis would examine the distributions of these QC metrics beforehand, but we will skip that step for brevity here.

discard <- low.lib | low.genes | high.spike
sce.gse81076 <- sce.gse81076[,!discard]
summary(discard)
##    Mode   FALSE    TRUE 
## logical    1291     437

We compute size factors for the endogenous genes using the deconvolution method (Lun, Bach, and Marioni 2016). This is done with pre-clustering through quickCluster() to avoid pooling together very different cells.

library(scran)
clusters <- quickCluster(sce.gse81076, min.mean=0.1)
table(clusters)
## clusters
##   1   2   3   4 
## 414 357 267 253
sce.gse81076 <- computeSumFactors(sce.gse81076, min.mean=0.1, clusters=clusters)
summary(sizeFactors(sce.gse81076))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.003044 0.447400 0.791070 1.000000 1.316058 8.761424

We also compute size factors for the spike-in transcripts (Lun et al. 2017). Recall that we set general.use=FALSE to ensure that the spike-in size factors are only applied to the spike-in transcripts.

sce.gse81076 <- computeSpikeFactors(sce.gse81076, general.use=FALSE)
summary(sizeFactors(sce.gse81076, "ERCC"))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01042 0.57800 0.88697 1.00000 1.27728 7.44245

We then compute normalized log-expression values for use in downstream analyses.

sce.gse81076 <- normalize(sce.gse81076)

Table of Contents
Quality control and normalization
CEL-seq, GSE81076

2.1.3 Identifying highly variable genes

We identify highly variable genes (HVGs) using trendVar() and decomposeVar(), using the variances of spike-in transcripts to model technical noise. We set block= to ensure that uninteresting differences between plates or donors do not inflate the variance. The small discrepancy in the fitted trend in Figure 1 is caused by the fact that the trend is fitted robustly to the block-wise variances of the spike-ins, while the variances shown are averaged across blocks and not robust to outliers.

block <- paste0(sce.gse81076$Plate, "_", sce.gse81076$Donor)
fit <- trendVar(sce.gse81076, block=block, parametric=TRUE) 
dec <- decomposeVar(sce.gse81076, fit)
plot(dec$mean, dec$total, xlab="Mean log-expression", 
    ylab="Variance of log-expression", pch=16)
OBis.spike <- isSpike(sce.gse81076)
points(dec$mean[is.spike], dec$total[is.spike], col="red", pch=16)
curve(fit$trend(x), col="dodgerblue", add=TRUE)
Variance of normalized log-expression values for each gene in the GSE81076 dataset, plotted against the mean log-expression. The blue line represents the mean-dependent trend fitted to the variances of the spike-in transcripts (red).

Figure 1: Variance of normalized log-expression values for each gene in the GSE81076 dataset, plotted against the mean log-expression. The blue line represents the mean-dependent trend fitted to the variances of the spike-in transcripts (red).

We order genes by decreasing biological component, revealing some usual suspects such as insulin and glucagon. We will be using this information later when performing feature selection prior to running mnnCorrect().

dec.gse81076 <- dec
dec.gse81076$Symbol <- rowData(sce.gse81076)$Symbol
dec.gse81076 <- dec.gse81076[order(dec.gse81076$bio, decreasing=TRUE),]
head(dec.gse81076)
## DataFrame with 6 rows and 7 columns
##                             mean            total              bio
##                        <numeric>        <numeric>        <numeric>
## ENSG00000254647 2.82477500862721 6.24939686243935 5.80546348932804
## ENSG00000115263 4.00392228927663 5.56617240435477 5.33299468869364
## ENSG00000118271 3.65232925066363 5.56170090164767  5.2587542601016
## ENSG00000115386 4.24621561998282  5.3682782819134 5.13334914024437
## ENSG00000164266 3.03739368175512 5.39637250473505 4.98985386321786
## ENSG00000172023  2.4600444995601 5.25637953645412 4.86267149419166
##                              tech   p.value       FDR      Symbol
##                         <numeric> <numeric> <numeric> <character>
## ENSG00000254647 0.443933373111314         0         0         INS
## ENSG00000115263 0.233177715661127         0         0         GCG
## ENSG00000118271 0.302946641546067         0         0         TTR
## ENSG00000115386 0.234929141669028         0         0       REG1A
## ENSG00000164266 0.406518641517194         0         0      SPINK1
## ENSG00000172023 0.393708042262459         0         0       REG1B

Table of Contents
Smart-seq2, E-MTAB-5061
CEL-seq, GSE81076
Processing the different datasets
Quality control and normalization

2.2 CEL-seq2, GSE85241

2.2.1 Loading in the data

This dataset was generated by Muraro et al. (2016) using the CEL-seq2 protocol with unique molecular identifiers (UMIs) and ERCC spike-ins. Count tables were obtained from the NCBI Gene Expression Omnibus using the accession number above. We first read the table into memory.

gse85241.df <- read.table("GSE85241_cellsystems_dataset_4donors_updated.csv", 
    sep='\t', h=TRUE, row.names=1, stringsAsFactors=FALSE)
dim(gse85241.df)
## [1] 19140  3072

We extract the metadata from the column names.

donor.names <- sub("^(D[0-9]+).*", "\\1", colnames(gse85241.df))
table(donor.names)
## donor.names
## D28 D29 D30 D31 
## 768 768 768 768
plate.id <- sub("^D[0-9]+\\.([0-9]+)_.*", "\\1", colnames(gse85241.df))
table(plate.id)
## plate.id
##   1   2   3   4   5   6   7   8 
## 384 384 384 384 384 384 384 384

Yet again, gene symbols were supplied instead of Ensembl or Entrez identifiers. We convert all row names to Ensembl identifiers, removing NA or duplicated entries (with the exception of spike-in transcripts).

gene.symb <- gsub("__chr.*$", "", rownames(gse85241.df))
is.spike <- grepl("^ERCC-", gene.symb)
table(is.spike)
## is.spike
## FALSE  TRUE 
## 19059    81
library(org.Hs.eg.db)
gene.ids <- mapIds(org.Hs.eg.db, keys=gene.symb, keytype="SYMBOL", column="ENSEMBL")
gene.ids[is.spike] <- gene.symb[is.spike]

keep <- !is.na(gene.ids) & !duplicated(gene.ids)
gse85241.df <- gse85241.df[keep,]
rownames(gse85241.df) <- gene.ids[keep]
summary(keep)
##    Mode   FALSE    TRUE 
## logical    2097   17043

We create a SingleCellExperiment object to store the counts and metadata together.

sce.gse85241 <- SingleCellExperiment(list(counts=as.matrix(gse85241.df)),
    colData=DataFrame(Donor=donor.names, Plate=plate.id),
    rowData=DataFrame(Symbol=gene.symb[keep]))
isSpike(sce.gse85241, "ERCC") <- grepl("^ERCC-", rownames(gse85241.df)) 
sce.gse85241  
## class: SingleCellExperiment 
## dim: 17043 3072 
## metadata(0):
## assays(1): counts
## rownames(17043): ENSG00000268895 ENSG00000121410 ... ENSG00000074755
##   ENSG00000036549
## rowData names(1): Symbol
## colnames(3072): D28.1_1 D28.1_2 ... D30.8_95 D30.8_96
## colData names(2): Donor Plate
## reducedDimNames(0):
## spikeNames(1): ERCC

Table of Contents
Identifying highly variable genes
Loading in the data
CEL-seq2, GSE85241

2.2.2 Quality control and normalization

We compute QC metrics for each cell and identify cells with low library sizes, low numbers of expressed genes, or high ERCC content.

sce.gse85241 <- calculateQCMetrics(sce.gse85241, compact=TRUE)
QC <- sce.gse85241$scater_qc
low.lib <- isOutlier(QC$all$log10_total_counts, type="lower", nmad=3)
low.genes <- isOutlier(QC$all$log10_total_features_by_counts, type="lower", nmad=3)
high.spike <- isOutlier(QC$feature_control_ERCC$pct_counts, type="higher", nmad=3)
data.frame(LowLib=sum(low.lib), LowNgenes=sum(low.genes), 
    HighSpike=sum(high.spike, na.rm=TRUE))
##   LowLib LowNgenes HighSpike
## 1    577       669       696

Low-quality cells are defined as those with extreme values for these QC metrics and are removed.

discard <- low.lib | low.genes | high.spike
sce.gse85241 <- sce.gse85241[,!discard]
summary(discard)
##    Mode   FALSE    TRUE 
## logical    2346     726

We compute size factors for the endogenous genes and spike-in transcripts, and use them to compute log-normalized expression values.

clusters <- quickCluster(sce.gse85241, min.mean=0.1, method="igraph")
table(clusters)
## clusters
##   1   2   3   4 
## 371 828 536 611
sce.gse85241 <- computeSumFactors(sce.gse85241, min.mean=0.1, clusters=clusters)
summary(sizeFactors(sce.gse85241))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.07892  0.54127  0.82761  1.00000  1.22184 13.98424
sce.gse85241 <- computeSpikeFactors(sce.gse85241, general.use=FALSE)
summary(sizeFactors(sce.gse85241, "ERCC"))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.09295 0.61309 0.88902 1.00000 1.27519 4.04643
sce.gse85241 <- normalize(sce.gse85241)

Table of Contents
Quality control and normalization
CEL-seq2, GSE85241

2.2.3 Identifying highly variable genes

We fit a trend to the spike-in variances as previously described, allowing us to model the technical noise for each gene (Figure 2). Again, we set block= to ensure that uninteresting differences between plates or donors do not inflate the variance.

block <- paste0(sce.gse85241$Plate, "_", sce.gse85241$Donor)
fit <- trendVar(sce.gse85241, block=block, parametric=TRUE) 
dec <- decomposeVar(sce.gse85241, fit)
plot(dec$mean, dec$total, xlab="Mean log-expression", 
    ylab="Variance of log-expression", pch=16)
is.spike <- isSpike(sce.gse85241)
points(dec$mean[is.spike], dec$total[is.spike], col="red", pch=16)
curve(fit$trend(x), col="dodgerblue", add=TRUE)
Variance of normalized log-expression values for each gene in the GSE85241 dataset, plotted against the mean log-expression. The blue line represents the mean-dependent trend fitted to the variances of the spike-in transcripts (red).

Figure 2: Variance of normalized log-expression values for each gene in the GSE85241 dataset, plotted against the mean log-expression. The blue line represents the mean-dependent trend fitted to the variances of the spike-in transcripts (red).

We order genes by decreasing biological component, as described above.

dec.gse85241 <- dec
dec.gse85241$Symbol <- rowData(sce.gse85241)$Symbol
dec.gse85241 <- dec.gse85241[order(dec.gse85241$bio, decreasing=TRUE),]
head(dec.gse85241)
## DataFrame with 6 rows and 7 columns
##                             mean            total              bio
##                        <numeric>        <numeric>        <numeric>
## ENSG00000115263 7.66267378242402 6.59681396806777 6.56800161897542
## ENSG00000089199 4.62064563864715 6.42527492432335 6.30366585829772
## ENSG00000169903 3.01379686443808  6.5465999616288 6.19244340917136
## ENSG00000254647 2.00214393757549 6.44633299225628 5.89008000997448
## ENSG00000118271  7.3354941164374 5.75428324335117 5.72283471882918
## ENSG00000171951  4.1895920246965 5.57157012090427 5.40710655269578
##                               tech   p.value       FDR      Symbol
##                          <numeric> <numeric> <numeric> <character>
## ENSG00000115263 0.0288123490923479         0         0         GCG
## ENSG00000089199   0.12160906602564         0         0        CHGB
## ENSG00000169903   0.35415655245744         0         0      TM4SF4
## ENSG00000254647  0.556252982281801         0         0         INS
## ENSG00000118271 0.0314485245219835         0         0         TTR
## ENSG00000171951  0.164463568208491         0         0        SCG2

Table of Contents
CEL-seq2, GSE85241
Processing the different datasets
Quality control and normalization

2.3 Smart-seq2, E-MTAB-5061

2.3.1 Loading in the data

This dataset was generated by Segerstolpe et al. (2016) using the Smart-seq2 protocol with ERCC spike-ins. Count tables were obtained from ArrayExpress using the accession number above. We first read the table into memory, though this requires some effort as the file is even more unconventionally formatted than the two examples above.

unzip("E-MTAB-5061.processed.1.zip")

# Figuring out the number of libraries (-1 for the '#sample').
header <- read.table("pancreas_refseq_rpkms_counts_3514sc.txt", 
    nrow=1, sep="\t", comment.char="", stringsAsFactors=FALSE)
ncells <- ncol(header) - 1L

# Loading only the gene names and the counts.
col.types <- vector("list", ncells*2 + 2)
col.types[1:2] <- "character"
col.types[2+ncells + seq_len(ncells)] <- "integer"
e5601.df <- read.table("pancreas_refseq_rpkms_counts_3514sc.txt", 
    sep="\t", colClasses=col.types)

# Disentangling the gene names and the counts.
gene.data <- e5601.df[,1:2]
e5601.df <- e5601.df[,-(1:2)]
colnames(e5601.df) <- as.character(header[1,-1])
dim(e5601.df)
## [1] 26271  3514

The gene metadata does contains unique GenBank identifiers, but these are transcript-level and concatenated together for each gene. Instead of trying to pull them apart, we perform the symbol-to-Ensembl conversion that was done for the previous datasets.

is.spike <- grepl("^ERCC-", gene.data[,2])
table(is.spike)
## is.spike
## FALSE  TRUE 
## 26179    92
library(org.Hs.eg.db)
gene.ids <- mapIds(org.Hs.eg.db, keys=gene.data[,1], keytype="SYMBOL", column="ENSEMBL")
gene.ids[is.spike] <- gene.data[is.spike,2]

keep <- !is.na(gene.ids) & !duplicated(gene.ids)
e5601.df <- e5601.df[keep,]
rownames(e5601.df) <- gene.ids[keep]
summary(keep)
##    Mode   FALSE    TRUE 
## logical    3736   22535

At least the metadata is stored in a separate file, which makes it fairly straightforward to parse. We match the rows to the column names to ensure that the metadata and count table are in the same order.

metadata <- read.table("E-MTAB-5061.sdrf.txt", header=TRUE, 
    sep="\t", check.names=FALSE, stringsAsFactors=FALSE)
m <- match(colnames(e5601.df), metadata[["Assay Name"]])
stopifnot(all(!is.na(m)))
metadata <- metadata[m,]
donor.id <- metadata[["Characteristics[individual]"]]
table(donor.id)
## donor.id
##           AZ    HP1502401 HP1504101T2D    HP1504901    HP1506401    HP1507101 
##           96          352          383          383          383          383 
## HP1508501T2D    HP1509101 HP1525301T2D HP1526901T2D 
##          383          383          384          384

We create a SingleCellExperiment object to store the counts and metadata together.

sce.e5601 <- SingleCellExperiment(list(counts=as.matrix(e5601.df)),
    colData=DataFrame(Donor=donor.id),
    rowData=DataFrame(Symbol=gene.data[keep,1]))
isSpike(sce.e5601, "ERCC") <- grepl("^ERCC-", rownames(e5601.df)) 
sce.e5601  
## class: SingleCellExperiment 
## dim: 22535 3514 
## metadata(0):
## assays(1): counts
## rownames(22535): ENSG00000118473 ENSG00000142920 ... ERCC-00061
##   ERCC-00048
## rowData names(1): Symbol
## colnames(3514): HP1502401_N13 HP1502401_D14 ... HP1526901T2D_O11
##   HP1526901T2D_A8
## colData names(1): Donor
## reducedDimNames(0):
## spikeNames(1): ERCC

Table of Contents
Identifying highly variable genes
Loading in the data
Smart-seq2, E-MTAB-5061

2.3.2 Quality control and normalization

We compute QC metrics for each cell and identify cells with low library sizes, low numbers of expressed genes, or high ERCC content. We also identify cells with near-zero spike-in counts, which are not compatible with spike-in normalization and modelling of technical noise.

sce.e5601 <- calculateQCMetrics(sce.e5601, compact=TRUE)
QC <- sce.e5601$scater_qc
low.lib <- isOutlier(QC$all$log10_total_counts, type="lower", nmad=3)
low.genes <- isOutlier(QC$all$log10_total_features_by_counts, type="lower", nmad=3) 
high.spike <- isOutlier(QC$feature_control_ERCC$pct_counts, type="higher", nmad=3)
low.spike <- isOutlier(QC$feature_control_ERCC$log10_total_counts, type="lower", nmad=2)
data.frame(LowLib=sum(low.lib), LowNgenes=sum(low.genes), 
    HighSpike=sum(high.spike, na.rm=TRUE), LowSpike=sum(low.spike))
##   LowLib LowNgenes HighSpike LowSpike
## 1    162       572       904      359

Low-quality cells are defined as those with extreme values for these QC metrics and are removed.

discard <- low.lib | low.genes | high.spike | low.spike
sce.e5601 <- sce.e5601[,!discard]
summary(discard)
##    Mode   FALSE    TRUE 
## logical    2285    1229

We compute size factors for the endogenous genes and spike-in transcripts, and use them to compute log-normalized expression values. Recall that the Smart-seq2 protocol generates read count data, so we use a more stringent filter of min.mean=1 to remove low-abundance genes.

clusters <- quickCluster(sce.e5601, min.mean=1, method="igraph")
table(clusters)
## clusters
##   1   2   3   4   5   6 
## 315 308 460 482 282 438
sce.e5601 <- computeSumFactors(sce.e5601, min.mean=1, clusters=clusters)
summary(sizeFactors(sce.e5601))
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  0.004916  0.377375  0.699866  1.000000  1.315723 12.740545
sce.e5601 <- computeSpikeFactors(sce.e5601, general.use=FALSE)
summary(sizeFactors(sce.e5601, "ERCC"))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.01433  0.19416  0.42867  1.00000  1.16641 11.23978
sce.e5601 <- normalize(sce.e5601)

Table of Contents
Quality control and normalization
Smart-seq2, E-MTAB-5061

2.3.3 Identifying highly variable genes

We identify highly variable genes (HVGs) using trendVar() and decomposeVar(). Here, we need to fit the mean-variance trend separately to each donor, as the donor-to-donor variation in the mean-variance trend is more pronounced than that in the UMI datasets. This yields one fit for each donor in Figure 3.

donors <- sort(unique(sce.e5601$Donor))
is.spike <- isSpike(sce.e5601)
par(mfrow=c(ceiling(length(donors)/2), 2), 
    mar=c(4.1, 4.1, 2.1, 0.1))
collected <- list()
for (x in unique(sce.e5601$Donor)) {
    current <- sce.e5601[,sce.e5601$Donor==x]
    if (ncol(current)<2L) { next }
    current <- normalize(current)
    fit <- trendVar(current, parametric=TRUE) 
    dec <- decomposeVar(current, fit)
    plot(dec$mean, dec$total, xlab="Mean log-expression",
        ylab="Variance of log-expression", pch=16, main=x)
    points(fit$mean, fit$var, col="red", pch=16)
    curve(fit$trend(x), col="dodgerblue", add=TRUE)
    collected[[x]] <- dec
}
Variance of normalized log-expression values for each gene in the E-MTAB-5601 dataset, plotted against the mean log-expression. Each plot corresponds to a donor, where the blue line represents the mean-dependent trend fitted to the variances of the spike-in transcripts (red).

Figure 3: Variance of normalized log-expression values for each gene in the E-MTAB-5601 dataset, plotted against the mean log-expression. Each plot corresponds to a donor, where the blue line represents the mean-dependent trend fitted to the variances of the spike-in transcripts (red).

We combine statistics across donors to consolidate them into a single set of results for this study. We then order genes by decreasing biological component.

dec.e5601 <- do.call(combineVar, collected)
dec.e5601$Symbol <- rowData(sce.e5601)$Symbol
dec.e5601 <- dec.e5601[order(dec.e5601$bio, decreasing=TRUE),]
head(dec.e5601)
## DataFrame with 6 rows and 7 columns
##                             mean            total              bio
##                        <numeric>        <numeric>        <numeric>
## ENSG00000115263 9.79785008603535 24.9554533902531 24.7436081289625
## ENSG00000118271 10.3624735164025 19.1597271291327 19.0677809547018
## ENSG00000089199 8.78894397314284 17.3693554890887  17.107324046159
## ENSG00000169903 7.11678634372552 15.2550468907374 14.6597366416469
## ENSG00000166922 7.90580761095785 15.0817216478007 14.6353279278984
## ENSG00000254647 4.94929481818886 17.8290056901903 14.5719504555495
##                               tech   p.value       FDR      Symbol
##                          <numeric> <numeric> <numeric> <character>
## ENSG00000115263  0.211845261290598         0         0         GCG
## ENSG00000118271 0.0919461744308925         0         0         TTR
## ENSG00000089199  0.262031442929708         0         0        CHGB
## ENSG00000169903  0.595310249090534         0         0      TM4SF4
## ENSG00000166922  0.446393719902237         0         0        SCG5
## ENSG00000254647   3.25705523464086         0         0         INS

Table of Contents
Performing MNN-based correction
Processing the different datasets
Top

3 Feature selection across batches

To obtain a single set of features for batch selection, we take the top 1000 genes with the largest biological components from each batch. The intersection of this set across batches is defined as our feature set for MNN correction.

top.e5601 <- rownames(dec.e5601)[seq_len(1000)]
top.gse85241 <- rownames(dec.gse85241)[seq_len(1000)]
top.gse81076 <- rownames(dec.gse81076)[seq_len(1000)]
chosen <- Reduce(intersect, list(top.e5601, top.gse85241, top.gse81076))

# Adding some gene symbols for interpretation.
symb <- mapIds(org.Hs.eg.db, keys=chosen, keytype="ENSEMBL", column="SYMBOL")
DataFrame(ID=chosen, Symbol=symb)
## DataFrame with 334 rows and 2 columns
##                  ID      Symbol
##         <character> <character>
## 1   ENSG00000115263         GCG
## 2   ENSG00000118271         TTR
## 3   ENSG00000089199        CHGB
## 4   ENSG00000169903      TM4SF4
## 5   ENSG00000166922        SCG5
## ...             ...         ...
## 330 ENSG00000114315        HES1
## 331 ENSG00000107372      ZFAND5
## 332 ENSG00000087086         FTL
## 333 ENSG00000170348      TMED10
## 334 ENSG00000251562      MALAT1

The use of an intersection is a rather conservative strategy as it requires the same gene to be highly variable in all batches. This may not be possible for marker genes of cell types that are not present in all batches. An alternative approach is to use combineVar() to compute the average biological component across batches for each gene. The feature set can then be defined as the top \(X\) genes with the largest biological components.

# Identifying genes that are annotated in all batches.
in.all <- Reduce(intersect, list(rownames(dec.e5601), 
    rownames(dec.gse85241), rownames(dec.gse81076)))

# Setting weighted=FALSE so each batch contributes equally.
combined <- combineVar(dec.e5601[in.all,], dec.gse85241[in.all,],
    dec.gse81076[in.all,], weighted=FALSE)
chosen2 <- rownames(combined)[head(order(combined$bio, decreasing=TRUE), 1000)]

Despite its conservativeness, we will use the intersection here to focus on the obvious biological differences between pancreas cell types. This reduces the number of genes that drive the weaker uninteresting differences between donors within each batch. These will not be corrected by applying mnnCorrect() to the batches, and will complicate the interpretation of the plots below.

Comments from Aaron:

Table of Contents
Examining the effect of correction
Feature selection across batches
Top

4 Performing MNN-based correction

Consider a cell \(a\) in batch \(A\), and identify the cells in batch \(B\) that are nearest neighbours to \(a\) in the expression space defined by the selected features. Repeat this for a cell \(b\) in batch \(B\), identifying its nearest neighbours in \(A\). Mutual nearest neighbours are pairs of cells from different batches that belong in each other’s set of nearest neighbours. The reasoning is that MNN pairs represent cells from the same biological state prior to the application of a batch effect - see Haghverdi et al. (2018) for full theoretical details. Thus, the difference between cells in MNN pairs can be used as an estimate of the batch effect, the subtraction of which can yield batch-corrected values.

We apply the mnnCorrect() function to the three batches to remove the batch effect, using the genes in chosen. This involves correcting their expression values so that all cells are comparable in the coordinate system of the first batch. The function returns a set of matrices containing corrected expression values, which we can use in downstream analyses.

original <- list(logcounts(sce.e5601)[chosen,],
                 logcounts(sce.gse81076)[chosen,],
                 logcounts(sce.gse85241)[chosen,])
corrected <- do.call(mnnCorrect, c(original, list(k=20, sigma=0.1)))
str(corrected$corrected)
## List of 3
##  $ : num [1:334, 1:2285] 0.13 0.14 0.124 0 0.116 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:334] "ENSG00000115263" "ENSG00000118271" "ENSG00000089199" "ENSG00000169903" ...
##   .. ..$ : chr [1:2285] "HP1502401_J13" "HP1502401_H13" "HP1502401_J14" "HP1502401_B14" ...
##  $ : num [1:334, 1:1291] -0.01607 -0.00611 0.01178 0.00703 0.01182 ...
##  $ : num [1:334, 1:2346] 0.147 0.141 0.135 0.112 0.113 ...

The function also reports the MNN pairs that were identified in each successive batch. This may be useful for trouble-shooting, e.g., to check whether cells independently assigned to the same cell type are correctly identified as MNN pairs.

corrected$pairs
## [[1]]
## DataFrame with 0 rows and 3 columns
## 
## [[2]]
## DataFrame with 5286 rows and 3 columns
##      current.cell other.cell other.batch
##         <integer>      <Rle>       <Rle>
## 1             410          5           1
## 2             618          5           1
## 3             816          5           1
## 4             780          5           1
## 5             400          5           1
## ...           ...        ...         ...
## 5282         1067       2283           1
## 5283         1093       2283           1
## 5284          488       2283           1
## 5285         1000       2283           1
## 5286          854       2283           1
## 
## [[3]]
## DataFrame with 7923 rows and 3 columns
##      current.cell other.cell other.batch
##         <integer>      <Rle>       <Rle>
## 1              69          5           1
## 2            2319          5           1
## 3            1266          5           1
## 4             280          5           1
## 5            1789          5           1
## ...           ...        ...         ...
## 7919          998       3574           2
## 7920         1283       3574           2
## 7921         1254       3574           2
## 7922         1047       3574           2
## 7923           53       3575           2

Comments from Aaron:

Table of Contents
Using the corrected values in downstream analyses
Performing MNN-based correction
Top

5 Examining the effect of correction

We create a new SingleCellExperiment object containing these corrected counts for each cell, along with information regarding the batch of origin.

omat <- do.call(cbind, original)
mat <- do.call(cbind, corrected$corrected)
colnames(mat) <- NULL
sce <- SingleCellExperiment(list(original=omat, corrected=mat))
colData(sce)$Batch <- rep(c("e5601", "gse81076", "gse85241"),
                          lapply(corrected$corrected, ncol))
sce
## class: SingleCellExperiment 
## dim: 334 5922 
## metadata(0):
## assays(2): original corrected
## rownames(334): ENSG00000115263 ENSG00000118271 ... ENSG00000170348
##   ENSG00000251562
## rowData names(0):
## colnames(5922): HP1502401_J13 HP1502401_H13 ... D30.8_93 D30.8_94
## colData names(1): Batch
## reducedDimNames(0):
## spikeNames(0):

We examine the batch correction with some t-SNE plots. Figure~4 demonstrates how the cells separate by batch of origin in the uncorrected data. After correction, more intermingling between batches is observed, consistent with the removal of batch effects. Note that the E-MTAB-5601 dataset still displays some separation, which is probably due to the fact that the other batches are UMI datasets.

osce <- runTSNE(sce, exprs_values="original", rand_seed=100)
ot <- plotTSNE(osce, colour_by="Batch") + ggtitle("Original")
csce <- runTSNE(sce, exprs_values="corrected", rand_seed=100)
ct <- plotTSNE(csce, colour_by="Batch") + ggtitle("Corrected")
multiplot(ot, ct, cols=2)
t-SNE plots of the pancreas datasets, before and after MNN correction. Each point represents a cell and is coloured by the batch of origin.

Figure 4: t-SNE plots of the pancreas datasets, before and after MNN correction. Each point represents a cell and is coloured by the batch of origin.

We colour by the expression of marker genes for known pancreas cell types to determine whether the correction is biologically sensible. Cells in the same visual cluster express the same marker genes (Figure 5), indicating that the correction maintains separation of cell types.

ct.gcg <- plotTSNE(csce, by_exprs_values="corrected", 
    colour_by="ENSG00000115263") + ggtitle("Alpha cells (GCG)")
ct.ins <- plotTSNE(csce, by_exprs_values="corrected", 
    colour_by="ENSG00000254647") + ggtitle("Beta cells (INS)")
ct.sst <- plotTSNE(csce, by_exprs_values="corrected", 
    colour_by="ENSG00000157005") + ggtitle("Delta cells (SST)")
ct.ppy <- plotTSNE(csce, by_exprs_values="corrected", 
    colour_by="ENSG00000108849") + ggtitle("PP cells (PPY)")
multiplot(ct.gcg, ct.ins, ct.sst, ct.ppy, cols=2)
t-SNE plots after MNN correction, where each point represents a cell and is coloured by its corrected expression of key marker genes for known cell types in the pancreas.

Figure 5: t-SNE plots after MNN correction, where each point represents a cell and is coloured by its corrected expression of key marker genes for known cell types in the pancreas.

Table of Contents
Examining the effect of correction
Top

6 Using the corrected values in downstream analyses

MNN correction places all cells from all batches within the same coordinate system. This means that the corrected values can be freely used to define distances between cells for dimensionality reduction or clustering. However, the correction does not preserve the mean-variance relationship. As such, we do not recommend using the corrected values for studying heterogeneity.

MNN-corrected values are generally not suitable for differential expression (DE) analyses, for several reasons:

In scRNA-seq studies, most DE analyses will be performed between clusters or along trajectories. We recommend using the corrected values for clustering or trajectory reconstruction, and switching back to the original log-expression values (or counts) for the DE analysis. The batch of origin can then be modelled as a blocking factor, which is possible with most packages for DE analysis such as edgeR or limma.

Finally, the MNN-corrected values can be used for further correction with mnnCorrect(). This is useful in nested experimental designs involving multiple batches within each of multiple studies, much like the pancreas datasets described above. In such cases, it is reasonable to first perform the correction across batches within the same study (where there should be more similar cell types, and thus more MNN pairs). The batch-corrected values for each study can then be supplied to mnnCorrect() to remove batch effects between studies.

Comments from Aaron:

Table of Contents
Using the corrected values in downstream analyses
Top

References

Grun, D., M. J. Muraro, J. C. Boisset, K. Wiebrands, A. Lyubimova, G. Dharmadhikari, M. van den Born, et al. 2016. “De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data.” Cell Stem Cell 19 (2):266–77.

Haghverdi, L., A. T. L. Lun, M. D. Morgan, and J. C. Marioni. 2018. “Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors.” Nat. Biotechnol., April.

Lun, A. T. L., F. J. Calero-Nieto, L. Haim-Vilmovsky, B. Gottgens, and J. C. Marioni. 2017. “Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data.” Genome Res. 27 (11):1795–1806.

Lun, A. T., K. Bach, and J. C. Marioni. 2016. “Pooling across cells to normalize single-cell RNA sequencing data with many zero counts.” Genome Biol. 17 (April):75.

McCarthy, D. J., K. R. Campbell, A. T. Lun, and Q. F. Wills. 2017. “Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R.” Bioinformatics 33 (8):1179–86.

Muraro, M. J., G. Dharmadhikari, D. Grun, N. Groen, T. Dielen, E. Jansen, L. van Gurp, et al. 2016. “A Single-Cell Transcriptome Atlas of the Human Pancreas.” Cell Syst 3 (4):385–94.

Ritchie, M. E., B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, and G. K. Smyth. 2015. “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Res. 43 (7):e47.

Segerstolpe, A., A. Palasantza, P. Eliasson, E. M. Andersson, A. C. Andreasson, X. Sun, S. Picelli, et al. 2016. “Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes.” Cell Metab. 24 (4):593–607.