decontX {celda}R Documentation

Contamination estimation with decontX

Description

Identifies contamination from factors such as ambient RNA in single cell genomic datasets.

Usage

## S4 method for signature 'SingleCellExperiment'
decontX(
  x,
  assayName = "counts",
  z = NULL,
  batch = NULL,
  maxIter = 500,
  delta = 10,
  convergence = 0.001,
  iterLogLik = 10,
  varGenes = 5000,
  dbscanEps = 1,
  L = 50,
  seed = 12345,
  logfile = NULL,
  verbose = TRUE
)

## S4 method for signature 'ANY'
decontX(
  x,
  z = NULL,
  batch = NULL,
  maxIter = 500,
  delta = 10,
  convergence = 0.001,
  iterLogLik = 10,
  varGenes = 5000,
  dbscanEps = 1,
  L = 50,
  seed = 12345,
  logfile = NULL,
  verbose = TRUE
)

Arguments

x

A numeric matrix of counts or a SingleCellExperiment with the matrix located in the assay slot under assayName. Cells in each batch will be subsetted and converted to a sparse matrix of class dgCMatrix from package Matrix before analysis.

assayName

Character. Name of the assay to use if x is a SingleCellExperiment.

z

Numeric or character vector. Cell cluster labels. If NULL, Celda will be used to reduce the dimensionality of the dataset to 'L' modules, 'umap' from the 'uwot' package will be used to further reduce the dataset to 2 dimenions and the 'dbscan' function from the 'dbscan' package will be used to identify clusters of broad cell types. Default NULL.

batch

Numeric or character vector. Batch labels for cells. If batch labels are supplied, DecontX is run on cells from each batch separately. Cells run in different channels or assays should be considered different batches. Default NULL.

maxIter

Integer. Maximum iterations of the EM algorithm. Default 500.

delta

Numeric. Symmetric Dirichlet concentration parameter to initialize theta. Default 10.

convergence

Numeric. The EM algorithm will be stopped if the maximum difference in the contamination estimates between the previous and current iterations is less than this. Default 0.001.

iterLogLik

Integer. Calculate log likelihood every 'iterLogLik' iteration. Default 10.

varGenes

Integer. The number of variable genes to use in Celda clustering. Variability is calcualted using modelGeneVar function from the 'scran' package. Used only when z is not provided. Default 5000.

dbscanEps

Numeric. The clustering resolution parameter used in 'dbscan' to estimate broad cell clusters. Used only when z is not provided. Default 1.

L

Integer. Number of modules for Celda clustering. Used to reduce the dimensionality of the dataset before applying UMAP and dbscan. Used only when z is not provided. Default 50.

seed

Integer. Passed to with_seed. For reproducibility, a default value of 12345 is used. If NULL, no calls to with_seed are made.

logfile

Character. Messages will be redirected to a file named 'logfile'. If NULL, messages will be printed to stdout. Default NULL.

verbose

Logical. Whether to print log messages. Default TRUE.

Value

If x is a matrix-like object, a list will be returned with the following items:

decontXcounts:

The decontaminated matrix. Values obtained from the variational inference procedure may be non-integer. However, integer counts can be obtained by rounding, e.g. round(decontXcounts).

contamination:

Percentage of contamination in each cell.

estimates:

List of estimated parameters for each batch. If z was not supplied, then the UMAP coordinates used to generated cell cluster labels will also be stored here.

z:

Cell population/cluster labels used for analysis.

runParams:

List of arguments used in the function call.

If x is a SingleCellExperiment, then the decontaminated counts will be stored as an assay and can be accessed with decontXcounts(x). The contamination values and cluster labels will be stored in colData(x). estimates and runParams will be stored in metadata(x)$decontX. If z was not supplied, then the UMAPs used to generated cell cluster labels will be stored in reducedDims slot in x

Examples

s <- simulateContaminatedMatrix()
result <- decontX(s$observedCounts, s$z)
contamination <- colSums(s$observedCounts - s$nativeCounts) /
  colSums(s$observedCounts)
plot(contamination, result$contamination)

[Package celda version 1.3.4 Index]