R包DropletUtils使用

作者: 菜鸟跑半年 | 来源:发表于2020-07-16 22:50 被阅读0次

seurat 对象中的 counts 转成稀疏矩阵barcode
R包DropletUtils使用
2020-08-19 发表在Cell的单细胞文章如何预处理10X
学习小组day6笔记-钟能能
学习小组Day6--二胖
如何查找R对应library的包以及对应版本、路径
ChIPseeker 饼图图层覆盖问题
R语言：创建web界面
学习小组Day6笔记——R数据整理
使用corrplot包绘制相关性图

bioconductor-DropletUtils
使用教程：Utilities for handling droplet-based single-cell RNA-seq data
对于基于液滴(droplet-based)的单细胞测序，通常只保留包含且只包含一个细胞的液滴生成的数据。R包DropletUtils针对10X Genomics平台，根据观察到的每个液滴的表达谱与周围溶液的表达谱来区分空液滴（empty droplets，只含溶液中RNA）和含细胞的液滴。
DropletUtils主要功能如下：

读入10X Genomics平台的UMI count matrix
读入CellRanger生成的molecule information file (molecule_info.h5)
降采样：downsampling the UMI count matrix or the raw reads
识别空液滴和doublets
去除Illumina 4000测序仪的barcode swapping效应

R包安装、载入

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("DropletUtils")
library(DropletUtils)

读入10X Genomics数据

读入UMI count matrix

CellRanger生成的稀疏矩阵数据一般在/outs/filtered_gene_bc_matrices/目录，包含barcodes.tsv.gz、features.tsv.gz和matrix.mtx.gz（CellRanger version 3），或者包含barcodes.tsv、genes.tsv、matrix.mtx（CellRanger version 2）。
本教程使用模拟数据：

# To generate the files.
example(write10xCounts, echo=FALSE) 
dir.name <- tmpdir
list.files(dir.name)
## [1] "barcodes.tsv" "genes.tsv"    "matrix.mtx"

read10xCounts函数读取CellRanger输出，并返回SingleCellExperiment对象

sce <- read10xCounts(dir.name)
sce
## class: SingleCellExperiment 
## dim: 100 10 
## metadata(1): Samples
## assays(1): counts
## rownames(100): ENSG00001 ENSG00002 ... ENSG000099 ENSG0000100
## rowData names(2): ID Symbol
## colnames: NULL
## colData names(2): Sample Barcode
## reducedDimNames(0):
## altExpNames(0):

载入的表达矩阵是稀疏矩阵，是R包Matrix的 dgCMatrix对象，该类型对象只储存非零的counts，节省内存空间，广泛应用于有很多dropouts的单细胞测序数据。

class(counts(sce))
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"

也可同时导入多个样本的测序数据，创建一个character向量传入read10xCounts函数即可，返回的是一个单一的SingleCellExperiment对象，其中多个样本的表达矩阵按列整合，然而此种情况要求不同样本表达矩阵的基因一致。

读入molecule information file

CellRanger生成的molecule information file (molecule_info.h5) 文件一般包含如下信息：UMI序列，barcode序列，RNA的reads数。
本教程使用示例文件：

set.seed(1000)
mol.info.file <- DropletUtils:::sim10xMolInfo(tempfile())
mol.info.file
## [1] "/tmp/Rtmpg4CWdT/file3a62437f43e8.1.h5"

read10xMolInfo函数读入R：

mol.info <- read10xMolInfo(mol.info.file)
mol.info
## $data
## DataFrame with 9532 rows and 5 columns
##             cell       umi gem_group      gene     reads
##      <character> <integer> <integer> <integer> <integer>
## 1           TGTT     80506         1        18        12
## 2           CAAT    722585         1        20        17
## 3           AGGG    233634         1         4        17
## 4           TCCC    516870         1        10        13
## 5           ATAG    887407         1         6         8
## ...          ...       ...       ...       ...       ...
## 9528        TACT   1043995         1         9        18
## 9529        GCTG    907401         1        20         7
## 9530        ATTA    255710         1        13         7
## 9531        GCAC    672962         1        20        12
## 9532        TGAA    482852         1         1        10
## 
## $genes
##  [1] "ENSG1"  "ENSG2"  "ENSG3"  "ENSG4"  "ENSG5"  "ENSG6"  "ENSG7"  "ENSG8" 
##  [9] "ENSG9"  "ENSG10" "ENSG11" "ENSG12" "ENSG13" "ENSG14" "ENSG15" "ENSG16"
## [17] "ENSG17" "ENSG18" "ENSG19" "ENSG20"

molecule information file文件中的信息有利于质控环节，比如检查测序饱和度时需要read counts数。
read10xMolInfo函数会自动猜测barcode序列长度，多数情况下猜测是准确的，用户也可用barcode.length参数传入已知的barcode序列长度。

Downsampling across batches

downsampling the UMI count matrix

对于测序深度相差较大的多批次测序，将测序深度最大的批次降采样(downsample)，使其匹配测序深度最小的批次的覆盖率，有利于避免技术噪声导致下游分析按批次聚类。downsampleMatrix函数可实现此功能。

set.seed(100)
new.counts <- downsampleMatrix(counts(sce), prop=0.5)
library(Matrix)
colSums(counts(sce))
##  [1] 508 524 490 518 464 468 484 544 519 473
colSums(new.counts)
##  [1] 254 262 245 259 232 234 242 272 260 237

以上代码使每个细胞的total count减半。
参数prop由用户自定义，取决于实验批次数量以及哪个批次覆盖率最低。参数prop可以是向量，，包含细胞特异的proportions。

downsampling the raw reads

对reads进行downsample更恰当，因为考虑了每个细胞的测序深度差异。通过对包含read counts信息的molecule information file应用downsampleReads函数实现此功能。

set.seed(100)
no.sampling <- downsampleReads(mol.info.file, prop=1)
sum(no.sampling)
## [1] 9532
with.sampling <- downsampleReads(mol.info.file, prop=0.5)
sum(with.sampling)
## [1] 9502

以上代码将reads 降采样至原始覆盖率的50%。downsampleReads函数返回的是UMI counts矩阵，假如测序饱和度很高的话，最后的total count 并不会下降太多。用户如果希望降采样后total count差不多的话，应该使用上一节的downsampleMatrix函数。

Computing barcode ranks

对液滴法测序数据的一个有用的diagnostic 是barcode rank plot图，y轴是每个barcode的(log-)total UMI count，x轴是(log-)rank ，这实际上是一个坐标轴对数变换的转置经验累积密度图，以查看barcodes的total counts分布。
模拟表达矩阵：

set.seed(0)
my.counts <- DropletUtils:::simCounts()

计算barcode ranks，画图

br.out <- barcodeRanks(my.counts)

# Making a plot.
plot(br.out$rank, br.out$total, log="xy", xlab="Rank", ylab="Total")
o <- order(br.out$rank)
lines(br.out$rank[o], br.out$fitted[o], col="red")

abline(h=metadata(br.out)$knee, col="dodgerblue", lty=2)
abline(h=metadata(br.out)$inflection, col="forestgreen", lty=2)
legend("bottomleft", lty=2, col=c("dodgerblue", "forestgreen"), 
    legend=c("knee", "inflection"))

barcode rank plot

曲线的knee和inflection points （拐点）标志着 total count分布的转变，体现含有少量RNA的空液滴和含有大量RNA的细胞液滴之间的区别。不过下面的代码将更严格地区分二者。

检测空液滴(empty droplets)

空液滴可能包含来自周围溶液的RNA，因此counts不为零。emptyDrops函数用以区分空液滴和含有细胞的液滴，原理为检验每个barcode的表达谱和周围溶液的表达谱有无显著偏差。

set.seed(100)
e.out <- emptyDrops(my.counts)
e.out
## DataFrame with 11100 rows and 5 columns
##           Total   LogProb    PValue   Limited        FDR
##       <integer> <numeric> <numeric> <logical>  <numeric>
## 1             2        NA        NA        NA         NA
## 2             9        NA        NA        NA         NA
## 3            20        NA        NA        NA         NA
## 4            20        NA        NA        NA         NA
## 5             1        NA        NA        NA         NA
## ...         ...       ...       ...       ...        ...
## 11096       215  -246.428 9.999e-05      TRUE 0.00014427
## 11097       201  -250.234 9.999e-05      TRUE 0.00014427
## 11098       247  -275.905 9.999e-05      TRUE 0.00014427
## 11099       191  -228.763 9.999e-05      TRUE 0.00014427
## 11100       198  -233.043 9.999e-05      TRUE 0.00014427

FDR低于特定阈值(比如0.01)的液滴代表与周围溶液的表达谱有显著偏差，这些液滴可看作是含有细胞的液滴。并且，为了避免去除内含细胞表达谱与周围溶液表达谱很相似的液滴，counts较大的液滴被自动设置p-value为零以保留。

is.cell <- e.out$FDR <= 0.01
sum(is.cell, na.rm=TRUE)
## [1] 902

p-values用置换检验计算，因此需设置种子数（set.seed）。计算结果e.out的Limited一列为逻辑值，表示是否可以通过增加置换数目来降低p值。如果有些条目的FDR高于设定阈值而Limited==TRUE，则表明应该提高emptyDrops函数的niters参数。

table(Limited=e.out$Limited, Significant=is.cell)
##        Significant
## Limited FALSE TRUE
##   FALSE   398  802
##   TRUE      0  100

画诊断图：the total count against the negative log-probability
含有细胞的液滴应该negative log-probabilities较大，或total counts 较大（total counts阈值参考上一节barcodeRanks函数得到的拐点）
下图基于模拟数据，因此较夸张：

plot(e.out$Total, -e.out$LogProb, col=ifelse(is.cell, "red", "black"),
    xlab="Total UMI count", ylab="-Log Probability")

the total count against the negative log-probability(红色为含细胞液滴，黑色为空液滴)

Demultiplexing hashed libraries

hashedDrops()函数用于demultiplex 单细胞Cell hashing实验。
首先模拟hash tag oligo (HTO) count matrix，同时加入doublets 和empty droplets。

set.seed(10000)

# Simulating empty droplets:
nbarcodes <- 1000
nhto <- 10
y <- matrix(rpois(nbarcodes*nhto, 20), nrow=nhto)

# Simulating cells:
ncells <- 100
true.sample <- sample(nhto, ncells, replace=TRUE)
y[cbind(true.sample, seq_len(ncells))] <- 1000

# Simulating doublets:
ndoub <- ncells/10
next.sample <- (true.sample[1:ndoub]  + 1) %% nrow(y)
next.sample[next.sample==0] <- nrow(y)
y[cbind(next.sample, seq_len(ndoub))] <- 500

首先从HTO count matrix识别含细胞的液滴，emptyDrops()函数需加lower=参数来匹配HTO文库的测序深度。

hto.calls <- emptyDrops(y, lower=500)
has.cell <- hto.calls$FDR <= 0.001
summary(has.cell)
##    Mode    TRUE    NA's 
## logical     100     900

每个barcode文库根据含量最高的HTO分配样本来源，分配的置信度由含量最高和次高的HTO数之间的log-fold change(下面表格demux的LogFC列)来定量。hashedDrops()函数会自动矫正周围溶液的HTO水平差异。

demux <- hashedDrops(y[,which(has.cell)], 
    ambient=metadata(hto.calls)$ambient)
demux
## DataFrame with 100 rows and 7 columns
##         Total      Best    Second     LogFC    LogFC2   Doublet Confident
##     <numeric> <integer> <integer> <numeric> <numeric> <logical> <logical>
## 1        1657         4         5  0.999462   4.60496      TRUE     FALSE
## 2        1635         8         9  0.999492   4.84165      TRUE     FALSE
## 3        1669         6         7  0.999473   4.45073      TRUE     FALSE
## 4        1674         6         7  0.999491   4.49983      TRUE     FALSE
## 5        1645         3         4  1.000292   4.74602      TRUE     FALSE
## ...       ...       ...       ...       ...       ...       ...       ...
## 96       1167         3         1   5.31708  0.427468     FALSE      TRUE
## 97       1158         3         1   5.26081  0.526363     FALSE      TRUE
## 98       1179         4         9   5.00121  0.604380     FALSE      TRUE
## 99       1187         2         5   5.37410  0.196833     FALSE      TRUE
## 100      1177         5         8   5.15739  0.464633     FALSE      TRUE

然后便可判断每个细胞的样本来源。R包作者提供了Confident接口，表明哪些液滴是确切的singlets，识别依据为(i) 液滴不是doublets ， (ii)含量最高和次高的HTO数之间的log-fold change不是很小。（本人理解：上面表格demux的Confident列是由Doublet列和LogFC列来定义的，Confident值为TRUE表明液滴是确切的singlets，Confident列为TRUE的行的Best列表示每个细胞的样本来源，即demux$Best[demux$Confident]）

table(demux$Best[demux$Confident])
## 
##  1  2  3  4  5  6  7  8  9 10 
## 10 15  9  7 12  8  6  6 10  6

还根据含量次高的HTO数与周围溶液的HTO水平之间的log-fold change(上面表格demux的LogFC2列)来识别doublets ，LogFC2值越大，说明每个液滴含量次高的HTO越不可能来源于周围溶液的HTO污染，更加证实了doublet的存在。

colors <- ifelse(demux$Confident, "black",
    ifelse(demux$Doublet, "red", "grey"))
plot(demux$LogFC, demux$LogFC2, col=colors,
    xlab="Log-fold change between best and second HTO",
    ylab="Log-fold change between second HTO and ambient")

黑色为确切的singlet，红色为doublet，灰色为不是确切的singlet但也不是doublet

去除swapping效应

去除样本间barcode swapping效应

用Illumina 4000测序仪混样测序时常发生barcode swapping的现象。一个样本的分子被另一个样本的barcode错误标记，导致在demultiplex过程中被错误地分配。幸运的是液滴法单细胞测序可以消除这种效应，因为不可能存在多个RNA分子的细胞barcode和UMI序列的组合完全一致的情况，因此多个样本中细胞barcode和UMI序列的组合完全一致的RNA分子很可能来源于barcode swapping。
swappedDrops函数根据同一run中10X混合测样的molecule information file识别并去除barcode swapping的分子，生成“cleaned” UMI count matrices。
首先模拟10X混合测样的molecule information file：

set.seed(1000)
mult.mol.info <- DropletUtils:::sim10xMolInfo(tempfile(), nsamples=3)
mult.mol.info
## [1] "/tmp/Rtmpg4CWdT/file3a627557651c.1.h5"
## [2] "/tmp/Rtmpg4CWdT/file3a627557651c.2.h5"
## [3] "/tmp/Rtmpg4CWdT/file3a627557651c.3.h5"

然后用swappedDrops函数去除barcode swapping效应：

s.out <- swappedDrops(mult.mol.info, min.frac=0.9)
length(s.out$cleaned)
## [1] 3
class(s.out$cleaned[[1]])
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"

细胞间的嵌合reads

嵌合分子是在文库准备过程中产生的，在此过程中，一个cDNA分子的不完全PCR产物通过共享序列(如3 'protocols的poly-A尾巴)杂交到另一个分子上进行延伸。这就产生了一种扩增子，其中UMI和细胞barcode来自一个转录本，而基因序列来自另一个转录分子，相当于基因间的reads交换。
chimericDrops()函数应用于molecule information file获得cleaned count matrix，去除嵌合reads效应，去除同一细胞中有相同的UMI的不同转录本，只保留reads数较高的那个。

out <- chimericDrops(mult.mol.info[1])
class(out)
## [1] "list"

其他用法

与read10xCounts函数相对，DropletUtils包的write10xCounts函数可反向将UMI counts稀疏矩阵转换成CellRanger输出文件（3个文件或HDF5格式）。
参数和使用查看?write10xCounts ，其中version参数表示要输出成CellRanger version 3.0还是version 2格式。
例如：反向将Seurat对象中的表达矩阵转换成barcodes.tsv.gz、features.tsv.gz和matrix.mtx.gz三个文件。

library(DropletUtils)    # 加载 R 包
# 从seurat 对象导出 barcodes.tsv.gz、features.tsv.gz 和 matrix.mtx.gz 至 output 文件夹
write10xCounts("output/",seurat_obj[["RNA"]]@counts, version = "3")    
R.utils::gunzip("output/features.tsv.gz")   # 解压gz文件

seurat 对象中的 counts 转成稀疏矩阵barcode
所使用的R包：DropletUtils 函数：write10xCounts
R包DropletUtils使用
bioconductor-DropletUtils使用教程：Utilities for handling drop...
2020-08-19 发表在Cell的单细胞文章如何预处理10X
方法部分还是很有参考价值的测序至获取细胞 10X测序技术比对至GRCH38 R包dropletUtils的函...
学习小组day6笔记-钟能能
R中的包 1.如何安装并使用R包（以dplyr包为例） 2.dplyr包的使用示例更多dplyr包的操作可使用R...
学习小组Day6--二胖
R包dplyr使用
如何查找R对应library的包以及对应版本、路径
通过R命令查找R路径通过R命令查找R已安装的包、版本、路径查看已经加载的包使用R studio的配置查看R包...
ChIPseeker 饼图图层覆盖问题
参考：「r<-包|ggplot2|grid」ggplotify——连接各类R图形R神包export的使用缘由;...
R语言：创建web界面
1、shiny包 R语言使用shiny包创建web界面。使用shinydashboard包和shinytheme，...
学习小组Day6笔记——R数据整理
R包的使用今日份学习是R包的基本操作，包括如何下载调用R包，dplyr的基本操作。mutate()select(...
使用corrplot包绘制相关性图
使用corrplot包绘制相关性图加载所需R包基本使用使用示例

R包DropletUtils使用

R包安装、载入

读入10X Genomics数据

读入UMI count matrix

读入molecule information file

Downsampling across batches

downsampling the UMI count matrix

downsampling the raw reads

Computing barcode ranks

检测空液滴(empty droplets)

Demultiplexing hashed libraries

去除swapping效应

去除样本间barcode swapping效应

细胞间的嵌合reads

其他用法

相关文章

seurat 对象中的 counts 转成稀疏矩阵barcode

R包DropletUtils使用

2020-08-19 发表在Cell的单细胞文章如何预处理10X

学习小组day6笔记-钟能能

学习小组Day6--二胖

如何查找R对应library的包以及对应版本、路径

ChIPseeker 饼图图层覆盖问题

R语言：创建web界面

学习小组Day6笔记——R数据整理

使用corrplot包绘制相关性图

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

单细胞

R语言与生物统计（别关注，自己当分类收藏夹用的）