单细胞细胞注释工具之CHETAH

作者: Kevin_Hhui | 来源:发表于2021-08-05 16:40 被阅读0次

单细胞细胞注释工具之CHETAH
单细胞注释工具
单细胞分析之细胞注释工具Garnett
单细胞转录组细胞类型注释分析合集
CHETAH: sc-RNA-seq 细胞类型定义工具
使用CHETAH包进行单细胞类型注释分析
sc-ATAC-seq细胞类型注释策略
单细胞分析之细胞注释-2：Garnett
GSEA也能做细胞类型鉴定？
单细胞辅助注释工具-SingleR

算法原理解析

（A）准备或构建一个参考单细胞数据集（reference scRNA-seq dataset），参考数据集完全独立于需要分类（训练）的数据集，接着利用Spearman correlation and average linkage构建分类树（classification tree）。这幅图里假定了有四种细胞类型，及它们的表达矩阵。
（C）针对前面层次聚类的结果（两个分支“branch”，4个叶子“leaf”），先找出最具差异性的基因集；寻找的方法是：循环计算选定branch下的leaf矩阵中基因的均值（这里记为M1），同时忽视同一个branch下的其他叶子，对于其他branch的leaf则将其视为整体计算均值（M2）。接着计算FC，即M1/M2，从而找出这个branch下最显著的基因集（最能将该branch与其他branch区分开来的基因集），最后选取FC排名前200个基因作为候选基因集。
（D）计算一个表达评分（profile score），这个评分的目的是展现出某个需训练的数据（j）与候选基因集及其他branch的相似性，个人理解是为了更准确推断这个j数据更像我们的候选基因集，而与其他branch相似度不高。该评分的计算方法：分别将输入数据j与候选基因集、其他整体branch做相关性分析（看源代码可以发现两次相关性分析使用的基因是一致的），接着将相关性转换为累积密度分布，最后二者相减就得出这个评分。
（E）计算置信分数（confidence score），这里的置信分数主要是为了控制标签注释的继续和停止。通过前一步的循环计算，我们可以得到每个branch下每片leaf的一个表达评分，将某个branch下最高评分的leaf作为候选标签，其他branch的表达评分则取均值，接着再二者相减，得到的结果与自己设定的阈值做比较（默认的阈值是0.1），如果该结果比阈值大，则将候选标签分配给这些细胞j；否则，该leaf将不会被分配（当该leaf为树的顶部节点），或被分为“中间类型”（当leaf下面仍有子叶--subleaf）。

algorithm overview

代码实操(参考官网教程：https://github.com/jdekanter/CHETAH)

you will only need:

your input data
a reference dataset, annotated with cell types
Both as a SingleCellExperiment

安装

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("CHETAH")

input_counts: t-SNE1 coordinates in input_tsne
ref_counts: celltypes vector ref_ct

构建reference dataset,包内置的参考数据集：https://figshare.com/s/aaf026376912366f81b6

## To prepare the data from the package's internal data, run:
celltypes_hn <- headneck_ref$celltypes
counts_hn <- assay(headneck_ref)
counts_melanoma <- assay(input_mel)
tsne_melanoma <- reducedDim(input_mel)

因为参考的数据里面可能不同细胞类型的数目存在差异，作者认为细胞数目在10-200之间是比较好，计算量不会太大，所以有一步选择细胞的过程。我原始数据中有一类细胞有1000多，如果不选的话，会导致构建分类树的时候，结果不准确。

构建开始前对细胞进行抽样

cell_selection <- unlist(lapply(unique(ref$celltypes), function(type) {
    type_cells <- which(ref$celltypes == type)
    if (length(type_cells) > 200) {
        type_cells[sample(length(type_cells), 200)]
    } else type_cells
}))
ref_new <- ref[ ,cell_selection]

Step 1: normalization

#方法一，自己写函数标准化
assay(headneck_ref, "counts") <- apply(assay(headneck_ref, "counts"), 2, function(column) log2((column/sum(column) * 10000) + 1))  #官网这写着是10^5，但是为了跟seurat统一，我改为10^4

#方法二，利用seurat的内置函数标准化
data <- NormalizeData(object = headneck_ref, normalization.method = "LogNormalize", scale.factor = 1e4)

Step 2: discaring of house-keeping genes

We therefore delete these genes, using the “ribosomal” list from here

ribo <- read.table("~/ribosomal.txt", header = FALSE, sep = '\t')
headneck_ref <- headneck_ref[!rownames(headneck_ref) %in% ribo[,1], ]

Step 3: Reference QC

CorrelateReference

CorrelateReference(ref_cells = headneck_ref)

ClassifyReference

ClassifyReference(ref_cells = headneck_ref)

Step 4: Running CHETAH

## For the reference we define a "counts" assay and "celltypes" metadata
headneck_ref <- SingleCellExperiment(assays = list(counts = counts_hn),
                                     colData = DataFrame(celltypes = celltypes_hn))

## For the input we define a "counts" assay and "TSNE" reduced dimensions
input_mel <- SingleCellExperiment(assays = list(counts = counts_melanoma),
                                  reducedDims = SimpleList(TSNE = tsne_melanoma))

input_mel <- CHETAHclassifier(input = input_mel,
                              ref_cells = headneck_ref)

PlotCHETAH(input = input_mel)

Please note that each type “NodeX” corresponds to the node with number X and that the type “Unassigned” corresponds to the node 0