Celaref | 单细胞测序细胞类型注释工具

作者: 生信宝典 | 来源:发表于2019-11-29 22:08 被阅读0次

单细胞转录组细胞类型注释分析合集
Celaref | 单细胞测序细胞类型注释工具
使用Celaref包进行单细胞类型注释分析（一）：Celaref
sc-ATAC-seq细胞类型注释策略
celaref ||单细胞细胞类型定义工具
单细胞入门【1】：单细胞测序方法该如何选择？
单细胞多组学新品预告 | 单细胞ATAC+单细胞转录联合检测
芯周刊·第23期 | 单细胞+空间转录组测序：揭示人类输尿管中的
Cancer Discovery ：胃癌单细胞图谱
XenofilteR：PDx 数据分析软件

我导再也不用担心我不认识marker啦

我们在进行单细胞测序的时候，通常情况下是通过高变genes来辨别细胞类型（于是一大堆不认识的），除了免疫细胞可能容易分析出来，其他的细胞我是两眼一抹黑。虽然具有single cell marker基因库，但在判断细胞亚型的时候有时并不是那么solid，有一些原则上比较特异的gene不知道为什么在cell marker上出现在不同的细胞上，比如S100A8,S100A9，在cell marker基因库中多为中性粒细胞的marker，但结合生理学意义，其实在单细胞中由于中性粒细胞本身含量较少且较为脆弱等原因导致捕获中性粒细胞是极为困难的。此时如果贸然下结论为中性粒细胞其实不利于后期的分析。

image

单细胞分群后，怎么找到Marker基因定义每一类群？

celaref R包通过与已知细胞类型的参考数据集的相似度进行比较。输入每个细胞中每个基因的reads counts数（gene-cell matrix）和每个细胞所属的簇（cluster）信息，和每个查询组中最明显富集的基因的参考样本比较，通过排名来匹配细胞类型。

Workflow

下面是典型的celaref工作流程，根据与参考数据集的转录组相似性来表征查询数据集的细胞簇。

image.gif

image

数据及其格式

gene-cell matrix
image
每个细胞所属的cluster信息
image

输入数据之后利用MAST包进行基因表达差异分析，每个基因被指定为从0（最富集）到1（最不存在）的rescaled rank。

比较查询数据和参考数据

得到每个查询细胞簇的Up基因列表 — 在该簇中具有显著更高表达的基因。在每个参考细胞簇的基因排名中查找这些基因，比较并绘制相似性。

输出结果

通常，查询数据中的每个细胞簇都针对参考数据（X轴）中的所有内容绘制。刻度线表示up基因，并且中位基因（middle generank）显示为粗条。顶部附近的偏差分布表示各组的相似性 - 基本上相同的基因代表它们各自样品中的cluster。也就是说查询组聚类分析后的代表基因如果和reference具有一定的相似性，则可以通过其相同的基因代表与其对应，也就是细胞类型的对应。

image

R语言 - 箱线图（小提琴图、抖动图、区域散点图）

为clusters分配标签

image

通过上图第二列可以发现其实可能存在两种以上的标签，这时候就需要进行综合判断了。

实例分析

# Installing BiocManager if necessary:     
#通过BiocManager进行R包加载
# install.packages("BiocManager")
BiocManager::install("celaref")

我们以系统包中的dataset为例进行演示。

假设有一个新的scRNAseq数据集（查询），其聚类已经聚集成4组：组1—4。但是我们还不知道哪个组对应于哪种细胞类型。

现在有一个相同组织类型的旧数据集（参考），其他人已经确定了细胞类型：Weird subtype, Exciting, Mystery cell typeand Dunno。

library(celaref)
# Paths to data files.
counts_filepath.query    <- system.file("extdata", "sim_query_counts.tab",    package = "celaref")
cell_info_filepath.query <- system.file("extdata", "sim_query_cell_info.tab", package = "celaref")
counts_filepath.ref      <- system.file("extdata", "sim_ref_counts.tab",      package = "celaref")
cell_info_filepath.ref   <- system.file("extdata", "sim_ref_cell_info.tab",   package = "celaref")

# 加载数据
toy_ref_se   <- load_se_from_files(counts_file=counts_filepath.ref, cell_info_file=cell_info_filepath.ref)
toy_query_se <- load_se_from_files(counts_file=counts_filepath.query, cell_info_file=cell_info_filepath.query)

head(de_table.demo_ref) # 参考数据格式

image

 head(de_table.toy_query ) # 查询数据格式

image

# 过滤
toy_ref_se     <- trim_small_groups_and_low_expression_genes(toy_ref_se)
toy_query_se   <- trim_small_groups_and_low_expression_genes(toy_query_se)

# 分别得到各自的差异表达基因
de_table.toy_ref   <- contrast_each_group_to_the_rest(toy_ref_se,    dataset_name="ref")
de_table.toy_query <- contrast_each_group_to_the_rest(toy_query_se,  dataset_name="query")

image

以上两步的运行均需要较长的时间。

# 展示查询组top基因在参考组中的分布
make_ranking_violin_plot(de_table.test=de_table.toy_query, de_table.ref=de_table.toy_ref)

image

# 获得各个group的标签
make_ref_similarity_names(de_table.toy_query, de_table.toy_ref)

image

通过以上方法我们基本可以辨别细胞类型，其中我们可以看到部分group并没有给出细胞类型（如group1,2在reciprocal_matches中没有找到匹配类型），有的可能会出现多个类型，需要进一步判断。

我们再以10X数据为例进行演示|PBMCs - 10X vs Microarray Reference

10X genomic****s有几个数据集可供从他们的网站下载，比如pbmc4k datasets（https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/pbmc4k），其中包含来自健康个体的外周血细胞(PBMC)数据，下面示例是筛选后的文件Gene/cell matrix(filtered)和聚类文件Clustering analysis。

查询数据：10X query dataset

library(celaref) # 首先设置路径并且加载数据，我们以kmeans_7_clusters 的聚类结果来进行细胞定义
datasets_dir <- "~/celaref_extra_vignette_data/datasets"

dataset_se.10X_pbmc4k_k7 <- load_dataset_10Xdata(
   dataset_path   = file.path(datasets_dir,'10X_pbmc4k'),
   dataset_genome = "GRCh38",
   clustering_set = "kmeans_7_clusters",
   id_to_use      = "GeneSymbol")
dataset_se.10X_pbmc4k_k7 <- trim_small_groups_and_low_expression_genes(dataset_se.10X_pbmc4k_k7) # 进行部分数据筛选并转换格式
？trim_small_groups_and_low_expression_genes # 查询函数帮助

image

de_table.10X_pbmc4k_k7   <- contrast_each_group_to_the_rest(dataset_se.10X_pbmc4k_k7, dataset_name="10X_pbmc4k_k7", num_cores=7) ##进行差异比较

参考数据：microarray dataset

查询数据有了，那么参考数据（reference）呢？

Watkins 等人在2009年的一篇文献已经发表过合适的 PBMC reference (a HaemAtlas)，这里使用的数据是从haemosphere网站（Graaf et al.2016，https://www.haemosphere.org/datasets/show）下载得到。

image

因为这是微阵列数据，所以还需要一些处理步骤：

Logged, normalised expression values. Any low expression or poor quality measurements should have already been removed.
Sample information (see contrast_each_group_to_the_rest_for_norm_ma_with_limma for details)

this_dataset_dir<-file.path(datasets_dir,  'haemosphere_datasets','watkins')
norm_expression_file<-file.path(this_dataset_dir,"watkins_expression.txt")
samples_file<- file.path(this_dataset_dir, "watkins_samples.txt")
norm_expression_table.full<-ead.table(norm_expression_file, sep="\t", header=TRUE,quote="",comment.char="",row.names=1,check.names=FALSE)
samples_table<- read_tsv(samples_file, col_types = cols())
samples_table$description<- make.names( samples_table$description) # Avoid group or extra_factor names starting with numbers, for microarrays

该数据集还包括其他组织，但作为PBMC数据的参考，我们只想考虑外周血样本。

samples_table <- samples_table[samples_table$tissue == "Peripheral Blood",]

image

这就是大致的流程，最难的部分是数据格式。微阵列表达值应使用经过标准化，对数转换并具有相同ID号的数据作为query dataset。从haemosphere网站能得到标准化的数据 — 但仍需要匹配ID。

该数据来自Illumina HumanWG-6 v2 Expression BeadChips，并在探针水平上给出表达。需要将这些探针转换为gene symbol以匹配PBMC数据。


library("tidyverse")
library("illuminaHumanv2.db")
probes_with_gene_symbol_and_with_data <- intersect(keys(illuminaHumanv2SYMBOL),rownames(norm_expression_table.full))

# Get mappings - non NA
probe_to_symbol <- select(illuminaHumanv2.db, keys=rownames(norm_expression_table.full), columns=c("SYMBOL"), keytype="PROBEID")
probe_to_symbol <- unique(probe_to_symbol[! is.na(probe_to_symbol$SYMBOL),])
# no multimapping probes
genes_per_probe <- table(probe_to_symbol$PROBEID) # How many genes a probe is annotated against?
multimap_probes <- names(genes_per_probe)[genes_per_probe  > 1]
probe_to_symbol <- probe_to_symbol[!probe_to_symbol$PROBEID %in% multimap_probes, ]

convert_expression_table_ids<- function(expression_table, the_probes_table, old_id_name, new_id_name){

    the_probes_table <- the_probes_table[,c(old_id_name, new_id_name)]
    colnames(the_probes_table) <- c("old_id", "new_id")

    # Before DE, just pick the top expresed probe to represent the gene
    # Not perfect, but this is a ranking-based analysis.
    # hybridisation issues aside, would expect higher epressed probes to be more relevant to Single cell data anyway.
    probe_expression_levels <- rowSums(expression_table)
    the_probes_table$avgexpr <- probe_expression_levels[as.character(the_probes_table$old_id)]

    the_genes_table <-  the_probes_table %>%
        group_by(new_id) %>%
        top_n(1, avgexpr)

    expression_table <- expression_table[the_genes_table$old_id,]
    rownames(expression_table) <- the_genes_table$new_id

    return(expression_table)
}

# Just the most highly expressed probe foreach gene.
norm_expression_table.genes <- convert_expression_table_ids(norm_expression_table.full,
                                                            probe_to_symbol, old_id_name="PROBEID", new_id_name="SYMBOL")

# 现在读取数据并用contrast_each_group_to_the_rest_for_norm_ma_with_limma进行对比。

de_table.Watkins2009PBMCs <- contrast_each_group_to_the_rest_for_norm_ma_with_limma(
                 norm_expression_table = norm_expression_table.genes,
                 sample_sheet_table    = samples_table,
                 dataset_name          = "Watkins2009PBMCs",
                 extra_factor_name     = 'description',
                 sample_name           = "sampleId",
                 group_name            = 'celltype')</pre>

将10X查询数据与参考数据进行比较

make_ranking_violin_plot(de_table.test=de_table.10X_pbmc4k_k7,de_table.ref=de_table.Watkins2009PBMCs)

image

# 提供分组标签
label_table.pbmc4k_k7_vs_Watkins2009PBMCs <- make_ref_similarity_names(de_table.10X_pbmc4k_k7, de_table.Watkins2009PBMCs)

image

这么热的单细胞，中科院的算法开发博士带你真正玩转这项平均每个月都有多篇高IF文章的技术。（点击蓝色字体了解详细）

Reference

Sarah Williams (2019). celaref: Single-cell RNAseq cell cluster labelling by reference. R package version 1.0.1.
https://bioconductor.org/packages/release/bioc/vignettes/celaref/inst/doc/celaref_doco.html

单细胞转录组细胞类型注释分析合集
SingleR 使用SingleR包进行单细胞类型注释分析 Celaref 使用Celaref包进行单细胞类型注释...
Celaref | 单细胞测序细胞类型注释工具
我导再也不用担心我不认识marker啦我们在进行单细胞测序的时候，通常情况下是通过高变genes来辨别细胞类型（...
使用Celaref包进行单细胞类型注释分析（一）：Celaref
Celaref包简介单细胞RNA-seq测序（scRNA-seq）技术可以解析组织或样品内的细胞异质性，并查看特...
sc-ATAC-seq细胞类型注释策略
解释任何单细胞测序数据的起点都是对给定数据集中的细胞簇进行注释。由于缺乏专门设计的工具以及在单细胞ATAC-seq...
celaref ||单细胞细胞类型定义工具
Sarah Williams (2019). celaref: Single-cell RNAseq cell c...
单细胞入门【1】：单细胞测序方法该如何选择？
单细胞转录组测序（scRNA-seq）已成为鉴定和表征细胞类型、状态、谱系和网络的核心工具。Get单细胞技术也成为...
单细胞多组学新品预告 | 单细胞ATAC+单细胞转录联合检测
随着单细胞测序技术的发展，单细胞转录组测序以及单细胞表观基因组测序，极大的提高了研究者们从异质性样本中分析细胞类型...
芯周刊·第23期 | 单细胞+空间转录组测序：揭示人类输尿管中的
01 单细胞+空间转录组测序：揭示人类输尿管中的细胞类型和信号网络本研究使用单细胞RNA测序、空间转录组测序和免...
Cancer Discovery ：胃癌单细胞图谱
单细胞测序技术能够揭示单个细胞的基因表达状态，反映细胞间的异质性，发现新的稀有细胞类型。当前单细胞测序技术的一个限...
XenofilteR：PDx 数据分析软件
单细胞测序主要解决的是细胞异质性问题，因此常用的样本类型为组织样本，直接采用细胞系做单细胞测序的研究比较少。对于临...