基于inferCNV+机器学习预测ScRNA肿瘤细胞的尝试

作者: QXPLUS | 来源:发表于2021-05-21 18:20 被阅读0次

基于inferCNV+机器学习预测ScRNA肿瘤细胞的尝试
[Paper] || KPNNs Knowledge-prime
TISCH || 肿瘤免疫单细胞中心
胃癌-单细胞-Dissecting transcriptiona
DENDRO:利用scRNA seq进行基因异质性分析和亚克隆检
传统机器学习预测客户流失
单细胞测序scRNA-seq技术学习笔记（一）——概述
按照分析套路，梳理一下癌症研究中的scRNA-seq计算方法..
FloydHub 2020年最佳机器学习书籍之一《可解释机器学习
ESTIMATE肿瘤纯度预测

数据

1. 数据来源

GEO下载肠癌的单细胞数据，GSE132465（https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE132465）

2. 数据说明

来自 23 肠癌患者的 23 个原癌单细胞样本 and 10 个对应的正常单细胞样本。3'端测序，共计 63,689 cells。（ Single-cell 3' mRNA sequencing data were obtained from 23 patients in 23 primary colorectal cancer and 10 matched normal mucosa. To remove low-quality cells, we applied filtering criteria using nUMI, nGene and percent of mitochondrial genes. 67,296 cells have passed the criteria. To eliminate cells of an ambiguous identity, we combined the Seurat multiCCA and RCA pipelines for initial clustering and cell type identification, and removed discordant cells from 2 methods. After defining the global cell type, cells with a number of genes exceeding the outliers were removed to eliminate potential doublets. 63,689 cells have passed the final criteria.）

我们在这里直接使用作者Seurat处理好的矩阵进行分析。（如果想从头开始处理，可以下载原始数据，借助 Cellranger + Seurat 进行处理和细胞质控过滤，降维聚类和细胞类型鉴定（把分析结果放在另一篇文章了）：https://www.jianshu.com/p/bcb384c8c884）

3. 下载并查看数据

nohup axel -n 2 ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132465/suppl/GSE132465_GEO_processed_CRC_10X_raw_UMI_count_matrix.txt.gz &

gunzip *.gz
less -S GSE132465_GEO_processed_CRC_10X_raw_UMI_count_matrix.txt
less -S GSE132465_GEO_processed_CRC_10X_cell_annotation.txt

raw_UMI_count.png

metadata.png

我们发现，数据居然已经处理好并且提取出matrix 和 metadata 了，省去了单细胞数据处理，降维，细胞类型鉴定和分组的一系列繁琐的过程。是不是很开心呢，annotation.txt 文件居然给出了需要用到的Class (Tumor/Normal) 和 Cell_type (我们这里感兴趣的细胞类群是上皮细胞 Epithelial cells)

inferCNV进行肿瘤样本中的肿瘤细胞鉴定

1. inferCNV

使用inferCNV算法可以区分细胞恶性与否，以一组"正常"细胞作为参考，分析肿瘤基因组上各个位置的基因表达量强度变化. 通过热图的形式展示每条染色体上的基因相对表达量，相对于正常细胞，肿瘤基因组总会过表达或者低表达。
原理：通过将RNA map到参考基因组上后，计算需要预测的细胞类群相对于(个人指定为）正常细胞类群中的基因拷贝数（CNV）来判断预测细胞类群中的细胞恶性与否。（正常细胞假设CNV=2, CNV≠2的细胞认为是恶性细胞）inferCNV是与给定正常细胞（reference）作为对照,来计算肿瘤样本（observation）中各个细胞的基因表达情况（不发生CNV为1，缺失小于1，多拷贝大于1，与1偏离越远，CNV越严重，越有可能是恶性细胞。）

2. 制作inferCNV输入文件

2.1 需要是3个文件

gene_order_file：gene_order_file.tsv
基因组注释文件(一般公司跑流程的话，这些都是弄好的，只需要制作自己的表达矩阵和分组矩阵信息即可)
raw_counts_matrix：gene_count.tsv
基因-细胞表达矩阵
在单细胞中存储在 project@assays$RNA@counts
annotations_file：gene_group.tsv
表达矩阵相对应的分组信息
两列：第一列是barcode,与gene-barcode matrix 对应；第二列是group info (比如 Cell_type)
在单细胞中存储在 project@meta.data

2.2 数据处理

这里以肠癌数据 GSE132465为例

datadir = "GSE132465_GEO_processed_CRC_10X_raw_UMI_count_matrix.txt"
groupfile = "GSE132465_GEO_processed_CRC_10X_cell_annotation.txt"
data<-fread(datadir,sep = "\t", header = T,check.names = F)
group<-fread(groupfile,sep = "\t", header = T,check.names = F)

-- tips: 如果文件较大，建议用 read.table 中的fread()函数代替read.table, 速度会快很多。

查看不同样本类型中的细胞类群分布情况

> table(group$Class,group$Cell_type)
         B cells Epithelial cells Mast cells Myeloids Stromal cells T cells
  Normal    5208             1070        184      369          3197    6376
  Tumor     3938            17469          3     6400          2736   16739

在这里，我们选取 Epithelial cells 进行inferCNV分析，其中。Normal--Epithelial cells 作为Control组。

> table(subset(group, Cell_type == "Epithelial cells",select = "Class"))
Normal  Tumor 
  1070  17469

3. 利用inferCNV软件预测细胞良恶性

inferCNV.pipline

3.1 inferCNV::run

library(infercnv)
infercnv_obj <- CreateInfercnvObject(
  raw_counts_matrix="gene_count.tsv",  ## gene - barcode matrix
  gene_order_file="/GRCh38_99/gene_order_file.tsv",
  annotations_file="gene_group.tsv",   ## metadata
  ref_group_names=c("Control"),        ## 指定 control group
  delim = "\t",
  max_cells_per_group = NULL,
  min_max_counts_per_cell = c(100, +Inf),
  chr_exclude = c("chrX", "chrY", "chrM") ## 去除 X, Y, MT, 基因，因为这些都不是二倍体拷贝数。
)

infercnv_obj.png

## cutoff=1 works well for Smart-seq2, and cutoff=0.1 works well for 10x Genomics
infercnv_obj =run(infercnv_obj,
                 cutoff=0.1, 
                 out_dir="./inferCNV_output", 
                 cluster_by_groups=TRUE, 
                 num_threads=6,          ## 多线程
                 denoise=TRUE,           ## 如果电脑资源不足，这两项可以设为FALSE
                 HMM=TRUE)

之后是漫长的结果等待 … 1~2d

漫长的等待，……（denoise，HMM 均为TRUE, 线程数设为12，细胞总数约1.85万，小伙伴们自行参考下吧。）

ps -ef | grep infer 查看 infer*程序是否还在运行
ps -ef | grep 进程号 查看进程号所执行的命令是啥

哎，跑了两天，最终还是没有跑出来，还等了这么久……

error.png

3.2 结果分析

现在最重要的目标就是根据这个图表或者说inferCNV的结果文件，把上皮细胞里面那些恶性的部分挑选出来

infercnv.preliminary.png : 初步的inferCNV展示结果（未经去噪或HMM预测）
infercnv.png : 最终inferCNV产生的去噪后的热图.
infercnv.references.txt : 正常细胞矩阵.
infercnv.observations.txt : 肿瘤细胞矩阵.
infercnv.observation_groupings.txt : 肿瘤细胞聚类后的分组关系.
infercnv.observations_dendrogram.txt : NEWICK格式，展示细胞间的层次关系.

由于我自己实在是没有跑出结果，而我的目的是对结果的分析尝试，于是先拿别人的结果先进行探索与分析了

由于是别人的数据，就简单看一下数据结构吧，

library(data.table)
datadir = "plasma_count.tsv"
metadata = "plasma_group.tsv"
## fread() 没有row.names 参数 ，需要借助data.frame() 进新房 行索引的指定
data<-fread(datadir,sep = "\t", header = T,check.names = F)
data<-data.frame(data,check.names = F, row.names = 1)

metadata<-fread(metadata,sep = "\t", header = T,check.names = F)
metadata<-data.frame(metadata,check.names = F, row.names = 1)

> data[1:5,1:5]
           1_AACAACCAGAAGCGTT 1_AACACACCATGAAGGC 1_AACCTTTTCTGGACTA
AL627309.1                  0                  0                  0
AL627309.3                  0                  0                  0
AL627309.4                  0                  0                  0
AL669831.5                  0                  0                  0
FAM87B                      0                  0                  0
           1_AAGCGAGCAAGGATGC 1_AATGACCGTGTCACAT
AL627309.1                  0                  0
AL627309.3                  0                  0
AL627309.4                  0                  0
AL669831.5                  0                  0
FAM87B                      0                  0

> table(metadata)
metadata
  10   HC 
4997   98

该数据集以HC作为reference, 以10作为observation,跑inferCNV.
inferCNV结果分析：

读取 infercnv的聚类信息

library(gridExtra)
library(grid)
require(dendextend)
require(ggthemes)
library(tidyverse)
library(Seurat)
library(infercnv)
library(miscTools)
library(phylogram)

## 读取 infercnv的聚类信息，
infercnv.dend <- read.dendrogram(file = "CNV_cluster10/infercnv.observations_dendrogram.txt")

Cut tree, 对聚类结果重新聚类（k 为指定的聚类数）

> infercnv.labels <- cutree(infercnv.dend, k = 2, order_clusters_as_data = FALSE)
> table(infercnv.labels)
infercnv.labels
   1    2 
1037 3960 
> infercnv.labels <- cutree(infercnv.dend, k = 3, order_clusters_as_data = FALSE)
> table(infercnv.labels)
infercnv.labels
   1    2    3 
1037 1532 2428 
> infercnv.labels <- cutree(infercnv.dend, k = 4, order_clusters_as_data = FALSE)
> table(infercnv.labels)
infercnv.labels
   1    2    3    4 
1037 1532  203 2225 
> infercnv.labels <- cutree(infercnv.dend, k = 5, order_clusters_as_data = FALSE)
> table(infercnv.labels)
infercnv.labels
   1    2    3    4    5 
 497  540 1532  203 2225 
> infercnv.labels <- cutree(infercnv.dend, k = 6, order_clusters_as_data = FALSE)
> table(infercnv.labels)
infercnv.labels
   1    2    3    4    5    6 
 497  540 1532  203  176 2049 
> infercnv.labels <- cutree(infercnv.dend, k = 7, order_clusters_as_data = FALSE)
> table(infercnv.labels)
infercnv.labels
   1    2    3    4    5    6    7 
 497  540  768  764  203  176 2049 
> infercnv.labels <- cutree(infercnv.dend, k = 8, order_clusters_as_data = FALSE)
> table(infercnv.labels)
infercnv.labels
   1    2    3    4    5    6    7    8 
 497  186  354  768  764  203  176 2049 
> infercnv.labels <- cutree(infercnv.dend, k = 9, order_clusters_as_data = FALSE)
> table(infercnv.labels)
infercnv.labels
   1    2    3    4    5    6    7    8    9 
 497  186  354  768  764  203  176  439 1610 
> infercnv.labels <- cutree(infercnv.dend, k = 10, order_clusters_as_data = FALSE)
> table(infercnv.labels)
infercnv.labels
   1    2    3    4    5    6    7    8    9   10 
 497  186  354  768  764  203  176  439  463 1147 
> infercnv.labels <- cutree(infercnv.dend, k = 11, order_clusters_as_data = FALSE)
> table(infercnv.labels)
infercnv.labels
   1    2    3    4    5    6    7    8    9   10   11 
 497  186  354  768  180  584  203  176  439  463 1147 
> infercnv.labels <- cutree(infercnv.dend, k = 12, order_clusters_as_data = FALSE)
> table(infercnv.labels)
infercnv.labels
   1    2    3    4    5    6    7    8    9   10   11   12 
 497  186  354  768  180  584  203  176  184  255  463 1147 
> infercnv.labels <- cutree(infercnv.dend, k = 13, order_clusters_as_data = FALSE)
> table(infercnv.labels)
infercnv.labels
   1    2    3    4    5    6    7    8    9   10   11   12   13 
 497  186  178  176  768  180  584  203  176  184  255  463 1147 
> infercnv.labels <- cutree(infercnv.dend, k = 14, order_clusters_as_data = FALSE)
> table(infercnv.labels)
infercnv.labels
  1   2   3   4   5   6   7   8   9  10  11  12  13  14 
497 186 178 176 768 180 584 203 176 184 255 463 512 635 
> infercnv.labels <- cutree(infercnv.dend, k = 15, order_clusters_as_data = FALSE)
> table(infercnv.labels)
infercnv.labels
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 
497 186 178 176 249 519 180 584 203 176 184 255 463 512 635 
> infercnv.labels <- cutree(infercnv.dend, k = 16, order_clusters_as_data = FALSE)
> table(infercnv.labels)
infercnv.labels
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16 
497 186 178 176 249 519 180 237 347 203 176 184 255 463 512 635 
> infercnv.labels <- cutree(infercnv.dend, k = 17, order_clusters_as_data = FALSE)
> table(infercnv.labels)
infercnv.labels
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17 
497 186 178 176 249 519 180 237 142 205 203 176 184 255 463 512 635 
> infercnv.labels <- cutree(infercnv.dend, k = 18, order_clusters_as_data = FALSE)
> table(infercnv.labels)
infercnv.labels
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
497 186 178 176 249 519 180 237 142 205 203 176 184 255 463 512 281 354 
> infercnv.labels <- cutree(infercnv.dend, k = 19, order_clusters_as_data = FALSE)
> table(infercnv.labels)
infercnv.labels
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19 
497 186 178 176 249 519 180 237 142 205 191  12 176 184 255 463 512 281 354 
> infercnv.labels <- cutree(infercnv.dend, k = 20, order_clusters_as_data = FALSE)
> table(infercnv.labels)
infercnv.labels
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
497 186 178 176 249 103 416 180 237 142 205 191  12 176 184 255 463 512 281 354

image.png

看到了这些，于是就考虑，对于noTumor细胞，在tumor样本中肯定是显而易见的，推测其聚类后的细胞数不会发生太大变化（希望是不发生变化），因此，就统计一下 k = 3到15的细胞数保持不变的那一类的cluster,我们就认为那一类cluster为noTumor cluster

freq_count<-list()
for (i in 4:15){
    infercnv.labels <- cutree(infercnv.dend, k = i, order_clusters_as_data = FALSE)
    label_freqs<-data.frame(table(infercnv.labels))
    write.csv(label_freqs, 
        file = paste0("infercnvlabels-",i,".csv"),
        row.names = F,
        )
    uniq_count<-unique(label_freqs$Freq)
    freq_count<-c(unlist(freq_count),unlist(uniq_count))
}

freq_count <- data.frame(table(freq_count))
#freq_count,Freq
maxfreq<-max(freq_count$Freq)
## 正常样本聚类的类别细胞数
cltnum <- subset(freq_count,Freq == maxfreq,select = "freq_count")[,1]

select_K<-NULL
notumor_label<-NULL
for (i in 5:15){
    infercnv.labels <- cutree(infercnv.dend, k = i, order_clusters_as_data = FALSE)
    label_freqs<-data.frame(table(infercnv.labels))
    if (dim(label_freqs[label_freqs$Freq == cltnum,])[1] != 0){
        notumor_label <- label_freqs[label_freqs$Freq == cltnum,"infercnv.labels"]
        cat(paste0("k = ",i," ----------- notumor_label is  ",notumor_label,"  -----------"))
        select_K<-i
        notumor_label<-notumor_label
        break
    }
}

画聚类的barplot

## plot the cluster of tumor
infercnv.labels <- cutree(infercnv.dend, k = select_K, order_clusters_as_data = FALSE)
table(infercnv.labels)

infercnv.labels[infercnv.labels != notumor_label] = 1
infercnv.labels[infercnv.labels == notumor_label] = 2
# Color labels
the_bars <- as.data.frame(tableau_color_pal("Tableau 20")(20)[infercnv.labels])
colnames(the_bars) <- "inferCNV_tree"
the_bars$inferCNV_tree <- as.character(the_bars$inferCNV_tree)

pdf("inferCNV_tree.pdf", width=14,height=5)
infercnv.dend %>% set("labels",rep("", nobs(infercnv.dend)) )  %>% plot(main="inferCNV dendrogram") %>%
  colored_bars(colors = as.data.frame(the_bars), dend = infercnv.dend, sort_by_labels_order = FALSE, add = T, y_scale=50 , y_shift = 0)
dev.off()

image.png

infercnv.png
放在一起看一下

notumor.png
再复现一下heatmap

infercnv_obj<-readRDS("CNV_cluster10/preliminary.infercnv_obj")
# 提取：处理后的表达矩阵
expr <- infercnv_obj@expr.data

normal_loc <- infercnv_obj@reference_grouped_cell_indices
normal_loc <- normal_loc$HC
# 提取：表达矩阵样本中肿瘤细胞的位置
tumor_loc <- infercnv_obj@observation_grouped_cell_indices
tumor_loc <- tumor_loc$`10`

gn <- rownames(expr)
geneFile <- read.table('/home/database/GRCh38_99/gene_order_file.tsv')
# 看到我们这里的表达矩阵由于在计算过程中进行了某些处理，所以相比最初的基因数量少了一些。那么就对geneFile取子集：
length(geneFile$V1);length(gn) 
sub_geneFile <- geneFile[geneFile$V1 %in% gn, ]
# Step2: 拆分矩阵
# 位置信息就存储在之前的normal_loc, tumor_loc中
norm_expr <- expr[,c(normal_loc)]
norm_expr <- cbind(norm_expr, as.factor(sub_geneFile$V2))
# 同理对tumor
tumor_expr <- expr[,c(tumor_loc)]
tumor_expr <- cbind(tumor_expr,as.factor(sub_geneFile$V2))

library(ComplexHeatmap)
library("RColorBrewer") # 调整配色
# 来自函数：infercnv::plot_cnv
get_group_color_palette <- function () {
  return(colorRampPalette(RColorBrewer::brewer.pal(12, "Set3")))
}

chr_cluster<-tumor_expr[,ncol(tumor_expr)]
color <- get_group_color_palette()(length(unique(chr_cluster)))

# 列标题的颜色框
top_color <- HeatmapAnnotation(
              cluster = anno_block(gp = gpar(fill = color), # 设置填充色
              labels = levels(chr_cluster), 
              labels_gp = gpar(cex = 0.9, col = "black")))

infercnv.labels <- cutree(infercnv.dend, k = cltnum, order_clusters_as_data = FALSE)
# tumor
n <- t(tumor_expr[,-ncol(tumor_expr)])
# normal
m <- t(norm_expr[,-ncol(norm_expr)])
infercnv.labels = infercnv.labels[match(rownames(n), names(infercnv.labels))]

infercnv.labels[infercnv.labels != notumor_label] = "tumor"
infercnv.labels[infercnv.labels == notumor_label] = "notumor"

left_color = rowAnnotation(Class = infercnv.labels,
    col = list(Class=c("tumor"= "#FFFFB3"  , "notumor"="#8DD3C7")))

observation_height<-10
reference_height<-(length(normal_loc)/length(tumor_loc))*10

pdf("test2-heatmap.pdf",width = 20,height = 20)
ht_normal = Heatmap(as.matrix(m),
                   cluster_rows = T,
                   cluster_columns = F,
                   show_column_names = F,
                   show_row_names = F,
                   column_split = chr_cluster,
                   row_title = "References (Cells)",
                   row_title_side = c("right"),
                   row_title_rot = 90,
                   row_title_gp = gpar(fontsize = 25),
                   column_title = NULL, 
                   heatmap_legend_param = list(
                     title = "Modified Expression",
                     title_position = "leftcenter-rot", # 图例标题位置
                     title_gp = gpar(fontsize = 20),# 图例标题大小
                     at=c(0.4,1.6), #图例范围
                     legend_height = unit(6, "cm")),#图例长度
                   width = 20, height = reference_height) 

ht_tumor = Heatmap(as.matrix(n),
                   cluster_rows = T,
                   cluster_columns = F,
                   show_column_names = F,
                   show_row_names = F,
                   column_split = chr_cluster,
                   show_heatmap_legend=F,
                   top_annotation = top_color,
                   left_annotation = left_color,
                   row_title = "Observations (Cells)",
                   row_title_side = c("right"),
                   row_title_rot = 90,
                   row_title_gp = gpar(fontsize = 25),
                   column_title = "Genomic Region",
                   column_title_side = c("bottom"),
                   column_title_gp = gpar(fontsize = 25),
                   width = 20, height = observation_height,
                   heatmap_height = 15)

# 设置竖直排列
draw(ht_normal %v% ht_tumor)
dev.off()

通过上面的方式就从肿瘤样本中找出了肺肿瘤细胞

image.png

机器学习方法预测非肿瘤细胞的尝试

另外一种自动检测非肿瘤细胞的思路就是我所擅长的机器学习啦。
思路：
用reference 细胞作为正常细胞，用CNV变异度很高的细胞作为肿瘤细胞，预测剩下的肿瘤样本的上皮细胞细胞中，哪些是肿瘤细胞，哪些是肺肿瘤细胞。

ml.pipline

image.png

一次失败的尝试，居然还没有上一种方法好 ……可能需要考虑能纳入整体信息的模型来进行建模？？？
以后再深入研究 ……

后期打算好好研究一下inferCNV包的内部脚本

基于inferCNV+机器学习预测ScRNA肿瘤细胞的尝试
数据 1. 数据来源 GEO下载肠癌的单细胞数据，GSE132465（https://www.ncbi.nlm.n...
[Paper] || KPNNs Knowledge-prime
One-sentence summary：利用scRNA-seq数据，构建基于生物知识的神经网络，以预测细胞状态,...
TISCH || 肿瘤免疫单细胞中心
肿瘤免疫单细胞中心(TISCH)是一个专注于肿瘤微环境(TME)的scRNA-seq数据库。TISCH在单细胞水平...
胃癌-单细胞-Dissecting transcriptiona
研究肿瘤（胃癌 ,GA）异质性样本 9个肿瘤和3个非肿瘤样本，27 677 个细胞的转录组全scRNA-seq ...
DENDRO:利用scRNA seq进行基因异质性分析和亚克隆检
尽管现在scRNA-seq广泛应用于肿瘤内异质性的研究中，体细胞突变的检测和从scRNA-seq推断克隆成信息可能...
传统机器学习预测客户流失
一、机器学习预测客户流失 Action，直接进入主题，尝试使用机器学习，预测客户流失。那么先把问题定义好。什么是...
单细胞测序scRNA-seq技术学习笔记（一）——概述
仅作学习交流用途，严禁用于商业用途单细胞测序（scRNA-seq）的介绍 scRNA-seq：单细胞测序是指在...
按照分析套路，梳理一下癌症研究中的scRNA-seq计算方法..
scRNA-seq已广泛应用于癌症相关研究，单细胞分辨率的转录组分析能够定量检测肿瘤内细胞表型多样性的分子活性。这...
FloydHub 2020年最佳机器学习书籍之一《可解释机器学习
说到机器学习，想必大家都不陌生。机器学习是计算机基于数据做出和改进预测或行为的一套方法。那什么是可解释机器学习...
ESTIMATE肿瘤纯度预测
肿瘤纯度：是指肿瘤组织中肿瘤细胞所占的比例。肿瘤组织中除了肿瘤细胞之外还有免疫细胞、基质细胞、间质细胞等非肿瘤细胞...