参考:公众号:单细胞天地 (如有侵,立删~)
CV值
变异系数又称离散系数或相对偏差(我们肯定都听过标准偏差,也就是sd值,它描述了数据值偏离算术平均值的程度),这个相对偏差描述的是标准偏差与平均值之比,即:cv=sd/mean*100% 。
为何不用sd而用cv值呢?
sd值,它和均值mean、方差var一样,都是对一维数据进行的分析,需要数据满足两个条件:中部、单峰。也就是说数据集只存在一个峰值,并且这个峰值大致位于数据集的中部。另外当比较两组数据集的离散程度大小时,即使它们各自满足"中部单峰"的条件,如果出现两组数据测量尺度差别太大或数据量纲存在差异的话,直接用标准差就不合适
变异系数就可以解决这个问题,它利用原始数据标准差和原始数据平均值的比值来各自消除尺度与量纲的差异。
rm(list = ls()) ## 魔幻操作,一键清空~
options(stringsAsFactors = F)
load(file = '../input_rpkm.Rdata')
# 就会得到RPKM的过滤表达矩阵dat和细胞信息metadata(包括clust分群、细胞板信息、检测到的基因数量)
## 载入第0步准备好的表达矩阵,变量 dat是log2CPM后的表达量矩阵。及细胞的一些属性(hclust分群,plate批次,检测到的基因数量)
dat[1:4,1:4]
#SS2_15_0048_A3 SS2_15_0048_A6 SS2_15_0048_A5 #SS2_15_0048_A4
#0610007P14Rik 0 0 74.95064 69.326130
#0610009B22Rik 0 0 0.00000 0.000000
#0610009L18Rik 0 0 0.00000 0.000000
#0610009O20Rik 0 0 2.19008 3.314831
head(metadata)
#g plate n_g all
#SS2_15_0048_A3 1 0048 3065 all
#SS2_15_0048_A6 2 0048 3036 all
#SS2_15_0048_A5 1 0048 3742 all
#SS2_15_0048_A4 3 0048 5012 all
#SS2_15_0048_A1 1 0048 5126 all
#SS2_15_0048_A2 3 0048 5692 all
exprSet=dat
计算mean、sd值
mean_per_gene <- apply(exprSet, 1, mean, na.rm = TRUE) #对表达矩阵每行求均值
sd_per_gene <- apply(exprSet, 1, sd, na.rm = TRUE) #对表达矩阵每行求标准差
构建数据框,计算cv值
# 构造一个数据框来存放结果。
cv_per_gene <- data.frame(mean = mean_per_gene,
sd = sd_per_gene,
cv = sd_per_gene/mean_per_gene)
rownames(cv_per_gene) <- rownames(exprSet)
head(cv_per_gene)
#mean sd cv
#0610007P14Rik 25.634318 48.60264 1.895999
#0610009B22Rik 27.203266 64.41918 2.368068
#0610009L18Rik 3.864324 19.61355 5.075544
#0610009O20Rik 10.808251 26.01667 2.407112
#0610010F05Rik 8.194137 21.06394 2.570612
#0610010K14Rik 31.812982 52.29101 1.643701
with(cv_per_gene,plot(log10(mean),log10(cv)))
with(cv_per_gene,plot(log10(mean),log10(cv^2)))
cv_per_gene$log10cv2=log10(cv_per_gene$cv^2)
cv_per_gene$log10mean=log10(cv_per_gene$mean)
library(ggpubr)
cv_per_gene=cv_per_gene[cv_per_gene$log10mean < 4, ]
cv_per_gene=cv_per_gene[cv_per_gene$log10mean > 0, ]
ggscatter(cv_per_gene, x = 'log10mean', y = 'log10cv2',
color = "black", shape = 16, size = 1, # Points color, shape and size
xlab = 'log10(mean)RPKM', ylab = "log10(cv^2)",
add = "loess", #添加拟合曲线
add.params = list(color = "red",fill = "lightgray"),
cor.coeff.args = list(method = "spearman"),
label.x = 3,label.sep = "\n",
dot.size = 2,
ylim=c(-0.5, 3),
xlim=c(0,4)
)
ggscatter(cv_per_gene[-grep("ERCC",rownames(cv_per_gene)),], x = 'log10mean', y = 'log10cv2',
color = "black", shape = 16, size = 1, # Points color, shape and size
xlab = 'log10(mean RPKM)', ylab = "log10(CV^2)",
mean.point=T,
cor.coeff.args = list(method = "spearman"),
label.x = 3,label.sep = "\n",
dot.size = 2,
ylim=c(-0.5, 2.5),
xlim=c(0,4),
# 这里ggp参数的意思就是:增加一个ggplot图层
ggp = ggscatter(cv_per_gene[grep("ERCC",rownames(cv_per_gene)),], x = 'log10mean', y = 'log10cv2',
color = "red", shape = 16, size = 1, # Points color, shape and size
xlab = 'log10(mean RPKM)', ylab = "log10(CV^2)",
add = "loess", #添加拟合曲线
mean.point=T,
add.params = list(color = "red",fill = "lightgray"),
cor.coeff.args = list(method = "pearson"),
label.x = 3,label.sep = "\n",
dot.size = 2,
)
)
聚类算法之PCA与tSNE
几个常用函数的转置t(transpose),傻傻分不清?: 计算距离介绍过dist()函数,它是按行为操作对象,而聚类是要对样本聚类,因此要先将我们平时见到的表达矩阵(行为基因,列为样本)转置;同样PCA也是对行/样本进行操作,也是需要先转置;另外归一化的scale()函数虽然是对列进行操作,但它的对象是基因,因此也需要转置
rm(list = ls()) ## 魔幻操作,一键清空~
options(stringsAsFactors = F)
load(file = '../input_rpkm.Rdata')
dat[1:4,1:4]
head(metadata,3)
#所有数据的聚类分组信息
group_list=metadata$g
#批次信息
plate=metadata$plate
## 下面是画PCA的必须操作,需要看不同做PCA的包的说明书。
dat_back=dat
dat=dat_back
dat=t(dat)
dat=as.data.frame(dat)
dat=cbind(dat,plate ) #cbind根据列进行合并,即叠加所有列 #矩阵添加批次信息
dat[1:4,1:4]
table(dat$plate)
#0048 0049
#384 384
library("FactoMineR")
library("factoextra")
# The variable plate (index = ) is removed
# before PCA analysis
dat.pca <- PCA(dat[,-ncol(dat)], graph = FALSE)
fviz_pca_ind(dat.pca,#repel =T,
geom.ind = "point", # show points only (nbut not "text")
col.ind = dat$plate, # color by groups
#palette = c("#00AFBB", "#E7B800"),
addEllipses = TRUE, # 添加晕环
legend.title = "Groups"
)
ggsave('all_cells_PCA_by_plate.png')
## 保存你的画布的图到本地图片文件。
可以看到两个批次之间分不开,说明没有批次效应
pca
统计细胞检测的基因数量
rm(list = ls()) ## 魔幻操作,一键清空~
options(stringsAsFactors = F)
load(file = '../input_rpkm.Rdata')
# 这里检测每个样本中有多少基因是表达的,count值以1为标准,rpkm值可以用0为标准
# n_g = apply(a,2,function(x) sum(x>0)) #统计每个样本有表达的有多少行(基因)
dat[1:4,1:4]
df=metadata
head(df)
#g plate n_g all
#SS2_15_0048_A3 1 0048 3065 all
#SS2_15_0048_A6 2 0048 3036 all
#SS2_15_0048_A5 1 0048 3742 all
#SS2_15_0048_A4 3 0048 5012 all
#SS2_15_0048_A1 1 0048 5126 all
#SS2_15_0048_A2 3 0048 5692 all
library(ggpubr)
ggviolin(df, x = "all", y = "n_g", fill = "all",
add = "boxplot", add.params = list(fill = "white"))
19.48.png
不同批次间比较
ggviolin(df, x = "plate", y = "n_g", fill = "plate",
#palette = c("#00AFBB", "#E7B800", "#FC4E07"),
add = "boxplot", add.params = list(fill = "white"))
19.51.png
hclust分组间比较
ggviolin(df, x = "g", y = "n_g", fill = "g",
add = "boxplot", add.params = list(fill = "white")) + stat_compare_means()
19.53.png
可以看到差异极显著
任意比较
# 只需要之前构建一个比较组合
my_comparisons <- list( c("1", "2"), c("2", "3"), c("3", "4") )
ggviolin(df, x = "g", y = "n_g", fill = "g",
add = "boxplot", add.params = list(fill = "white")) + stat_compare_means(comparisons = my_comparisons)+ # Add pairwise comparisons p-value
stat_compare_means(label.y = 50) # Add global p-value
19.56.png
小tip:如果说可视化分群结果,发现群组间基因数量差异太大,就要考虑技术差异问题,因为由于生物学导致几千个基因关闭的可能性不是很大,可以换一种聚类算法试一试目前单细胞也有很多采用dbscan算法进行的聚类分析.
细胞周期判断
加载矩阵
rm(list = ls()) ## 魔幻操作,一键清空~
options(stringsAsFactors = F)
load(file = '../input.Rdata')
a[1:4,1:4]
head(df)
## 载入第0步准备好的表达矩阵,及细胞的一些属性(hclust分群,plate批次,检测到的基因数量)
# 注意 变量a是原始的counts矩阵,变量 dat是logCPM后的表达量矩阵。
group_list=df$g
plate=df$plate
table(plate)
#0048 0049
#384 384
创建sce对象
#BiocManager::install("scran")
library(scran)
sce <- SingleCellExperiment(list(counts=dat))
> sce
class: SingleCellExperiment
dim: 12198 768
metadata(0):
assays(1): counts
rownames(12198): 0610007P14Rik 0610009B22Rik ... ERCC-00170
ERCC-00171
rowData names(0):
colnames(768): SS2_15_0048_A3 SS2_15_0048_A6 ... SS2_15_0049_P22
SS2_15_0049_P24
colData names(0):
reducedDimNames(0):
altExpNames(0):
主要使用cyclone函数
cyclone函数主要需要三个元素:一个是sce单细胞对象,一个是pairs参数,还有就是gene.names参数
mm.pairs <- readRDS(system.file("exdata", "mouse_cycle_markers.rds",
package="scran"))
第三个参数:gene.names,cyclone函数需要使用ensembl基因名
library(org.Mm.eg.db)
ensembl <- mapIds(org.Mm.eg.db, keys=rownames(sce),
keytype="SYMBOL", column="ENSEMBL")
head(ensembl)
0610007P14Rik 0610009B22Rik 0610009L18Rik
NA "ENSMUSG00000007777" "ENSMUSG00000043644"
0610009O20Rik 0610010F05Rik 0610010K14Rik
NA "ENSMUSG00000042208" "ENSMUSG00000020831"
三者齐全,可以进行细胞周期计算:
assigned <- cyclone(sce, pairs=mm.pairs, gene.names=ensembl)
str(assigned)
> str(assigned)
List of 3
$ phases : chr [1:768] "G1" "G1" "G1" "G1" ...
$ scores :'data.frame': 768 obs. of 3 variables:
..$ G1 : num [1:768] 0.999 0.998 1 1 1 1 0.997 0.919 1 0.994 ...
..$ S : num [1:768] 0.143 0.003 0.062 0.009 0.331 0.01 0.01 0.013 0.029 0.005 ...
..$ G2M: num [1:768] 0.001 0.011 0.01 0.001 0 0 0.056 0.202 0 0.035 ...
$ normalized.scores:'data.frame': 768 obs. of 3 variables:
..$ G1 : num [1:768] 0.874 0.986 0.933 0.99 0.751 ...
..$ S : num [1:768] 0.12511 0.00296 0.05784 0.00891 0.24869 ...
..$ G2M: num [1:768] 0.000875 0.01087 0.009328 0.00099 0 ...
对assigned进行操作
head(assigned$scores)
>head(assigned$scores)
G1 S G2M
1 0.999 0.143 0.001
2 0.998 0.003 0.011
3 1.000 0.062 0.010
4 1.000 0.009 0.001
5 1.000 0.331 0.000
6 1.000 0.010 0.000
table(assigned$phases)
# G1 G2M S
#722 36 10
# 作图(利用score和phases这两个元素)
draw=cbind(assigned$score,assigned$phases)
attach(draw) #attach的目的就是现在加载,之后直接引用即可
library(scatterplot3d)
scatterplot3d(G1, S, G2M, angle=20,
color = rainbow(3)[as.numeric(as.factor(assigned$phases))],
grid=TRUE, box=FALSE)
detach(draw)
21.13.png
热图
library(pheatmap)
cg=names(tail(sort(apply(dat,1,sd)),100))
n=t(scale(t(dat[cg,])))# 矩阵归一化
# pheatmap(n,show_colnames =F,show_rownames = F)
library(pheatmap)
df$cellcycle=assigned$phases
ac=df
rownames(ac)=colnames(n)
pheatmap(n,show_colnames =F,show_rownames = F,
annotation_col=ac
)
21.16.png
网友评论