SO HOT!
1. 安装一些R包
1.1 安装
1.1.1 ALL
> if (!requireNamespace("BiocManager", quietly = TRUE))
+ install.packages("BiocManager")
>
> BiocManager::install("ALL")
1.1.2 pasilla
> if (!requireNamespace("BiocManager", quietly = TRUE))
+ install.packages("BiocManager")
>
> BiocManager::install("pasilla")
1.1.3 airway
> if (!requireNamespace("BiocManager", quietly = TRUE))
+ install.packages("BiocManager")
>
> BiocManager::install("airway")
1.1.4 limma
> if (!requireNamespace("BiocManager", quietly = TRUE))
+ install.packages("BiocManager")
>
> BiocManager::install("limma")
1.1.5 DESeq2
> if (!requireNamespace("BiocManager", quietly = TRUE))
+ install.packages("BiocManager")
> BiocManager::install("DESeq2")
1.1.6 clusterProfiler
> if (!requireNamespace("BiocManager", quietly = TRUE))
+ install.packages("BiocManager")
>
> BiocManager::install("clusterProfiler")
1.1.7 reshape2
> install.packages('reshape2')
1.1.8 ggplot2
> install.packages('ggplot2')
1.2 运行
用 library()
检查已安装的包是否能够运行
e.g.
> library(CLL)
# 载入需要的程辑包:affy
# 载入需要的程辑包:BiocGenerics
# 载入需要的程辑包:parallel
#
# 载入程辑包:‘BiocGenerics’
#
# The following objects are masked from ‘package:parallel’:
#
# clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
# clusterExport, clusterMap, parApply, parCapply, parLapply,
# parLapplyLB, parRapply, parSapply, parSapplyLB
#
# The following objects are masked from ‘package:stats’:
#
# IQR, mad, sd, var, xtabs
#
# The following objects are masked from ‘package:base’:
#
# anyDuplicated, append, as.data.frame, basename, cbind, colnames,
# dirname, do.call, duplicated, eval, evalq, Filter, Find, get,
# grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match,
# mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position,
# rank, rbind, Reduce, rownames, sapply, setdiff, sort, table,
# tapply, union, unique, unsplit, which, which.max, which.min
#
# 载入需要的程辑包:Biobase
# Welcome to Bioconductor
#
# Vignettes contain introductory material; view with
# 'browseVignettes()'. To cite Bioconductor, see
# 'citation("Biobase")', and for packages 'citation("pkgname")'.
若试图运行一个没有安装的包:
> available.packages()
# Package Version
# A3 "A3" "1.0.0"
> library(A3)
# Error in library(A3) : 不存在叫‘A3’这个名字的程辑包
2. 了解ExpressionSet对象
比如CLL包里面就有data(sCLLex) ,找到它包含的元素,提取其表达矩阵(使用exprs函数),查看其大小。
提取表达矩阵:
> library('CLL')
> data("sCLLex")
> exprSet <- exprs(sCLLex)
查看大小:
> dim(exprSet)
# [1] 12625 22
> str(exprSet)
# num [1:12625, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:12625] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...
3. 了解 str,head,help函数,作用于 第二步提取到的表达矩阵
3.1 str()
> str(exprSet)
# num [1:12625, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:12625] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...
Compactly display the internal structure of an R object, a diagnostic function and an alternative to summary (and to some extent, dput). Ideally, only one line for each ‘basic’ structure is displayed. It is especially well suited to compactly display the (abbreviated) contents of (possibly nested) lists. The idea is to give reasonable output for any R object. It calls args for (non-primitive) function objects.
用于查看矩阵的大小(列数、行数),查看列名、行名的示例。
3.2 head()
> head(exprSet)
# CLL11.CEL CLL12.CEL CLL13.CEL CLL14.CEL CLL15.CEL CLL16.CEL
# 1000_at 5.743132 6.219412 5.523328 5.340477 5.229904 4.920686
# 1001_at 2.285143 2.291229 2.287986 2.295313 2.662170 2.278040
# 1002_f_at 3.309294 3.318466 3.354423 3.327130 3.365113 3.568353
# 1003_s_at 1.085264 1.117288 1.084010 1.103217 1.074243 1.073097
# 1004_at 7.544884 7.671801 7.474025 7.152482 6.902932 7.368660
# 1005_at 5.083793 7.610593 7.631311 6.518594 5.059087 4.855161
# ...
Returns the first or last parts of a vector, matrix, table, data frame or function. Since head() and tail() are generic functions, they may also have been extended to other classes.
用于查看对象的前几行,默认为前6行,可通过
n = xL
调整。
3.3 help()
> help("exprs")
R Documentation
help is the primary interface to the help systems.
用于调出 R Documentation 对某个函数的说明页面,?fuction
是 help()
的简略写法
4. 安装并了解 hgu95av2.db 包
看看 ls(“package:hgu95av2.db”) 后显示的哪些变量?
安装:
> if (!requireNamespace("BiocManager", quietly = TRUE))
+ install.packages("BiocManager")
>
> BiocManager::install("hgu95av2.db")
> library(hgu95av2.db)
> ls("package:hgu95av2.db")
# [1] "hgu95av2" "hgu95av2.db" "hgu95av2_dbconn"
# [4] "hgu95av2_dbfile" "hgu95av2_dbInfo" "hgu95av2_dbschema"
# [7] "hgu95av2ACCNUM" "hgu95av2ALIAS2PROBE" "hgu95av2CHR"
# [10] "hgu95av2CHRLENGTHS" "hgu95av2CHRLOC" "hgu95av2CHRLOCEND"
# [13] "hgu95av2ENSEMBL" "hgu95av2ENSEMBL2PROBE" "hgu95av2ENTREZID"
# [16] "hgu95av2ENZYME" "hgu95av2ENZYME2PROBE" "hgu95av2GENENAME"
# [19] "hgu95av2GO" "hgu95av2GO2ALLPROBES" "hgu95av2GO2PROBE"
# [22] "hgu95av2MAP" "hgu95av2MAPCOUNTS" "hgu95av2OMIM"
# [25] "hgu95av2ORGANISM" "hgu95av2ORGPKG" "hgu95av2PATH"
# [28] "hgu95av2PATH2PROBE" "hgu95av2PFAM" "hgu95av2PMID"
# [31] "hgu95av2PMID2PROBE" "hgu95av2PROSITE" "hgu95av2REFSEQ"
# [34] "hgu95av2SYMBOL" "hgu95av2UNIGENE" "hgu95av2UNIPROT"
> as.list(hgu95av2GO[1]
# $`1000_at`$`GO:0097110`
# $`1000_at`$`GO:0097110`$GOID
# [1] "GO:0097110"
# $`1000_at`$`GO:0097110`$Evidence
# [1] "IEA"
# $`1000_at`$`GO:0097110`$Ontology
# [1] "MF"
1000_at
为探针,分别显示GOID, Evidence(基因注释证据代码), Ontology(基因本体论中的三个分类,BP = Biological Process; CC = Cellular Component; MF = Molecular Function).
> as.list(hgu95av2UNIGENE[1])
# $`1000_at`
# [1] "Hs.861"
"Hs.861"为NCBI UniGene数据库的ID.
提取其中的数据:
> get(featureNames(sCLLex)[1],hgu95av2GO)[1]
# $`GO:0000165`
# $`GO:0000165`$GOID
# [1] "GO:0000165"
# $`GO:0000165`$Evidence
# [1] "NAS"
# $`GO:0000165`$Ontology
# [1] "BP"
5. 理解 head(toTable(hgu95av2SYMBOL)) 的用法
找到TP53基因对应的探针ID
> head(toTable(hgu95av2SYMBOL))
# probe_id symbol
# 1 1000_at MAPK3
# 2 1001_at TIE1
# 3 1002_f_at CYP2C19
# 4 1003_s_at CXCR5
# 5 1004_at CXCR5
# 6 1005_at DUSP1
找到TP53基因对应的探针ID:
> ids <- toTable(hgu95av2SYMBOL)
> grep("TP53",ids$symbol)
# [1] 732 884 966 997 1420 2675 3322 4120 4304 5475 5526 10084
> ids[grep("TP53",ids$symbol),]
# probe_id symbol
# 732 1711_at TP53BP1
# 884 1860_at TP53BP2
# 966 1939_at TP53
# 997 1974_s_at TP53
# 1420 31618_at TP53
# 2675 33025_at TP53TG5
# 3322 33749_at TP53TG1
# 4120 34629_at TP53I11
# 4304 34822_at TP53BP2
# 5475 36079_at TP53I3
# 5526 36136_at TP53I11
# 10084 40986_s_at TP53BP1
6. 理解探针与基因的对应关系
总共多少个基因,基因最多对应多少个探针,是哪些基因,是不是因为这些基因很长,所以在其上面设计多个探针呢?
> summary(hgu95av2SYMBOL)
# SYMBOL map for chip hgu95av2 (object of class "ProbeAnnDbBimap")
# |
# | Lkeyname: probe_id (Ltablename: probes)
# | Lkeys: "1000_at", "1001_at", ... (total=12625/mapped=11460)
# |
# | Rkeyname: symbol (Rtablename: gene_info)
# | Rkeys: "A1BG", "A2M", ... (total=61468/mapped=8585)
# |
# | direction: L --> R
共有61468个基因,对应上8585个。
> table(ids$symbol) %>% sort() %>% tail()
# YME1L1 GAPDH INPP4A MYB PTGER3 STAT1
# 7 8 8 8 8 8
列出 ids 的 'symbol' 列及每个元素出现的个数,排序,显示最后6个;
得到对应最多探针的基因。
基因长度与设计在上面的探针数量无关。
7. 第二步提取到的表达矩阵,找到那些不在 hgu95av2.db 包收录的对应着SYMBOL的探针
> mapped_probes <- mappedkeys(hgu95av2SYMBOL)
> exprSetpb <- row.names(exprSet)
> length(grep("FALSE",(exprSetpb %in% mapped_probes)))
# [1] 1165
> grep("FALSE",(exprSetpb %in% mapped_probes))
# [1] 8 52 98 99 100 109 115 117 128 135 157 168
# [13] 169 171 174 195 197 199 203 219 225 282 283 292
# ...
> head(exprSetpb[index])
# [1] "1007_s_at" "1047_s_at" "1089_i_at" "108_g_at" "1090_f_at" "1099_s_at"
不确定对不对....
8. 过滤表达矩阵
删除那1165个没有对应基因名字的探针。
> e = exprSet[rownames(exprSet) %in% mapped_probes,]
> str(exprSet)
# num [1:12625, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:12625] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
# ..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...
> str(e)
# num [1:11460, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:11460] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
# ..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...
9. 整合表达矩阵
多个探针对应一个基因的情况下,只保留在所有样本里面平均表达量最大的那个探针。
> maxid = by(e,ids$symbol,function(x) rownames(x)[which.max(rowMeans(x))])
# 按照 ids$symbol 将 e 分组,取行平均值最大的行 的 行名
# 但,为什么要用 symbol 分组呢 😂
> uniid = as.character(maxid)
> uni_e = e[rownames(e) %in% uniid,]
> maxid = by(e,ids$probe_id,function(x) rownames(x)[which.max(rowMeans(x))])
> uniid = as.character(maxid)
> uni_e = e[rownames(e) %in% uniid,]
> str(uni_e)
# num [1:11460, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:11460] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
# ..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...
# emmmmm 结局是一样的,ok我懂了
10. 把过滤后的表达矩阵更改行名为基因的symbol
因为这个时候探针和基因是一对一关系了。
> rownames(uni_e) = ids[match(rownames(uni_e),ids$probe_id),2]
# 将ids的第2列(symbol)对应到 uni_e 的行名
> library(reshape2)
> m_e = melt(uni_e)
> colnames(m_e) = c('symbol','sample','value')
reshape2_melt()
分组:
> pd = pData(sCLLex)
> Disease = pd[,2]
> table(Disease)
# Disease
# progres. stable
# 14 8
> m_e$group = rep(Disease,each=nrow(uni_e))
11. 对第10步得到的表达矩阵进行探索
先画第一个样本的所有基因的表达量的boxplot,hist,density,然后画所有样本的这些图
> suppressPackageStartupMessages(library(ggplot2))
> ggplot(m_e,aes(x=sample,y=value,fill=group))+geom_boxplot()
ggplot_box
> ggplot(m_e,aes(value,fill=group)) + geom_histogram(bins = 200)+facet_wrap(~sample, nrow = 4)
ggplot_facetwrap
> ggplot(m_e,aes(value,col=group)) + geom_density()
[图片上传失败...(image-e00840-1559916805026)]
12. 理解统计学指标mean,median,max,min,sd,var,mad
并计算出每个基因在所有样本的这些统计学指标,最后按照mad值排序,取top 50 mad值的基因,得到列表。(注意:这个题目出的并不合规,请仔细看。)
12.1 mean
算术平均值
> meanlist=apply(uni_e, 1, mean)
> head(meanlist)
# UTF1 EVX1 ADGRB1 SYN1 CHRNB4 ARR3
# -0.14880683 -0.13705268 0.05817044 0.07589677 0.08317654 0.20826165
12.2 median
中位数
> mdlist=apply(uni_e, 1, median)
> head(mdlist)
# MAPK3 TIE1 CYP2C19 CXCR5 DUSP1 MMP10
# 5.368993 2.333562 3.378740 7.402258 5.902370 3.320628
12.3 max
最大值
> maxlist=apply(uni_e, 1, max)
> head(maxlist)
# MAPK3 TIE1 CYP2C19 CXCR5 DUSP1 MMP10
# 6.251678 2.662170 3.798335 8.802729 9.919125 3.574332
12.4 min
最小值
> minlist=apply(uni_e, 1, min)
> head(minlist)
# MAPK3 TIE1 CYP2C19 CXCR5 DUSP1 MMP10
# 4.826131 2.222276 3.247276 6.456285 4.097580 3.213493
12.5 sd
标准差
> sdlist=apply(uni_e, 1, sd)
> head(sdlist)
# MAPK3 TIE1 CYP2C19 CXCR5 DUSP1 MMP10
# 0.36652641 0.09046121 0.12801488 0.58908623 1.73368583 0.10074804
12.6 var
方差
> varlist=apply(uni_e, 1, var)
> head(varlist)
# MAPK3 TIE1 CYP2C19 CXCR5 DUSP1 MMP10
# 0.13434161 0.00818323 0.01638781 0.34702258 3.00566656 0.01015017
12.7 mad
绝对中位差Median Absolute Deviation
> madlist=apply(uni_e, 1, mad)
> head(madlist)
# MAPK3 TIE1 CYP2C19 CXCR5 DUSP1 MMP10
# 0.22779776 0.06516670 0.08294023 0.38497146 1.43854685 0.09476130
top 50 mad值的基因:
> tail(sort(madlist),50)
# PFN2 TNFSF9 ARHGAP44 P2RY14 THEMIS2 LPL ANXA4 DUSP6
# 1.481294 1.485155 1.488032 1.505107 1.507287 1.532212 1.533327 1.551320
# DUSP5 H1FX FLNA CLEC2B TSPYL2 ZNF266 S100A9 NR4A2
# 1.553782 1.557412 1.574436 1.578055 1.582430 1.587748 1.608285 1.612875
# TGFBI ARF6 APBB2 VCAN RBM38 CAPG PLXNC1 RGS2
# 1.643149 1.654548 1.674443 1.681098 1.703638 1.713747 1.718297 1.770151
# RNASE6 VAMP5 CYBB GNLY CCL3 OAS1 TRIB2 ZNF804A
# 1.775430 1.791017 1.811973 1.814364 1.862311 1.883509 1.937294 1.986163
# IGH PCDH9 VIPR1 COBLL1 GUSBP11 S100A8 HBB LHFPL2
# 2.090866 2.144223 2.171912 2.179666 2.204212 2.220970 2.267136 2.317045
# FCN1 ZAP70 IGLC1 LGALS1 FOS SLAMF1 TCF7 DMD
# 2.371590 2.579046 2.590895 2.600604 2.938078 2.944105 2.993352 3.071541
# IGF2BP3 FAM30A
# 3.234011 3.982191
13. 根据第12步骤得到top 50 mad值的基因列表来取表达矩阵的子集
并且热图可视化子表达矩阵。试试看其它5种热图的包的不同效果。
取子集:
> top50gene=tail(sort(madlist),50)
> top50gene=as.data.frame(top50gene)
> top50genelist=rownames(top50gene)
> top50matrix=uni_e[top50genelist,]
> ct=scale(top50matrix,center=T,scale=F) ## 中心化
> nmlztop50matrix=scale(ct,center=T,scale=T) ## 标准化
> pheatmap::pheatmap(nmlztop50matrix)
pheatmap
> heatmap(nmlztop50matrix)
heatmap
> ggplot(ml_nmtop50,aes(sample,symbol))+
+ geom_tile(aes(fill=value),colour = "white")+
+ scale_fill_gradient(low = "white",high = "steelblue")
ggplot_heatmap
> library(gplots)
> heatmap.2(nmlztop50matrix,key = T,symkey = FALSE,density.info="none",trace="none")
gplots_heatmap.2
> library(lattice)
> library(latticeExtra)
> x <- t(as.matrix(scale(nmlztop50matrix)))
> dd.row <- as.dendrogram(hclust(dist(x)))
> row.ord <- order.dendrogram(dd.row)
> levelplot(t(nmlztop50matrix),aspect = "fill",xlab="sample",ylab="symbol",colorkey = list(space = "left"),legend=list(right=list(fun=dendrogramGrob,args=list(x=dd.row,rod=row.ord,side='right',size=5))))
> levelplot(t(nmlztop50matrix),aspect =
+ "fill",xlab="sample",ylab="symbol",colorkey =
+ list(space = "left"),legend=
+ list(right=list(fun=dendrogramGrob,args=
+ list(x=dd.row,rod=row.ord,side='right',size=5))))
lattice_levelplot
emmm PDF示例里的代码跑出来top50matirx里是很多基因的重复,so这是我琢磨出来的结果
14. 取不同统计学指标mean,median,max,mean,sd,var,mad的各top50基因列表
使用UpSetR包来看他们之间的overlap情况。
各列表:
> e_mean=tail(sort(meanlist),50)
> e_md=tail(sort(mdlist),50)
> e_max=tail(sort(maxlist),50)
> e_min=tail(sort(minlist),50)
> e_sd=tail(sort(sdlist),50)
> e_var=tail(sort(varlist),50)
> e_mad=tail(sort(madlist),50)
> suppressPackageStartupMessages(library("UpSetR"))
> library(magrittr)
> e_all = data.frame(all,
+ e_mean=ifelse(all %in% names(e_mean),1,0),
+ e_md=ifelse(all %in% names(e_md),1,0),
+ e_max=ifelse(all %in% names(e_max),1,0),
+ e_min=ifelse(all %in% names(e_min),1,0),
+ e_sd=ifelse(all %in% names(e_sd),1,0),
+ e_var=ifelse(all %in% names(e_var),1,0),
+ e_mad=ifelse(all %in% names(e_mad),1,0)
+ )
> upset(e_all,nsets = 7,sets.bar.color = "#BD1111")
UpSetR_magrittr
15. 在第二步的基础上面提取CLL包里面的data(sCLLex) 数据对象的样本的表型数据。
> pd=pData(sCLLex)
> grouplist=as.character(pd[,2])
> table(grouplist)
# grouplist
# progres. stable
# 14 8
16. 对所有样本的表达矩阵进行聚类并且绘图
然后添加样本的临床表型数据信息(更改样本名)
> clust = t(exprSet)
> rownames(clust) = colnames(exprSet)
> View(clust)
> View(exprSet)
> clust_dist = dist(clust,method = "euclidean")
> hc = hclust(clust_dist,"ward.D")
> suppressPackageStartupMessages(library(factoextra))
> fviz_dend(hc, k = 4 ,cex = 0.5,k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),color_labels_by_k = TRUE, rect = TRUE)
hclust_fivzdend
17.对所有样本的表达矩阵进行PCA分析并且绘图,添加表型信息
> pca_data <- prcomp(t(exprSet),scale=TRUE)
> pcx <- data.frame(pca_data$x)
> pcr <- cbind(samples=rownames(pcx),grouplist, pcx)
> ggplot(pcr, aes(PC1, PC2))+geom_point(size=5, aes(color=grouplist)) +
+ geom_text(aes(label=samples),hjust=-0.1, vjust=-0.3)
procmp_ggplot
18. 根据表达矩阵及样本分组信息进行批量T检验,得到检验结果表格
分组,求p值
> gl = as.factor(grouplist)
> group1 = which(grouplist == levels(gl)[1])
> group2 = which(grouplist == levels(gl)[2])
# e = exprSet[rownames(exprSet) %in% mapped_probes,]
> et1 = e[,group1]
> et2 = e[,group2]
> data_t = cbind(et1,et2)
> pvals = apply(e, 1, function(x){
+ t.test(as.numeric(x)~grouplist)$p.value
+ }) ## p值
> p.adj = p.adjust(pvals, method = "BH") ## 校正后的p值
求log2FC, 做批量t检验
> data_mean_c = rowMeans(et1) ## control
> data_mean_t = rowMeans(et2) ## treatment
> log2FC = data_mean_t-data_mean_c
> DEG_t.test = cbind(data_mean_c, data_mean_t, log2FC, pvals, p.adj)
> DEG_t.test=DEG_t.test[order(DEG_t.test[,4]),] ##排序
> DEG_t.test=as.data.frame(DEG_t.test)
> head(DEG_t.test)
# data_mean_c data_mean_t log2FC pvals p.adj
# 36129_at 7.875615 8.791753 0.9161377 1.629755e-05 0.1867699
# 37676_at 6.622749 7.965007 1.3422581 4.058944e-05 0.2211373
# 33791_at 7.616197 5.786041 -1.8301554 6.965416e-05 0.2211373
# 39967_at 4.456446 2.152471 -2.3039752 8.993339e-05 0.2211373
# 34594_at 5.988866 7.058738 1.0698718 9.648226e-05 0.2211373
# 32198_at 4.157971 3.407405 -0.7505660 2.454557e-04 0.3192169
19. 使用limma包对表达矩阵及样本分组信息进行差异分析,得到差异分析表格
重点看logFC和P值,画火山图(就是logFC和-log10(P值)的散点图)。
> suppressPackageStartupMessages(library(limma))
> design1=model.matrix(~factor(grouplist)) ## 设计矩阵
> colnames(design1)=levels(factor(grouplist))
> rownames(design1)=colnames(exprSet)
> fit1 = lmFit(exprSet,design1)
> fit1=eBayes(fit1)
> options(digits = 3)
> mtx1 = topTable(fit1,coef=2,adjust='BH',n=Inf)
> DEG_mtx1 = na.omit(mtx1) ##去除缺失值
> head(DEG_mtx1)
# logFC AveExpr t P.Value adj.P.Val B
# 39400_at 1.028 5.62 5.84 8.34e-06 0.0334 3.23
# 36131_at -0.989 9.95 -5.77 9.67e-06 0.0334 3.12
# 33791_at -1.830 6.95 -5.74 1.05e-05 0.0334 3.05
# 1303_at 1.384 4.46 5.73 1.06e-05 0.0334 3.04
# 36122_at -0.780 7.26 -5.14 4.21e-05 0.1062 1.93
# 36939_at -2.547 6.92 -5.04 5.36e-05 0.1128 1.74
画图:
> DEG=DEG_mtx1
> logFC_cutoff <- with(DEG,mean(abs(logFC)) + 2*sd(abs(logFC)) )
> DEG$result = as.factor(ifelse(DEG$P.Value < 0.05 & abs(DEG$logFC) > logFC_cutoff, ## p.Value<0.05,|logFC|>1
+ ifelse(DEG$logFC > logFC_cutoff ,'UP','DOWN'),'NOT'))
> title <- paste0('Cutoff for logFC is ',round(logFC_cutoff,3), #round保留小数位数
+
+ '\nThe number of up gene is ',nrow(DEG[DEG$result =='UP',]) ,
+
+ '\nThe number of down gene is ',nrow(DEG[DEG$result =='DOWN',])
+ )
> ggplot(data=DEG, aes(x=logFC, y=-log10(P.Value), color=result)) +
+ geom_point(alpha=0.4, size=1.75) +
+ theme_set(theme_set(theme_bw(base_size=20)))+
+ xlab("log2 fold change") + ylab("-log10 p-value") +
+ ggtitle( title ) + theme(plot.title = element_text(size=15,hjust = 0.5))+
+ scale_colour_manual(values = c('blue','black','red'))
ggplot_volcano
20.对T检验结果的P值和limma包差异分析的P值画散点图,看看哪些基因相差很大?
> DEG_t.test = DEG_t.test[rownames(DEG_mtx1),]
> colnames(DEG_t.test)
# [1] "data_mean_c" "data_mean_t" "log2FC" "pvals"
# [5] "p.adj"
> colnames(DEG_mtx1)
# [1] "logFC" "AveExpr" "t" "P.Value" "adj.P.Val"
# [6] "B"
画图:
> plot(DEG_t.test[,4],DEG_mtx1[,4])
> plot(-log10(DEG_t.test[,4]),-log10(DEG_mtx1[,4]))
ggplot_dot
ggplot_dot_lg
最后,向大家隆重推荐生信技能树的一系列干货!
- 生信技能树全球公益巡讲:https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
- B站公益74小时生信工程师教学视频合辑:https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
- 招学徒:https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw
网友评论