生信人的20个R语言习题--by me

作者: 尘世中一个迷途小书僮 | 来源:发表于2020-04-10 17:46 被阅读0次

生信人的20个R语言习题--by me
生信人的20个R语言习题及其答案-土豆学习笔记
R语言作业-20题
Day2-R
2019-04-19
盘一盘生信技能树R语言小作业（高级）
[目录|待刷的生信题整合]来自于生信技能树和生信菜鸟团的待刷的生
【生信人的20个R语言习题】R语言学习Q&A
R语言视频笔记
R语言之生信⑦Cox比例风险模型(单因素)

看到生信技能树博客上发布的20道生信人的R语言练习题（http://www.bio-info-trainee.com/3409.html），也打算做来练练手。果然，知识的盲区总会在应用中浮现。话不多说，就先附上我的答案，以供大伙参考。

P.S. 这20道R语言题偏向于RNA芯片表达数据的分析，有芯片分析流程经验的同学做起来会比较顺手

安装一些R包：
```
数据包： ALL, CLL, pasilla, airway 
软件包：limma，DESeq2，clusterProfiler 
工具包：reshape2
绘图包：ggplot2
```
不同领域的R包使用频率不一样，在生物信息学领域，尤其需要掌握bioconductor系列包。

options(repos = "https://mirrors.tuna.tsinghua.edu.cn/CRAN/")
BiocManager::install(c('CLL','ALL','pasilla','airway',
                       'limma', 'DESeq2','clusterProfiler',
                       'reshape2','ggplot2'))

国内的朋友可以尝试换国内的镜像再去下包，会比较快点。

了解ExpressionSet对象，比如CLL包里面就有data(sCLLex) ，找到它包含的元素，提取其表达矩阵(使用exprs函数)，查看其大小
1. 参考：http://www.bio-info-trainee.com/bioconductor_China/software/limma.html
2. 参考：https://github.com/bioconductor-china/basic/blob/master/ExpressionSet.md

library(CLL)
data("sCLLex")
sCLLex
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 12625 features, 22 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: CLL11.CEL CLL12.CEL ... CLL9.CEL (22 total)
## varLabels: SampleID Disease
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: hgu95av2

exset <- exprs(sCLLex)
dim(exset)
##  12625    22

CLL中提供的数据sCLLex 是由Affymetrix hug95av2 芯片测得的 Chronic Lymphocytic Leukemia（CLL）样本数据，其中包括22位患者的12,625个探针，患者被分为stable和progress两个状态

# 一般芯片分析函数，可能名字会因不同芯片处理包的不同而变化
exprs(): extracting expression matrix from microarray data
pData(): get meat information from data, like samples grouping and sampleid

了解 str,head,help函数，作用于第二步提取到的表达矩阵

str(exset)
##  num [1:12625, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:12625] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
##   ..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...
head(exset)
##           CLL11.CEL CLL12.CEL CLL13.CEL CLL14.CEL CLL15.CEL CLL16.CEL CLL17.CEL
## 1000_at    5.743132  6.219412  5.523328  5.340477  5.229904  4.920686  5.325348
## 1001_at    2.285143  2.291229  2.287986  2.295313  2.662170  2.278040  2.350796
## 1002_f_at  3.309294  3.318466  3.354423  3.327130  3.365113  3.568353  3.502440
## 1003_s_at  1.085264  1.117288  1.084010  1.103217  1.074243  1.073097  1.091264
## 1004_at    7.544884  7.671801  7.474025  7.152482  6.902932  7.368660  6.456285
## 1005_at    5.083793  7.610593  7.631311  6.518594  5.059087  4.855161  5.176975
##           CLL18.CEL CLL19.CEL CLL20.CEL CLL21.CEL CLL22.CEL CLL23.CEL CLL24.CEL
## 1000_at    4.826131  5.212387  5.285830  5.581859  6.251678  5.480752  5.216033
## 1001_at    2.325163  2.432635  2.256547  2.348389  2.263849  2.264434  2.344079
## 1002_f_at  3.394410  3.617099  3.279726  3.391734  3.306811  3.341444  3.798335
## 1003_s_at  1.076470  1.132558  1.096870  1.138386  1.061184  1.046188  1.082169
## 1004_at    6.824862  7.304803  8.757756  6.695295  7.372185  7.616749  6.839187
## 1005_at    4.874563  4.097580  9.250585  7.657463  7.683677  8.700667  5.546996
##           CLL2.CEL CLL3.CEL CLL4.CEL CLL5.CEL CLL6.CEL CLL7.CEL CLL8.CEL
## 1000_at   5.966942 5.397508 5.281720 5.414718 5.460626 5.897821 5.253883
## 1001_at   2.350073 2.406846 2.341961 2.372928 2.356978 2.222276 2.254772
## 1002_f_at 3.427736 3.453564 3.412944 3.411922 3.396466 3.247276 3.255148
## 1003_s_at 1.501367 1.191339 1.139510 1.153548 1.135671 1.074631 1.166247
## 1004_at   7.545673 8.802729 7.307752 7.491507 8.063202 7.014543 8.019108
## 1005_at   9.611601 5.732182 6.483191 6.072558 9.919125 5.149411 4.989604
##           CLL9.CEL
## 1000_at   5.214155
## 1001_at   2.358544
## 1002_f_at 3.365746
## 1003_s_at 1.151184
## 1004_at   7.432331
## 1005_at   5.339848
class(exset)
## [1] "matrix"

安装并了解 hgu95av2.db 包,看看 ls("package:hgu95av2.db") 后显示的那些变量

BiocManager::install('hgu95av2.db')
library(hgu95av2.db)
ls(hgu95av2.db)
##  "conn"        "packageName"

hgu95av2.db 提供了hgu95av2芯片平台的详细信息，包括探针号与基因SYMBOL对应关系

理解 head(toTable(hgu95av2SYMBOL)) 的用法，找到 TP53 基因对应的探针ID

# toTable() get the probe to symbol information
head(toTable(hgu95av2SYMBOL))
##    probe_id  symbol
## 1   1000_at   MAPK3
## 2   1001_at    TIE1
## 3 1002_f_at CYP2C19
## 4 1003_s_at   CXCR5
## 5   1004_at   CXCR5
## 6   1005_at   DUSP1
id1 <- toTable(hgu95av2SYMBOL)
dim(id1)
## [1] 11460     2
id1[id1$symbol=='TP53',]
##       probe_id symbol
## 966    1939_at   TP53
## 997  1974_s_at   TP53
## 1420  31618_at   TP53

理解探针与基因的对应关系，总共多少个基因，基因最多对应多少个探针，是哪些基因，是不是因为这些基因很长，所以在其上面设计多个探针呢？

library(ggplot2)
library(dplyr)
length(unique(id1$symbol)) #探针数目
## [1] 8585

idfreq <- as.data.frame(table(id1$symbol)) #基因与探针对应频次
str(idfreq)
## 'data.frame':    8585 obs. of  2 variables:
##  $ Var1: Factor w/ 8585 levels "AADAC","AAK1",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Freq: int  1 5 1 1 1 1 1 1 1 3 ...

sum(idfreq$Freq)
## [1] 11460

table(idfreq$Freq) %>% plot()

symbol frequency--plot

ggplot(idfreq) + geom_histogram(aes(x=idfreq$Freq), binwidth = 0.5)

symbol frequency--ggplot

第二步提取到的表达矩阵是12625个探针在22个样本的表达量矩阵，找到那些不在 hgu95av2.db 包收录的对应着SYMBOL的探针。
过滤表达矩阵，删除那1165个没有对应基因名字的探针。

#7-8
table(row.names(exset) %in% id1$probe_id)
## 
## FALSE  TRUE 
##  1165 11460
probe_noid <- exset[which(!(rownames(exset) %in% id1$probe_id)), ]
head(row.names(probe_noid))
## [1] "1007_s_at" "1047_s_at" "1089_i_at" "108_g_at"  "1090_f_at" "1099_s_at"
es_keep <- exset[which(rownames(exset) %in% id1$probe_id), ]
table(row.names(es_keep) %in% id1$probe_id)
## 
##  TRUE 
## 11460

整合表达矩阵，多个探针对应一个基因的情况下，只保留在所有样本里面平均表达量最大的那个探针。
把过滤后的表达矩阵更改行名为基因的symbol，因为这个时候探针和基因是一对一关系了。

#9-10
id1 <- id1[row.names(es_keep) %in% id1$probe_id,]
tmp <- by(es_keep,id1$symbol, 
          FUN = function(x) row.names(x)[which.max(rowMeans(x))])
es_keep <- es_keep[row.names(es_keep) %in% tmp,]
dim(es_keep)
## [1] 8585   22
es_keep <- as.data.frame(es_keep)
es_keep$probe_id <- row.names(es_keep)
df1 <- merge(es_keep, id1, by='probe_id')
row.names(df1) <- df1[,'symbol']
es_keep <- df1[,-c(1,24)]
es_keep[1:5,1:3]
##         CLL11.CEL CLL12.CEL CLL13.CEL
## MAPK3    5.743132  6.219412  5.523328
## TIE1     2.285143  2.291229  2.287986
## CYP2C19  3.309294  3.318466  3.354423
## CXCR5    7.544884  7.671801  7.474025
## DUSP1    5.083793  7.610593  7.631311

7-10题这个过程是在对表达矩阵进行ID转换，将探针ID转化为基因SYMBOL
如果是对高通量测序数据进行ID转换的话，需要借助其物种的id库（可以使用clusterProfiler中的bitr函数进行转换）

对第10步得到的表达矩阵进行探索，先画第一个样本的所有基因的表达量的boxplot,hist,density ，然后画所有样本的这些图
1. 参考：http://bio-info-trainee.com/tmp/basic_visualization_for_expression_matrix.html
2. 理解ggplot2的绘图语法，数据和图形元素的映射关系

R自带绘图工具版本

以下画的图都是所有样本的图

boxplot(es_keep)

hist(as.matrix(es_keep), 
     main = 'All genes expresion among all samples',
     xlab = 'expression level')

par(mfrow=c(4,6))
for (i in 1:dim(es_keep)[2]) {
  as.matrix(es_keep[,i]) %>% density() %>% plot(main=colnames(es_keep)[i])
}

density plot for each samples

ggplot版本

library(reshape2)

# 构建适合于ggplot作图的dataframe
exprset_keep <- melt(as.matrix(es_keep))
exprset_keep$group <- rep(group, each=nrow(es_keep))
colnames(exprset_keep) <- c('probe','sample','value','group')
head(exprset_keep)
ggplot(exprset_keep, aes(x=sample, y=value, fill=group)) + geom_boxplot() +
  theme(axis.text.x = element_text(angle=45, vjust = 0.5))
ggplot(exprset_keep, aes(x=sample, y=value, fill=group)) + geom_violin()
ggplot(exprset_keep,aes(value,fill=group))+
  geom_histogram(bins = 200)+
  facet_wrap(~sample, nrow = 4)
ggplot(exprset_keep,aes(value,col=group))+
  geom_density()+facet_wrap(~sample, nrow = 4)

histogram

density plot

理解统计学指标mean,median,max,min,sd,var,mad并计算出每个基因在所有样本的这些统计学指标，最后按照mad值排序，取top 50 mad值的基因，得到列表。

genes_mean <- apply(es_keep,1,mean)
genes_median <- apply(es_keep, 1, median)
genes_max <- apply(es_keep,1,max)
genes_min <- apply(es_keep,1,min)
genes_sd <- apply(es_keep, 1,sd)
genes_var <- genes_sd^2
genes_mad <- apply(es_keep,1,mad)
genes_mad <- sort(genes_mad,decreasing = T)
genes_top50mad <- genes_mad[1:50]
genes_top50mad
##   FAM30A  IGF2BP3      DMD     TCF7   SLAMF1      FOS   LGALS1    IGLC1 
## 3.982191 3.234011 3.071541 2.993352 2.944105 2.938078 2.600604 2.590895 
##    ZAP70     FCN1   LHFPL2      HBB   S100A8  GUSBP11   COBLL1    VIPR1 
## 2.579046 2.371590 2.317045 2.267136 2.220970 2.204212 2.179666 2.171912 
##    PCDH9      IGH  ZNF804A    TRIB2     OAS1     CCL3     GNLY     CYBB 
## 2.144223 2.090866 1.986163 1.937294 1.883509 1.862311 1.814364 1.811973 
##    VAMP5   RNASE6     RGS2   PLXNC1     CAPG    RBM38     VCAN    APBB2 
## 1.791017 1.775430 1.770151 1.718297 1.713747 1.703638 1.681098 1.674443 
##     ARF6    TGFBI    NR4A2   S100A9   ZNF266   TSPYL2   CLEC2B     FLNA 
## 1.654548 1.643149 1.612875 1.608285 1.587748 1.582430 1.578055 1.574436 
##     H1FX    DUSP5    DUSP6    ANXA4      LPL  THEMIS2   P2RY14 ARHGAP44 
## 1.557412 1.553782 1.551320 1.533327 1.532212 1.507287 1.505107 1.488032 
##   TNFSF9     PFN2 
## 1.485155 1.481294

根据第12步骤得到top 50 mad值的基因列表来取表达矩阵的子集，并且热图可视化子表达矩阵。试试看其它5种热图的包的不同效果。

esmad50 <- es_keep[which(rownames(es_keep) %in% names(genes_top50mad)),]
pheatmap::pheatmap(esmad50)

取不同统计学指标mean,median,max,mean,sd,var,mad的各top50基因列表，使用UpSetR包来看他们之间的overlap情况。

g_ls <- list(genes_mad,genes_max,genes_mean,genes_min,genes_median,
             genes_sd,genes_var)
names(g_ls) <- c("genes_mad","genes_max","genes_mean",
                 "genes_min","genes_median",
                 "genes_sd","genes_var")
g_ls <- lapply(g_ls, function(x) sort(x,decreasing = T))
lapply(g_ls,head)
## $genes_mad
##   FAM30A  IGF2BP3      DMD     TCF7   SLAMF1      FOS 
## 3.982191 3.234011 3.071541 2.993352 2.944105 2.938078 
## 
## $genes_max
##    IGLC1    RPS27     RPL9     IGHM   RPL37A   EEF1A1 
## 14.89571 14.57313 14.21913 14.10483 14.03078 13.88174 
## 
## $genes_mean
##    RPS27     RPL9   RPL37A    RPS23      B2M   RPS15A 
## 14.10560 13.68386 13.66803 13.45981 13.38206 13.31317 
## 
## $genes_min
##    RPS27    RPS23     RPL9      B2M    RPL32    RPL41 
## 13.72338 13.08543 13.06800 13.02396 12.90171 12.85905 
## 
## $genes_median
##    RPS27   RPL37A     RPL9    RPS23   RPS15A      B2M 
## 14.07602 13.73950 13.70093 13.45210 13.40403 13.38796 
## 
## $genes_sd
##   RPS4Y1    DDX3Y    PCDH9   FAM30A      HBB    IGLC1 
## 3.228614 3.224221 2.878758 2.798808 2.645901 2.614670 
## 
## $genes_var
##    RPS4Y1     DDX3Y     PCDH9    FAM30A       HBB     IGLC1 
## 10.423948 10.395604  8.287249  7.833328  7.000793  6.836498

g_ls <- lapply(g_ls, function(x) head(x,n=50))

genes_all <- unlist(g_ls) %>% names() %>% 
  gsub(pattern = "^.*\\.",replacement = "") %>% unique()

dat1 <- data.frame(genes_all=genes_all,
               genes_mean=ifelse(genes_all %in%  names(g_ls$genes_mean) ,1,0),
               genes_median=ifelse(genes_all %in%  names(g_ls$genes_median) ,1,0),
               genes_max=ifelse(genes_all %in%  names(g_ls$genes_max) ,1,0),
               genes_min=ifelse(genes_all %in%  names(g_ls$genes_min) ,1,0),
               genes_sd=ifelse(genes_all %in%  names(g_ls$genes_sd) ,1,0),
               genes_var=ifelse(genes_all %in%  names(g_ls$genes_var) ,1,0),
               genes_mad=ifelse(genes_all %in%  names(g_ls$genes_mad) ,1,0)
)
UpSetR::upset(dat,nsets = 7)

upsetR

UpsetR是一个可视化交集的R包，在多数据取交集的情况下，可视化的结果会比韦恩图看起来更舒服。
其画出的图包括几个部分：
一，右上的bar图：每一个bar对应变量中的一组集合，bar的数字表示该集合的数量大小，例如图中最左侧的第一个bar，表示genes_mad的这个集合有24个元素。
二，右下的点线图：实心点表示该变量在这一集合中存在，点与点之间的存在连线即表明变量间存在交集。
三，左下的bar：表示各变量的大小，这里所有变量大小均为50

UpsetR的结果需要一定学习后才能理解，但考虑其丰富的信息和更直观的结果，这个学习成本是值得付出的。

在第二步的基础上面提取CLL包里面的data(sCLLex) 数据对象的样本的表型数据。
对所有样本的表达矩阵进行聚类并且绘图，然后添加样本的临床表型数据信息(更改样本名)
对所有样本的表达矩阵进行PCA分析并且绘图，同样要添加表型信息。

#15-17
scmeata <- pData(sCLLex)
str(scmeata)
## 'data.frame':    22 obs. of  2 variables:
##  $ SampleID: 'AsIs' chr  "CLL11" "CLL12" "CLL13" "CLL14" ...
##  $ Disease : Factor w/ 2 levels "progres.","stable": 1 2 1 1 1 1 2 2 1 2 ...

colnames(es_keep) <- gsub(".CEL","",colnames(es_keep))
for (i in 1:length(scmeata$SampleID)) {
  if (colnames(es_keep)[i] %in% scmeata$SampleID[i]) {
    colnames(es_keep)[i]=paste(colnames(es_keep)[i], scmeata$Disease[i], sep = ".")
  }
}

t(es_keep) %>% dist() %>% hclust() %>% plot()

Clustering

library(factoextra)
library(FactoMineR)

group <- c(as.character(scmeata$Disease))
es.pca <- PCA(t(es_keep),graph = F)

fviz_pca_ind(es.pca,
             geom.ind = "point",
             col.ind = group, 
             legend.title = "Groups")

PCA

PCA分析用了两个包，FactoMineR进行PCA分析，factoextra进行可视化

引用CSDN上的内容（https://blog.csdn.net/g863402758/article/details/53409704）作为补充

R中作为主成分分析最主要的函数是princomp()函数
princomp()主成分分析可以从相关阵或者从协方差阵做主成分分析
summary()提取主成分信息
loadings()显示主成分分析或因子分析中载荷的内容
predict()预测主成分的值
screeplot()画出主成分的碎石图
biplot()画出数据关于主成分的散点图和原坐标在主成分下的方向

从hcluster和PCA的结果来看，stable和progress两种指标对患者的分类效果不太好，不能很清晰地将患者分开。

根据表达矩阵及样本分组信息进行批量T检验，得到检验结果表格

group_ls <- as.factor(group)
p_val <-  apply(es_keep, 1, function(x){
  t.test(as.numeric(x)~group_ls)$p.value
})
head(p_val)
##     MAPK3      TIE1   CYP2C19     CXCR5     DUSP1     MMP10 
## 0.5355478 0.4370106 0.6064461 0.6829745 0.4663820 0.7039005

p.adj <- p.adjust(p_val, method = 'BH')
expr_stable <- es_keep[,grep('.stable',colnames(es_keep))]
expr_progress <- es_keep[,grep('.progres',colnames(es_keep))]
avg_1 <- rowMeans(expr_stable)
avg_2 <- rowMeans(expr_progress)
log2FC <- avg_2 - avg_1
DEGobj <- cbind(avg_1,avg_2,log2FC,p_val,p.adj)
DEGobj <- DEGobj[order(DEGobj[,"p.adj"]), ] %>% as.data.frame()
head(DEGobj)
## avg_1    avg_2     log2FC        p_val     p.adj
## SGSM2  8.791753 7.875615 -0.9161377 1.629755e-05 0.1399145
## DLEU1  5.786041 7.616197  1.8301554 6.965416e-05 0.1656600
## USP6NL 7.058738 5.988866 -1.0698718 9.648226e-05 0.1656600
## PDE8A  7.965007 6.622749 -1.3422581 4.058944e-05 0.1656600
## LDOC1  2.152471 4.456446  2.3039752 8.993339e-05 0.1656600
## COMMD4 3.407405 4.157971  0.7505660 2.454557e-04 0.2586989

这里进行的其实就是差异表达分析，分组后进行组间的基因表达量t-test。要注意的是我们更应该关注adjusted p-value，因为单纯的p-value会陷入multiple test problem

其实，一般的差异分析也就干了这么一点事，算个p-value，adj.p-val，foldchange。完全可以把上面的代码封装成函数来用，写一个差异表达分析的函数也没有很难。

使用limma包对表达矩阵及样本分组信息进行差异分析，得到差异分析表格，重点看logFC和P值，画个火山图(就是logFC和-log10(P值)的散点图。)。

library(limma)

# supress intercept
design <- model.matrix(~0+group_ls)
colnames(design) <- levels(group_ls)
rownames(design) <- colnames(es_keep)
design
##                progres. stable
## CLL11.progres.        1      0
## CLL12.stable          0      1
## CLL13.progres.        1      0
## CLL14.progres.        1      0
## CLL15.progres.        1      0
## CLL16.progres.        1      0
## CLL17.stable          0      1
## CLL18.stable          0      1
## CLL19.progres.        1      0
## CLL20.stable          0      1
## CLL21.progres.        1      0
## CLL22.stable          0      1
## CLL23.progres.        1      0
## CLL24.stable          0      1
## CLL2.stable           0      1
## CLL3.progres.         1      0
## CLL4.progres.         1      0
## CLL5.progres.         1      0
## CLL6.progres.         1      0
## CLL7.progres.         1      0
## CLL8.progres.         1      0
## CLL9.stable           0      1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group_ls
## [1] "contr.treatment"

limma包进行差异表达分析最重要的就是把Expression matrix和design matrix创建出来。

# con_matrix 指明组间的比较关系
con_matrix <- makeContrasts("progres.-stable",levels = design)
con_matrix
##           Contrasts
## Levels     progres.-stable
##   progres.               1
##   stable                -1

fit <- lmFit(es_keep,design)
fit2 <- contrasts.fit(fit,con_matrix)
fit2 <- eBayes(fit2)
out <- topTable(fit2,coef = 1,n=Inf)
nrDEG <- na.omit(out)
head(nrDEG)
##              logFC  AveExpr         t      P.Value  adj.P.Val        B
## TBC1D2B -1.0284628 5.620700 -5.837398 8.240961e-06 0.02236713 3.351813
## CLIC1    0.9888221 9.954273  5.772843 9.560006e-06 0.02236713 3.230775
## DLEU1    1.8301554 6.950685  5.740883 1.029092e-05 0.02236713 3.170615
## SH3BP2  -1.3835699 4.463438 -5.735418 1.042149e-05 0.02236713 3.160313
## GPM6A    2.5471980 6.915045  5.043180 5.268833e-05 0.08731397 1.821657
## YTHDC2  -0.5187135 7.602354 -4.873724 7.881207e-05 0.08731397 1.485027

nrDEG$threshold <- as.factor(ifelse(nrDEG$P.Value < 0.05 & abs(nrDEG$logFC) >=log2(2),ifelse(nrDEG$logFC> log2(2) ,'Up','Down'),'Not'))
plot1 <- ggplot(data=nrDEG, aes(x=logFC, y =-log10(P.Value), colour=threshold,fill=threshold)) +
  scale_color_manual(values=c("blue", "grey","red"))+
  geom_point(alpha=0.4, size=1.2) +
  theme_bw(base_size = 12, base_family = "Times") +
  theme(legend.position="right",
        panel.grid=element_blank(),
        legend.title = element_blank(),
        legend.text= element_text(face="bold", color="black",family = "Times", size=8),
        plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(face="bold", color="black", size=12),
        axis.text.y = element_text(face="bold",  color="black", size=12),
        axis.title.x = element_text(face="bold", color="black", size=12),
        axis.title.y = element_text(face="bold",color="black", size=12)) +
  labs( x="log2 (Fold Change)",y="-log10 (p-value)")
plot1

这里选择的cutoff是p-value < 0.05和 |fold change| >2。如果用adj.p-val的话就没几个基因符合了。

还需要注意的是limma包差异分析的两个关键函数：lmFit(), eBayes()

对T检验结果的P值和limma包差异分析的P值画散点图，看看哪些基因相差很大。

colnames(DEGobj)
colnames(nrDEG)
table(rownames(DEGobj) %in% rownames(nrDEG))
## 
## TRUE 
## 8585
DEGobj <- DEGobj[order(rownames(DEGobj)),]
nrDEG <- nrDEG[order(rownames(nrDEG)),]
plot(DEGobj[,4], nrDEG[,4]) #using p-value to plot

p-val.

plot(DEGobj[,5], nrDEG[,5]) #using adjusted p-val. to plot

adj.p-val.

这里差异较大可能是因为p-value矫正方式不同导致的。

plot(-log10(DEGobj[,4]),-log10(nrDEG[,4])) #using -log10p-val. to plot

-log10p-val.

以上就是“生信人的20个R语言习题”的答案与见解了，上面许多的题目如果没有芯片数据处理相关经验的话，做起来会比较吃力。但如果看过生信技能新关于GEO数据库挖掘方面的视频的话就会豁然开朗了。

完。

参考资料：

生信人的20个R语言习题：http://www.bio-info-trainee.com/3409.html

Jimmy的答案：https://github.com/jmzeng1314/5years/blob/master/learn-R/tasks/3-r-20-codes.R#L184)

R语言 PCA(主成分分析） : https://blog.csdn.net/g863402758/article/details/53409704

https://www.jianshu.com/p/788010093c90

生信人的20个R语言习题--by me
看到生信技能树博客上发布的20道生信人的R语言练习题（http://www.bio-info-trainee.co...
生信人的20个R语言习题及其答案-土豆学习笔记
生信人的20个R语言习题 http://www.bio-info-trainee.com/3409.html生信...
R语言作业-20题
生信人的20个R语言习题 1 安装R包 2 了解ExpressionSet对象 CLL包里面就有data(sCLL...
Day2-R
sthda R作图代码大全生信人20个R语言练习题一份精美答案提纲： R数据结构 tidyverse, tid...
2019-04-19
R语言初级练习题-上生信技能树线下培训课，R语言初级练习题作答记录 1.PNG 根据返回结果打开电脑目录，可以看...
盘一盘生信技能树R语言小作业（高级）
生信人的20个R语言习题 1.安装一些R包： 2.了解ExpressionSet对象，比如CLL包里面就有data...
[目录|待刷的生信题整合]来自于生信技能树和生信菜鸟团的待刷的生
1.生信人的linux考试2.生信人的20个R语言习题3.VCF格式文件的shell小练习4.fasta和fast...
【生信人的20个R语言习题】R语言学习Q&A
Q1：生信人的20个R语言习题中：数据包： ALL, CLL, pasilla, airway；安装失败咋整？相...
R语言视频笔记
【生信技能树】生信人应该这样学R语言 01.介绍R语言及Rstudio 1.了解并安装R，Rstudio以及R包 ...
R语言之生信⑦Cox比例风险模型(单因素)
目录 R语言之生信①差异基因分析1 R语言之生信②差异基因分析2 R语言之生信③差异基因分析3 R语言之生信④TC...