表达矩阵的处理
后续分析所用到的数据,均为FPKM标准化后的表达矩阵。
从流程上对WGCNA进行解读
1)当对芯片数据或者RNA-Seq等数据完成分析之后,我们可以得到一张表达矩阵
一行为一个样本,一列为一个gene。
2)读取表达矩阵之后,对其进行adjacency矩阵的计算
adjacency矩阵,基于gene之间相关性的矩阵。每一个单元代表了2个gene之间的关联性(similarity)。
adjacency矩阵的构建,涉及到软阈值的选择,即构建一个幂函数,选择一个指数(power),强化强相关性,弱化弱相关性。
「adjacency矩阵的拓展」
adjacency矩阵有2种计算方式,
1)unsigned:
2)signed:
第一种情况,认为具有高负相关的gene之间,是有联系的。
第二种情况,认为具有高负相关的gene之间,是没有联系的。
举个例子,
假设cor = -1,β=2,unsigned情况下,计算得到的adjacency为1,即gene之间高度关联。signed情况下,计算得到的adjacency为0,即gene之间无关联。
3)TOM矩阵的构建
引入一个指标,即topological overlaps的计算,用于定义该gene是否有多个高关联性的gene(<font color='yellow'>用于gene module的构建</font>)
「TOM矩阵的拓展」
但是针对unsigned类型的adjacency矩阵,可能会出现无法判断几个基因之间的关联关系。
比如现有i,j,k 3个gene,计算得到它们之间的关联关系为,
1)i和j之间为正相关;2)i和k之间为正相关;3)j和k之间为负相关。
这就矛盾了,说明上面的关联关系可能受到了一些噪音的影响等等。
<u>此时,引入signed TOM</u>,其相较于unsigned TOM能够更好的解决gene之间的冲突情况。
Note:需要注意的是,当使用signed adjacency进行分析时,不会出现上述矛盾。
同时,这些矛盾作者都已经考虑到了,但还是有一定区别,
TOMsimilarity()
中,默认设置为TOMType = "unsigned"
TOMsimilarityFromExpr()
中,默认设置为TOMType = "signed"
4)TOM dissimilarity矩阵的构建
TOM dissimilarity矩阵代表了某个gene与其他gene之间的距离。
用上述矩阵进行聚类分析,得到gene module。
5)初步构建gene module
以TOM dissimilarity矩阵作为输入,进行聚类分析。
代码如下,
# Turn data expression into topological overlap matrix
power=sft$powerEstimate # 使用sft$powerEstimate调用预估出的软阈值
TOM = TOMsimilarityFromExpr(datExpr, power = power)
dissTOM = 1-TOM
# Plot gene tree
geneTree = hclust(as.dist(dissTOM), method = "average"); # 用于后续cutreeDynamic(),对gene tree进行裁剪
# pdf(file = "3-gene_cluster.pdf", width = 12, height = 9);
plot(geneTree, xlab="", sub="", main = "Gene clustering on TOM-based dissimilarity",
labels = FALSE, hang = 0.04);
# dev.off()
6)使用cutreeDynamic()
进行聚类分析的优化
使用<font color='yellow'>pairwise eigengene</font>,进一步计算得到eigengene dissimilarity,用于聚类分析,筛选指标,最终合并gene module。
Note:eigengene为gene表达模式的指标(PCA降维得到的第一个主成分)
cutreeDynamic()
代码如下,
cutreeDynamic
labels2colors
plotDendroAndColors
# Module identification using dynamic tree cut
dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM, deepSplit = 2,
pamRespectsDendro = FALSE, minClusterSize = 30);
table(dynamicMods)
length(table(dynamicMods))
# Convert numeric labels into colors
dynamicColors = labels2colors(dynamicMods)
table(dynamicColors)
# Plot the dendrogram and colors underneath
# pdf(file = "4-module_tree.pdf", width = 8, height = 6);
plotDendroAndColors(geneTree, dynamicColors, "Dynamic Tree Cut",dendroLabels = FALSE,
hang = 0.03,addGuide = TRUE, guideHang = 0.05,main = "Gene dendrogram and module colors")
# dev.off()
绘制图如下,
gene module合并代码如下 -> 基于cutreeDynamic()
的分析结果进行聚类分析的gene module的合并
-
mergeCloseModules
:合并gene module plotDendroAndColors
# Merge close modules
MEDissThres=0.25
abline(h=MEDissThres, col = "red")
merge = mergeCloseModules(datExpr, dynamicColors, cutHeight = MEDissThres, verbose = 3)
mergedColors = merge$colors
mergedMEs = merge$newMEs
# Plot merged module tree
# pdf(file = "5-merged_Module_Tree.pdf", width = 12, height = 9)
plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors),
c("Dynamic Tree Cut", "Merged dynamic"), dendroLabels = FALSE,
hang = 0.03, addGuide = TRUE, guideHang = 0.05)
# dev.off()
# merge$oldMEs,为数据框,行为样本,列为对应的gene module,其中的数值代表了它们之间的关联度
write.table(merge$oldMEs,file="oldMEs.txt");
write.table(merge$newMEs,file="newMEs.txt");
绘制图如下,可以看到有一些gene module被合并了,
7)将gene module与表型特征相联系
使用标准化的eigengene进行计算。
代码如下,
-
moduleEigengenes
,计算每个gene module的特征值 moduleTraitCor = cor(MEs, datTraits, use = "p")
moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples)
# Define numbers of genes and samples
nGenes = ncol(datExpr);
nSamples = nrow(datExpr);
# Recalculate MEs with color labels
MEs0 = moduleEigengenes(datExpr, mergedColors)$eigengenes
MEs = orderMEs(MEs0)
# r$> MEs[1:5, 1:5]
# MElightcyan1 MEdarkolivegreen MEgreenyellow MEcyan MEwhite
# LH38_Z1_1 0.3093871 -0.01665288 0.1583588 -0.017809267 -0.01305706
# LH38_Z1_2 0.3318247 -0.01679628 0.1629585 -0.010450719 0.04613664
# LH38_Z1_3 0.3087032 -0.01762908 0.1527085 -0.018456470 -0.03661408
# LH38_Z2_1 0.3050361 -0.01431782 0.1589569 -0.020199995 0.06161998
# LH38_Z2_2 0.3422982 -0.01466196 0.1722949 -0.006663298 0.08192540
# 一些数据处理部分
# Read microbial data as traits
bac_traits = read.table("traits_file/b_order_234.txt", header = T, sep = "\t")
rownames(bac_traits) = bac_traits[, 1]
bac_traits = bac_traits[, -1]
# r$> bac_traits[1:5, 1:5]
# Pseudomonadales Enterobacteriales Xanthomonadales Burkholderiales Verrucomicrobiales
# LH38_Z1_1 0.02120943 0.006338742 0.07261663 0.05920385 0.02674949
# LH38_Z1_2 0.04192444 0.009089757 0.06880071 0.05583164 0.02156440
# LH38_Z1_3 0.01393256 0.004525862 0.06961207 0.05100152 0.02189402
# LH39_Z1_1 0.11288033 0.013045132 0.07138692 0.04186106 0.01743154
# LH39_Z1_2 0.01214503 0.003410243 0.08236562 0.03733519 0.01953600
rownames(MEs) = paste(substr(rownames(MEs), 1, nchar(rownames(MEs))-1), rep(c("1", "2", "3"), 60), sep = "")
# sample names should be consistent in eigen genes and traits !!!!
bac_traits = bac_traits[match(rownames(MEs), rownames(bac_traits)), ]
table(rownames(MEs) == rownames(bac_traits))
# Calculate pearson correlation coefficients between module eigen-genes and traits
moduleTraitCor = cor(MEs, bac_traits, use = "p");
moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples);
write.table(moduleTraitCor,file="moduleTrait_correlation.txt");
write.table(moduleTraitPvalue,file="moduleTrait_pValue.txt");
结果图可视化代码如下,
sizeGrWindow(10,6)
# Will display correlations and their p-values
# 合并
textMatrix = paste(signif(moduleTraitCor, 2), "\n(",
signif(moduleTraitPvalue, 1), ")", sep = "");
dim(textMatrix) = dim(moduleTraitCor)
# pdf("module-traits-bacteria-order.pdf", width = 100, height = 30)
par(mar = c(15, 12, 5, 5));
# Display the correlation values within a heatmap plot
labeledHeatmap(Matrix = moduleTraitCor,
xLabels = names(bac_traits),
yLabels = names(MEs),
ySymbols = names(MEs),
colorLabels = FALSE,
colors = greenWhiteRed(50),
textMatrix = textMatrix, # 矩阵单元格上需要显示的信息
setStdMargins = FALSE,
cex.text = 0.5,
zlim = c(-1,1),
main = paste("Module-trait relationships"))
# dev.off()
结果图如下,
Note:右边的图例,代表相关性程度
8)Hub gene的鉴定
2种方法,
- 与目标module有强关联性的gene(gene significance)
- 使用module membership来鉴定关键gene,
一般先绘制出以module membership为x,gene significance为y的散点图。
代码如下,
# 以相关性最高的模块为例
which(moduleTraitCor == max(moduleTraitCor, na.rm = TRUE), arr.ind = TRUE)
# row col
# MEskyblue3 25 30
Nitrosomonadales <- as.data.frame(bac_traits[, 30])
names(Nitrosomonadales) = "Nitrosomonadales"
modNames = substring(names(MEs), 3) # 去除ME前缀
# 计算module membership
# 使用的是WGCNA自带cor函数,使用皮尔逊计算相关性
geneModuleMembership = as.data.frame(cor(datExpr, MEs, use = "p"));
geneModuleMembership[1:5, 1:5]
# MElightcyan1 MEdarkolivegreen MEgreenyellow MEcyan MEwhite
# Zm00001d001763 -0.045547074 -0.10334193 -0.20030129 0.22671189 0.1501121833
# Zm00001d001766 0.004752275 -0.01548396 -0.11130522 -0.22346048 0.0178421845
# Zm00001d001770 -0.286230379 -0.23340743 -0.39930773 0.22791746 0.0925346831
# Zm00001d001774 0.029505956 -0.06714112 -0.06323784 -0.34302212 0.1677206174
# Zm00001d001775 -0.029767437 -0.07642270 -0.13732818 -0.07876773 -0.0008678361
MMPvalue = as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples));
MMPvalue[1:5, 1:5]
names(geneModuleMembership) = paste("MM", modNames, sep="");
names(MMPvalue) = paste("p.MM", modNames, sep="");
# 计算gene significance
geneTraitSignificance = as.data.frame(cor(datExpr, Nitrosomonadales, use = "p"));
# head(geneTraitSignificance)
# nrow(geneTraitSignificance)
GSPvalue = as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples));
names(geneTraitSignificance) = paste("GS.", names(Nitrosomonadales), sep="");
names(GSPvalue) = paste("p.GS.", names(Nitrosomonadales), sep="");
module = "skyblue3"
column = match(module, modNames); # match(x, y),找到y中x的索引位置
moduleGenes = mergedColors==module;
# length(mergedColors)
# nrow(geneModuleMembership)
# nrow(geneTraitSignificance)
# table(mergedColors)
table(moduleGenes)
sizeGrWindow(7, 7);
par(mfrow = c(1,1));
verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
abs(geneTraitSignificance[moduleGenes, 1]),
xlab = paste("Module Membership in", module, "module"),
ylab = "Gene significance for Nitrosomonadales",
main = paste("Module membership vs. gene significance\n"),
cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)
散点图结果如下,可以看到module membership值越高的gene,其gene significance也越高。
其他
绘制gene module eigengenes与samples之间的热图
library("pheatmap")
# Heatmap of old module eigen-genes and samples
pdf(file="oldMEs.pdf",heigh=80,width=20)
row.names(merge$oldMEs)=names(data0) # oldMEs,是一个矩阵
pheatmap(merge$oldMEs,cluster_col=T,cluster_row=T,show_rownames=T,show_colnames=T,fontsize=6)
dev.off()
# Heatmap of new module eigen-genes and samples
# pdf(file="newMEs.pdf",heigh=60,width=20)
row.names(merge$newMEs)=names(data0)
pheatmap(merge$newMEs,cluster_col=T,cluster_row=T,show_rownames=T,show_colnames=T,fontsize=6)
# dev.off()
参考资料
[1] https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/
[2] https://github.com/PengYuMaize/Yu2021NaturePlants
[3] 跟着Nature Plants学数据分析:R语言WGCNA分析完整示例
[4] https://www.youtube.com/watch?v=BzYfg1lO3jw
[5] https://www.biostars.org/p/288153/
[6] https://peterlangfelder.com/2018/11/25/signed-or-unsigned-which-network-type-is-preferable/
网友评论