知识分享| 转录组个性化分析(6）——WGCNA

作者: 百易汇能 | 来源:发表于2023-03-12 15:25 被阅读0次

生信菜鸟团-专题学习目录
转录组WGCNA分析
单细胞转录组数据分析的时候可以加上wgcna-转自生信技能树
10X空间转录组之WGCNA的运用
转录组专题-WGCNA分析
10X空间转录组分析之转录组信息 & 空间位置的联合分析（SED
转录组专题：WGCNA
转录组研究好帮手-WGCNA分析
1.转录组 2.高通量测序 3.转录组数据分析 4.差异表达基因
转录组分析-WGCNA

近年来，很多SCI高分文章中都使用了WGCNA，那么WGCNA是什么，它可以应用于哪些研究方向，又如何进行WGCNA分析，其分析结果如何看？现在就带着这些问题，跟着小易一起学习探讨吧！

1 WGCNA介绍

WGCNA（weighted gene co-expression network analysis，权重基因共表达网络分析）是一种分析多个样本基因表达模式的分析方法，可将表达模式相似的基因进行聚类，并分析模块与特定性状或表型之间的关联关系，因此在疾病以及其他性状与基因关联分析等方面的研究中被广泛应用。

WGCNA与趋势分析都为基因共表达分析方法，与趋势分析相比，WGCNA 有以下优势：

a. 聚类方法：使用权重共表达策略（无尺度分布），更加符合生物学现象；

b. 基因间的关系：能呈现基因间的相互作用关系，且能找出处于调控网络中心的 hub 基因；

c. 样本数：适合于大样本量，且越多越好。而趋势分析 5 个点以上就会使结果非常复杂，准确性下降，且最多只能分析 8 个点；

d. 与表型的关联：可以利用模块特征值和 hub 基因与特定性状、表型进行关联分析，更准确地分析生物学问题。

2 数据要求

由于WGCNA是基于相关系数的表达调控网络分析方法。当样本数过低的时候，相关系数的计算是不可靠的，得到的调控网络价值不大。所以，我们推荐的样本数如下：

a. 当独立样本数≥8（非重复样本）时，可以考虑基于Pearson相关系数的WGCNA共表达网络的方法（效果看实际情况而定）；

b. 当样本数≥15（可以包含生物学重复）时，WGCNA方法会有更好的效果。

c. 当样品数＜8时，不建议进行该项分析。

d. 该方法对于不同材料或不同组织进行分析更有意义，对于不同时间点处理相同样品意义不大。

3 分析代码

a. wgcna01_1.R，wgcna01_2.R是用来构建网络和相关图，适用于性状关联和无性状关联的WGCNA网络构建。因为这两步很费时，就拆开了并行跑。

用法：Rscript wgcna01_1.R expre.xls;

b. wgcna02.R 是可视化基因网络图。

Rscript wgcna02.R trait.txt(optional)

代码：wgcna01_1

# Description: To perform co‐expression analysis based on WGCNA Rpackage

# Version: 1.0, mics.com, 2018‐10‐09

# Usage: Rscript wgcna.R rpkm.xls outdir

args <‐ commandArgs(TRUE);

exp_data=args[1]

# Part 1. Basic Analysis

# 1. Data input, cleaning and pre‐processing

# 1.1 Loading expression data

# Load the package

library(WGCNA);

# The following setting is important, do not omit.

options(stringsAsFactors = FALSE);

#Read in the data set

data = read.table(exp_data, head=T, row.names=1);

multiExpr = t(data);

# 1.2 Checking data for excessive missing values

gsg = goodSamplesGenes(multiExpr, verbose = 3);

#

if (!gsg$allOK)

{

# Remove the offending genes and samples from the data:

multiExpr = multiExpr[gsg$goodSamples, gsg$goodGenes]

}

nGenes = ncol(multiExpr)

nSamples = nrow(multiExpr)

if(nGenes > 30000)

{

library(genefilter)

data = read.table(exp_data, head=T, row.names=1)

f1 <‐ pOverA(0.8,0.5)

f2 <‐ function(x) (IQR(x) > 0.5)

ff <‐ filterfun(f1, f2)

selected <‐ genefilter(data, ff)

sum(selected)

esetSub <‐ data[selected, ]

multiExpr <‐ as.data.frame(t(esetSub))

nGenes = ncol(multiExpr)

nSamples = nrow(multiExpr)

}

save(multiExpr, file = "01‐dataInput_Expr.RData")

####Plot 1 samples cluster

sampleTree = hclust(dist(multiExpr), method = "average");

sizeGrWindow(12,9)

pdf(file = "sampleClustering.pdf", width = 12, height = 9);

par(cex = 0.6);

par(mar = c(0,4,2,0))

plot(sampleTree, main = "Sample clustering to detect outliers",

sub="", xlab="", cex.lab = 1.5,cex.axis = 1.5, cex.main = 2)

dev.off()

#########################################remove outlier samples(depend

on samples totally)

# Plot a line to show the cut

#abline(h = 15, col = "red");

# Determine cluster under the line

#clust = cutreeStatic(sampleTree, cutHeight = 15, minSize = 10)

#table(clust)

# clust 1 contains the samples we want to keep.

#keepSamples = (clust==1)

#multiExpr = multiExpr[keepSamples, ]

#nGenes = ncol(multiExpr)

#nSamples = nrow(multiExpr)

# 2. Step‐by‐step network construction and module detection

#=====================================================================

=========

# 2.1 Choosing the soft‐thresholding power: analysis of network

# Choose a set of soft‐thresholding powers

powers = c(c(1:10), seq(from = 12, to=30, by=2))

# Call the network topology analysis function

sft = pickSoftThreshold(multiExpr, powerVector = powers, verbose = 5 )

save(sft, file = "01‐sftpower.RData")

####Plot 2 selected power

pdf(file ="Network_power.pdf", width = 9, height = 5);

par(mfrow = c(1,2));

cex1 = 0.9;

# Scale‐free topology fit index as a function of the soft‐thresholdingpower

plot(sft$fitIndices[,1], ‐sign(sft$fitIndices[,3])*sft$fitIndices[,2],

xlab="Soft Threshold (power)",ylab="Scale Free Topology ModelFit,signed R^2",type="n",

main = paste("Scale independence"));

text(sft$fitIndices[,1], ‐sign(sft$fitIndices[,3])*sft$fitIndices[,2],

labels=powers,cex=cex1,col="red");

abline(h=0.85,col="red")

# Mean connectivity as a function of the soft‐thresholding power

plot(sft$fitIndices[,1], sft$fitIndices[,5],

xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",

main = paste("Mean connectivity"))

text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers,

cex=cex1,col="red")

dev.off()

# Select power

softPower <‐ sft$powerEstimate

####Plot 3 检验选定的β值下记忆网络是否逼近 scale free

# 基因多的时候使用下面的代码:

k <‐ softConnectivity(datE=multiExpr,power=softPower)

pdf(file = "Check_Scalefree.pdf", width = 10, height = 5);

par(mfrow=c(1,2))

hist(k)

scaleFreePlot(k,main="Check Scale free topology\n")

#‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐

‐‐‐‐‐‐‐‐‐

# 2.2 One‐step network construction and module detection

#‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐

‐‐‐‐‐‐‐‐‐

#2.2.1 获得临近矩阵,根据β值获得临近矩阵和拓扑矩阵，：

softPower <‐ sft$powerEstimate

adjacency = adjacency(multiExpr, power = softPower);

# 将临近矩阵转为 Tom 矩阵

TOM = TOMsimilarity(adjacency);

# 计算基因之间的相异度

dissTOM = 1‐TOM

save(TOM, file = "TOMsimilarity_adjacency.RData")

#2.2.2a Clustering using TOM

geneTree = hclust(as.dist(dissTOM),method="average");

minModuleSize = 30

dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM,

minClusterSize = minModuleSize);

table(dynamicMods)

dynamicColors = labels2colors(dynamicMods)

table(dynamicColors)

#Plot 4 plot the dendrogram and colors underneath

pdf(file = "GeneDendrogramColors.pdf", width = 12, height = 9);

sizeGrWindow(12,9)

plotDendroAndColors(geneTree, dynamicColors,

"Dynamic Tree Cut",

dendroLabels = FALSE, hang = 0.03,

addGuide = TRUE, guideHang = 0.05,

main = "Gene dendrogram and module colors")

dev.off()

#2.2.2b Merging of modules whose expression profiles are very similar

MEList = moduleEigengenes(multiExpr, colors = dynamicColors)

MEs = MEList$eigengenes

MEDiss = 1‐cor(MEs);

METree = hclust(as.dist(MEDiss), method = "average");

#Plot 5 merge the dendrogram and colors underneath for similar module

pdf(file = "Clustering_similar_module.pdf", width = 12, height = 9);

sizeGrWindow(12, 9)

plot(METree, main = "Clustering of similar module eigengenes",

xlab = "", sub = "")

MEDissThres = 0.25

abline(h=MEDissThres, col = "red")

dev.off()

merge = mergeCloseModules(multiExpr, dynamicColors, cutHeight =

MEDissThres, verbose = 3)

mergedColors = merge$colors;

mergedMEs = merge$newMEs;

#Plot 6 plot the gene dendrogram again, with the original and merged

module colors underneath

pdf(file = "Merged_GeneDendrogramColors.pdf", width = 12, height = 9)

sizeGrWindow(12, 9)

plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors), c("Dynamic Tree Cut", "Merged dynamic"),

dendroLabels = FALSE, hang = 0.03, addGuide = TRUE, guideHang = 0.05)

dev.off()

moduleColors = mergedColors

colorOrder = c("grey", standardColors(50));

moduleLabels = match(moduleColors, colorOrder)‐1;

MEs = mergedMEs;

save(MEs, moduleLabels, moduleColors, geneTree, file = "02‐networkConstruction‐stepByStep.RData")

代码：wgcna01_2

args <‐ commandArgs(TRUE);

exp_data=args[1]

# Part 1.1 Tom Network visualization

# 1. Data input, cleaning and pre‐processing

# 1.1 Loading expression data

# Load the package

library(WGCNA);

options(stringsAsFactors = FALSE)

enableWGCNAThreads()

# The following setting is important, do not omit.

options(stringsAsFactors = FALSE);

#Read in the data set

data = read.table(exp_data, head=T, row.names=1);

multiExpr = t(data)

# 1.2 Checking data for excessive missing values

gsg = goodSamplesGenes(multiExpr, verbose = 3);

#

if (!gsg$allOK)

{

# Remove the offending genes and samples from the data:

multiExpr = multiExpr[gsg$goodSamples, gsg$goodGenes]

}

nGenes = ncol(multiExpr)

nSamples = nrow(multiExpr)

if(nGenes > 30000)

{

library(genefilter)

data = read.table(exp_data, head=T, row.names=1)

f1 <‐ pOverA(0.8,0.5)

f2 <‐ function(x) (IQR(x) > 0.5)

ff <‐ filterfun(f1, f2)

selected <‐ genefilter(data, ff)

sum(selected)

esetSub <‐ data[selected, ]

multiExpr <‐ as.data.frame(t(esetSub))

nGenes = ncol(multiExpr)

nSamples = nrow(multiExpr)

}

# 1.3 Choosing the soft‐thresholding power: analysis of network

topology

# Choose a set of soft‐thresholding powers

powers = c(c(1:10), seq(from = 12, to=30, by=2))

# Call the network topology analysis function

sft = pickSoftThreshold(multiExpr, powerVector = powers, verbose = 5 )

softPower <‐ sft$powerEstimate

# 1.4 Visualizing the gene network

# Calculate topological overlap anew: this could be done more

efficiently by saving the TOM

TOM = TOMsimilarityFromExpr(multiExpr, power = softPower);

save(TOM, file = "TOMsimilarityFromExpr_multiExpr.power.RData")

dissTOM = 1‐TOM

nSelect = 500

# For reproducibility, we set the random seed

set.seed(10);

select = sample(nGenes, size = nSelect);

selectTOM = dissTOM[select, select];

selectTree = hclust(as.dist(selectTOM), method = "average")

selectColors = moduleColors[select];

#Plot 1 Visualizing the gene network in all modules for Tom matrix

pdf(file="Tom_GeneNetworkHeatmap.pdf", width = 9, height = 9);

sizeGrWindow(9,9)

plotDiss = selectTOM^7; diag(plotDiss) = NA;

TOMplot(plotDiss, selectTree, selectColors, main = "Gene Network

heatmap plot")

dev.off()

代码：wgcna02

library(WGCNA)

options(stringsAsFactors = FALSE)

enableWGCNAThreads()

lname1= load("01‐dataInput_Expr.RData")

lname2=load("01‐sftpower.RData")

lname3=load("02‐networkConstruction‐stepByStep.RData")

args <‐ commandArgs(TRUE);

trait_data=args[1]

#‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐

‐‐‐‐‐‐‐‐‐

# 3.2 Visualizing the network of module eigengenes

#‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐

‐‐‐‐‐‐‐‐‐

# Recalculate module eigengenes

MEs = moduleEigengenes(multiExpr, moduleColors)$eigengenes

MET = orderMEs(MEs)

#Plot 3.2.a plot the relationships among the module eigengenes

pdf(file="Module_Eigengene_network.pdf", width = 9, height = 9);

par(mfrow=c(1,2))

plotEigengeneNetworks(MET, "Eigengene dendrogram", marDendro =

c(0,4,2,0),plotHeatmaps = FALSE)

par(cex = 1.0)

plotEigengeneNetworks(MET, "Eigengene adjacency heatmap", marHeatmap =

c(3,4,2,2),

plotDendrograms = FALSE, xLabelsAngle = 90)

dev.off()

#Plot 3.2.b plot MEs corelation between different module

library(PerformanceAnalytics)

Matrix_MEs <‐ as.matrix(MEs)

pdf(file="Correlation_modules.pdf",width = 9, height = 9)

chart.Correlation(Matrix_MEs,histogram = TRUE,pch=19)

dev.off()

#=====================================================================

=========

# 4.Find all genes for each module and Exporting to Cytoscape

#=====================================================================

=========

# Recalculate topological overlap if needed

# Select modules

Allmodules = unique(moduleColors);

multiExpr <‐ as.data.frame(multiExpr)

probes = names(multiExpr) ###"probes means gene names"

# Select module probes

for (i in Allmodules){

module=i

inModule = is.finite(match(moduleColors, module));

modProbes = probes[inModule];

#4.1 Write all gene from each modules

write.table(modProbes,paste("allgenefrom_module",module,".txt",

sep=""),quote = FALSE,row.names = F,col.names = FALSE,sep="\t")

#4.2 Select the corresponding Topological Overlap for Cytoscape

input files

modTOM = TOM[inModule, inModule];

dimnames(modTOM) = list(modProbes, modProbes)

# Export the network into edge and node list files Cytoscape can

read

cyt = exportNetworkToCytoscape(modTOM,

edgeFile = paste("CytoscapeInput‐edges‐", module, ".txt",sep=""),

nodeFile = paste("CytoscapeInput‐nodes‐", module, ".txt",sep=""),

weighted = TRUE,

threshold = 0.02,

nodeNames = modProbes,

altNodeNames = modProbes,

nodeAttr = moduleColors[inModule])

}

#=====================================================================

=========

# 5. Find the hub gene for all modules

#=====================================================================

=========

hubs <‐ chooseTopHubInEachModule(multiExpr, moduleColors,

omitColors = "grey",

power = 14,

type = "unsigned")

H <‐ as.matrix(hubs)

write.table(H,"hubGenes_foreachModule.txt",quote = FALSE,row.names =

T,col.names = FALSE,sep="\t")

# Part 6. External Analysis

if(file.exists(trait_data)){

#=====================================================================

=========

# 6.1 Loading trait data

#=====================================================================

=========

traitData = read.table(trait_data);

# Form a data frame analogous to expression data that will hold the

clinical traits.

nGenes =ncol(multiExpr);

nSamples = nrow(multiExpr);

sampleName=rownames(multiExpr)

traitRows = match(sampleName,traitData$Sample) #colum name for all

sample must be called "Sample"

datTraits = traitData[traitRows, ‐1];

rownames(datTraits) = traitData[traitRows, 1];

save(datTraits,file="01‐dataInput_Trait.RData")

collectGarbage();

# 6.2 Quantifying module–trait associations

MEs0 = moduleEigengenes(multiExpr, moduleColors)$eigengenes

MEs = orderMEs(MEs0)

moduleTraitCor = cor(MEs, datTraits, use = "p");

moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples);

####Plot1 Module−trait relationships in heatmap plot

sizeGrWindow(10,6)

pdf(file = "Module_trait_relationships.pdf", width = 12, height = 9);

#Will display correlations and their p‐values

textMatrix = paste(signif(moduleTraitCor, 2), "\n(",

signif(moduleTraitPvalue, 1), ")", sep ="");

dim(textMatrix) = dim(moduleTraitCor)

par(mar = c(6, 8.5, 3, 3));

#Display the correlation values within a heatmap plot

labeledHeatmap(Matrix = moduleTraitCor,

xLabels = names(datTraits),

yLabels = names(MEs),

ySymbols = names(MEs),

colorLabels = FALSE,

colors = greenWhiteRed(50),

textMatrix = textMatrix,

setStdMargins = FALSE,

cex.text = 0.5,

zlim = c(‐1,1),

main = paste("Module‐trait relationships"))

dev.off()

}

4 结果说明

4.1 GeneDendrogramColors.pdf：基因进化树(基因共表达模块分析)

横坐标：基因共表达模块分析的结果展示，不同的颜色代表不同的基因模块，灰色属于未知模块的基因

纵坐标：基因间的不相似性系数，进化树的每一个树枝代表一个基因

4.2 NetworkHeatmap.pdf :共表达模块基因相关系数热图

图上和图左是基因系统发育树，每一个树枝代表一个基因，不同颜色亮带表示不同的module，每一个亮点表示每个基因与其他基因的相关性强弱，每个点的颜色越深（白→黄→红）代表行和列对应的两个基因间的连通性越强。P值采用student's t test计算所得，P值越小代表基因与模块相关性的显著性越强。

4.3 EigengeneDendrogramHeatmap.pdf :共表达模块的聚类图和热图

聚类图：对每个基因模块的特征向量基因进行聚类,

热图：我们将所有模块两两间的相关性进行分析，并绘制热图，颜色越红，相关性越高

4.4 ModuleSampleRelation.pdf : 性状相关特征模块分析（此分析需结合生理数据，如没有生理数据则没有此结果）

每一列代表一个性状，每一行代表一个基因模块。每个格子里的数字代表模块与性状的相关性，该数值越接近1，表示模块与性状正相关性越强；越接近-1，表示模块与性状负相关性越强。括号里的数字代表显著性P value，该数值越小，表示显著性越强。P值采用student's t test计算所得，P值越小代表性状与模块相关性的显著性越强。

生信菜鸟团-专题学习目录
生信菜鸟团-专题学习目录原文链接转录组专题转录组专题-可变剪接转录组专题-融合基因转录组专题-WGCNA分析...
转录组WGCNA分析
介绍这个包之前，先要搞清楚这个包能干啥。（部分内容摘抄自学术咖） Q1：WGCNA能干嘛？A1：能够将表达模式相似...
单细胞转录组数据分析的时候可以加上wgcna-转自生信技能树
单细胞转录组数据分析的时候可以加上wgcna 生信技能树 WGCNA分析大家都不陌生了，我在生信技能树多次写教程分...
10X空间转录组之WGCNA的运用
hello，新的一周，周二了，今天来分享一下空间转录组WGCNA的用法，关于WGCNA已经分享了很多了，大家可以参...
转录组专题-WGCNA分析
生信菜鸟团https://mp.weixin.qq.com/s?__biz=MzUzMTEwODk0Ng==&mi...
10X空间转录组分析之转录组信息 & 空间位置的联合分析（SED
hello，大家好，今天给大家继续分享10X空间转录组的分析，其实之前分享的内容已经多次强调过空间转录组一定要转录...
转录组专题：WGCNA
要注意的就一个是软阈值那块几个参数。
转录组研究好帮手-WGCNA分析
问：我做了疾病发展过程六个阶段的转录组研究，可是文章投稿编辑说我的研究太单薄？除了差异分析，我还能不能用转录组的数...
1.转录组 2.高通量测序 3.转录组数据分析 4.差异表达基因
1.转录组2.高通量测序3.转录组数据分析4.差异表达基因分析5.趋势性上调和下调基因分析6.基因集功能富集分析....
转录组分析-WGCNA
1 数据输入过程中出现的报错：datExpr must contain numeric data. 这个报错是因为...