1. 层次聚类

在读入表达矩阵，进行基础的筛选，并使用 goodSamplesGenes() 确保没有太多的缺失值后，我们对样本进行聚类，以剔除离群样本

#接下来，我们对样本进行聚类(与稍后对基因进行聚类形成对比)，以查看是否有明显的异常值
sampleTree = hclust(dist(datExpr0), method = "average");
sizeGrWindow(12,9)
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5,
     cex.axis = 1.5, cex.main = 2)
     # +abline(h =75 , col = "red")
dev.off()

可以看到这次的 31 个样本中并没有离群的，因此下一步 cutreeStatic() 中的 cutHeight 参数应设置为 80（为使参数设置准确，可在画图的代码后加上 +abline(h = 80 , col = "red") 作为辅助线）

2. 剔除离群样本

# Determine cluster under the line
clust = cutreeStatic(sampleTree, cutHeight = 80, minSize = 10)
table(clust)
# clust == 1 包含了我们需要的样本
keepSamples = (clust==1)
datExpr = datExpr0[keepSamples, ]
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)
# datExpr 现在包含可用于网络分析的表达式数据。

在这一步中需要根据实际情况调整的参数是 cutreeStatic() 中的 cutHeight 和 minSize，其中 minSize 表示可以成为一个 cluster 的最小样本数

3. 合并临床信息画图

临床样本信息的格式要求
- 行为样本，列为临床信息条目
- 数据必须是数值型，可以是 0/1，也可以是连续型变量

# 读取临床文件
datTraits=read.table("datTraits.txt",sep = "\t",header = T,check.names = F)
datTraits[1:4,1:4]

# 下面主要是为了防止临床表型与样本名字对不上
datTraits <- datTraits[match(rownames(datExpr),rownames(datTraits)),]
identical(rownames(datTraits),rownames(datExpr))

# Re-cluster samples
sampleTree2 = hclust(dist(datExpr), method = "average")
# 将样本用颜色表示，白色表示低值，红色表示高值，灰色表示缺少条目
# 如果是连续性变量会是渐变色，如果是 0/1 的数据将会是红白相间
traitColors = numbers2colors(datTraits, signed = FALSE);
# Plot the sample dendrogram and the colors underneath.
sizeGrWindow(12,9)
par(cex = 0.6);
par(mar = c(0,4,2,0))
png("Step1-Sample dendrogram and trait heatmap.png",width = 800,height = 600)
plotDendroAndColors(sampleTree2, traitColors,
                    groupLabels = names(datTraits),
                    main = "Sample dendrogram and trait heatmap",)
dev.off()

# 最后表达矩阵要转化为 data.frame 格式，方便下一步操作
datExpr=as.data.frame(datExpr)
save(datExpr, datTraits, file = "WGCNA-01-dataInput.RData")