基因表达聚类分析之初探SOM

作者: 生信宝典 | 来源:发表于2019-10-28 20:05 被阅读0次

SOM基因表达聚类分析初探
基因表达聚类分析之初探SOM
基因表达谱的SOM聚类分析识别功能基因集
[NAR|2019] WebSestalt 基因列表的富集分析
层次聚类分析案例（三）
医学生物信息学：基因芯片数据挖掘分析表达差异基因
基因共表达聚类分析及可视化
转录组中的基因表达模式聚类分析
使用SOM尝试跑单细胞数据
分析基因和miRNA与疾病表型的相关性（WGCNA）

之前的培训有老师提出要做SOM分析，最后卡在code plot只能出segment plot却出不来line plot。查了下，没看到解决方案。今天看了下源码，设置了一个参数，得到趋势图。也顺便学习了SOM分析的整个过程，整理下来，以备以后用到。

更多聚类相关见：

基因共表达聚类分析和可视化

聚类分析factoextra

获取pheatmap聚类后和标准化后的结果

SOM分析基本理论

SOM (Self-Organizing Feature Map,自组织特征图)是基于神经网络方式的数据矩阵和可视化方式。与其它类型的中心点聚类算法如K-means等相似，SOM也是找到一组中心点 (又称为codebook vector)，然后根据最相似原则把数据集的每个对象映射到对应的中心点。在神经网络术语中，每个神经元对应于一个中心点。

与K-means类似，数据集中的每个对象每次处理一个，判断最近的中心点，然后更新中心点。与K-means不同的是，SOM中中心点之间存在拓扑形状顺序，在更新一个中心点的同时，邻近的中心点也会随着更新，直到达到设定的阈值或中心点不再有显著变化。最终获得一系列的中心点 (codes)隐式地定义多个簇，与这个中心点最近的对象归为同一个簇。

SOM强调簇中心点之间的邻近关系，相邻的簇之间相关性更强，更有利于解释结果，常用于可视化网络数据或基因表达数据。

Even though SOM is similar to K-means, there is a fundamental difference. Centroids used in SOM have a predetermined topographic ordering relationship. During the training process, SOM uses each data point to update the closest centroid and centroids that are nearby in the topographic ordering. In this way, SOM produces an ordered set of centroids for any given data set. In other words, the centroids that are close to each other in the SOM grid are more closely related to each other than to the centroids that are farther away. Because of this constraint, the centroids of a two-dimensional SOM can be viewed as lying on a two-dimensional surface that tries to fit the n-dimensional data as well as possible. The SOM centroids can also be thought of as the result of a nonlinear regression with respect to the data points. At a high level, clustering using the SOM technique consists of the steps described in Algorithm below:

1: Initialize the centroids.
2: repeat
3: Select the next object.
4: Determine the closest centroid to the object.
5: Update this centroid and the centroids that are close, i.e., in a specified neighborhood.
6: until The centroids don't change much or a threshold is exceeded.
7: Assign each object to its closest centroid and return the centroids and clusters.</pre>

SOM分析实战

下面是R中用kohonen包进行基因表达数据的SOM分析。

加载或安装包

### LOAD LIBRARIES - install with:
#install.packages(c("kohonen")
library(kohonen)

读入数据并进行标准化

data <- read.table("ehbio_trans.Count_matrix.xls", row.names=1, header=T, sep="\t")

# now train the SOM using the Kohonen method
# 标准化数据
data_train_matrix <- as.matrix(t(scale(t(data))))
names(data_train_matrix) <- names(data)

head(data_train_matrix)

untrt_N61311 untrt_N052611 untrt_N080611 untrt_N061011 trt_N61311
ENSG00000223972    1.6201852    -0.5400617    -0.5400617    -0.5400617 -0.5400617
ENSG00000227232   -1.0711639     1.0274429     0.6776751     0.8525590 -1.2460478
ENSG00000278267   -1.6476479     1.3480756     0.1497862     0.7489309 -0.4493585
ENSG00000237613    2.4748737    -0.3535534    -0.3535534    -0.3535534 -0.3535534
ENSG00000238009   -0.3535534    -0.3535534    -0.3535534    -0.3535534  2.4748737
ENSG00000268903   -0.7020086     0.9025825    -0.7020086    -0.7020086 -0.7020086
trt_N052611 trt_N080611 trt_N061011
ENSG00000223972   1.6201852  -0.5400617  -0.5400617
ENSG00000227232  -1.2460478   0.5027912   0.5027912
ENSG00000278267   0.7489309   0.1497862  -1.0485032
ENSG00000237613  -0.3535534  -0.3535534  -0.3535534
ENSG00000238009  -0.3535534  -0.3535534  -0.3535534
ENSG00000268903   0.9025825  -0.7020086   1.7048781

训练SOM模型

# 定义网络的大小和形状  
som_grid <- somgrid(xdim = 10, ydim=10, topo="hexagonal")  
# Train the SOM model!
som_model <- supersom(data_train_matrix, grid=som_grid, keep.data = TRUE)

可视化SOM结果

# Plot of the training progress - how the node distances have stabilised over time.
# 展示训练过程，距离随着迭代减少的趋势，判断迭代是否足够；最后趋于平稳比较好
plot(som_model, type = "changes")</pre>

image.png

计量每个SOM中心点包含的基因的数目

## custom palette as per kohonen package (not compulsory)
coolBlueHotRed <- function(n, alpha = 0.7) {
  rainbow(n, end=4/6, alpha=alpha)[n:1]
}

# shows the number of objects mapped to the individual units.
# Empty units are depicted in gray.
plot(som_model, type = "counts", main="Node Counts", palette.name=coolBlueHotRed)

image

计量SOM中心点的内敛性和质量

# map quality
# shows the mean distance of objects mapped to a unit to
# the codebook vector of that unit.
# The smaller the distances, the better the objects are
# represented by the codebook vectors.
plot(som_model, type = "quality", main="Node Quality/Distance", palette.name=coolBlueHotRed)

image

邻居距离-查看潜在边界点


# 颜色越深表示与周边点差别越大，越是分界点
# neighbour distances
# shows the sum of the distances to all immediate neighbours.
# This kind of visualization is also known as a U-matrix plot.
# Units near a class boundary can be expected to have higher average distances to their neighbours.
# Only available for the "som" and "supersom" maps, for the moment.
plot(som_model, type="dist.neighbours", main = "SOM neighbour distances", palette.name=grey.colors)

image

查看SOM中心点的变化趋势

#code spread
plot(som_model, type = "codes", codeRendering="lines")

image

获取每个SOM中心点相关的基因


table(som_model$unit.classif)

# 只显示一部分
  1   2   3   4   5   6
197 172 434 187 582 249
 95  96  97  98  99 100
168 919 226 419 193 241
# code是从左至右，从下至上进行编号的
som_model_code_class = data.frame(name=rownames(data_train_matrix), code_class=som_model$unit.classif)
head(som_model_code_class)

            name code_class
1 ENSG00000223972         81
2 ENSG00000227232         37
3 ENSG00000278267         93
4 ENSG00000237613         51
5 ENSG00000238009         11
6 ENSG00000268903          4

SOM结果进一步聚类

# 选择合适的聚类数目
# show the WCSS metric for kmeans for different clustering sizes.
# Can be used as a "rough" indicator of the ideal number of clusters
mydata <- as.matrix(as.data.frame(som_model$codes))
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss)
par(mar=c(5.1,4.1,4.1,2.1))
plot(1:15, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares", main="Within cluster sum of squares (WCSS)")

image


# Form clusters on grid
## use hierarchical clustering to cluster the codebook vectors
som_cluster <- cutree(hclust(dist(mydata)), 6)
# Colour palette definition
cluster_palette <- function(x, alpha = 0.6) {
  n = length(unique(x)) * 2
  rainbow(n, start=2/6, end=6/6, alpha=alpha)[seq(n,0,-2)]
}

cluster_palette_init = cluster_palette(som_cluster)
bgcol = cluster_palette_init[som_cluster]

#show the same plot with the codes instead of just colours
plot(som_model, type="codes", bgcol = bgcol, main = "Clusters", codeRendering="lines")
add.cluster.boundaries(som_model, som_cluster)

image

有一些类的模式不太明显，以后再看怎么优化。

SOM获取基因所在的新类

som_model_code_class_cluster = som_model_code_class
som_model_code_class_cluster$cluster = som_cluster[som_model_code_class$code_class]
head(som_model_code_class_cluster)

            name code_class cluster
1 ENSG00000223972         81       2
2 ENSG00000227232         37       8
3 ENSG00000278267         93       8
4 ENSG00000237613         51       7
5 ENSG00000238009         11       4
6 ENSG00000268903          4       3

映射某个属性到SOM图


# 此处选择一个样本作为示例，可以关联很多信息，
# 比如基因通路，只要在矩阵后增加新的属性就可以。
color_by_var = names(data_train_matrix)[1]
color_by = data_train_matrix[,color_by_var]
unit_colors <- aggregate(color_by, by=list(som_model$unit.classif), FUN=mean, simplify=TRUE)
plot(som_model, type = "property", property=unit_colors[,2], main=color_by_var, palette.name=coolBlueHotRed)

image

SOM基因表达聚类分析初探
上周的暑期生信黑马培训有老师提出要做SOM分析，最后卡在code plot只能出segment plot却出不来l...
基因表达聚类分析之初探SOM
之前的培训有老师提出要做SOM分析，最后卡在code plot只能出segment plot却出不来line pl...
基因表达谱的SOM聚类分析识别功能基因集
前不久小编接触了这样一个分析，给定基因表达矩阵后，通过自组织映射（Self-organizing map，SOM）...
[NAR|2019] WebSestalt 基因列表的富集分析
我们经常在进行差异表达分析之后，对于差异表达的基因，下一步是要对这些基因进行聚类分析的，通过聚类分析来对这些基因进...
层次聚类分析案例（三）
之前的笔记：聚类介绍：点这里层次聚类分析案例（一）层次聚类分析案例（二）案例三：基因聚类获取全基因组表达数据的...
医学生物信息学：基因芯片数据挖掘分析表达差异基因
基因芯片数据挖掘分析表达差异基因基因芯片数据分析（一）：芯片数据初探基因芯片数据分析（二）：读取芯片数据基因...
基因共表达聚类分析及可视化
欢迎关注天下博客：http://blog.genesino.com/2017/11/gene-cluster/共表...
转录组中的基因表达模式聚类分析
欢迎关注”生信修炼手册”! 实验设计对于转录组数据的分析是非常重要的，对于常规的case/control实验设计，...
使用SOM尝试跑单细胞数据
修改一下上面错误的结论通过如下代码的更细致的观察发现在处理单细胞的数据中，使用SOM方法找到的模式表达基因更倾向于...
分析基因和miRNA与疾病表型的相关性（WGCNA）
流程图： 1.分别下载某个疾病基因和miRNA的芯片数据2.分别进行聚类分析（WCGNA）得到基因以及miRNA...

基因表达聚类分析之初探SOM

SOM分析基本理论

SOM分析实战

加载或安装包

读入数据并进行标准化

训练SOM模型

可视化SOM结果

计量每个SOM中心点包含的基因的数目

计量SOM中心点的内敛性和质量

邻居距离-查看潜在边界点

查看SOM中心点的变化趋势

获取每个SOM中心点相关的基因

SOM结果进一步聚类

SOM获取基因所在的新类

映射某个属性到SOM图

相关文章

SOM基因表达聚类分析初探

基因表达聚类分析之初探SOM

基因表达谱的SOM聚类分析识别功能基因集

[NAR|2019] WebSestalt 基因列表的富集分析

层次聚类分析案例（三）

医学生物信息学：基因芯片数据挖掘分析表达差异基因

基因共表达聚类分析及可视化

转录组中的基因表达模式聚类分析

使用SOM尝试跑单细胞数据

分析基因和miRNA与疾病表型的相关性（WGCNA）

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

收入即学习