美文网首页pcoa图
R:肠型Enterotypes

R:肠型Enterotypes

作者: 胡童远 | 来源:发表于2020-08-04 21:03 被阅读0次

    导读

    2011年,肠型(Enterotypes)的概念首次在《自然》杂志上由Arumugam等【1】提出,该研究发现可以将人类肠道微生物组分成稳定的3种类型,因为这3种类型不受年龄、性别、体重以及地域限制,表现出较高的稳定性,与血型具有很高的相似性,所以将其定义为肠型。后来的有的研究称发现2种肠型,也有称发现4种,这引起了国际上对于肠型概念的广泛讨论与争议。最新的《自然∙微生物学》杂志报道了学界对于肠型的最新认识,调解了之前关于肠型的一些争论【2】。

    方法

    目前流行的肠型计算方法有2种,一种是基于样品间的Jensen-Shannon距离,利用围绕中心点划分算法(PAM)进行聚类,最佳分类数目通过Calinski-Harabasz (CH)指数确定【3】;另一种则是直接基于物种丰度,利用狄利克雷多项混合模型(DMM)进行肠型分类【4】。

    参考:人体肠道宏基因组生物信息分析方法 [J].微生物学报, 2018
    Nature 2011 肠型分析教程:https://enterotype.embl.de/enterotypes.html#
    Reference-Based肠型分析:http://enterotypes.org/
    三大肠型ET_B、ET_P、ET_F分别代表Bacteroides(拟杆菌属), Prevotella(普氏菌属)和Firmicutes(厚壁菌门) 肠型。
    Github:https://github.com/tapj/biotyper

    一、举例

    来自:Quantitative microbiome profiling disentangles inflammation- and bile duct obstruction-associated microbiota alterations across PSC/IBD diagnoses.nature microbiology. 2019
    方法:DirichletMultinomial.
    描述:Enterotyping (or community typing) based on the DMM approach was performed in R as described previously. Enterotyping was performed on a combined genus-level abundance matrix including PSC/IBD and mHC samples compiled with 1,054 samples originating from the FGFP.

    来自:Impact of early events and lifestyle on the gut microbiota and metabolic phenotypes in young school-age children. micorbiome. 2019
    方法:DMM, PAM-JSD, PAM-BC
    描述:Genus-level enterotypes analysis was performed according to the Dirichlet multinomial mixtures (DMM) and partitioning around medoid (PAM)-based clustering protocols using Jensen-Shannon divergence (PAM-JSD) and Bray-Curtis (PAM-BC)

    二、函数:JSD计算距离、PAM聚类

    思路:
    1 丰度值为0的加上0.000001(1e-6),避免分子分母出现0
    2 新建matrix,样品数作行数,样品数作列数
    3 双重for循环计算样品间JSD到matrix
    4 样品名为行列名,as.dist去重复,attr设置dist属性
    5 确定cluster数,用PAM对JSD样品matirx进行聚类

    library(ade4)
    library(cluster)
    library(clusterSim)
    
    ## 1. 菌属-样品相对丰度表
    data=read.table("input.txt", header=T, row.names=1, dec=".", sep="\t")
    # dec:标志小数点符号,有些国家的小数点是逗号
    data=data[-1,]
    
    ## 写函数
    ## JSD计算样品距离,PAM聚类样品,CH指数估计聚类数,比较Silhouette系数评估聚类质量。
    dist.JSD <- function(inMatrix, pseudocount=0.000001, ...)
    {
      ## 函数:JSD计算样品距离
      KLD <- function(x,y) sum(x *log(x/y))
      JSD <- function(x,y) sqrt(0.5 * KLD(x, (x+y)/2) + 0.5 * KLD(y, (x+y)/2))
    
      matrixColSize <- length(colnames(inMatrix))
      matrixRowSize <- length(rownames(inMatrix))
      colnames <- colnames(inMatrix)
      resultsMatrix <- matrix(0, matrixColSize, matrixColSize)
      
      inMatrix = apply(inMatrix, 1:2, function(x) ifelse (x==0, pseudocount, x))
      
      for(i in 1:matrixColSize)
      {
        for(j in 1:matrixColSize)
        { 
          resultsMatrix[i, j] = JSD(as.vector(inMatrix[, i]), as.vector(inMatrix[, j]))
        }
      }
          
      colnames -> colnames(resultsMatrix) -> rownames(resultsMatrix)
      as.dist(resultsMatrix) -> resultsMatrix
      # 两两重复矩阵去重,as.dist
      attr(resultsMatrix, "method") <- "dist"
      return(resultsMatrix)
    }
    
    pam.clustering = function(x, k)
    { 
      ## 函数:PAM聚类样品
      # x is a distance matrix and k the number of clusters
      require(cluster)
      cluster = as.vector(pam(as.dist(x), k, diss=TRUE)$clustering)
      return(cluster)
    }
    

    三、计算CH确定最佳Cluster

    思路:
    1 CH指数作为评估cluster数好坏的指标
    2 尝试不同的cluster数,挨个进行PAM聚类,用index.G1计算CH指数
    3 CH指数绘图观察最佳cluster数,which(, arr.ind=T)通过获取最大CH数的位置得到该数

    ## 2. 选择CH指数最大的K值作为最佳聚类数
    data.dist = dist.JSD(data)
                       
    # data.cluster=pam.clustering(data.dist, k=3) # k=3 为例 
    # nclusters = index.G1(t(data), data.cluster, d = data.dist, centrotypes = "medoids") # 查看CH指数
    
    nclusters = NULL
    
    for(k in 1:20)
    { 
      if(k==1)
      {
        nclusters[k] = NA 
      }
      else
      {
        data.cluster_temp = pam.clustering(data.dist, k)
        nclusters[k] = index.G1(t(data), data.cluster_temp,  d = data.dist, centrotypes = "medoids")
      }
    }
    
    plot(nclusters, type="h", xlab="k clusters", ylab="CH index", main="Optimal number of clusters") # 查看K与CH值得关系
    

    如图,CH最大时的cluster数为3,最佳。

    nclusters[1] = 0
    k_best = which(nclusters == max(nclusters), arr.ind = TRUE)
    

    三、对JSD矩阵进行正式PAM聚类,silhouette评估聚类质量

    ## 3. PAM根据JSD距离对样品聚类(分成K个组)
    data.cluster=pam.clustering(data.dist, k = k_best)
    
    ## 4. silhouette评估聚类质量,-1=<S(i)=<1,越接近1越好
    mean(silhouette(data.cluster, data.dist)[, k_best])
    

    四、可视化

    between-class analysis (BCA) 和principal coordinates analysis (PCoA)是两种常用的肠型可视化方法。PCOA更常用生态学研究,推荐使用PCOA对JSD matrix进行可视化。

    1 基于BCA

    ## plot 1
    obs.pca=dudi.pca(data.frame(t(data)), scannf=F, nf=10)
    obs.bet=bca(obs.pca, fac=as.factor(data.cluster), scannf=F, nf=k-1) 
    
    s.class(obs.bet$ls, fac=as.factor(data.cluster), grid=F,sub="Between-class analysis", col=c(1,2,3))
    

    2 基于PCoA

    #plot 2
    obs.pcoa = dudi.pco(data.dist, scannf = F, nf = 3)
    
    s.class(obs.pcoa$li, fac = as.factor(data.cluster), grid = F, sub = "Principal coordiante analysis", col=c(1,2,3))
    

    五、一个案例

    分析思路:
    1 利用菌属样品表对样品进行JSD-PAM聚类
    2 PCOA得到样品坐标,envfit多元回归分析细菌对聚类的贡献度
    3 挑选细菌作可视化

    1 输入数据
    菌属丰度表


    2 删除存在重复的菌属
    # 删除存在重复的菌属
    genus = data.frame(table(input[,1]))
    delete = as.character(genus[genus$Freq != 1,]$Var1)
    input2 = input[!(input$Taxonomy%in%delete),]
    rownames(input2) = input2$Taxonomy
    input2 = input2[, -1]
    

    3 排序选择,这里没选

    # 选用丰度前100的菌属
    input2$sum = apply(input2, 1, sum)
    input2 = input2[order(input2$sum, decreasing=T),]
    input2 = input2[, -ncol(input2)]
    

    4 获取样品举例矩阵

    inMatrix = input2
    #dist.JSD <- function(inMatrix, pseudocount=0.000001, ...)
    #{
      ## 函数:JSD计算样品距离
      pseudocount=0.000001
      KLD <- function(x,y) sum(x *log(x/y))
      JSD <- function(x,y) sqrt(0.5 * KLD(x, (x+y)/2) + 0.5 * KLD(y, (x+y)/2))
    
      matrixColSize <- length(colnames(inMatrix))
      matrixRowSize <- length(rownames(inMatrix))
      colnames <- colnames(inMatrix)
      resultsMatrix <- matrix(0, matrixColSize, matrixColSize)
      
      inMatrix = apply(inMatrix, 1:2, function(x) ifelse (x==0, pseudocount, x))
      
      for(i in 1:matrixColSize)
      {
        for(j in 1:matrixColSize)
        { 
          resultsMatrix[i, j] = JSD(as.vector(inMatrix[, i]), as.vector(inMatrix[, j]))
        }
      }
          
      colnames -> colnames(resultsMatrix) -> rownames(resultsMatrix)
      as.dist(resultsMatrix) -> resultsMatrix
      # 两两重复矩阵去重,as.dist
      attr(resultsMatrix, "method") <- "dist"
      # class, comment, dim, dimnames, names, row.names and tsp
      #return(resultsMatrix)
    #}
    data.dist = resultsMatrix
    

    5 聚类,获取最佳聚类数

    pam.clustering = function(x, k)
    { 
      ## 函数:PAM聚类样品
      # x is a distance matrix and k the number of clusters
      require(cluster)
      cluster = as.vector(pam(as.dist(x), k, diss=TRUE)$clustering)
      return(cluster)
    }
                       
    #data.dist = dist.JSD(inMatrix)
                      
    # data.cluster = pam.clustering(data.dist, k=3) # k=3 为例 
    # nclusters = index.G1(t(input2), data.cluster, d = data.dist, centrotypes = "medoids")
    # 查看CH指数
    nclusters = NULL
    
    for(k in 1:10)
    { 
      if(k==1)
      {
        nclusters[k] = NA
      }
      else
      {
        data.cluster_temp = pam.clustering(data.dist, k)
        nclusters[k] = index.G1(t(inMatrix), data.cluster_temp,  d = data.dist, centrotypes = "medoids")
      }
    }
    
    plot(nclusters, type="h", xlab="k clusters", ylab="CH index", main="Optimal number of clusters")  
                       
    ## 3. PAM根据JSD距离对样品聚类(分成K个组)
    nclusters[1] = 0
    k_best = which(nclusters == max(nclusters), arr.ind = TRUE)
    data.cluster = pam.clustering(data.dist, k = k_best)
    
    ## 4. silhouette评估聚类质量,-1=<S(i)=<1,越接近1越好
    silhouette(data.cluster, data.dist)[, k_best]      
    mean(silhouette(data.cluster, data.dist)[, k_best])
    

    6 PCOA获取坐标,envfit计算细菌相关度,

    obs.pcoa = dudi.pco(data.dist, scannf=F, nf=2)
    
    data = inMatrix
    env = data.frame(t(data))
    
    ord = obs.pcoa$li
    fit = envfit(ord, env, perm=999)
    p = data.frame(fit$vectors$pvals)
    p$label = rownames(p)
    p = p[order(p$fit.vectors.pvals, decreasing=F),]
    #head(fit$vectors)
    #fit$arrows
    #fit$r
    #fit$pvals
    #sign_site = which(p[,1]>0.001, arr.ind=T)[1] - 1
    

    7 根据p值选择细菌(方案一)

    sign_site = which(p[,1]>0.001, arr.ind=T)[1] - 1
    env = data.frame(t(inMatrix[rownames(inMatrix)%in%p$label[1:sign_site],]))
    fit = envfit(ord, env, perm=999)
    
    tmp = data.frame(fit$vectors$arrows)
    color=c()
    for(i in 1:nrow(tmp))
    {
        if(tmp[i,1] < 0 ) color=c(color, "red")
        else if(tmp[i,1] > 0) color=c(color, "blue")
    }
    
    genus = data.frame(head(fit$vectors)$arrows)
    #lab = rbind(ord, genus)
    sample = ord
    for(i in 1:nrow(ord))
    {
        if(ord[i,2] < 0) sample[i,2] = ord[i,2] - 0.01
            else if(ord[i,2] > 0) sample[i,2] = ord[i,2] + 0.01 
                else if(ord[i,2] == 0) sample[i,2] = ord[i,2] + 0.01
    }
    

    8 可视化

    pdf("test1.pdf", width=25, height=21)
    
    par(family = "serif")
    #s.class(ord, fac=as.factor(data.cluster), grid=F,sub="Principal coordiante analysis", csub=3, pch=21, col=c("red","blue"), cpoint=1, clabel=2, cstar=1)
    s.class(ord, fac=as.factor(data.cluster), grid=F, sub="Principal coordiante analysis", csub=3, pch=21, cpoint=2, clabel=2, cstar=1, col=c("Firebrick3", "DeepSkyBlue2"))
    #plot(fit, font=3, col=color, cex=3)
    text(x = sample$A1, y = sample$A2, labels = rownames(sample), cex=2)
    text(x = genus[,1]/4, y = genus[,2]/4, labels = rownames(genus), cex=5, col="black", font=4)
    #text(x = lab$A1, y = lab$A2, labels = rownames(lab), cex=2)
    dev.off()
    

    9 筛选p值前六细菌(方案二)

    # 二、选择靶标细菌(前六)
    env = data.frame(t(inMatrix[rownames(inMatrix)%in%p$label[c(1:6)],]))
    fit = envfit(ord, env, perm=999)
    
    # 箭头配色
    # fit$vectors$arrows
    tmp = data.frame(fit$vectors$arrows)
    color=c()
    for(i in 1:nrow(tmp))
    {
        if(tmp[i,1] < 0 ) color=c(color, "red")
        else if(tmp[i,1] > 0) color=c(color, "blue")
    }
    
    # 画图保存
    pdf("test-arrow.pdf", width=25, height=21)
    
    par(family = "serif")
    #s.class(ord, fac=as.factor(data.cluster), grid=F,sub="Principal coordiante analysis", csub=3, pch=21, col=c("red","blue"), cpoint=1, clabel=2, cstar=1)
    s.class(ord, fac=as.factor(data.cluster), grid=F, sub="Principal coordiante analysis", csub=3, pch=21, cpoint=2, clabel=2, cstar=1, col=c("Firebrick3", "DeepSkyBlue2"))
    plot(fit, font=3, col=color, cex=3)
    
    dev.off()
    

    10 选择靶标细菌(方案三)

    # 二、选择靶标细菌
    env = data.frame(t(inMatrix[rownames(inMatrix)%in%p$label[c(1,2,5)],]))
    fit = envfit(ord, env, perm=999)
    
    # 给每个样本点添加label,防止覆盖 + 0.01
    genus = data.frame(head(fit$vectors)$arrows)
    lab = rbind(ord, genus)
    sample = ord
    for(i in 1:nrow(ord))
    {
        if(ord[i,2] < 0) sample[i,2] = ord[i,2] - 0.01
            else if(ord[i,2] > 0) sample[i,2] = ord[i,2] + 0.01 
                else if(ord[i,2] == 0) sample[i,2] = ord[i,2] + 0.01
    }
    # https://cran.r-project.org/web/packages/ade4/ade4.pdf
    pdf("test1.pdf", width=25, height=21)
    
    par(family = "serif")
    #s.class(ord, fac=as.factor(data.cluster), grid=F,sub="Principal coordiante analysis", csub=3, pch=21, col=c("red","blue"), cpoint=1, clabel=2, cstar=1)
    s.class(ord, fac=as.factor(data.cluster), grid=F, sub="Principal coordiante analysis", csub=3, pch=21, cpoint=2, clabel=2, cstar=1, col=c("Firebrick3", "DeepSkyBlue2"))
    #plot(fit, font=3, col=color, cex=3)
    text(x = sample$A1, y = sample$A2, labels = rownames(sample), cex=2)
    text(x = genus[,1]/4, y = genus[,2]/4, labels = rownames(genus), cex=5, col="black", font=4)
    #text(x = lab$A1, y = lab$A2, labels = rownames(lab), cex=2)
    dev.off()
    
    图片.png

    参考:
    【1】Enterotypes of the human gut microbiome. Nature, 2011
    【2】人体肠道宏基因组生物信息分析方法 [J].微生物学报, 2018
    【3】Linking long-term dietary patterns with gut microbial enterotypes. Science, 2011
    【4】Dirichlet multinomial mixtures: generative models for microbial metagenomics. PLoS One, 2012
    【5】Statin drugs might boost healthy gut microbes. Nature News. 2020

    如何鉴定肠型
    排序图(beta多样性)添加各种辅助线-无非是让不明显的规律看得到
    ggplot2添加散点图文字标记
    https://cran.r-project.org/web/packages/ade4/ade4.pdf

    2020.10.26更新

    相关文章

      网友评论

        本文标题:R:肠型Enterotypes

        本文链接:https://www.haomeiwen.com/subject/qzzhrktx.html