美文网首页RNA-seq收入即学习转录组数据分析
拿到基因表达矩阵之后的那点事(三)

拿到基因表达矩阵之后的那点事(三)

作者: 凯凯何_Boy | 来源:发表于2020-08-14 09:09 被阅读0次

    之前的流程我们已经通过三种常用的方法对样品之间做了差异分析,接下来我们就以最流行的DEseq2包分析的结果接着进行分析,可视化~

    大家还记得在上篇差异分析关于Deseq2的结果中,我们得到了一个dds的文件和resdata的数据框结,dds是DEseq数据对象。

    class(dds)
    [1] "DESeqDataSet"
    attr(,"package")
    [1] "DESeq2"  
    

    分析之前,我们先做几个简单但是又比较重要的图,

    P值分布图

    为什么要画这个图呢?大家可以可以看下生信宝典公众号的一篇帖子算出的P-value看上去像齐天大圣变的庙? ,观察原始p值的分布是重要的,我们期待去看到在较低的p值处有个峰且P值高于0.1时处于均匀分布。否则,多重检验不会进行且结果也是没有意义的

    �library(ggplot2)
    ggplot(resdata, aes(x = pvalue)) +
    geom_histogram(bins = 100)
    
    图片.png

    MA图

    An MA plot is useful to observe if the data normalization worked well . MA plot is a scatterplot where x axis denotes the average of normalized counts across samples and the y axis denotes the log fold change in the given contrast. Most points are expected to be on the horizontal 0 line (most genes are expected to be not differentially expressed).

    library(DESeq2)
    DESeq2::plotMA(object = dds, ylim = c(-5, 5))
    
    图片.png

    RLE plot

    A similar plot to the MA plot is the RLE (Relative Log Expression) plot that is useful in finding out if the data at hand needs normalization. Sometimes, even the datasets normalized using the explained methods above may need further normalization due to unforeseen sources of variation that might stem from the library preparation, the person who carries out the experiment, the date of sequencing, the temperature changes in the laboratory at the time of library preparation, and so on and so fort. RLE plot is a quick diagnostic that can be applied on the raw or normalized count matrices to see if further processing is required.
    需要安装一个EDASeq的包

    #BiocManager::install('EDASeq')
    library(EDASeq)
    par(mfrow = c(1, 2))
    plotRLE(counts(dds), outline=FALSE, ylim=c(-4, 4),
            col=as.numeric(coldata$condition),
            main = 'Raw Counts')
    plotRLE(DESeq2::counts(dds, normalized = TRUE),
            outline=FALSE, ylim=c(-4, 4),
            col = as.numeric(coldata$condition),
            main = 'Normalized Counts')
    
    图片.png

    通过上图可以看出我们的数据需要进行标准化~~~

    箱线图

    Deseq2内置了一个plotCounts的函数可以帮助我们快速通过一个箱线图去比较两组某个基因的差异,returnData函数可以直接返回数据框,也方便我们去ggplot作图

    #针对单个基因提取数据框 画箱线图
    d <- plotCounts(dds, gene=which.min(res$padj),intgroup='condition', returnData=TRUE)
    boxplot(count~condition,data = d,col = '#f8766d', ylab = 'expression ',main= 'P值最小的gene')
    
    图片.png

    内置方法画PCA

    PCA数据画图时候要将数据框标准化一下DEseq2内置的俩种标准化方法

    #数据集大的时候用
    VSD <- vst(dds,blind = FALSE)
    AS <- assay(VSD)
    
    #小数据集 用log2转化 
    rld <- rlogTransformation(dds, blind=FALSE) 
    exprMatrix_rlog <- assay(rld) 
    
    #boxplot 查看log后的基因分布
    cols <- rainbow(6)
    boxplot(exprMatrix_rlog,col = cols,main="expression value",las=2)
    
    #PCA 注意PCA要使用log数据
    plotPCA(rld,intgroup="condition", ntop=nrow(counts(dds,normalized =TRUE)))
    
    图片.png

    解释变量bar图

    内置图太简单,这里我们可以通过ggplot进行可视化PCA

    pca <- prcomp(t(assay(rld)))
    percentVar <- pca$sdev^2/sum(pca$sdev^2)
    pca.res = pca[["x"]] %>%
      as.data.frame()
    pca.res = cbind(pca.res, coldata$condition)
    colnames(pca.res)[7] = 'group'
    pca.var = pca$sdev^2 %>%
      as.data.frame()
    pca.var$var = round(pca.var$. / sum(pca.var) * 100, 2) # 计算各主成分所占百分比
    pca.var$pc = colnames(pca.res)[1:(ncol(pca.res)-1)]
    library(ggplot2)
    library(ggsci)
    
    # 绘制bar图看每个主成分的解释量
    ggplot(pca.var, aes(reorder(pc,-var), var, fill = pc,group = 1)) +
      geom_bar(stat = 'identity')+
      scale_fill_d3()+
      geom_line(linetype="dashed", color="blue", size=1.2)+
      geom_point(col = 'red')+
      scale_y_continuous(expand = c(0,0),limits = c(0,100)) +
      geom_text(aes(label = var),hjust = .2)+
      theme_bw() +
      labs(x = '主成分',
           y = '主成分解释量(%)')
    
    图片.png

    绘制PCA图

    p = ggplot(pca.res, aes(PC1, PC2, color = group, shape = group))+ 
      # 选择X轴Y轴并映射颜色和形状
      geom_point(size = 3)+ # 画散点图并设置大小
      geom_hline(yintercept = 0,linetype="dashed") + # 添加横线
      geom_vline(xintercept = 0,linetype="dashed") + # 添加竖线
      scale_color_igv()+ # 设置颜色,此处为Integrative Genomics Viewer配色
      theme_bw() + # 加上边框
      stat_ellipse(level = 0.95,show.legend = F)+ # 添加置信椭圆
      # 自动提取主成分解释度进行绘图
      labs(x = paste('PC1(', pca.var$var[1],'%)', sep = ''),
           y = paste('PC2(', pca.var$var[2],'%)', sep = '')) +
      theme(legend.position = c(0.85,0.85),
            legend.background = element_blank()) # 设置图例位置,此处为相对位置
    p
    
    图片.png

    相关性热图

    这里可以基于样品间的相关性画出热图,也可以基于样品之间的距离作图

    library(gplots)
    #相关性
    a <- cor(as.matrix(assay(rld)))
    #也可以进行距离热图
    b <- as.matrix(dist(t(assay(rld))))
    cols <- c("dodgerblue3","firebrick3")[condition]
    heatmap.2(b,symm = TRUE,
    col = colorRampPalette(c("darkblue","white"))(100),
    labCol = colnames(a),labRow = colnames(a),
    disfun = function(c)as.dist(1 - c),trace = "none",
    Colv = TRUE,cexRow = 0.9,cexCol = 0.9,key.title = NA,font = 2,
    RowSideColors = cols,ColSideColors = cols)
    
    图片.png

    OK,接下来呢,我们结合样品基因表达量去做一些可视化,比如画出这俩组所有差异基因的热图,但是有时候差异基因几千个,不想要那么多,我们就可以根据方差去排序,选择前100个变化较大的基因去做热图,除了热图,点图其实也是不错的选择,用点去代表基因量的多少,再映射上颜色更加美观~~

    一. 首先关于差异基因画个热图

    #关于差异基因画个热图
    diff_gene <- subset(resdata, padj < 0.05 & (log2FoldChange > 1 | log2FoldChange < -1))
    colnames(diff_gene)[1] <- 'GeneID'
    diff_data <- as.matrix(exprMatrix_rlog[diff_gene$GeneID, ]) 
    # colorRampPalette(c("navy", "white", "firebrick3"))(50)
    hmcol <- rev(colorRampPalette(RColorBrewer::brewer.pal(9, "YlOrRd"))(255))
    ann_colors=list(condition=c(C="#377EB8",Tg="#E41A1C")) #设置颜色
    coldata <- coldata%>%tibble::column_to_rownames('Sample')
    pheatmap::pheatmap(diff_data, scale="row", color = hmcol,show_rownames = F,fontsize_row=2,annotation_col = coldata, annotation_colors=ann_colors,cutree_cols = 2)
    
    图片.png

    我们还可以根据方差去画排名前多少的基因的热图,方法道理都一样,挑选排名前100的基因代码

    #根据基因在不同的样本中表达变化的差异程度mad值对数据排序,差异越大的基因排位越前。
    normalized_counts_mad <- apply(exprMatrix_rlog, 1, mad)
    normalized_counts <- exprMatrix_rlog[order(normalized_counts_mad, decreasing=T), ][1:100,]
    pheatmap::pheatmap(normalized_counts)
    

    二. 画点图
    热图感觉是越密集越美观,但是点图我建议最好还是只针对20个左右去画,点太多时候有大有小会显的很拥挤,代码如下:

    1. 先构造数据框 ,我们将上调和下调最显著的前10个给标记出来
    resdata$threshold = factor(ifelse(resdata$padj < 0.05 & abs(resdata$log2FoldChange) >= 1, ifelse(resdata$log2FoldChange>= 1 ,'Up','Down'),'NoSignifi'),levels=c('Up','Down','NoSignifi'))
    up <- subset(resdata, threshold == 'Up')
    up <- up[order(up$padj), ][1:10, ]
    down <- subset(resdata, threshold == 'Down')
    down <- down[order(down$padj), ][1:10, ]
    select_data1 <- rbind(up,down)
    colnames(select_data1)[1] <- 'GeneID'
    plot_data <- exprMatrix_rlog[select_data1$GeneID, ]
    
    1. 作图
      第一种方法:
    library(ggpubr)
    my_cols <- c("#0D0887FF", "#6A00A8FF", "#B12A90FF",
                 "#E16462FF", "#FCA636FF", "#F0F921FF")
    p1 <- ggballoonplot(plot_data, fill = "value")+
      scale_fill_gradientn(colors = my_cols)
    p1
    
    图片.png

    第二种方法:
    可以通过ggplot2画气泡图,将“宽型”数据框转成“长型”数据框

    library(reshape2)
    sele <-melt(plot_data,variable.name = 'Sample')
    colnames(sele) <- c('Gene','Sample','value')
    coldata <- coldata%>%tibble::rownames_to_column('Sample')
    sele <- merge(sele,coldata,by = 'Sample')
    #指定纵轴标签顺序
    
    sele$Gene<-factor(sele$Gene,levels = rev(unique(sele$Gene)),ordered = TRUE)
    
    #绘制正方形“点”,带描边
    p2<-ggplot(sele, aes(Sample,Gene))+geom_point(aes(size=value,color = value))+theme_minimal()+
      scale_color_gradientn(colors = rainbow(7))+theme_bw()+
      theme(axis.text.y = element_text(angle = 0,hjust = 1,vjust = 1,color = rep(c('#8DA1CB','#FD8D62'),c(10,10))))#根据上下调映射y轴颜色
      #scale_color_gradient(low = 'blue',high = 'red')+ylab('')
    p2
    #或者是分组的形式
    p3<-ggplot(sele, aes(Sample,Gene))+geom_point(aes(size=value,color = condition))+theme_bw()+theme(axis.text.y = element_text(angle = 0,hjust = 1,vjust = 1,color = rep(c('blue','red'),c(10,10))))+scale_color_manual(values = c('#8DA1CB','#FD8D62'))
    p3
    
    ggarrange(p2,p3,labels = c('A','B'),ncol = 2)
    
    图片.png

    图形选择看个人喜好,当然其实还有很多种可视化方法去展示一些基因的差异,比如经典的bar柱状图,蜂群图,金字塔图等等,大家可以自行去探索,学习优秀的代码,玩转你的数据~~

    相关文章

      网友评论

        本文标题:拿到基因表达矩阵之后的那点事(三)

        本文链接:https://www.haomeiwen.com/subject/dsxbdktx.html