RNA-seq分析简洁版

作者: Y大宽 | 来源:发表于2018-08-08 14:34 被阅读36次
    • 前面RNA-seq分析:从软件安装到富集分析部分已经把转录组全部流程走完了一遍,这次利用RNA-seq(2)-2:下载数据中下载的肝癌数据进行分析,不在赘述细节,所以有看细节的还是请去这里
    • 这部分主要是代码和部分结果图,但进行了部分修正,比如KEGG 可视化部分,用了前clusterprofiler部分的结果,所以这部分不包括Gage包。
    • 所以从质量比对开始进行
    -----------------------------------------------------------------

    3 sra到fastq格式转换并进行质量控制

    3.1 数据解压:用samtools中的fastq-dump将sra格式转为fastq格式
    • 备注:用时5个小时
    for ((i=2;i<=5;i++));do fastq-dump --gzip --split-3 -A SRR31621$i.sra -O .;done
    
    3.2 用fastqc进行质量控制
    #将所有的数据进行质控,得到zip的压缩文件和html文件
    fastqc -o .  *.fastq.gz
    
    3.3 3 质控结果解读(为保持紧凑,只放一张图)
    SRR316214.png

    4 下载参考基因组及基因注释

    RNA-seq(4):下载参考基因组及基因注释部分已经下载

    5 序列比对:Hisat2

    5.1 开始比对:用hisat2,得到SAM文件(5个小时)
    • 我的fastq文件在/mnt/f/rna_seq/data
    • 我的reference在/mnt/f/rna_seq/data/reference
    • 我的index在/mnt/f/rna_seq/data/reference/index/hg19
    • 比对后得到的bam文件会存放在/mnt/f/rna_seq/aligned
     source ~/miniconda3/bin/activate
    (base) kelly@DESKTOP-MRA1M1F:/mnt/f/rna_seq/data$ cd ..
    (base) kelly@DESKTOP-MRA1M1F:/mnt/f/rna_seq$
    for ((i=2;i<=5;i++));do hisat2 -t -x /mnt/f/rna_seq/data/reference/index/hg19/genome -1 /mnt/f/rna_seq/data/SRR31621${i}.sra_1.fastq.gz -2 /mnt/f/rna_seq/data/SRR31621${i}.sra_2.fastq.gz -S SRR31621${i}.sam ;done
    
    (base) kelly@DESKTOP-MRA1M1F:/mnt/f/rna_seq$ for ((i=2;i<=5;i++));do hisat2 -t -x /mnt/f/rna_seq/data/reference/index/hg19/genome -1 /mnt/f/rna_seq/data/SRR31621${i}.sra_1.fastq.gz -2 /mnt/f/rna_seq/data/SRR31621${i}.sra_2.fastq.gz -S SRR31621${i}.sam ;done
    Time loading forward index: 00:00:05
    Time loading reference: 00:00:01
    Multiseed full-index search: 02:11:39
    101339131 reads; of these:
      101339131 (100.00%) were paired; of these:
        7840870 (7.74%) aligned concordantly 0 times
        87196406 (86.04%) aligned concordantly exactly 1 time
        6301855 (6.22%) aligned concordantly >1 times
        ----
        7840870 pairs aligned concordantly 0 times; of these:
          333400 (4.25%) aligned discordantly 1 time
        ----
        7507470 pairs aligned 0 times concordantly or discordantly; of these:
          15014940 mates make up the pairs; of these:
            8711709 (58.02%) aligned 0 times
            5441734 (36.24%) aligned exactly 1 time
            861497 (5.74%) aligned >1 times
    95.70% overall alignment rate
    
    5.2 SAM文件转换为BAM文件,并对bam文件进行sort,最后建立索引:SAMtools
    Sam 文件转换为bam格式(用时1.5hrs)
    (base) kelly@DESKTOP-MRA1M1F:/mnt/f/rna_seq/aligned$ for ((i=2;i<=5;i++));do samtools view -S /mnt/f/rna_seq/SRR31621${i}.sam -b > SRR31621${i}.bam;done
    
    对bam文件进行排序,默认染色体位置(用时1.5hrs)
    for ((i=2;i<=5;i++));do samtools sort SRR31621${i}.bam
    -o SRR31621${i}_sorted.bam;done
    
    建立索引(用时40mins)
    for ((i=2;i<=5;i++));do samtools index SRR31621${i}_sorted.bam;done
    

    6 reads计数,合并矩阵并进行注释

    6.1 bam文件按reads name排序(用时1h)
    for ((i=2;i<=5;i++));do samtools sort -n SRR31621${i}.b
    am -o SRR31621${i}_nsorted.bam;done
    
    6.2 2 reads计数,得到表达矩阵htseq-count(用时很久很久,久到不忍写,8hrs)

    注释文件如果已经解压,则不再需要重复

    cd ../data/matrix
    gunzip /mnt/f/rna_seq/data/reference/annotation/hg19/gencode.v19.annotation.gt.gz && rm -rf gencode.v19.annotation.gtf.gz
    for ((i=2;i<=5;i++));do htseq-count -r name -f bam /mnt/f/rna_seq/aligned/SRR31621${i}_nsorted.bam /mnt/f/rna_seq/data/reference/annotation/hg19/gencode.v19.an
    notation.gtf > SRR31621${i}.count; done
    
    ls -al *.count
    -rwxrwxrwx 1 root root 1197426 Aug  7 16:25 SRR316212.count
    -rwxrwxrwx 1 root root 1186189 Aug  7 17:16 SRR316213.count
    -rwxrwxrwx 1 root root 1200305 Aug  7 22:51 SRR316214.count
    -rwxrwxrwx 1 root root 1187596 Aug  7 23:32 SRR316215.count
    
    6.3 3 合并表达矩阵并进行基因名注释(R中进行)

    Tumor:SRR316214,SRR316215
    Adjacent Normal Liver:SRR316212,SRR316213

    setwd("F:/rna_seq/data/matrix")
    options(stringsAsFactors = FALSE)
    control1<-read.table("SRR316212.count",sep = "\t",col.names = c("gene_id","control1"))
    control2<-read.table("SRR316213.count",sep = "\t",col.names = c("gene_id","control2"))
    treat1<-read.table("SRR316214.count",sep = "\t",col.names = c("gene_id","treat1"))
    treat2<-read.table("SRR316215.count",sep = "\t",col.names = c("gene_id","treat2"))
    raw_count <- merge(merge(control1, control2, by="gene_id"), merge(treat1, treat2, by="gene_id"))
    head(raw_count)
    raw_count_filt <- raw_count[-1:-5,]
    ENSEMBL <- gsub("\\.\\d*", "", raw_count_filt$gene_id) 
    row.names(raw_count_filt) <- ENSEMBL
    head(raw_count_filt)
    
    > head(raw_count_filt)
                               gene_id control1 control2 treat2.x treat2.y
    ENSG00000000003 ENSG00000000003.10     4004      781      756      756
    ENSG00000000005  ENSG00000000005.5        1        0        0        0
    ENSG00000000419  ENSG00000000419.8      776      140      165      165
    ENSG00000000457  ENSG00000000457.9      624      144      240      240
    ENSG00000000460 ENSG00000000460.12      260       52      105      105
    ENSG00000000938  ENSG00000000938.8      161       59       16       16
    

    7 DEseq2筛选差异表达基因并注释

    这次我换了Annotation包进行注释

    7.1 载入数据(countData和colData)

    # 这一步很关键,要明白condition这里是因子,不是样本名称;小鼠数据有对照组和处理组,各两个重复
    condition <- factor(c(rep("control",2),rep("treat",2)), levels = c("control","treat"))
    condition
    colData <- data.frame(row.names=colnames(mycounts), condition)
    
    > colData
             condition
    control1   control
    control2   control
    treat1       treat
    treat2       treat
    

    7.2 构建dds对象,开始DESeq流程

    dds <- DESeqDataSetFromMatrix(countData=mycounts, 
                                  colData=colData, 
                                  design= ~ condition)
    dds = DESeq(dds)
    

    7.3 总体结果查看

    接下来,我们要查看treat versus control的总体结果,并根据p-value进行重新排序。利用summary命令统计显示一共多少个genes上调和下调(FDR0.1)

    res = results(dds, contrast=c("condition", "control", "treat"))
    res = res[order(res$pvalue),]
    head(res)
    summary(res)
    write.csv(res,file="All_results.csv")
    table(res$padj<0.01)
    
    > summary(res)
    out of 30981 with nonzero total read count
    adjusted p-value < 0.1
    LFC > 0 (up)     : 3928, 13% 
    LFC < 0 (down)   : 3283, 11% 
    outliers [1]     : 0, 0% 
    low counts [2]   : 10317, 33% 
    (mean count < 2)
    [1] see 'cooksCutoff' argument of ?results
    [2] see 'independentFiltering' argument of ?results
    > table(res$padj<0.01)
    FALSE  TRUE 
    16192  4472
    

    可见,上调基因和下调的数目都很多,padj<0.01的有4472,即使padj<0.001的也有3089个。

    7.4 提取差异表达genes(DEGs)

    先看下padj(p值经过多重校验校正后的值)小于0.01,表达倍数取以2为对数后大于1或者小于-1的差异表达基因。代码如下

    diff_gene_deseq2 <-subset(res, padj < 0.01 & abs(log2FoldChange) > 1)
    dim(diff_gene_deseq2)
    head(diff_gene_deseq2)
    write.csv(diff_gene_deseq2,file= "HCC_DEG_0.05_2.csv")
    
    > dim(diff_gene_deseq2)
    [1] 3780    6
    > head(diff_gene_deseq2)
    log2 fold change (MLE): condition control vs treat 
    Wald test p-value: condition control vs treat 
    DataFrame with 6 rows and 6 columns
                     baseMean log2FoldChange     lfcSE      stat        pvalue          padj
                    <numeric>      <numeric> <numeric> <numeric>     <numeric>     <numeric>
    ENSG00000130649 78059.342       7.190959 0.2110024  34.07998 1.460506e-254 3.017990e-250
    ENSG00000268925  5666.055       7.534595 0.2573299  29.27991 1.869559e-188 1.931628e-184
    ENSG00000105697  8791.483       6.611356 0.2338215  28.27523 6.970959e-176 4.801597e-172
    ENSG00000167244  8988.099       6.767261 0.2582403  26.20528 2.313301e-151 1.195051e-147
    ENSG00000162366  4037.960      -7.076854 0.2704122 -26.17062 5.742578e-151 2.373293e-147
    ENSG00000169715  3803.597       6.238596 0.2403373  25.95767 1.489813e-148 5.130914e-145
    

    把padj<0.001,|log2FC|>1有2938个

    8 探索分析结果:Data visulization

    #MA plot
    plotMA(res,ylim=c(-2,2))
    topGene <- rownames(res)[which.min(res$padj)]
    with(res[topGene, ], {
      points(baseMean, log2FoldChange, col="dodgerblue", cex=6, lwd=2)
      text(baseMean, log2FoldChange, topGene, pos=2, col="dodgerblue")
    })
    
    #shirnked
    res_order<-res[order(row.names(res)),]
    res = res_order
    res.shrink <- lfcShrink(dds, contrast = c("condition","treat","control"), res=res)
    plotMA(res.shrink, ylim = c(-5,5))
    topGene <- rownames(res)[which.min(res$padj)]
    with(res[topGene, ], {
      points(baseMean, log2FoldChange, col="dodgerblue", cex=2, lwd=2)
      text(baseMean, log2FoldChange, topGene, pos=2, col="dodgerblue")
    })
    
    #identify
    idx <- identify(res$baseMean, res$log2FoldChange)
    rownames(res)[idx]
    #plotcounts
    plotCounts(dds, gene=which.min(res$padj), intgroup="condition", returnData=TRUE)
    plotCounts(dds, gene="ENSG00000130649", intgroup="condition", returnData=FALSE)
    #boxplot
    plotCounts(dds, gene="ENSG00000130649", intgroup="condition", returnData=TRUE) %>% 
      ggplot(aes(condition, count)) + geom_boxplot(aes(fill=condition)) + scale_y_log10() + ggtitle("CYP2E1")
    #pointplot
    d <- plotCounts(dds, gene="ENSG00000130649", intgroup="condition", returnData=TRUE)
    ggplot(d, aes(x=condition, y=count)) + 
      geom_point(aes(color= condition),size= 4, position=position_jitter(w=0.5,h=0)) + 
      scale_y_log10(breaks=c(25,100,400))+ ggtitle("CYP2E1")
    
    ##3最小padj
    d <- plotCounts(dds, gene=which.min(res$padj), intgroup="condition", 
                    returnData=TRUE)
    ggplot(d, aes(x=condition, y=count)) + 
      geom_point(position=position_jitter(w=0.1,h=0)) + 
      scale_y_log10(breaks=c(25,100,400))
    
    #PCA
    vsdata <- vst(dds, blind=FALSE)
    plotPCA(vsdata, intgroup="condition")
    #beatifule pca plot
    pcaData <- plotPCA(vsdata, intgroup=c("condition"), returnData=TRUE)
    percentVar <- round(100 * attr(pcaData, "percentVar"))
    ggplot(pcaData, aes(PC1, PC2, color=condition, shape=condition)) +
      geom_point(size=3) +
      xlab(paste0("PC1: ",percentVar[1],"% variance")) +
      ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
      coord_fixed()
    
    library("pheatmap")
    select<-order(rowMeans(counts(dds, normalized = TRUE)),
                  decreasing = TRUE)[1:20]
    df <- as.data.frame(colData(dds)[,c("condition","sizeFactor")])
    # this gives log2(n + 1)
    ntd <- normTransform(dds)
    pheatmap(assay(ntd)[select,], cluster_rows=FALSE, show_rownames=FALSE,
             cluster_cols=FALSE, annotation_col=df)
    #vsd
    vsd <- vst(dds, blind=FALSE)
    pheatmap(assay(vsd)[select,], cluster_rows=FALSE, show_rownames=FALSE,
             cluster_cols=FALSE, annotation_col=df)
    ##sample to sample heatmap
    sampleDists <- dist(t(assay(vsd)))
    library("RColorBrewer")
    sampleDistMatrix <- as.matrix(sampleDists)
    rownames(sampleDistMatrix) <- paste(vsd$condition, vsd$type, sep="-")
    colnames(sampleDistMatrix) <- NULL
    colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
    pheatmap(sampleDistMatrix,
             clustering_distance_rows=sampleDists,
             clustering_distance_cols=sampleDists,
             col=colors)
    

    这部分主要结果部分的一些可视化,具体参考详细的部分即可,只放一张图


    Rplot.jpeg

    9 DEGs的富集分析(功能注释)ClusterProfiler包

    ##enrichment analysis using clusterprofiler package created by yuguangchuang
    library(clusterProfiler)
    library(DOSE)
    library(stringr)
    library(org.Hs.eg.db)
    #get the ENTREZID for the next analysis
    sig.gene= diff_gene_deseq2
    head(sig.gene)
    gene<-rownames(sig.gene)
    head(gene)
    gene.df<-bitr(gene, fromType = "ENSEMBL", 
                  toType = c("SYMBOL","ENTREZID"),
                  OrgDb = org.Hs.eg.db)
    
    head(gene.df)
    
    > head(gene.df)
              ENSEMBL   SYMBOL ENTREZID
    1 ENSG00000130649   CYP2E1     1571
    3 ENSG00000105697     HAMP    57817
    4 ENSG00000167244     IGF2     3481
    5 ENSG00000162366 PDZK1IP1    10158
    6 ENSG00000169715     MT1E     4493
    7 ENSG00000074276    CDHR2    54825
    

    GO enrichment

    #Go enrichment
    ego_cc<-enrichGO(gene       = gene.df$ENSEMBL,
                     OrgDb      = org.Hs.eg.db,
                     keyType    = 'ENSEMBL',
                     ont        = "CC",
                     pAdjustMethod = "BH",
                     pvalueCutoff = 0.01,
                     qvalueCutoff = 0.05)
    ego_bp<-enrichGO(gene       = gene.df$ENSEMBL,
                     OrgDb      = org.Hs.eg.db,
                     keyType    = 'ENSEMBL',
                     ont        = "BP",
                     pAdjustMethod = "BH",
                     pvalueCutoff = 0.01,
                     qvalueCutoff = 0.05)
    barplot(ego_bp,showCategory = 18,title="The GO_BP enrichment analysis of all DEGs ")+ 
      scale_size(range=c(2, 12))+
      scale_x_discrete(labels=function(ego_bp) str_wrap(ego_bp,width = 25))
    

    KEGG enrichment

    library(stringr)
    kk<-enrichKEGG(gene      =gene.df$ENTREZID,
                   organism = 'hsa',
                   pvalueCutoff = 0.05)
    
    > kk[1:30]
                   ID                                          Description GeneRatio  BgRatio
    hsa04610 hsa04610                  Complement and coagulation cascades   38/1407  79/7431
    hsa04933 hsa04933 AGE-RAGE signaling pathway in diabetic complications   40/1407  99/7431
    hsa00071 hsa00071                               Fatty acid degradation   23/1407  44/7431
    hsa04974 hsa04974                     Protein digestion and absorption   35/1407  90/7431
    hsa05146 hsa05146                                           Amoebiasis   36/1407  96/7431
    hsa04978 hsa04978                                   Mineral absorption   23/1407  51/7431
    hsa04115 hsa04115                                p53 signaling pathway   29/1407  72/7431
    hsa05144 hsa05144                                              Malaria   22/1407  49/7431
    hsa00982 hsa00982                    Drug metabolism - cytochrome P450   28/1407  72/7431
    hsa00980 hsa00980         Metabolism of xenobiotics by cytochrome P450   29/1407  76/7431
    hsa03320 hsa03320                               PPAR signaling pathway   28/1407  74/7431
    hsa05204 hsa05204                              Chemical carcinogenesis   30/1407  82/7431
    
    mykegg<-barplot(kk,showCategory = 20, title="The KEGG enrichment analysis of all DEGs")+
      scale_size(range=c(2, 12))+
      scale_x_discrete(labels=function(kk) str_wrap(kk,width = 25))
     dotplot(kk,showCategory = 20, title="The KEGG enrichment analysis of all DEGs")+
      scale_size(range=c(2, 12))+
      scale_x_discrete(labels=function(kk) str_wrap(kk,width = 25))
    

    用cowplot拼起来

    library(cowplot)
    plot_grid(mygobp,mykegg,labels = c("A","B"),align = "h")
    
    Rplot01.jpeg

    10 KEGG信号通路的可视化(pathview包)

    备注,前面详细部分用gageData重新进行了富集分析,这部分直接调用ClusterProfiler包的富集结果
    • 选取enrichKEGG结果pvalue排名第一的hsa4610:Complement and coagulation cascades和排名第7的hsa04115:p53 signaling pathway
    library("pathview")
    foldchanges = sig.gene$log2FoldChange
    names(foldchanges)= gene.df$ENTREZID
    head(foldchanges)
    pathview(gene.data = foldchanges, pathway.id = "hsa04610", species="hsa")
    

    ------------------------------------------------------------------------------

    hsa4610:Complement and coagulation cascades
    hsa04610.pathview.png

    ------------------------------------------------------------------------------

    hsa04115:p53 signaling pathway

    hsa04115.pathview.png

    11 把counts结果,clusterprofiler部分的symbol name 和DEGs全部合并到一个表格

    备注,这部分的主要问题是没用可以merge用的ID,先看下
    > head(mycounts)
                    control1 control2 treat1 treat2
    ENSG00000000003     4004      781   4229    756
    ENSG00000000005        1        0      0      0
    ENSG00000000419      776      140   1180    165
    ENSG00000000457      624      144   1271    240
    ENSG00000000460      260       52    610    105
    ENSG00000000938      161       59     57     16
    
    > head(gene.df)
              ENSEMBL   SYMBOL ENTREZID
    1 ENSG00000130649   CYP2E1     1571
    3 ENSG00000105697     HAMP    57817
    4 ENSG00000167244     IGF2     3481
    5 ENSG00000162366 PDZK1IP1    10158
    6 ENSG00000169715     MT1E     4493
    7 ENSG00000074276    CDHR2    54825
    
    > head(sig.gene)
    log2 fold change (MLE): condition control vs treat 
    Wald test p-value: condition control vs treat 
    DataFrame with 6 rows and 6 columns
                      ENSEMBL log2FoldChange     lfcSE      stat        pvalue          padj
                    <numeric>      <numeric> <numeric> <numeric>     <numeric>     <numeric>
    ENSG00000130649 78059.342       7.190959 0.2110024  34.07998 1.460506e-254 3.017990e-250
    ENSG00000268925  5666.055       7.534595 0.2573299  29.27991 1.869559e-188 1.931628e-184
    ENSG00000105697  8791.483       6.611356 0.2338215  28.27523 6.970959e-176 4.801597e-172
    ENSG00000167244  8988.099       6.767261 0.2582403  26.20528 2.313301e-151 1.195051e-147
    ENSG00000162366  4037.960      -7.076854 0.2704122 -26.17062 5.742578e-151 2.373293e-147
    ENSG00000169715  3803.597       6.238596 0.2403373  25.95767 1.489813e-148 5.130914e-145
    

    所以执行以下代码:

    head(sig.gene)
    head(gene.df)
    ENSEMBL<-rownames(sig.gene)
    sig.gene<-cbind(ENSEMBL, sig.gene)
    colnames(sig.gene)[1]<-c("ENSEMBL")
    
    head(mycounts)
    ENSEMBL<-rownames(mycounts)
    mycounts<-cbind(ENSEMBL, mycounts)
    colnames(mycounts)[1]<-c("ENSEMBL")
    
    merge1<-merge(sig.gene,mycounts,by="ENSEMBL")
    merge2<-merge(merge1,gene.df, by="ENSEMBL")
    head(merge2)
    write.csv(DEG_symbole,file="hcc_DEGs_last_results.csv")
    

    结果如下:

    > head(merge2)
    DataFrame with 6 rows and 13 columns
              ENSEMBL   baseMean log2FoldChange     lfcSE      stat       pvalue         padj  control1  control2    treat1    treat2
          <character>  <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric> <integer> <integer> <integer> <integer>
    1 ENSG00000000938   70.51956       2.169409 0.4610591  4.705274 2.535248e-06 2.632581e-05       161        59        57        16
    2 ENSG00000001460   77.93853      -1.496365 0.4247974 -3.522537 4.274371e-04 2.517116e-03        68        18       268        68
    3 ENSG00000001561  382.90350      -1.466940 0.3397462 -4.317751 1.576269e-05 1.347065e-04       416        69      1780       227
    4 ENSG00000001626   70.41775       4.713865 0.5812437  8.109965 5.063455e-16 1.944819e-14       226        61        16         1
    5 ENSG00000001630  165.86393      -2.283850 0.4094980 -5.577194 2.444286e-08 3.665365e-07        84        28       442       207
    6 ENSG00000002549 1683.37557       1.519578 0.2216429  6.855972 7.082930e-12 1.708432e-10      4166      1104      2171       481
           SYMBOL    ENTREZID
      <character> <character>
    1         FGR        2268
    2       STPG1       90529
    3       ENPP4       22875
    4        CFTR        1080
    5     CYP51A1        1595
    6        LAP3       51056
    

    至此,这部分结果可以用来做其他很多下游分析了。

    相关文章

      网友评论

      本文标题:RNA-seq分析简洁版

      本文链接:https://www.haomeiwen.com/subject/ioqwvftx.html