美文网首页RNA-seq转录组数据分析生物信息学:研一
mRNA-seq学习(一):[有参]基因差异表达分析和GO注释

mRNA-seq学习(一):[有参]基因差异表达分析和GO注释

作者: TOP生物信息 | 来源:发表于2019-03-10 21:01 被阅读13次

    数据来源:Cytosolic acetyl-CoA promotes histone acetylation
    predominantly at H3K27 in Arabidopsis;GSE79524

    我只试做了转录组分析那一部分。简单概括就是为了评估乙酰化对基因表达的影响,对野生型和突变体都做了转录组分析(基因差异表达分析和GO注释)。

    原文方法,我换成了自己较熟悉的几个工具

    1. 数据获取及质控

    #1.脚本查看
    $ cat dir_6.txt 
    SRR3286802
    SRR3286803
    SRR3286804
    SRR3286805
    SRR3286806
    SRR3286807
    
    $ cat 1.sh 
    dir=$(cut -f1 dir_6.txt)
    for sample in $dir
    do
    prefetch $sample
    done
    
    $ cat 2.sh 
    for i in `seq 2 7`
    do
    fastq-dump --gzip --split-3 --defline-qual '+' --defline-seq '@$ac-$si-$ri' ~/ncbi/public/sra/SRR328680${i}.sra -O /ifs1/Grp3/huangsiyuan/learn_rnaseq/mRNA-seq/data/ &
    done
    
    #2.下载及解压
    sh 1.sh
    sh 2.sh
    
    #解压之后是这样的,可以看出是双端测序
    $ ls
    1.sh       SRR3286802_1.fastq.gz  SRR3286803_2.fastq.gz  SRR3286805_1.fastq.gz  SRR3286806_2.fastq.gz
    2.sh       SRR3286802_2.fastq.gz  SRR3286804_1.fastq.gz  SRR3286805_2.fastq.gz  SRR3286807_1.fastq.gz
    dir_6.txt  SRR3286803_1.fastq.gz  SRR3286804_2.fastq.gz  SRR3286806_1.fastq.gz  SRR3286807_2.fastq.gz
    
    #3.质量检测:fastqc,multiqc
    ls *fastq.gz | while read id
    do
    fastqc ${id} &
    done
    multiqc *fastqc*
    
    #4.过滤接头序列及低质量reads片段
    ##就这组数据来说,质量检测的结果表明数据质量很好,因此这里省略的过滤这一步。
    

    2. 下载gff/gtf注释文件并提取出感兴趣的基因/转录本区间

    一个基因可能对应不同的转录本,不同的转录本可能对应不同的功能。
    以拟南芥的gff注释文件为例:

    #提取基因
    $ less Arabidopsis_thaliana.TAIR10.42.gff3 | awk '{ if($3=="gene") print $0 }' | cut -d ";" -f1 | head -n 5
    1   araport11   gene    3631    5899    .   +   .   ID=gene:AT1G01010
    1   araport11   gene    6788    9130    .   -   .   ID=gene:AT1G01020
    1   araport11   gene    11649   13714   .   -   .   ID=gene:AT1G01030
    1   araport11   gene    23121   31227   .   +   .   ID=gene:AT1G01040
    1   araport11   gene    31170   33171   .   -   .   ID=gene:AT1G01050
    
    #提取转录本
    $ less Arabidopsis_thaliana.TAIR10.42.gff3 | awk '{ if($3=="mRNA") print $0 }' | cut -d ";" -f1 | head -n 10
    1   araport11   mRNA    3631    5899    .   +   .   ID=transcript:AT1G01010.1
    1   araport11   mRNA    6788    8737    .   -   .   ID=transcript:AT1G01020.6
    1   araport11   mRNA    6788    8737    .   -   .   ID=transcript:AT1G01020.2
    1   araport11   mRNA    6788    9130    .   -   .   ID=transcript:AT1G01020.3
    1   araport11   mRNA    6788    9130    .   -   .   ID=transcript:AT1G01020.5
    1   araport11   mRNA    6788    9130    .   -   .   ID=transcript:AT1G01020.4
    1   araport11   mRNA    6788    9130    .   -   .   ID=transcript:AT1G01020.1
    1   araport11   mRNA    11649   13714   .   -   .   ID=transcript:AT1G01030.1
    1   araport11   mRNA    11649   13714   .   -   .   ID=transcript:AT1G01030.2
    1   araport11   mRNA    23121   31227   .   +   .   ID=transcript:AT1G01040.1
    

    从ID可以看出AT1G01020基因有6个转录本。这篇文献比较的是基因层面的表达差异,所以这里我提取的是基因gff,算上线粒体和叶绿体上的基因一共27655个。

    $ less Arabidopsis_thaliana.TAIR10.42.gff3 | awk '{ if($3=="gene") print $0 }' > gene27655.gff
    

    3. 将reads比对到参考基因组

    3.1 为参考基因组建立索引
    nohup ~/mysoft/hisat2-2.1.0/hisat2-build Arabidopsis_thaliana.TAIR10.dna.toplevel.fa Arabidopsis_thaliana &
    
    3.2 比对
    $ cat 3.sh 
    for i in `seq 2 7`
    do
    hisat2 -x ~/learn_rnaseq/mRNA-seq/ref/Arabidopsis_thaliana -p 8 \
    -1 ~/learn_rnaseq/mRNA-seq/data/SRR328680${i}_1.fastq.gz \
    -2 ~/learn_rnaseq/mRNA-seq/data/SRR328680${i}_2.fastq.gz \
    -S ~/learn_rnaseq/mRNA-seq/map/SRR328680${i}.sam
    done
    
    $ nohup sh 3.sh &
    

    从报告文件来看比对率都挺高的,97%以上。

    3.3 sam转bam, 并排序
    $ cat 1.sh 
    for i in `seq 2 7`
    do
    /ifs1/Software/biosoft/samtools-1.9/samtools view -Sb SRR328680${i}.sam > SRR328680${i}.bam
    java -jar /ifs1/Software/biosoft/picard/picard.jar SortSam I=SRR328680${i}.bam O=SRR328680${i}.s.bam SO=coordinate
    /ifs1/Software/biosoft/samtools-1.9/samtools index SRR328680${i}.s.bam
    done
    $ nohup  sh 1.sh &
    

    4. 计算表达量

    Usage: featureCounts [options] -a <annotation_file> -o <output_file> input_file1 [input_file2] ...
    nohup ~/mysoft/subread-1.6.3-source/bin/featureCounts -F GTF -t gene -g gene_id -T 4 -a ~/learn_rnaseq/mRNA-seq/ref/gene27655.gff -o ~/learn_rnaseq/mRNA-seq/count/test ~/learn_rnaseq/mRNA-seq/map/SRR328680*.s.bam &
    #上面这一步运行完之后会多出test.summary,test这两个文件
    nohup ~/mysoft/subread-1.6.3-source/bin/featureCounts -F GTF -t gene -g gene_id -T 4 -p -a ~/learn_rnaseq/mRNA-seq/ref/gene27655.gff -o ~/learn_rnaseq/mRNA-seq/count/all ~/learn_rnaseq/mRNA-seq/map/SRR328680*.s.bam &
    #上面这行只多了-p选项,为了看看在双端测序中统计fragments和reads有什么区别。运行完了多出all.summary,all两个文件。
    #从all和text两个矩阵文件中,看不出什么明显的差别,但是加-p之后,运行时间长了很多。
    

    将all文件中的这7列提取出来即为表达矩阵。

    grep "^AT" all | cut -f1,7,8,9,10,11,12 > 6sample_count_matrix.txt
    

    5. 差异表达分析

    6sample_count_matrix.txt

    使用DESeq2 包。

    a <- as.matrix(read.table("6sample_count_matrix.txt",sep="\t",header = T,row.names=1))
    condition <- factor(c(rep("WT",3),rep("acc1.5",3)), levels = c("WT","acc1.5"))
    colData <- data.frame(row.names=colnames(a), condition)
    #查看一下
    > colData
            condition
    WT1            WT
    WT2            WT
    WT3            WT
    acc1.51    acc1.5
    acc1.52    acc1.5
    acc1.53    acc1.5
    
    #确保
    > all(rownames(colData) == colnames(a))
    [1] TRUE
    
    dds <- DESeqDataSetFromMatrix(a, colData, design= ~ condition)
    dds <- DESeq(dds)
    #运行结束会报告
    estimating size factors
    estimating dispersions
    gene-wise dispersion estimates
    mean-dispersion relationship
    final dispersion estimates
    fitting model and testing
    
    #得到初步结果并查看
    res <-  results(dds, contrast=c("condition", "acc1.5", "WT"))
    ##log2(fold change)也是一个衡量差异显著性的指标
    resLFC <- lfcShrink(dds, coef="condition_acc1.5_vs_WT", type="apeglm")
    resLFC
    ##按照p值由小到大排列
    resOrdered <- res[order(res$pvalue),]
    resOrdered
    ##矫正p值小于0.001的有多少个差异基因
    sum(res$padj < 0.001, na.rm=TRUE)
    ##画个MA-plot看看大体趋势
    plotMA(res, ylim=c(-2,2))
    plotMA(resLFC, ylim=c(-2,2))
    
    #按照自定义的阈值提取差异基因并导出
    diff_gene_deseq2 <-subset(res, padj < 0.001 & abs(log2FoldChange) > 1)
    write.csv(diff_gene_deseq2,file= "DEG_acc1.5_vs_WT.csv")
    

    6. 简单的GO注释

    首选clusterProfiler包。

    library(”clusterProfiler“)
    library("org.At.tair.db")
    e <- read.table("DEG_acc1.5_vs_WT.csv", header = T,sep = ",")
    f <- e[,1]
    #转换ID并将ENTREZID编号作为后面的识别ID
    ids <- bitr(f, fromType="TAIR", toType=c("SYMBOL","ENTREZID"), OrgDb="org.At.tair.db")
    f <- ids[,3]
    
    按照GO系统给基因分类
    ggo <- groupGO(gene     = f,
                   OrgDb    = org.At.tair.db,
                   ont      = "CC",
                   level    = 3,
                   readable = TRUE)
    > head(ggo)
                       ID                    Description Count GeneRatio
    GO:0005886 GO:0005886                plasma membrane   237   237/718
    GO:0005628 GO:0005628              prospore membrane     0     0/718
    GO:0005789 GO:0005789 endoplasmic reticulum membrane     3     3/718
    GO:0019867 GO:0019867                 outer membrane     6     6/718
    GO:0031090 GO:0031090             organelle membrane    63    63/718
    GO:0034357 GO:0034357        photosynthetic membrane    23    23/718
    
    over-representation test
    ego2 <- enrichGO(gene         = ids$TAIR,
                     OrgDb         = org.At.tair.db,
                     keyType       = 'TAIR',
                     ont           = "CC",
                     pAdjustMethod = "BH",
                     pvalueCutoff  = 0.01,
                     qvalueCutoff  = 0.05,
                     readable      = TRUE)
    
    > head(ego2)
                       ID          Description GeneRatio   BgRatio       pvalue    p.adjust      qvalue
    GO:0009505 GO:0009505 plant-type cell wall    20/654 274/24998 3.989815e-05 0.002907596 0.002713546
    GO:0048046 GO:0048046             apoplast    22/654 328/24998 5.995043e-05 0.002907596 0.002713546
                                                                                                                                                 geneID
    GO:0009505         ATBXL2/ATPMEPCRA/LRX1/AtWAK1/ATPME2/GLL22/LRX2/SS3/ATPAP1/ATC4H/CASP1/BGLU15/RHS12/BGAL1/ATGRP-5/ATPCB/EARLI1/ATPGIP2/FLA13/PME5
    GO:0048046 iPGAM1/ATDHAR1/AOAT1/ANN1/ATPEPC1/GDPDL1/CLE12/AGT/ENO2/GOX1/ATPCB/AtBG2/CRK9/CORI3/PMDH2/BETA/UDG4/ATSBT4.12/LTP3/AtPRX71/ATBXL4/ANNAT2
               Count
    GO:0009505    20
    GO:0048046    22
    
    GO基因富集分析
    geneList = 2^e[,3]
    names(geneList)=as.character(ids[,3])
    geneList = sort(geneList, decreasing = TRUE)
    
    ego3 <- gseGO(geneList     = geneList,
                  OrgDb        = org.At.tair.db,
                  ont          = "CC",
                  nPerm        = 1000,
                  minGSSize    = 100,
                  maxGSSize    = 500,
                  pvalueCutoff = 0.05,
                  verbose      = FALSE)
    head(ego3)
    
    可视化
    barplot(ggo, drop=TRUE, showCategory=12)
    dotplot(ego2)
    

    自己的分析略显简单,没有过多的细化,后面随着学习的深入再来补充。


    reference

    Statistical analysis and visualization of functional profiles for genes and gene clusters:
    http://www.bioconductor.org/packages/release/bioc/vignettes/clusterProfiler/inst/doc/clusterProfiler.html
    Analyzing RNA-seq data with DESeq2:http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

    相关文章

      网友评论

        本文标题:mRNA-seq学习(一):[有参]基因差异表达分析和GO注释

        本文链接:https://www.haomeiwen.com/subject/bcljpqtx.html