美文网首页RNA-seq生物信息学与算法NGS分析流程
原创10000+生信教程大神给你的RNA实战视频演练

原创10000+生信教程大神给你的RNA实战视频演练

作者: 因地制宜的生信达人 | 来源:发表于2018-08-17 21:56 被阅读138次

    准备工作

    1. 安装conda

    推荐使用偷懒方法,比如安装miniconda软件,下载地址:https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/ 这样就可以使用它安装绝大部分其它软件。

    但是在中国大陆的小伙伴,需要更改镜像源配置

    conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
    conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda
    conda config --set show_channel_urls yes
    

    2. 安装软件

    为了避免污染linux工作环境,推荐在conda中创建各个流程的安装环境,比如:

    conda create -n rna python=2 #创建名为rna的软件安装环境
    conda info --envs #查看当前conda环境
    source activate rna #激活conda的rna环境
    

    先读文献调研,得到转录组分析需要用到的软件列表;

    • 质控
    • fastqc , multiqc, trimmomatic, cutadapt ,trim-galore
    • 比对
    • star, hisat2, bowtie2, tophat, bwa, subread
    • 计数
    • htseq, bedtools, deeptools, salmon

    如果你对一个软件不了解的话,那么安装之前在https://bioconda.github.io/recipes.html,检索该软件包是否存在,或者使用conda search packagename进行检索。

    但是我帮你确定好了下面的软件安装代码是可行的!!!

    conda install -y sra-tools
    conda install -y trimmomatic
    conda install -y cutadapt multiqc 
    conda install -y trim-galore
    conda install -y star hisat2 bowtie2
    conda install -y subread tophat htseq bedtools deeptools
    conda install -y salmon
    source deactivate #注销当前的rna环境
    

    转录组流程

    step1: sra2fastq

    下载SRA数据

    新建一个名为SRR_Acc_List.txt的文档,将SRR号码保存在文档内,一个号码占据一行。文件可以在我的GitHub下载获取:https://github.com/jmzeng1314/GEO/blob/master/airway/SRR_Acc_List.txt

    • prefetch下载数据
    wkd=/home/jmzeng/project/airway/ #设置工作目录
    source activate rna
    cat SRR_Acc_List.txt | while read id; do (prefetch  ${id} &);done
    ps -ef | grep prefetch | awk '{print $2}' | while read id; do kill ${id}; done #在内地下载速度很慢,所以我杀掉这些下载进程
    
    • 或者直接使用我已经下载好的sra数据
    mkdir $wkd/raw 
    cd $wkd/raw 
    ls /public/project/RNA/airway/sra/*  | while read id; do ( nohup fastq-dump --gzip --split-3 -O ./ ${id} & ); done ## 批量转换sra到fq格式。
    source deactivate 
    
    • 得到的SRA数据如下
    /public/project/RNA/airway/sra/
    ├── [1.6G]  SRR1039508.sra
    ├── [1.4G]  SRR1039509.sra
    ├── [1.6G]  SRR1039510.sra
    ├── [1.5G]  SRR1039511.sra
    ├── [2.0G]  SRR1039512.sra
    ├── [2.2G]  SRR1039513.sra
    ├── [3.0G]  SRR1039514.sra
    ├── [1.9G]  SRR1039515.sra
    ├── [2.1G]  SRR1039516.sra
    ├── [2.6G]  SRR1039517.sra
    ├── [2.3G]  SRR1039518.sra
    ├── [2.0G]  SRR1039519.sra
    ├── [2.1G]  SRR1039520.sra
    ├── [2.4G]  SRR1039521.sra
    ├── [2.0G]  SRR1039522.sra
    └── [2.2G]  SRR1039523.sra
    
    • sra格式转fastq格式

    格式转还用到的软件是fastq-dump

    for i in $wkd/*sra
    do
            echo $i
            nohup fastq-dump --split-3 --skip-technical --clip --gzip $i & ## 批量转换
    done
    
    • 得到fastq数据如下

    原始数据是双端测序结果,fastq-dump配合--split-3参数,一个样本被拆分成两个fastq文件

    ├── [1.3G]  SRR1039508_1.fastq.gz
    ├── [1.3G]  SRR1039508_2.fastq.gz
    ├── [1.2G]  SRR1039509_1.fastq.gz
    ├── [1.2G]  SRR1039509_2.fastq.gz
    ├── [1.3G]  SRR1039510_1.fastq.gz
    ├── [1.3G]  SRR1039510_2.fastq.gz
    ├── [1.2G]  SRR1039511_1.fastq.gz
    ├── [1.2G]  SRR1039511_2.fastq.gz
    ├── [1.6G]  SRR1039512_1.fastq.gz
    ├── [1.6G]  SRR1039512_2.fastq.gz
    ├── [950M]  SRR1039513_1.fastq.gz
    ├── [952M]  SRR1039513_2.fastq.gz
    ├── [2.4G]  SRR1039514_1.fastq.gz
    ......
    ├── [1.5G]  SRR1039522_1.fastq.gz
    ├── [1.5G]  SRR1039522_2.fastq.gz
    ├── [1.8G]  SRR1039523_1.fastq.gz
    └── [1.8G]  SRR1039523_2.fastq.gz
    

    step2: check quality of sequence reads

    fastqc生成质控报告,multiqc将各个样本的质控报告整合为一个。

    ls *gz | xargs fastqc -t 10
    multiqc ./ 
    
    • 得到结果如下
    ├── [4.0K]  multiqc_data
    │   ├── [2.1M]  multiqc_data.json
    │   ├── [6.8K]  multiqc_fastqc.txt
    │   ├── [2.2K]  multiqc_general_stats.txt
    │   ├── [ 16K]  multiqc.log
    │   └── [3.4K]  multiqc_sources.txt
    ├── [1.5M]  multiqc_report.html
    ├── [236K]  SRR1039508_1_fastqc.html
    ├── [279K]  SRR1039508_1_fastqc.zip
    ├── [238K]  SRR1039508_2_fastqc.html
    ├── [286K]  SRR1039508_2_fastqc.zip
    ├── [236K]  SRR1039510_1_fastqc.html
    ├── [278K]  SRR1039510_1_fastqc.zip
    ├── [241K]  SRR1039510_2_fastqc.html
    ├── [292K]  SRR1039510_2_fastqc.zip
    ......
    ├── [220K]  SRR1039522_fastqc.zip
    ├── [234K]  SRR1039523_1_fastqc.html
    ├── [273K]  SRR1039523_1_fastqc.zip
    ├── [232K]  SRR1039523_2_fastqc.html
    └── [274K]  SRR1039523_2_fastqc.zip
    
    

    每个id_fastqc.html都是一个质量报告,multiqc_report.html是所有样本的整合报告

    step3: filter the bad quality reads and remove adaptors.

    • 运行如下代码,得到名为config的文件,包含两列数据
    mkdir $wkd/clean 
    cd $wkd/clean 
    ls /home/jmzeng/project/airway/raw/*_1.fastq.gz >1
    ls /home/jmzeng/project/airway/raw/*_2.fastq.gz >2
    paste 1 2  > config
    
    • 打开文件 qc.sh ,并且写入如下内容

    trim_galore,用于去除低质量和接头数据

    source activate rna
    bin_trim_galore=trim_galore
    dir='/home/jmzeng/project/airway/clean'
    cat $1 |while read id
    do
            arr=(${id})
            fq1=${arr[0]}
            fq2=${arr[1]} 
    nohup $bin_trim_galore -q 25 --phred33 --length 36 --stringency 3 --paired -o $dir  $fq1 $fq2 & 
    done 
    source deactivate 
    
    • 运行qc.sh
    bash qc.sh config #config是传递进去的参数
    
    • 结果显示如下
    ├── [2.9K]  SRR1039508_1.fastq.gz_trimming_report.txt
    ├── [1.2G]  SRR1039508_1_val_1.fq.gz
    ├── [3.1K]  SRR1039508_2.fastq.gz_trimming_report.txt
    ├── [1.2G]  SRR1039508_2_val_2.fq.gz
    ├── [2.9K]  SRR1039509_1.fastq.gz_trimming_report.txt
    ......
    ├── [2.9K]  SRR1039522_1.fastq.gz_trimming_report.txt
    ├── [1.4G]  SRR1039522_1_val_1.fq.gz
    ├── [3.1K]  SRR1039522_2.fastq.gz_trimming_report.txt
    ├── [1.4G]  SRR1039522_2_val_2.fq.gz
    ├── [2.9K]  SRR1039523_1.fastq.gz_trimming_report.txt
    ├── [1.7G]  SRR1039523_1_val_1.fq.gz
    ├── [3.1K]  SRR1039523_2.fastq.gz_trimming_report.txt
    └── [1.7G]  SRR1039523_2_val_2.fq.gz
    

    step4: alignment

    star, hisat2, bowtie2, tophat, bwa, subread都是可以用于比到的软件

    • 先运行一个样本,测试一下
    mkdir $wkd/test 
    cd $wkd/test 
    source activate rna
    ls $wkd/clean/*gz |while read id;do (zcat ${id}|head -1000>  $(basename ${id} ".gz"));done
    id=SRR1039508
    hisat2 -p 10 -x /public/reference/index/hisat/hg38/genome -1 ${id}_1_val_1.fq   -2 ${id}_2_val_2.fq  -S ${id}.hisat.sam
    subjunc -T 5  -i /public/reference/index/subread/hg38 -r ${id}_1_val_1.fq -R ${id}_2_val_2.fq -o ${id}.subjunc.sam  
    bowtie2 -p 10 -x /public/reference/index/bowtie/hg38  -1 ${id}_1_val_1.fq   -2 ${id}_2_val_2.fq  -S ${id}.bowtie.sam
    bwa mem -t 5 -M  /public/reference/index/bwa/hg38   ${id}_1_val_1.fq   ${id}_2_val_2.fq > ${id}.bwa.sam
    
    • 批量比对代码
    cd $wkd/clean 
    ls *gz|cut -d"_" -f 1 |sort -u |while read id;do
    ls -lh ${id}_1_val_1.fq.gz   ${id}_2_val_2.fq.gz 
    hisat2 -p 10 -x /public/reference/index/hisat/hg38/genome -1 ${id}_1_val_1.fq.gz   -2 ${id}_2_val_2.fq.gz  -S ${id}.hisat.sam
    subjunc -T 5  -i /public/reference/index/subread/hg38 -r ${id}_1_val_1.fq.gz -R ${id}_2_val_2.fq.gz -o ${id}.subjunc.sam
    bowtie2 -p 10 -x /public/reference/index/bowtie/hg38  -1 ${id}_1_val_1.fq.gz   -2 ${id}_2_val_2.fq.gz  -S ${id}.bowtie.sam
    bwa mem -t 5 -M  /public/reference/index/bwa/hg38   ${id}_1_val_1.fq.gz   ${id}_2_val_2.fq.gz > ${id}.bwa.sam
    done 
    
    
    
    • sam文件转bam
    ls *.sam|while read id ;do (samtools sort -O bam -@ 5  -o $(basename ${id} ".sam").bam   ${id});done
    rm *.sam 
    
    • 为bam文件建立索引
    ls *.bam |xargs -i samtools index {}
    
    • reads的比对情况统计
    ls *.bam |xargs -i samtools flagstat -@ 10  {}  >
    ls *.bam |while read id ;do ( nohup samtools flagstat -@ 1 $id >  $(basename ${id} ".bam").flagstat & );done
    source deactivate 
    
    • 最终结果显示如下
    ├── [1.8G]  SRR1039508.bowite2.bam
    ├── [2.9M]  SRR1039508.bowite2.bam.bai
    ├── [ 444]  SRR1039508.bowite2.flagstat
    ├── [ 10G]  SRR1039508.bowite2.sam
    ├── [1.7G]  SRR1039509.bowite2.bam
    ......
    ├── [2.0G]  SRR1039521.bowite2.bam
    ├── [2.9M]  SRR1039521.bowite2.bam.bai
    ├── [ 444]  SRR1039521.bowite2.flagstat
    ├── [ 10G]  SRR1039521.bowite2.sam
    ├── [2.3G]  SRR1039522.bowite2.bam
    ├── [3.0M]  SRR1039522.bowite2.bam.bai
    ├── [ 444]  SRR1039522.bowite2.flagstat
    ├── [ 12G]  SRR1039522.bowite2.sam
    ├── [2.5G]  SRR1039523.bowite2.bam
    ├── [3.0M]  SRR1039523.bowite2.bam.bai
    ├── [ 444]  SRR1039523.bowite2.flagstat
    └── [ 14G]  SRR1039523.bowite2.sam
    

    step5: counts

    mkdir $wkd/align 
    cd $wkd/align 
    source activate rna
    for fn in {508..523}
    do
    featureCounts -T 5 -p -t exon -g gene_id  -a /public/reference/gtf/gencode/gencode.v25.annotation.gtf.gz -o counts.txt SRR1039$fn.bam
    done
    source deactivate 
    
    • 得到的文件如下
          1 # Program:featureCounts v1.6.1; Command:"featureCounts" "-T" "5" "-p" "-t" "exon" "-g" "gene_id" "-a" "/public/reference/gtf/gencode/ge
          2 Geneid  Chr     Start   End     Strand  Length  /home/llwu/RNA/airway/2.align/bowite2/SRR1039523.bowite2.bam
          3 ENSG00000223972.5       chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1    11869;12010;12179;12613;12613;12975;13221;13221;13453   12227;1
          4 ENSG00000227232.5       chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1  14404;15005;15796;16607;16858;17233;17606;17915;18268;2
          5 ENSG00000278267.1       chr1    17369   17436   -       68      9
          6 ENSG00000243485.4       chr1;chr1;chr1;chr1;chr1;chr1   29554;30267;30366;30564;30976;30976     30039;30667;30503;30667;31097;31109
          7 ENSG00000237613.2       chr1;chr1;chr1;chr1;chr1        34554;35245;35277;35721;35721   35174;35481;35481;36073;36081   -;-;-;-;-
          8 ENSG00000268020.3       chr1    52473   53312   +       840     0
          9 ENSG00000240361.1       chr1    62948   63887   +       940     0
         10 ENSG00000186092.4       chr1    69091   70008   +       918     0
    

    step5: DEG

    • 差异分析之前需要首先对转录组上游分析得到的文件 all.id.txt 进行一定程度的检查,代码如:
    rm(list = ls())
    options(stringsAsFactors = F)
    a=read.table('all.id.txt',header = T)
    tmp=a[1:14,1:7]
    meta=a[,1:6]
    exprSet=a[,7:ncol(a)]
    colnames(exprSet)
    a2=exprSet[,'SRR1039516.hisat.bam']
    
    library(airway)
    data(airway)
    exprSet=assay(airway)
    colnames(exprSet)
    a1=exprSet[,'SRR1039516']
    group_list=colData(airway)[,3]
    
    a2=data.frame(id=meta[,1],a2=a2)
    a1=data.frame(id=names(a1),a1=as.numeric(a1))
    library(stringr)
    a2$id <- str_split(a2$id,'\\.',simplify = T)[,1]
    tmp=merge(a1,a2,by='id')
    png('tmp.png')
    plot(tmp[,2:3])
    dev.off()
    
    library(corrplot)
    png('cor.png')
    corrplot(cor(log2(exprSet+1)))
    dev.off()
    
    library(pheatmap)
    png('heatmap.png')
    m=cor(log2(exprSet+1))
    pheatmap(scale(cor(log2(exprSet+1))))
    dev.off()
    
    img

    相关文章

      网友评论

      • 王诗翔:在counts这一步gtf="/public/reference/gtf/gencode/gencode.v25.annotation.gtf.gz" 后featureCounts运行报错导入不了gtf文件,很奇怪。我最后发现不用变量替换直接在选项a后面填入该路径又可以~
      • 谢俊飞:jimmy下次做转录组的视频,可以拿点非模式生物为例子吗?每次都学人啊,小鼠啊,很多操作放在非模式物种上都没法用。我是做昆虫的,在做基因注释什么的,完全不适用

      本文标题:原创10000+生信教程大神给你的RNA实战视频演练

      本文链接:https://www.haomeiwen.com/subject/tuxxiftx.html