生信技能树线上课程:RNA-seq

作者: 泥人吴 | 来源:发表于2019-06-05 00:12 被阅读12次

    写在前面

    • 很荣幸参加生信技能树全国巡讲-第一站-重庆站的RNA-seq线下培训课程。
    • 以下为线下课结合线上课所做的笔记,希望能帮到你。

    一些小练习题

    R语言的练习题
    初级10 个题目,尽量根据参考代码理解及完成:http://www.bio-info-trainee.com/3793.html  
    中级要求是:http://www.bio-info-trainee.com/3750.html
    高级要求是完成20题: http://www.bio-info-trainee.com/3415.html
    
    LINUX的练习题:
    最低要求是完成我的 linux 20题 http://www.bio-info-trainee.com/2900.html   
    其次完成生物信息学数据格式的习题(blast/blat/fa-fq/sam-bam/vcf/bed/gtf-gff),收集这些格式的说明书。
    fasta和fastq格式文件的shell小练习 http://www.bio-info-trainee.com/3575.html 
    sam和bam格式文件的shell小练习 http://www.bio-info-trainee.com/3578.html
    VCF格式文件的shell小练习 http://www.bio-info-trainee.com/3577.html 
    
    RNA-seq只有小考核 http://www.bio-info-trainee.com/3920.html
    


    下面开始RNA-seq流程

    1.软件下载

    1.1下载Miniconda3

    # 下载Miniconda3
    $ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
    ## 选择miniconda2/3都可以,可以通过创建另一个版本Python环境。
    
    # 安装Miniconda3
    $ bash Miniconda3-latest-Linux-x86_64.sh 
    $ source ~/.bashrc
    (base) vip39 13:30:42 ~ #表示激活miniconda
    
    # 配置镜像,清华镜像挂掉后悬着这个吧
    conda config --add channels bioconda #生信软件
    conda config --add channels conda-forge
    # 查看一下配置的镜像
    $ vim ~/.condarc
    channels:
      - conda-forge
      - bioconda
      - defaults
    show_channel_urls: true
    
    # 创建命名为rna的python2的环境
    $ conda create -n rna python=2 
    Collecting package metadata: /
    
    Preparing transaction: done
    Verifying transaction: done
    Executing transaction: done
    
    # 激活/进入conda的rna环境,避免每次用-n rna
    $ source activate rna
    (rna) vip39 13:47:00 ~
    

    1.2安装软件

    jimmy老师给出的安装列表
    • 数据资源下载,参考基因组及参考转录组
      gtf ,genome,fa
    • 质控,需要fastqc及multiqc
      trimmomatic,cutadapt,trim-galore
    • 对比
      star,HISAT2,TOPHAT2, bowtie,bwa,subread
    • 计数
      featureCounts,htseq-counts,bedtools
    • nomalization 归一化,差异分析等
      DEseq2,edgeR,limma
    $ conda install -y fastqc trim-galore star bwa hisat2 bowtie2 subread tophat htseq bedtools deeptools salmon cutadapt multiqc sra-tools
    $ conda install -c bioconda trimmomatic 
    

    2.下载数据

    #得到_List.txt后,我们直接用循环下载数据
    $ cat SRR_Acc_List.txt | while read id; do (prefetch ${id});done
    ##上面程序挂为后台运行的方式为:
    cat SRR_Acc_List.txt | while read id; do (prefetch $id &);done
    ## 杀死上面这个循坏
    ps -ef |grep prefecth awk '{print $2}'|while read id ;do kill $ id ;done
    
    # 下载得到数据,这里我们下载得到4个数据
    $ tree
    .
    ├── SRR1039508.sra
    ├── SRR1039509.sra
    ├── SRR1039510.sra
    └── SRR1039514.sra
    
    # 3. SRA数据转成fastq.gz格式
    ## 使用循环来实现
    $ mkdir 1.fastq_qc
    $ cd 1.fastq_qc
    $ ls ~/ncbi/public/sra/*|while read id ;do (nohup fastq-dump --gzip --split-3 -O ./ $id &);done
    ## 你也可以手动实现转换
    fastq-dump --gzip --split-3 -O ~/ ./SRR1039508.sra
    

    3.质控

    3.1了解数据大小、质量

    $ ls *gz |xargs fastqc -t 10 
    ## multiqc整合结果
    $ multiqc ./
    ...
    [INFO   ]          fastqc : Found 8 reports #找到8个,对应上我上面的4个
    [INFO   ]         multiqc : Report      : multiqc_report.html #将这个结果报告下载下来查看。
    [INFO   ]         multiqc : MultiQC complete
    
    • 查看multiqc_report.html


      multiqc_report.html

    3.2过滤数据

    • 去除接头
    • 去除两端低质量碱基(-q 25)
    • 最大允许错误率(默认-e 0.1)
    • 去除<36的reads(--length 36)
    • 切除index的overlap>3的碱基
    • reads去除以对为单位(--paired)
    ##过滤数据
    ###step3:filter the bad quality reads and remove adaptors。
    $ cd 1.fastq_qc/
    $ ls ~/1.fastq_qc/*_1.fastq.gz
    /home/vip51/1.fastq_qc/SRR1039508_1.fastq.gz  /home/vip51/1.fastq_qc/SRR1039509_1.fastq.gz  /home/vip51/1.fastq_qc/SRR1039510_1.fastq.gz  /home/vip51/1.fastq_qc/SRR1039514_1.fastq.gz
    $ ls ~/1.fastq_qc/*_1.fastq.gz >1
    $ ls ~/1.fastq_qc/*_2.fastq.gz >2
    $ paste 1 2
    $ paste 1 2 >config
    $ cat config 
    /home/vip51/1.fastq_qc/SRR1039508_1.fastq.gz    /home/vip51/1.fastq_qc/SRR1039508_2.fastq.gz
    /home/vip51/1.fastq_qc/SRR1039509_1.fastq.gz    /home/vip51/1.fastq_qc/SRR1039509_2.fastq.gz
    /home/vip51/1.fastq_qc/SRR1039510_1.fastq.gz    /home/vip51/1.fastq_qc/SRR1039510_2.fastq.gz
    /home/vip51/1.fastq_qc/SRR1039514_1.fastq.gz    /home/vip51/1.fastq_qc/SRR1039514_2.fastq.gz
    ## 建立好config后,接下来就可以进行批量质控了
    ## 建立批量处理脚本
    $ cat >qc.sh
    source activate rna
    bin_trim_galore=trim_galore
    mkdir filter-data
    dir='/home/vip51/filter-data'
    cat config |while read id
    do
        arr=($id)
        fq1=${arr[0]}
        fq2=${arr[1]}
    nohup $bin_trim_galore -q 25 --phred33 --length 36 -e 0.1 --stringency 3 --paired -o $dir $fq1 $fq2 &
    done
    source deactivate
    ## 拿到config和qc.sh后
    $ bash qc.sh config
    

    4比对-HISAT2 mapping

    4.1参考基因组hg38及构建索引

    #hisat软件建立索引文件
    cd ~/reference
    mkdir -p index/hisat && cd index/hisat
    nohup wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg38.tar.gz  &
    nohup tar zxvf grcm38.tar.gz &
    ## 解压后
    (rna) vip51 00:19:08 ~/reference/index/hisat/hg38
    $ ls 
    genome.1.ht2  genome.2.ht2  genome.3.ht2  genome.4.ht2  genome.5.ht2  genome.6.ht2  genome.7.ht2  genome.8.ht2  make_hg38.sh
    # 这里表示构建索引成功,简单的说,就是去官网下载index,然后解压即可成功
    
    • 其实构建索引的时候,我花了很多时间,就在于~/reference/index/hisat/hg38/genome这个的理解不到位,所以跑去Google-构建索引失败才发现问题.
    hisat2 -p 2 -x ~/reference/index/hisat/hg38/genome -1 ~/filter-data/${id}_1_val_1.fq -2 ~/filter-data/${id}_2_val_2.fq -S ${id}.hisat.sam 2>${id}.sam.log
    

    4.2 比对

    • bam查看方法
    samtools view -h smaple.bam
    samtools view smaple.bam
    
    ## 将文件名为*gq.gz改为*fq,head-1000取的前1000行,这一步非必要的
    $ ls ~/filter-data/*gz|while read id ;do (zcat $id|head -1000> $(basename $id ".gz"));done
    $ (rna) vip51 20:30:12 ~/filter-data
    $ ls -lh
    total 12G
    -rw-rw-r-- 1 vip51 vip51 2.9K Jun  2 15:29 SRR1039508_1.fastq.gz_trimming_report.txt
    -rw-rw-r-- 1 vip51 vip51  64K Jun  6 16:53 SRR1039508_1_val_1.fq #取的前1000行,所以数据很小。
    -rw-rw-r-- 1 vip51 vip51 1.3G Jun  2 15:44 SRR1039508_1_val_1.fq.gz
    ## hisat2 比对
    id=SRR1039508
    hisat2 -p 2 -x ~/reference/index/hisat/hg38/genome -1 ~/filter-data/${id}_1_val_1.fq -2 ~/filter-data/${id}_2_val_2.fq -S ${id}.hisat.sam 2>${id}.sam.log
    
    ## hisat批量比对
    $ ls *gz|cut -d "_" -f 1|sort -u
    SRR1039508
    SRR1039509
    SRR1039510
    SRR1039514
    ## 建立一个hisat.sh
    
    $ cat > hisat.sh
    cd filter-data/
    ls *gz|cut -d "_" -f 1|sort -u |while read id ;do
    cd ~/4.mapping/
    ls -lh ${id}_1_val_1.fq.gz  ${id}_2_val_2.fq.gz
    hisat2 -p 10 -x ~/reference/index/hisat/hg38/genome -1 ~/filter-data/${id}_1_val_1.fq.gz -2 ~/filter-data/${id}_2_val_2.fq.gz -S ${id}.hisat.sam 
    done
    ###批量对比
    $ source hisat.sh
    (rna) vip51 23:15:04 ~/4.mapping
    $ ls -lh
    total 56G
    -rw-rw-r-- 1 vip51 vip51 13G Jun  7 22:44 SRR1039508.hisat.sam
    -rw-rw-r-- 1 vip51 vip51 12G Jun  7 22:51 SRR1039509.hisat.sam
    -rw-rw-r-- 1 vip51 vip51 13G Jun  7 22:58 SRR1039510.hisat.sam
    -rw-rw-r-- 1 vip51 vip51 21G Jun  7 23:08 SRR1039514.hisat.sam
    ##上面看出.sam文件还是很大的
    ## 如果跑的是这个
    # hisat2 -p 10 -x ~/reference/index/hisat/hg38/genome -1 ~/filter-data/${id}_1_val_1.fq -2 ~/filter-data/${id}_2_val_2.fq -S ${id}.hisat.sam 
    ## 上面对对应的就是head -1000,数据会很小
    
    ###.sam转化为.bam
    $ ls *.sam |while read id;do (samtools sort -O bam -@ 5 -o $(basename ${id} ".sam").bam ${id});done
    (rna) vip51 23:34:31 ~/4.mapping
    $ ls -lh
    total 66G
    -rw-rw-r-- 1 vip51 vip51 2.1G Jun  7 23:21 SRR1039508.hisat.bam
    -rw-rw-r-- 1 vip51 vip51  13G Jun  7 22:44 SRR1039508.hisat.sam
    -rw-rw-r-- 1 vip51 vip51 1.9G Jun  7 23:24 SRR1039509.hisat.bam
    -rw-rw-r-- 1 vip51 vip51  12G Jun  7 22:51 SRR1039509.hisat.sam
    -rw-rw-r-- 1 vip51 vip51 2.1G Jun  7 23:28 SRR1039510.hisat.bam
    -rw-rw-r-- 1 vip51 vip51  13G Jun  7 22:58 SRR1039510.hisat.sam
    -rw-rw-r-- 1 vip51 vip51 3.7G Jun  7 23:34 SRR1039514.hisat.bam
    -rw-rw-r-- 1 vip51 vip51  21G Jun  7 23:08 SRR1039514.hisat.sam
    # 可以看出.bam文件要小得多
    

    4.3 比对策略统计比对效率

    ls *.bam |while read id ;do ( samtools flagstat -@ 1 $id > $(basename ${id} ".bam").flagstat );done
    
    (rna) vip51 23:42:28 ~/4.mapping
    $ ls -lh
    total 66G
    -rw-rw-r-- 1 vip51 vip51 2.1G Jun  7 23:21 SRR1039508.hisat.bam
    -rw-rw-r-- 1 vip51 vip51  448 Jun  7 23:38 SRR1039508.hisat.flagstat
    -rw-rw-r-- 1 vip51 vip51  13G Jun  7 22:44 SRR1039508.hisat.sam
    -rw-rw-r-- 1 vip51 vip51 1.9G Jun  7 23:24 SRR1039509.hisat.bam
    -rw-rw-r-- 1 vip51 vip51  448 Jun  7 23:38 SRR1039509.hisat.flagstat
    -rw-rw-r-- 1 vip51 vip51  12G Jun  7 22:51 SRR1039509.hisat.sam
    -rw-rw-r-- 1 vip51 vip51 2.1G Jun  7 23:28 SRR1039510.hisat.bam
    -rw-rw-r-- 1 vip51 vip51  449 Jun  7 23:39 SRR1039510.hisat.flagstat
    -rw-rw-r-- 1 vip51 vip51  13G Jun  7 22:58 SRR1039510.hisat.sam
    -rw-rw-r-- 1 vip51 vip51 3.7G Jun  7 23:34 SRR1039514.hisat.bam
    -rw-rw-r-- 1 vip51 vip51  449 Jun  7 23:39 SRR1039514.hisat.flagstat
    -rw-rw-r-- 1 vip51 vip51  21G Jun  7 23:08 SRR1039514.hisat.sam
    
    $ ls *.bam|xargs -i samtools index {} #构建索引,拿到IGV里面看
    ### 
    (rna) vip51 23:43:56 ~/4.mapping
    $ mkdir stat
    $ mv *flagstat stat/
    $ cd stat/
    $ ls
    SRR1039508.hisat.flagstat  SRR1039509.hisat.flagstat  SRR1039510.hisat.flagstat  SRR1039514.hisat.flagstat
    
    (rna) vip51 00:24:55 ~/4.mapping/stat
    $ wc -l *
      13 SRR1039508.hisat.flagstat
      13 SRR1039509.hisat.flagstat
      13 SRR1039510.hisat.flagstat
      13 SRR1039514.hisat.flagstat
      52 total
    (rna) vip51 00:25:41 ~/4.mapping/stat
    $ cat *|awk '{print $1}' |paste - - - - - - - - - - - - - 
    50094593    5350127 0   0   49304507    44744466    22372233    22372233    43122616    43495718    458662  121216  89837
    45068638    4229594 0   0   44198454    40839044    20419522    20419522    39169816    39531396    437464  121706  92949
    49310701    5020389 0   0   48454263    44290312    22145156    22145156    42601074    42978162    455712  143714  111364
    82290684    6889352 0   0   81370594    75401332    37700666    37700666    73172032    73858618    622624  395008  343293
    $ cut -d"+" -f 2 SRR1039508.hisat.flagstat |cut -d" " -f 3-90
    $ cut -d"+" -f 2 SRR1039508.hisat.flagstat |cut -d" " -f 3-90
    in total (QC-passed reads 
    secondary
    supplementary
    duplicates
    mapped (98.42% : N/A)
    paired in sequencing
    read1
    read2
    properly paired (96.38% : N/A)
    with itself and mate mapped
    singletons (1.03% : N/A)
    with mate mapped to a different chr
    with mate mapped to a different chr (mapQ>=5)
    # 如下表
    
    hisat比对结果
    • multiqc ./hisat.flagstat*
    ## 上面的SRR1039508.hisat.flagstat  SRR1039509.hisat.flagstat  SRR1039510.hisat.flagstat  SRR1039514.hisat.flagstat 也可以如下multiqc生成网页版报告
    $ multiqc ./
    
    (rna) vip51 00:43:27 ~/4.mapping/stat
    $ ls 
    multiqc_data multiqc_report.html
    
    • 查看multiqc_report.html


      multiqc_report.html

    5.差异分析-featureCounts

    5.1featureCounts 需要两个输入文件:

    • 比对产生的BAM/ SAM文件(上面我们已经的到)
    $ ls *bam
    SRR1039508.hisat.bam  SRR1039509.hisat.bam  SRR1039510.hisat.bam  SRR1039514.hisat.bam
    
    • 区间注释文件,这个文件需要自己创建
    $ cd ~/reference/gtf/
    ## 下载最新的
    $ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/gencode.v30.annotation.gtf.gz
    
    • 去Gencode网站下载,这里我们下载最新的,下载GTF文件。
      Release 30
    • 解压缩,得到Load annotation file gencode.v30.annotation.gtf
    $ gunzip gencode.v30.annotation.gtf.gz
    ## 得到
    $ ls
    gencode.v30.annotation.gtf  human.gene.positions  nohup.out
    

    6.2 featureounts

    $ cd ~/6.featureCounts/
    $ featureCounts -T 5 -p -t exon -g gene_id -a ~/reference/gtf/gencode.v30.annotation.gtf -o ~/all.id.txt ~/4.mapping/*.bam
    ## 得到如下:
    
            ==========     _____ _    _ ____  _____  ______          _____  
            =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
              =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
                ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
                  ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
            ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
          v1.6.4
    
    //========================== featureCounts setting ===========================\\
    ||                                                                            ||
    ||             Input files : 4 BAM files                                      ||
    ||                           P SRR1039508.hisat.bam                           ||
    ||                           P SRR1039509.hisat.bam                           ||
    ||                           P SRR1039510.hisat.bam                           ||
    ||                           P SRR1039514.hisat.bam                           ||
    ||                                                                            ||
    ||             Output file : all.id.txt                                       ||
    ||                 Summary : all.id.txt.summary                               ||
    ||              Annotation : gencode.v30.annotation.gtf (GTF)                 ||
    ||      Dir for temp files : /home/vip51                                      ||
    ||                                                                            ||
    ||                 Threads : 5                                                ||
    ||                   Level : meta-feature level                               ||
    ||              Paired-end : yes                                              ||
    ||      Multimapping reads : not counted                                      ||
    || Multi-overlapping reads : not counted                                      ||
    ||   Min overlapping bases : 1                                                ||
    ||                                                                            ||
    ||          Chimeric reads : counted                                          ||
    ||        Both ends mapped : not required                                     ||
    ||                                                                            ||
    \\============================================================================//
    
    //================================= Running ==================================\\
    ||                                                                            ||
    || Load annotation file gencode.v30.annotation.gtf ...                        ||
    ||    Features : 1279686                                                      ||
    ||    Meta-features : 58870                                                   ||
    ||    Chromosomes/contigs : 25                                                ||
    ||                                                                            ||
    || Process BAM file SRR1039508.hisat.bam...                                   ||
    ||    Paired-end reads are included.                                          ||
    ||    Assign alignments (paired-end) to features...                           ||
    ||                                                                            ||
    ||    WARNING: reads from the same pair were found not adjacent to each       ||
    ||             other in the input (due to read sorting by location or         ||
    ||             reporting of multi-mapping read pairs).                        ||
    ||                                                                            ||
    ||    Pairing up the read pairs.                                              ||
    ||                                                                            ||
    ||    Total alignments : 25134856                                             ||
    ||    Successfully assigned alignments : 18588891 (74.0%)                     ||
    ||    Running time : 0.41 minutes                                             ||
    ||                                                                            ||
    || Process BAM file SRR1039509.hisat.bam...                                   ||
    ||    Paired-end reads are included.                                          ||
    ||    Assign alignments (paired-end) to features...                           ||
    ||                                                                            ||
    ||    WARNING: reads from the same pair were found not adjacent to each       ||
    ||             other in the input (due to read sorting by location or         ||
    ||             reporting of multi-mapping read pairs).                        ||
    ||                                                                            ||
    ||    Pairing up the read pairs.                                              ||
    ||                                                                            ||
    ||    Total alignments : 22612196                                             ||
    ||    Successfully assigned alignments : 17040035 (75.4%)                     ||
    ||    Running time : 0.36 minutes                                             ||
    ||                                                                            ||
    || Process BAM file SRR1039510.hisat.bam...                                   ||
    ||    Paired-end reads are included.                                          ||
    ||    Assign alignments (paired-end) to features...                           ||
    ||                                                                            ||
    ||    WARNING: reads from the same pair were found not adjacent to each       ||
    ||             other in the input (due to read sorting by location or         ||
    ||             reporting of multi-mapping read pairs).                        ||
    ||                                                                            ||
    ||    Pairing up the read pairs.                                              ||
    ||                                                                            ||
    ||    Total alignments : 24741721                                             ||
    ||    Successfully assigned alignments : 18349007 (74.2%)                     ||
    ||    Running time : 0.32 minutes                                             ||
    ||                                                                            ||
    || Process BAM file SRR1039514.hisat.bam...                                   ||
    ||    Paired-end reads are included.                                          ||
    ||    Assign alignments (paired-end) to features...                           ||
    ||                                                                            ||
    ||    WARNING: reads from the same pair were found not adjacent to each       ||
    ||             other in the input (due to read sorting by location or         ||
    ||             reporting of multi-mapping read pairs).                        ||
    ||                                                                            ||
    ||    Pairing up the read pairs.                                              ||
    ||                                                                            ||
    ||    Total alignments : 41252925                                             ||
    ||    Successfully assigned alignments : 31979144 (77.5%)                     ||
    ||    Running time : 0.54 minutes                                             ||
    ||                                                                            ||
    ||                                                                            ||
    || Summary of counting results can be found in file "/home/vip51/all.id.txt.  ||
    || summary"                                                                   ||
    ||                                                                            ||
    \\============================================================================//
    
    
    • 得到下面两个结果
    # all.id.txt         all.id.txt.summary
    ## 我们对all.id.txt.summary 进行multiqc一下,看看Counts情况
    $ multiqc all.id.txt.summary 
    ...
    [INFO   ]         multiqc : Report      : multiqc_report_1.html
    [INFO   ]         multiqc : Data        : multiqc_data_1
    [INFO   ]         multiqc : MultiQC complete
    
    • 查看上面得的到的multiqc_report_1.html


      General Statistics
      featureCounts


    拿到all.id.txt,就可以到R中进行下游分析了

    相关文章

      网友评论

        本文标题:生信技能树线上课程:RNA-seq

        本文链接:https://www.haomeiwen.com/subject/axaeuqtx.html