生信技能树线上课程：RNA-seq

作者: 泥人吴 | 来源:发表于2019-06-05 00:12 被阅读12次

生信技能树线上课程：RNA-seq
StatQuest机器学习
2019-04-19
[生信技能树线下]Linux课堂回顾
RNA-seq小考核----生信技能树
RNA-seq，生信技能树
生信技能树，RNA-seq
数据挖掘班开年寄语
fastq处理笔记
Linux笔记

写在前面

很荣幸参加生信技能树全国巡讲-第一站-重庆站的RNA-seq线下培训课程。

以下为线下课结合线上课所做的笔记，希望能帮到你。

一些小练习题

R语言的练习题
初级10 个题目，尽量根据参考代码理解及完成：http://www.bio-info-trainee.com/3793.html  
中级要求是：http://www.bio-info-trainee.com/3750.html
高级要求是完成20题： http://www.bio-info-trainee.com/3415.html

LINUX的练习题：
最低要求是完成我的 linux 20题 http://www.bio-info-trainee.com/2900.html   
其次完成生物信息学数据格式的习题(blast/blat/fa-fq/sam-bam/vcf/bed/gtf-gff)，收集这些格式的说明书。
fasta和fastq格式文件的shell小练习 http://www.bio-info-trainee.com/3575.html 
sam和bam格式文件的shell小练习 http://www.bio-info-trainee.com/3578.html
VCF格式文件的shell小练习 http://www.bio-info-trainee.com/3577.html 

RNA-seq只有小考核 http://www.bio-info-trainee.com/3920.html

下面开始RNA-seq流程

1.软件下载

1.1下载Miniconda3

清华镜像关闭后，只有选择其它下载：
[官网]：https://docs.conda.io/en/latest/miniconda.html
参考教程: conda的安装与使用

# 下载Miniconda3
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
## 选择miniconda2/3都可以，可以通过创建另一个版本Python环境。

# 安装Miniconda3
$ bash Miniconda3-latest-Linux-x86_64.sh 
$ source ~/.bashrc
(base) vip39 13:30:42 ~ #表示激活miniconda

# 配置镜像，清华镜像挂掉后悬着这个吧
conda config --add channels bioconda #生信软件
conda config --add channels conda-forge
# 查看一下配置的镜像
$ vim ~/.condarc
channels:
  - conda-forge
  - bioconda
  - defaults
show_channel_urls: true

# 创建命名为rna的python2的环境
$ conda create -n rna python=2 
Collecting package metadata: /

Preparing transaction: done
Verifying transaction: done
Executing transaction: done

# 激活/进入conda的rna环境，避免每次用-n rna
$ source activate rna
(rna) vip39 13:47:00 ~

1.2安装软件

jimmy老师给出的安装列表

数据资源下载，参考基因组及参考转录组
gtf ，genome，fa
质控，需要fastqc及multiqc
trimmomatic，cutadapt，trim-galore
对比
star,HISAT2,TOPHAT2, bowtie,bwa,subread
计数
featureCounts，htseq-counts,bedtools
nomalization 归一化，差异分析等
DEseq2，edgeR，limma

$ conda install -y fastqc trim-galore star bwa hisat2 bowtie2 subread tophat htseq bedtools deeptools salmon cutadapt multiqc sra-tools
$ conda install -c bioconda trimmomatic

清华镜像挂掉后，就多用google，或者https://anaconda.org/bioconda/

2.下载数据

找到SRA数据 [可修改SRP号]
找到SRA数据集，找到_List.txt

#得到_List.txt后，我们直接用循环下载数据
$ cat SRR_Acc_List.txt | while read id; do (prefetch ${id});done
##上面程序挂为后台运行的方式为：
cat SRR_Acc_List.txt | while read id; do (prefetch $id &);done
## 杀死上面这个循坏
ps -ef |grep prefecth awk '{print $2}'|while read id ;do kill $ id ;done

# 下载得到数据，这里我们下载得到4个数据
$ tree
.
├── SRR1039508.sra
├── SRR1039509.sra
├── SRR1039510.sra
└── SRR1039514.sra

# 3. SRA数据转成fastq.gz格式
## 使用循环来实现
$ mkdir 1.fastq_qc
$ cd 1.fastq_qc
$ ls ~/ncbi/public/sra/*|while read id ;do (nohup fastq-dump --gzip --split-3 -O ./ $id &);done
## 你也可以手动实现转换
fastq-dump --gzip --split-3 -O ~/ ./SRR1039508.sra

3.质控

3.1了解数据大小、质量

$ ls *gz |xargs fastqc -t 10 
## multiqc整合结果
$ multiqc ./
...
[INFO   ]          fastqc : Found 8 reports #找到8个，对应上我上面的4个
[INFO   ]         multiqc : Report      : multiqc_report.html #将这个结果报告下载下来查看。
[INFO   ]         multiqc : MultiQC complete

查看multiqc_report.html

multiqc_report.html

3.2过滤数据

去除接头
去除两端低质量碱基(-q 25)
最大允许错误率(默认-e 0.1)
去除<36的reads(--length 36)
切除index的overlap>3的碱基
reads去除以对为单位(--paired)

##过滤数据
###step3:filter the bad quality reads and remove adaptors。
$ cd 1.fastq_qc/
$ ls ~/1.fastq_qc/*_1.fastq.gz
/home/vip51/1.fastq_qc/SRR1039508_1.fastq.gz  /home/vip51/1.fastq_qc/SRR1039509_1.fastq.gz  /home/vip51/1.fastq_qc/SRR1039510_1.fastq.gz  /home/vip51/1.fastq_qc/SRR1039514_1.fastq.gz
$ ls ~/1.fastq_qc/*_1.fastq.gz >1
$ ls ~/1.fastq_qc/*_2.fastq.gz >2
$ paste 1 2
$ paste 1 2 >config
$ cat config 
/home/vip51/1.fastq_qc/SRR1039508_1.fastq.gz    /home/vip51/1.fastq_qc/SRR1039508_2.fastq.gz
/home/vip51/1.fastq_qc/SRR1039509_1.fastq.gz    /home/vip51/1.fastq_qc/SRR1039509_2.fastq.gz
/home/vip51/1.fastq_qc/SRR1039510_1.fastq.gz    /home/vip51/1.fastq_qc/SRR1039510_2.fastq.gz
/home/vip51/1.fastq_qc/SRR1039514_1.fastq.gz    /home/vip51/1.fastq_qc/SRR1039514_2.fastq.gz
## 建立好config后，接下来就可以进行批量质控了
## 建立批量处理脚本
$ cat >qc.sh
source activate rna
bin_trim_galore=trim_galore
mkdir filter-data
dir='/home/vip51/filter-data'
cat config |while read id
do
    arr=($id)
    fq1=${arr[0]}
    fq2=${arr[1]}
nohup $bin_trim_galore -q 25 --phred33 --length 36 -e 0.1 --stringency 3 --paired -o $dir $fq1 $fq2 &
done
source deactivate
## 拿到config和qc.sh后
$ bash qc.sh config

4比对-HISAT2 mapping

4.1参考基因组hg38及构建索引

#hisat软件建立索引文件
cd ~/reference
mkdir -p index/hisat && cd index/hisat
nohup wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg38.tar.gz  &
nohup tar zxvf grcm38.tar.gz &
## 解压后
(rna) vip51 00:19:08 ~/reference/index/hisat/hg38
$ ls 
genome.1.ht2  genome.2.ht2  genome.3.ht2  genome.4.ht2  genome.5.ht2  genome.6.ht2  genome.7.ht2  genome.8.ht2  make_hg38.sh
# 这里表示构建索引成功，简单的说，就是去官网下载index，然后解压即可成功

其实构建索引的时候，我花了很多时间，就在于~/reference/index/hisat/hg38/genome这个的理解不到位，所以跑去Google-构建索引失败才发现问题.

hisat2 -p 2 -x ~/reference/index/hisat/hg38/genome -1 ~/filter-data/${id}_1_val_1.fq -2 ~/filter-data/${id}_2_val_2.fq -S ${id}.hisat.sam 2>${id}.sam.log

4.2 比对

bam查看方法

samtools view -h smaple.bam
samtools view smaple.bam

## 将文件名为*gq.gz改为*fq,head-1000取的前1000行，这一步非必要的
$ ls ~/filter-data/*gz|while read id ;do (zcat $id|head -1000> $(basename $id ".gz"));done
$ (rna) vip51 20:30:12 ~/filter-data
$ ls -lh
total 12G
-rw-rw-r-- 1 vip51 vip51 2.9K Jun  2 15:29 SRR1039508_1.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 vip51 vip51  64K Jun  6 16:53 SRR1039508_1_val_1.fq #取的前1000行，所以数据很小。
-rw-rw-r-- 1 vip51 vip51 1.3G Jun  2 15:44 SRR1039508_1_val_1.fq.gz
## hisat2 比对
id=SRR1039508
hisat2 -p 2 -x ~/reference/index/hisat/hg38/genome -1 ~/filter-data/${id}_1_val_1.fq -2 ~/filter-data/${id}_2_val_2.fq -S ${id}.hisat.sam 2>${id}.sam.log

## hisat批量比对
$ ls *gz|cut -d "_" -f 1|sort -u
SRR1039508
SRR1039509
SRR1039510
SRR1039514
## 建立一个hisat.sh

$ cat > hisat.sh
cd filter-data/
ls *gz|cut -d "_" -f 1|sort -u |while read id ;do
cd ~/4.mapping/
ls -lh ${id}_1_val_1.fq.gz  ${id}_2_val_2.fq.gz
hisat2 -p 10 -x ~/reference/index/hisat/hg38/genome -1 ~/filter-data/${id}_1_val_1.fq.gz -2 ~/filter-data/${id}_2_val_2.fq.gz -S ${id}.hisat.sam 
done
###批量对比
$ source hisat.sh
(rna) vip51 23:15:04 ~/4.mapping
$ ls -lh
total 56G
-rw-rw-r-- 1 vip51 vip51 13G Jun  7 22:44 SRR1039508.hisat.sam
-rw-rw-r-- 1 vip51 vip51 12G Jun  7 22:51 SRR1039509.hisat.sam
-rw-rw-r-- 1 vip51 vip51 13G Jun  7 22:58 SRR1039510.hisat.sam
-rw-rw-r-- 1 vip51 vip51 21G Jun  7 23:08 SRR1039514.hisat.sam
##上面看出.sam文件还是很大的
## 如果跑的是这个
# hisat2 -p 10 -x ~/reference/index/hisat/hg38/genome -1 ~/filter-data/${id}_1_val_1.fq -2 ~/filter-data/${id}_2_val_2.fq -S ${id}.hisat.sam 
## 上面对对应的就是head -1000，数据会很小

###.sam转化为.bam
$ ls *.sam |while read id;do (samtools sort -O bam -@ 5 -o $(basename ${id} ".sam").bam ${id});done
(rna) vip51 23:34:31 ~/4.mapping
$ ls -lh
total 66G
-rw-rw-r-- 1 vip51 vip51 2.1G Jun  7 23:21 SRR1039508.hisat.bam
-rw-rw-r-- 1 vip51 vip51  13G Jun  7 22:44 SRR1039508.hisat.sam
-rw-rw-r-- 1 vip51 vip51 1.9G Jun  7 23:24 SRR1039509.hisat.bam
-rw-rw-r-- 1 vip51 vip51  12G Jun  7 22:51 SRR1039509.hisat.sam
-rw-rw-r-- 1 vip51 vip51 2.1G Jun  7 23:28 SRR1039510.hisat.bam
-rw-rw-r-- 1 vip51 vip51  13G Jun  7 22:58 SRR1039510.hisat.sam
-rw-rw-r-- 1 vip51 vip51 3.7G Jun  7 23:34 SRR1039514.hisat.bam
-rw-rw-r-- 1 vip51 vip51  21G Jun  7 23:08 SRR1039514.hisat.sam
# 可以看出.bam文件要小得多

4.3 比对策略统计比对效率

ls *.bam |while read id ;do ( samtools flagstat -@ 1 $id > $(basename ${id} ".bam").flagstat );done

(rna) vip51 23:42:28 ~/4.mapping
$ ls -lh
total 66G
-rw-rw-r-- 1 vip51 vip51 2.1G Jun  7 23:21 SRR1039508.hisat.bam
-rw-rw-r-- 1 vip51 vip51  448 Jun  7 23:38 SRR1039508.hisat.flagstat
-rw-rw-r-- 1 vip51 vip51  13G Jun  7 22:44 SRR1039508.hisat.sam
-rw-rw-r-- 1 vip51 vip51 1.9G Jun  7 23:24 SRR1039509.hisat.bam
-rw-rw-r-- 1 vip51 vip51  448 Jun  7 23:38 SRR1039509.hisat.flagstat
-rw-rw-r-- 1 vip51 vip51  12G Jun  7 22:51 SRR1039509.hisat.sam
-rw-rw-r-- 1 vip51 vip51 2.1G Jun  7 23:28 SRR1039510.hisat.bam
-rw-rw-r-- 1 vip51 vip51  449 Jun  7 23:39 SRR1039510.hisat.flagstat
-rw-rw-r-- 1 vip51 vip51  13G Jun  7 22:58 SRR1039510.hisat.sam
-rw-rw-r-- 1 vip51 vip51 3.7G Jun  7 23:34 SRR1039514.hisat.bam
-rw-rw-r-- 1 vip51 vip51  449 Jun  7 23:39 SRR1039514.hisat.flagstat
-rw-rw-r-- 1 vip51 vip51  21G Jun  7 23:08 SRR1039514.hisat.sam

$ ls *.bam|xargs -i samtools index {} #构建索引，拿到IGV里面看
### 
(rna) vip51 23:43:56 ~/4.mapping
$ mkdir stat
$ mv *flagstat stat/
$ cd stat/
$ ls
SRR1039508.hisat.flagstat  SRR1039509.hisat.flagstat  SRR1039510.hisat.flagstat  SRR1039514.hisat.flagstat

(rna) vip51 00:24:55 ~/4.mapping/stat
$ wc -l *
  13 SRR1039508.hisat.flagstat
  13 SRR1039509.hisat.flagstat
  13 SRR1039510.hisat.flagstat
  13 SRR1039514.hisat.flagstat
  52 total
(rna) vip51 00:25:41 ~/4.mapping/stat
$ cat *|awk '{print $1}' |paste - - - - - - - - - - - - - 
50094593    5350127 0   0   49304507    44744466    22372233    22372233    43122616    43495718    458662  121216  89837
45068638    4229594 0   0   44198454    40839044    20419522    20419522    39169816    39531396    437464  121706  92949
49310701    5020389 0   0   48454263    44290312    22145156    22145156    42601074    42978162    455712  143714  111364
82290684    6889352 0   0   81370594    75401332    37700666    37700666    73172032    73858618    622624  395008  343293
$ cut -d"+" -f 2 SRR1039508.hisat.flagstat |cut -d" " -f 3-90
$ cut -d"+" -f 2 SRR1039508.hisat.flagstat |cut -d" " -f 3-90
in total (QC-passed reads 
secondary
supplementary
duplicates
mapped (98.42% : N/A)
paired in sequencing
read1
read2
properly paired (96.38% : N/A)
with itself and mate mapped
singletons (1.03% : N/A)
with mate mapped to a different chr
with mate mapped to a different chr (mapQ>=5)
# 如下表

hisat比对结果

multiqc ./hisat.flagstat*

## 上面的SRR1039508.hisat.flagstat  SRR1039509.hisat.flagstat  SRR1039510.hisat.flagstat  SRR1039514.hisat.flagstat 也可以如下multiqc生成网页版报告
$ multiqc ./

(rna) vip51 00:43:27 ~/4.mapping/stat
$ ls 
multiqc_data multiqc_report.html

查看multiqc_report.html

multiqc_report.html

5.差异分析-featureCounts

5.1featureCounts 需要两个输入文件：

比对产生的BAM/ SAM文件（上面我们已经的到)

$ ls *bam
SRR1039508.hisat.bam  SRR1039509.hisat.bam  SRR1039510.hisat.bam  SRR1039514.hisat.bam

区间注释文件，这个文件需要自己创建

$ cd ~/reference/gtf/
## 下载最新的
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/gencode.v30.annotation.gtf.gz

去Gencode网站下载,这里我们下载最新的，下载GTF文件。
Release 30
解压缩,得到Load annotation file gencode.v30.annotation.gtf

$ gunzip gencode.v30.annotation.gtf.gz
## 得到
$ ls
gencode.v30.annotation.gtf  human.gene.positions  nohup.out

6.2 featureounts

$ cd ~/6.featureCounts/
$ featureCounts -T 5 -p -t exon -g gene_id -a ~/reference/gtf/gencode.v30.annotation.gtf -o ~/all.id.txt ~/4.mapping/*.bam
## 得到如下：

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
      v1.6.4

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 4 BAM files                                      ||
||                           P SRR1039508.hisat.bam                           ||
||                           P SRR1039509.hisat.bam                           ||
||                           P SRR1039510.hisat.bam                           ||
||                           P SRR1039514.hisat.bam                           ||
||                                                                            ||
||             Output file : all.id.txt                                       ||
||                 Summary : all.id.txt.summary                               ||
||              Annotation : gencode.v30.annotation.gtf (GTF)                 ||
||      Dir for temp files : /home/vip51                                      ||
||                                                                            ||
||                 Threads : 5                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : counted                                          ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file gencode.v30.annotation.gtf ...                        ||
||    Features : 1279686                                                      ||
||    Meta-features : 58870                                                   ||
||    Chromosomes/contigs : 25                                                ||
||                                                                            ||
|| Process BAM file SRR1039508.hisat.bam...                                   ||
||    Paired-end reads are included.                                          ||
||    Assign alignments (paired-end) to features...                           ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Pairing up the read pairs.                                              ||
||                                                                            ||
||    Total alignments : 25134856                                             ||
||    Successfully assigned alignments : 18588891 (74.0%)                     ||
||    Running time : 0.41 minutes                                             ||
||                                                                            ||
|| Process BAM file SRR1039509.hisat.bam...                                   ||
||    Paired-end reads are included.                                          ||
||    Assign alignments (paired-end) to features...                           ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Pairing up the read pairs.                                              ||
||                                                                            ||
||    Total alignments : 22612196                                             ||
||    Successfully assigned alignments : 17040035 (75.4%)                     ||
||    Running time : 0.36 minutes                                             ||
||                                                                            ||
|| Process BAM file SRR1039510.hisat.bam...                                   ||
||    Paired-end reads are included.                                          ||
||    Assign alignments (paired-end) to features...                           ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Pairing up the read pairs.                                              ||
||                                                                            ||
||    Total alignments : 24741721                                             ||
||    Successfully assigned alignments : 18349007 (74.2%)                     ||
||    Running time : 0.32 minutes                                             ||
||                                                                            ||
|| Process BAM file SRR1039514.hisat.bam...                                   ||
||    Paired-end reads are included.                                          ||
||    Assign alignments (paired-end) to features...                           ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Pairing up the read pairs.                                              ||
||                                                                            ||
||    Total alignments : 41252925                                             ||
||    Successfully assigned alignments : 31979144 (77.5%)                     ||
||    Running time : 0.54 minutes                                             ||
||                                                                            ||
||                                                                            ||
|| Summary of counting results can be found in file "/home/vip51/all.id.txt.  ||
|| summary"                                                                   ||
||                                                                            ||
\\============================================================================//

得到下面两个结果

# all.id.txt         all.id.txt.summary
## 我们对all.id.txt.summary 进行multiqc一下，看看Counts情况
$ multiqc all.id.txt.summary 
...
[INFO   ]         multiqc : Report      : multiqc_report_1.html
[INFO   ]         multiqc : Data        : multiqc_data_1
[INFO   ]         multiqc : MultiQC complete