简介
Hisat能够将RNA-seq reads 比对到参考基因组上,并且发现转录本的剪接位点,具有运行速度快,占用计算机内存资源少的特点。StringTie将这些比对结果组合成完整的转录本,并估计所有基因和转录本的表达水平(Pertea et al. 2015)。Ballgown从StringTie结果中提取转录本和表达水平,并应用严格的统计方法来确定它们的差异表达。总的来说,上述三个软件联合的RNA-seq 分析流程是比较 完善的,如下图(Pertea et al. 2016)。
Hisat2, Stringtie and Ballgown workflow测试数据
包含12个样本的RNA-seq序列共2.0G,样本分别来自GBR(英国人来自英格兰)和YRI(约鲁巴人来自尼日利亚伊巴丹)。(Pertea et al. 2016)
ftp下载链接: ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz
fastp质控
一般fastp 可以在质控后直接生成质控前,质控后数据的变化情况,此处依然沿用fastqc + multiqc + fastp的质控路线。
参数:
-
RawFastQC:Raw reads 的 FastQC 输出路径
-
RequiredCPU:线程数(int)
# fastqc
fastqc -o RawFastQC -t RequiredCPU sampleID_1.fq.gz sampleID_2.fq.gz
# multiqc - summary report
multiqc RawFastQC
# fastp - Quality control
fastp -i sampleID_1.fq.gz -I sampleID_2.fq.gz \
-o CleanReads/sampleID_1.fq.gz \
-O CleanReads/sampleID_2.fq.gz \
-w RequiredCPU \
-j FastpDir/sampleID.json -h FastpDir/sampleID.html
由于每个样本都生成了一个fastp.json 文件,我们可以批量解析质控结果,从而在整体水平上评估质控效果。
#!/usr/bin/python
"""
@ Usage: python fastp_parse.py input_file input_file
@ input_file:the file of the fastp.jsons
@ Wu 2020.12.30
"""
import sys, os
import json
import pandas as pd
input_file = sys.argv[1]
input_file = sys.argv[2]
fastp_list = os.listdir(input_file)
result_dict = {}
merge_result_dict = {}
for i in fastp_list:
if i.endswith('.json'):
with open(input_file + i, 'r') as f:
result_dict[i] = json.load(f)
key = i.replace(".json", "")
key1 = key + '+before_filtering'
key2 = key + '+after_filtering'
merge_result_dict[key1] = {'total_reads' : result_dict[i]["summary"]['before_filtering']['total_reads'],
"total_bases": result_dict[i]["summary"]['before_filtering']['total_bases'],
'q20_rate' : result_dict[i]["summary"]['before_filtering']['q20_rate'],
'q30_rate' : result_dict[i]["summary"]['before_filtering']['q30_rate']
}
merge_result_dict[key2] = {'total_reads' : result_dict[i]["summary"]['after_filtering']['total_reads'],
"total_bases": result_dict[i]["summary"]['after_filtering']['total_bases'],
'q20_rate' : result_dict[i]["summary"]['after_filtering']['q20_rate'],
'q30_rate' : result_dict[i]["summary"]['after_filtering']['q30_rate']
}
df = pd.DataFrame(merge_result_dict).T
df.to_csv(output_file, sep="\t", index= True,header=True)
Hisat Mapping
对于一些模式生物或者常见生物,其Hisat index可能已经构建好了,可以直接去下载即可。
Url page: https://ccb.jhu.edu/software/hisat2/index.shtml, ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/
对于上述链接中不存在Hisat index的情况,则需要我们自己进行构建;值得注意的是,不同于bwa index,Hisat index 可以为参考基因组提供注释信息以便获得更高的效率和更好的跨外显子边界和剪接位点 mapping 。
Data:1. reference genome(reference.fa);2. clean RNAseq reads 3. ref_annotation.gff
# extract splice-site and exon information
hisat2/bin/extract_splice_sites.py ref_annotation.gff >ref.ss
hisat2/bin/extract_exons.py ref_annotation.gff >ref.exon
# building the index
hisat2-build -p 8 --ss ref.ss --exon ref.exon reference.fa index\rna_ref_index(out_file_name)
# mapping the NGS reads
hisat2 -p 8 --dta -x index/rna_ref_index -1 CleanReads/sampleID_1.fq.gz \
-2 CleanReads/sampleID_2.fq.gz -S sampleID.sam
# sortting BAM
samtools view -S -b sampleID.sam | samtools sort -@ 10 -o sampleID_sorted.bam
Mapping quality
对于mapping quality,可以通过IGV和samtools tview 进行可视化查看,亦可以通过samtools flagstat 获得,并提取所有样本比对信息作图比较。
# samtools flagstat
samtools flagstat sampleID_sorted.bam > flagstat/sampleID_sorted.bam
Transcriptome Assembly
在这里有两种模式可以选择,一种是 ‘reference only’ 模式,该模式运行速度更快、更简单,而且它会避免数据受到其他库的污染,但可能无法发现新的转录本;另一种是预测模式,在该模式下转录本将跨过不同的样本数据去进行比较,可能会找到新的转录本,但在不了解自己样本数据的前提下最好不用;
参考:https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual
# reference only
stringtie --rf -p 8 -G ref_annotation.gff -e -B \
-o ballgown/sampleID/sampleID.gtf \
-A ballgown/sampleID/sampleID_gene_abundances.tsv \
sampleID_sorted.bam
表达矩阵:t_data.ctab,gene_abundances.tsv
ballgown
有表达矩阵数据后,除了可以分别读入、合并和分析外,亦可以用ballgow 包进行全局性的数据管理和分析。
library(ballgown)
library(tidyverse)
# load the sample information
pheno_data <- read.table("RNAseq/phenodata.txt", header = TRUE, sep ='\t')
# create a ballgown object
ballgown_obj <- ballgown(dataDir = "RNAseq/ballgown",
# regular expression identifying the sample files
# "" means all FILES
samplePattern = "",
pData = pheno_data)
# getting the gene, transcript, exon, and intron expression levels
# by using gexpr(), texpr(), eexpr(), and iexpr()
gene_data <- gexpr(ballgown_obj)
# Load all attributes including gene name
trans_data <- texpr(ballgown_obj, 'all')
Reference
Pertea, Mihaela, Geo M Pertea, Corina M Antonescu, Tsung-Cheng Chang, Joshua T Mendell, and Steven L Salzberg. 2015. “StringTie Enables Improved Reconstruction of a Transcriptome from Rna-Seq Reads.” Nature Biotechnology 33 (3): 290–95.
Pertea, Mihaela, Daehwan Kim, Geo M Pertea, Jeffrey T Leek, and Steven L Salzberg. 2016. “Transcript-Level Expression Analysis of Rna-Seq Experiments with Hisat, Stringtie and Ballgown.” Nature Protocols 11 (9): 1650.
Stringtie:https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual
ballgown:https://www.bioconductor.org/packages/release/bioc/html/ballgown.html
网友评论