2. Hisat2, Stringtie and Ballgow

作者: 吴十三和小可爱的札记 | 来源:发表于2021-04-02 19:24 被阅读0次

2. Hisat2, Stringtie and Ballgow
实现Stingtie与DEseq2的无缝连接
转录组分析所需软件
ERROR: dependency ‘XML’ is not a
HISAT2,StringTie，Ballgown处理转录组数据
基于转录组预测基因预测
RNA-seq用hisat2、stringtie、DESeq2分
基因结构注释（3）：转录组预测
转录组分析（三）Hisat2+StringTie+Ballgow
hisat2、stringtie、ballgown分析RNA-s

简介

Hisat能够将RNA-seq reads 比对到参考基因组上，并且发现转录本的剪接位点，具有运行速度快，占用计算机内存资源少的特点。StringTie将这些比对结果组合成完整的转录本，并估计所有基因和转录本的表达水平(Pertea et al. 2015)。Ballgown从StringTie结果中提取转录本和表达水平，并应用严格的统计方法来确定它们的差异表达。总的来说，上述三个软件联合的RNA-seq 分析流程是比较完善的，如下图(Pertea et al. 2016)。

Hisat2, Stringtie and Ballgown workflow

测试数据

包含12个样本的RNA-seq序列共2.0G，样本分别来自GBR(英国人来自英格兰)和YRI(约鲁巴人来自尼日利亚伊巴丹)。(Pertea et al. 2016)

ftp下载链接： ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz

fastp质控

一般fastp 可以在质控后直接生成质控前，质控后数据的变化情况，此处依然沿用fastqc + multiqc + fastp的质控路线。

参数：

RawFastQC：Raw reads 的 FastQC 输出路径
RequiredCPU：线程数(int)

# fastqc
fastqc -o RawFastQC -t RequiredCPU sampleID_1.fq.gz sampleID_2.fq.gz

# multiqc - summary report
multiqc RawFastQC

# fastp - Quality control
fastp -i sampleID_1.fq.gz -I sampleID_2.fq.gz \
 -o CleanReads/sampleID_1.fq.gz \
 -O CleanReads/sampleID_2.fq.gz \
 -w RequiredCPU \
 -j FastpDir/sampleID.json -h FastpDir/sampleID.html

由于每个样本都生成了一个fastp.json 文件，我们可以批量解析质控结果，从而在整体水平上评估质控效果。

#!/usr/bin/python
"""
@ Usage: python fastp_parse.py input_file input_file
@ input_file：the file of the fastp.jsons
@ Wu 2020.12.30
"""
import sys, os
import json
import pandas as pd

input_file = sys.argv[1]
input_file = sys.argv[2]

fastp_list = os.listdir(input_file)
result_dict = {}
merge_result_dict = {}
for i in fastp_list:
 if i.endswith('.json'):
 with open(input_file + i, 'r') as f:

 result_dict[i] = json.load(f)

key = i.replace(".json", "")
key1 = key + '+before_filtering'
key2 = key + '+after_filtering'

merge_result_dict[key1] = {'total_reads' : result_dict[i]["summary"]['before_filtering']['total_reads'],
 "total_bases": result_dict[i]["summary"]['before_filtering']['total_bases'],
 'q20_rate' : result_dict[i]["summary"]['before_filtering']['q20_rate'],
 'q30_rate' : result_dict[i]["summary"]['before_filtering']['q30_rate']
}

merge_result_dict[key2] = {'total_reads' : result_dict[i]["summary"]['after_filtering']['total_reads'],
 "total_bases": result_dict[i]["summary"]['after_filtering']['total_bases'],
 'q20_rate' : result_dict[i]["summary"]['after_filtering']['q20_rate'],
 'q30_rate' : result_dict[i]["summary"]['after_filtering']['q30_rate']
}


df = pd.DataFrame(merge_result_dict).T
df.to_csv(output_file, sep="\t", index= True,header=True)

Hisat Mapping

对于一些模式生物或者常见生物，其Hisat index可能已经构建好了，可以直接去下载即可。
Url page: https://ccb.jhu.edu/software/hisat2/index.shtml, ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/

对于上述链接中不存在Hisat index的情况，则需要我们自己进行构建；值得注意的是，不同于bwa index，Hisat index 可以为参考基因组提供注释信息以便获得更高的效率和更好的跨外显子边界和剪接位点 mapping 。

Data：1. reference genome(reference.fa)；2. clean RNAseq reads 3. ref_annotation.gff

# extract splice-site and exon information
hisat2/bin/extract_splice_sites.py ref_annotation.gff >ref.ss
hisat2/bin/extract_exons.py ref_annotation.gff >ref.exon

# building the index
hisat2-build -p 8 --ss ref.ss --exon ref.exon reference.fa index\rna_ref_index(out_file_name)

# mapping the NGS reads
hisat2 -p 8 --dta -x index/rna_ref_index -1 CleanReads/sampleID_1.fq.gz \
-2 CleanReads/sampleID_2.fq.gz -S sampleID.sam

# sortting BAM 
samtools view -S -b sampleID.sam | samtools sort -@ 10 -o sampleID_sorted.bam

Mapping quality

对于mapping quality，可以通过IGV和samtools tview 进行可视化查看，亦可以通过samtools flagstat 获得，并提取所有样本比对信息作图比较。

# samtools flagstat
samtools flagstat sampleID_sorted.bam > flagstat/sampleID_sorted.bam

Transcriptome Assembly

在这里有两种模式可以选择，一种是 ‘reference only’ 模式，该模式运行速度更快、更简单，而且它会避免数据受到其他库的污染，但可能无法发现新的转录本；另一种是预测模式，在该模式下转录本将跨过不同的样本数据去进行比较，可能会找到新的转录本，但在不了解自己样本数据的前提下最好不用；

参考：https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual

# reference only
stringtie --rf -p 8 -G ref_annotation.gff -e -B \
-o ballgown/sampleID/sampleID.gtf \
-A ballgown/sampleID/sampleID_gene_abundances.tsv \
sampleID_sorted.bam

表达矩阵：t_data.ctab，gene_abundances.tsv

ballgown

有表达矩阵数据后，除了可以分别读入、合并和分析外，亦可以用ballgow 包进行全局性的数据管理和分析。

library(ballgown)
library(tidyverse)

# load the sample information
pheno_data <- read.table("RNAseq/phenodata.txt", header = TRUE, sep ='\t')

# create a ballgown object
ballgown_obj <- ballgown(dataDir = "RNAseq/ballgown",
# regular expression identifying the sample files
# "" means all FILES
 samplePattern = "", 
 pData = pheno_data)

# getting the gene, transcript, exon, and intron expression levels 
# by using gexpr(), texpr(), eexpr(), and iexpr()
gene_data <- gexpr(ballgown_obj)

# Load all attributes including gene name
trans_data <-  texpr(ballgown_obj, 'all')

Reference

Pertea, Mihaela, Geo M Pertea, Corina M Antonescu, Tsung-Cheng Chang, Joshua T Mendell, and Steven L Salzberg. 2015. “StringTie Enables Improved Reconstruction of a Transcriptome from Rna-Seq Reads.” Nature Biotechnology 33 (3): 290–95.

Pertea, Mihaela, Daehwan Kim, Geo M Pertea, Jeffrey T Leek, and Steven L Salzberg. 2016. “Transcript-Level Expression Analysis of Rna-Seq Experiments with Hisat, Stringtie and Ballgown.” Nature Protocols 11 (9): 1650.

Stringtie：https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual

ballgown：https://www.bioconductor.org/packages/release/bioc/html/ballgown.html

网友评论

本文标题：2. Hisat2, Stringtie and Ballgow

本文链接：https://www.haomeiwen.com/subject/uknohltx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！