美文网首页走进转录组重点关注
2. Hisat2, Stringtie and Ballgow

2. Hisat2, Stringtie and Ballgow

作者: 吴十三和小可爱的札记 | 来源:发表于2021-04-02 19:24 被阅读0次

    简介

    Hisat能够将RNA-seq reads 比对到参考基因组上,并且发现转录本的剪接位点,具有运行速度快,占用计算机内存资源少的特点。StringTie将这些比对结果组合成完整的转录本,并估计所有基因和转录本的表达水平(Pertea et al. 2015)。Ballgown从StringTie结果中提取转录本和表达水平,并应用严格的统计方法来确定它们的差异表达。总的来说,上述三个软件联合的RNA-seq 分析流程是比较 完善的,如下图(Pertea et al. 2016)。

    Hisat2, Stringtie and Ballgown workflow

    测试数据

    包含12个样本的RNA-seq序列共2.0G,样本分别来自GBR(英国人来自英格兰)和YRI(约鲁巴人来自尼日利亚伊巴丹)。(Pertea et al. 2016)

    ftp下载链接: ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz

    fastp质控

    一般fastp 可以在质控后直接生成质控前,质控后数据的变化情况,此处依然沿用fastqc + multiqc + fastp的质控路线。

    参数:

    • RawFastQC:Raw reads 的 FastQC 输出路径

    • RequiredCPU:线程数(int)

    # fastqc
    fastqc -o RawFastQC -t RequiredCPU sampleID_1.fq.gz sampleID_2.fq.gz
    
    # multiqc - summary report
    multiqc RawFastQC
    
    # fastp - Quality control
    fastp -i sampleID_1.fq.gz -I sampleID_2.fq.gz \
     -o CleanReads/sampleID_1.fq.gz \
     -O CleanReads/sampleID_2.fq.gz \
     -w RequiredCPU \
     -j FastpDir/sampleID.json -h FastpDir/sampleID.html
    

    由于每个样本都生成了一个fastp.json 文件,我们可以批量解析质控结果,从而在整体水平上评估质控效果。

    #!/usr/bin/python
    """
    @ Usage: python fastp_parse.py input_file input_file
    @ input_file:the file of the fastp.jsons
    @ Wu 2020.12.30
    """
    import sys, os
    import json
    import pandas as pd
    
    input_file = sys.argv[1]
    input_file = sys.argv[2]
    
    fastp_list = os.listdir(input_file)
    result_dict = {}
    merge_result_dict = {}
    for i in fastp_list:
     if i.endswith('.json'):
     with open(input_file + i, 'r') as f:
    
     result_dict[i] = json.load(f)
    
    key = i.replace(".json", "")
    key1 = key + '+before_filtering'
    key2 = key + '+after_filtering'
    
    merge_result_dict[key1] = {'total_reads' : result_dict[i]["summary"]['before_filtering']['total_reads'],
     "total_bases": result_dict[i]["summary"]['before_filtering']['total_bases'],
     'q20_rate' : result_dict[i]["summary"]['before_filtering']['q20_rate'],
     'q30_rate' : result_dict[i]["summary"]['before_filtering']['q30_rate']
    }
    
    merge_result_dict[key2] = {'total_reads' : result_dict[i]["summary"]['after_filtering']['total_reads'],
     "total_bases": result_dict[i]["summary"]['after_filtering']['total_bases'],
     'q20_rate' : result_dict[i]["summary"]['after_filtering']['q20_rate'],
     'q30_rate' : result_dict[i]["summary"]['after_filtering']['q30_rate']
    }
    
    
    df = pd.DataFrame(merge_result_dict).T
    df.to_csv(output_file, sep="\t", index= True,header=True)
    

    Hisat Mapping

    对于一些模式生物或者常见生物,其Hisat index可能已经构建好了,可以直接去下载即可。
    Url page: https://ccb.jhu.edu/software/hisat2/index.shtml, ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/

    对于上述链接中不存在Hisat index的情况,则需要我们自己进行构建;值得注意的是,不同于bwa index,Hisat index 可以为参考基因组提供注释信息以便获得更高的效率和更好的跨外显子边界和剪接位点 mapping 。

    Data:1. reference genome(reference.fa);2. clean RNAseq reads 3. ref_annotation.gff

    # extract splice-site and exon information
    hisat2/bin/extract_splice_sites.py ref_annotation.gff >ref.ss
    hisat2/bin/extract_exons.py ref_annotation.gff >ref.exon
    
    # building the index
    hisat2-build -p 8 --ss ref.ss --exon ref.exon reference.fa index\rna_ref_index(out_file_name)
    
    # mapping the NGS reads
    hisat2 -p 8 --dta -x index/rna_ref_index -1 CleanReads/sampleID_1.fq.gz \
    -2 CleanReads/sampleID_2.fq.gz -S sampleID.sam
    
    # sortting BAM 
    samtools view -S -b sampleID.sam | samtools sort -@ 10 -o sampleID_sorted.bam
    

    Mapping quality

    对于mapping quality,可以通过IGV和samtools tview 进行可视化查看,亦可以通过samtools flagstat 获得,并提取所有样本比对信息作图比较。

    # samtools flagstat
    samtools flagstat sampleID_sorted.bam > flagstat/sampleID_sorted.bam
    

    Transcriptome Assembly

    在这里有两种模式可以选择,一种是 ‘reference only’ 模式,该模式运行速度更快、更简单,而且它会避免数据受到其他库的污染,但可能无法发现新的转录本;另一种是预测模式,在该模式下转录本将跨过不同的样本数据去进行比较,可能会找到新的转录本,但在不了解自己样本数据的前提下最好不用;

    参考:https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual

    # reference only
    stringtie --rf -p 8 -G ref_annotation.gff -e -B \
    -o ballgown/sampleID/sampleID.gtf \
    -A ballgown/sampleID/sampleID_gene_abundances.tsv \
    sampleID_sorted.bam
    

    表达矩阵:t_data.ctab,gene_abundances.tsv

    ballgown

    有表达矩阵数据后,除了可以分别读入、合并和分析外,亦可以用ballgow 包进行全局性的数据管理和分析。

    library(ballgown)
    library(tidyverse)
    
    # load the sample information
    pheno_data <- read.table("RNAseq/phenodata.txt", header = TRUE, sep ='\t')
    
    # create a ballgown object
    ballgown_obj <- ballgown(dataDir = "RNAseq/ballgown",
    # regular expression identifying the sample files
    # "" means all FILES
     samplePattern = "", 
     pData = pheno_data)
    
    # getting the gene, transcript, exon, and intron expression levels 
    # by using gexpr(), texpr(), eexpr(), and iexpr()
    gene_data <- gexpr(ballgown_obj)
    
    # Load all attributes including gene name
    trans_data <-  texpr(ballgown_obj, 'all')
    

    Reference

    Pertea, Mihaela, Geo M Pertea, Corina M Antonescu, Tsung-Cheng Chang, Joshua T Mendell, and Steven L Salzberg. 2015. “StringTie Enables Improved Reconstruction of a Transcriptome from Rna-Seq Reads.” Nature Biotechnology 33 (3): 290–95.

    Pertea, Mihaela, Daehwan Kim, Geo M Pertea, Jeffrey T Leek, and Steven L Salzberg. 2016. “Transcript-Level Expression Analysis of Rna-Seq Experiments with Hisat, Stringtie and Ballgown.” Nature Protocols 11 (9): 1650.

    Stringtie:https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual

    ballgown:https://www.bioconductor.org/packages/release/bioc/html/ballgown.html

    相关文章

      网友评论

        本文标题:2. Hisat2, Stringtie and Ballgow

        本文链接:https://www.haomeiwen.com/subject/uknohltx.html