美文网首页基因组生物信息学生信分析流程
一个高杂合真菌基因组组装脚本(改代码版)

一个高杂合真菌基因组组装脚本(改代码版)

作者: IMC小达人 | 来源:发表于2019-10-31 11:02 被阅读0次

    前言及背景

    什么是基因组de novo测序?其是对某一物种进行高通量测序,利用高性能计算平台和生物信息学方法,在不依赖于参考基因组的情况下进行组装,从而绘制该物种的全基因组序列图谱。针对基因组的特性,基因组常被分为两类:普通(简单)基因组和复杂基因组。简单基因组指单倍体,纯合二倍体或者杂合度<0.5%,且重复序列含量<50%,GC含量为35%到65%之间的二倍体。复杂基因组则指杂合率>0.5%,重复序列含量>50%,GC含量处于异常的范围(GC含量<35%或者GC含量>=65%的二倍体,多倍体。诺禾致源对二倍体复杂基因组进一步细分为微杂合基因组(0.5%<杂合率<=0.8%、高杂合基因组(杂合率>0.8%)以及高重复基因组(重复序列比例>50%)。复杂基因组的组装一直以来都是一个让科研工作者为之头疼的问题。科研工作者也为解决这个问题一直努力着。随着三代测序平台的更迭,测序subreads的长度不断的增长,以及光学图谱技术的出现(Hic,10xGenomic等)。重复序列导致的组装困难逐步被缓解。但高杂合这一特性任一直困扰着科研工作者。

    为了解决杂合组装,各式各样的方案不断被提出。总体思路有下:1.设计实验方案获取单倍型(例如种间杂种;案例:Sequencing a Juglans regia × J. microcarpa hybrid yields high-quality genome assemblies of parental species;2.设计适合于杂合基因组组装的软件,优化组装的算法;3.组装之后,再用去杂合软件去除杂合序列。

    第一个思路很清晰,获取单倍型再测序,就解决了高杂合这个难点,可有时单倍体的获取真不是一般的难,局限性很大。

    基于第二个思路,已有很多软件开发。有NOVOheter, Plantanus, MSR-CA,Plantanus-allee(Plantanus的升级版,支持Hic,10xGenomics等数据)等。此外还有一些软件有支持高杂合组装的模块,如Canu,SPAdes, Falon等。但实测的经验来看,效果都不是很好。

    第三种思路的核心就是,居于相似性cut,目前接触过的有Redundans,Haplomerger2,Purge_haplogs。Purge_haplogs除了考虑相似性,还有一大特性,就是通过分析比对read的覆盖度决定谁去谁留。此外,Haplomerger2和Purge_haplogs还支持重复序列部分的屏蔽。

    今年我接手到了一个基因组大小为86M,杂合率约2.4%,重复序列率约为20%左右的真菌。测序策略为pacbio seq I + PE 150 。 本来打算再做一个Hic,但咨询一些测序公司之后,均表示真菌没有太多成功经验,且投入与产出可能不对等。故打消测Hic的念头。

    基于已有的数据,我先采取了canu (单倍型组装;canu多倍体模式,"batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50"组装),MECAT2,MaSuRCA,Falon,flye,wtdbg2 等组装软件进行了组装。接着使用 Purge_haplogs 、Redundans和Haplomerger2进行去杂合,然后再使用 FinisherSC进行基因组升级,最后使用nextpolish 进行 polish。经过测试,canu,MECAT2(经过nextpolish 进行 polish之后), MaSuRCA的三个软件的表现相对较好。去杂软件Purge_haplogs的适用性优于Redundans和Haplomerger2,去杂前后BUSCO评估基本不变。由于涉及到文章。故暂时只能先提供最优的脚本,等文章出了之后会进一步深入。

    组装案例

    
    #测序数据Bam to Fastq or Fasta
    
    samtools fastq -0 012m.subreads.fq -@ 32 subreads.bam
    
    #Assessment of genome size and heterozygosity(raw PE reads)
    
    mkdir genomescope
    
    cd genomescope
    
    ln -s ../012m_L1_?.fq ./
    
    jellyfish count -C -m 21 -s 1000000000 -t 32 *.fq -o reads.jf
    
    jellyfish histo -t 32 reads.jf > reads.histo
    
    Rscript /opt/biosoft/genomescope/genomescope.R reads.histo 21 150 output
    
    # 二代数据质控
    
    mkdir Trimmomatic
    
    cd Trimmomatic
    
    java -jar /opt/biosoft/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 32 -phred33 ../012m_L1_1.fq ../012m_L1_2.fq 012m_Trimmomatic.1.fq 012m_Trimmomatic.unpaired.1.fq 012m_Trimmomatic.2.fq 012m_Trimmomatic.unpaired.2.fq ILLUMINACLIP:/opt/biosoft/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:75 TOPHRED33
    
    cd ../
    
    # correct Illumina reads | Assessment of genome size and heterozygosity
    
    mkdir Finderrors
    
    source ~/.bash.pacbio
    
    # ErrorCorrectReads.pl 为 ALLPATHS-LG 的一个perl 程序
    
    ErrorCorrectReads.pl PHRED_ENCODING=33 READS_OUT=012m FILL_FRAGMENTS=0 KEEP_KMER_SPECTRA=1 PAIRED_READS_A_IN=../fastuniq/012m.fastuniq.1.fastq PAIRED_READS_B_IN=../fastuniq/012m.fastuniq.2.fastq PLOIDY=2 PAIRED_SEP=251 PAIRED_STDEV=48
    
    # ErrorCorrectReads.pl也可以对基因组特性进行评估,但针对我的菌株来说,同已发布的邻近菌株比较,GenomeScope的结果要准确一些。
    
    # kmer plot
    
    cd 012m.fastq.kspec
    
    perl -p -i -e 's/\@fns = \(\"frag_reads_filt.25mer.kspec\", \"frag_reads_edit.24mer.kspec\", \"frag_reads_corr.25mer.kspec\"\);\n//' /opt/biosoft/ALLPATHS-LG/bin/KmerSpectrumPlot.pl
    
    KmerSpectrumPlot.pl SPECTRA=1 FREQ_MAX=255
    
    perl -p -i -e 's/\@fns = \(\"frag_reads_filt.25mer.kspec\", \"frag_reads_edit.24mer.kspec\", \"frag_reads_corr.25mer.kspec\"\);\n//' /opt/biosoft/ALLPATHS-LG/bin/KmerSpectrumPlot.pl
    
    convert kmer_spectrum.cumulative_frac.log.lin.eps kmer_spectrum.cumulative_frac.log.lin.png
    
    convert kmer_spectrum.distinct.lin.lin.eps kmer_spectrum.distinct.lin.lin.png
    
    convert kmer_spectrum.distinct.log.log.eps kmer_spectrum.distinct.log.log.png
    
    cd ../
    
    # Using LoRDEC to modify PacBio Reads
    
    mkdir LoRDEC
    
    cd LoRDEC
    
    lordec-correct -2 ../Finderrors/012m.paired.A.fastq ../Finderrors/012m.paired.B.fastq -i ../012m.subreads.fq -k 19 -o pacbio.LoRDEC.corrected.fasta -s 3 -T 32 &> lordec-correct.log
    
    seqkit seq -u pacbio.LoRDEC.corrected.fasta > pacbio.corrected.fasta**
    
    cd ../
    
    ###Genome assembly###
    
    mkdir MaSuRCA
    
    cd MaSuRCA
    
    echo '# example configuration file
    
    DATA
    
    #Illumina paired end reads supplied as <two-character prefix> <fragment mean> <fragment stdev> <forward_reads> <reverse_reads>
    
    #if single-end, do not specify <reverse_reads>
    
    #MUST HAVE Illumina paired end reads to use MaSuRCA
    
    PE= pe 251 48 /home/bioinfo/data/012m/012m_L1_1.fq  /home/bioinfo/data/012m/012m_L1_2.fq
    
    #Illumina mate pair reads supplied as <two-character prefix> <fragment mean> <fragment stdev> <forward_reads> <reverse_reads>
    
    #JUMP= sh 3600 200 /FULL_PATH/short_1.fastq  /FULL_PATH/short_2.fastq
    
    #pacbio OR nanopore reads must be in a single fasta or fastq file with absolute path, can be gzipped
    
    #if you have both types of reads supply them both as NANOPORE type
    
    PACBIO=/home/bioinfo/data/012m/LoRDEC/pacbio.corrected.fasta
    
    #NANOPORE=/FULL_PATH/nanopore.fa
    
    #Other reads (Sanger, 454, etc) one frg file, concatenate your frg files into one if you have many
    
    #OTHER=/FULL_PATH/file.frg
    
    #synteny-assisted assembly, concatenate all reference genomes into one reference.fa; works for Illumina-only data
    
    #REFERENCE=/FULL_PATH/nanopore.fa
    
    END
    
    PARAMETERS
    
    #PLEASE READ all comments to essential parameters below, and set the parameters according to your project
    
    #set this to 1 if your Illumina jumping library reads are shorter than 100bp
    
    EXTEND_JUMP_READS=0
    
    #this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content
    
    GRAPH_KMER_SIZE = auto
    
    #set this to 1 for all Illumina-only assemblies
    
    #set this to 0 if you have more than 15x coverage by long reads (Pacbio or Nanopore) or any other long reads/mate pairs (Illumina MP, Sanger, 454, etc)
    
    USE_LINKING_MATES = 0
    
    #specifies whether to run the assembly on the grid
    
    USE_GRID=0
    
    #specifies grid engine to use SGE or SLURM
    
    GRID_ENGINE=SGE
    
    #specifies queue (for SGE) or partition (for SLURM) to use when running on the grid MANDATORY
    
    GRID_QUEUE=all.q
    
    #batch size in the amount of long read sequence for each batch on the grid
    
    GRID_BATCH_SIZE=500000000
    
    #use at most this much coverage by the longest Pacbio or Nanopore reads, discard the rest of the reads
    
    #can increase this to 30 or 35 if your reads are short (N50<7000bp)
    
    LHE_COVERAGE=25
    
    #set to 0 (default) to do two passes of mega-reads for slower, but higher quality assembly, otherwise set to 1
    
    MEGA_READS_ONE_PASS=0
    
    #this parameter is useful if you have too many Illumina jumping library mates. Typically set it to 60 for bacteria and 300 for the other organisms
    
    #LIMIT_JUMP_COVERAGE = 300
    
    #these are the additional parameters to Celera Assembler.  do not worry about performance, number or processors or batch sizes -- these are computed automatically.
    
    #CABOG ASSEMBLY ONLY: set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms.
    
    CA_PARAMETERS =  cgwErrorRate=0.15
    
    #CABOG ASSEMBLY ONLY: whether to attempt to close gaps in scaffolds with Illumina  or long read data
    
    CLOSE_GAPS=1
    
    #auto-detected number of cpus to use, set this to the number of CPUs/threads per node you will be using
    
    NUM_THREADS = 32
    
    #this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*20**
    
    JF_SIZE = 1740000000
    
    #ILLUMINA ONLY. Set this to 1 to use SOAPdenovo contigging/scaffolding module.  Assembly will be worse but will run faster. Useful for very large (>=8Gbp) genomes from Illumina-only data
    
    SOAP_ASSEMBLY=0
    
    #Hybrid Illumina paired end + Nanopore/PacBio assembly ONLY.  Set this to 1 to use Flye assembler for final assembly of corrected mega-reads.  A lot faster than CABOG, at the expense of some contiguity. Works well even when MEGA_READS_ONE_PASS is set to 1.  DO NOT use if you have less than 15x coverage by long reads.
    
    FLYE_ASSEMBLY=0
    
    END ' > config.txt
    
    /opt/biosoft/MaSuRCA-3.3.4/bin/masurca config.txt
    
    ./assemble.sh
    
    mkdir purge_haplogs
    
    cd purge_haplogs
    
    minimap2 -t 32 -ax map-pb ../final.genome.scf.fasta /home/bioinfo/data/012m/canu/012m.correctedReads.fasta.gz | samtools view -hF 256 - | samtools sort -@ 32 -m 2G -o aligned.bam
    
    purge_haplotigs  hist  -b aligned.bam  -g ../final.genome.scf.fasta -t 20
    
    purge_haplotigs contigcov -i aligned.bam.gencov -o coverage_stats.csv -l 18 -m 76 -h 134
    
    purge_haplotigs purge -g ../final.genome.scf.fasta -c coverage_stats.csv -b aligned.bam -t 4 -a 50
    
    mkdir finisherSC
    
    cd finisherSC
    
    ln -s ../curated.fasta ./contigs.fasta
    
    ln -s ~/data/012m/canu/012m.correctedReads.fasta ./raw_reads.fasta
    
    #这一步不要设置多线程,不然可能报错,使用mummer4的速度要远高于mummet3
    
    python /opt/biosoft/finishingTool/finisherSC.py ./ /opt/biosoft/mummer4/bin/
    
    mkdir NextPolish
    
    cd NextPolish
    
    ls ~/data/012m/Finderrors/012m.paired.?.fastq > sgs.fofn
    
    echo '/home/bioinfo/data/012m/canu/012m.correctedReads.fasta' > lgs.fofn
    
    echo '[General]
    
    job_type = local
    
    job_prefix = nextPolish
    
    task = default
    
    rewrite = yes
    
    rerun = 10
    
    parallel_jobs = 5
    
    multithread_jobs = 6
    
    genome = ../improved3.fasta
    
    genome_size = auto
    
    workdir = ./01_rundir
    
    polish_options = -p {multithread_jobs}
    
    [sgs_option]
    
    sgs_fofn = ./sgs.fofn
    
    sgs_options = -max_depth 100 -bwa
    
    [lgs_option]
    
    lgs_fofn = ./lgs.fofn
    
    lgs_options = -min_read_len 10k -max_read_len 150k -max_depth 60
    
    lgs_minimap2_options = -x map-pb
    
    [polish_options]
    
    -ploidy 2 ' > run.cfg
    
    /opt/biosoft/NextPolish/nextPolish run.cfg
    
    cat ./01_rundir/01.kmer_count/*polish.ref.sh.work/polish_genome*/genome.nextpolish.part*.fasta > genome.nextpolish.fasta
    
    

    参考

    动植物基因组de novo常见问题

    杂基因组测序技术研究进展

    NextPolish

    Sequencing a Juglans regia × J. microcarpa hybrid yields high-quality genome assemblies of parental species

    LoRDEC 利用二代数据纠错PacBio 数据

    genomescope

    MaSuRCA

    purge_haplogs

    相关文章

      网友评论

        本文标题:一个高杂合真菌基因组组装脚本(改代码版)

        本文链接:https://www.haomeiwen.com/subject/hoizvctx.html