肠道菌群:序列分类之kraken

作者: 基因的生物信息学分析 | 来源:发表于2019-08-02 22:05 被阅读13次

    细菌基因组测序完,想看看样本有没有被其他的菌污染?
    人的转录组测序完,想快速看看人、微生物的序列的比例?
    元/宏基因组测序完,想快速获得样本中物种的丰度信息?

    REFERENCE

    Wood DE, Salzberg SL: Kraken: ultrafast metagenomic sequence classification using exact alignments.Genome Biology 2014, 15:R46.

    Kraken 1

    Kraken 1是2014年Wood DE在Genome Biology发表的宏基因组序列分类软件,能够快速对宏基因样品中的reads进行分类。Kraken在序列比对环节基于精确k-mer匹配和精简数据库的方法,采取精确匹配,其核心是Kraken有一种特殊数据库,用以预先计算序列中包含的特殊的Kmer序列。 image 下面是来自kraken官网关于各分类器的测评结果: image.png

    Kraken速度很快,精度较低,适用于做微生物检测的预处理。通过一些实际的数据测试发现:与Metaphlan2相比,Kraken速度较快,获得的物种数目较多,相对应的假阳性率也较高。

    Usage: kraken [options] <filename(s)>
    
    Options:
      --db NAME               Name for Kraken DB
                              (default: none)
      --threads NUM           Number of threads (default: 1)
      --fasta-input           Input is FASTA format
      --fastq-input           Input is FASTQ format
      --fastq-output          Output in FASTQ format
      --gzip-compressed       Input is gzip compressed
      --bzip2-compressed      Input is bzip2 compressed
      --quick                 Quick operation (use first hit or hits)
      --min-hits NUM          In quick op., number of hits req'd for classification
                              NOTE: this is ignored if --quick is not specified
      --unclassified-out FILENAME
                              Print unclassified sequences to filename
      --classified-out FILENAME
                              Print classified sequences to filename
      --out-fmt FORMAT        Format for [un]classified sequence output. supported
                              options are: {legacy, paired, interleaved}
      --output FILENAME       Print output to filename (default: stdout); "-" will
                              suppress normal output
      --only-classified-output
                              Print no Kraken output for unclassified sequences
      --preload               Loads DB into memory before classification
      --paired                The two filenames provided are paired-end reads
      --check-names           Ensure each pair of reads have names that agree
                              with each other; ignored if --paired is not specified
      --help                  Print this message
      --version               Print version information
    
    If none of the *-input or *-compressed flags are specified, and the
    file is a regular file, automatic format detection is attempted.
    
    $ kraken --threads 40 --db minikraken_20171013_4GB --preload --
    paired --fastq-input --gzip-compressed  ${B}_1.fastq.gz ${B}_2.fastq.gz | kraken-report --db minikraken_20171013_4GB > "$B"_kraken.tab
    
    $ less -S  "$B"_kraken.tab
    50.30  27375076        27375076        U       0       unclassified
     49.70  27047187        13633   -       1       root
     49.35  26857009        348     -       131567    cellular organisms
     49.35  26855582        105532  D       2           Bacteria
     33.57  18270135        0       -       1783270       FCB group
     33.57  18269977        5       -       68336           Bacteroidetes/Chlorobi group
     33.57  18269761        107058  P       976               Bacteroidetes
     33.34  18144208        69      C       200643              Bacteroidia
     33.34  18144105        961382  O       171549                Bacteroidales
     30.47  16580771        0       F       815                     Bacteroidaceae
     30.47  16580771        2380340 G       816                       Bacteroides
     22.89  12454849        9623614 S       821                         Bacteroides vulgatus
      5.20  2831235 2831235 -       435590                        Bacteroides vulgatus ATCC 8482
      0.73  396849  0       S       357276                      Bacteroides dorei
      0.73  396849  396849  -       997877                        Bacteroides dorei CL03T12C01
      0.60  326487  326471  S       28116                       Bacteroides ovatus
      0.00  16      16      -       1379690                       Bacteroides ovatus V975
      0.42  227354  141795  S       818                         Bacteroides thetaiotaomicron
      0.16  85559   85559   -       226186                        Bacteroides thetaiotaomicron VPI-5482
      0.38  206113  206113  S       1796613                     Bacteroides caecimuris
      0.33  180286  180286  S       47678                       Bacteroides caccae
      0.29  156976  68554   S       817                         Bacteroides fragilis
      0.10  55847   55847   -       862962                        Bacteroides fragilis 638R
      0.03  17859   17859   -       295405                        Bacteroides fragilis YCH46
      0.03  14716   14716   -       272559                        Bacteroides fragilis NCTC 9343
    

    如上是Kraken的结果,可以看出它没有估算出物种的丰度。

    Bracken

    这时可以使用另一款软件Bracken (Bayesian Reestimation of Abundance with KrakEN),它是一种从元基因组数据中计算物种丰度的高度准确的统计方法。

    $ bracken -h
    Usage: bracken -d MY_DB -i INPUT -o OUTPUT -r READ_LEN -l LEVEL -t THRESHOLD
      MY_DB          location of Kraken database
      INPUT          Kraken REPORT file to use for abundance estimation
      OUTPUT         file name for Bracken default output
      READ_LEN       read length to get all classifications for (default: 100)
      LEVEL          level to estimate abundance at [options: D,P,C,O,F,G,S] (default: S)
      THRESHOLD      number of reads required PRIOR to abundance estimation to perform reestimation (default: 0)
    $ bracken -d minikraken_20171013_4GB/ -i $B\_kraken.tab -t 10 -o $B.out
    #获得如下结果:
    name    taxonomy_id     taxonomy_lvl    kraken_assigned_reads   added_reads     new_est_reads   fraction_total_reads
    Uncultured phage WW-nAnB strain 2       1449896 S       388     113     501     0.00007
    Uncultured phage WW-nAnB strain 3       1449897 S       349     131     480     0.00007
    Aureimonas sp. AU20     1349819 S       108     39      147     0.00002
    Phaeobacter piscinae    1580596 S       21      4       25      0.00000
    Sinorhizobium sp. CCBAU 05631   794846  S       25      12      37      0.00001
    Mucilaginibacter sp. BJC16-A31  1234841 S       31      0       31      0.00000
    Arcanobacterium phocae  131112  S       34      0       34      0.00000
    Kineococcus radiotolerans       131568  S       82      3       85      0.00001
    Actinomyces radingae    131110  S       314     5       319     0.00005
    Sediminicola sp. YIK13  1453352 S       19      0       19      0.00000
    Methylotenera mobilis   359408  S       42      1       43      0.00001
    Stenotrophomonas rhizophila     216778  S       76      52      128     0.00002
    Acholeplasma oculi      35623   S       12      0       12      0.00000
    Dictyoglomus turgidum   513050  S       12      0       12      0.00000
    Chelatococcus sp.
    

    NOTE: Kraken 2 is the newest version of Kraken (See Kraken 2's Webpage for details). Kraken 1 will continue to be available via the Kraken 1 Github page, but it is no longer being supported.

    Kraken 2

    与Kraken 1相比,Kraken 2有了很大的改进:
    1.更快速的构建数据库
    2.数据库的占用存储空间更少

    1. 更快的分类速度
      还能支持 16S Databases包括Greengenes, SILVA,和 RDP.
    
    $ kraken2
    Need to specify input filenames!
    Usage: kraken2 [options] <filename(s)>
    
    Options:
      --db NAME               Name for Kraken 2 DB
                              (default: none)
      --threads NUM           Number of threads (default: 1)
      --quick                 Quick operation (use first hit or hits)
      --unclassified-out FILENAME
                              Print unclassified sequences to filename
      --classified-out FILENAME
                              Print classified sequences to filename
      --output FILENAME       Print output to filename (default: stdout); "-" will
                              suppress normal output
      --confidence FLOAT      Confidence score threshold (default: 0.0); must be
                              in [0, 1].
      --minimum-base-quality NUM
                              Minimum base quality used in classification (def: 0,
                              only effective with FASTQ input).
      --report FILENAME       Print a report with aggregrate counts/clade to file
      --use-mpa-style         With --report, format report output like Kraken 1's
                              kraken-mpa-report
      --report-zero-counts    With --report, report counts for ALL taxa, even if
                              counts are zero
      --memory-mapping        Avoids loading database into RAM
      --paired                The filenames provided have paired-end reads
      --use-names             Print scientific names instead of just taxids
      --gzip-compressed       Input files are compressed with gzip
      --bzip2-compressed      Input files are compressed with bzip2
      --help                  Print this message
      --version               Print version information
    
    If none of the *-compressed flags are specified, and the filename provided
    is a regular file, automatic format detection is attempted.
    
    $ kraken2\
        --db minikraken2_v2_8GB_201904_UPDATE/ \
        --threads 20 \
        --report report \
     --gzip-compressed   --paired \
    ${B}_1.fastq.gz ${B}_2.fastq.gz
    
    

    感谢您的阅读,欢迎点赞、评论和转发!!

    扫描或长按下方二维码,即可关注公众号: 基因的生物信息学分析

    image

    相关阅读

    细菌基因组:结核杆菌测序耐药位点分析

    一文搞定细菌基因组De Novo测序分析

    肠道菌群:16S测序分析流程解读

    肠道菌群:宏基因组测序分析流程解读(上)

    相关文章

      网友评论

        本文标题:肠道菌群:序列分类之kraken

        本文链接:https://www.haomeiwen.com/subject/pdoxdctx.html