美文网首页
ART的安装和简单使用

ART的安装和简单使用

作者: 少年英雄小猪熊 | 来源:发表于2018-11-17 19:27 被阅读0次

    [TOC]


    安装环境

    Ubuntu18.10
    ART-bin-MountRainier-2016.06.05-Linux64


    安装过程

    1、从https://www.niehs.nih.gov/research/resources/software/biostatistics/art/下载安装包。
    2、将安装包sudo mv sratoolkit.current-ubuntu64.tar.gz -t /opt
    然后

    cd /opt
    tar xzvf artbinmountrainier2016.06.05linux64.tgz
    echo "export PATH=\$PATH:/opt/art_bin_MountRainier" >> ~/.bashrc
    source ~/.bashrc
    art_illumina
    

    这里就安装完毕了。


    简介

    Set of Simulation Tools

    ART is a set of simulation tools to generate synthetic next-generation sequencing reads. ART simulates sequencing reads by mimicking real sequencing process with empirical error models or quality profiles summarized from large recalibrated sequencing data. ART can also simulate reads using user own read error model or quality profiles. ART supports simulation of single-end, paired-end/mate-pair reads of three major commercial next-generation sequencing platforms: Illumina's Solexa, Roche's 454 and Applied Biosystems' SOLiD. ART can be used to test or benchmark a variety of method or tools for next-generation sequencing data analysis, including read alignment, de novo assembly, SNP and structure variation discovery. ART was used as a primary tool for the simulation study of the 1000 Genomes Project . ART is implemented in C++ with optimized algorithms and is highly efficient in read simulation. ART outputs reads in the FASTQ format, and alignments in the ALN format. ART can also generate alignments in the SAM alignment or UCSC BED file format. ART can be used together with genome variants simulators (e.g. VarSim) for evaluating variant calling tools or methods.

    有些时候,我们需要用到一些数据模拟工具生成模拟数据对软件进行测试,下面我就为大家介绍一款比较流行的模拟数据软件ART。

    该款软件于2012年发表在Bioinformatics上,目前被引次数高达476次。ART可以模拟生成三大主流二代测序平台Illumina's Solexa, Roche's 454和Applied Biosystems' SOLiD的single-end, paired-end/mate-pair reads,同时也可以对序列比对、无参组装、call SNP等进行打分,可以说是功能相当全面。ART在1000 Genomes Project里被用作主要的模拟数据工具,采用C++编写,同时内置了Perl脚本,有着优化的算法和极高的效率,但目前并不支持多线程。输出的格式有FASTQ、alignments in the ALN format、SAM等,可以通过内置的脚本将ALN转换成BED格式。

    目前ART不但可以在linux和Macos下使用,同时也有相关版本在windows笔记本上运行。通过官方地址(https://www.niehs.nih.gov/research/resources/software/biostatistics/art/)我们找到最新的linux版本下载地址,下载后tar解压缩即可直接使用ART内置的程序。

    鉴于目前常用的是illumina平台的测序数据,这里就以ART里illumina相关的程序进行使用说明。art_illumina运行代码如下:

    ./art_illumina -ss HS25 -i ./testSeq.fa -o ./paired_end_com -l 150 -f 10 -p -m 500 -s 10 -sam

    ./art_illumina是需要运行的程序

    -i 需要输入的参考基因组

    -o 需要输出的数据,paired_end_com是输出文件的前缀

    -p 表示输出是paired-end数据,如果-m参数给出的值>=2000,则自动升级成mate-pair

    -m 表示paired-end的片段大小

    -s 表示-m片段的偏差

    -f 表示输出数据的覆盖度,这里是10X

    -l 150 表示是150bp的双端数据

    -sam 同时生成sam文件

    -ef 加上-ef可以使输出的模拟数据没有错误值,加不加看自己的需求。

    -ss The name of Illumina sequencing system of the built-in profile used for simulation,illumina不同平台有不同的固定表示,具体如下所示,其中HS25目前比较常见。

    GA1 - GenomeAnalyzer I (36bp,44bp)

    GA2 - GenomeAnalyzer II (50bp, 75bp)

    HS10 - HiSeq 1000 (100bp)

    HS20 - HiSeq 2000 (100bp)

    HS25 - HiSeq 2500 (125bp, 150bp)

    HSXn - HiSeqX PCR free (150bp)

    HSXt - HiSeqX TruSeq (150bp)

    MinS - MiniSeq TruSeq (50bp)

    MSv1 - MiSeq v1 (250bp)

    MSv3 - MiSeq v3 (250bp)

    NS50 - NextSeq500 v2 (75bp)


    简单试用

    输入20.4MB模拟数据sample.fa

    art_illumina -ss HS25 -i ./sample.fa -o ./paired_end_com -l 150 -f 10 -p -m 500 -s 10 -sam
    

    控制台输出结果为

        ====================ART====================
                 ART_Illumina (2008-2016)          
              Q Version 2.5.8 (June 7, 2016)       
         Contact: Weichun Huang <whduke@gmail.com> 
        -------------------------------------------
    
                      Paired-end sequencing simulation
    
    Total CPU time used: 0.65
    
    The random seed for the run: 1542454628
    
    Parameters used during run
        Read Length:    150
        Genome masking 'N' cutoff frequency:    1 in 150
        Fold Coverage:            10X
        Mean Fragment Length:     500
        Standard Deviation:       10
        Profile Type:             Combined
        ID Tag:                   
    
    Quality Profile(s)
        First Read:   HiSeq 2500 Length 150 R1 (built-in profile) 
        First Read:   HiSeq 2500 Length 150 R2 (built-in profile) 
    
    Output files
    
      FASTQ Sequence Files:
         the 1st reads: ./paired_end_com1.fq
         the 2nd reads: ./paired_end_com2.fq
    
      ALN Alignment Files:
         the 1st reads: ./paired_end_com1.aln
         the 2nd reads: ./paired_end_com2.aln
    
      SAM Alignment File:
        ./paired_end_com.sam
    
    

    输出文件如下:

    art_illumina_result.png

    参考 http://www.dxy.cn/bbs/topic/38801765

    相关文章

      网友评论

          本文标题:ART的安装和简单使用

          本文链接:https://www.haomeiwen.com/subject/xhkifqtx.html