[TOC]
安装环境
Ubuntu18.10
ART-bin-MountRainier-2016.06.05-Linux64
安装过程
1、从https://www.niehs.nih.gov/research/resources/software/biostatistics/art/下载安装包。
2、将安装包sudo mv sratoolkit.current-ubuntu64.tar.gz -t /opt
然后
cd /opt
tar xzvf artbinmountrainier2016.06.05linux64.tgz
echo "export PATH=\$PATH:/opt/art_bin_MountRainier" >> ~/.bashrc
source ~/.bashrc
art_illumina
这里就安装完毕了。
简介
Set of Simulation Tools
ART is a set of simulation tools to generate synthetic next-generation sequencing reads. ART simulates sequencing reads by mimicking real sequencing process with empirical error models or quality profiles summarized from large recalibrated sequencing data. ART can also simulate reads using user own read error model or quality profiles. ART supports simulation of single-end, paired-end/mate-pair reads of three major commercial next-generation sequencing platforms: Illumina's Solexa, Roche's 454 and Applied Biosystems' SOLiD. ART can be used to test or benchmark a variety of method or tools for next-generation sequencing data analysis, including read alignment, de novo assembly, SNP and structure variation discovery. ART was used as a primary tool for the simulation study of the 1000 Genomes Project . ART is implemented in C++ with optimized algorithms and is highly efficient in read simulation. ART outputs reads in the FASTQ format, and alignments in the ALN format. ART can also generate alignments in the SAM alignment or UCSC BED file format. ART can be used together with genome variants simulators (e.g. VarSim) for evaluating variant calling tools or methods.
有些时候,我们需要用到一些数据模拟工具生成模拟数据对软件进行测试,下面我就为大家介绍一款比较流行的模拟数据软件ART。
该款软件于2012年发表在Bioinformatics上,目前被引次数高达476次。ART可以模拟生成三大主流二代测序平台Illumina's Solexa, Roche's 454和Applied Biosystems' SOLiD的single-end, paired-end/mate-pair reads,同时也可以对序列比对、无参组装、call SNP等进行打分,可以说是功能相当全面。ART在1000 Genomes Project里被用作主要的模拟数据工具,采用C++编写,同时内置了Perl脚本,有着优化的算法和极高的效率,但目前并不支持多线程。输出的格式有FASTQ、alignments in the ALN format、SAM等,可以通过内置的脚本将ALN转换成BED格式。
目前ART不但可以在linux和Macos下使用,同时也有相关版本在windows笔记本上运行。通过官方地址(https://www.niehs.nih.gov/research/resources/software/biostatistics/art/)我们找到最新的linux版本下载地址,下载后tar解压缩即可直接使用ART内置的程序。
鉴于目前常用的是illumina平台的测序数据,这里就以ART里illumina相关的程序进行使用说明。art_illumina运行代码如下:
./art_illumina -ss HS25 -i ./testSeq.fa -o ./paired_end_com -l 150 -f 10 -p -m 500 -s 10 -sam
./art_illumina是需要运行的程序
-i 需要输入的参考基因组
-o 需要输出的数据,paired_end_com是输出文件的前缀
-p 表示输出是paired-end数据,如果-m参数给出的值>=2000,则自动升级成mate-pair
-m 表示paired-end的片段大小
-s 表示-m片段的偏差
-f 表示输出数据的覆盖度,这里是10X
-l 150 表示是150bp的双端数据
-sam 同时生成sam文件
-ef 加上-ef可以使输出的模拟数据没有错误值,加不加看自己的需求。
-ss The name of Illumina sequencing system of the built-in profile used for simulation,illumina不同平台有不同的固定表示,具体如下所示,其中HS25目前比较常见。
GA1 - GenomeAnalyzer I (36bp,44bp)
GA2 - GenomeAnalyzer II (50bp, 75bp)
HS10 - HiSeq 1000 (100bp)
HS20 - HiSeq 2000 (100bp)
HS25 - HiSeq 2500 (125bp, 150bp)
HSXn - HiSeqX PCR free (150bp)
HSXt - HiSeqX TruSeq (150bp)
MinS - MiniSeq TruSeq (50bp)
MSv1 - MiSeq v1 (250bp)
MSv3 - MiSeq v3 (250bp)
NS50 - NextSeq500 v2 (75bp)
简单试用
输入20.4MB模拟数据sample.fa
art_illumina -ss HS25 -i ./sample.fa -o ./paired_end_com -l 150 -f 10 -p -m 500 -s 10 -sam
控制台输出结果为
====================ART====================
ART_Illumina (2008-2016)
Q Version 2.5.8 (June 7, 2016)
Contact: Weichun Huang <whduke@gmail.com>
-------------------------------------------
Paired-end sequencing simulation
Total CPU time used: 0.65
The random seed for the run: 1542454628
Parameters used during run
Read Length: 150
Genome masking 'N' cutoff frequency: 1 in 150
Fold Coverage: 10X
Mean Fragment Length: 500
Standard Deviation: 10
Profile Type: Combined
ID Tag:
Quality Profile(s)
First Read: HiSeq 2500 Length 150 R1 (built-in profile)
First Read: HiSeq 2500 Length 150 R2 (built-in profile)
Output files
FASTQ Sequence Files:
the 1st reads: ./paired_end_com1.fq
the 2nd reads: ./paired_end_com2.fq
ALN Alignment Files:
the 1st reads: ./paired_end_com1.aln
the 2nd reads: ./paired_end_com2.aln
SAM Alignment File:
./paired_end_com.sam
输出文件如下:
art_illumina_result.png
网友评论