最近在看一些算法的文章。以期与大家共勉。希望能有收获。帮助到自己,也帮助到大家。
今天说说这个DIAMOND软件。一个太牛逼的算法工具。截至今日,被引次数已经达到5000+(图1)。
图1DIAMOND第一版投的是nature methods(Q1/28.547)的brief communications(图2)。
图2好了,期刊大致了解了,现在开始翻译......
标题:
Fast and sensitive protein alignment using DIAMOND
使用 DIAMOND 进行快速灵敏的蛋白质比对
摘要:
The alignment of sequencing reads against a protein reference database is a major computational bottleneck in metagenomics and data-intensive evolutionary projects. Although recent tools offer improved performance over the gold standard BLASTX, they exhibit only a modest speedup or low sensitivity. We introduce DIAMOND, an open-source algorithm based on double indexing that is 20,000 times faster than BLASTX on short reads and has a similar degree of sensitivity.
在宏基因组学和数据密集型进化项目中,测序读取与蛋白质参考数据库的比对是一个主要的计算瓶颈。尽管最近的工具提供了比黄金标准BLASTX更好的性能,但它们只表现出适度的加速或较低的灵敏度。我们引入了DIAMOND,这是一种基于双索引的开源算法,在短读取上比BLASTX快20000倍,并且具有相似的灵敏度。
In metagenomics studies, millions of sequence reads are analyzed to determine the functional or taxonomic content of microbial samples from the environment1. An important computational step is to determine which genes are present, usually by aligning translated DNA sequences against a reference database of protein sequences such as the NCBI nonredundant (NCBI-nr) database2 or KEGG3. BLASTX4 has long been considered the gold standard tool for this owing to its high sensitivity. However, BLASTX is much too slow for routine application in a highthroughput context.
在宏基因组学研究中,数以百万计的序列读数被分析以确定环境中微生物样本的功能或分类内容。一个重要的计算步骤是确定哪些基因是存在的,通常是通过将翻译的DNA序列与蛋白质序列的参考数据库,如NCBI非冗余(NCBI-nr)数据库或KEGG进行比对。BLASTX由于其高灵敏度,长期以来一直被认为是这方面的黄金标准工具。然而,BLASTX对于在高通量背景下的常规应用来说太慢了。
A number of faster approaches have been proposed, such as BLAT5 and USEARCH6. Notably, RAPSearch2 (ref. 7) improves speed by a factor of up to 100 over BLASTX while maintaining a similar level of sensitivity. However, as the size and number of samples continue to grow, even faster methods are required. PAUDA8 provides a 10,000-fold increase in speed over BLASTX but has very low sensitivity on the level of individual alignments, reporting only 2–3% of all BLASTX alignments.
一些更快的方法已经被提出,例如 BLAT和 USEARCH。 值得注意的是,RAPSearch2比 BLASTX 提高了高达100倍的速度,同时保持了相似的灵敏度水平。 然而,随着样本的大小和数量不断增长,需要更快的方法。 PAUDA的速度比 BLASTX 提高了 10,000 倍,但对单个比对水平的灵敏度非常低,仅报告所有BLASTX 比对的 2-3%。
Here we present DIAMOND (double index alignment of next-generation sequencing data), an open-source program that is ideally suited for replacing BLASTX in a high-throughput setting (http://ab.inf.uni-tuebingen.de/software/diamond and Supplementary Software). When targeting significant alignments against the NCBI-nr database with an expected value below 10−3, DIAMOND aligns short sequence reads at approximately 20,000 times the speed of BLASTX and has a similar level of sensitivity. Like BLASTX, DIAMOND is an ‘all mapper’ that attempts to determine exhaustively all significant alignments for a given query.
在这里,我们介绍了 DIAMOND(下一代测序数据的双索引对齐),这是一个非常适合在高通量环境中替换 BLASTX 的开源程序。 当针对预期值低于 10-3 的 NCBI-nr 数据库进行重要比对时,DIAMOND 以大约 20,000 倍 BLASTX 的速度比对短序列读取,并具有相似的灵敏度水平。 与 BLASTX 一样,DIAMOND 是一个“所有映射器”,它试图详尽地确定给定查询的所有重要对齐。
Most sequence comparison programs, including BLASTX, follow the seed-and-extend paradigm. In this two-phase approach, users search first for matches of seeds (short stretches of the query sequence) in the reference database, and this is followed by an ‘extend’ phase that aims to compute a full alignment.
大多数序列比较程序,包括 BLASTX,都遵循种子和扩展范例。 在这种两阶段方法中,用户首先在参考数据库中搜索种子匹配(查询序列的短片段),然后是旨在计算完全对齐的“扩展”阶段。
Sequence comparison programs typically precompute an index that holds all seed locations in the reference sequences. A file of queries is then linearly scanned, and the seeds of a given query are matched to seeds in the reference sequences by randomaccess lookups in the index. In contrast, DIAMOND uses double indexing, an approach that determines the list of all seeds and their locations in both the query and reference sequences. The two lists are sorted lexicographically and traversed together in a linear manner to determine all matching seeds and their corresponding locations. Double indexing takes advantage of the cache hierarchy by increasing data locality, thus reducing the demands on main memory bandwidth.
序列比较程序通常会预先计算一个索引,该索引保存引用序列中的所有种子位置。然后线性扫描查询文件,并通过索引中的随机访问查找将给定查询的种子与引用序列中的种子匹配。相比之下,DIAMOND使用双索引,这种方法确定所有种子的列表及其在查询和参考序列中的位置。这两个列表按字典顺序排序,并以线性方式一起遍历,以确定所有匹配的种子及其对应的位置。双索引通过增加数据位置来利用缓存层次结构,从而减少对主内存带宽的需求。
Most sequence comparison programs, including BLASTX and RAPSearch2, use single consecutive seeds, which need to be short (length 3–6 amino acids) to ensure sensitivity. To increase speed without losing sensitivity, DIAMOND uses spaced seeds—that is, longer seeds in which only a subset of positions are used9,10. The number and exact layout of those positions are called the weight and shape of the spaced seed, respectively. To achieve high sensitivity, DIAMOND uses a set of four carefully chosen shapes11 of length 15–24 and weight 12 by default. The most sensitive version of DIAMOND uses 16 shapes of weight 9. In addition, DIAMOND uses a reduced amino acid alphabet of size 11 to enhance sensitivity12. A simple exact match criterion determines which seeds are passed on to the extension phase, in which a Smith-Waterman alignment13 is computed
大多数序列比较程序,包括 BLASTX 和 RAPSearch2,使用单个连续种子,需要短(长度 3-6 个氨基酸)以确保灵敏度。 为了在不损失灵敏度的情况下提高速度,DIAMOND 使用间隔种子,即较长的种子,其中仅使用位置子集。 这些位置的数量和精确布局分别称为间隔种子的重量和形状。 为实现高灵敏度,DIAMOND 默认使用长度为 15–24 和重量为 12 的四个精心挑选的形状。 最敏感的 DIAMOND 版本使用 16 种形状的重量 9。此外,DIAMOND 使用大小为 11 的简化氨基酸字母表来提高灵敏度。 一个简单的精确匹配标准确定哪些种子被传递到扩展阶段,其中计算了 Smith-Waterman 对齐。(笔者不懂,今后这些不懂的笔者尽量跳过,节省时间,只给笔者想给读者们看的)
In a recent metagenomic study of 12 permafrost samples14, a BLASTX comparison of 176 million high-quality DNA reads against the KEGG reference database3 was reported to require 800,000 CPU hours at a supercomputing center15. When we used DIAMOND with its default settings, the analysis of all 246 million reads took 2.3 h on a single workstation, producing a total of 568.9 million alignments on 43 million reads.
在最近对 12 个永久冻土样本 14 进行的宏基因组研究中,据报道,在一个超级计算中心 15 对 1.76 亿个高质量 DNA 读数与 KEGG 参考数据库进行 BLASTX 比较需要 800,000 个 CPU 小时。 当我们使用默认设置的 DIAMOND 时,在单个工作站上分析所有 2.46 亿条读取需要 2.3 小时,在 4300 万条读取中产生总共 5.689 亿条比对。
To systematically compare the performance of DIAMOND (version 0.4.7) with BLASTX (version 2.2.28+) and RAPSearch2 (version 2.18), we downloaded publicly available metagenome data sets produced with Illumina (permafrost14 and Human Microbiome Project16 data), Ion Torrent (ERP004234), 454 Titanium (SRR1298978 and SRR1298979) and Sanger17 sequencing technologies as well as open reading frames (ORFs) predicted from a microbial assembly18. We used all three programs to align all data sets against NCBI-nr (version May 2013) on a single workstation using 48 cores. We ran both DIAMOND and RAPSearch2 using a fast setting (−fast) and a sensitive setting (−sensitive)
为了系统地比较 DIAMOND(0.4.7 版)与 BLASTX(2.2.28+ 版)和 RAPSearch2(2.18 版)的性能,我们下载了由 Illumina(permafrost14 和 Human Microbiome Project16 数据)、Ion Torrent 生成的公开可用的宏基因组数据集 (ERP004234)、454 Titanium (SRR1298978 和 SRR1298979) 和 Sanger17 测序技术以及从微生物组装预测的开放阅读框 (ORF)18。 我们使用所有三个程序在使用 48 个内核的单个工作站上将所有数据集与 NCBI-nr(2013 年 5 月版)对齐。 我们使用快速设置 (-fast) 和敏感设置 (-sensitive) 运行 DIAMOND 和 RAPSearch2。
DIAMOND-fast uses four seed shapes and runs around 20,000 times faster than BLASTX.DIAMOND-sensitive uses 16 seed shapes and runs about 2,000 times faster than BLASTX, aligning 99% of reads that have a BLASTX alignment and obtaining over 92% of all BLASTX alignments, on Illumina reads. DIAMOND-fast was 40 times faster than RAPSearch2-fast, with greater sensitivity, and was up to 500 times faster than RAPSearch2-sensitive, with similar sensitivity. DIAMOND-fast and DIAMOND-sensitive consistently outperformed RAPSearch2-fast and RAPSearch2-sensitive, respectively, on all data sets. The peak memory usage of DIAMOND-default was 100 GB, but the program can be configured to require less than 32 GB of memory, at the expense of a 30% reduction in speed.
DIAMOND-fast 使用四种种子形状,运行速度比 BLASTX 快约 20,000 倍。 DIAMOND-sensitive 使用 16 种种子形状,运行速度比 BLASTX 快约 2,000 倍,在 Illumina 读取上对齐 99% 的具有 BLASTX 对齐的读取并获得超过 92% 的所有 BLASTX 对齐。 DIAMOND-fast 比 RAPSearch2-fast 快 40 倍,灵敏度更高,比 RAPSearch2-sensitive 快 500 倍,灵敏度相似。 在所有数据集上,DIAMOND-fast 和 DIAMOND-sensitive 分别始终优于 RAPSearch2-fast 和 RAPSearch2-sensitive。 DIAMOND-default 的峰值内存使用量为 100 GB,但可以将程序配置为需要少于 32 GB 的内存,代价是速度降低 30%。
Whereas the functional analysis of metagenome data sets has in the past been restricted by the requirement for supercomputing services, researchers can now use DIAMOND to perform functional analysis routinely on all their data sets. An analysis that takes 1 month with BLASTX takes only a few minutes with DIAMOND. Functional analysis of all 35 billion Illumina reads generated by the Human Microbiome Project, the largest published metagenome data set to date, would take about 2 weeks on a single server with DIAMOND.
过去,宏基因组数据集的功能分析受到超级计算服务需求的限制(也是我自己做的CyDotian的缺点),而现在研究人员可以使用 DIAMOND 对其所有数据集进行常规功能分析。 使用 BLASTX 需要 1 个月的分析,使用 DIAMOND 只需几分钟。 人类微生物组计划(迄今为止最大的已发表宏基因组数据集)生成的所有 350 亿条 Illumina 读数的功能分析将在使用 DIAMOND 的单个服务器上花费大约 2 周时间。
之后还有ONLINE METHODS,一些证明DIAMOND结果比先有工具更好的例子,取的是人类微生物的数据。还进行了PCoA分析。内存,价格等性价比的优势比较。
这里就不再翻译了。因为方法笔者也不是很能理解。重要的信息有一个:文章说了DIAMOND是C++写的,笔者下载了附件代码,看见还有shell语言的结合。嗯嗯嗯,笔者知道很多知名的算法软件底层都是用的C/C++来写。比如MCScanX就是用的C++,然后结合一些java,shell等编程语言调用分析等等。nice!棒! 笔者的CyDotian的底层算法就是用的C实现,然后结合Python/shell来调用分析。很大原因,笔者认为在于速度的考量。毕竟笔者就是这么过来的。。。
总结:DIAMOND比较了金标准-BLASTX和新的快速的RAPSearch2。最终表现出DIAMOND是最好的。nice! 棒!
valuable references
Kent, W.J. BLAT—The BLAST-Like Alignment Tool, Genome Res. 12, 656–664 (2002).
Edgar, R.C. Search and clustering orders of magnitude faster than BLAST, Bioinformatics 26, 2460–2461 (2010).
Huson, D.H. & Xie, C. A poor man’s BLASTX—high-throughput metagenomic protein database search using PAUDA, Bioinformatics 30, 38–39 (2014).
Y.Zhao,H.Tang,Y.Ye, RAPSearch2:A fast and memory-efficient protein similarity search tool for next-generation sequencing data.Bioinformatics 28,125–126(2012)
下期继续分享DIAMOND的第二版。敬请期待!
网友评论