美文网首页
文献阅读3.5 RAviz:用于检测重复基因组区域中的假阳性比对

文献阅读3.5 RAviz:用于检测重复基因组区域中的假阳性比对

作者: 龙star180 | 来源:发表于2022-08-15 20:04 被阅读0次

期刊

Horticulture Research (7.291/Q1)

许东大神的又一佳作

RAviz: A Visualization Tool for Detecting False-positive Alignments in Repetitive Genomic Regions

RAviz:用于检测重复基因组区域中的假阳性比对的可视化工具

Running head: Visualizing false-positive alignments in repeats

Dear Editor,

For any species, a high-quality reference genome is the basis for almost all kinds of genomic analysis. However, for decades, the reference sequences of important eukaryotic genomes were incomplete due to the missing of repetitive genomic regions including both tandem repeats such as centromere, telomere, ribosomal DNA and interspersed repeats like transposons and segmental duplications. The incomplete reference genomes not only cause data analytic mistakes like false-positive variant calls but also impede the studies of the repeat-related diseases such as cancer and infertility. Fortunately, the generation of Pacbio high-fidelity (HiFi) long sequences and ONT ultra-long (UL) sequences provides the opportunity of solving repeat assembly problem due to their advantages in accuracy and length respectively, and a list of complete (also called T2T, short for Telomere to Telomere) or near-complete reference genomes of important eukaryotic species like human, Arabidopsis, rice and tomato have been built recently.

对于任何物种,高质量的参考基因组是几乎所有类型基因组分析的基础。然而,几十年来,重要的真核基因组的参考序列是不完整的,因为缺少重复基因组区域,包括串联重复如着丝粒、端粒、核糖体 DNA 和散在重复如转座子和节段重复。不完整的参考基因组不仅会导致假阳性变异检出等数据分析错误,还会阻碍癌症和不孕症等重复相关疾病的研究。幸运的是,Pacbio 高保真 (HiFi) 长序列和 ONT 超长 (UL) 序列的生成提供了解决重复组装问题的机会,因为它们分别在精度和长度方面具有优势,以及完整的列表(也称为 T2T (端粒到端粒的缩写)或近乎完整的重要真核物种(如人类、拟南芥、水稻和番茄)的参考基因组。

Since there is no assembler which can generate complete reference genomes purely automatically until now, these T2T projects all need large amounts of manual curations. These manual works focus on generating continuous and correct sequences in repetitive regions in some steps of genome assembly such as contig assembly, scaffolding, polishing and gap filling. However, due to the high similarity between the repeat copies, the sequences from different copies may be aligned by mistake which leads to mis-assemblies in these steps. To filter these false-positive alignments and ensure the correctness of repeat assembly, copy-specific features like SNPs and structural variations have to be used. More specifically, the alignments in which the copy-specific features from two sequences do not match should be identified as false-positive. Due to the difficulty and high computational cost of SNP and SV detection, in practice, people use rare k-mers (subsequences of k nucleotides appearing in whole genome for small number of times) as markers to replace these copy-specific features, and the alignments with rare k-mers mismatching should be removed. This strategy has been widely used by T2T-related automatic tools like CentroFlye8, Abruijn9and manual human T2T assembly. However, due to the lack of visualization tool which can show the match of rare k-mers in alignments (the existing alignment visualization tools such as IGV10 show the alignments without k-mer matching), it is extremely tedious and time consuming to carry out this strategy manually. This is the reason that other T2T or near-T2T projects choose not to or use a rougher strategy to filter false-positive alignments, leading to lower correctness than human T2T assembly. Here we developed an efficient alignment visualization tool called RAviz to meet this need. With the alignments and corresponding k-mer matching profiles clearly visualized by RAviz, it is much easier and time-efficient for the users to decide which are the false alignments and remove them in T2T assembly projects.

由于到目前为止还没有可以完全自动生成完整参考基因组的组装器,因此这些 T2T 项目都需要大量的人工管理。这些手工工作的重点是在基因组组装的某些步骤(例如重叠群组装、支架、抛光和间隙填充)中的重复区域中生成连续且正确的序列。然而,由于重复拷贝之间的高度相似性,来自不同拷贝的序列可能会被错误地比对,从而导致这些步骤中的错误组装。为了过滤这些假阳性比对并确保重复组装的正确性,必须使用拷贝特异性特征,如 SNP 和结构变异更具体地说,来自两个序列的拷贝特异性特征不匹配的比对应被识别为假阳性。由于 SNP 和 SV 检测的难度和计算成本高,在实践中,人们使用稀有的 k-mers(k 个核苷酸的子序列在全基因组中出现少量次数)作为标记来替换这些拷贝特异性特征,而应删除具有罕见 k-mers 不匹配的比对。该策略已被 CentroFlye8、Abruijn9 等 T2T 相关自动化工具和人工 T2T 组装广泛使用。然而,由于缺乏能够显示比对中罕见k-mers匹配的可视化工具(现有的比对可视化工具如IGV10显示没有k-mer匹配的比对),执行起来非常繁琐和耗时手动执行此策略。这就是其他 T2T 或接近 T2T 项目选择不使用或使用更粗略的策略来过滤假阳性对齐的原因,导致正确性低于人类 T2T 组装。在这里,我们开发了一种名为 RAviz 的高效对齐可视化工具来满足这一需求。通过 RAviz 清楚地显示对齐和相应的 k-mer 匹配配置文件,用户可以更轻松、更省时地确定哪些是错误对齐并在 T2T 装配项目中将其删除。

RAviz is a Windows- or MacOS- based open-source software which is able to visualize the alignments between any types of DNA sequences such as reads, contigs, scaffolds or reference genomes in three different modes (Figure 1A). First, in the global mode, the alignments in the whole genome are drawn in one or more continuous pages to show the users an overall situation (Figure 1C). Second, two sequences can be specified by IDs and the alignments between them are shown. Third, one sequence is specified and all the alignments related to it are shown (Figure 1B). In each mode, the users can filter the shown alignments by two types of mapping scores: MAPQ (a popular mapping score for evaluating the quality and confidence of each alignment) and KMAPQ (a mapping score defined by RAviz according to the matching of rare k-mers for the same purpose). If a specific region of an alignment needs to be visualized, RAviz allows the users to zoom in by pressing the “Ctrl” key and operating the mouse scrollers, and move by dragging with mouse.

RAviz 是一个基于 Windows 或 MacOS 的开源软件,它能够以三种不同的模式可视化任何类型的 DNA 序列(如读数、重叠群、支架或参考基因组)之间的比对(图 1A)。首先,在全局模式下,将整个基因组中的比对绘制在一个或多个连续页面中,以向用户展示整体情况(图1C)。其次,可以通过 ID 指定两个序列,并显示它们之间的比对。第三,指定一个序列,并显示与其相关的所有比对(图 1B)。在每种模式下,用户可以通过两种映射分数过滤显示的对齐:MAPQ(用于评估每个对齐的质量和置信度的流行映射分数)和 KMAPQ(RAviz 根据稀有 k 的匹配定义的映射分数-mers 用于相同目的)。如果需要可视化对齐的特定区域,RAviz 允许用户通过按“Ctrl”键并操作鼠标滚轮来放大,并通过鼠标拖动来移动。

Figure 1

Figure 1 The interface and examples of RAviz A. The interface of RAviz: 1) Parameter module in which the parameters for drawing inputted; 2) Data module which shows the original data related to drawing in the form of table; 3) PAF drawing module for specifying the input PAF file and related parameters; 4) Rare k-mer drawing module for specifying the self-defined input file from pre-processing program and related parameters. B. An example of showing all alignments related to a specified sequence. C. An example of showing alignments globally. D, Two examples of visualization with only alignment areas (shown by grey colour); The two sequences in left alignment are from the same strand of genome while those in right alignment are from different strand. E, An example which explains how RAviz helps detect false-positive alignments in repetitive regions; The two alignments are also two examples of visualization with both alignment area and rare k-mer matching profile (each pair of matching rare k-mers is shown as a red line between them); In both two alignments, the two sequences are from different strands.

图1 RAviz的界面和例子 A. RAviz 界面: 1) 参数模块,输入绘图参数; 2)数据模块,以表格形式展示与绘图相关的原始数据; 3) PAF绘图模块,用于指定输入的PAF文件及相关参数; 4)罕见的k-mer绘图模块,用于从预处理程序和相关参数中指定自定义输入文件。 B. 显示与指定序列相关的所有比对的示例。 C. 显示全局对齐的示例。 D,两个仅具有对齐区域的可视化示例(以灰色显示);左对齐的两个序列来自基因组的同一链,而右对齐的两个序列来自不同的链。 E,一个解释 RAviz 如何帮助检测重复区域中的假阳性对齐的示例;这两个对齐也是具有对齐区域和稀有 k-mer 匹配配置文件的两个可视化示例(每对匹配的稀有 k-mer 显示为它们之间的红线);在这两个比对中,两个序列来自不同的链。

RAviz allows two types of input. The standard PAF (a type of format of sequence alignment, which can be generated from minimap211) files can be directly input if the users only need to visualize the alignment areas without showing the k-mers (Figure 1D). Otherwise, a pre-processing program (https://github.com/xianjia10/kmer-map.git) of RAviz needs to be run for generating a self-defined input file containing all needed information like alignment areas, k-mer positions and k-mer matching from PAF files and FASTA (a type of format of DNA sequences) files, and then both alignment area and the matching of rare k-mers can be shown for each alignment with this input file (Figure 1E). The pre-processing program needs to be run on Linux system, as the alignment and k-mer files are comparatively large files in most situations.

RAviz 允许两种类型的输入。 如果用户只需要可视化对齐区域而不显示 k-mers(图 1D),则可以直接输入标准 PAF(一种序列对齐格式,可以从 minimap211 生成)文件。 否则,需要运行 RAviz 的预处理程序 (https://github.com/xianjia10/kmer-map.git) 以生成自定义输入文件,其中包含所有需要的信息,如对齐区域、k-mer 位置 和来自 PAF 文件和 FASTA(一种 DNA 序列格式)文件的 k-mer 匹配,然后可以显示与此输入文件的每次对齐的对齐区域和稀有 k-mer 的匹配(图 1E)。 预处理程序需要在Linux系统上运行,因为alignment和k-mer文件在大多数情况下都是比较大的文件。

Figure 1E shows a real example in which RAviz helps detect false-positive alignments in repetitive regions. In this example, a contig from tomato genome called ctg1981 was aligned simultaneously to the two copies (Chr9:33,008,416-33,011,210 and Chr2:13,008,040-13,010,920) of an interspersed repeat on tomato reference genome by minimap2. Due to the high mapping scores (MAPQ=60) of the two alignments, it is difficult to decide which copy is correct for T2T assembly. However, with the rare k-mer matching profile visualized by RAviz, it is easy to see that the alignment to chromosome 9 is supported by a large amount of rare k-mers while the alignment to chromosome 2 is a typical false-positive and should be removed.

图 1E 显示了一个真实示例,其中 RAviz 帮助检测重复区域中的假阳性对齐。 在此示例中,来自番茄基因组的称为 ctg1981 的重叠群通过 minimap2 同时与番茄参考基因组上散布重复的两个拷贝(Chr9:33,008,416-33,011,210 和 Chr2:13,008,040-13,010,920)对齐。 由于两个比对的高映射分数 (MAPQ=60),很难确定哪个副本对于 T2T 组装是正确的。 然而,通过 RAviz 可视化的稀有 k-mer 匹配图谱,很容易看出与染色体 9 的比对得到大量稀有 k-mers 的支持,而与染色体 2 的比对是典型的假阳性,应该被移除。

RAviz was implemented by python 3.8 and PyQt5. In the implementation, the time-efficiency and memory-efficiency of the software have to be guaranteed considering the huge amounts of alignments and rare k-mers existing for almost any eukaryotic genome. For example, tomato genome (a homozygous diploid genome with ~775 M nucleotides) contains about 10.9 G unique k-mers (k-mers appearing in the whole genome for only once) and there are 70,871 alignments between the corresponding HiFi contigs (not to mention reads) and reference genome. To improve the memory efficiency, the pre-processing program of RAviz stores k-mer positions instead of k-mer sequences. To improve the time-efficiency, instead of sequential searching in the input file, RAviz builds two index files which contains the mapping from reference and query IDs to the corresponding line numbers in the file, so that the related alignment information can be found ultrafast by a given reference or query. For generating an input file of 900 Mb, RAviz uses 0.7 hours with 18 CPUs.

RAviz 由 python 3.8 和 PyQt5 实现。在实施过程中,考虑到几乎所有真核基因组都存在大量的比对和稀有的 k-mer,必须保证软件的时间效率和内存效率。例如,番茄基因组(具有约 775 M 核苷酸的纯合二倍体基因组)包含约 10.9 G 独特的 k-mers(k-mers 在整个基因组中仅出现一次),并且相应的 HiFi contigs 之间有 70,871 个比对(不是提及读取)和参考基因组。为了提高内存效率,RAviz 的预处理程序存储 k-mer 位置而不是 k-mer 序列为了提高时间效率,RAviz 不是在输入文件中进行顺序搜索,而是构建了两个索引文件,其中包含从参考 ID 和查询 ID 到文件中相应行号的映射,以便通过以下方式快速找到相关的对齐信息给定的参考或查询为了生成 900 Mb 的输入文件,RAviz 使用 0.7 小时和 18 个 CPU。

To conclude, RAviz can efficiently visualize the sequence alignments in repetitive genomic regions with rare k-mer matching profile, and thus is able to help the users detect and remove false positive alignments and generate high-quality assembly in T2T reference genome building projects. In the future, we will add the interactive functions in the next version of RAviz and develop the Linux version.

综上所述,RAviz 可以有效地可视化重复基因组区域中具有罕见 k-mer 匹配谱的序列比对,从而能够帮助用户检测和去除假阳性比对,并在 T2T 参考基因组构建项目中生成高质量的组装。 未来我们会在下一个版本的RAviz中加入交互功能,并且开发Linux版本。

Availability of data and materials

The RAviz and testing data can be freely downloaded at https://github.com/xianjia10/RAviz.git. The manual and videos can be found in the tutorial subfolder under the installation directory of RAviz.

RAviz 和测试数据可以在 https://github.com/xianjia10/RAviz.git 上免费下载。 手册和视频可以在 RAviz 安装目录下的 tutorial 子文件夹中找到。

好厉害啊,很少见到桌面软件发一区top的。牛。。。笔者要是能发一篇,该多好。。。

相关文章

  • 重测序分析(3)比对到参考基因组

    将经过过滤后得到的高质量数据比对到参考基因组上,并进行排序和去重复等处理,用于后续的变异检测。 软件准备 bwas...

  • 基因组重复序列检测:RepeatMasker的安装及使用

    RepeatMasker是重复序列检测的常用工具,通过与参考数据库的相似性比对来准确识别或屏蔽基因组中的重复序列,...

  • 应急工作中的模式觉察

    前天单位推出了针对新冠疫情核酸检测用的标准物质,主要是用于校准基因检测方法,提高准确性,减少假阳性或假阴性...

  • Merqury评估基因组质量

    一款用于评估基因组质量的新方法 一般用于评估的方法 二代reads 比对率; 偏向于重复序列... BUSCO;仅...

  • HIV检测呈阳性?别急,可能是假阳性

    什么是假阳性? 当一个没有感染HIV的人,检测报告中显示阳性,这个结果被认为是假阳性。一般来说,HIV检测具有很高...

  • TRF--Tandem Repeat Finder

    TRF软件是基因组注释中常用于检测序列中串联重复序列的软件,无需安装,使用简单方便。 1. 重复序列分为串联重复序...

  • 数据分析-贝叶斯统计

    条件概率术语: s真阳性:患病被检测成 阳 性P(+|L) 假阳性:未患病被检测成阳性P(+|~L) 真阴性:未患...

  • 比对率不理想-污染检测

    污染检测,即通过blast,对样本的序列进行nt总库比对,看样本reads中不同物种的占比情况。 在进行基因组比对...

  • DNA测序之mapping介绍

    DNA测序之mapping介绍 准确度。 基因组很大,并且有重复,如何准确的mapping到基因组。如果比对错误,...

  • 如何使用GMAP/GSNAP进行转录组序列比对

    GMAP最早用于讲EST/cDNA序列比对到参考基因组上,可以用于基因组结构注释。后来高通量测序时代,又开发了GS...

网友评论

      本文标题:文献阅读3.5 RAviz:用于检测重复基因组区域中的假阳性比对

      本文链接:https://www.haomeiwen.com/subject/tdddgrtx.html