期刊
Nucleic Acids Research (19.160/Q1)
REPuter: the manifold applications of repeat analysis on a genomic scale (2001)
REPuter:重复分析在基因组规模上的多种应用

ABSTRACT
The repetitive structure of genomic DNA holds many secrets to be discovered. A systematic study of repetitive DNA on a genomic or inter-genomic scale requires extensive algorithmic support. The REPuter program described herein was designed to serve as a fundamental tool in such studies. Efficient and complete detection of various types of repeats is provided together with an evaluation of significance and interactive visualization. This article circumscribes the wide scope of repeat analysis using applications in five different areas of sequence analysis: checking fragment assemblies, searching for low copy repeats, finding unique sequences, comparing gene structures and mapping of cDNA/ EST sequences.
基因组 DNA 的重复结构蕴藏着许多有待发现的秘密。 在基因组或基因组间规模上对重复 DNA 进行系统研究需要广泛的算法支持。 此处描述的 REPuter 程序旨在用作此类研究中的基本工具。 提供了对各种类型重复的有效和完整的检测以及重要性评估和交互式可视化。 本文使用在序列分析的五个不同领域中的应用程序来限制重复分析的广泛范围:检查片段组装、搜索低拷贝重复、寻找独特的序列、比较基因结构和 cDNA/EST 序列的映射。
INTRODUCTION
One of the most striking features of DNA is the extent to which it consists of repeated substrings. This is particularly true of eukaryotes. For example, it is estimated that families of reiterated sequences account for >50% of the human genome (1). The presence of palindromic (i.e. reverse complemented) repeats hints to the formation of hairpin structures that may provide some structural or replicational mechanism (2). Furthermore, some repeats have been shown to affect bacterial virulence by acting as the molecular basis of a mechanism used to successfully colonize and infect different human individuals (3). This makes repeats an interesting research topic, and indeed there is a vast literature on repetitive structures and their hypothesized functional and evolutionary role.
DNA 最显著的特征之一是它由重复子串组成的程度。 真核生物尤其如此。 例如,据估计,重复序列家族占人类基因组的 50% 以上 (1)。 回文(即反向互补)重复的存在暗示了发夹结构的形成,这可能提供一些结构或复制机制 (2)。 此外,一些重复已被证明通过充当用于成功定殖和感染不同人类个体的机制的分子基础来影响细菌毒力 (3)。 这使得重复成为一个有趣的研究课题,确实有大量关于重复结构及其假设的功能和进化作用的文献。
The manifold applications of repeat analysis
重复分析的多种应用
The obvious task of a computer program for automatic repeat analysis is to find in, for example, a complete genome, all repeats above a given level of significance. Such analysis alone can give interesting insights into the structure of the genome like an abnormal distribution of repetitive elements or recent duplication events. Usually, the repeats will then be further investigated by other methods, typically starting with a database search for any known characteristics of the newly detected repeat sequence.
用于自动重复分析的计算机程序的明显任务是在,例如,在完整的基因组中,找到高于给定显著性水平的所有重复——(CyDotian亦是如此)。 仅此类分析就可以对基因组的结构提供有趣的见解,例如重复元件的异常分布或最近的重复事件。 通常,随后将通过其他方法进一步研究重复,典型地是从数据库搜索新挖掘到的重复序列的任何已知特性开始——(例如,把挖掘到的大片段重复序列提取出来然后放到NCBI去search,看看有什么功能,我擦,王老师就是这么做的啊,牛掰。。。我为什么这么菜...)。
Besides presenting a program for this traditional type of repeat analysis, the emphasis of this article is on the fact that an efficient and complete computation of repeats is a powerful algorithmic technique in a variety of tasks normally not subsumed under repeat analysis. In fact, of the five applications that we present in Applications, only the one dealing with low copy repeats related to human malformations reports on the traditional kind of repeat analysis. The other applications that we consider go far beyond this: checking sequence assemblies; finding unique sequences as a preprocessing step of PCR primer or DNA oligo chip design; comparing gene structures in order to verify gene intron/exon predictions; and mapping ESTs to genomic sequence.
除了为这种传统类型的重复分析提供一个程序外,本文的重点是这样一个事实,即有效和完整的重复计算是一种强大的算法技术,适用于通常不属于重复分析的各种任务。 事实上,在我们在应用程序中展示的五种应用程序中,只有一个处理与人类畸形相关的低拷贝重复报告了传统的重复分析。 我们考虑的其他应用程序远不止于此:检查序列程序集; 寻找独特的序列作为 PCR 引物或 DNA 寡核苷酸芯片设计的预处理步骤; 比较基因结构以验证基因内含子/外显子预测; 并将 EST 映射到基因组序列。
This versatility (yet to be described) results from the fact that a repeat is a mathematically simple object—a substring w of a sequence S occurring twice in S (allowing a certain amount of error). This has two consequences: a substring w that is not a repeat in S is unique in S; a common substring of sequences S1 and S2 (also allowing a certain amount of error) is a repeat of the concatenated sequence S1S2. With these two simple observations it is immediately clear that repeat analysis can be used for the various types of sequence analysis tasks mentioned above, once it has been rigorously implemented so as to be applicable to whole genome sequences.
这种多功能性(尚待描述)源于重复是一个数学上简单的对象这一事实 - 序列 S 的子串 w 在 S 中出现两次(允许一定量的错误)。 这有两个后果:在 S 中不是重复的子串 w 在 S 中是唯一的; 序列 S1 和 S2 的公共子串(也允许一定量的错误)是连接序列 S1S2 的重复。 通过这两个简单的观察,可以立即清楚地看出,重复分析可以用于上述各种类型的序列分析任务,只要它被严格实施以适用于全基因组序列。
The challenge of repeat analysis on a genomic scale
A tool for the systematic study of the repetitive structure of sequences as large as complete genomes must satisfy the following criteria. (i) Efficiency: to analyze complete genomes, up to 3–4 billion bp, the space and time used by the algorithm must scale practically linear with the sequence length. (ii) Flexibility and significance: a biologically realistic model must recognize not only exact, but also degenerate, not only direct, but also palindromic repeats. To determine the significance of a repeat, a statistical assessment is mandatory. (iii) Interactive visualization: the large amount of data generated requires an overview of the whole input sequence, but the user must also be able to zoom in on details of particular repetitive regions. (iv) Compositionality: as repeat finding is considered to be a basic step in genome structure analysis, the program must provide a simple interface to enable composition with other advanced analysis tools.
用于系统研究与完整基因组一样大的序列重复结构的工具必须满足以下标准。 (i) 效率:要分析高达 3-40 亿 bp 的完整基因组,算法使用的空间和时间必须与序列长度几乎成线性关系。 (ii) 灵活性和重要性:一个生物学上现实的模型必须不仅能识别完全的,而且能识别退化的,不仅能识别正向的,而且能识别回文的重复——(CyDotian都有,MUMmer只能识别完全的)。 为了确定重复的重要性,必须进行统计评估。 (iii) 交互式可视化:生成的大量数据需要对整个输入序列进行概览,但用户还必须能够放大特定重复区域的细节。 (iv) 组合性:由于重复发现被认为是基因组结构分析的基本步骤,因此程序必须提供一个简单的界面,以便与其他高级分析工具进行组合。
The REPuter program described herein satisfies these requirements in the following way. The search engine REPfind of REPuter uses an efficient and compact implementation of suffix trees in order to locate exact repeats in linear space and time. It has been estimated in Kurtz (4) that this time-critical task can be done in linear time for sequences up to the size of the human genome. These exact repeats are used as seeds from which significant degenerate repeats are constructed allowing for mismatches, insertions and deletions. Yet, our program is not heuristic: it guarantees to find all degenerate repeats according to the parameters specified by the user—minimum length and maximum number of errors. The output is sorted by significance scores (E-values). Besides degenerate direct repeats, REPfind is capable of detecting degenerate palindromic repeats. One of the basic innovations of REPfind is the ability to process very large DNA sequences very fast. For example, REPfind computes all 18 391 maximal degenerate repeats with a length of at least 700 bp with at most 10 errors in the complete genome of Drosophila melanogaster at a speed of 100 000 bases per second. More running time and space results are reported in Kurtz et al. (5) and in Applications.
此处描述的 REPuter 程序通过以下方式满足这些要求。 REPuter 的搜索引擎 REPfind 使用后缀树的高效且紧凑的实现,以便在线性空间和时间中定位完全的重复——(跟MUMmer一样啊)。 Kurtz (4) 中估计,对于高达人类基因组大小的序列,这个时间关键任务可以在线性时间内完成。这些完全重复被用作种子,从中构建显著的退化重复,允许错配、插入和缺失。然而,我们的程序不是启发式的:它保证根据用户指定的参数(最小长度和最大错误数)找到所有退化重复——(将最大错误数也作为一个参数了,其实就是identity的变形)。输出按显著性分数(E 值)排序。除了退化正向重复,REPfind 还能够检测退化回文重复。 REPfind 的一项基本创新是能够非常快速地处理非常大的 DNA 序列。例如,REPfind 以每秒 100 000 个碱基的速度计算所有 18 391 个最大退化重复,长度至少为 700 bp,在黑腹果蝇的完整基因组中最多有 10 个错误。 Kurtz 等人报告了更多的运行时间和空间结果。 (5) 和在应用程序中。
The output of the search engine REPfind is displayed in the form of a repeat graph by the interactive visualization program REPvis. This feature is described in detail in Interactive Visualization.
搜索引擎 REPfind 的输出由交互式可视化程序 REPvis 以重复图的形式显示。 此功能在交互式可视化中有详细描述。
The stand-alone version of REPuter described here is available free of charge for non-commercial purposes. An online version of REPuter providing some basic functionality can be used on the Bielefeld Bioinformatics web server (http://bibiserv.techfak.uni-bielefeld.de/reputer/). There are currently more than 50 laboratories using the stand-alone version, and more than 340 submissions to the online version per month.
此处描述的独立版本的 REPuter 可免费用于非商业目的。 可以在比勒费尔德生物信息学网络服务器 (http://bibiserv.techfak.uni-bielefeld.de/reputer/) 上使用提供一些基本功能的 REPuter 在线版本。 目前有 50 多家实验室使用单机版,每月在线版提交超过 340 份。
Related work
There are many papers describing algorithms for finding repeats in a string [see Kurtz et al. (5) for an overview]. Here we concentrate on available software tools.
有许多论文描述了在字符串中查找重复的算法 [参见 Kurtz 等人。 (5) 概述]。 在这里,我们专注于可用的软件工具。
Two types of approaches to locate repeats in biological sequences can be distinguished. Methods of the first type, whose most prominent example is RepeatMasker (http://ftp.genome.washington.edu/RM/RepeatMasker.html;
可以区分两种定位生物序列中重复序列的方法。 第一种方法,其最突出的例子是RepeatMasker(http://ftp.genome.washington.edu/RM/RepeatMasker.html;
A.F.A.Smit and P.Green, unpublished), define a ‘repeat’ as a substring that is known to occur very often in a genome. Such substrings tend to confuse sequence analysis programs, and hence they are masked to avoid spurious results. Repeat masking programs use a dictionary of known repeat sequences and perform an exact or approximate string matching of the given sequence against all the dictionary entries.
A.F.A.Smit 和 P.Green,未发表),将“重复”定义为已知在基因组中经常出现的子串。 这样的子串往往会混淆序列分析程序,因此它们被屏蔽以避免虚假结果。 重复屏蔽程序使用已知重复序列的字典,并针对所有字典条目执行给定序列的精确或近似字符串匹配。
Programs of the other type, like REPuter, try to find repeats without prior knowledge, just from the nucleic acid sequence. The simplest approach to such a priori repeat detection is to look at a dot plot of the sequence against itself. Here, repeats show up as diagonal lines. Such dot plots can, for example, be produced by the programs Dotter [http://www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html; (6)] or Large Dot Plots (http://alces.med.umn.edu/rawdot.html; Virtual Genome Center, unpublished).
其他类型的程序,如 REPuter,试图在没有先验知识的情况下,仅从核酸序列中找到重复。 这种先验重复检测的最简单方法是查看序列自身的点图。 在这里,重复显示为对角线。 例如,此类点图可以由程序 Dotter [http://www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html; (6)] 或大点图(http://alces.med.umn.edu/rawdot.html;虚拟基因组中心,未发表)。
For the algorithmic detection of repeats several methods have been suggested, and any suite of biosequence analysis programs contains one or more repeat finding tools, like repeat in the GCG package [http://www.accelrys.com/products/gcg_wisconsin_package/; (7)] or etandem, equicktandem, einverted and palindrome in the EMBOSS package [http://www.hgmp.mrc.ac.uk/Software/EMBOSS/; (8)]. However, the algorithms behind these programs are heuristic and not very well described in the literature or the program documentation.
对于重复的算法检测,已经提出了几种方法,并且任何一套生物序列分析程序都包含一个或多个重复查找工具,例如 GCG 包中的重复 [http://www.accelrys.com/products/gcg_wisconsin_package/; (7)] 或 Etandem、equicktandem、einverted 和 palindrome 在 EMBOSS 包中 [http://www.hgmp.mrc.ac.uk/Software/EMBOSS/; (8)]。 然而,这些程序背后的算法是启发式的,在文献或程序文档中没有很好地描述。
Several stand-alone programs also exist for finding repeats. Dst (http://alces.med.umn.edu/newdst.html; Virtual Genome Center, unpublished) is a heuristic method based on filtration of fixed-length oligonucleotides. REPRO [http://mathbio.nimr.mrc.ac.uk/~rgeorge/repro/; (9)] finds repeats in protein sequences. The method is based on computation of non-overlapping local alignments, and hence uses time proportional to the square of the sequence length. This naturally restricts the search to short sequences. OligoRep (http://wwwmgs.bionet.nsc.ru/mgs/programs/oligorep/; unpublished) is in spirit similar to our method, but is restricted to short sequences (a maximal length of 1000 bp is recommended). A step further in repeat analysis is to cluster the repeats according to their position, for example, as described in Volfovsky et al. (10).
还存在几个用于查找重复的独立程序。 Dst(http://alces.med.umn.edu/newdst.html;Virtual Genome Center,未发表)是一种基于固定长度寡核苷酸过滤的启发式方法。 REPRO [http://mathbio.nimr.mrc.ac.uk/~rgeorge/repro/; (9)] 发现蛋白质序列中的重复。 该方法基于非重叠局部比对的计算,因此使用与序列长度的平方成正比的时间。 这自然将搜索限制为短序列。 OligoRep(http://wwwmgs.bionet.nsc.ru/mgs/programs/oligorep/;未发布)在本质上与我们的方法相似,但仅限于短序列(建议最大长度为 1000 bp)。 重复分析的另一个步骤是根据它们的位置对重复进行聚类,例如,如 Volfovsky 等人所述。 (10)。
A very important subtype of repeats are tandem repeats where the two (or more) copies of the repeat immediately follow each other in the DNA sequence. Programs specialized in finding tandem repeats include Tandem Repeat Finder [http://c3.biomath.mssm.edu/trf.html; (11)] and Tandyman (http://www.stdgen.lanl.gov/tandyman/index.html; unpublished).
一个非常重要的重复亚型是串联重复,其中重复的两个(或更多)拷贝在 DNA 序列中紧跟在彼此之后。 专门寻找串联重复的程序包括 Tandem Repeat Finder [http://c3.biomath.mssm.edu/trf.html; (11)] 和 Tandyman (http://www.stdgen.lanl.gov/tandyman/index.html;未发表)。
In all this work either the methods are restricted to small input or they do not implement the full model of degenerate repeats. REPuter provides the first solution to efficient and exhaustive repeat analysis of complete genomes. Thus, it is the only tool useful for applications beyond repeat analysis considered here.
在所有这些工作中,这些方法要么仅限于小输入,要么没有实现退化重复的完整模型。 REPuter 为完整基因组的高效和详尽重复分析提供了第一个解决方案。 因此,除了此处考虑的重复分析之外,它是唯一对应用有用的工具。
MODELS AND ALGORITHMS
Those readers who are less interested in technical details, but more in the applications of the REPuter program, can safely skip this section upon first reading.
那些对技术细节不太感兴趣,但对 REPuter 程序的应用更感兴趣的读者,在初读时可以放心地跳过本节。
Basic notions
Let S be a string of length |S| = n over an alphabet Σ. S[i] denotes the ith character of S, for i ∈[1, n]. S–1 denotes the reverse of S. For i ≤ j, S[i, j] denotes the substring of S starting with the ith and ending with the jth character of S. Substring S[i, j] is denoted by the pair of positions (i, j). The length of the substring (i, j) is l(i, j) = j – i + 1. To refer to the characters to the left and right of every character in S without worrying about the first and last character, we define S[0] and S[n + 1] to be two distinct characters not occurring anywhere else in S.
令 S 为长度为 |S| 的字符串 = n 在字母 Σ 上。 S[i] 表示 S 的第 i 个字符,因为 i ∈[1, n]。 S–1 表示 S 的倒数。对于 i ≤ j,S[i,j] 表示 S 的子串以 S 的第 i 个字符开始并以 S 的第 j 个字符结束。子串 S[i,j] 由对表示 位置 (i, j)。 子串 (i, j) 的长度是 l(i, j) = j – i + 1。为了引用 S 中每个字符左右的字符而不关心第一个和最后一个字符,我们定义 S[0] 和 S[n + 1] 是两个不同的字符,不会出现在 S 的其他任何地方。
A pair of positions (i1, j1), i1 ≤ j1, contains a pair (i2, j2), i2 ≤ j2, if and only if i1 ≤ i2 and j2 ≤ j1. A pair (p1, p2) of substrings (i.e. a pair of pairs of positions) contains a pair (p3, p4) of substrings if and only if p1 contains p3, and p2 contains p4.
一对位置 (i1, j1),i1 ≤ j1,包含一对 (i2, j2),i2 ≤ j2,当且仅当 i1 ≤ i2 且 j2 ≤ j1。 当且仅当 p1 包含 p3 且 p2 包含 p4 时,一对 (p1, p2) 子串(即一对位置)包含一对 (p3, p4) 子串。
A pair of substrings R = ((i1, j1),(i2, j2)) is an exact repeat if and only if (i1, j1) ≠ (i2, j2) and S[i1, j1] = S[i2, j2]. The length of R is l(R) = j1 – i1 + 1 = j2 – i2 + 1. An exact repeat R = ((i1, j1),(i2, j2)) is called ‘maximal’ if and only if S[i1 – 1] ≠ S[i2 – 1] and S[j1 + 1] ≠ S[j2 + 1] [see Gusfield (12)].
一对子串 R = ((i1, j1),(i2, j2)) 当且仅当 (i1, j1) ≠ (i2, j2) 且 S[i1, j1] = S[i2, j2]。 R 的长度是 l(R) = j1 – i1 + 1 = j2 – i2 + 1。精确重复 R = ((i1, j1),(i2, j2)) 当且仅当 S [i1 – 1] ≠ S[i2 – 1] 和 S[j1 + 1] ≠ S[j2 + 1] [见 Gusfield (12)]。
If S is a DNA sequence, then we distinguish between two kinds of biologically interesting repeats. The repeats defined above are called direct (or forward) repeats. A pair of substrings P = ((i1, j1),(i2, j2)) is a ‘palindromic’ (or ‘reverse complemented’) repeat if and only if S[i1, j1] = S[i2, j2], where w denotes the reverse complement of a DNA sequence w.
如果 S 是一个 DNA 序列,那么我们就可以区分两种生物学上有趣的重复序列。 上面定义的重复称为正向(或前向)重复。 当且仅当 S[i1, j1] = S[i2, j2] 时,一对子串 P = ((i1, j1),(i2, j2)) 是“回文”(或“反向补码”)重复, 其中 w 表示 DNA 序列 w 的反向互补。
The Hamming distance of two equal-length strings S1 and S2, denoted by dH(S1, S2), is the number of positions where S1 and S2 differ.
两个相等长度的字符串 S1 和 S2 的汉明距离,用 dH(S1, S2) 表示,是 S1 和 S2 不同的位置数。
There are three kinds of edit operations: deletions, insertions and mismatches of single characters. These are collectively referred to as differences. The edit distance or Levenshtein distance of S1 and S2, denoted by dE(S1, S2), is the minimum number of edit operations needed to transform S1 into S2.
编辑操作分为三种:单个字符的删除、插入和不匹配。 这些统称为差异。 S1 和 S2 的编辑距离或 Levenshtein 距离,用 dE(S1, S2) 表示,是将 S1 转换为 S2 所需的最小编辑操作数。
Finding exact repeats
The standard approach to compute maximal exact repeats is based on hashing methods. These usually tabulate for each DNA sequence w of a fixed length, say r, the positions P(w) in S, where w occurs. For each w and i, j ∈ P(w), i < j, ((i, i + r – 1),(j, j + r – 1)) is an exact repeat of length r. To find maximal repeats of length at least l, each exact repeat has to be extended to the left and to the right to check if it is embedded in a maximal exact repeat of the required length. The extension is done by pairwise character comparisons, and thus the running time does not only depend on the number of maximal repeats, but also on their lengths. In practice, r is between 10 and 13, a restriction imposed by the hashing techniques. On the other hand, l is usually of length at least 20. Hence there are many more repeats to be extended than maximal repeats to be reported. To exemplify this, we have calculated the corresponding numbers for the genome of the Escherichia coli K12 strain (13). This genome is of length 4 639 221. There are 4 105 188 repeats of length 12 to be extended, but only 7799 maximal repeats of length at least 20. Thus, the ratio of maximal repeats against tested candidates is 0.0019.
计算最大精确重复的标准方法是基于哈希方法。对于每个固定长度的 DNA 序列 w,例如 r,这些通常将 S 中出现 w 的位置 P(w) 制成表格。对于每个 w 和 i,j ∈ P(w), i < j, ((i, i + r – 1),(j, j + r – 1)) 是长度为 r 的精确重复。为了找到长度至少为 l 的最大重复,每个精确重复必须向左和向右扩展,以检查它是否嵌入到所需长度的最大精确重复中。扩展是通过成对字符比较完成的,因此运行时间不仅取决于最大重复次数,还取决于它们的长度。实际上,r 介于 10 和 13 之间,这是哈希技术施加的限制。另一方面,l 通常长度至少为 20。因此,要扩展的重复次数比要报告的最大重复次数多得多。为了举例说明这一点,我们计算了大肠杆菌 K12 菌株基因组的相应数字 (13)。该基因组的长度为 4 639 221。有 4 105 188 个长度为 12 的重复要扩展,但只有 7799 个最大重复长度至少为 20。因此,最大重复与测试候选者的比率为 0.0019。
Using the suffix tree for S we do not have to filter out a small number of maximal repeats from many candidates. Instead, we directly compute maximal exact repeats (of arbitrary length), using the algorithm of Gusfield (12). This algorithm runs in O (n + z) time, where z is the number of maximal repeats. Thus, the running time is optimal and does not depend on the lengths of the repeats. It has independently been shown how to practically construct suffix trees for genomic-size sequences (4,14). The space efficient implementation techniques developed by Kurtz (4), and an efficient implementation of the algorithm of Gusfield (12), were the basis of the first REPuter program for finding exact repeats (15). This subtask of our new algorithms is not discussed further.
使用 S 的后缀树,我们不必从许多候选者中过滤掉少量的最大重复。 相反,我们使用 Gusfield (12) 的算法直接计算最大精确重复(任意长度)。 该算法在 O (n + z) 时间内运行,其中 z 是最大重复次数。 因此,运行时间是最佳的,并且不依赖于重复的长度。 它已经独立地展示了如何实际构建基因组大小序列的后缀树 (4,14)。 Kurtz (4) 开发的空间高效实现技术和 Gusfield (12) 算法的有效实现是第一个 REPuter 查找精确重复的程序 (15) 的基础。 我们新算法的这个子任务不再进一步讨论。
We will present algorithms for finding degenerate repeats based on two different distance models: the Hamming distance model and the edit distance model. In the following we assume that an error threshold k ≥ 0 and a length threshold l > 0 are given.
我们将介绍基于两种不同距离模型的退化重复查找算法:汉明距离模型和编辑距离模型。 下面我们假设给出了一个错误阈值 k ≥ 0 和一个长度阈值 l > 0。
The mismatches repeat problem
k-mismatch repeats are based on the notion of Hamming distance. A pair of equal-length substrings R = ((i1, j1),(i2, j2)) is a k-mismatch repeat if and only if (i1, j1) ≠ (i2, j2) and dH(S[i1, j1], S[i2, j2]) = k. The length of R is l(R) = j1 – i1 + 1 = j2 – i2 + 1. A k-mismatch repeat is maximal covering if it is not contained in any other k-mismatch repeat.
k-错配重复基于汉明距离的概念。 一对等长子串 R = ((i1, j1),(i2, j2)) 当且仅当 (i1, j1) ≠ (i2, j2) 且 dH(S[i1, j1], S[i2, j2]) = k。 R 的长度为 l(R) = j1 – i1 + 1 = j2 – i2 + 1。如果 k 错配重复不包含在任何其他 k 错配重复中,则它是最大覆盖。
The ‘mismatches repeat problem’ is to enumerate all maximal covering k-mismatch repeats of length at least l that occur in S. Our algorithm maximal mismatch repeats (MMR) for solving this problem is based on the following observation. If R = ((i1, j1),(i2, j2)) is a k-mismatch repeat, then the k mismatches divide S[i1, j1] and S[i2, j2] into maximal exact repeats w0, w1, w2,…, wk. The exact repeats w0 and wk occurring at the borders of the strings are maximal because R is maximal covering; the others are obviously maximal. Now maxi∈[0, k] |wi| is a minimum if the mismatching character pairs are equally distributed over R, yielding a pattern as shown in Figure 1.
“错配重复问题”是枚举 S 中出现的长度至少为 l 的所有最大覆盖 k 错配重复。我们用于解决此问题的算法最大错配重复 (MMR) 基于以下观察。 如果 R = ((i1, j1),(i2, j2)) 是一个 k 错配重复,则 k 个错配将 S[i1, j1] 和 S[i2, j2] 划分为最大精确重复 w0, w1, w2 ,……,周。 出现在字符串边界的精确重复 w0 和 wk 是最大的,因为 R 是最大覆盖; 其他显然是最大的。 现在 maxi∈[0, k] |wi| 如果不匹配的字符对在 R 上均匀分布,则为最小值,从而产生如图 1 所示的模式。

Figure 1. k = 3 mismatching characters (denoted by filled circles) distributed equally over a repeat of length l = 11, yielding a minimum seed size of

图 1. k = 3 个不匹配字符(用实心圆圈表示)平均分布在长度为 l = 11 的重复中,产生的最小种子大小为


显然,对于这样的均等分布,最长 wi 的长度≥ 。 因此,每个长度为 l 的最大覆盖 k 个错配重复 R 包含一个最大精确重复长度≥ ,称为种子。
Algorithm MMR. Compute all seeds and test for each seed whether it can be extended to a k-mismatch repeat. More precisely, for each seed ((i1, j1),(i2, j2)) tables Hleft and Hright of size k + 1 are computed such that for each q ∈ [0, k]: Hright(q) is the maximum number p such that dH(S[j1 + 1, j1 + p], S[j2 + 1, j2 + p]) = q, i.e. S[j1 + 1, j1 + p] and S[j2 + 1, j2 + p] have an alignment with q mismatches. Analogously, Hleft(q) is the maximum number p such that dH(S[i1 – p, i1 – 1], S[i2 – p, i2 – 1]) = q. Then, for each q ∈ [0, k], check whether j1 – i1 + 1 + Hleft(q) + Hright(k – q) ≥ l holds. If so, output the maximal covering k-mismatch repeat ((i1 – Hleft(q), j1 + Hright(k – q)),(i2 – Hleft(q), j2 + Hright(k – q)).
算法 MMR。 计算所有种子并测试每个种子是否可以扩展到 k 错配重复。 更准确地说,对于每个种子 ((i1, j1),(i2, j2)),计算大小为 k + 1 的表 Hleft 和 Hright,使得对于每个 q ∈ [0, k]: Hright(q) 是最大数 p 使得 dH(S[j1 + 1, j1 + p], S[j2 + 1, j2 + p]) = q,即 S[j1 + 1, j1 + p] 和 S[j2 + 1, j2 + p] 与 q 个不匹配对齐。 类似地,Hleft(q) 是最大数 p 使得 dH(S[i1 – p, i1 – 1], S[i2 – p, i2 – 1]) = q。 然后,对于每个 q ∈ [0, k],检查 j1 – i1 + 1 + Hleft(q) + Hright(k – q) ≥ l 是否成立。 如果是,则输出最大覆盖k-mismatch repeat ((i1 – Hleft(q), j1 + Hright(k – q)),(i2 – Hleft(q), j2 + Hright(k – q))。
It is easy to prove that algorithm MMR correctly solves the mismatches repeat problem.
很容易证明MMR算法正确地解决了错配重复问题。
Table Hright can be computed in O(k) time by using a suffix tree that allows to determine the length of the longest common prefix of two substrings of S in constant time. Since we construct the suffix tree of S anyway, this imposes virtually no overhead. Of course, the same approach can be applied to Hleft [For details on this technique see Harel and Tarjan (16) and Schieber and Vishkin (17)].
表 Hright 可以通过使用后缀树在 O(k) 时间内计算,该后缀树允许在恒定时间内确定 S 的两个子串的最长公共前缀的长度。 由于我们无论如何都构建了 S 的后缀树,因此这几乎不会产生任何开销。 当然,同样的方法可以应用于 Hleft [有关该技术的详细信息,请参见 Harel 和 Tarjan (16) 以及 Schieber 和 Vishkin (17)]。
The overall time efficiency of algorithm MMR can be assessed as follows. The preprocessing phase of computing the suffix tree and locating the seeds takes O(n) time. For a given seed, the extension phase of algorithm MMR takes O(k) time as shown above, yielding an overall time efficiency of O(n + zk) where z is the number of seeds. The number of seeds z can be estimated by E where | Σ| is the size of the alphabet Σ and s = (see below).
算法 MMR 的整体时间效率可以评估如下。 计算后缀树和定位种子的预处理阶段需要 O(n) 时间。 对于给定的种子,算法 MMR 的扩展阶段需要 O(k) 时间,如上所示,产生 O(n + zk) 的整体时间效率,其中 z 是种子的数量。 种子数 z 可以通过 E 来估计,其中 | Σ| 是字母 Σ 和 s = 的大小(见下文)。
Algorithm MMR detects a maximal k-mismatch repeat more than once if it contains more than one seed. This can be avoided by stopping the computation of table Hleft as soon as another seed is detected. This ensures that for a given seed the algorithm will output only those maximal k-mismatch repeats in which this particular seed is the leftmost.
如果 MMR 算法包含多个种子,则算法 MMR 会多次检测到最大 k 错配重复。 这可以通过在检测到另一个种子时立即停止表 Hleft 的计算来避免。 这确保对于给定的种子,该算法将仅输出该特定种子位于最左边的那些最大 k 错配重复。
The differences repeat problem
We now extend our technique to allow for insertions and deletions. A pair R = ((i1, j1),(i2, j2)) of substrings is a k-differences repeat if and only if (i1, j1) ≠ (i2, j2) and dE(S[i1, j1], S[i2, j2]) = k. The ‘length’ of R is l(R) = min(j1 – i1 + 1, j2 – i2 + 1). A k-differences repeat is ‘maximal covering’ if it is not contained in any other k-differences repeat.
我们现在扩展我们的技术以允许插入和删除。 一对 R = ((i1, j1),(i2, j2)) 子串是 k 次差重复当且仅当 (i1, j1) ≠ (i2, j2) 和 dE(S[i1, j1], S[i2, j2]) = k。 R 的“长度”为 l(R) = min(j1 – i1 + 1, j2 – i2 + 1)。 如果 k-differences 重复不包含在任何其他 k-differences 重复中,则它是“最大覆盖”。
The differences repeat problem is to enumerate all maximal covering k-differences repeats of length at least l. Our algorithm maximal differences repeats (MDR) (for solving this problem also crucially depends on the fact that every maximal covering k-differences repeat R of length l contains a maximal exact repeat of length ≥ , called a seed.
差异重复问题是枚举所有长度至少为 l 的最大覆盖 k 差异重复。 我们的算法最大差异重复(MDR)(为了解决这个问题,关键还取决于这样一个事实:每个长度为 l 的最大覆盖 k 差异重复 R 包含长度≥ 的最大精确重复,称为种子。
Algorithm MDR. Compute all seeds and try to extend these to k-differences repeats as shown in Figure 2. To be more precise, for every seed ((i1, j1),(i2, j2)) compute tables Eleft and E right defined as follows: E right (q) is the set of all pairs (xr, yr) ∈ [1, m] × [1, n] such that dE(S[j1 + 1, j1 + xr], S[j2 + 1, j2 + yr]) = q, i.e. S[j1 + 1, j1 + xr] and S[j2 + 1, j2 +yr] have an alignment with q differences. Analogously, Eleft(q) is the set of all pairs (xl, yl) ∈ [1, m] × [1, n] such that dE(S[i1 – xl, i1 – 1]–1, S[i2 – yl, i2 – 1]–1) = dE(S[i1 – xl, i1 – 1], S[i2 – yl, i2 – 1]) = q. For each q ∈ [0, k], for each pair (xl, yl) ∈ E left(q), and each (xr, yr) ∈ Eright(k – q): if j1 – i1 + 1 + xl + xr ≥ l and j 2 – i2 + 1 + yl + yr ≥ l, then output the maximal covering k-differences repeats ((i1 – xl, j1 + xr), (i2 – yl, j2 + yr)).
算法 MDR。 计算所有种子并尝试将它们扩展到 k 差异重复,如图 2 所示。更准确地说,对于每个种子 ((i1, j1),(i2, j2)),计算表 Eleft 和 E right 定义如下: E right (q) 是所有对 (xr, yr) ∈ [1, m] × [1, n] 的集合,使得 dE(S[j1 + 1, j1 + xr], S[j2 + 1, j2 + yr]) = q,即 S[j1 + 1, j1 + xr] 和 S[j2 + 1, j2 +yr] 有 q 个差异的对齐。 类似地,Eleft(q) 是所有对 (xl, yl) ∈ [1, m] × [1, n] 的集合,使得 dE(S[i1 – xl, i1 – 1]–1, S[i2 – yl, i2 – 1]–1) = dE(S[i1 – xl, i1 – 1], S[i2 – yl, i2 – 1]) = q。 对于每个 q ∈ [0, k],对于每个对 (xl, yl) ∈ E left(q),并且每个 (xr, yr) ∈ Eright(k – q):如果 j1 – i1 + 1 + xl + xr ≥ l 且 j 2 – i2 + 1 + yl + yr ≥ l,则输出最大覆盖 k 差重复 ((i1 – xl, j1 + xr), (i2 – yl, j2 + yr))。

It is not difficult to show that algorithm MDR correctly solves the differences repeat problem.
不难证明,算法 MDR 正确地解决了差异重复问题。
One could of course use a standard dynamic programming (DP) algorithm [for example see Wagner and Fischer (18)] to extend seeds in O(n2) time. However, there are faster methods: using the algorithm of Ukkonen (19), it is possible to compute tables Eleft and Eright in O(kn) time by computing only front(k) of the DP-matrix, see Figure 2. A combination of this algorithm with the longest common prefix technique yields an O(k2) time method to compute tables E left and Eright. Lastly, for each q ∈ [0, k] the algorithm tests O(k2) combinations of the values in Eleft (q) and Eright (k – q). Thus, the extension phase of algorithm MDR takes time O(k3) per seed. As for algorithm MMR, we conclude that the overall time efficiency of algorithm MDR is O(n + zk3), where z is the number of seeds.
当然可以使用标准动态规划 (DP) 算法 [例如参见 Wagner 和 Fischer (18)] 在 O(n2) 时间内扩展种子。 但是,有更快的方法:使用 Ukkonen (19) 的算法,可以通过仅计算 DP 矩阵的 front(k) 在 O(kn) 时间内计算表 Eleft 和 Eright,参见图 2。 使用最长公共前缀技术的这种算法产生了一个 O(k2) 时间方法来计算表 E left 和 Eright。 最后,对于每个 q ∈ [0, k],该算法测试 Eleft (q) 和 Eright (k – q) 中的值的 O(k2) 组合。 因此,算法 MDR 的扩展阶段每个种子需要时间 O(k3)。 至于算法 MMR,我们得出结论,算法 MDR 的整体时间效率为 O(n + zk3),其中 z 是种子数。
By restricting to left-most seeds, algorithm MDR can be improved in a similar way to algorithm MMR.
通过限制到最左边的种子,可以以类似于算法 MMR 的方式改进算法 MDR。
A different approach to search for degenerate repeats would be to initially search for inexact seeds and then to extend these with less errors. However, this approach suffers from the fact that there is no efficient algorithm for finding all inexact seeds, even if the number of errors is very small.
搜索退化重复的另一种方法是首先搜索不精确的种子,然后以更少的错误扩展这些种子。 然而,这种方法的缺点是没有有效的算法来找到所有不精确的种子,即使错误的数量非常少。
Significance of repeats
In order to assess the significance of a repeat found by our methods, we compute its E-value, i.e. the number of repeats of the same length or longer and with the same number of errors or fewer that one would expect to find in a random DNA of the same length.
为了评估通过我们的方法发现的重复的显著性,我们计算了它的 E 值,即,在相同长度的随机 DNA 中,相同长度或更长且错误数相同或更少的重复次数。
As a model of random DNA the uniform Bernoulli model is used, where each base A, C, G and T has the same probability p = 1/4 of occurrence. (In general, p = 1/|Σ| for an alphabet Σ of size |Σ|). The E-value of the number of maximal exact repeats of length ≥ l can be computed according to the following formula [for details see Kurtz et al. (5)]:
作为随机 DNA 的模型,使用统一的伯努利模型,其中每个碱基 A、C、G 和 T 具有相同的概率 p = 1/4。 (通常,对于大小为 |Σ| 的字母表 Σ,p = 1/|Σ|)。 长度≥ l 的最大精确重复数的 E 值可以根据以下公式计算 [详见 Kurtz et al. (5)]:


E-values for k-mismatch repeats can be computed in a similar way. The probability that two independent sequences S1 and S2, both of length l, have a Hamming distance of exactly k under the uniform Bernoulli model is
k-错配重复的 E 值可以以类似的方式计算。 在均匀伯努利模型下,长度均为 l 的两个独立序列 S1 和 S2 的汉明距离恰好为 k 的概率为

Using this formula, E[# of maximal ≤ k-mismatch repeats of length ≥ l] can be approximated by

The exact solution can be found in Kurtz et al. (5).
确切的解决方案可以在 Kurtz 等人中找到。 (5)。
One can generalize this result for the non-uniform Bernoulli model with fixed probabilities pa for each a ∈ Σ. This is achieved by replacing p by the probability p* that the bases at two randomly chosen positions of S are the same,
对于每个 a ∈ Σ 具有固定概率 pa 的非均匀伯努利模型,可以推广这一结果。 这是通过将 p 替换为 S 的两个随机选择位置的碱基相同的概率 p* 来实现的,

This, however, is only an approximation to the exact solution because the different probabilities for self-overlapping repeats are ignored.
然而,这只是精确解的近似值,因为自重叠重复的不同概率被忽略了。
In the case of the edit distance no analytic solution is known for Pr[dE(S1, S2) = k]. For this reason we use the procedure of Kurtz and Myers (20), which estimates the probability of the event Ak(P) that an arbitrary (not necessarily random) string P matches the prefix of a random string with edit distance k. This procedure is an unbiased estimator which gives good results in a matter of 1000 samples even for patterns of small probability. To obtain an estimation Pr[dE(S1, S2) = k], we precomputed a table E. Here E(l, k) is the average of the estimation of the probability of the event Ak(P). The estimation is delivered by running the above procedure with 1000 samples for 100 random patterns P, each of length l. The variance of the 100 estimations obtained for each l and k is very small, and so we argue that E(l, k) gives a good approximation for Pr[dE(S1, S2) = k] where l = max(|S1|, |S2|). Hence E[# of maximal ≤ k-differences repeats of length ≥ l] can be approximated by
在编辑距离的情况下,不知道 Pr[dE(S1, S2) = k] 的解析解。 出于这个原因,我们使用 Kurtz 和 Myers (20) 的过程,它估计任意(不一定是随机的)字符串 P 与编辑距离为 k 的随机字符串的前缀匹配的事件 Ak(P) 的概率。 这个过程是一个无偏估计器,即使对于小概率模式,它也可以在 1000 个样本中给出良好的结果。 为了获得估计值 Pr[dE(S1, S2) = k],我们预先计算了一个表 E。这里 E(l, k) 是事件概率 Ak(P) 的估计值的平均值。 通过对 100 个随机模式 P(每个长度为 l)运行 1000 个样本来运行上述过程来提供估计。 每个 l 和 k 获得的 100 个估计的方差非常小,因此我们认为 E(l, k) 给出了 Pr[dE(S1, S2) = k] 的一个很好的近似值,其中 l = max(|S1 |, |S2|)。 因此,E[# of maximum ≤ k-differences repeats of length ≥ l] 可以近似为

INTERACTIVE VISUALIZATION
REPvis, the visualization component of the REPuter program, provides an easy to use interface for examining repeat structures computed by REPfind. The program is designed to be used by the biologist, thus putting the data in the hands of those who can best interpret it.
REPuter 程序的可视化组件 REPvis 提供了一个易于使用的界面,用于检查 REPfind 计算的重复结构。 该程序旨在供生物学家使用,从而将数据交到最能解释它的人手中。
A typical mode of use is as follows. The visualization comes up showing a single colored line, depicting either the longest or the most significant repeat. The first step is to obtain an impression of the overall number and distribution of repeats. By shifting a slider, we let further repeats rise on the screen, in the order of decreasing length or significance, which is coded in a 10-color scale (see Figs 3, 5, 6, 8 and 9). Since black is used as the color for the shortest/least significant repeats, we may go down all the way: if we hit the noise level (see Figs 6 and 7) the more significant repeats still shine up in colors before a black background of noise.
典型的使用方式如下。 可视化显示一条彩色线,描绘最长或最重要的重复。 第一步是获得对重复总数和分布的印象。 通过移动滑块,我们让进一步的重复在屏幕上按长度或重要性递减的顺序上升,以 10 色标度编码(见图 3、5、6、8 和 9)。 由于黑色被用作最短/最不重要重复的颜色,我们可能会一直走下去:如果我们达到噪音水平(见图 6 和 7),在黑色的噪音背景之前,更重要的重复仍然以颜色发光。

Figure 3. Assembly checking of human chromosome 22. REPvis display of exact direct and palindromic repeats with a minimum length of 300 bp. The chromosome (32 484 231 bp) consists of 11 concatenated contigs separated by vertical white lines (the separators are specified as an extra annotation and displayed by REPvis). The color code for repeat length indicates that all other repeats are dwarfed by one long, exact repeat (light purple) of 190 014 bp. The computation time for this repeat structure is 8 min.
图 3. 人类 22 号染色体的装配检查。REPvis 显示精确的正向和回文重复,最小长度为 300 bp。 染色体(32 484 231 bp)由 11 个连接的 contigs 组成,由垂直的白线分隔(分隔符被指定为额外的注释并由 REPvis 显示)。 重复长度的颜色代码表示所有其他重复都比一个 190 014 bp 的长而精确的重复(浅紫色)相形见绌。 此重复结构的计算时间为 8 分钟。

Figure 5. Low copy repeats. The repeat graph of the selected 3 Mb from human chromosome 22 showing an unusual repeat region (see left part of Fig. 3). The direct and palindromic repeats of minimum length 100 bp are shown. In essence, the net-like pattern reveals a low copy repeat, with some direct repeats, and others reversed. Comparing this with the repeat pattern observed in (23), one observes a correspondence to the 3 Mb (TDR) of chromosome 22q11.2, responsible for the DiGeorge/Velo-cardio-facial syndrome. The computation time for this repeat structure is 30 s.
图 5. 低拷贝重复。 从人类 22 号染色体中选择的 3 Mb 的重复图显示了一个不寻常的重复区域(参见图 3 的左侧部分)。 显示了最小长度为 100 bp 的正向和回文重复。 从本质上讲,网状模式揭示了低复制重复,一些正向重复,而另一些则相反。 将此与在 (23) 中观察到的重复模式进行比较,可以观察到与染色体 22q11.2 的 3 Mb (TDR) 的对应关系,这是造成 DiGeorge/Velo-cardio-facial 综合征的原因。 这种重复结构的计算时间为 30 秒。

Figure 6. Unique sequences. Direct and palindromic repeats of contig 4 of human chromosome 21 with a minimum length of 20 bp and up to 2 errors. LINEs and SINEs were masked. Clearly, the noise level of random repeats is reached, generating a black background. The computation time for this repeat structure is 16 s. This example is continued in Figure 7.
图 6. 独特的序列。 人类第 21 号染色体 contig 4 的正向和回文重复,最小长度为 20 bp,最多 2 个错误。 LINE 和 SINE 被屏蔽。 显然,达到了随机重复的噪声水平,产生黑色背景。 这种重复结构的计算时间为 16 秒。 此示例在图 7 中继续。

Figure 7. Unique sequences (continued). Zooming into the repeat graph of Figure 6, we find that there are still regions free from repeats. The area between the two vertical white lines represents 5300 bp. It provides unique candidates for primer design.
图 7. 独特的序列(续)。 放大图 6 的重复图,我们发现仍然有没有重复的区域。 两条垂直白线之间的区域代表 5300 bp。 它为引物设计提供了独特的候选者。

Figure 8. Gene structure comparison. The concatenated 5′-UTR from mouse and human pellino1 gene is considered. Both sequences are separated by the vertical white line. Since the pellino gene is expressed in opposed strands in mouse and human, we restrict to palindromic repeats. The repeat graph shows palindromic repeats of 35 bp minimum length with at most two errors. The Genscan prediction is shown as an annotation of the repeat graph, each exon being represented by a colored bar. By analyzing only the inter-sequence matches, we can identify similarities and conserved regions between mouse and human with respect to the pellino gene. The purple line suggests that a common exon is missed by Genscan. The computation time for this repeat structure is 3 s.
图 8. 基因结构比较。 考虑了来自小鼠和人类 pellino1 基因的串联 5'-UTR。 两个序列由垂直白线分隔。 由于 pellino 基因在小鼠和人类的相反链中表达,我们限制为回文重复。 重复图显示了 35 bp 最小长度的回文重复,最多有两个错误。 Genscan 预测显示为重复图的注释,每个外显子由彩色条表示。 通过仅分析序列间匹配,我们可以确定小鼠和人类之间关于 pellino 基因的相似性和保守区域。 紫色线表明 Genscan 遗漏了一个常见的外显子。 这种重复结构的计算时间为 3 秒。

Figure 9. cDNA/EST mapping. Looking for a mouse gene similar to the human Kiaa0903 gene. There are known homologous regions between human chromosome 2 holding the Kiaa0903 gene and mouse chromosome 11. The human cDNA Kiaa0903 sequence was concatenated with a 104 270 bp contig from the mouse region. Both sequences are separated by a vertical white line. The repeat graph shows that the exons of the Kiaa0903 gene are separated by very long introns. The 3′- and 5′-ends of the Kiaa0903 gene are either missing in the contig or are not well conserved. The computation time for this repeat structure is 3 s.
图 9. cDNA/EST 作图。 寻找与人类 Kiaa0903 基因相似的小鼠基因。 在含有 Kiaa0903 基因的人类染色体 2 和小鼠染色体 11 之间存在已知的同源区域。人类 cDNA Kiaa0903 序列与来自小鼠区域的 104 270 bp 重叠群连接。 两个序列由一条垂直的白线分隔。 重复图显示 Kiaa0903 基因的外显子被很长的内含子隔开。 Kiaa0903 基因的 3'- 和 5'- 末端要么在重叠群中缺失,要么没有得到很好的保守。 这种重复结构的计算时间为 3 秒。
During the overview, we may catch interest in particular repeats or repeat-rich regions. A mouse click brings up the inspection window (see Figs 4 and 7). Here we can zoom in or out on a region by left or right clicking the mouse. Selecting a position on the strand symbol prints the information corresponding to this sequence position in a browser box below. There, a single repeat can be selected to view the alignment of the two instances of the repeat, or to submit the corresponding nucleotide sequence for further investigation of biological significance to a FASTA or BLAST database search.
在概述期间,我们可能会对特定的重复或重复丰富的区域感兴趣。 单击鼠标会打开检查窗口(参见图 4 和图 7)。 在这里,我们可以通过左键或右键单击鼠标来放大或缩小一个区域。 选择链符号上的位置会在下面的浏览器框中打印与该序列位置相对应的信息。 在那里,可以选择单个重复以查看重复的两个实例的比对,或将相应的核苷酸序列提交给 FASTA 或 BLAST 数据库搜索以进一步研究生物学意义。

Figure 4. Assembly checking of human chromosome 22 (continued). Enlarged view of contig 7 and contig 8, with attention to the 190 014 bp repeat, shown in light purple. Obviously, the entire contig 8 has also incorrectly been assembled at the beginning of contig 7. This error is rectified in the current release of human chromosome 22.
图 4. 人类 22 号染色体的组装检查(续)。 contig 7 和 contig 8 的放大图,注意 190 014 bp 重复,以浅紫色显示。 显然,整个 contig 8 也被错误地组装在 contig 7 的开头。这个错误在当前发布的人类 22 号染色体中得到了纠正。
The repeat graph as displayed by Repvis can be annotated with additional symbols or lines. For example, the user can display a predicted gene structure by specifying colored arcs and their position in the genome. These will be shown together with the repeat graph, e.g. to generate or verify hypotheses about the correspondence between the particular repeat structure and an intron/exon structure (also see Applications).
Repvis 显示的重复图可以用其他符号或线条进行注释。 例如,用户可以通过指定彩色弧线及其在基因组中的位置来显示预测的基因结构。 这些将与重复图一起显示,例如 生成或验证关于特定重复结构和内含子/外显子结构之间对应关系的假设(另见应用程序)。
APPLICATIONS
The basic application of REPuter is, of course, to reveal the repetitive structure of large chromosomes or genomes. Our recent experience shows that the use of REPuter extends far beyond the original purpose. This is illustrated by applications in five different sequence analysis tasks. The running time of REPfind for each application is given in the figure legends. It refers to a 400 MHz SUN computer with the Solaris 2.5.1. operating system.
REPuter 的基本应用当然是揭示大染色体或基因组的重复结构。 我们最近的经验表明,REputer 的使用远远超出了最初的目的。 这可以通过五种不同序列分析任务中的应用来说明。 每个应用程序的 REPfind 运行时间在图例中给出。 它指的是装有 Solaris 2.5.1 的 400 MHz SUN 计算机操作系统。
Assembly checking
The task of assembly programs is to arrange sequence reads in the proper order and orientation to obtain complete BAC sequences and contigs. However, assembly programs are not perfect and the assembly is often erroneous, possibly resulting in a poor annotation of the resulting sequence.
组装程序的任务是以正确的顺序和方向排列序列读取以获得完整的 BAC 序列和重叠群。 然而,汇编程序并不完美,而且汇编经常是错误的,可能导致结果序列的注释很差。
An assembled sequence can be checked by applying REPfind to it. Very long repeats may be due to overlapping regions between BAC sequences or contigs. This may hint to assembly errors. If repeats are palindromic, one of the repeat instances may have been assembled in the wrong orientation.
可以通过对其应用 REPfind 来检查组装的序列。 非常长的重复可能是由于 BAC 序列或重叠群之间的重叠区域。 这可能暗示装配错误。 如果重复是回文,则重复实例之一可能以错误的方向组装。
This procedure was applied to the 11 concatenated contigs of human chromosome 22. The contigs were obtained from GenBank on March 22, 2001 (accession nos NT_011516.3, NT_011517.2, NT_011519.4, NT_025937.1, NT_011520.5, NT_011521.1, NT_011523.4, NT_011524.2, NT_011525.3, NT_019197.2, NT_011526.3). The REPvis repeat graph (Fig. 3) indicates an overlapping region of unexpected length. Zooming into the repeat graph (Fig. 4) reveals that the complete contig 8 has already been assembled in the beginning of contig 7. This error has been corrected in the current version of human chromosome 22.
该程序应用于人类 22 号染色体的 11 个串联重叠群。重叠群于 2001 年 3 月 22 日从 GenBank 获得(登录号 NT_011516.3、NT_011517.2、NT_011519.4、NT_025937.1、NT_011520.5、NT_011521。 1、NT_011523.4、NT_011524.2、NT_011525.3、NT_019197.2、NT_011526.3)。 REPvis 重复图(图 3)表示一个意外长度的重叠区域。 放大重复图(图 4)显示完整的 contig 8 已在 contig 7 的开头组装。此错误已在当前版本的人类 22 号染色体中得到纠正。
Low copy repeats related to human malformations
Human malformations and syndromes are often associated with the deletion or duplication of specific chromosomal regions. It has been observed that phenotypes related to deletions are more severe than those related to duplications (21). Chromosomal regions rich in gene content may be directly associated with several malignant diseases caused by microdeletions in this part of the genome. The haploinsufficiency for at least some of the genes in the deleted region may be responsible for direct effects on specific developmental processes.
人类畸形和综合征通常与特定染色体区域的缺失或重复有关。 据观察,与缺失相关的表型比与重复相关的表型更严重 (21)。 富含基因的染色体区域可能与这部分基因组微缺失引起的几种恶性疾病直接相关。 缺失区域中至少一些基因的单倍体不足可能是对特定发育过程的直接影响的原因。
One well known aberration associated with such chromosomal rearrangements is the DiGeorge/Velo-cardio-facial syndrome, localized at 22q11.2 (22,23). Most of the rearrangements observed in this region refer to large deletions in a 3 Mb region, causing various anomalies, including mental retardation. Some specific low copy repeat elements are identified flanking this typical deleted region (TDR). This suggests that they function as breakpoints leading to homologous recombination, explaining the genomic instability of this chromosome region.
与此类染色体重排相关的一种众所周知的异常是 DiGeorge/Velo-cardio-facial 综合征,位于 22q11.2 (22,23)。 在该区域观察到的大多数重排指的是 3 Mb 区域中的大量缺失,导致各种异常,包括智力迟钝。 在这个典型的删除区域 (TDR) 侧翼识别出一些特定的低拷贝重复元件。 这表明它们充当导致同源重组的断点,解释了该染色体区域的基因组不稳定性。
Low copy repeats form a characteristic net-like pattern in the visualization provided by REPuter. At a glance, this pattern indicates the number of repeats, their relative positions and the regions potentially deleted.
低拷贝重复在 REPuter 提供的可视化中形成了一个典型的网状模式。 乍一看,这个模式表明了重复的数量、它们的相对位置和可能被删除的区域。
We used Repfind to locate direct and palindromic repeats for human chromosome 22 (Fig. 3). In the left part of the repeat graph, an agglomeration of repeats is observed covering a range of ∼3 Mb containing many repeated blocks (Fig. 5). The pattern involves one essential element that was repeated four times, sometimes directly, sometimes reversed. A comparison with experimental results of Shaikh et al. (23) suggests that the structure corresponds to the TDR responsible for the DiGeorge/Velo-cardio-facial syndrome.
我们使用 Repfind 来定位人类 22 号染色体的正向和回文重复(图 3)。 在重复图的左侧,观察到重复的聚集,覆盖范围约为 3 Mb,包含许多重复块(图 5)。 该xinag模式涉及一个基本元件,该元件重复了四次,有时是正向的,有时是反向的。 与 Shaikh 等人的实验结果的比较。 (23) 表明该结构对应于导致 DiGeorge/Velo-cardio-facial 综合征的 TDR。
Unique sequences
Hybridization techniques are very effective tools in molecular biology. The research pursued spans a wide range including genotyping, pre-natal diagnostics and microarray technology.
杂交技术是分子生物学中非常有效的工具。 所从事的研究范围广泛,包括基因分型、产前诊断和微阵列技术。
Hybridization techniques make use of nucleic acid probes to detect complementary nucleic acid targets present in biological fluids or tissues. Their success essentially depends on the specificity of the probe. Targeting the probes to non-unique sequences may result in cross-hybridization generating false positives. Finding unique sequences is the mathematical complement of finding repeats. Since REPuter solves the repeat finding problem in a non-heuristic way, it can also be used to solve the complementary problem. Assume the probes should have length l and be unique up to k errors. Determining and discarding all repeats of minimum length l and a maximum of k errors, the remaining sequence fragments are guaranteed to be unique everywhere. Hence, probes can be designed from them according to other experimental criteria.
杂交技术利用核酸探针来检测存在于生物体液或组织中的互补核酸靶标。 他们的成功基本上取决于探针的特异性。 将探针靶向非独特序列可能会导致交叉杂交产生假阳性。 寻找独特的序列是寻找重复的数学补充。 由于 REPuter 以非启发式的方式解决了重复查找问题,因此它也可以用于解决互补问题。 假设探针应该有长度 l 并且是唯一的,最多 k 个错误。 确定并丢弃所有最小长度 l 和最大 k 错误的重复,保证剩余的序列片段在任何地方都是唯一的。 因此,可以根据其他实验标准从它们中设计探针。
Our example application is the screening of BAC libraries to get clones used as probes for fluorescence in situ hybridization in contig 4 of human chromosome 21 (GenBank accession no. NT_003534). As this screening is done by PCR, our strategy was to filter out all direct and palindromic repeats with a minimum length of 20 bp, allowing 2 errors (Figs 6 and 7). This guarantees the absence of mispriming (in contig 4) for all primers chosen from the unique sequences. After this preprocessing step, primers can be designed separately for each region of interest, using standard software.
我们的示例应用是筛选 BAC 文库,以获得用作人类染色体 21 的 contig 4 中荧光原位杂交探针的克隆(GenBank 登录号 NT_003534)。 由于该筛选是通过 PCR 完成的,我们的策略是过滤掉所有最小长度为 20 bp 的正向和回文重复,允许出现 2 个错误(图 6 和 7)。 这保证了从独特序列中选择的所有引物不存在错误引物(在 contig 4 中)。 在此预处理步骤之后,可以使用标准软件为每个感兴趣的区域单独设计引物。
Gene structure comparison
Comparative genomics is one of the major reasons for sequencing whole genomes of different organisms. Throughout evolution, vital genes and regulatory regions have been conserved to guarantee basic functions, maintaining similarities across species. Since the mouse is a well known model organism for studies of human biology and medicine, the access to its genome allows researchers to make important discoveries in the regulation of human genes based on common structures and mechanisms.
比较基因组学是对不同生物的全基因组进行测序的主要原因之一。 在整个进化过程中,重要的基因和调控区域一直被保存下来,以保证基本功能,保持物种间的相似性。 由于小鼠是人类生物学和医学研究中众所周知的模式生物,因此对其基因组的访问使研究人员能够在基于共同结构和机制的人类基因调控方面做出重要发现。
REPuter enables us to compare two or more sequences by concatenating them, and then searching for repeats. Controlling the allowed error rate in repeats, we can consider many grades of similarity. In this way, REPuter provides a simple plausibility check for gene structures predicted by other software tools.
REPuter 使我们能够通过连接两个或多个序列,然后搜索重复来比较它们。 控制重复中允许的错误率,我们可以考虑许多等级的相似性。 通过这种方式,REPuter 为其他软件工具预测的基因结构提供了简单的合理性检查。
In our application example, we consider the 5′-UTR region from the Mus musculus pellino1 gene (GenBank accession no. AC091421) and the complete pellino1 gene of Homo sapiens (GenBank accession no. NT_005326.1; Fig. 8). The gene structure predicted by the program Genscan [version 1.0, with default options; (24)] is shown with the repeat graph. Two of the predicted mouse exons coincide with matches in the human genomic sequence. However, there is yet another 98% identity match between both species, 101 bp long (purple). This region was not predicted as an exon by Genscan, although the strong conservation and a match with a cDNA (GenBank accession nos AF302503 for Mouse musculus and AF302505 for Homo sapiens) suggest that it is indeed an exon.
在我们的应用示例中,我们考虑来自 Mus musculus pellino1 基因(GenBank 登录号 AC091421)和智人的完整 pellino1 基因(GenBank 登录号 NT_005326.1;图 8)的 5'-UTR 区域。 Genscan程序预测的基因结构[版本1.0,带有默认选项; (24)] 与重复图一起显示。 两个预测的小鼠外显子与人类基因组序列中的匹配一致。 然而,两个物种之间还有 98% 的同一性匹配,101 bp 长(紫色)。 Genscan 没有将该区域预测为外显子,尽管强保守性和与 cDNA 的匹配(小鼠肌肉的 GenBank 登录号 AF302503 和智人的 AF302505)表明它确实是一个外显子。
cDNA/EST mapping
Expressed sequence tags (ESTs) are markers that provide a rapid and reliable method for gene discovery as well as a resource to analyze the expression of known and unknown genes (25).
表达序列标签 (EST) 是为基因发现提供快速可靠的方法以及分析已知和未知基因表达的资源的标记 (25)。
Given a collection of ESTs and a genomic sequence, one wants to find local similarities between them. This can be done by concatenating the genome with all ESTs and searching for repeats.
给定一组 EST 和一个基因组序列,人们想要找到它们之间的局部相似性。 这可以通过将基因组与所有 EST 连接并搜索重复来完成。
We applied this strategy to look for a mouse gene similar to the human Kiaa0903 gene. There are known homologous regions between human chromosome 2 holding the Kiaa0903 gene and mouse chromosome 11. The human cDNA Kiaa0903 sequence was concatenated with a contig from the mouse region (GenBank accession no. AC091423; Fig. 9). The repeat graph shows that the exons of the Kiaa0903 gene are separated by very long introns. The 3′- and 5′-ends of the Kiaa0903 gene are either missing in the contig or are not well conserved.
我们应用这种策略来寻找与人类 Kiaa0903 基因相似的小鼠基因。 在含有 Kiaa0903 基因的人染色体 2 和小鼠染色体 11 之间存在已知的同源区域。人 cDNA Kiaa0903 序列与来自小鼠区域的重叠群连接(GenBank 登录号 AC091423;图 9)。 重复图显示 Kiaa0903 基因的外显子被很长的内含子隔开。 Kiaa0903 基因的 3'- 和 5'- 末端要么在重叠群中缺失,要么没有得到很好的保守。
CONCLUSIONS
We have given five examples for applications of automatic repeat analysis. On a small scale, all of these analyses can be done by ad-hoc combinations of traditional tools. However, all-against-all comparisons break down when data size approaches that of a small eukaryotic genome or a human chromosome. Heuristic search methods are inadequate when the presence or absence of repetitive elements must be determined with certainty. The virtue of REPuter is to combine linear time efficiency with exhaustive analysis. Thus, a single tool can be utilized for all analysis tasks which are directed to, or can be formulated as, tasks of repetitive structure analysis.
我们给出了自动重复分析应用的五个示例。 在小范围内,所有这些分析都可以通过传统工具的临时组合来完成。 然而,当数据大小接近小型真核基因组或人类染色体时,所有对比的比较就会失效。 当必须确定是否存在重复元素时,启发式搜索方法是不合适的。 REPuter 的优点是将线性时间效率与详尽分析相结合。 因此,单一工具可用于所有针对或可被表述为重复结构分析任务的分析任务。
ACKNOWLEDGEMENTS
S.K. was partially supported by grant KU 1257/1 from the Deutsche Forschungsgemeinschaft. J.V.C. was supported by the DFG-Graduiertenkolleg 635 Bioinformatik.
SK 得到了德意志研究基金会的 KU 1257/1 赠款的部分支持。 合资公司 得到了 DFG 研究培训组 635 生物信息学的支持。
网友评论