算法文献阅读9：ColinearScan (现代共线性算法的启发

作者: 龙star180 | 来源:发表于2022-06-12 00:08 被阅读0次

算法文献阅读9：ColinearScan (现代共线性算法的启发
离群点分析
博客园转载
遗传算法详解
TBtools基因家族分析详细教程(3)基因家族成员的进化分析2
基因组共线性工具MCScanX使用说明
启发式算法
A* 搜索算法
算法2
百度无人驾驶apollo项目路径规划a*算法分析

ColinearScan是世界级共线性/多倍化/生物信息分析大牛/奠基人/发展人王希胤老师在北大读博开发的一款共线性算法软件。其实，后来的MCScan、MCScanX和WGDI等现代共线性算法都可以说是基于王老师开发的ColinearScan算法衍生而来。

Statistical inference of chromosomal homology based on gene colinearity and applications to Arabidopsis and rice （2006年）

基于基因共线性的染色体同源性统计推断及其在拟南芥和水稻中的应用

笔者觉得可惜哈！这么牛掰的算法竟然只发了BMC Bioinformatics（不是说这个期刊不好，就是说现在还是中科院三区，怎么这么牛掰的算法应该跟MCScan一样发个一区top吧！）

摘要

Background: The identification of chromosomal homology will shed light on such mysteries of genome evolution as DNA duplication, rearrangement and loss. Several approaches have been developed to detect chromosomal homology based on gene synteny or colinearity. However, the previously reported implementations lack statistical inferences which are essential to reveal actual homologies.

背景：染色体同源性的鉴定将揭示DNA复制、重排和丢失等基因组进化的奥秘。 已经开发了几种基于基因同线性或共线性检测染色体同源性的方法。然而，先前报道的实现缺乏对揭示实际同源性至关重要的统计推断。

Results: In this study, we present a statistical approach to detect homologous chromosomal segments based on gene colinearity. We implement this approach in a software package ColinearScan to detect putative colinear regions using a dynamic programming algorithm. Statistical models are proposed to estimate proper parameter values and evaluate the significance of putative homologous regions. Statistical inference, high computational efficiency and flexibility of input data type are three key features of our approach.

结果：在这项研究中，我们提出了一种基于基因共线性检测同源染色体片段的统计方法。我们在软件包 ColinearScan 中实现了这种方法，以使用动态规划算法检测假定的共线区域。提出了统计模型来估计适当的参数值并评估假定同源区域的重要性。 统计推断、高计算效率和输入数据类型的灵活性是我们方法的三个关键特征。

Conclusion: We apply ColinearScan to the Arabidopsis and rice genomes to detect duplicated regions within each species and homologous fragments between these two species. We find many more homologous chromosomal segments in the rice genome than previously reported. We also find many small colinear segments between rice and Arabidopsis genomes.

结论：我们将 ColinearScan 应用于拟南芥和水稻基因组，以检测每个物种内的重复区域和这两个物种之间的同源片段。我们在水稻基因组中发现了比以前报道的更多的同源染色体片段。我们还在水稻和拟南芥基因组之间发现了许多小的共线片段。

Background

Exploration of homology between chromosomes facilitates the understanding of the structure, function and evolution of genomes. Extensive synteny and colinearity have been detected between chromosomes in different species of cereals [1], mammals [2] and yeasts [3] providing a deep insight into the evolution of ancient chromosomes. Between chromosomes of the same species, large-scale homologous segments exist caused by whole genome or segmental duplication [4-9]. It has been reported that nearly 80% of the Arabidopsis thaliana genome and 45–60% of the rice genome are in large duplicated regions [10-12].

探索染色体之间的同源性有助于理解基因组的结构、功能和进化。已在不同种类的谷物 [1]、哺乳动物 [2] 和酵母 [3] 的染色体之间检测到广泛的同线性和共线性，从而深入了解古代染色体的进化。同一物种的染色体之间存在由全基因组或片段重复引起的大规模同源片段[4-9]。 据报道，近 80% 的拟南芥基因组和 45-60% 的水稻基因组位于大型重复区域 [10-12]。

Special care should be taken to reveal chromosomal homology due to numerous genomic changes such as chromosomal rearrangements, gene inversions and gene loss [13-15]. Many approaches have been developed to identify chromosomal homologues [16] based on genetic maps [17], sequence alignment [18,19], gene synteny [10] and gene colinearity [20-23]. By detecting the density and order of homologous gene pairs between chromosomes, colinearity approach can reveal reliable homologous regions and requires less computational resources. This approach also enables us to develop reasonable statistical tests to evaluate the significance of predicted homologous regions.

由于许多基因组变化（例如染色体重排、基因倒位和基因丢失），应特别注意揭示染色体同源性 [13-15]。基于遗传图谱 [17]、序列比对 [18,19]、基因同线性 [10] 和基因共线性 [20-23]，已经开发了许多方法来识别染色体同源物 [16]。通过检测染色体之间同源基因对的密度和顺序，共线性方法可以揭示可靠的同源区域并且需要较少的计算资源。这种方法还使我们能够开发合理的统计测试来评估预测的同源区域的重要性。

The typical implementations of the colinearity strategy are ADHoRe [20], FISH [24] and DiagHunter [25]. The implementations of these approaches have limitations in some aspects, though they have been widely adopted. Firstly, the gap size between neighboring homologous genes which is essential to define and detect true colinearity needs further evaluation [12,20 23,26]. Secondly, statistical tests to assess predicted homologous regions are mainly based on a prerequisite of balanced gene loss rates between homologous regions. Finally, compositional and structural differences, especially gene density and repetition in genome-wide and local chromosomal regions, have not been fully addressed.

共线性策略的典型实现是 ADHoRe [20]、FISH [24] 和 DiagHunter [25]。这些方法的实现在某些方面存在局限性，尽管它们已被广泛采用。首先，相邻同源基因之间的间隙大小对于定义和检测真正的共线性至关重要，需要进一步评估 [12,20 23,26]。其次，评估预测同源区域的统计测试主要基于同源区域之间平衡的基因丢失率的先决条件。最后，组成和结构差异，特别是全基因组和局部染色体区域的基因密度和重复，尚未得到充分解决。

Here we describe a new colinearity approach characterized by improved statistical inference, flexibility and computational efficiency. Firstly, the selection of parameter values is theoretically explored, especially that of the gap length between neighboring genes. Secondly, the statistical test has been substantially strengthened with a mathematical deduction to evaluate the significance of the predicted homologous regions. Finally, the compositional and structural heterogeneity of chromosomes has been considered.

在这里，我们描述了一种新的共线性方法，其特点是改进了统计推断、灵活性和计算效率。首先，从理论上探讨了参数值的选择，尤其是相邻基因之间的间隙长度。其次，通过数学推导来评估预测的同源区域的显着性，大大加强了统计检验。最后，考虑了染色体的组成和结构异质性。

Using a dynamic programming algorithm, we developed ColinearScan and scanned the Arabidopsis and rice genomes to detect duplicated regions in each species and homologous chromosomal regions between these two species. We found 75.0% of Arabidopsis genes and 76.2% of rice genes were in duplicated regions. Moreover, we identified homologous fragments between these two species, in 32.9% of Arabidopsis and 16.8% of rice. Nearly all homologous segments were shorter than 0.6 Mb, indicating massive chromosomal rearrangements after the monocot-dicot divergence [27].

使用动态规划算法，我们开发了 ColinearScan 并扫描了拟南芥和水稻基因组，以检测每个物种中的重复区域以及这两个物种之间的同源染色体区域。结果表明，75.0%的拟南芥基因和76.2%的水稻基因存在重复区域。在拟南芥和水稻中分别有32.9%和16.8%的同源片段。几乎所有的同源片段都小于0.6 Mb，表明单子叶双子叶分化[27]后发生了大量的染色体重排。

Results

Algorithm

Gene homology matrix

The first step in our colinearity approach is the construction of the gene homology matrix. To find homologous gene pairs between two chromosomes denoted as A and B, protein sequences encoded by genetically or physically positioned genes are used to perform an all-against-all BLAST search [28]. A gene homology matrix (GHM, denoted as H) is then constructed using the homology information from BLAST results. Chromosome A and B, represented by the positioned genes are arranged along H horizontally and vertically (Fig. 1A). A cell of H is filled with "1" if the corresponding genes on chromosome A and chromosome B are homologous, otherwise with "0". Tandem and other repetitive genes are widely distributed in chromosomes showing many "1"s in horizontal or vertical straight lines in the dot matrix map (Fig. 2) and causing problems in revealing true homology. Therefore, we used a general approach, masking the genes appearing more than 10 times in both chromosomes.

共线性方法的第一步是构建基因同源矩阵。为了找到A和B两条染色体之间的同源基因对，利用基因或物理位置基因编码的蛋白质序列进行all-against-all BLAST搜索[28]。然后利用BLAST结果中的同源性信息构建基因同源性矩阵(GHM，记为H)。A、B染色体以定位基因表示，沿H水平和垂直排列(图1A)。如果A染色体和B染色体上对应的基因是同源的，H单元格内为“1”，否则为“0”。串联基因和其他重复性基因广泛分布在染色体中，在点阵图中，水平或垂直的直线上有许多“1”(图2)，这给揭示真正的同源性带来了问题。因此，我们采用了一种通用的方法，将在两条染色体中出现超过10次的基因隐藏起来。

图1

A modified Smith-Waterman algorithm to locate colinearity. (A) A simplified gene homology matrix (GHM, denoted as H). Genes A1, A2, ... , A18 on chromosome A are arranged horizontally, and genes B1, B2, ... , B14 on chromosome B are arranged vertically. Each cell of the matrix is filled with "1" or "0" based on the homology information from BLASTP search, e.g., gene A1 and gene B1 are homologous, and gene A2 and B2 are non-homologous. (B) A modified dynamic programming procedure. A scoring matrix S is constructed recursively based on H, with mg set to 2 genes apart. The distance criterion demands that neighboring genes in colinearity are no more than 2 genes apart. Pointers are shown by dark or grey arrow lines. Two collinear paths containing 9 and 5 genes are shown by dark arrow lines reflecting the same colinear relationship between the corresponding chromosomal regions.

一种改进的 Smith-Waterman 算法来定位共线性。 (A) 简化的基因同源矩阵 (GHM, 表示为 H)。 A染色体上的基因A1、A2、...、A18水平排列，B染色体上的基因B1、B2、...、B14垂直排列。根据 BLASTP 搜索的同源性信息，矩阵的每个单元格用“1”或“0”填充，例如基因 A1 和基因 B1 是同源的，而基因 A2 和 B2 是非同源的。 (B) 修改后的动态规划程序。评分矩阵 S 基于 H 递归构建，mg 设置为相隔 2 个基因。距离标准要求共线性的相邻基因相距不超过 2 个基因。指针由深色或灰色箭头线表示。包含 9 和 5 个基因的两条共线路径用黑色箭头线表示，反映了相应染色体区域之间的相同共线关系。

图2

图2
Examples of dot maps. (A) A dot map between rice chromosomes 2 and 4. Each dot in the map reflects a homologous gene pair with BLASTP score > 100. The dots are not distributed uniformly in the map. The map is also featured by many horizontal and vertical lines formed by repetitive genes. (B) A dot map between the same chromosomes as (A) with repetitive genes filtered. (C) A dot map of rice chromosome 1 against itself. Self-matching dots form a solid diagonal line. (D) A dot map with self-matching and repetitive genes filtered. A diagonal line reflecting the neighboring homologues can still be seen.

点图示例。 (A) 水稻 2 号和 4 号染色体之间的点图。图中的每个点都反映了 BLASTP 得分 > 100 的同源基因对。这些点在图中分布不均匀。该图还具有由重复基因形成的许多水平和垂直线的特征。 (B) 与 (A) 相同的染色体之间的点图，其中重复基因被过滤。 (C) 水稻 1 号染色体与其自身的点图。自匹配点形成实心对角线。 (D) 过滤自匹配和重复基因的点图。 仍然可以看到反映相邻同系物的对角线。

Dynamic programming algorithm

To reveal the homologous genes in colinearity between two chromosomes, we implemented a dynamic programming approach based on the well-known Smith-Waterman algorithm [29]. Using this approach, we can discover the longest putative sister regions represented by several proximal points of colinear homologous gene pairs in nearly diagonal orientations. The points may not be in close proximity due to large-scale gene loss, insertion and translocation. The extent of the proximity of the points is essential to reveal and evaluate the colinear sister segments. Lines forming by the points corresponding to the true colinear segments are either nearly parallel to the main diagonal line or the anti-diagonal line due to DNA segmental inversion.Homologous genes in colinear segments should all have the same or inverse transcriptional directions if no single gene inversions occur. We scan H in two directions, starting from the upper-left and upper-right. Here, we describe the procedure starting from the upper-left, which also applies in the other direction. Transcriptional orientations of the genes are recorded but not used when performing the colinearity search.

为了揭示两条染色体之间共线性的同源基因，我们实现了一种基于著名的 Smith-Waterman 算法 [29] 的动态规划方法。使用这种方法，我们可以发现最长的假定姐妹区域，这些区域由几乎对角方向的共线同源基因对的几个近点表示。由于大规模基因丢失、插入和易位，这些点可能不会很接近。点的接近程度对于揭示和评估共线姊妹段至关重要。由于DNA节段倒位，由对应于真正共线段的点形成的线几乎平行于主对角线或对角线。如果没有单个基因，共线段中的同源基因应该具有相同或反向的转录方向发生反转。我们从左上角和右上角开始从两个方向扫描 H。在这里，我们描述从左上角开始的过程，这也适用于另一个方向。记录基因的转录方向，但在执行共线性搜索时不使用。

To reveal the colinearity represented by the proximal points in H, we introduce a parameter mg (the maximum gap length) between two neighboring points. Then we define another matrix S (the scoring matrix) with the same size as H (Fig. 1B). A cell in matrix S represents the extension of a colinearity path, i.e., the value of each cell is the number of collinear gene pairs in the path accumulated from its starting point. The path extends and the value of the cell increases by 1 if there is a "1" in lower-right neighborhood, and both vertical and horizontal distances are less than mg. Initially, S is identical to H. We rebuild the matrix S recursively using a dynamic programming procedure:

为了揭示H中相邻点所代表的共线性，我们引入了一个参数mg(最大间隙长度)。然后我们定义另一个矩阵S(即得分矩阵)，其大小与H相同(图1B)。矩阵S中的一个单元格表示共线路径的扩展，即每个单元格的值为该路径从起始点开始累计的共线基因对数。如果右下邻域有“1”，则路径扩展，细胞值增加1，且垂直和水平距离均小于mg。一开始，S和h是相同的，我们用一个动态规划过程递归地重建矩阵S:

where S(i, j) is the score computed, H(i, j) is the homology information, pre(i, j) is the cell leading to the maximum score at the cell p(i, j), and a pointer (denoted by dark or gray arrow lines in Fig. 1B) is created from the cell p(i, j) to pre(i, j), dist(p(i, j), p(a, b)) is the distance between the cells p(i, j) and p(a, b). Eventually, the maximum score in S corresponds to the longest putative collinear segments. The longest colinearity path formed by dots of the homologous gene pairs is revealed by a trace-back procedure according to the pointers created. After the homologous genes in putative colinearity are recorded, we mask these putative colinear segments by setting H(i, j) to 0, rebuild the matrix S, and scan for other putative colinear segments till no sister regions containing more colinear genes than a threshold r could be found.

其中S(i, j)是计算出的分数，H(i, j)是同源信息，pre(i, j)是导致单元格p(i, j)的最大分数，一个指针（图中用深色或灰色箭头线表示。1B）从单元格p（i，j）到pre（i，j）被创建，dist（p（i，j），p（a，b））是单元格p（i，j）和p（a，b）之间的距离。最后,S中的最大分数对应最长的假定共线段。根据建立的指针进行回溯，揭示了同源基因对的点所形成的最长共线路径。记录了假定共线的同源基因后，我们通过设H(i, j)为0来掩盖这些假定共线片段，重建矩阵S，并扫描其他假定共线片段，直到没有发现包含超过阈值r的共线基因的姐妹区域。

----------------

之后都是算法各个参数和指标详细的介绍，并将其应用在水稻和拟南芥的基因组（或蛋白组）中，并有了摘要中的发现，感兴趣的读者可以自行下载详细阅读，这里不做更细的翻译。

----------------

讨论

Identification of the duplicated segments, especially their distribution pattern in a genome, is essential for further inference on when and how the duplication or species divergence occurred, and whether or not recurrent duplication events happened. The selection of parameter values, in particular the maximum gap length between the neighboring genes, is critical to detect chromosomal homology. However, the selection of maximum gap length in previous reported studies was mainly empirical, which might fail to detect authentic duplicated segments [20,22,23]. Many fewer and shorter duplicated segments are discovered when a smaller gap length is adopted, such as in the case of rice [20], whereas more and longer duplicated regions can be found if a larger gap length is adopted. Moreover, different gap length should be used in different genomes such as Arabidopsis and rice, since the density of colinear genes varies due to DNA loss and insertion. By considering the difference in gene density, especially the density of homologous genes in different genomes, we determined the maximum gap length based on statistical analysis. For example, when the duplicated regions in Arabidopsis and rice are detected, the maximum gap lengths were estimated to be 116 Kb and 334 Kb, respectively.

识别复制的片段，特别是它们在基因组中的分布模式，对于进一步推断复制或物种分化的时间和方式，以及是否发生了重复性的复制事件至关重要。参数值的选择，特别是相邻基因之间的最大间隙长度，对检测染色体同源性至关重要。然而，在以往的研究中，对最大间隙长度的选择主要是经验性的，这可能无法检测到真实的重复片段[20,22,23]。当采用较小的间隙长度时，许多较少和较短的重复片段被发现，如在水稻的例子中[20]，而如果采用较大的间隙长度，可以发现更多和更长的重复区域。此外，在不同的基因组中，如拟南芥和水稻，应采用不同的间隙长度，因为由于DNA的丢失和插入，共线基因的密度也不同。通过考虑基因密度的差异，特别是不同基因组中同源基因的密度，我们在统计分析的基础上确定了最大间隙长度。例如，当检测到拟南芥和水稻中的重复区域时，最大缺口长度估计分别为116 Kb和334 Kb。

The input data of our approach can be any type of genetic markers such as sequences, genetic markers. Various measurements can be used to represent the distance between markers, such as physical or genetic distances, or gene numbers. In most previous studies, the significance of the predicted colinear regions was evaluated by a permutation test, which is rather time-consuming [20,23]. We estimate the significance of the predicted colinear segments through statistical inference. The statistical inference has the advantage over the permutation test in terms of computational efficiency. It takes only 2 minutes to calculate the epv to evaluate their significance on a personal computer (AMD AthlonXP 2000+, 512 MB RAM) while running a permutation test takes several hours on the same machine. The massive gene duplications and translocations in its proximal regions will lead to many colinearity shadows, decreasing the computational efficiency. We include a neighborhood masking procedure in ColinearScan to remove colinearity shadows in our algorithm, which dramatically improves the efficiency of detecting duplicated segments in the rice genome.

我们方法的输入数据可以是任何类型的遗传标记，例如序列、遗传标记。可以使用各种测量值来表示标记之间的距离，例如物理或遗传距离或基因数。在大多数先前的研究中，预测的共线区域的重要性是通过置换检验来评估的，这相当耗时 [20,23]。我们通过统计推断估计预测的共线段的重要性。统计推断在计算效率方面优于置换检验。计算 epv 以评估它们在个人计算机（AMD AthlonXP 2000+，512 MB RAM）上的重要性只需要 2 分钟，而在同一台计算机上运行置换测试需要几个小时。其近端区域的大量基因重复和易位会导致许多共线性阴影，从而降低计算效率。我们在 ColinearScan 中包含一个邻域掩蔽程序，以消除我们算法中的共线性阴影，这极大地提高了检测水稻基因组中重复片段的效率。

ADHoRe adopts linear regression analysis to infer duplicated chromosomal segments [20,21]. The underlying assumption is that gene loss rates have been balanced between sister segments, resulting in a straight line in the dot map. The colinear homologues in a chromosomal segment might be interspersed by individual genes that have no homologues at the corresponding position in its sister segment. At the very beginning of divergence of the sister segments, there should be one-to-one gene homology. Thereafter, massive gene deletions, translocations and chromosomal rearrangements occur and the initial pattern eventually becomes obscured [25]. The homologues with the conservative orders would appear in a straight line in the dot map if gene deletion or insertion had been balanced in different regions of the sister segments, otherwise in a curvy line. Wang et al. [15] explore the gene loss rates in the sister segments in rice and find that nearly straight lines are obtained for some sister segments, e.g., in chromosomes 11 and 12, and in chromosomes 2 and 4. However, curvy lines are also found for some sister segments, e.g., in chromosomes 1 and 5, and in chromosomes 8 and 9. A linearity assumption might fail to detect true duplicated segments. In FISH, Calabrese et al. [24] also adopt a colinearity strategy and develop a different statistical approach to evaluate the extension of collinear points, referred as clump in GHM. However, the value of key parameter p, reflecting the probability that a point occurs in the neighborhood of the former point, is artificially defined, and the maximal gap is deduced from p in their approach. DiagHunter [25] adopts a colinearity method similar to our approach, and the maximal length of the path is predefined. The program stops extending the current path until it reaches the maximal length threshold, or other neighboring points cannot be found.

ADHoRe 采用线性回归分析来推断重复的染色体片段 [20,21]。基本假设是姐妹片段之间的基因丢失率已经平衡，从而在点图中形成一条直线。染色体片段中的共线同源物可能散布在其姐妹片段中相应位置没有同源物的单个基因中。在姐妹节段分歧的最开始，应该有一对一的基因同源性。此后，大量基因缺失、易位和染色体重排发生，最初的模式最终变得模糊[25]。如果基因缺失或插入在姐妹片段的不同区域平衡，则具有保守顺序的同源物将出现在点图中的直线上，否则呈曲线状。王等人。 [15] 研究了水稻姐妹片段的基因丢失率，发现一些姐妹片段获得了几乎直线，例如，在 11 号和 12 号染色体中，以及在 2 号和 4 号染色体中。在一些姐妹片段中也发现了曲线，例如，在 1 号和 5 号染色体中，以及在 8 号和 9 号染色体中。线性假设可能无法检测到真正的重复片段。在 FISH 中，Calabrese 等人。 [24] 还采用了共线性策略并开发了一种不同的统计方法来评估共线点的扩展，在 GHM 中称为丛。然而，反映一个点出现在前一个点附近的概率的关键参数 p 的值是人为定义的，并且在他们的方法中从 p 推导出最大间隙。 DiagHunter [25] 采用与我们的方法类似的共线性方法，并且预定义了路径的最大长度。程序停止扩展当前路径，直到达到最大长度阈值，或者找不到其他相邻点。

Polyploidy has been supposed to be prevalent in plants. Recently, genome-wide studies further suggest the ubiquity of polyploidy, even in genomes which have not been considered to undergo genomic duplication [35]. The small genome of Arabidopsis has been reported to have undergone at least one round of duplication by different groups [12,18]. Here using a different method, we discover that 75.0% of the Arabidopsis genome sequences are in duplicated regions and a significant portion of sequences have multiple copies. The previous studies in the rice genome have been focused on the large obvious duplicated segments, produced by the relatively recent duplication events [15,36]. Here, we detect 76.2% of rice sequences in duplicated regions, and 42.9% have multiple copies.

多倍体在植物中普遍存在。最近，全基因组研究进一步表明多倍体的普遍存在，甚至在尚未被认为发生基因组复制的基因组[35]中也存在多倍体。据报道，拟南芥的小基因组在不同的类群中至少经历了一轮重复[12,18]。在这里，我们使用不同的方法发现，拟南芥75.0%的基因组序列处于重复区域，并且有很大一部分序列具有多重拷贝。以往对水稻基因组的研究主要集中在较近期的重复事件产生的明显的大重复片段上[15,36]。在这里，我们检测到76.2%的水稻序列处于重复区域，42.9%的序列有多个拷贝。

The possibility of constructing the monocot-dicot comparative genetic map has been discussed [37] based on the comparison of Arabidopsis and rice sequences [38,39]. However, a comprehensive detection of homologous regions between these two genomes has not been available. Based on gene colinearity, we detected homologous regions between Arabidopsis and rice, accounting for 32.9% and 16.9% of each genome. All homologous segments were shorter than 0.6 Mb in length, indicating the massive genome rearrangements in both genomes after the monocot-dicot divergence. Though the short homologous segments make it difficult to construct the comparative genetic map between monocot and dicot, the homologues in colinearity found in this study may provide clues for further work in comparative genomics.

结论

基于拟南芥和水稻序列的比较[38,39]，已经讨论了构建单子叶植物-双子叶植物比较遗传图谱的可能性[37]。 然而，这两个基因组之间的同源区域的全面检测尚未获得。 基于基因共线性，我们检测到拟南芥和水稻之间的同源区域，分别占每个基因组的32.9%和16.9%。所有同源片段的长度都短于 0.6 Mb，表明在单子叶植物-双子叶植物分化后两个基因组中的大量基因组重排。虽然短的同源片段使构建单子叶植物和双子叶植物之间的比较遗传图谱变得困难，但本研究中发现的共线性同源物可能为比较基因组学的进一步工作提供线索。

总结

We develop an algorithm to detect homologous chromosomal segments with conserved gene order, and we propose a statistical approach to estimate parameters and evaluate the significance of potential homology. We apply this approach to rice and Arabidopsis with high efficiency to detect potential colinear regions and evaluate their significance. We find many more homologous chromosomal segments in rice genomes than previously reported, which consolidated the inference that a polyploidy had occurred in the common ancestor of grasses. We also find many small colinear segments between rice and Arabidopsis genomes, providing clues to the evolutionary history of monocots and dicots.

我们开发了一种算法来检测具有保守基因顺序的同源染色体片段，并提出了一种统计方法来估计参数和评估潜在同源性的意义。我们将该方法应用于水稻和拟南芥，高效地检测潜在共线区域并评估其显著性。我们在水稻基因组中发现了比以往报道更多的同源染色体片段，这巩固了禾本科植物的共同祖先存在多倍体的推断。我们还发现了水稻和拟南芥基因组之间的许多共线段，为单子叶植物和双子叶植物的进化历史提供了线索。

Authors' contributions

作者的贡献

XW and XS developed the algorithm and the statistical models under the supervision of JL. XW, XS and ZL implemented the programs in rice and Arabidopsis, QZ and SG contributed to this work on plant biology and evolution, ZL and LK developed the online web server, WT provided technical support to the project. All the authors contributed to the refinement of the manuscript drafted by XW.

XW和XS在JL的监督下开发了算法和统计模型。XW、XS和ZL实施了水稻和拟南芥的项目，QZ和SG参与了植物生物学和进化的工作，ZL和LK开发了在线网络服务器，WT为项目提供了技术支持。所有的作者都对XW起草的手稿做出了改进。

Availability and requirements

• Project name: ColinearScan

• Project home page: http://colinear.cbi.pku.edu.cn （已经进不去了其实，好多网址，比如agriGO也进不去了，真是可惜呀，这么方便好用的植物GO分析网址——中国农大苏震教授团队开发的，要是还能进去使用，笔者真想写一篇agriGO v2.0的推文，可惜啊，可惜！—要是有读者知道新的可以用agriGO v2.0的网址，希望可以告知笔者。）

• Operating systems: Linux, Unix

• Programming languages: C++, PERL

• Other requirements: Standard C++ Library, BioPerl and other PERL modules including Getopt::Long and Pod::Usage

• License: GPL

补充材料里的图片

总结：一篇牛掰的算法文章，只发了一个三区，很多没有原创算法的工具包都发了一区top，感觉ColinearScan的算法也应该发更好的期刊。

算法文献阅读9：ColinearScan (现代共线性算法的启发
ColinearScan是世界级共线性/多倍化/生物信息分析大牛/奠基人/发展人王希胤老师在北大读博开发的一款共线...
离群点分析
LOF算法使用基于密度的局部离群点检测算法LOF鉴于LOF算法的特点，使用了文献[1]中的DLOF算法，在文献中...
博客园转载
启发式算法（Heuristic Algorithm）启发式算法（Heuristic Algorithm）有不同的...
遗传算法详解
遗传算法（Genetic Algorithm）又叫基因进化算法，或进化算法。属于启发式搜索算法一种，这个算法比较...
TBtools基因家族分析详细教程(3)基因家族成员的进化分析2
基因-共线性的定义与常见算法原理物种内的共线性分析文件准备（物种比对到自身的.blast文件，物种基因信息文件....
基因组共线性工具MCScanX使用说明
基因组共线性工具MCScanX使用说明简介 MCScanX工具集对MCScan算法进行了调整，用于检测共线性和同...
启发式算法
近期在回顾启发式算法的原理及代码。所谓的启发式算法，描述起来有点抽象。启发式算法的定义：一个基于直观或经验构造的...
A* 搜索算法
启发式搜索算法要理解 A* 搜寻算法，还得从启发式搜索算法开始谈起。所谓启发式搜索，就在于当前搜索结点往下选择下...
算法2
算法的分类精确算法（exact algorithm），总能保证求得问题的解启发式算法（heuristic al...
百度无人驾驶apollo项目路径规划a*算法分析
算法分析车辆路径规划寻路算法有很多，apollo路径规划模块使用的是启发式搜索算法A*寻路算法。 a*算法是一种...

算法文献阅读9：ColinearScan (现代共线性算法的启发

相关文章

算法文献阅读9：ColinearScan (现代共线性算法的启发

离群点分析

博客园转载

遗传算法详解

TBtools基因家族分析详细教程(3)基因家族成员的进化分析2

基因组共线性工具MCScanX使用说明

启发式算法

A* 搜索算法

算法2

百度无人驾驶apollo项目路径规划a*算法分析

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

比较基因组

深度分析