- 识别smudge的边界;
- smudge过滤;
- 单倍体覆盖率的估计;
-
首先,将二维空间划分为bins,计算每个bin中k-mer对的数量。
-
然后,每个smudge的中心被选择为对应于局部极大值的bin(以k-mer对的数量表示)。 所有其他bin中的k-mer pairs被聚合到相邻最近的bin ( 该bin被指定为smudge中心 )。
一旦确定了每个smudge的边界,就过滤那些少于0.5%数据集的smudge(包含少于0.5%的k-mer对),因为这些通常代表基因组的重复结构,并且通常会由于表示它们的k-mer太少而被放错位置。
对于第一次估计单倍体覆盖度,我们先计算每个smudge的覆盖度,然后计算一个总体覆盖度作为这些smudge的加权平均值,其中权重是每个smudge内k-mer对的数目。
为了计算单个smudge的覆盖度,我们首先根据其假定结构对smudge进行标记。 例如,在所有相对较小的覆盖范围接近0.5的污迹中,假定覆盖范围最低的污迹是AB,而其他污迹则以AB污迹作为参考。 这一过程是继续的所有相关的次要覆盖的识别污点,直到所有污点被标记。
For example, of all the smudges with a relative minor coverage near 0.5, the one with the lowest sum of coverages is assumed to be AB and others are labeled using the AB smudge as a reference. This process is continued for all relative minor coverages of the identified smudges until all smudges are labeled.
Finally, the estimate of monoploid coverage for an individual smudge is given by its sum of coverages divided by the number of k-mers that make up its labeled structure. For example, the estimate for an AAB smudge would be CovA+CovB since three k-mers make up the AAB structure.
网友评论