【Python小试】计算目录下所有DNA序列的Kmer并过滤

作者: 生物信息与育种 | 来源:发表于2020-05-01 22:58 被阅读0次

背景

Kmer是基因组组装算法中经常接触到的概念，简单来说，Kmer就是长度为k的核苷酸序列。一般长短为m的reads可以分成m-k+1个Kmer。Kmer的长度和阈值直接影响到组装的效果。

Denovo组装流程：原始数据——数据过滤——纠错——kmer分析——denovo组装。

组装测序策略：根据基因组大小和具体情况选择个大概的k值，构建contig所需的数据量以及所需的构建的文库数量。对于植物基因组一般考虑的是大kmer（>31），动物一般在27左右，具体根据基因组情况调整。需要在短片段数据量达到20X左右的时候进行kmer分析。Kmer分析正常后，继续加测数据以达到最后期望的数据量。

编码

import os
import sys

# convert command line arguments to variables
kmer_size = int(sys.argv[1])
count_cutoff = int(sys.argv[2])

# define the function to split dna
def split_dna(dna, kmer_size):
    kmers = []
    for start in range(0,len(dna)-(kmer_size-1),1):
        kmer = dna[start:start+kmer_size]
        kmers.append(kmer)
    return kmers

# create an empty dictionary to hold the counts
kmer_counts = {}

# process each file with the right name
for file_name in os.listdir("."):
    if file_name.endswith(".dna"):
        dna_file = open(file_name)

        # process each DNA sequence in a file
        for line in dna_file:
            dna = line.rstrip("\n")

            # increase the count for each k-mer that we find
            for kmer in split_dna(dna, kmer_size):
                current_count = kmer_counts.get(kmer, 0)
                new_count = current_count + 1
                kmer_counts[kmer] = new_count

# print k-mers whose counts are above the cutoff
for kmer, count in kmer_counts.items():
    if count > count_cutoff:
        print(kmer + " : " + str(count))

Ref: https://www.cnblogs.com/leezx/p/5577600.html

网友评论

python

本文标题：【Python小试】计算目录下所有DNA序列的Kmer并过滤

本文链接：https://www.haomeiwen.com/subject/mhspihtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

【Python小试】计算目录下所有DNA序列的Kmer并过滤

背景

编码

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

python