美文网首页
2019-10-25

2019-10-25

作者: 树懒吃糖_ | 来源:发表于2019-10-25 10:10 被阅读0次

    文件处理

    考虑到平常经常涉及到大量文件批量处理,现在将用过的脚本进行整理。

    主要包括以下几点:
    1.文件查找;文件内容拆分,提取字段;多个文件中提取的字段合并;字段统计
    python的collections.Counter包用于统计特别方便,推荐使用。不过没有测试过如果数据量特别大的情况下计算速度怎么样,下次可以做个测试。
    (ps.几万条数据统计是妥妥没啥问题的,这个用过哈)

    因为平时工作都是在linux上进行的,所以一般脚本都是为了服务器工作写的,如果在window下进行,可能某些小细节需要调整。Python读取文件时容易出现编码异常的情况,可以查阅其他帖子。

    import os
    def search_file(base, tag,outpath):
         '''
         参数: base, 待查询的根目录;tag, 待查询文件中含有的标签;outpath, 查找结果输出文件
         '''
        outfile = open(outpath, 'w')
        for dirbase, dirfile, filenames in os.walk(base):
            for filename in filenames:
                if filename.endwith(tag):    #(filename.startswith(tag))
                    path = os.path.join(dirbase, filename)
                    outfile.write(path+'\n')
          return
    
    def search_file(object):
        p_list = paths(object)
        sh = open(r'./merge.sh', 'w')   #sh 执行脚本
        sh.write('echo start'+'\n')
        for base in p_list:
            cur_path = os.path.join(base, '01.readfilter')
            sampleid = os.path.split(base)[-1]
            bam_list = list()
            for dirbase, dirfile, filenames in os.walk(cur_path):
                for filename in filenames:
                    if filename.endswith('bwa.srt.bam'):
                        path = os.path.join(dirbase, filename)
                        bam_list.append(path)
    
            if len(bam_list) > 1:
                name = os.path.join(r'/OLD_LIB/home/bam', sampleid + '.merge.sh')
                sh_merge = open(name, 'w')
                outbam = '/OLD_LIB/home/works/bam/'+sampleid+'.merge.bam'
                cmd = 'samtools merge -@ 10 '+outbam+'  '+'  '.join(bam_list)
                index = 'samtools index -@ 5 ' + outbam + ' ' +outbam+'.bai'
    
                sh_merge.write(cmd+'\n')
                sh_merge.write(index+'\n')
                sh.write('sh '+name+'\n')
    
    import sys
    def file_split(path):
         '''参数说明:path,待处理文件路径'''
        if not os.path.exists(path):
            print('file not exist!!  \n Please recheck')
            sys.exit(1)
         with open(path) as file:
            for line in file:
                if not line.startswith('#'):
                    aa = line.strip().split('\t')
                    chr, pos, ref, alt = aa[0:5]
                    key = chr+'\t'+pos+'\t'+ref+'\t'+alt
                    names = aa[8].split(':')
                    vals = aa[9].split(':')
                    infs= dict(zip(names, vals))
                    ad = infs['AD']
                    rd = infs['RD']
                    dp = int(ad)+int(rd)
                    freq = round(int(ad)/dp, 4)
                    if key not in diction.keys():
                        diction[key] = str(dp) + '\t' + str(ad) + '\t' + str(freq)
                        if freq > 0.2:
                            diction1[key] = str(dp) + '\t' + str(ad) + '\t' + str(freq)
       return diction, diction1
    
    from collections import Counter
    def count_words(path):
        '''模板文件用的是一行一个单词,如果一行多列,需要先拆分成单个单词 '''
        words = list()
        with open(path) as file:
            for line in file:
                aa = line.strip()
                words.append(aa)
        Counter.most_common(words)
    

    END

    相关文章

      网友评论

          本文标题:2019-10-25

          本文链接:https://www.haomeiwen.com/subject/lanjvctx.html