美文网首页三代组装_注释_11.18.23
python: 整理Unicycler基因组组装结果

python: 整理Unicycler基因组组装结果

作者: 胡童远 | 来源:发表于2020-07-01 19:39 被阅读0次

    导读

    将二代三代Unicycler混合组装结果进行整理:抽提染色体/质粒序列,统计GC含量,统计序列长度,结构。

    一、Unicycler结果

    Unicycler二代、三代混合组装、矫错、抛光

    -rw-rw-r--  1 cheng WST 2933634 1月   7 18:06 001_best_spades_graph.gfa
    -rw-rw-r--  1 cheng WST 2909841 1月   7 18:06 002_overlaps_removed.gfa
    -rw-rw-r--  1 cheng WST 2920506 1月   7 18:17 003_bridges_applied.gfa
    -rw-rw-r--  1 cheng WST 2906329 1月   7 18:17 004_final_clean.gfa
    -rw-rw-r--  1 cheng WST 2906330 1月   7 19:32 005_polished.gfa
    -rw-rw-r--  1 cheng WST 2945006 1月   7 19:32 assembly.fasta
    -rw-rw-r--  1 cheng WST 2906330 1月   7 19:32 assembly.gfa
    -rw-rw-r--  1 cheng WST   17815 1月   7 19:32 unicycler.log
    
    • 查看文件格式
    head assembly.fasta
    
    >1 length=1838724 depth=1.00x
    ATTACTTCTTCAGCAAAGTTTTCTTCTTTCTTTTCGATACCTTCGCCAACTTCATAACGAACAAAAGATT
    TTACCTTACCATTCTTAGAAGCAACATATTTTGCAACTGTTAAATCAGGATCCTTAACAAAGTCTTGATC
    ATCCAATGCAACTTCTGCAAAGAACTTGTTCAAGCGACCAGCAACCATTTTTTCAATGATTTTTTCTGGC
    TTACCTTCATTCTTAGCTTCTTCTGTAAGAACTTCCTTTTCATGTGCAACTTCAGCTTCTGGAACTTCAT
    

    二、去换行符

    python & shell:去除fasta文件的换行符

    • 格式

    三、抽提染色体/质粒序列,统计GC含量,统计序列长度,结构

    思路:
    1 将组装结果fasta读到字典
    2 抽提最长的染色体,并统计碱基数、GC含量
    3 给质粒建个结果文件夹
    4 抽提环状序列(质粒),并统计碱基数、GC含量
    5 格式化输出

    • 代码
    #!/usr/bin/env python3
    import re, os, sys
    ms, infile, outfile_genome, outfilefold_plasmid, out_table = sys.argv
    with open(infile) as f:
        Dict = {}
        for line in f:
            line = line.strip()
            if line[0] == ">":
                key = line
                Dict[key] = []
            else:
                Dict[key].append(line)
                
        i = 0
        with open(out_table, 'w') as table:
            with open(outfile_genome, 'w') as genome:
                table.write("\tBase(bp)\tGC%\tStructure\n")
                os.makedirs(outfilefold_plasmid)
                for key, value in Dict.items():
                    if key[0:9] == ">1 length":
                        seq = ''.join(value)   
                        genome.write("{}\n{}\n".format(key, seq))
                        # write table
                        length = len(seq)
                        gc_percent = format((seq.count("G") + seq.count("C"))/length*100, '0.2f')
                        print("\033[32m{}\t{}\033[0m".format(key, length))
                        table.write("Chromosome\t{}\t{}\tChromosome\n".format(length, gc_percent))
                for key, value in Dict.items():
                    if "circular=true" in key:
                        seq = ''.join(value)
                        print("\033[32m{}\t{}\033[0m".format(key, len(seq)))
                        i += 1
                        out_name = outfilefold_plasmid + "/plasmid" + str(i) + ".fasta"
                        with open(out_name, 'w') as plasmid:
                            plasmid.write("{}\n{}\n".format(key, seq))
                            # write table
                            length = len(seq)
                            gc_percent = format((seq.count("G") + seq.count("C"))/length*100, '0.2f')
                            ID = "Plasmid" + str(i)
                            table.write("{}\t{}\t{}\tCircle\n".format(ID, length, gc_percent))
    

    四、结果

    .
    ├── assembly_enter.fasta
    ├── assembly_genome.fasta
    ├── out_table_html.txt
    ├── out_table.txt
    └── plasmid
        ├── plasmid1.fasta
        ├── plasmid2.fasta
        ├── plasmid3.fasta
        └── plasmid4.fasta
    
    1 directory, 8 files
    
    cat out_table.txt 
    
        Base(bp)    GC% Structure
    Chromosome  1838724 33.12   Chromosome
    Plasmid1    21933   33.70   Circle
    Plasmid2    204654  31.99   Circle
    Plasmid3    2579    30.59   Circle
    Plasmid4    38837   33.80   Circle
    

    相关文章

      网友评论

        本文标题:python: 整理Unicycler基因组组装结果

        本文链接:https://www.haomeiwen.com/subject/ejbhqktx.html