导读

将二代三代Unicycler混合组装结果进行整理：抽提染色体/质粒序列，统计GC含量，统计序列长度，结构。

一、Unicycler结果

-rw-rw-r--  1 cheng WST 2933634 1月   7 18:06 001_best_spades_graph.gfa
-rw-rw-r--  1 cheng WST 2909841 1月   7 18:06 002_overlaps_removed.gfa
-rw-rw-r--  1 cheng WST 2920506 1月   7 18:17 003_bridges_applied.gfa
-rw-rw-r--  1 cheng WST 2906329 1月   7 18:17 004_final_clean.gfa
-rw-rw-r--  1 cheng WST 2906330 1月   7 19:32 005_polished.gfa
-rw-rw-r--  1 cheng WST 2945006 1月   7 19:32 assembly.fasta
-rw-rw-r--  1 cheng WST 2906330 1月   7 19:32 assembly.gfa
-rw-rw-r--  1 cheng WST   17815 1月   7 19:32 unicycler.log

查看文件格式

head assembly.fasta

>1 length=1838724 depth=1.00x
ATTACTTCTTCAGCAAAGTTTTCTTCTTTCTTTTCGATACCTTCGCCAACTTCATAACGAACAAAAGATT
TTACCTTACCATTCTTAGAAGCAACATATTTTGCAACTGTTAAATCAGGATCCTTAACAAAGTCTTGATC
ATCCAATGCAACTTCTGCAAAGAACTTGTTCAAGCGACCAGCAACCATTTTTTCAATGATTTTTTCTGGC
TTACCTTCATTCTTAGCTTCTTCTGTAAGAACTTCCTTTTCATGTGCAACTTCAGCTTCTGGAACTTCAT

二、去换行符

python & shell：去除fasta文件的换行符

格式

三、抽提染色体/质粒序列，统计GC含量，统计序列长度，结构

思路：
1 将组装结果fasta读到字典
2 抽提最长的染色体，并统计碱基数、GC含量
3 给质粒建个结果文件夹
4 抽提环状序列（质粒），并统计碱基数、GC含量
5 格式化输出

代码

#!/usr/bin/env python3
import re, os, sys
ms, infile, outfile_genome, outfilefold_plasmid, out_table = sys.argv
with open(infile) as f:
    Dict = {}
    for line in f:
        line = line.strip()
        if line[0] == ">":
            key = line
            Dict[key] = []
        else:
            Dict[key].append(line)
            
    i = 0
    with open(out_table, 'w') as table:
        with open(outfile_genome, 'w') as genome:
            table.write("\tBase(bp)\tGC%\tStructure\n")
            os.makedirs(outfilefold_plasmid)
            for key, value in Dict.items():
                if key[0:9] == ">1 length":
                    seq = ''.join(value)   
                    genome.write("{}\n{}\n".format(key, seq))
                    # write table
                    length = len(seq)
                    gc_percent = format((seq.count("G") + seq.count("C"))/length*100, '0.2f')
                    print("\033[32m{}\t{}\033[0m".format(key, length))
                    table.write("Chromosome\t{}\t{}\tChromosome\n".format(length, gc_percent))
            for key, value in Dict.items():
                if "circular=true" in key:
                    seq = ''.join(value)
                    print("\033[32m{}\t{}\033[0m".format(key, len(seq)))
                    i += 1
                    out_name = outfilefold_plasmid + "/plasmid" + str(i) + ".fasta"
                    with open(out_name, 'w') as plasmid:
                        plasmid.write("{}\n{}\n".format(key, seq))
                        # write table
                        length = len(seq)
                        gc_percent = format((seq.count("G") + seq.count("C"))/length*100, '0.2f')
                        ID = "Plasmid" + str(i)
                        table.write("{}\t{}\t{}\tCircle\n".format(ID, length, gc_percent))

四、结果

.
├── assembly_enter.fasta
├── assembly_genome.fasta
├── out_table_html.txt
├── out_table.txt
└── plasmid
    ├── plasmid1.fasta
    ├── plasmid2.fasta
    ├── plasmid3.fasta
    └── plasmid4.fasta

1 directory, 8 files

cat out_table.txt 

    Base(bp)    GC% Structure
Chromosome  1838724 33.12   Chromosome
Plasmid1    21933   33.70   Circle
Plasmid2    204654  31.99   Circle
Plasmid3    2579    30.59   Circle
Plasmid4    38837   33.80   Circle