导读
将二代三代Unicycler混合组装结果进行整理:抽提染色体/质粒序列,统计GC含量,统计序列长度,结构。
一、Unicycler结果
-rw-rw-r-- 1 cheng WST 2933634 1月 7 18:06 001_best_spades_graph.gfa
-rw-rw-r-- 1 cheng WST 2909841 1月 7 18:06 002_overlaps_removed.gfa
-rw-rw-r-- 1 cheng WST 2920506 1月 7 18:17 003_bridges_applied.gfa
-rw-rw-r-- 1 cheng WST 2906329 1月 7 18:17 004_final_clean.gfa
-rw-rw-r-- 1 cheng WST 2906330 1月 7 19:32 005_polished.gfa
-rw-rw-r-- 1 cheng WST 2945006 1月 7 19:32 assembly.fasta
-rw-rw-r-- 1 cheng WST 2906330 1月 7 19:32 assembly.gfa
-rw-rw-r-- 1 cheng WST 17815 1月 7 19:32 unicycler.log
- 查看文件格式
head assembly.fasta
>1 length=1838724 depth=1.00x
ATTACTTCTTCAGCAAAGTTTTCTTCTTTCTTTTCGATACCTTCGCCAACTTCATAACGAACAAAAGATT
TTACCTTACCATTCTTAGAAGCAACATATTTTGCAACTGTTAAATCAGGATCCTTAACAAAGTCTTGATC
ATCCAATGCAACTTCTGCAAAGAACTTGTTCAAGCGACCAGCAACCATTTTTTCAATGATTTTTTCTGGC
TTACCTTCATTCTTAGCTTCTTCTGTAAGAACTTCCTTTTCATGTGCAACTTCAGCTTCTGGAACTTCAT
二、去换行符
- 格式
![](https://img.haomeiwen.com/i19404765/090e67d56324b0e7.png)
三、抽提染色体/质粒序列,统计GC含量,统计序列长度,结构
思路:
1 将组装结果fasta读到字典
2 抽提最长的染色体,并统计碱基数、GC含量
3 给质粒建个结果文件夹
4 抽提环状序列(质粒),并统计碱基数、GC含量
5 格式化输出
- 代码
#!/usr/bin/env python3
import re, os, sys
ms, infile, outfile_genome, outfilefold_plasmid, out_table = sys.argv
with open(infile) as f:
Dict = {}
for line in f:
line = line.strip()
if line[0] == ">":
key = line
Dict[key] = []
else:
Dict[key].append(line)
i = 0
with open(out_table, 'w') as table:
with open(outfile_genome, 'w') as genome:
table.write("\tBase(bp)\tGC%\tStructure\n")
os.makedirs(outfilefold_plasmid)
for key, value in Dict.items():
if key[0:9] == ">1 length":
seq = ''.join(value)
genome.write("{}\n{}\n".format(key, seq))
# write table
length = len(seq)
gc_percent = format((seq.count("G") + seq.count("C"))/length*100, '0.2f')
print("\033[32m{}\t{}\033[0m".format(key, length))
table.write("Chromosome\t{}\t{}\tChromosome\n".format(length, gc_percent))
for key, value in Dict.items():
if "circular=true" in key:
seq = ''.join(value)
print("\033[32m{}\t{}\033[0m".format(key, len(seq)))
i += 1
out_name = outfilefold_plasmid + "/plasmid" + str(i) + ".fasta"
with open(out_name, 'w') as plasmid:
plasmid.write("{}\n{}\n".format(key, seq))
# write table
length = len(seq)
gc_percent = format((seq.count("G") + seq.count("C"))/length*100, '0.2f')
ID = "Plasmid" + str(i)
table.write("{}\t{}\t{}\tCircle\n".format(ID, length, gc_percent))
四、结果
.
├── assembly_enter.fasta
├── assembly_genome.fasta
├── out_table_html.txt
├── out_table.txt
└── plasmid
├── plasmid1.fasta
├── plasmid2.fasta
├── plasmid3.fasta
└── plasmid4.fasta
1 directory, 8 files
cat out_table.txt
Base(bp) GC% Structure
Chromosome 1838724 33.12 Chromosome
Plasmid1 21933 33.70 Circle
Plasmid2 204654 31.99 Circle
Plasmid3 2579 30.59 Circle
Plasmid4 38837 33.80 Circle
网友评论