How can i get contig file from the scaffold file (scaffold was generated from CLC). Is there any converter or programme? Ex : This is my scaffold : ACTGTGCATNNNNNNACGCTGCA and I want the contig file from scaffold like : Contig1 - ACTGTGCA and Contig2-ACGCTGCA
方法1:
perl 脚本
方法2:
try with FASTA/Q toolkit SeqKit and shell commands.
cat seqs.fa \
| seqkit fx2tab \
| cut -f 2 \
| sed -r 's/n+/\n/gi' \
| cat -n \
| seqkit tab2fx \
| seqkit replace -p "(.+)" -r "Contig{nr}"
cat seqs.fa \ # read file
| seqkit fx2tab \ # convert FASTA to tabular format: scaffold1 ACTGTGCATNNNNNNACGCTGCA
| cut -f 2 \ # select the 2nd column : ACTGTGCATNNNNNNACGCTGCA
| sed -r 's/n+/\n/gi' \ # replace the Ns with '\n' : ACTGTGCAT
# : ACGCTGCA
| cat -n \ # output row number : 1 ACTGTGCAT
# I just want a 2-column format : 2 ACGCTGCA
| seqkit tab2fx \ # convert tabular to FASTA : >1
# : ACTGTGCAT
| seqkit replace -p "(.+)" -r "Contig{nr}" # renname sequence header, {nr} means row number
方法3 python2.7
#!/usr/bin/python2.7
from Bio import SeqIO
import getopt,sys,re
def usage():
print "Usage: python contig_from_scaffold.py -i <input_scaffold_fasta> -o <output_contig_fasta>"
try:
options, remainder=getopt.getopt(sys.argv[1:], 'i:o:h')
except getopt.GetoptError as err:
print str(err)
usage()
sys.exit()
for opt, arg in options:
if opt in ('-i'):
input_file=arg
if opt in ('-h'):
usage()
sys.exit()
elif opt in ('-o'):
output_file=arg
out=open(output_file, 'w')
sequence = ''.join([str(record.seq).strip() for record in SeqIO.parse(input_file, "fasta")])
m=re.sub('[nN]+','\n',sequence).split('\n')
for i in range(1,len(m)):
out.write('>contig_'+str(i)+'\n')
out.write(m[i]+'\n')
资料来源于 https://www.biostars.org/p/211400/
https://www.jianshu.com/p/02d0ed5b16bd
网友评论