对于蛋白的fa文件,由于是每个转录本对应一个蛋白序列。由于后期注释需要以基因为单位,因此需要对fa文件进行处理,计算每一个转录本的蛋白序列长度,然后提取最长的蛋白序列,该序列即为该基因的蛋白序列。具体代码(关键实现代码是if 判断语句)
如下:
open FA, "$ARGV[0]";
$/=">";
<FA>;
while(<FA>){
chomp;
my($id,$seq)=(split /\n/,$_,2)[0,1];
$seq=~s/\n//g;
($ID1,$ID2)=split /\./,$id;
#print"$ID1\n$seq\n";
$length=length($seq);
if(exists $hash{$ID1}){
if($length > length($hash{$ID1})) {
$hash{$ID1}=$seq;
}
else{
next;
}
}
else{
$hash{$ID1}=$seq;
}
}
foreach (keys %hash){
#$len=length($hash{$_});
print">$_\n$hash{$_}\n";
}
运行代码:
perl longgest_protein_ID.pl SWO.v3.0.protein.fa > longgest_protein.fa
其中SWO.v3.0.protein.fa如下:
head -20 SWO.v3.0.protein.fa
>Cs_ont_5g000040.3
MTEFMDLEAQDGVRMPWNVIPGTKQEASNCVVPVSAIYTPIKAFPVNNNSMPILPYAPLRCRTCRSILNPFSIVDFAAKI
WICPFCFQRNHFPPHYASITDDNLPAELFPQYTTIEYEPPGPGEKSSVPPVFMFVVDTCIIEEEMSFLKSALSQAIDLLP
DNSLVGLITFGTLVQVHELGGFGQIIPKTYVFKGSKDVSKDQLLEQLNFFIKKPKPSTGVIAGVRDGLSSDTIARFLVPA
FDCEFTLNSVLEELQKDPWPVPPDQRSTRCTGTALSIAASLLGACVPGSGARILAFVGGPSTEGPAAIVSKNLSEPIRSH
KDLDKDSAPHYHKAVKFYDALSKQLVHQGHVLDLFACALDQVGVAELKVAVEKTGGLVVLSDSFGHAVFKDSVRRVFHSG
DYDLGLSSNGIFEINCSKDIKVQGIIGPCASLEKKGPLCSDAVVGQGNTSAWKMCGLDKATSLCLVFEIVKKEIPDATLQ
STNNQFYFQFLTYYQHNCGQMRLRVTTLSRRWVAGPGSVQDLIAGFDQEAAAVVMARLVSFKMEIEAEFDPIRWLDKALI
HMCSRFGDYQKDSPSSFSLSPRFSIFPQFMFHLRRSQFVQVFNNSPDETAYFRMILNRENVTNSVVMIQPSLISYSFHSG
PEPALLDVAAIAADRILLLDSYFTVVIFHGATIAQWRKAGYHNQPEHQAFAQLLQAPHDDAGVIMKERFPVPRLVICDQH
GSQVGSYHNVQQMGGSLWMIVVCYLLSPGKDILGHSPMDVCT*
>Cs_ont_5g044840.1
MESLALPCLNSFTSSRASFHKEQSNKLLAKARYLNSSKPASLRLTPYYIAPSFNLGGIRMQAGGEEYELKQMRDMAAAKK
RWDALIRDGKVKVLTPREAGYAVQLSSKTLLDVRPSTERKKAWIKGSIWIPIFDIDDTFDAGSLPQKVTNFVMGGWWSGV
PTLSYNKQFLSKVEEKLPKDTDLIVACQKGLRSLAACELLYNAGYRNLFWVQGGLEAAEEEDLVREGPQPLKFAGIGGLS
EFLGWTGQQRAQAAREGWGYRLLYSARLVGVFLAADALIIGAQQVGRYIQDIRSH*
>Cs_ont_4g025730.1
MSVSMRDLDSAFQGAGQKAGIEIWRIENFKPVLVPKSSHGKFFTGDSYVILKTTASKSGALRHDIHYWLGKDTSQDEAGT
AAIKTVELDAALGGRAVQYREVQGHETEKFLSYFKPCIIPQEGGIASGFKRAEAEEHKIRLFVCRGKHVIHVKEVPFSRS
SLNHDDIFILDTQSKIFQFNGSNSSIQERAKALEVVQYIKDTYHDGKCEVAVVEDGKLMADAEAGEFWGFFGGFAPLPRK
网友评论