美文网首页便捷工具
提取最长蛋白序列(转录本一样适用)

提取最长蛋白序列(转录本一样适用)

作者: 余绕 | 来源:发表于2023-01-04 17:14 被阅读0次
    对于蛋白的fa文件,由于是每个转录本对应一个蛋白序列。由于后期注释需要以基因为单位,因此需要对fa文件进行处理,计算每一个转录本的蛋白序列长度,然后提取最长的蛋白序列,该序列即为该基因的蛋白序列。具体代码(关键实现代码是if 判断语句)如下:
    open FA, "$ARGV[0]";
    $/=">";
    <FA>;
    while(<FA>){
            chomp;
                my($id,$seq)=(split /\n/,$_,2)[0,1];
                $seq=~s/\n//g;
                       ($ID1,$ID2)=split /\./,$id;   
               #print"$ID1\n$seq\n"; 
                      $length=length($seq);
             if(exists  $hash{$ID1}){
    
                if($length > length($hash{$ID1})) {
                 $hash{$ID1}=$seq;           
                  }  
                   else{
                         next;
                      }
              }
              else{
              
                   $hash{$ID1}=$seq;
              }
          }
    
    foreach (keys %hash){
    
    #$len=length($hash{$_});
    
    print">$_\n$hash{$_}\n";
    
    }     
    

    运行代码:

    perl longgest_protein_ID.pl  SWO.v3.0.protein.fa > longgest_protein.fa
    

    其中SWO.v3.0.protein.fa如下:

    head -20  SWO.v3.0.protein.fa  
    >Cs_ont_5g000040.3
    MTEFMDLEAQDGVRMPWNVIPGTKQEASNCVVPVSAIYTPIKAFPVNNNSMPILPYAPLRCRTCRSILNPFSIVDFAAKI
    WICPFCFQRNHFPPHYASITDDNLPAELFPQYTTIEYEPPGPGEKSSVPPVFMFVVDTCIIEEEMSFLKSALSQAIDLLP
    DNSLVGLITFGTLVQVHELGGFGQIIPKTYVFKGSKDVSKDQLLEQLNFFIKKPKPSTGVIAGVRDGLSSDTIARFLVPA
    FDCEFTLNSVLEELQKDPWPVPPDQRSTRCTGTALSIAASLLGACVPGSGARILAFVGGPSTEGPAAIVSKNLSEPIRSH
    KDLDKDSAPHYHKAVKFYDALSKQLVHQGHVLDLFACALDQVGVAELKVAVEKTGGLVVLSDSFGHAVFKDSVRRVFHSG
    DYDLGLSSNGIFEINCSKDIKVQGIIGPCASLEKKGPLCSDAVVGQGNTSAWKMCGLDKATSLCLVFEIVKKEIPDATLQ
    STNNQFYFQFLTYYQHNCGQMRLRVTTLSRRWVAGPGSVQDLIAGFDQEAAAVVMARLVSFKMEIEAEFDPIRWLDKALI
    HMCSRFGDYQKDSPSSFSLSPRFSIFPQFMFHLRRSQFVQVFNNSPDETAYFRMILNRENVTNSVVMIQPSLISYSFHSG
    PEPALLDVAAIAADRILLLDSYFTVVIFHGATIAQWRKAGYHNQPEHQAFAQLLQAPHDDAGVIMKERFPVPRLVICDQH
    GSQVGSYHNVQQMGGSLWMIVVCYLLSPGKDILGHSPMDVCT*
    >Cs_ont_5g044840.1
    MESLALPCLNSFTSSRASFHKEQSNKLLAKARYLNSSKPASLRLTPYYIAPSFNLGGIRMQAGGEEYELKQMRDMAAAKK
    RWDALIRDGKVKVLTPREAGYAVQLSSKTLLDVRPSTERKKAWIKGSIWIPIFDIDDTFDAGSLPQKVTNFVMGGWWSGV
    PTLSYNKQFLSKVEEKLPKDTDLIVACQKGLRSLAACELLYNAGYRNLFWVQGGLEAAEEEDLVREGPQPLKFAGIGGLS
    EFLGWTGQQRAQAAREGWGYRLLYSARLVGVFLAADALIIGAQQVGRYIQDIRSH*
    >Cs_ont_4g025730.1
    MSVSMRDLDSAFQGAGQKAGIEIWRIENFKPVLVPKSSHGKFFTGDSYVILKTTASKSGALRHDIHYWLGKDTSQDEAGT
    AAIKTVELDAALGGRAVQYREVQGHETEKFLSYFKPCIIPQEGGIASGFKRAEAEEHKIRLFVCRGKHVIHVKEVPFSRS
    SLNHDDIFILDTQSKIFQFNGSNSSIQERAKALEVVQYIKDTYHDGKCEVAVVEDGKLMADAEAGEFWGFFGGFAPLPRK
    
    

    相关文章

      网友评论

        本文标题:提取最长蛋白序列(转录本一样适用)

        本文链接:https://www.haomeiwen.com/subject/uhzrcdtx.html