细菌基因组下载

作者: 涤生生 | 来源:发表于2019-01-07 21:47 被阅读39次

    最近需要尽可能的下载细菌基因组,ncbi ftp站点(ftp://ftp.ncbi.nlm.nih.gov/genomes)里面"all"这个目录可以下载,不过"all"目录还有古细菌之类的哈。如果只想取Refseq或genbank数据库其一,可以通过以下方式:

    #下载Refseq/genbank数据库中的细菌基因组数据
    #截止20190106,有140681条基因组数据
    wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt
    ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt
    #文件第12列是拼接状态(Complete Genome/Scaffold/contig),第20列是ftp路径,下面是下载组装完整的基因组信息
    awk -F "\t" '$12=="Complete Genome" && $11=="latest"{print $20}' assembly_summary.txt > ftpdirpaths
    awk 'BEGIN{FS=OFS="/";filesuffix="genomic.gbff.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}' ftpdirpaths > ftpfilepaths
    cut -d / -f 11 ftpfilepaths | paste - ftpfilepaths  | while read a b;do echo "wget -c -nd -r -np -k -L -p -nd -P genbank $b && gzip -d genbank/$a";done >run.download.sh
    #脚本只下载了gbff文件,当然,如果想要gff,fna,faa通过小脚本转就可以了,或者加上“-A faa.gz,fna.gz,gff.gz,gpff.gz”
    

    真的超级占存储空间,我只下载gbff文件,需要其他格式的文件再简单格式转化就可以了。同时,下载一批立马处理一批。
    Refseq和genbank的区别
    Refseq下载的是GCF开头的文件,而genbank下载的是GCA开头的文件。

    具体看这里:https://zhuanlan.zhihu.com/p/20749737

    GenBank是核苷酸数据库,RefSeq是基因数据库 ,具体哪个更全?互为补充,以前做分析的时候发现有些细菌基因组在GenBank里面没有在RefSeq有,还有可能两个数据库里面某个基因组只有一个fna文件没有注释,这种情况如果遇到自己动手用prokka十分钟就能解决。


    生信小脚本收藏癖,有时遇到比较实用的脚本会存下来放在自己环境变量bin里面,积累会使以后的分析变得更加高效

    附上两个网上看到好用的脚本

    1.GenBank转faa格式

    #!/usr/bin/env perl
    # This program reads gbk files and extracts all amino acid sequences from the
    # /translation fields into a FASTA file. The FASTA header contains a sequential
    # number followed by the taxon id, which is extracted from the
    # /db_xref="taxon:<ID>" field. Only letters in the 20 letter amino acid
    # alphabet are retained in the FASTA file.
    #
    # author: Peter Menzel
    #
    # This file is part of Kaiju, Copyright 2015-2017 Peter Menzel and Anders Krogh
    # Kaiju is licensed under the GPLv3, see the file LICENSE.
    #
    
    use strict;
    use warnings;
    use IO::Uncompress::AnyUncompress qw(anyuncompress $AnyUncompressError) ;
    
    my $t = 0;
    my $taxid;
    my $protein_id;
    
    if(!defined $ARGV[1]) { die "Usage: $0 infile.gbk outfile.faa"; }
    open(OUT,">",$ARGV[1]) or die "Could not open file $ARGV[1] for writing.";
    my $in_fh = new IO::Uncompress::AnyUncompress $ARGV[0] or die "Opening input file failed: $AnyUncompressError\n";
    
    while(<$in_fh>) {
        chomp;
        if(m,/db_xref="taxon:(\d+)",) {
            $taxid = $1;
        }
        elsif(m,/protein_id="([^"]+)",) {
            $protein_id=$1;
        }
        elsif(m,\s+/translation="([^"]+)",)  {  
            if(!defined($taxid)) { die "No taxon id found in gbk file $ARGV\n";}
            print OUT ">$protein_id\_$taxid\n";
            my $seq = $1;
            $seq =~ tr/BZ/DE/;  # a.a. alphabet specifies `B' matches `N' or `D', and `Z' matches `Q' or `E.', here we use substitution with higher score
            $seq =~ s/[^ARNDCQEGHILKMFPSTWYV]//gi;
            print OUT "$seq\n";
        }
        elsif(m,\s+/translation="([^"]+)$,) {
            if(!defined($taxid)) { die "No taxon id found in gbk file $ARGV\n";}
            print OUT ">$protein_id\_$taxid\n";
            $t = 1;
            my $seq = $1;
            $seq =~ tr/BZ/DE/;
            $seq =~ s/[^ARNDCQEGHILKMFPSTWYV]//gi;
            print OUT "$seq\n";
        }
        elsif($t) {
            if(m,",) { 
                tr/BZ/DE/;
                s/[^ARNDCQEGHILKMFPSTWYV]//gi;
                print OUT $_,"\n";   
                $t = 0;
            }
            else { 
                tr/BZ/DE/;
                s/[^ARNDCQEGHILKMFPSTWYV]//gi;
                print OUT $_,"\n";   
            }
        }
    }
    close($in_fh);
    close(OUT);
    

    2.Genbank转fna格式

    #!/usr/bin/env python
    
    import sys
    from Bio import SeqIO
    
    if len(sys.argv) < 3:
        print('USAGE: gbk2fna GBK FNA')
        sys.exit(65)
    
    SeqIO.write(SeqIO.parse(sys.argv[1], 'genbank'), sys.argv[2], 'fasta')
    

    相关文章

      网友评论

        本文标题:细菌基因组下载

        本文链接:https://www.haomeiwen.com/subject/broprqtx.html