BLAST-The learning notes of the 

作者: Hypdoctor | 来源:发表于2017-12-03 19:55 被阅读26次

    Basic Local Alignment Search Tool (BLAST)

    个用来比对生物序列的一级结构(如不同蛋白质的氨基酸序列或不同基因的DNA序列)的算法。 已知一个包含若干序列的数据库,BLAST可以让研究者在其中寻找与其感兴趣的序列相同或类似的序列。 例如如果某种非人动物的一个以前未知的基因被发现,研究者一般会在人类基因组中做一个BLAST搜索来确认人类是否包含类似的基因(通过序列的相似性)。BLAST算法以及实现它的程序由美国国家生物技术信息中心(NCBI)的Warren Gish、David J. Lipman及Webb Miller博士开发的。(from wikipedia)

    A suite of tools

    blast-table.png

    The key concepts of BLAST

    -Search may take place in nucleotide and/or protein space or translated spaces where nucleotides are translated into proteins.
    -Searches may implement search “strategies”: optimizations to a certain task. Different search strategies will return different alignments.
    -Searches use alignments that rely on scoring matrices
    -Searches may be customized with many additional parameters. BLAST has many subtle functions that most users never need.

    使用BLAST 的基本步骤

    1.使用makeblastdb建立BLAST数据库
    2.合适的选择blastn、blastp、blsatx等工具
    3.运行工具并在需要的时候格式化输出结果

    Build a blast database

    #建立database目录
    mkdir -p ~/refs/ebola
    #获取ebola病毒核酸序列
    efetch -db nucleotide -id KM233118 --format fasta > ~/refs/ebola/KM233118.fa
    

    makeblastdb命令建立ebola核酸序列database
    makeblastdb -help | more

    USAGE
      makeblastdb [-h] [-help] [-in input_file] [-input_type type]
        -dbtype molecule_type [-title database_title] [-parse_seqids]
        [-hash_index] [-mask_data mask_data_files] [-mask_id mask_algo_ids]
        [-mask_desc mask_algo_descriptions] [-gi_mask]
        [-gi_mask_name gi_based_mask_names] [-out database_name]
        [-max_file_sz number_of_bytes] [-logfile File_Name] [-taxid TaxID]
        [-taxid_map TaxIDMapFile] [-version]
    DESCRIPTION
       Application to create BLAST databases, version 2.7.1+
    REQUIRED ARGUMENTS
     -dbtype <String, `nucl', `prot'>
       Molecule type of target db
    OPTIONAL ARGUMENTS
     -h
       Print USAGE and DESCRIPTION;  ignore all other parameters
     -help
       Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters
     -version
       Print version number;  ignore other arguments
     *** Input options
     -in <File_In>
       Input file/database name
       Default = `-'
     -input_type <String, `asn1_bin', `asn1_txt', `blastdb', `fasta'>
       Type of the data specified in input_file
       Default = `fasta'
    > *** Configuration options
     -title <String>
       Title for BLAST database
       Default = input file name provided to -in argument
     -parse_seqids
       Option to parse seqid for FASTA input if set, for all other input types
       seqids are parsed automatically
     -hash_index
       Create index of sequence hash values.
     *** Sequence masking options
     -mask_data <String>
       Comma-separated list of input files containing masking data as produced by
       NCBI masking applications (e.g. dustmasker, segmasker, windowmasker)
     -mask_id <String>
       Comma-separated list of strings to uniquely identify the masking algorithm
        * Requires:  mask_data
        * Incompatible with:  gi_mask
     -mask_desc <String>
       Comma-separated list of free form strings to describe the masking algorithm
       details
        * Requires:  mask_id
     -gi_mask
       Create GI indexed masking data.
        * Requires:  parse_seqids
        * Incompatible with:  mask_id
     -gi_mask_name <String>
       Comma-separated list of masking data output files.
        * Requires:  mask_data, gi_mask
     *** Output options
     -out <String>
       Name of BLAST database to be created
       Default = input file name provided to -in argumentRequired if multiple
       file(s)/database(s) are provided as input
     -max_file_sz <String>
       Maximum file size for BLAST database files
       Default = `1GB'
     -logfile <File_Out>
       File to which the program log should be redirected
     *** Taxonomy options
     -taxid <Integer, >=0>
       Taxonomy ID to assign to all sequences
        * Incompatible with:  taxid_map
     -taxid_map <File_In>
       Text file mapping sequence IDs to taxonomy IDs.
       Format:<SequenceId> <TaxonomyId><newline>
        * Requires:  parse_seqids
        * Incompatible with:  taxid
    
    #创建ebola核酸序列数据库
    makeblastdb -in ~/refs/ebola/KM233118.fa -dbtype nucl -out ~/refs/ebola/KM233118
    

    创建PRJNA257197氨基酸序列数据库

    #下载PRJNA257197所有蛋白质序列fasta文件
    esearch -db protein -query PRJNA257197 | efetch -format fasta > index/all-proteins.fa
    #创建氨基酸序列数据库
    makeblastdb -in index/all-proteins.fa -dbtype prot -out index/all -parse_seqids
    #列出数据库内的内容,以“%a”accession格式显示
    blastdbcmd -db index/all -entry 'all' -outfmt "%a" | head
    

    BLAST database的下载

    NCBI提供许多物种和几乎所有的已知序列的数据库的下载
    website

    #创建目录用于存放下载的数据库
    mkdir -p ~refs/refseq
    cd ~/ref/refseq
    #blast软件包中已有update_blastdb.pl用于下载NCBI已经做好的数据库
    #查看所有数据库
    update_blastdb.pl | more
    #下载16 microbial database
    update_blastdb.pl 16SMicrobial --decompress
    #下载分类数据库
    update_blastdb.pl taxdb --decompress
    #将数据路径加入系统环境变量,这也是分类检索所必须的(for MAC)
    echo "export BLASTDB=$BLASTDB:~/refs/refseq/" >> ~/.bahs_profile
    source ~/.bash_profile
    (未完待续)

    相关文章

      网友评论

        本文标题:BLAST-The learning notes of the 

        本文链接:https://www.haomeiwen.com/subject/fmdobxtx.html