美文网首页
NCBI远程批量blast获取物种信息

NCBI远程批量blast获取物种信息

作者: kkkkkkang | 来源:发表于2020-09-24 19:41 被阅读0次

本地化BLAST当然可行,但是占内存很多,同时,想获得物种名称,需要在设置好nr/nt数据库后把taxdb.tar.gz解压出来的taxdb.btdtaxdb.bti两个文件放到$BLASTDB
参考:如何在blast结果中输出物种名称https://segmentfault.com/a/1190000012055972

啊这。。不想这么麻烦。所以我就想直接-remote参数获得这些我想要的

曲线救国:

  1. blastn 被query的序列文件可以包含多个fasta序列
 blastn -query seq_1.fa -remote -db nr -outfmt "6 qseqid qlen sseqid sgi slen pident length mismatch gapopen qstart qend sstart send  evalue bitscore staxid ssciname" -out seq_1_out.txt

这样结果ssciname列依然是NA。。。

  1. 一般把第一个比对结果作为我们的结果,所以把相同的去除: python脚本如下
#!/usr/bin/env python3
#-*-coding : utf-8 -*-
import argparse
parser = argparse.ArgumentParser(description = "\nThis python script is used to extrac the first line among all the line with same start", add_help = False, usage = "\n python3 derep.py -i [input] -o [output] \n python3 seq_split.py --input [input] --output [output]")
required = parser.add_argument_group("Required options")
optional = parser.add_argument_group("Optional options")
required.add_argument("-i", "--input", metavar = "[input]", help = "input file", required = True)
required.add_argument("-o", "--output", metavar = "[output.fasta]", help = "output file", required = True)
optional.add_argument("-h", "--help", action = "help", help = "help.info")
args = parser.parse_args()
#read input file
output_dict = {}
with open (args.input, "r") as fi:
        for line in fi:
                start = line.split('\t')[0]
                if start not in output_dict.keys():
                        output_dict[start] = line
                else:
                        continue
with open (args.output, "w") as fo:
        for k,v in output_dict.items():
                print(k, file = fo)
                print(v, file = fo)
  1. 回到正题,这里已经有了GI accession number 如JX430817.1。假如需要所有样品的,可以简单写个脚本提取一下。
    想要获取物种组成当然可以直接去NCBI的Nucleotide那里搜
    想要批量的话得用一个R包taxize
    安装devtools::install_github("ropensci/taxize")
#基本用法
library(taxize)
genbank2uid(id = "JX430817.1")
No ENTREZ API key provided
 Get one via taxize::use_entrez()
See https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
[[1]]
[1] "1237042"
attr(,"class")
[1] "uid"
attr(,"match")
[1] "found"
attr(,"multiple_matches")
[1] FALSE
attr(,"pattern_match")
[1] FALSE
attr(,"uri")
[1] "https://www.ncbi.nlm.nih.gov/taxonomy/1237042"
attr(,"name")
[1] "Streptomyces sp. YRA037 16S ribosomal RNA gene, partial sequence"
最后一行就是我们需要的,想批量可以用循环解决
> li  <- c("JX430824.1", "JX430817.1")
> for (i in li) {
+     print(genbank2uid(id = i))
+ }
No ENTREZ API key provided
 Get one via taxize::use_entrez()
See https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
[[1]]
[1] "1237045"
attr(,"class")
[1] "uid"
attr(,"match")
[1] "found"
attr(,"multiple_matches")
[1] FALSE
attr(,"pattern_match")
[1] FALSE
attr(,"uri")
[1] "https://www.ncbi.nlm.nih.gov/taxonomy/1237045"
attr(,"name")
[1] "Streptomyces sp. YRA064 16S ribosomal RNA gene, partial sequence"

No ENTREZ API key provided
 Get one via taxize::use_entrez()
See https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
[[1]]
[1] "1237042"
attr(,"class")
[1] "uid"
attr(,"match")
[1] "found"
attr(,"multiple_matches")
[1] FALSE
attr(,"pattern_match")
[1] FALSE
attr(,"uri")
[1] "https://www.ncbi.nlm.nih.gov/taxonomy/1237042"
attr(,"name")
[1] "Streptomyces sp. YRA037 16S ribosomal RNA gene, partial sequence"

虽然看起来好像搞复杂了,以后但真是批量的话,可能用得到,记录一下

相关文章

网友评论

      本文标题:NCBI远程批量blast获取物种信息

      本文链接:https://www.haomeiwen.com/subject/efktuktx.html