本地化
BLAST
当然可行,但是占内存很多,同时,想获得物种名称,需要在设置好nr/nt数据库后把taxdb.tar.gz
解压出来的taxdb.btd
和taxdb.bti
两个文件放到$BLASTDB
中
参考:如何在blast结果中输出物种名称https://segmentfault.com/a/1190000012055972
啊这。。不想这么麻烦。所以我就想直接-remote
参数获得这些我想要的
曲线救国:
- blastn 被query的序列文件可以包含多个fasta序列
blastn -query seq_1.fa -remote -db nr -outfmt "6 qseqid qlen sseqid sgi slen pident length mismatch gapopen qstart qend sstart send evalue bitscore staxid ssciname" -out seq_1_out.txt
这样结果ssciname列依然是NA。。。
- 一般把第一个比对结果作为我们的结果,所以把相同的去除: python脚本如下
#!/usr/bin/env python3
#-*-coding : utf-8 -*-
import argparse
parser = argparse.ArgumentParser(description = "\nThis python script is used to extrac the first line among all the line with same start", add_help = False, usage = "\n python3 derep.py -i [input] -o [output] \n python3 seq_split.py --input [input] --output [output]")
required = parser.add_argument_group("Required options")
optional = parser.add_argument_group("Optional options")
required.add_argument("-i", "--input", metavar = "[input]", help = "input file", required = True)
required.add_argument("-o", "--output", metavar = "[output.fasta]", help = "output file", required = True)
optional.add_argument("-h", "--help", action = "help", help = "help.info")
args = parser.parse_args()
#read input file
output_dict = {}
with open (args.input, "r") as fi:
for line in fi:
start = line.split('\t')[0]
if start not in output_dict.keys():
output_dict[start] = line
else:
continue
with open (args.output, "w") as fo:
for k,v in output_dict.items():
print(k, file = fo)
print(v, file = fo)
- 回到正题,这里已经有了GI accession number 如
JX430817.1
。假如需要所有样品的,可以简单写个脚本提取一下。
想要获取物种组成当然可以直接去NCBI的Nucleotide那里搜
想要批量的话得用一个R包taxize
安装devtools::install_github("ropensci/taxize")
#基本用法
library(taxize)
genbank2uid(id = "JX430817.1")
No ENTREZ API key provided
Get one via taxize::use_entrez()
See https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
[[1]]
[1] "1237042"
attr(,"class")
[1] "uid"
attr(,"match")
[1] "found"
attr(,"multiple_matches")
[1] FALSE
attr(,"pattern_match")
[1] FALSE
attr(,"uri")
[1] "https://www.ncbi.nlm.nih.gov/taxonomy/1237042"
attr(,"name")
[1] "Streptomyces sp. YRA037 16S ribosomal RNA gene, partial sequence"
最后一行就是我们需要的,想批量可以用循环解决
> li <- c("JX430824.1", "JX430817.1")
> for (i in li) {
+ print(genbank2uid(id = i))
+ }
No ENTREZ API key provided
Get one via taxize::use_entrez()
See https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
[[1]]
[1] "1237045"
attr(,"class")
[1] "uid"
attr(,"match")
[1] "found"
attr(,"multiple_matches")
[1] FALSE
attr(,"pattern_match")
[1] FALSE
attr(,"uri")
[1] "https://www.ncbi.nlm.nih.gov/taxonomy/1237045"
attr(,"name")
[1] "Streptomyces sp. YRA064 16S ribosomal RNA gene, partial sequence"
No ENTREZ API key provided
Get one via taxize::use_entrez()
See https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
[[1]]
[1] "1237042"
attr(,"class")
[1] "uid"
attr(,"match")
[1] "found"
attr(,"multiple_matches")
[1] FALSE
attr(,"pattern_match")
[1] FALSE
attr(,"uri")
[1] "https://www.ncbi.nlm.nih.gov/taxonomy/1237042"
attr(,"name")
[1] "Streptomyces sp. YRA037 16S ribosomal RNA gene, partial sequence"
虽然看起来好像搞复杂了,以后但真是批量的话,可能用得到,记录一下
网友评论