NCBI我想大家是再熟悉不过的了,有时候我们需要按照自己的想法来下载一些序列在比对一下,网上也有很多点来点去的教程,但是面对数据量大一点的序列他们往往不太灵。
比如我们想下物种Phytophthora 的ITS1序列,在NCBI上输入Phytophthora ITS1 就可以了。然后把Search details里面的内容填到脚本相应的内容里面就好啦。
下载序列其实是一道填空题。
#!/usr/bin/perl
#use strict;
#######################################################
# Application 3: Retrieving large datasets
#
# Goal: Download all chimpanzee mRNA sequences in FASTA format (>50,000 sequences).
#
# Solution: First use ESearch to retrieve the GI numbers for these sequences and post them on the History server, then use multiple EFetch calls to retrieve the data in batches of 500.
#
# Input: $query – chimpanzee[orgn]+AND+biomol+mrna[prop]
#
# Output: A file named "chimp.fna" containing FASTA data.
# #######################################################
use LWP::Simple;
my ($query,$base,$url,$output,$web,$key,$count,$retmax);
#$query = 'chimpanzee[orgn]+AND+biomol+mrna[prop]';
$query = 'Phytophthora[Organism] AND ITS1[All Fields]';
#assemble the esearch URL
$base = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/";
$url = $base."esearch.fcgi?db=nucleotide&term=$query&usehistory=y";
print "$url","\n";
#post the esearch URL
$output = get($url);
die "Couldn't get it!" unless defined $output;
#parse WebEnv, QueryKey and Count (# records retrieved)
$web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);
$key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);
$count = $1 if ($output =~ /<Count>(\d+)<\/Count>/);
#open output file for writing
open(OUT, ">Phytophthora_its.fna") || die "Can't open file!\n";
#retrieve data in batches of 500
$retmax = 500;
for (my $retstart = 0; $retstart < $count; $retstart += $retmax) {
my $efetch_url = $base ."efetch.fcgi?db=nucleotide&WebEnv=$web";
$efetch_url .= "&query_key=$key&retstart=$retstart";
$efetch_url .= "&retmax=$retmax&rettype=fasta&retmode=text";
my $efetch_out = get($efetch_url);
print OUT "$efetch_out";
}
close OUT;
这是之前在下载数据的时候遇到一个脚本,觉得挺方便的,在这里分享给大家,但是不知道作者谁了,该作者保留所有权利!
网友评论