需求描述

下载ncbi当中的 念珠菌(candida auris)所有基因组DNA的fasta序列。

如果是直接在nucleotide数据库检索candida auris的会。对得到所有DNA以及mRNA的结果。我们需要把结果mRNA的过滤掉

image.png

具体操作

方法一：直接网页下载

最简单的方法方法就是通过网页直接下载。我们点击网页上的Send To ——— Gene Features ——— FASTA Nucleotide。即可下载所有的序列

image.png

但是这种方法有一个不好的地方在于，有时候序列特别多的时候，网页版本的下载会在一段时间结束。虽然显示下载完成了。但是其实可能只是下载了一部分的序列。这个下载完需要检查一下是否真的下载完了。

方法二：代码行下载

ncbi提供了很多用于数据库下载的API工具。我们使用R:rentrez包来下载。其他语言也是有相对应的工具的。

代码行需要注意的地方在于：因为我们在网页上检索的时候进行了筛选。所以在代码上检索的时候也需要进行筛选的。不然得到的结果是不一样的。

ncbi提供了filter选项来当作我们结果结果的过滤。同样的在rentrez当中，我们通过FILT参数设想定过滤选项。

具体在检索的查询表达式如何输入，我们可以通过entrez_db_searchable来进行查看

library(rentrez)
### 查看nucleotid数据库当中的限定词
entrez_db_searchable("nucleotide")

## Searchable fields for database 'nuccore'
##   ALL     All terms from all searchable fields 
##   UID     Unique number assigned to each sequence 
##   FILT    Limits the records 
##   WORD    Free text associated with record 
##   TITL    Words in definition line 
##   KYWD    Nonstandardized terms provided by submitter 
##   AUTH    Author(s) of publication 
##   JOUR    Journal abbreviation of publication 
##   VOL     Volume number of publication 
##   ISS     Issue number of publication 
##   PAGE    Page number(s) of publication 
##   ORGN    Scientific and common names of organism, and all higher levels of taxonomy 
##   ACCN    Accession number of sequence 
##   PACC    Does not include retired secondary accessions 
##   GENE    Name of gene associated with sequence 
##   PROT    Name of protein associated with sequence 
##   ECNO    EC number for enzyme or CAS registry number 
##   PDAT    Date sequence added to GenBank 
##   MDAT    Date of last update 
##   SUBS    CAS chemical name or MEDLINE Substance Name 
##   PROP    Classification by source qualifiers and molecule type 
##   SQID    String identifier for sequence 
##   GPRJ    BioProject 
##   SLEN    Length of sequence 
##   FKEY    Feature annotated on sequence 
##   PORG    Scientific and common names of primary organism, and all higher levels of taxonomy 
##   COMP    Component accessions for an assembly 
##   ASSM    Assembly 
##   DIV     Division 
##   STRN    Strain 
##   ISOL    Isolate 
##   CULT    Cultivar 
##   BRD     Breed 
##   BIOS    BioSample

如上我们可以看到FILT可以来进行筛选。因此我们来使用代码检索结果。至于我们filter都可以选择哪些内容可以在网站中的Advanced里面查看

CA <- entrez_search(db = "nucleotide", term = "candida auris AND (genomic DNA[FILT])", retmax = 10000)
### 查看检索结果
CA$count

## [1] 1630

如上可以看到和网页检索结果是一样的。进一步我们可以通过entrez_fetch来抓取每个id的fasta序列即可。由于API提取ncbi的数据的时候如果项目太多会被系统自动关闭掉。所以我们需要用循环来提取序列

id <- CA$ids
n <- 200 #每次读入的记录数量
CAgenome <- c()
for (i in seq(1, length(id), n)) {
  res <- entrez_fetch(db = "nucleotide", id = id[i : (i + n - 1)], rettype = "fasta")
  CAgenome <- paste(CAgenome, res, collapse = "")
}
write.table(CAgenome, "CAgenome.fa", quote = F, row.names = F, col.names = F)

这样即可得到其所有的fasta序列了