在R中进行蛋白序列与Pfam数据库的批量比对

作者: 董八七 | 来源:发表于2018-09-04 22:01 被阅读19次

Pfam是蛋白数据库，可以在线比对，但每次只能比对单条序列，非常繁琐。R包bio3d能够实现在R中进行蛋白序列批量与Pfam数据库的比对。
在安装这个包时遇到了一些问题。从R-cran中安装后，会报错Error in resurl["uuid", 1] : incorrect number of dimensions。改从GitHub安装，R3.5.0报错Rtool不兼容（电脑里的是Rtool35），重新安装Rtool也不能解决。安装R3.5.1反而是正常的。感谢该包的开发和维护人员。

用到的主函数是hmmer。
函数bio3d::read.fasta可以直接读取fasta文件，单条或多条序列都可以，但当多条序列时，得到的列表文件中序列是以矩阵形式呈现的，也就是难以单条提取进行比对，而这也是比对函数hmmer要求的，就是不能同时或循环比对list中的所有序列。我尝试用seqinr::read.fasta读进去后再转成bio3d要求的格式时没有成功。最后只能用比较笨的方法，先把fasta文件中的多条序列拆成单个文件，然后再逐条读取并比对，就可以了，但这始终是低效的方法。
代码如下：

library(bio3d)
# devtools::install_bitbucket('Grantlab/bio3d', subdir = 'ver_devel/bio3d/')
library(tidyverse)

# read fasta
seq <- readLines("D:/Lk.WOX.pep.fa")
rsult <- NULL
i <- 0
for (i in (i + 1):(length(seq)/2)) {
  # get seq nam
  seq_nam <- seq[i * 2 - 1] %>% str_remove(">")
  # get indiv seq & save into 'temp.txt'
  seq_temp <- seq[c(i * 2 - 1, i * 2)] %>% write(file = "temp.txt")
  # this fun (tryCatch) can keep the code going although some errors encountered
  # due to diverse reasons eg null returned after alignment
  tryCatch({
    seq_temp <- bio3d::read.fasta("temp.txt") %>% hmmer(type = "hmmscan", db = "pfam")
    # add seq id to the df returned after alignment
    seq_temp$hit.tbl$Larix_id <- seq_nam
    # collect the results into 'rsult'
    rsult <- rbind(rsult, seq_temp$hit.tbl)
  }, error = function(e) {
    cat("#---数据库不存在相应记录！\n")
  })
  cat("#---", seq_nam, "---", i, "\n")
}
result <- rsult
# mv the seq id to the 1st col
result <- result[c(length(names(result)), 1:(length(names(result)) - 1))]
write.csv(result, file = "D:/Lk.WOX.Pfam.csv", row.names = F, quote = F)

> sessionInfo()
R version 3.5.1 Patched (2018-08-31 r75231)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C                                                   
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] shiny_1.1.0       htmlwidgets_1.2.1 wordcloud2_0.2.1  stringi_1.2.4     quanteda_1.3.4   
 [6] koRpus_0.10-2     data.table_1.11.4 forcats_0.3.0     stringr_1.3.1     dplyr_0.7.6      
[11] purrr_0.2.5       readr_1.1.1       tidyr_0.8.1       tibble_1.4.2      ggplot2_3.0.0    
[16] tidyverse_1.2.1   dword_18.05-17   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18        lubridate_1.7.4     lattice_0.20-35     assertthat_0.2.0   
 [5] digest_0.6.15       mime_0.5            R6_2.2.2            cellranger_1.1.0   
 [9] plyr_1.8.4          backports_1.1.2     httr_1.3.1          pillar_1.3.0       
[13] rlang_0.2.1         lazyeval_0.2.1      readxl_1.1.0        rstudioapi_0.7     
[17] miniUI_0.1.1.1      Matrix_1.2-14       devtools_1.13.6     addinexamples_0.1.0
[21] munsell_0.5.0       broom_0.5.0         compiler_3.5.1      spacyr_0.9.91      
[25] httpuv_1.4.5        modelr_0.1.2        pkgconfig_2.0.1     htmltools_0.3.6    
[29] tidyselect_0.2.4    crayon_1.3.4        withr_2.1.2         later_0.7.3        
[33] grid_3.5.1          xtable_1.8-2        nlme_3.1-137        jsonlite_1.5       
[37] gtable_0.2.0        magrittr_1.5        formatR_1.5         scales_0.5.0       
[41] RcppParallel_4.4.1  cli_1.0.0           promises_1.0.1      bindrcpp_0.2.2     
[45] xml2_1.2.0          stopwords_0.9.0     fastmatch_1.1-0     AAremedy_0.0.0.9600
[49] tools_3.5.1         glue_1.3.0          hms_0.4.2           yaml_2.2.0         
[53] colorspace_1.3-2    rvest_0.3.2         memoise_1.1.0       knitr_1.20         
[57] bindr_0.1.1         haven_1.1.2

网友评论

本文标题：在R中进行蛋白序列与Pfam数据库的批量比对

本文链接：https://www.haomeiwen.com/subject/gzgnwftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

在R中进行蛋白序列与Pfam数据库的批量比对

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

R - tips

基因组