Pfam是蛋白数据库,可以在线比对,但每次只能比对单条序列,非常繁琐。
R包bio3d
能够实现在R
中进行蛋白序列批量与Pfam数据库的比对。
在安装这个包时遇到了一些问题。从R-cran
中安装后,会报错Error in resurl["uuid", 1] : incorrect number of dimensions
。改从GitHub
安装,R3.5.0报错Rtool不兼容(电脑里的是Rtool35),重新安装Rtool也不能解决。安装R3.5.1反而是正常的。感谢该包的开发和维护人员。
用到的主函数是hmmer
。
函数bio3d::read.fasta
可以直接读取fasta
文件,单条或多条序列都可以,但当多条序列时,得到的列表文件中序列是以矩阵形式呈现的,也就是难以单条提取进行比对,而这也是比对函数hmmer
要求的,就是不能同时或循环比对list中的所有序列。我尝试用seqinr::read.fasta
读进去后再转成bio3d
要求的格式时没有成功。最后只能用比较笨的方法,先把fasta
文件中的多条序列拆成单个文件,然后再逐条读取并比对,就可以了,但这始终是低效的方法。
代码如下:
library(bio3d)
# devtools::install_bitbucket('Grantlab/bio3d', subdir = 'ver_devel/bio3d/')
library(tidyverse)
# read fasta
seq <- readLines("D:/Lk.WOX.pep.fa")
rsult <- NULL
i <- 0
for (i in (i + 1):(length(seq)/2)) {
# get seq nam
seq_nam <- seq[i * 2 - 1] %>% str_remove(">")
# get indiv seq & save into 'temp.txt'
seq_temp <- seq[c(i * 2 - 1, i * 2)] %>% write(file = "temp.txt")
# this fun (tryCatch) can keep the code going although some errors encountered
# due to diverse reasons eg null returned after alignment
tryCatch({
seq_temp <- bio3d::read.fasta("temp.txt") %>% hmmer(type = "hmmscan", db = "pfam")
# add seq id to the df returned after alignment
seq_temp$hit.tbl$Larix_id <- seq_nam
# collect the results into 'rsult'
rsult <- rbind(rsult, seq_temp$hit.tbl)
}, error = function(e) {
cat("#---数据库不存在相应记录!\n")
})
cat("#---", seq_nam, "---", i, "\n")
}
result <- rsult
# mv the seq id to the 1st col
result <- result[c(length(names(result)), 1:(length(names(result)) - 1))]
write.csv(result, file = "D:/Lk.WOX.Pfam.csv", row.names = F, quote = F)
> sessionInfo()
R version 3.5.1 Patched (2018-08-31 r75231)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] shiny_1.1.0 htmlwidgets_1.2.1 wordcloud2_0.2.1 stringi_1.2.4 quanteda_1.3.4
[6] koRpus_0.10-2 data.table_1.11.4 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.6
[11] purrr_0.2.5 readr_1.1.1 tidyr_0.8.1 tibble_1.4.2 ggplot2_3.0.0
[16] tidyverse_1.2.1 dword_18.05-17
loaded via a namespace (and not attached):
[1] Rcpp_0.12.18 lubridate_1.7.4 lattice_0.20-35 assertthat_0.2.0
[5] digest_0.6.15 mime_0.5 R6_2.2.2 cellranger_1.1.0
[9] plyr_1.8.4 backports_1.1.2 httr_1.3.1 pillar_1.3.0
[13] rlang_0.2.1 lazyeval_0.2.1 readxl_1.1.0 rstudioapi_0.7
[17] miniUI_0.1.1.1 Matrix_1.2-14 devtools_1.13.6 addinexamples_0.1.0
[21] munsell_0.5.0 broom_0.5.0 compiler_3.5.1 spacyr_0.9.91
[25] httpuv_1.4.5 modelr_0.1.2 pkgconfig_2.0.1 htmltools_0.3.6
[29] tidyselect_0.2.4 crayon_1.3.4 withr_2.1.2 later_0.7.3
[33] grid_3.5.1 xtable_1.8-2 nlme_3.1-137 jsonlite_1.5
[37] gtable_0.2.0 magrittr_1.5 formatR_1.5 scales_0.5.0
[41] RcppParallel_4.4.1 cli_1.0.0 promises_1.0.1 bindrcpp_0.2.2
[45] xml2_1.2.0 stopwords_0.9.0 fastmatch_1.1-0 AAremedy_0.0.0.9600
[49] tools_3.5.1 glue_1.3.0 hms_0.4.2 yaml_2.2.0
[53] colorspace_1.3-2 rvest_0.3.2 memoise_1.1.0 knitr_1.20
[57] bindr_0.1.1 haven_1.1.2
网友评论