美文网首页R - tips基因组
在R中进行蛋白序列与Pfam数据库的批量比对

在R中进行蛋白序列与Pfam数据库的批量比对

作者: 董八七 | 来源:发表于2018-09-04 22:01 被阅读19次

    Pfam是蛋白数据库,可以在线比对,但每次只能比对单条序列,非常繁琐。R包bio3d能够实现在R中进行蛋白序列批量与Pfam数据库的比对。
    在安装这个包时遇到了一些问题。从R-cran中安装后,会报错Error in resurl["uuid", 1] : incorrect number of dimensions。改从GitHub安装,R3.5.0报错Rtool不兼容(电脑里的是Rtool35),重新安装Rtool也不能解决。安装R3.5.1反而是正常的。感谢该包的开发和维护人员。

    用到的主函数是hmmer
    函数bio3d::read.fasta可以直接读取fasta文件,单条或多条序列都可以,但当多条序列时,得到的列表文件中序列是以矩阵形式呈现的,也就是难以单条提取进行比对,而这也是比对函数hmmer要求的,就是不能同时或循环比对list中的所有序列。我尝试用seqinr::read.fasta读进去后再转成bio3d要求的格式时没有成功。最后只能用比较笨的方法,先把fasta文件中的多条序列拆成单个文件,然后再逐条读取并比对,就可以了,但这始终是低效的方法。
    代码如下:

    library(bio3d)
    # devtools::install_bitbucket('Grantlab/bio3d', subdir = 'ver_devel/bio3d/')
    library(tidyverse)
    
    # read fasta
    seq <- readLines("D:/Lk.WOX.pep.fa")
    rsult <- NULL
    i <- 0
    for (i in (i + 1):(length(seq)/2)) {
      # get seq nam
      seq_nam <- seq[i * 2 - 1] %>% str_remove(">")
      # get indiv seq & save into 'temp.txt'
      seq_temp <- seq[c(i * 2 - 1, i * 2)] %>% write(file = "temp.txt")
      # this fun (tryCatch) can keep the code going although some errors encountered
      # due to diverse reasons eg null returned after alignment
      tryCatch({
        seq_temp <- bio3d::read.fasta("temp.txt") %>% hmmer(type = "hmmscan", db = "pfam")
        # add seq id to the df returned after alignment
        seq_temp$hit.tbl$Larix_id <- seq_nam
        # collect the results into 'rsult'
        rsult <- rbind(rsult, seq_temp$hit.tbl)
      }, error = function(e) {
        cat("#---数据库不存在相应记录!\n")
      })
      cat("#---", seq_nam, "---", i, "\n")
    }
    result <- rsult
    # mv the seq id to the 1st col
    result <- result[c(length(names(result)), 1:(length(names(result)) - 1))]
    write.csv(result, file = "D:/Lk.WOX.Pfam.csv", row.names = F, quote = F)
    
    > sessionInfo()
    R version 3.5.1 Patched (2018-08-31 r75231)
    Platform: x86_64-w64-mingw32/x64 (64-bit)
    Running under: Windows 7 x64 (build 7601) Service Pack 1
    
    Matrix products: default
    
    locale:
    [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
    [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
    [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
    [4] LC_NUMERIC=C                                                   
    [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    
    
    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     
    
    other attached packages:
     [1] shiny_1.1.0       htmlwidgets_1.2.1 wordcloud2_0.2.1  stringi_1.2.4     quanteda_1.3.4   
     [6] koRpus_0.10-2     data.table_1.11.4 forcats_0.3.0     stringr_1.3.1     dplyr_0.7.6      
    [11] purrr_0.2.5       readr_1.1.1       tidyr_0.8.1       tibble_1.4.2      ggplot2_3.0.0    
    [16] tidyverse_1.2.1   dword_18.05-17   
    
    loaded via a namespace (and not attached):
     [1] Rcpp_0.12.18        lubridate_1.7.4     lattice_0.20-35     assertthat_0.2.0   
     [5] digest_0.6.15       mime_0.5            R6_2.2.2            cellranger_1.1.0   
     [9] plyr_1.8.4          backports_1.1.2     httr_1.3.1          pillar_1.3.0       
    [13] rlang_0.2.1         lazyeval_0.2.1      readxl_1.1.0        rstudioapi_0.7     
    [17] miniUI_0.1.1.1      Matrix_1.2-14       devtools_1.13.6     addinexamples_0.1.0
    [21] munsell_0.5.0       broom_0.5.0         compiler_3.5.1      spacyr_0.9.91      
    [25] httpuv_1.4.5        modelr_0.1.2        pkgconfig_2.0.1     htmltools_0.3.6    
    [29] tidyselect_0.2.4    crayon_1.3.4        withr_2.1.2         later_0.7.3        
    [33] grid_3.5.1          xtable_1.8-2        nlme_3.1-137        jsonlite_1.5       
    [37] gtable_0.2.0        magrittr_1.5        formatR_1.5         scales_0.5.0       
    [41] RcppParallel_4.4.1  cli_1.0.0           promises_1.0.1      bindrcpp_0.2.2     
    [45] xml2_1.2.0          stopwords_0.9.0     fastmatch_1.1-0     AAremedy_0.0.0.9600
    [49] tools_3.5.1         glue_1.3.0          hms_0.4.2           yaml_2.2.0         
    [53] colorspace_1.3-2    rvest_0.3.2         memoise_1.1.0       knitr_1.20         
    [57] bindr_0.1.1         haven_1.1.2 
    

    相关文章

      网友评论

        本文标题:在R中进行蛋白序列与Pfam数据库的批量比对

        本文链接:https://www.haomeiwen.com/subject/gzgnwftx.html