[R|数据库] m6A数据库总结(上)

作者: drlee_fc74 | 来源:发表于2019-10-27 21:36 被阅读0次

[m6A|数据库] m6A数据库总结(下)
[R|数据库] m6A数据库总结(上)
m6A数据库总体评价
docker-compose 搭建常用数据库
2019-12-06
7.2数据库
DataFountain PHP面试
Q：ArrayExpress包常用函数及分析？
电商数据库设计及架构优化实战(一) - 制定数据库开发规范
数据库基础和SQL基础总结

需求描述

随着网上数据库的增多，我们在做科研的时候，经常可以通过数据库来对对研究点进行提前的预测。但是数据库应该怎么寻找呢？

设计思路

由于数据库的建立也会伴随着发表一篇类似的文章来介绍数据库。因此我们可以通过pubmed来检索相关发表的数据库，再提取相关数据库信息即可。

数据操作

检索表达式的构建

由于所有网络数据库都是网页形式的，另外在发表文章的时候，肯定都会附数据库的具体网址，因此我们只需要检索网址的特殊关键词，同时再加想要检索的关键词即可。同时，一般数据库的文章在摘要当中就会提到数据库的网址，所以我们检索范围放到摘要上会更加准确一些。这样的话，我们要检索和m6A相关的数据库的话，那么检索方就是：

term <- "(https|http|ftp|file[Title/xx zAbstract]) AND m6A[Title/Abstract]"

其中https|http|ftp|file代表网址的固定开头格式。m6A则是我们想要检索的相关数据库

数据检索

我们使用rentrez来抓取m6A数据库相关的信息。其中通过

library(rentrez)
## 获取相关检索式的结果，返回最多100个结果
pmid <- entrez_search(db = "pubmed", term = term, retmax = 100)
## 检索到相关文章的信息
pub_sum <- entrez_summary(db = "pubmed", id = pmid$ids)
## 提取我们想要的信息。我们这里提取pubmid, 发表日期，文章题目，杂志名称以及被引次数
extract_pub <- extract_from_esummary(pub_sum, 
                                     c("uid", ## pembmed id
                                       "pubdate", #发表日期
                                       "title", #文章题目
                                       "source", # 杂志简写
                                       "pmcrefcount"#被引次数
                                       ))
## 转换成data.frame
result <- data.frame(t(extract_pub)) 
head(result)

##               uid     pubdate
## 31624092 31624092 2019 Oct 17
## 31510685 31510685 2019 Jul 15
## 31504168 31504168 2019 Aug 28
## 31147718 31147718  2019 Jul 2
## 30993345 30993345 2019 Apr 23
## 30598068 30598068 2018 Dec 31
##                                                                                                                                     title
## 31624092                       Direct RNA sequencing enables m6A detection in endogenous transcript isoforms at base specific resolution.
## 31510685                               FunDMDeep-m6A: identification and prioritization of functional differential m6A methylation genes.
## 31504168                                                  CROSSalive: a web server for predicting the in vivo structure of RNA molecules.
## 31147718                                                           RNAmod: an integrated system for the annotation of mRNA modifications.
## 30993345 WHISTLE: a high-accuracy map of the human N6-methyladenosine (m6A) epitranscriptome predicted using a machine learning approach.
## 30598068                                     DeepM6ASeq: prediction and characterization of m6A-containing sequences using deep learning.
##                      source pmcrefcount
## 31624092                RNA            
## 31510685     Bioinformatics            
## 31504168     Bioinformatics            
## 31147718  Nucleic Acids Res            
## 30993345  Nucleic Acids Res           3
## 30598068 BMC Bioinformatics           2

进一步精简

通过对于结果的查看，我们发现虽然通过网址关键词来限定检索条件了，其实很多数据库相关的文章都是以数据库名：来开头的。虽然这样可能会丢掉一些不舒服这个原则的数据库，但是也是一种精简的办法。

## 提取含有:为标题的文章。
NewDatIndex <- stringr::str_detect(result$title, ":")
result1 <- result[NewDatIndex,]
### 把文章的题目替换成数据库名称
result1$title <-  stringr::str_extract(result1$title, ".+?(?=:)")
result1

##               uid     pubdate                title             source
## 31510685 31510685 2019 Jul 15        FunDMDeep-m6A     Bioinformatics
## 31504168 31504168 2019 Aug 28           CROSSalive     Bioinformatics
## 31147718 31147718  2019 Jul 2               RNAmod  Nucleic Acids Res
## 30993345 30993345 2019 Apr 23              WHISTLE  Nucleic Acids Res
## 30598068 30598068 2018 Dec 31           DeepM6ASeq BMC Bioinformatics
## 30416381 30416381        2018                BERMP     Int J Biol Sci
## 30201554 30201554 2018 Nov 15     iRNA(m6A)-PseDNC       Anal Biochem
## 29880878 29880878    2018 Aug Publisher Correction       Nat Neurosci
## 29850798 29850798  2018 Nov 1                  PEA     Bioinformatics
## 29340952 29340952    2018 Feb             RFAthM6A     Plant Mol Biol
## 29126312 29126312  2018 Jan 4          MeT-DB V2.0  Nucleic Acids Res
## 29040692 29040692  2018 Jan 4          RMBase v2.0  Nucleic Acids Res
## 29036329 29036329  2018 Jan 4               m6AVar  Nucleic Acids Res
## 28724534 28724534    2017 Oct            m6aViewer                RNA
## 27723837 27723837        2016           RNAMethPre           PLoS One
## 26896799 26896799  2016 Jun 2                SRAMP  Nucleic Acids Res
##          pmcrefcount
## 31510685            
## 31504168            
## 31147718            
## 30993345           3
## 30598068           2
## 30416381           1
## 30201554           9
## 29880878           1
## 29850798           2
## 29340952           5
## 29126312          18
## 29040692          19
## 29036329           9
## 28724534           3
## 27723837           6
## 26896799          41

PS:
通过查看结果，我们看到有一个结果叫做Publisher Correction这个和m6A数据库好像没有关系，所以这一步的精简还是需要修改的呀。

提取数据库文章

数据库在摘要当中都会提供他们的数据库网址。因此我们可以通过摘要的内容来提取数据库的网址即可。
PS：关于正则表达式匹配相关的网址，使用的是别人总结好的代码

DatUrl <- sapply(result1$uid, function(x){
    pub_ab <- entrez_fetch(db = "pubmed", id = x, rettype = "abstract")
    stringr::str_extract(pub_ab, "(https?|ftp|file)://[-A-Za-z0-9+&@#/%?=~_|!:,.;()]+[-A-Za-z0-9+&@#/%=~_|]")
})

PS：有时候摘要里面会包括多个网址，但是基本上都会指向同一个网址，所以我们只取其中一个即可。

整合结果

result1$URL <- DatUrl
result1

##               uid     pubdate                title             source
## 31510685 31510685 2019 Jul 15        FunDMDeep-m6A     Bioinformatics
## 31504168 31504168 2019 Aug 28           CROSSalive     Bioinformatics
## 31147718 31147718  2019 Jul 2               RNAmod  Nucleic Acids Res
## 30993345 30993345 2019 Apr 23              WHISTLE  Nucleic Acids Res
## 30598068 30598068 2018 Dec 31           DeepM6ASeq BMC Bioinformatics
## 30416381 30416381        2018                BERMP     Int J Biol Sci
## 30201554 30201554 2018 Nov 15     iRNA(m6A)-PseDNC       Anal Biochem
## 29880878 29880878    2018 Aug Publisher Correction       Nat Neurosci
## 29850798 29850798  2018 Nov 1                  PEA     Bioinformatics
## 29340952 29340952    2018 Feb             RFAthM6A     Plant Mol Biol
## 29126312 29126312  2018 Jan 4          MeT-DB V2.0  Nucleic Acids Res
## 29040692 29040692  2018 Jan 4          RMBase v2.0  Nucleic Acids Res
## 29036329 29036329  2018 Jan 4               m6AVar  Nucleic Acids Res
## 28724534 28724534    2017 Oct            m6aViewer                RNA
## 27723837 27723837        2016           RNAMethPre           PLoS One
## 26896799 26896799  2016 Jun 2                SRAMP  Nucleic Acids Res
##          pmcrefcount
## 31510685            
## 31504168            
## 31147718            
## 30993345           3
## 30598068           2
## 30416381           1
## 30201554           9
## 29880878           1
## 29850798           2
## 29340952           5
## 29126312          18
## 29040692          19
## 29036329           9
## 28724534           3
## 27723837           6
## 26896799          41
##                                                                               URL
## 31510685                               https://github.com/NWPU-903PR/DMDeepm6A1.0
## 31504168                http://service.tartaglialab.com/new_submission/crossalive
## 31147718                                      https://bioinformatics.sc.cn/RNAmod
## 30993345                                      http://whistle-epitranscriptome.com
## 30598068                                   https://github.com/rreybeyb/DeepM6ASeq
## 30416381                                           http://www.bioinfogo.org/bermp
## 30201554                          http://lin-group.cn/server/iRNA(m6A)-PseDNC.php
## 29880878 https://www.southernbiotech.com/?catno=4030-05&type=Polyclonal#&panel1-1
## 29850798                                       https://hub.docker.com/r/malab/pea
## 29340952                               https://github.com/nongdaxiaofeng/RFAthM6A
## 29126312                                      http://compgenomics.utsa.edu/MeTDB/
## 29040692                                           http://rna.sysu.edu.cn/rmbase/
## 29036329                                                 http://m6avar.renlab.org
## 28724534                                              http://dna2.leeds.ac.uk/m6a
## 27723837                     http://bioinfo.tsinghua.edu.cn/RNAMethPre/index.html
## 26896799                                              http://www.cuilab.cn/sramp/

至此，我们找到了16个和m6A相关的数据库。下一篇我们将对这16个数据库进行总结

关于数据库的具体介绍可见：[m6A|数据库] m6A数据库总结(下)

[m6A|数据库] m6A数据库总结(下)
之前通过R语言，我们在pubmed上获得了16个可能的m6A数据库。这次我们就对这16个m6A数据库进行总结整理 ...
[R|数据库] m6A数据库总结(上)
需求描述随着网上数据库的增多，我们在做科研的时候，经常可以通过数据库来对对研究点进行提前的预测。但是数据库应该怎...
m6A数据库总体评价
数据库汇总通过对上述的20个得到的数据库进行整理总结，我们发现一共有19个是和m6A相关。在19个里面有一个数据...
docker-compose 搭建常用数据库
前言作为开发经常需要本地数据库，docker可以快速部署，并且可以快速删除。总结了一下常用数据库，mysql，r...
2019-12-06
R语言学习总结数据库：TCGA（TCGAbiolinks包），GTEx，CCLE，GEO（GEOquery） c...
7.2数据库
考勤系统数据库 E-R图E-R.PNG 表数据库表
DataFountain PHP面试
数据库锁数据库锁方面这篇文章总结的挺好的数据库锁总结线程模型及原理操作系统核心原理-4.线程原理（上）：线程基...
Q：ArrayExpress包常用函数及分析？
GEO数据库对应的R包是GEOquery，强大的ArrayExpress数据库也有对应的R包——ArrayExpr...
电商数据库设计及架构优化实战(一) - 制定数据库开发规范
2 准备工作 3 项目说明 4 数据库设计规范 5 数据库命名规范总结 6 数据库基础设计规范总结 7 数据库...
数据库基础和SQL基础总结
本文会总结下数据库知识，SQL基础，常用SQL语句总结；一、数据库基础相关概念二、数据库重点知识点三、数据库...