对于非模式物种无法直接通过现成的OrgDb来进行通路富集,所以需要通过AnnotationForge构建
利用eggnog来进行注释,连接基因与GO、KEGG通路
1.eggNOG-mapper:区分旁系同源基因和直系同源基因
①序列比对
②推测直系同源基因
③功能注释
(1) 软件安装
eggnog-mapper5.0是基于Python3.7版本
创建一个Python3.7的小环境+直接安装eggnog-mapper
conda create -n python3 python=3.7.8 -y
conda activate python3
conda install -c bioconda eggnog-mapper -y
(2) 数据下载
我是选择直接下载eggNOG数据库在本地,再上传到服务器后解压
下载链接:http://eggnog5.embl.de/download/


在安装eggnog-mapper目录下创建文件夹data,将eggnog.db.gz和eggnog.proteins.dmnd.gz先上传后解压
gunzip eggnog.db.gz
gunzip eggnog.proteins.dmnd.gz
(3) 软件运行
运行eggnog-mapper(我一般写个脚本直接挂后台)
vim eggnog.sh
emapper.py -i ~.fa -o ~ -m diamond --cpu 10
nohup sh eggnog.sh >eggnog.log &
-i:输入的蛋白组序列文件,请自行下载你物种的pep序列文件
-o:输出文件的前缀,后缀会自动补充为.emapper.annotations
-m:同源性比对搜索。其实基于diamond的比对,当序列大于1000条时,是不用指定数据库的
--cpu:使用的线程数
(4) 处理结果文件
.hits
.annotations
.seed_orthologs
(主要是在shell中去掉开头的#)
sed '/^# /d' ~.annotations.tsv | sed 's/#//' > emapper.annotations.tsv
2. 在R中构建OrgDb
(1) 加载R包
library(tidyverse)
library(stringr)
library(KEGGREST)
library(AnnotationForge)
library(clusterProfiler)
library(dplyr)
library(jsonlite)
library(purrr)
library(RCurl)
library(job)
(2) 读取数据
因为文件比较大,用read.table会读不进来
emapper <- rio::import('emapper.annotations.tsv')
#将空值替换为NA,方便后续使用na.omit()函数提出没有注释到的行
emapper[emapper==""]<-NA
(3) 提取GO信息
#提取query列,Preferred_named列,GO列
gene_info <- emapper %>% dplyr::select(GID = query, GENENAME = Preferred_name) %>% na.omit()
gos <- emapper %>% dplyr::select(query, GOs) %>% na.omit()
#构建一个空的数据框为后面填充数据
gene2go = data.frame(GID = character(),
GO = character(),
EVIDENCE = character())
#填充gene2go数据框(这一步时间比较长,所以挂后台)
job::job({for (row in 1:nrow(gos)) {
the_gid <- gos[row, "query"][[1]]
the_gos <- str_split(gos[row,"GOs"], ",", simplify = FALSE)[[1]]
df_temp <- data_frame(GID = rep(the_gid, length(the_gos)),
GO = the_gos,
EVIDENCE = rep("IEA", length(the_gos)))
gene2go <- rbind(gene2go, df_temp)}})
#将“-”替换为NA,然后使用na.omit()删除含有NA的行
gene2go$GO[gene2go$GO=="-"]<-NA
gene2go<-na.omit(gene2go)
(4) 提取KEGG信息
#将emapper中query列,KEGG_ko列提取出来
gene2ko <- emapper %>% dplyr::select(GID = query, Ko = KEGG_ko) %>% na.omit()
#将gene2ko的Ko列中"ko:"删除,不然后面找pathway会报错
gene2ko$Ko <- gsub("ko:","",gene2ko$Ko)
(5) 下载KO的json文件到当前目录下
下载地址:https://www.genome.jp/kegg-bin/get_htext?ko00001
(6) 定义函数
update_kegg <- function(json = "ko00001.json") {
pathway2name <- tibble(Pathway = character(), Name = character())
ko2pathway <- tibble(Ko = character(), Pathway = character())
kegg <- fromJSON(json)
for (a in seq_along(kegg[["children"]][["children"]])) {
A <- kegg[["children"]][["name"]][[a]]
for (b in seq_along(kegg[["children"]][["children"]][[a]][["children"]])) {
B <- kegg[["children"]][["children"]][[a]][["name"]][[b]]
for (c in seq_along(kegg[["children"]][["children"]][[a]][["children"]][[b]][["children"]])) {
pathway_info <- kegg[["children"]][["children"]][[a]][["children"]][[b]][["name"]][[c]]
pathway_id <- str_match(pathway_info, "ko[0-9]{5}")[1]
pathway_name <- str_replace(pathway_info, " \\[PATH:ko[0-9]{5}\\]", "") %>% str_replace("[0-9]{5} ", "")
pathway2name <- rbind(pathway2name, tibble(Pathway = pathway_id, Name = pathway_name))
kos_info <- kegg[["children"]][["children"]][[a]][["children"]][[b]][["children"]][[c]][["name"]]
kos <- str_match(kos_info, "K[0-9]*")[,1]
ko2pathway <- rbind(ko2pathway, tibble(Ko = kos, Pathway = rep(pathway_id, length(kos))))}}}
save(pathway2name, ko2pathway, file = "kegg_info.RData")}
# 调用函数后在本地创建kegg_info.RData文件
update_kegg()
# 载入kegg_info.RData文件
load(file = "kegg_info.RData")
根据gene2ko中的ko信息将gene2ko中的Pathway列提取出来
gene2pathway <- gene2ko %>% left_join(ko2pathway, by = "Ko") %>% dplyr::select(GID, Pathway) %>% na.omit()
(7) 获取物种信息
查询网址:https://www.ncbi.nlm.nih.gov/taxonomy
比如我做的草鱼

tax_id = "7959"
genus = "Ctenopharyngodon"
species = "idella"
去重
gene2go <- unique(gene2go)
gene2go <- gene2go[!duplicated(gene2go),]
gene2ko <- gene2ko[!duplicated(gene2ko),]
gene2pathway <- gene2pathway[!duplicated(gene2pathway),]
(8) 构建OrgDb(注意:邮箱(作者和维护者)中不能包括“_”)
makeOrgPackage(gene_info=gene_info,
go=gene2go,
ko=gene2ko,
maintainer='Yuri <@qq.com>',
author='Yuri <@qq.com>',
pathway=gene2pathway,
version="0.0.2",
outputDir = "~/database/",
tax_id=tax_id,
genus=genus,
species=species,
goTable="go")
(9)导入OrgDb库
install.packages("./org.Cidella.eg.db", repos=NULL)
library(org.Cidella.eg.db)
3. 一些总结
①在linux中配置python环境
我的服务器是python2,保存在usr/bin中,但是eggNOG-mapper用的是python3, 我一度查找怎么切换为python3,其实直接单独创建小环境就可以了。
②Python的一些报错
eggNOG-mapper主要依赖python这样一个环境,导致会有一些报错,我的错误内容就是缺乏一些依赖的软件,直接pip install ~
就可以了
③在构建OrgDb时,作者和维护者的格式'Yuri <邮箱>',不能包含‘_’
4.参考文献
https://www.jianshu.com/p/f2e4dbaae719
https://www.jieandze1314.com/post/cnposts/208/
网友评论