一个用interproscan做基因注释的简易教程

作者: 卖萌哥 | 来源:发表于2018-08-01 15:10 被阅读19次

一个用interproscan做基因注释的简易教程
interproscan 安装及运行错误调试
基因功能注释
KEGG pathway 注释整理
「蛋白注释」- 批量InterProScan注释
De novo 基因组GO 注释小窗口
2019-08-28 基因注释
InterProscan 输出格式
python从基因组注释文件中提取GO信息
WebGL简易教程地理地形绘制

官网地址：

http://www.ebi.ac.uk/interpro/download.html

github使用手册地址：

https://github.com/ebi-pf-team/interproscan/wiki

1.下载、解压、安装

下载链接：

nohup wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.28-67.0/interproscan-5.28-67.0-64-bit.tar.gz &

因为压缩包有11G左右，所以最好还是用nohup后台下载，以防网络因素导致下了一半得重新下的情况。

解压：

tar -pxvzf interproscan-5.28-67.0-*-bit.tar.gz

这里参数p是 :

p = preserve the file permissions
#即保存文件权限

安装Panther模块

panter库需要单独安装。

下载&解压

cd [InterProScan5 home]/data/
nohup wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/data/panther-data-12.0.tar.gz &
tar -pxvzf panther-data-12.0.tar.gz

The pre-calculated match lookup网页服务器能够提供超过3千万蛋白序列的比对，包括所有的UniProtKB蛋白序列.
InterProScan 5使用这个服务器能够加速本地服务器的速度。
这是这个版本的特点，要想使用这个服务器的话，需要电脑能上网：http://www.ebi.ac.uk to use it.
 如果你的电脑防火墙阻止访问这个网站，你课以下载本地化的InterProScan 5 lookup service（https://code.google.com/p/interproscan/wiki/LocalLookupService）
或者关掉这个功能关掉这个功能的时候，你可以在命令行加入-dp 或者修改interproscan.properties 
在前面加一个#注释掉即可
 precalculated.match.lookup.service.url=http://www.ebi.ac.uk/interpro/match-lookup

最基本使用模式：

./interproscan.sh -i /path/to/sequences.fasta –o /san/ –goterms –iprlookup –pa -f xml

也可使用示例来测试：

./interproscan.sh -i test_proteins.fasta -f tsv

参数信息：

-appl / --applications application_name (optional)

By default, all available analyses are run.

也可以指定特定的数据库

./interproscan.sh -appl Pfam -i /path/to/sequences.fasta

也可以指定多个数据库，并可选择数据库的版本

./interproscan.sh -appl Pfam-31.0 -appl PRINTS-42.0 -i /path/to/sequences.fasta

也可以只使用一个-appl，后面跟很多的数据库

./interproscan.sh -appl CDD,COILS,Gene3D,HAMAP,MobiDBLite,PANTHER,Pfam,PIRSF,PRINTS,ProDom,PROSITEPATTERNS,PROSITEPROFILES,SFLD,SMART,SUPERFAMILY,TIGRFAM -i /path/to/sequences.fasta

所有可用的数据库list：

Included Analyses

This distribution of InterProScan includes:

CDD
COILS
Gene3D
HAMAP
MOBIDB
PANTHER
Pfam
PIRSF
PRINTS
ProDom
PROSITE (Profiles and Patterns)
SFLD
SMART (unlicensed components only by default - this analysis has simplified post-processing that includes an E-value filter, however you should not expect it to give the same match output as the fully licensed version of SMART)
SUPERFAMILY
TIGRFAMs

以下的数据库在interproscan 5中可用，但是需要获得许可：

Phobius (licensed software)
SignalP
SMART (licensed components)
TMHMM

所以不加比较好，可以最大程度地得到需要的信息，虽然会给后续处理带来压力。

-i / --fasta sequence_file

需要输入fasta格式的文件。核酸和蛋白都可以，但推荐蛋白，毕竟蛋白文件相对小一点。

for protein sequences, returnand files or for nucleotide sequences, return GFF3 and XML files

蛋白质文件的默认输出格式是 TSV, XML 或GFF3，核酸序列默认输出GFF3 或 XML 文件

-iprlookup,--iprlookup (optional)

-goterms,--goterms (optional)

开启GO注释，这两个参数一般一起开，GO的注释依赖于-iprlookup参数

-b / --output-file-base file_name (optional)

Optionally, you can supply a path and base name (excluding a file extension) for the results file as follows:

./interproscan.sh -i /path/to/sequences.fasta -b /path/to/output_file

The appropriate file extension will be added to each output file, depending upon the format(s) requested. (It is therefore recommended that you do not include a file extension yourself.)

Note that using this option will not overwrite existing files. If a file with the required name exists at the path specified, the provided file name will have 'underscore_number' appended in front of the file extension.

没明白，贴原文需要的自己看。感觉上是不需要自己设置输出格式的意思？而且不会重写覆盖掉已存在的文件。

-o 跟前面的-b.-d不能同时出现，如果设置了这个，就必须设置-f

-f 输出文件的格式，支持的格式为TSV, XML, GFF3, HTML and SVG。蛋白默认的格式为TSV, XML 和 GFF3, 核酸的格式之前为GFF3 和XML，现在都可以了哦。

 ./interproscan.sh -f XML -f HTML -i /path/to/sequences.fasta -b /path/to/output_file

集中输出格式的区别：https://code.google.com/p/interproscan/wiki/OutputFormats

-dp 关闭precalculated match lookup service，默认的是开启。根据md5值来快速检验这上传的数据是否已经被注释了，如果是已经注释了就直接出结果。节省时间。

-pa / --pathways (optional)
Option that provides mappings from matches to pathway information, which is based on the matched manually curated InterPro entries. (IMPLIES -iprlookup option). The different pathways databases that I5 provides cross links to are:

KEGG
MetaCyc
Reactome

开启可能的注释信息。

-t / --seqtype (optional)
InterProScan 5 supports analysis of both protein and nucleic acid sequences (DNA/RNA). Your input sequences are interpreted as protein sequences by default. If you like to scan nucleotide sequences you must set the -t option:

./interproscan.sh -t n -i /path/to/sequences.fasta

如果输入的是核酸序列（DNA或者RNA都可以）需要设置-t参数,默认是蛋白。

-dra/ --disable-residue-annot (optional)

Optionally, you can prevent InterProScan from calculating the residue level annotations and displaying in the output where available. If you don't require this information then disabling the feature will improve performance and result in smaller output files.

可以将一些计算忽略，得到较小较快的结果

更多的信息请查看下面参考的第一条。

一些注意事项

跑interproscan的数据可以是核酸也可以是蛋白质，但是命令会有一些区别。

数据一定要格式化，而且序列中不能出现*号等其他字符。gene的名字不能为空。

根据以上信息整理后得到：

./interproscan.sh -i /path/to/sequences.pep -iprlookup -goterms -f html -f tsv -dp -pa -dra -b /path/to/output_file


interproscan.sh  -appl PfamA (-appl PRINTS) -appl SMART -appl PANTHER -i Porphyra_umbilicalis_pep.fasta -f tsv -o Porphyra_umbilicalis_pep.fasta.ipr -goterms -T temp -iprlookup


nohup ./interproscan.sh  -appl PfamA -appl SMART -appl PANTHER -i 160614_klebsormidium_v1.1_AA.fasta.fasta -f tsv -o kfl.tsv -goterms -T temp -iprlookup -dp &





#tsv格式可以直接用excel打开。