使用EDTA评估基因组LAI值

作者: wo_monic | 来源:发表于2023-07-13 15:43 被阅读0次

EDTA的功能设计是用来de novo注释全基因组的TE序列

EDTA的分析流程

安装EDTA EDTA github

git clone https://github.com/oushujun/EDTA.git
cd EDTA
conda env create -f EDTA.yml
conda activate EDTA
perl EDTA.pl

EDTA的使用

需要提前准备的文件

必须文件

genome.fa (注意染色体的名称长度不能超过13个字符，而且不能有特殊字符，主要是LTR鉴定工具里有一个限制染色体字符长度)

可选文件

genome.gff3
genome.cds.fa
提前准备好的100%准确的TE库

前期准备工作

确保genome.fa序列长度小于13
准备cds文件
从gff3文件生成运行所需的bed格式文件

程序运行的参数

--genome        基因组fa文件，必须文件
  --species [Rice|Maize|others] 物种名称，Default: others
  --step [all|filter|final|anno]    运行EDTA的哪些步骤
                 all: run the entire pipeline (default)
                 filter: start from raw TEs to the end.
                 final: start from filtered TEs to finalizing the run.
                 anno: perform whole-genome annotation/analysis after TE library construction.
  --overwrite [0|1]     是否覆盖之前的分析，默认是不覆盖。可以指定为1覆盖之前的的分析 If previous results are found, decide to overwrite (1, rerun) or not (0, default).
  --cds [File]      基因组的cds序列文件
  --curatedlib [file]   Provided a curated library to keep consistant naming and classification for known TEs.
            All TEs in this file will be trusted 100%, so please ONLY provide MANUALLY CURATED ones here.
             This option is not mandatory. It's totally OK if no file is provided (default).
  --sensitive [0|1]     Use RepeatModeler to identify remaining TEs (1) or not (0, default).
             This step is very slow and MAY help to recover some TEs.
  --anno [0|1]  Perform (1) or not perform (0, default) whole-genome TE annotation after TE library construction.
  --rmout [File]    Provide your own homology-based TE annotation instead of using the EDTA library for masking.
        File is in RepeatMasker .out format. This file will be merged with the structural-based TE annotation. (--anno 1 required).
        Default: use the EDTA library for annotation.
  --evaluate [0|1]  Evaluate (1) classification consistency of the TE annotation. (--anno 1 required). Default: 0.
         This step is slow and does not affect the annotation result.
  --exclude [File]  Exclude regions (bed format) from TE masking in the MAKER.masked output. Default: undef. (--anno 1 required).
  --u [float]   用于估算LTR插入时间的核苷酸突变突变率的数值 Neutral mutation rate to calculate the age of intact LTR elements.
         Intact LTR age is found in this file: *EDTA_raw/LTR/*.pass.list. Default: 1.3e-8 (per bp per year, from rice).
  --threads|-t  [int]   Number of theads to run this script (default: 4)
-sensitive 是否使用Repeatmodeler

运行EDTA

runEDTA.bash内容如下:

export PERL5LIB=/
#激活conda环境(shell脚本里，激活conda比较麻烦，需要先source)
condapath=`conda info | grep 'base environment'|cut -d : -f2|cut -d " " -f2`
source ${condapath}/etc/profile.d/conda.sh
conda deactivate
conda activate EDTA
perl ~/soft/EDTA/EDTA.pl --genome Yucatanense.genome.chr.fa --cds se.genome.cds.fa --anno 1 --evaluate 1 --overwrite 0 --threads 24 --u 2.6e-9 > EDTA.out

因为依赖环境是安装在conda的EDTA环境里，所以使用的时候需要提前激活。

输出结果解读

最终输出的结果是$genome.mod.EDTA.TElib.fa
这个文件是基因组的非重复TE库。

从EDTA的输出结果获取LAI值
EDTA2LAI.bash的脚本如下：

##从EDTA的输出结果计算LAI
genome="se.genome.chr.fa" #这个是用于EDTA计算的基因组文件的名称
EDTA_path="/share/home/chai/soft/EDTA" #EDTA的安装路径
LAI="/share/home/chai/software/LTR_retriever/LTR_retriever/LAI" #LAI的绝对路径
threads=16

for genome in ${genome}; do perl ${EDTA_path}/util/gff2bed.pl $genome.mod.EDTA.TEanno.gff3 structural > $genome.mod.EDTA.TEanno.struc.bed ; done
for genome in ${genome}; do grep LTR $genome.mod.EDTA.TEanno.struc.bed|grep struc|awk '{print $1":"$2".."$3"\t"$7}' > $genome.mod.EDTA.TEanno.LTR.pass.list ; done
for genome in ${genome}; do perl -nle 'my ($chr, $s, $e, $anno, $dir, $supfam)=(split)[0,1,2,3,8,12]; print "10000 0.001 0.001 0.001 $chr $s $e NA $dir $anno $supfam"' $genome.mod.EDTA.TEanno.struc.bed > $genome.out.EDTA.TEanno.out ; done
for genome in ${genome}; do ${LAI} -genome $genome -intact $genome.mod.EDTA.TEanno.LTR.pass.list -all $genome.out.EDTA.TEanno.out -q -t ${threads} ; done

主要是修改前面的软件的路径和基因组的文件名即可。
$genome.mod.EDTA.TEanno.gff3这个文件是需要指定--anno 1参数，才能生成。

报错处理

第1次报错 `Error: The RMblast engine is not installed in RepeatMasker!`

运行RepeatMasker --help报错如下
ListUtil.c: loadable library and perl binaries are mismatched (got handshake key 0xeb00080, needed 0xde00080)
原因是我在自己的环境变量里配置了自己安装的perl.使用env|grep PERL能看到PERL5LIB=/自己的路径，所以就出现这个问题。解决办法就是，在你运行脚本前，先运行export PERL5LIB=/，这样就可以使用了。把这句命令添加到你运行EDTA最前的脚本里即可。或者是直接修改你自己的环境变量文件~/.bashrc的export PERL5LIB=/。但是我不会这么做，因为环境变量里的配置，会容易引发其他软件无法使用。

第2次报错，`which: no grf-main`

检查发现， TIR-Learnner2.5.sh第26行grfp=$(dirnamewhich grf-main)这里错误。

conda activate EDTA
which grf-main
~/miniconda3/envs/EDTA/bin/grf-main

发现在EDTA的conda环境里能找到grf-main，但是该脚本里却不会自动在conda环境里找，所以就手动修改第26行的路径为对应的真实的绝对路径
修改后是grfp="/share/home/chaim/miniconda3/envs/EDTA/bin"

第3次报错

ModuleNotFoundError: No module named 'tensorflow'
需要安装tensorflow模块，这个是机器学习的模块。
/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'
还会出现上面的warning信息，这是因为numpy版本不兼容造成的。
要么降低numpy版本，要么通过方法2修改dtypes.py文件。
np.dtype([("quint8", np.uint8, 1)])修改为np.dtype([("quint8", np.uint8, (1,))])
所有的np.uint8, 1都修改为np.uint8, (1,).这个出现了多次，都需要修改。

网友评论

本文标题：使用EDTA评估基因组LAI值

本文链接：https://www.haomeiwen.com/subject/agkmudtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

使用EDTA评估基因组LAI值

安装EDTA EDTA github

EDTA的使用

必须文件

可选文件

前期准备工作

运行EDTA

输出结果解读

报错处理

第1次报错 `Error: The RMblast engine is not installed in RepeatMasker!`

第2次报错，`which: no grf-main`

第3次报错

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

使用EDTA评估基因组LAI值

安装EDTA EDTA github

EDTA的使用

必须文件

可选文件

前期准备工作

运行EDTA

输出结果解读

报错处理

第1次报错 Error: The RMblast engine is not installed in RepeatMasker!

第2次报错，which: no grf-main

第3次报错

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

第1次报错 `Error: The RMblast engine is not installed in RepeatMasker!`

第2次报错，`which: no grf-main`