LTR_FINDER | prediction of full-

作者: 生信师姐 | 来源:发表于2020-04-01 15:35 被阅读0次

MBA报考四大类型你知道吗？
Neoantigen Prediction
Review of Driver behavior recogn
Reinforcement Learning3
2018-07-11
User similarity
198/200
基因组注释③：RepeatScout的使用
Privacy Policy
[论文笔记]Graph Convolutional Matrix

一、简介

长末端重复序列（long terminal repeated，LTR）：反转录病毒的基因组的两端各有一个长末端重复序列(5'—LTR和3'—LTR)，不编码蛋白质，但含有启动子，增强子等调控元件，病毒基因组内的LTR可转移到细胞原癌基因邻近处，使这些原癌基因在LTR强启动子和增强子的作用下被激活，将正常细胞转化为癌细胞。

结构见下图

image.png

图中TSD表示target site duplications，红色三角表示LTR motif。A图是一个完整的LTR结构，其中a,b,c是LTR_retriever的分析目标。

Annotation of LTR retrotransposons relies primarily on de novo approaches due to their highly diverse terminal repeats.

二、软件使用

Given DNA sequences, it predicts locations and structure of full-length LTR retrotransposons accurately by considering common structural features.

ab initio LTR retrotransposon finding.

analysis of many sequences of LTR elements in nearly 20 years revealed some structural features (signals) common in these elements, including Long Terminal Repeats (LTRs), Target Site Repeats (TSRs), Primer Binding Sites (PBSs), Polypurine Tract (PPT) and TG ... CA box, as well as sites of Reverse Transcriptase (RT), Integrase (IN) and RNaseH (RH). These results have made ab initio computer discovery of LTR elements possible.

第一步，用LTR_FINDER找到基因组的LTR序列

~/opt/biosoft/LTR_Finder/source/ltr_finder  \
  -D 20000 -d 1000 \
  -L 700 -l 100 \
  -p 20 -C -M 0.9 Athaliana.fa >Athaliana.finder.scn

-D表示5'和3'LTR之间的最大距离; -d表示5'和3'LTR之间的最小距离;
-L表示5'和3'LTR序列的最大长度; -l表示5'和3'LTR序列的最小长度;
-p表示完全匹配配对的最小长度;
-C表示检测中心粒(centriole)删除高度重复区域;
-M表示最小的LTR相似度。

第二步运行LTR_retriever根据LTR_FINDER的输出识别LTR-RT，生成非冗余LTR-RT文库，可用于基因组注释

>~/opt/biosoft/LTR_retriever/LTR_retriever -threads 4 -genome Athaliana.fa -infinder Athaliana.finder.scn

这里的-infinder表示输入来自于LTR_FINDER，这一步会调用RepeatMasker，而RepeatMasker要求序列ID长度不大于50个字符

三、LTR_FINDER_parallel

LTR_FINDER的并行化能够快速识别长末端重复逆转录转座子

我们假设高度复杂基因组的完整序列可能包含大量复杂的嵌套(nested)结构，以指数级增加搜索空间。为了分解这些复杂的序列结构，我们将染色体序列分成相对较短的片段(1Mb)，并且并行地执行LTR_FINDER。我们期望LTR_FINDER_parallel的时间复杂度为O(n)。对于高度复杂的区域（即着丝粒），其中一段可能需要相当长的时间（即数小时）。为了避免在这些区域中延长的操作时间，我们使用了一个超时方案（300秒）来控制子进程可以运行的最长时间。如果超时，则将1Mb片段进一步分割为50kb片段，以挽救LTR候选片段。在处理所有片段后，将LTR候选基因的区域坐标转换回基因组水平坐标，便于下游分析。

We hypothesized that complete sequences of highly complex genomes may contain a large number of com�plicated nested structures that exponentially increase the search space. To break down these complicated sequence structures, we **split chromosomal sequences into relatively short segments (1 Mb) **and executes LTR_FINDER in parallel. We expect the time complexity of LTR_FINDER_parallel is O(n). For highly complicated regions (i.e., centromeres), one segment could take a rather long time (i.e., hours). To avoid extended operation time in such regions, we used a timeout scheme (300 s) to control for the longest time a child process can run. If timeout, the 1 Mb segment is further split into 50 Kb segments to salvage LTR candidates. After processing all segments, the regional coordinates of LTR candidates are converted back to the genome-level coordinates for the convenience of downstream analyses.

Usage: perl LTR_FINDER_parallel -seq [file] -size [int] -threads [int]  
Options:
    -seq    [file]  Specify the sequence file.
    -size   [int]   Specify the size you want to split the genome sequence.
            Please make it large enough to avoid spliting too many LTR elements. Default 5000000 (bp).               
    -time   [int]   Specify the maximum time to run a subregion (a thread).
            This helps to skip simple repeat regions that take a substantial of time to run. Default: 1500 (seconds).
            Suggestion: 300 for -size 1000000. Increase -time when -size increased.  
    -try1   [0|1]   If a region requires more time than the specified -time (timeout), decide:  
                0, discard the entire region.
                1, further split to 50 Kb regions to salvage LTR candidates (default);
    -harvest_out    Output LTRharvest format if specified. Default: output LTR_FINDER table format.
    -next           Only summarize the results for previous jobs without rerunning LTR_FINDER (for -v).
    -verbose|-v     Retain LTR_FINDER outputs for each sequence piece.
    -finder [file]  The path to the program LTR_FINDER (default v1.0.7, included in this package).
    -threads|-t     [int]   Indicate how many CPU/threads you want to run LTR_FINDER.
    -check_dependencies Check if dependencies are fullfiled and quit
    -help|-h        Display this help information.

1. Input

Genome file in multi-FASTA format.

2. Output

GFF3, LTRharvest (STDOUT)or LTR_FINDER (-w 2) formats of predicted LTR candidates.

3. Parameter setting for LTR_FINDER

Currently there is no parameter settings for LTR_FINDER in this parallel version. I have chose the "best" parameters for you，Please refer to LTR_FINEDR for details of these parameters.

-w 2 -C -D 15000 -d 1000 -L 7000 -l 100 -p 20 -M 0.85

If you want to use other parameters in LTR_FINDER_parallel, please edit the file LTR_FINDER_parallel line 9 to change the preset parameters.

Based on our previous study [1], we applied the optimized parameter for LTR_FINDER (−w 2 -C -D 15000 -d 1000 -L 7000 -l 100 -p 20 -M 0.85), which identifies long terminal repeats ranging from 100 to 7000 bp with identity ≥85% and interval regions from 1 to 15 Kb. The output of LTR_FINDER_parallel is convertible to the popular LTRharvestformat, which is compatible to the high-accuracy post-processing filter LTR_retriever.

4. Performance benchmark

Genome	Arabidopsis	Rice	Maize	Wheat
Version	TAIR10	MSU7	AGPv4	CS1.0
Size	119.7 Mb	374.5 Mb	2134.4 Mb	14547.3 Mb
Original memory (1 CPU*)	0.37 Gbyte	0.55 Gbyte	5.00 Gbyte	11.88 Gbyte
Parallel memory (36 CPUs*)	0.10 Gbyte	0.12 Gbyte	0.82 Gbyte	17.67 Gbyte
Original time (1 CPU)	0.58 h	2.1 h	448.5 h	10169.3 h
Parallel time (36 CPUs)	6.4 min	2.6 min	10.3 min	71.8 min
Speed up	5.4 x	48.5 x	2,613 x	8,498 x
Number of LTR candidates (1 CPU)	226	2,851	60,165	231,043
Number of LTR candidates (36 CPUs)	226	2,834	59,658	237,352
% difference of candidate #	0.00%	0.60%	0.84%	-2.73%

*Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz

5. FAQs and best practices

（1）How to generate output files for
A: You can use the -harvest_out parameter to generate LTRharvest-format output, then feed to LTR_retriever using -inharvest. If you have more than one LTRharvest output, simply cat them together.

（2）How to prepare the genome file?
A: It's highly recommended to use short and simple sequence names. For example, use letters, numbers, and _ to generate unique names shorter than 15 bits. This will make your downstream analyses much more easier. If you have delicate sequence names and encounter errors, you may want to simplify them and try again.

（3）Do I really need to modify the -size, -time, and -try1 parameters?
A: Not really. Except when you are 100% sure what you are doing, these parameters are optimized for the best performance in general.

6. Issues

Currently I am using a non-overlapping way to cut the original sequence. Some LTR elements could be broken due to this. So far the side-effect is minimal (< 1% loss) comparing to the performance boost (up to 8,500X faster). I don't have a plan to update it to a sliding window scheme. Welcome to improve it and request for merge.

参考来源：
https://github.com/xzhub/ltr_finder
https://gitee.com/xdkong/LTR_FINDER_parallel
https://www.cnblogs.com/bio-mary/p/12187157.html

Ou S, Jiang N. LTR_FINDER_parallel: parallelization of LTR_FINDER enabling rapid identification of long terminal repeat retrotransposons. Mob DNA 2019;10(1):48.

MBA报考四大类型你知道吗？
MBA报考类型一、全日制MBA(The full-‌‌time MBA program) 该项意图申请人需求挑选...
Neoantigen Prediction
High-resolutionHLA typingwas performed computationally us...
Review of Driver behavior recogn
Driver behavior recognition and prediction in a SmartCar ...
Reinforcement Learning3
coursera by University of Alberta Prediction and Control ...
2018-07-11
[1805.09393] Pouring Sequence Prediction using Recurrent ...
User similarity
User similarity The goal is to make automated prediction ...
198/200
#Prediction Forecasting is not a "you have it or not" tal...
基因组注释③：RepeatScout的使用
重复序列注释的思路：①使用LTR_Finder软件→基于结构预测→ 得到XX.finder.scn文库文件；②使用...
Privacy Policy
The "Height prediction calculator" app respects and prote...
[论文笔记]Graph Convolutional Matrix
[keywords]: recommender systems, link prediction, biparti...

LTR_FINDER | prediction of full-

一、简介

二、软件使用

三、LTR_FINDER_parallel

1. Input

2. Output

3. Parameter setting for LTR_FINDER

4. Performance benchmark

5. FAQs and best practices

6. Issues

相关文章

MBA报考四大类型你知道吗？

Neoantigen Prediction

Review of Driver behavior recogn

Reinforcement Learning3

2018-07-11

User similarity

198/200

基因组注释③：RepeatScout的使用

Privacy Policy

[论文笔记]Graph Convolutional Matrix

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

LTR类转座子收藏