文献阅读4.1 使用 RepeatMasker 识别基因组序列中

作者: 龙star180 | 来源:发表于2022-09-01 20:12 被阅读0次

基因组重复序列检测：RepeatMasker的安装及使用
RepeatMasker:查找基因组上的重复序列
RepeatMasker
RepeatModeler+RepeatMasker的安装与使用
基因组注释流程
RepeatMasker预测基因组重复序列
使用MAKER进行基因注释(基础入门）
maker基因组注释一（基础篇）
基因组注释理论基础
2020年4月8日第四章序列-1

期刊/来源 (RepeatMasker是不是一直没发表呀，还是笔者没搜到？)

Current Protocols in Bioinformatics （不是SCI）

Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences

使用 RepeatMasker 识别基因组序列中的重复元件

RepeatMasker is a popular software tool widely used in computational genomics to identify, classify, and mask repetitive elements, including low-complexity sequences and interspersed repeats. RepeatMasker searches for repetitive sequence by aligning the input genome sequence against a library of known repeats, such as Repbase. Here, we describe two Basic Protocols that provide detailed guidelines on how to use RepeatMasker, either via the Web interface or command-line Unix/Linux system, to analyze repetitive elements in genomic sequences. Sequence comparisons in RepeatMasker are usually performed by the alignment program cross match, which requires significant processing time for larger sequences. An Alternate Protocol describes how to reduce the processing time using an alternative alignment program, such as WU-BLAST. Further, the advantages, limitations, and known bugs of the software are discussed. Finally, guidelines for understanding the results are provided.

RepeatMasker 是一种流行的软件工具，广泛用于计算基因组学，用于识别、分类和屏蔽重复元件，包括低复杂度序列和散布重复。 RepeatMasker 通过将输入基因组序列与已知重复库（例如 Repbase）进行比对来搜索重复序列。在这里，我们描述了两个基本协议，它们提供了有关如何通过 Web 界面或命令行 Unix/Linux 系统使用 RepeatMasker 来分析基因组序列中的重复元件的详细指南。 RepeatMasker 中的序列比较通常由比对程序交叉匹配执行，这需要大量的处理时间来处理较大的序列。 替代方案描述了如何使用替代比对程序（例如 WU-BLAST）来减少处理时间。此外，还讨论了该软件的优点、局限性和已知错误。最后，提供了理解结果的指南。

Keywords: RepeatMasker, genome annotation, repetitive elements, repeat library, cross match, WU-BLAST, RECON

关键词：RepeatMasker、基因组注释、重复元件、重复库、交叉匹配、WU-BLAST、RECON

INTRODUCTION

RepeatMasker (developed by A.F.A. Smit, R. Hubley, and P. Green; see http://www.repeatmasker.org/) was designed to identify and annotate repetitive elements in nucleotide sequences and mask them for further analysis. The repetitive elements, including low-complexity DNA sequences and interspersed repeats, are annotated and replaced by Ns, Xs, or lowercase letters (see below for options) in the corresponding positions of the DNA sequence. The new addition to the RepeatMasker package is a program that also identifies repetitive elements within protein sequences. Here, we focus on utilizing RepeatMasker to identify repetitive elements in genomic sequences. To run RepeatMasker, one needs to select the repeat library files, which contain repetitive elements consensus sequences. Currently, Repbase Update (Jurka, 2001; Jurka et al. 2005; http://www.girinst.org/) is the largest commercially available repeat library (free for academic use) and covers a number of organisms including human, rodent, zebrafish, Drosophila, and Arabidopsis thaliana. Library files for organisms that do not have Repbase Update library files can be generated ab initio using RECON (Bao and Eddy, 2002; http://selab.janelia.org/recon.html) or RepeatScout (http://bix.ucsd.edu/repeatscout/; Price et al., 2005). The newest version of RECON, v. 1.06, was released recently and is available from the RepeatModeler package at http://www.repeatmasker.org/RepeatModeler.html. Sequence comparisons in RepeatMasker are usually carried out by the program cross match, developed by Phil Green (http://www.phrap.org/consed/consed.html#howToGet). One can also use WU-BLAST (http://info.cchmc.org/help/wublast.html; see Alternate Protocol) to replace cross match for fast processing.

RepeatMasker（由 A.F.A. Smit、R. Hubley 和 P. Green 开发；参见 http://www.repeatmasker.org/）旨在识别和注释核苷酸序列中的重复元件，并将它们掩蔽以供进一步分析。重复元件，包括低复杂度的 DNA 序列和散布的重复序列，在 DNA 序列的相应位置被注释和替换为 Ns、Xs 或小写字母（选项见下文）。 RepeatMasker 软件包的新增功能是一个程序，它还可以识别蛋白质序列中的重复元件。在这里，我们专注于利用 RepeatMasker 来识别基因组序列中的重复元件。要运行RepeatMasker，需要选择包含重复元件一致序列的重复库文件。目前，Repbase Update (Jurka, 2001; Jurka et al. 2005; http://www.girinst.org/) 是最大的商业重复库（免费供学术使用），涵盖了许多生物体，包括人类、啮齿动物、斑马鱼、果蝇和拟南芥。没有 Repbase 更新库文件的生物的库文件可以使用 RECON (Bao and Eddy, 2002; http://selab.janelia.org/recon.html) 或 RepeatScout (http://bix.ucsd) 从头开始生成.edu/repeatscout/；Price 等人，2005）。 RECON 的最新版本 v. 1.06 最近发布，可从 http://www.repeatmasker.org/RepeatModeler.html 上的 RepeatModeler 包获得。 RepeatMasker 中的序列比较通常由 Phil Green (http://www.phrap.org/consed/consed.html#howToGet) 开发的程序交叉匹配进行。也可以使用 WU-BLAST（http://info.cchmc.org/help/wublast.html；参见备用协议）来替换交叉匹配以进行快速处理。

USING RepeatMasker VIA THE WEB INTERFACE

RepeatMasker may be accessed through the Web at http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker. Unlike the command-line version of RepeatMasker (see Basic Protocol 2), Web RepeatMasker has a nucleotide sequence size limit of 100 kb. The attempt to analyze a sequence larger than 100 kb fails (whereupon a prompt is displayed in a message window, shown in Fig. 4.10.1). Sequences shorter than 100 kb are readily analyzed using the Web RepeatMasker, with the time needed for processing correlating with the length of the sequence. For faster service outside North America, there are RepeatMasker mirror sites in Germany, Israel, and Australia.

可以通过位于 http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker 的 Web 访问 RepeatMasker。与命令行版本的 RepeatMasker（参见基本协议 2）不同，Web RepeatMasker 的核苷酸序列大小限制为 100 kb。分析大于 100 kb 的序列的尝试失败（随后在消息窗口中显示提示，如图 4.10.1 所示）。使用 Web RepeatMasker 可以轻松分析小于 100 kb 的序列，处理所需的时间与序列的长度相关。 为了在北美以外提供更快的服务，在德国、以色列和澳大利亚设有 RepeatMasker 镜像站点。

On the other hand, if one routinely submits large sequences for analysis, it may be better to download the command-line version and run RepeatMasker locally (see Basic Protocol 2). Importantly, if the query sequence exceeds the 100-kb limit, the only choice is to download RepeatMasker and run it locally.

另一方面，如果经常提交大序列进行分析，最好下载命令行版本并在本地运行 RepeatMasker（参见基本协议 2）。 重要的是，如果查询序列超过 100-kb 的限制，唯一的选择就是下载 RepeatMasker 并在本地运行它。

Necessary Resources

Hardware

Any Internet-connected computer

Software

Web browser: e.g., Mozilla Firefox or Internet Explorer

Files

A FASTA file (APPENDIX 1B) or a collection of FASTA files can be processed via the Web interface. Note that the size limit is 100 kb for RepeatMasker via Web. The example file used in this protocol is a 22,539-bp human genomic DNA sequence from the UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/hgGateway). The coordinate is chr10:62743355-62765893.

可以通过 Web 界面处理一个 FASTA 文件（附录 1B）或一组 FASTA 文件。请注意，RepeatMasker via Web 的大小限制为 100 kb。本协议中使用的示例文件是来自 UCSC 基因组浏览器 (http://genome.ucsc.edu/cgi-bin/hgGateway) 的 22,539-bp 人类基因组 DNA 序列。坐标为 chr10:62743355-62765893。

1. Point the Web browser to http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker. Load the FASTA sequence file (maximum 100 kb) by entering the sequence name or browsing the file. Alternatively, paste the FASTA sequence (maximum 100 kb) into the indicated text field.

1. 将 Web 浏览器指向 http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker。通过输入序列名称或浏览文件加载 FASTA 序列文件（最大 100 kb）。或者，将 FASTA 序列（最大 100 kb）粘贴到指定的文本字段中。

RepeatMasker will return an error message if the input sequence contains non-DNA symbols or if the sequence is too long.

如果输入序列包含非 DNA 符号或序列太长，RepeatMasker 将返回错误消息。

2. Select a format for results from the two radio buttons next to “return format”: “html” or “tar file.”

2. 从“返回格式”旁边的两个单选按钮中选择结果格式：“html”或“tar 文件”。

If “html” is selected, the results will be written as an html file. If “tar file” is selected, the results will be packed into an archive using the Unix “tar” protocol. For the example here, select “html.”

如果选择“html”，结果将被写入一个 html 文件。如果选择“tar 文件”，结果将使用 Unix“tar”协议打包到存档中。对于此处的示例，选择“html”。

。。。

之后的内容在公众号里看吧，简书不知道为什么又发不出去，可能是因为图片太大？亦或是文字太长？

基因组重复序列检测：RepeatMasker的安装及使用
RepeatMasker是重复序列检测的常用工具，通过与参考数据库的相似性比对来准确识别或屏蔽基因组中的重复序列，...
RepeatMasker:查找基因组上的重复序列
欢迎关注”生信修炼手册”! RepeatMasker软件用于查找基因组上的重复序列，默认情况下，会将重复序列原有的...
RepeatMasker
What RepeatMasker是一款基于Library-based，通过相似性比对来识别重复序列，可以屏蔽序列...
RepeatModeler+RepeatMasker的安装与使用
一：RepeatMasker安装在基因组注释中第一步就是重复序列的屏蔽，目前常用的从头注释pipeline就是R...
基因组注释流程
一、使用Regtag将contig挂到染色体上二、使用Repeatmasker进行基因组数据屏蔽：三、基因预测...
RepeatMasker预测基因组重复序列
一、安装 conda一键安装帮助文档二、预测基因组重复序列使用方法结果 *cat：序列与重复序列比对的文件...
使用MAKER进行基因注释(基础入门）
在基因组注释上，MAKER算是一个很强大的分析流程。能够识别重复序列，将EST和蛋白序列比对到基因组，进行从头预测...
maker基因组注释一（基础篇）
在基因组注释上，MAKER算是一个很强大的分析流程。能够识别重复序列，将EST和蛋白序列比对到基因组，进行从头预测...
基因组注释理论基础
基因组注释主要包括四个方面：重复序列识别序列比对方法 RepeatScout、LTR-finder、T...
2020年4月8日第四章序列-1
第四章序列-1 4.1 序列概述 4.2 字符串 4.3 列表 4.4 元组 4.1序列概述 Python中，根...

文献阅读4.1 使用 RepeatMasker 识别基因组序列中

相关文章

基因组重复序列检测：RepeatMasker的安装及使用

RepeatMasker:查找基因组上的重复序列

RepeatMasker

RepeatModeler+RepeatMasker的安装与使用

基因组注释流程

RepeatMasker预测基因组重复序列

使用MAKER进行基因注释(基础入门）

maker基因组注释一（基础篇）

基因组注释理论基础

2020年4月8日第四章序列-1

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读