美文网首页
文献阅读4.1 使用 RepeatMasker 识别基因组序列中

文献阅读4.1 使用 RepeatMasker 识别基因组序列中

作者: 龙star180 | 来源:发表于2022-09-01 20:12 被阅读0次

    期刊/来源 (RepeatMasker是不是一直没发表呀,还是笔者没搜到?)

    Current Protocols in Bioinformatics (不是SCI)

    Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences

    使用 RepeatMasker 识别基因组序列中的重复元件

    RepeatMasker is a popular software tool widely used in computational genomics to identify, classify, and mask repetitive elements, including low-complexity sequences and interspersed repeats. RepeatMasker searches for repetitive sequence by aligning the input genome sequence against a library of known repeats, such as Repbase. Here, we describe two Basic Protocols that provide detailed guidelines on how to use RepeatMasker, either via the Web interface or command-line Unix/Linux system, to analyze repetitive elements in genomic sequences. Sequence comparisons in RepeatMasker are usually performed by the alignment program cross match, which requires significant processing time for larger sequences. An Alternate Protocol describes how to reduce the processing time using an alternative alignment program, such as WU-BLAST. Further, the advantages, limitations, and known bugs of the software are discussed. Finally, guidelines for understanding the results are provided.

    RepeatMasker 是一种流行的软件工具,广泛用于计算基因组学,用于识别、分类和屏蔽重复元件,包括低复杂度序列散布重复。 RepeatMasker 通过将输入基因组序列与已知重复库(例如 Repbase)进行比对来搜索重复序列。 在这里,我们描述了两个基本协议,它们提供了有关如何通过 Web 界面命令行 Unix/Linux 系统使用 RepeatMasker 来分析基因组序列中的重复元件的详细指南。 RepeatMasker 中的序列比较通常由比对程序交叉匹配执行,这需要大量的处理时间来处理较大的序列。 替代方案描述了如何使用替代比对程序(例如 WU-BLAST)来减少处理时间。 此外,还讨论了该软件的优点、局限性和已知错误。 最后,提供了理解结果的指南。

    Keywords: RepeatMasker, genome annotation, repetitive elements, repeat library, cross match, WU-BLAST, RECON

    关键词:RepeatMasker、基因组注释、重复元件、重复库、交叉匹配、WU-BLAST、RECON

    INTRODUCTION

    RepeatMasker (developed by A.F.A. Smit, R. Hubley, and P. Green; see http://www.repeatmasker.org/) was designed to identify and annotate repetitive elements in nucleotide sequences and mask them for further analysis. The repetitive elements, including low-complexity DNA sequences and interspersed repeats, are annotated and replaced by Ns, Xs, or lowercase letters (see below for options) in the corresponding positions of the DNA sequence. The new addition to the RepeatMasker package is a program that also identifies repetitive elements within protein sequences. Here, we focus on utilizing RepeatMasker to identify repetitive elements in genomic sequences. To run RepeatMasker, one needs to select the repeat library files, which contain repetitive elements consensus sequences. Currently, Repbase Update (Jurka, 2001; Jurka et al. 2005; http://www.girinst.org/) is the largest commercially available repeat library (free for academic use) and covers a number of organisms including human, rodent, zebrafish, Drosophila, and Arabidopsis thaliana. Library files for organisms that do not have Repbase Update library files can be generated ab initio using RECON (Bao and Eddy, 2002; http://selab.janelia.org/recon.html) or RepeatScout (http://bix.ucsd.edu/repeatscout/; Price et al., 2005). The newest version of RECON, v. 1.06, was released recently and is available from the RepeatModeler package at http://www.repeatmasker.org/RepeatModeler.html. Sequence comparisons in RepeatMasker are usually carried out by the program cross match, developed by Phil Green (http://www.phrap.org/consed/consed.html#howToGet). One can also use WU-BLAST (http://info.cchmc.org/help/wublast.html; see Alternate Protocol) to replace cross match for fast processing.

    RepeatMasker(由 A.F.A. Smit、R. Hubley 和 P. Green 开发;参见 http://www.repeatmasker.org/)旨在识别和注释核苷酸序列中的重复元件,并将它们掩蔽以供进一步分析。重复元件,包括低复杂度的 DNA 序列和散布的重复序列,在 DNA 序列的相应位置被注释和替换为 Ns、Xs 或小写字母(选项见下文)。 RepeatMasker 软件包的新增功能是一个程序,它还可以识别蛋白质序列中的重复元件在这里,我们专注于利用 RepeatMasker 来识别基因组序列中的重复元件。要运行RepeatMasker,需要选择包含重复元件一致序列的重复库文件。目前,Repbase Update (Jurka, 2001; Jurka et al. 2005; http://www.girinst.org/) 是最大的商业重复库(免费供学术使用),涵盖了许多生物体,包括人类、啮齿动物、斑马鱼、果蝇和拟南芥。没有 Repbase 更新库文件的生物的库文件可以使用 RECON (Bao and Eddy, 2002; http://selab.janelia.org/recon.html) 或 RepeatScout (http://bix.ucsd) 从头开始生成.edu/repeatscout/;Price 等人,2005)。 RECON 的最新版本 v. 1.06 最近发布,可从 http://www.repeatmasker.org/RepeatModeler.html 上的 RepeatModeler 包获得。 RepeatMasker 中的序列比较通常由 Phil Green (http://www.phrap.org/consed/consed.html#howToGet) 开发的程序交叉匹配进行。也可以使用 WU-BLAST(http://info.cchmc.org/help/wublast.html;参见备用协议)来替换交叉匹配以进行快速处理。

    USING RepeatMasker VIA THE WEB INTERFACE

    RepeatMasker may be accessed through the Web at http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker. Unlike the command-line version of RepeatMasker (see Basic Protocol 2), Web RepeatMasker has a nucleotide sequence size limit of 100 kb. The attempt to analyze a sequence larger than 100 kb fails (whereupon a prompt is displayed in a message window, shown in Fig. 4.10.1). Sequences shorter than 100 kb are readily analyzed using the Web RepeatMasker, with the time needed for processing correlating  with the length of the sequence. For faster service outside North America, there are RepeatMasker mirror sites in Germany, Israel, and Australia.

    可以通过位于 http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker 的 Web 访问 RepeatMasker。 与命令行版本的 RepeatMasker(参见基本协议 2)不同,Web RepeatMasker 的核苷酸序列大小限制为 100 kb。 分析大于 100 kb 的序列的尝试失败(随后在消息窗口中显示提示,如图 4.10.1 所示)。 使用 Web RepeatMasker 可以轻松分析小于 100 kb 的序列,处理所需的时间与序列的长度相关。 为了在北美以外提供更快的服务,在德国、以色列和澳大利亚设有 RepeatMasker 镜像站点

    On the other hand, if one routinely submits large sequences for analysis, it may be better to download the command-line version and run RepeatMasker locally (see Basic Protocol 2). Importantly, if the query sequence exceeds the 100-kb limit, the only choice is to download RepeatMasker and run it locally.

    另一方面,如果经常提交大序列进行分析,最好下载命令行版本并在本地运行 RepeatMasker(参见基本协议 2)。 重要的是,如果查询序列超过 100-kb 的限制,唯一的选择就是下载 RepeatMasker 并在本地运行它。

    Necessary Resources

    Hardware

    Any Internet-connected computer

    Software

    Web browser: e.g., Mozilla Firefox or Internet Explorer

    Files

    A FASTA file (APPENDIX 1B) or a collection of FASTA files can be processed via the Web interface. Note that the size limit is 100 kb for RepeatMasker via Web. The example file used in this protocol is a 22,539-bp human genomic DNA sequence from the UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/hgGateway). The coordinate is chr10:62743355-62765893.

    可以通过 Web 界面处理一个 FASTA 文件(附录 1B)或一组 FASTA 文件。 请注意,RepeatMasker via Web 的大小限制为 100 kb。 本协议中使用的示例文件是来自 UCSC 基因组浏览器 (http://genome.ucsc.edu/cgi-bin/hgGateway) 的 22,539-bp 人类基因组 DNA 序列。 坐标为 chr10:62743355-62765893

    1. Point the Web browser to http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker. Load the FASTA sequence file (maximum 100 kb) by entering the sequence name or browsing the file. Alternatively, paste the FASTA sequence (maximum 100 kb) into the indicated text field. 

    1. 将 Web 浏览器指向 http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker。 通过输入序列名称或浏览文件加载 FASTA 序列文件(最大 100 kb)。 或者,将 FASTA 序列(最大 100 kb)粘贴到指定的文本字段中。

    RepeatMasker will return an error message if the input sequence contains non-DNA symbols or if the sequence is too long.

    如果输入序列包含非 DNA 符号或序列太长,RepeatMasker 将返回错误消息。

    2. Select a format for results from the two radio buttons next to “return format”: “html” or “tar file.” 

    2. 从“返回格式”旁边的两个单选按钮中选择结果格式:“html”或“tar 文件”。

    If “html” is selected, the results will be written as an html file. If “tar file” is selected, the results will be packed into an archive using the Unix “tar” protocol. For the example here, select “html.”

    如果选择“html”,结果将被写入一个 html 文件。 如果选择“tar 文件”,结果将使用 Unix“tar”协议打包到存档中。 对于此处的示例,选择“html”。

    。。。

    之后的内容在公众号里看吧,简书不知道为什么又发不出去,可能是因为图片太大?亦或是文字太长?

    相关文章

      网友评论

          本文标题:文献阅读4.1 使用 RepeatMasker 识别基因组序列中

          本文链接:https://www.haomeiwen.com/subject/ydsynrtx.html