美文网首页诗翔的R语言学习之路R
【r<-开发】Bioconductor包开发(二):开发指

【r<-开发】Bioconductor包开发(二):开发指

作者: 王诗翔 | 来源:发表于2019-04-23 14:56 被阅读20次

https://www.bioconductor.org/developers/package-guidelines/

介绍

Bioconductor项目推广高质量,文档齐全,可互操作的软件。 本文的指导方针有助于实现这一目标; 它们并不意味着给包的作者带来不必要的负担,对于那些难以满足指导方针的作者应该在bioc-devel邮件列表上寻求建议。

在开发Bioconductor软件包时,我们敦促软件包维护人员尽可能遵循这些准则。

有关生成软件包的一般说明,请参阅R中的Writing R Extensions手册或者R web site

请记住,这些是软件包验收的最低要求,软件仍将遵循以下其他指南和Bioconductor团队成员的正式技术审查。

通用包开发

Bioconductor和R版本

包的开发者在开发和测试包时应当一直使用开发版本的Bioconductor

根据R发布周期,使用Bioconductor开发可能需要涉及开发版R的使用,参阅Bioconductor包开发(一):开发环境准备

正确、空间和时间

  1. Bioconductor包最少需要使用最近的R版本通过R CMD build (或R CMD INSTALL --build) 和无错无警告通过 R CMD check,开发者需要解决所有的note。

  2. 包也必须无错无警告通过R CMD BiocCheckBiocCheck 包时包含Bioconductor最佳实践测试的集合。开发者必须解决任何出现的问题。

  3. 因为不是所有的文件系统都大小写敏感,所以不要使用仅大小写不同的文件名。

  4. R CMD build生成的源包应当小于5MB。

  5. 包需要在10分钟内运行完R CMD check --no-build-vignettes。使用 --no-build-vignettes 选项是为了确保短文只生成一次。

  6. 短文和手册的例子不能够使用超过3GB的内存,因为R不能在32位的Windows上分配超过更多的内存。

  7. 对于软件包,独立文件必须小于5MB。

  8. 原始包目录不应该包含任何非必需的文件、系统文件或隐藏文件如 .DS_Store, .project, .git, cache file, log files, .Rproj, .so, etc.等。

R CMD check 环境

(就是设定check的选项,有兴趣的可以下载github的文件然后添加环境变量)

It is possible to activate or deactivate a number of options in R CMD build and R CMD check. Options can be set as individual environment variables or they can be listed in a file. Descriptions of all the different options available can be found here. Bioconductor has chosen to customize some of these options for incoming submission during R CMD check. The file of utilized flags can be downloaded from Github. The file can either be place in a default directory as directed here or can be set through environment variable R_CHECK_ENVIRON with a command similar to

export R_CHECK_ENVIRON = <path to downloaded file>

DESCRIPTION

必须正确格式化DESCRIPTION文件。以下部分将介绍有关DESCRIPTION字段和相关文件的一些重要说明。

推荐使用devtoolsusethis包进行开发,下面的模板可以自动生成。含义很容易理解,不懂的照抄就可以。

  1. “Package:” field: 包名,这个要和GitHub仓库名匹配,大小写敏感。包名不能是一个已经存在于CRAN或Bioconductor的软件包。下面代码可以检测包是否已经存在:

     if (!requireNamespace("BiocManager"))
     install.packages("BiocManager")
     BiocManager::install("MyPackage")
    
    
  2. “Title:” field: 简要的描述性标题

  3. “Version:” field: 所有的 Bioconductor 软件包使用 x.y.z 版本策略,参阅Version Numbering。第一次提交到 Bioconductor 包必须设定预先版本号0.99.0,下面是规则

    • x为0通常表示还没有发布的包。
    • y 是偶数说明包已经发布,奇数表示正在开发
    • z 每一次更改都增加
  4. “Description:” field: 一个或多个句子,描述包的功能。

  5. “Authors@R or Author/Maintainer:” fields: Use either Authors@R field (方式1)or Author: and Maintainer: fields(方式2), not both. A maintainer designation (cre for Authors@R ) is required with an actively maintained email. This email will be used for contact regarding an issues that arise with your package in the future. We prefer the Authors@R format giving all the authors with appropriate roles. For persons with an ORCID identifier (See ORCiD for more information) provide the identifier via an element named “ORCID” in the comment argument of person(). Example:person("Lori", "Shepherd", email=Lori.Shepherd@roswellpark.org, role=c("cre", aut"), comment = c(ORCID = "0000-0002-5910-4010")).推荐方式1,简单。随便找个已经发布的包修改下就可以了。

维护者是能是一个人!

  1. “License:” field: should preferably refer to a standard license (see wikipedia) using one of R’s standard specifications. Be specific about any version that applies (e.g., GPL-2). Core Bioconductor packages are typically licensed under Artistic-2.0. To specify a non-standard license, include a file named LICENSE in your package (containing the full terms of your license) and use the string “file LICENSE” (without the double quotes) in this “License:” field. The package should contain only code that can be redistributed according to the package license. Be aware of the licensing agreements for packages you are depending on in your package. Not all packages are open source even if they are publicly available.

  2. “LazyData:” field: For packages that include data, we recommend not including LazyData: TRUE. This rarely proves to be a good thing. In our experience it only slows down the loading of packages with large data.

  3. “Depends/Imports/Suggests/Enhances:” fields:

    • All packages must be available via Bioconductor or CRAN; users and the automated build system have no way to install packages from other sources.

    • Reuse, rather than re-implement or duplicate, well-tested functionality from other packages. Make use of appropriate existing packages (e.g., biomaRt, AnnotationDbi, Biostrings) and classes (e.g., SummarizedExperiment, GRanges, Rle, DNAStringSet), and avoid duplication of functionality available in other Bioconductor packages. See Common Bioconductor Methods and Classes. Bioconductor Reviewers are very strict on this point! New packages should be interoperable with existing Bioconductor classes and not reimplement functionality especially with regards to importing/reading data.

    • A package can be listed only once between Depends/Imports/Suggests/Enhances. Determine placement of package based on the following guidelines:

      • Imports: is for packages that provide functions, methods, or classes that are used inside your package name space. Most packages are listed here.
      • Depends: is for packages that provide essential functionality for users of your package, e.g., the GenomicRanges package is listed in the Depends: field of GenomicAlignments. It is unusual for more than three packages to be listed as ‘Depends:’.
      • Suggests: is for packages used in vignettes or examples, or in conditional code.
      • Enhances: is for packages such as Rmpi or parallel that enhance the performance of your package, but are not strictly needed for its functionality.
    • It is seldom necessary to specify R or specific versions as dependencies, since the Bioconductorrelease strategy and standard installation instructions guarantee these constraints. Repositories mirrored outside Bioconductor should include branches for each Bioconductor release, and may find it useful to fully specify versions to enforce constraints otherwise guaranteed by Bioconductor installation practices.

  4. “SystemRequirements:” field: This field is for listing any external software which is required, but not automatically installed by the normal package installation process. If the installation process is non-trivial, a top-level README file should be included to document the process.

  5. “biocViews:” field: REQUIRED! Specify at least two biocViews categories. Multiple terms are encouraged but terms must come from the same package type (Software, AnnotationData, ExperimentData, Workflow).

  6. “BugReports:” field: It is encouraged to include the relevant links to Github for reporting Issues.

  7. “URL:” field: This field directs users to source code repositories, additional help resources, etc; details are provided in “Writing R Extensions”, RShowDoc("R-exts").

  8. “Video:” field: This field displays links to instructional videos.

  9. “Collates:” field: This may be necessary to order class and method definitions appropriately during package installation.

NAMESPACE

命名空间使用devtools::document()自动生成。

A Namespace file defines the functions, classes, and methods that are imported into the name space, and exported for users. Bioconductor reviewers will be looking for:

  1. Exported functions should use camel case or underscoring and not include “.” indicate S3 dispatch.

  2. Generally importFrom() is encouraged over importing an entire package, however if there are many functions from a single package, import() is okay.

  3. Exporting all functions with exportPattern("^[[:alpha:]]+") is strongly discouraged.

NEWS

这个可以照着下面模板格式来。

A NEWS file should be included to keep track of changes to the code from one version to the next. It can be a top level file or in the inst/ directory. Specifics on formatting can be found on the help page for ?news. Bioconductor uses the NEWS file to create the semi-annual release announcement. It must include list elements and cannot be a plain text file. An example format:

Changes in version 0.99.0 (2018-05-15)
+ Submitted to Bioconductor

Changes in version 1.1.1 (2018-06-15)
+ Fixed bug. Begin indexing from 1 instead of 2
+ Made the following significant changes
  o added a subsetting method
  o added a new field to database

After you install your package, the following can be run to see if the NEWS is properly formatted:

utils::news(package="<name of your package>")

The output should look similar to the following

Changes in version 1.1.1 (2018-06-15):

    o   Fixed bug. Begin indexing from 1 instead of 2

    o   Made the following significant changes
    o added a subsetting method
    o added a new field to database

Changes in version 0.99.0 (2018-05-15):

    o   Submitted to Bioconductor

If you get something like the following there are formatting ERRORS that need to be corrected:

Version: 0.99.0
Date: 2018-05-15
Text: Submitted to Bioconductor

Version: 1.1.1
Date: 2018-06-15
Text: Fixed bug. Begin indexing from 1 instead of 2

Version: 1.1.1
Date: 2018-06-15
Text: Made the following significant changes o added a subsetting
    method o added a new field to database

CITATION

引用可以写在包加载信息里、包说明文档里面,单独地像下面一样设置可以找个发表过的包照搬。

Appropriate citations must be included in help pages (e.g., in the see also section) and vignettes; this aspect of documentation is no different from any scientific endeavor. The file inst/CITATION can be used to specify how a package is to be cited.

Whether or not a CITATION file is present, an automatically-generated citation will appear on the package landing page on the Bioconductor web site. For optimal formatting of author names (if a CITATION file is not present), specify the package author and maintainer using the Authors@R field as described in Writing R Extensions.

(有空再整整)

Including Data

An excellent practice is to develop a software package, and to provide or use an existing experiment data package, annotation data or data in the ExperimentHub or AnnotationHub to give a comprehensive illustration of the methods in the software package.

If existing data is not available or applicable, or a new smaller dataset is needed for examples in the package, data can be included either as a separate data package (for larger amounts of data) or within the package (for smaller datasets).

Additional Experiment Data Package

Experimental data packages contain data specific to a particular analysis or experiment. They often accompany a software package for use in the examples and vignettes and in general are not updated regularly. If you need a general subset of data for workflows or examples first check the AnnotationHub resource for available files (e.g., BAM, FASTA, BigWig, etc.). Bioconductor encourages creating an experiment data package that utilizes ExperimentHub or AnnotationHub (See Creating an Experiment Hub Package or Creating an Annotation Hub Package) but a traditional package that encapsulates the data is also okay. See the Package Submission package for submitting related packages.

Adding Data to Existing Package

Bioconductor strongly encourages the use of existing datasets but if not available data can be included directly in the package for use in the examples found in man pages, vignettes, and tests of your package. This is a good reference by Hadley Wickham concerning data. As mentioned Bioconductor, however does not encourage using LazyData: True despite its recommendataion in this article. Some key points are summarized below.

Exported Data and the data/ Directory

Data in data/ is exported to the user and readily available. It is made available in an R session through the use of data(). It will require documentation concerning its creatation and source information. It is most often a .RData file created with save() but other types are acceptible as well, see ?data(). Please remember to compress the data.

Raw Data and the inst/extdata/ Directory

It is often desirable to show a workflow which involves parsing or loading of raw files. Bioconductorrecommends finding existing raw data already provided in another package or the hubs, however if this is not applicable, raw data files should be included in the inst/extdata. Files of these type are often accessed utilizing system.file(). Bioconductor requires documentation on these files in an inst/script/ directory.

Internal Data

Rarely, a package may require parsed data that is used internal but should not be exported to the user. An R/sysdata.rda is often the best place to include this type of data.

Package Documentation

Package documentation is important for users to understand how to work with your code. Bioconductorrequires a vignette with executable code that demonstrates how to use the package to accomplish a task, man pages for all exported functions with runnable examples, well documented data structures especially if not a pre-exiting class, and well documented datasets for data in data and in inst/extdata. References to the methdos used as well as to other simlar or related project/packages is also expected. If data structures differ from similar packages, Bioconductor reviewers will expect some justification as to why. Keep in mind it is always possible to extend existing classes.

Vignettes

A vignette demonstrates how to accomplish non-trivial tasks embodying the core functionality of your package. There are two common types of vignettes. A Sweave vignette is an .Rnw file that contains LaTeX and chunks of R code. The R code chunk starts with a line «»=, and ends with @. Each chunk is evaluated during R CMD build, prior to LaTeX compilation to a PDF document. An R markdown vignette is similar to a Sweave vignette, but uses markdown instead of LaTeX for structuring text sections and resulting in HTML output. The knitr package can process most Sweave and all R markdown vignettes, producing pleasing output. Refer to Writing package vignettes for technical details. See the BiocStylepackage for a convenient way to use common macros and a standard style.

A vignette provides reproducibility: the vignette produces the same results as copying the corresponding commands into an R session. It is therefore essential that the vignette embed executed R code. short-cuts (e.g., using a LaTeX verbatim environment, or using the Sweave eval=FALSE flag, or equivalent tricks in markdown) undermine the benefit of vignettes and are generally not allowed; exceptions can be made with proper justification and are at the Bioconductor Reviewers discretion.

All packages are required to have at least one vignette. Vignettes go in the vignettes directory of the package. Vignettes are often used as stand-alone documents, so best practices are to include an informative title, the primary author of the vignette, the last modified date of the vignette, and a link to the package landing page. We encourage the use of BiocSytle for formatting.

Some best coding practices for Biocondcutor vigenttes are as follow:

  1. Add an “Introduction” section that serves as an abstract to introduce the objective, models, unique functions, key points, etc that distinguish the package from other packages of similar type.

  2. Add an “Installation” section that show to users how to download and load the package from Bioconductor.

  3. If appropriate, we strongly encourage a table of contents

  4. Non-trival executable code is a must!!! Static vignettes are not acceptable.

  5. Include a section with the SessionInfo()

  6. Only the vignette file (.Rnw or .Rmd) and any necessary static images should be in the vignette directory. No intermediate files should be present.

  7. Remember to include any relavent references to methods.

‘man’ Pages

All exported functions and classes will have a man page. Bioconductor also encourages having a package man page with an overview of the package and links to the main functions. Data man pages must include source information and data structure information. Man pages describing new classes must be very detailed on the structure and what type of information is stored. All man pages should have an runnable examples. See Writing R Extensions section on man pages for detailed instruction or format information for documenting a package, functions, classes, and data sets. All help pages should be comprehensive.

inst/script/

The scripts in this directory can vary. Most importantly if data was included in the inst/extdata/, a related script must be present in this directory documenting very clearly how the data was generated. It should include source urls and any important information regarding filtering or processing. It can be executible code, sudo code, or a text description. A user should be able to download and be able to roughly reproduce the file or object that is present as data.

Unit Tests

Unit tests are highly recommended. We find them indispensable for both package development and maintenance. Two of the main frameworks for testing are RUnit and testthat. Examples and explanations are provided here.

R Code and Best Practices

Everyone has their own coding style and formats. There are however some best practice guidelines that Bioconductor will look for (see coding style). There are also some other key points:

  1. Only contain code that can be distributed under the license specified.

  2. Many common coding and sytax issues are flagged in R CMD check, and R CMD BiocCheck. (see the R CMD check cheatsheet and BiocCheck vignette. Some of the more promenient offenders:

    • Use vapply() instead of sapply() and use the various apply functions instead of for loops.
    • Use seq_len() or seq_along() instead of 1:...
    • Use TRUE/FALSE instead of T/F
    • Avoid class()== and class()!= instead use is()
    • Use system2() instead of system
    • Do not use set.seed in any internal R code.
    • No browser() calls should be in code
    • Avoid the use of <<-.
    • Avoid use of direct slot access with @ or slot(). Accessor methods should be created and utilized
  3. Some additional formatting and syntax guidelines

    • Use <- instead of = for assignment
    • Function names should be camel case or utilize the underscore _ and not have a dot . which indicates S3 dispatch.
    • Use dev.new() to start a graphics drive if necessary. Avoid using x11() or X11() for it can only be called on machines that have access to an X server.
  4. Avoid re-implementing functionality or classes. Make use of appropriate existing packages (e.g., biomaRt, AnnotationDbi, Biostrings, GenomicRanges) and classes (e.g., SummarizedExperiment, AnnotatedDataFrame, GRanges, DNAStringSet) to avoid duplication of functionality available in other Bioconductor packages. See also Common Bioconductor Methods and Classes. This encourages interoperability and simplifies your own package development. If new representation is needed, see the Essential S4 interface section of Robust and Efficient Code.

  5. Avoid large chunks of repeated code. If code is being repeated this is generally a good indication a helper function could be implemented.

  6. Excessively long functions should also be avoided. Write small functions. It’s best if each function has only one job that needs to do. And it’s also best if it does that job in as few lines of code as possible. If you find yourself writing great big functions that wrap on for more than a screen then you should probably take a moment to split it up into smaller helper functions. Smaller functions are easier to read, debug and to reuse.

  7. Argument names to functions should be descriptive and well documented. Arguments should generally have default values. Check arguments against a validity check.

  8. Vectorize! Many R operations are performed on the whole object, not just the elements of the object (e.g., sum(x), not x[1] + x[2] + x[2] + ...). In particular, relatively few situations require an explicit for loop. See the Vectorize section of Robust and Efficient Code for additional detail.

  9. Follow guiding principles on Querying Web Resources if applicable

  10. For parallel implementation please use BiocParallel. See also the Parallel Recommendations section of Robust and Efficient Code.

C or Fortran code

If the package contains C or Fortran code, it should adhere to the standards and methods described in the System and foreign language interfaces section of the Writing R Extensions manual. In particular:

  • Use internal R functions, e.g., R_alloc and random number generators, over system supplied ones.

  • Use C function registration (See the Registering native routines).

  • Use R_CheckUserInterrupt in C level loops when there is a chance that they may not terminate for certain parameter settings or when their run time exceeds 10 seconds with typical parameter settings, and the method is intended for interactive use.

  • Make judicious use of Makevars and Makefile within a package. These are often not required at all (See the Configure and cleanup).

  • During package development, enable all warnings and disable optimizations. If you plan to use a debugger, tell the compiler to include debugging symbols. The easiest way to enforce these is to create a user-level Makevars file user’s home directory in a sub-directory called ‘.R’). See examples below for flags for common toolchains. Consult the Writing R Extensions Manual for details about Makevars files.

    • Example for gcc/g++:

      CFLAGS=-Wall -Wextra -pedantic -O0 -ggdb CXXFLAGS=-Wall -Wextra -pedantic -O0 -ggdb FFLAGS=-Wall -Wextra -pedantic -O0 -ggdb

    • Example for clang/clang++:

      CFLAGS=-Weverything -O0 -g CXXFLAGS=-Weverything -O0 -g FFLAGS=-Wall -Wextra -pedantic -O0 -g

Third-party code

Use of external libraries whose functionality is redundant with libraries already supported is strongly discouraged. In cases where the external library is complex the author may need to supply pre-built binary versions for some platforms.

By including third-party code a package maintainer assumes responsibility for maintenance of that code. Part of the maintenance responsibility includes keeping the code up to date as bug fixes and updates are released for the mainline third-party project.

For guidance on including code from some specific third-party sources, see the external code sources section of the C++ Best Practices guide.

The .gitignore File

Bioconductor requires a git repository for submission. There are certain system files that should not be git tracked and are unacceptable to include. These files can remain on a local system but should be excluded from the git repository which is possible by including a .gitignore file.

The following are files that are checked by Bioconductor and flagged as unacceptable:

hidden_file_ext = (
    ".renviron", ".rprofile", ".rproj", ".rproj.user", ".rhistory",
    ".rapp.history", ".o", ".sl", ".so", ".dylib", ".a", ".dll",
    ".def", ".ds_store", "unsrturl.bst", ".log", ".aux", ".backups",
    ".cproject", ".directory", ".dropbox", ".exrc", ".gdb.history",
    ".gitattributes", ".gitmodules", ".hgtags", ".project", ".seed",
    ".settings", ".tm_properties"
)

Conclusion

The following exercise How to Build Bioconductor Package with RStudio may also be helpful.

Remember that every Bioconductor package goes through a formal review process and may still receive technical feedback from the assigned Bioconductor Team Reviewer. An overview of the submission process may be found here and a package may be submitted to the new package tracker.

  1. The Bioconductor team member assigned to review the package during the submission process will expect all ERROR, WARNINGS, and NOTES to be addressed. If there are any remaining, a justification of why they are not corrected will be expected. 2

  2. This is true for Software Packages. Experiment Data, Annotation, and Workflow packages are allowed additional space and check time.

相关文章

网友评论

    本文标题:【r<-开发】Bioconductor包开发(二):开发指

    本文链接:https://www.haomeiwen.com/subject/wljvgqtx.html