Common Workflow Language [四]

作者: 生信师姐 | 来源:发表于2020-07-21 12:26 被阅读0次

Common Workflow Language [四]
生信流程工具-CWL
Common Workflow Language [三]
Common Workflow Language [一]
Common Workflow Language [二]
Common Workflow Language [五]
利用WDL语言书写数据处理流程
CLR IL JIT
【WDL】1. 语言介绍
Building Academic Language Meeti

十六文件格式

问题

如何标记输入文件所需的文件格式？
如何标记输出文件的生成文件格式？

目标

学习如何明确指定File对象的格式.

工具和工作流可以将File类型作为输入，并将其生成作为输出。我们建议File类型，这有助于为其他人提供如何使用工具的文档，同时允许您在创建参数文件时进行一些简单的类型检查。

对于文件格式，我们建议引用已经存在的ontologies （如示例中的EDAM），为您的机构引用一个本地ontology，或者在与其他人共享您的工具之前，不要为了快速开发而添加一个文件格式。可以在这里浏览IANA 和 DAM的现有文件格式列表。

注意：对于附加值，cwltool可以基于文件格式进行一些基本的推理，并在出现明显的不匹配时进行警告。

metadata_example.cwl

#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: CommandLineTool

label: An example tool demonstrating metadata.

inputs:
  aligned_sequences:
    type: File
    label: Aligned sequences in BAM format
    format: edam:format_2572
    inputBinding:
      position: 1

baseCommand: [ wc, -l ]

stdout: output.txt

outputs:
  report:
    type: stdout
    format: edam:format_1964
    label: A text file that contains a line count

$namespaces:
  edam: http://edamontology.org/
$schemas:
  - http://edamontology.org/EDAM_1.18.owl

命令行格式中的CWL描述等效于：

wc -l /path/to/aligned_sequences.ext > output.txt

示例参数文件

下面是上述示例的参数文件示例。我们鼓励为您的工具核实参数文件的工作示例。这允许其他人快速使用您的工具，从“known good”参数化开始。

sample.yml

aligned_sequences:
    class: File
    format: http://edamontology.org/format_2572
    path: file-formats.bam

Note: To follow the example below, you need to download the example input file, file-formats.bam. The file is available from https://github.com/common-workflow-language/user_guide/raw/gh-pages/_includes/cwl/16-file-formats/file-formats.bam and can be downloaded e.g. via wget:

wget https://github.com/common-workflow-language/user_guide/raw/gh-pages/_includes/cwl/16-file-formats/file-formats.bam

$ cwltool metadata_example.cwl sample.yml
/usr/local/bin/cwltool 1.0.20161114152756
Resolved 'metadata_example.cwl' to 'file:///media/large_volume/testing/cwl_tutorial2/metadata_example.cwl'
[job metadata_example.cwl] /tmp/tmpNWyAd6$ /bin/sh \
    -c \
    'wc' '-l' '/tmp/tmpBf6m9u/stge293ac74-3d42-45c9-b506-dd35ea3e6eea/file-formats.bam' > /tmp/tmpNWyAd6/output.txt
Final process status is success
{
  "report": {
    "format": "http://edamontology.org/format_1964",
    "checksum": "sha1$49dc5004959ba9f1d07b8c00da9c46dd802cbe79",
    "basename": "output.txt",
    "location": "file:///media/large_volume/testing/cwl_tutorial2/output.txt",
    "path": "/media/large_volume/testing/cwl_tutorial2/output.txt",
    "class": "File",
    "size": 80
  }
}

总结

可以记录输入和输出File的预期格式。
一旦您的工具成熟，我们建议您通过引用现有的ontologies来指定格式，例如EDAM。

十七 Metadata and Authorship

问题：如何使人们更容易引用我的工具描述？

目标：了解如何将作者信息和其他元数据添加到CWL描述中。

Implementation extensions not required for correct execution (for example, fields related to GUI presentation) and metadata about the tool or workflow itself (for example, authorship for use in citations) may be provided as additional fields on any object.

正确执行不需要Implementation extensions（例如，与GUI呈现相关的字段）和工具或工作流本身相关的元数据（例如，引用中使用的作者身份）可以作为任何对象的附加字段提供。

此类扩展字段（例如format: edam:format_2572)可以使用文档$namespaces部分中列出的命名空间前缀（例如edam:http://edamontology.org/)如Schema Salad specification所述。一旦添加了名称空间前缀，可以在文档中的任何位置访问它，如下所示。否则必须使用完整的URL:格式：format: http://edamontology.org/format_2572。

对于所有开发人员，建议工具和工作流使用以下最小元数据。此示例包含允许其他人引用您的工具的元数据。

metadata_example2.cwl

#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: CommandLineTool

label: An example tool demonstrating metadata.
doc: Note that this is an example and the metadata is not necessarily consistent.

inputs:
  aligned_sequences:
    type: File
    label: Aligned sequences in BAM format
    format: edam:format_2572
    inputBinding:
      position: 1

baseCommand: [ wc, -l ]

stdout: output.txt

outputs:
  report:
    type: stdout
    format: edam:format_1964
    label: A text file that contains a line count

s:author:
  - class: s:Person
    s:identifier: https://orcid.org/0000-0002-6130-1021
    s:email: mailto:dyuen@oicr.on.ca
    s:name: Denis Yuen

s:contributor:
  - class: s:Person
    s:identifier: http://orcid.org/0000-0002-7681-6415
    s:email: mailto:briandoconnor@gmail.com
    s:name: Brian O'Connor

s:citation: https://dx.doi.org/10.6084/m9.figshare.3115156.v2
s:codeRepository: https://github.com/common-workflow-language/common-workflow-language
s:dateCreated: "2016-12-13"
s:license: https://spdx.org/licenses/Apache-2.0 

$namespaces:
  s: https://schema.org/
  edam: http://edamontology.org/

$schemas:
 - https://schema.org/version/latest/schema.rdf
 - http://edamontology.org/EDAM_1.18.owl

命令行格式中的CWL描述等效于：

wc -l /path/to/aligned_sequences.ext > output.txt

扩展示例

对于那些积极性很高的人，也可以用大量的元数据为工具添加注释。这个例子包括作为关键字的EDAM ontology标记（允许对相关工具进行分组）、使用该工具的硬件需求提示（ hints ）以及其他元数据字段。

metadata_example3.cwl

#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: CommandLineTool

label: An example tool demonstrating metadata.
doc: Note that this is an example and the metadata is not necessarily consistent.

hints:
  ResourceRequirement:
    coresMin: 4

inputs:
  aligned_sequences:
    type: File
    label: Aligned sequences in BAM format
    format: edam:format_2572
    inputBinding:
      position: 1

baseCommand: [ wc, -l ]

stdout: output.txt

outputs:
  report:
    type: stdout
    format: edam:format_1964
    label: A text file that contains a line count

s:author:
  - class: s:Person
    s:identifier: https://orcid.org/0000-0002-6130-1021
    s:email: mailto:dyuen@oicr.on.ca
    s:name: Denis Yuen

s:contributor:
  - class: s:Person
    s:identifier: http://orcid.org/0000-0002-7681-6415
    s:email: mailto:briandoconnor@gmail.com
    s:name: Brian O'Connor

s:citation: https://dx.doi.org/10.6084/m9.figshare.3115156.v2
s:codeRepository: https://github.com/common-workflow-language/common-workflow-language
s:dateCreated: "2016-12-13"
s:license: https://spdx.org/licenses/Apache-2.0 

s:keywords: edam:topic_0091 , edam:topic_0622
s:programmingLanguage: C

$namespaces:
 s: https://schema.org/
 edam: http://edamontology.org/

$schemas:
 - https://schema.org/version/latest/schema.rdf
 - http://edamontology.org/EDAM_1.18.owl

总结

元数据可以在CWL描述中提供。
开发人员应该提供最少的作者信息，以鼓励正确的引用。

十八自定义类型

问题：如何创建自定义类型并将其导入CWL描述？

有时您可能希望编写自己的自定义类型，以便在CWL描述中使用和重用。使用此类自定义类型可以减少所有使用同一类型的多个描述之间的冗余，还允许对工具/分析进行额外的定制/配置，而无需直接修改CWL描述。

下面的例子是 biom convert format 的CWL描述，用于将标准biom表文件转换为hd5格式的工具。

custom-types.cwl

#!/usr/bin/env cwl-runner 
cwlVersion: v1.0
class: CommandLineTool

requirements:
  InlineJavascriptRequirement: {}
  ResourceRequirement:
    coresMax: 1
    ramMin: 100  # just a default, could be lowered
  SchemaDefRequirement:
    types:
      - $import: biom-convert-table.yaml

hints:
  DockerRequirement:
    dockerPull: 'quay.io/biocontainers/biom-format:2.1.6--py27_0'
  SoftwareRequirement:
    packages:
      biom-format:
        specs: [ "https://doi.org/10.1186/2047-217X-1-7" ]
        version: [ "2.1.6" ]

inputs:
  biom:
    type: File
    format: edam:format_3746  # BIOM
    inputBinding:
      prefix: --input-fp
  table_type:
    type: biom-convert-table.yaml#table_type
    inputBinding:
      prefix: --table-type

  header_key:
    type: string?
    doc: |
      The observation metadata to include from the input BIOM table file when
      creating a tsv table file. By default no observation metadata will be
      included.
    inputBinding:
      prefix: --header-key

baseCommand: [ biom, convert ]

arguments:
  - valueFrom: $(inputs.biom.nameroot).hdf5  
    prefix: --output-fp
  - --to-hdf5

outputs:
  result:
    type: File
    outputBinding: { glob: "$(inputs.biom.nameroot)*" }

$namespaces:
  edam: http://edamontology.org/
  s: https://schema.org/

$schemas:
  - http://edamontology.org/EDAM_1.16.owl
  - https://schema.org/version/latest/schema.rdf

s:license: https://spdx.org/licenses/Apache-2.0
s:copyrightHolder: "EMBL - European Bioinformatics Institute"

custom-types.yml

biom:
    class: File
    format: http://edamontology.org/format_3746
    path: rich_sparse_otu_table.biom
table_type: OTU table

注意：为了运行下面的示例，需要下载示例输入文件rich_sparse_otu_table.biom 。可通过wget下载。

wget https://raw.githubusercontent.com/common-workflow-language/user_guide/gh-pages/_includes/cwl/19-custom-types/rich_sparse_otu_table.biom

在第34行，inputs:table_type，是 table转换过程中使用的，允许allowable table options的列表将作为自定义对象导入

inputs:
  biom:
    type: File
    format: edam:format_3746  # BIOM
    inputBinding:
      prefix: --input-fp
  table_type:
    type: biom-convert-table.yaml#table_type
    inputBinding:
      prefix: --table-type

In this case the symbols array from the imported biom-convert-table.yaml file define the allowable table options. For example, in custom-types.yml, we pass OTU table as an input that tells the tool to create an OTU table in hd5 format.

对自定义类型的引用是定义对象(biom-convert-table.yaml) 的文件名的组合以及该文件(table_type)中定义自定义类型的对象的名称。

在本例中，导入的biom-convert-table.yaml文件的symbols数组来定义允许的table options。例如，在自定义-类型.yml，我们将OTU table 作为一个input传递，它告诉工具以hd5格式创建OTU表。

描述自定义类型的YAML文件的内容如下所示：

type: enum
name: table_type
label: The type of the table to produce
symbols:
  - OTU table
  - Pathway table
  - Function table
  - Ortholog table
  - Gene table
  - Metabolite table
  - Taxon table
  - Table

为了在CWL描述中使用自定义类型，必须导入它。requirements:SchemaDefRequirement中描述了导入，如下面的自定义示例custom-types.cwl所示：

requirements:
  InlineJavascriptRequirement: {}
  ResourceRequirement:
    coresMax: 1
    ramMin: 100
  SchemaDefRequirement:
    types:
      - $import: biom-convert-table.yaml

Note also that the author of this CWL description has also included ResourceRequirements, specifying the minimum amount of RAM and number of cores required for the tool to run successfully, as well as details of the version of the software that the description was written for and other useful metadata. These features are discussed further in other chapters of this user guide.

注意：CWL描述的 author还包括了ResourceRequirements，指定了工具成功运行所需的最小RAM和CPU的核数量，以及软件版本的详细信息和其他有用的元数据。

总结

您可以创建自己的自定义类型以加载到描述中。
这些自定义类型允许用户配置工具，而无需直接修改工具描述。
自定义类型在单独的YAML文件中描述，并根据需要导入。

十九

指定软件需求

问题：如何指定作业的需求/依赖关系？

目标

学习如何编写软件需求描述。
了解如何使用SciCrunch检索所需工具/版本的唯一标识符。

Often tool descriptions will be written for a specific version of a software. To make it easier for others to make use of your descriptions, you can include a SoftwareRequirement field in the hints section. This may also help to avoid confusion about which version of a tool the description was written for.

通常，工具描述时会描写软件的版本。为了便于其他人使用自己的描述，可以在hints部分包含一个SoftwareRequirement字段。这也有利于避免混淆软件版本。

cwlVersion: v1.0
class: CommandLineTool

label: "InterProScan: protein sequence classifier"

doc: |
      Version 5.21-60 can be downloaded here:
      https://github.com/ebi-pf-team/interproscan/wiki/HowToDownload

      Documentation on how to run InterProScan 5 can be found here:
      https://github.com/ebi-pf-team/interproscan/wiki/HowToRun

requirements:
  ResourceRequirement:
    ramMin: 10240
    coresMin: 3
  SchemaDefRequirement:
    types:
      - $import: InterProScan-apps.yml

hints:
  SoftwareRequirement:
    packages:
      interproscan:
        specs: [ "https://identifiers.org/rrid/RRID:SCR_005829" ]
        version: [ "5.21-60" ]

inputs:
  proteinFile:
    type: File
    inputBinding:
      prefix: --input
  applications:
    type: InterProScan-apps.yml#apps[]?
    inputBinding:
      itemSeparator: ','
      prefix: --applications

baseCommand: interproscan.sh

arguments:
 - valueFrom: $(inputs.proteinFile.nameroot).i5_annotations
   prefix: --outfile
 - valueFrom: TSV
   prefix: --formats
 - --disable-precalc
 - --goterms
 - --pathways
 - valueFrom: $(runtime.tmpdir)
   prefix: --tempdir

outputs:
  i5Annotations:
    type: File
    format: iana:text/tab-separated-values
    outputBinding:
      glob: $(inputs.proteinFile.nameroot).i5_annotations

$namespaces:
 iana: https://www.iana.org/assignments/media-types/
 s: https://schema.org/
$schemas:
 - https://schema.org/version/latest/schema.rdf

s:license: https://spdx.org/licenses/Apache-2.0
s:copyrightHolder: "EMBL - European Bioinformatics Institute"

在本例中，software requirement 是InterProScan版本5.21-60。

hints:
  SoftwareRequirement:
    packages:
      interproscan:
        specs: [ "https://identifiers.org/rrid/RRID:SCR_005829" ]
        version: [ "5.21-60" ]

根据您的CWLrunner，这些提示可用于在运行作业之前检查所需软件是否已安装并可用。要对 reference implementation启用这些检查，请使用依赖关系解析程序配置。

Other good choices, in order of preference, are to include the DOI for the main tool citation and the URL to the tool.

除了版本号，工具的唯一资源标识符（URI）以RRID的形式给出。具有RRIDs的资源可以在SciCrunch注册表中查找，该注册表提供了一个portal，用于地查找、跟踪和引用科学资源。如果要将工具指定为SoftwareRequirement，请在SciCrunch上搜索该工具并使用在注册表中分配的RRID。（如果您想向SciCrunch添加工具，请遵循本教程）您可以在需求描述的specs字段中使用此RRID来引用该工具（identifiers.org)。按照优先顺序，其他比较好的选择包括主工具引用的DOI和工具的URL。

总结

Software requirements 应在hints:SoftwareRequirement指定。

二十 Writing Workflows

Questions：如何将工具整合到工作流中？

此工作流从tar文件中提取java源文件，然后对其进行编译。

1st-workflow.cwl

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: Workflow
inputs:
  tarball: File
  name_of_file_to_extract: string

outputs:
  compiled_class:
    type: File
    outputSource: compile/classfile

steps:
  untar:
    run: tar-param.cwl
    in:
      tarfile: tarball
      extractfile: name_of_file_to_extract
    out: [extracted_file]

  compile:
    run: arguments.cwl
    in:
      src: untar/extracted_file
    out: [classfile]

Visualization of .cwl

1st-workflow

在单独的文件中使用YAML或JSON对象来描述运行的输入：

1st-workflow-job.yml

tarball:
  class: File
  path: hello.tar
name_of_file_to_extract: Hello.java

$ echo "public class Hello {}" > Hello.java && tar -cvf hello.tar Hello.java
$ cwl-runner 1st-workflow.cwl 1st-workflow-job.yml
[job untar] /tmp/tmp94qFiM$ tar --create --file /home/example/hello.tar Hello.java
[step untar] completion status is success
[job compile] /tmp/tmpu1iaKL$ docker run -i --volume=/tmp/tmp94qFiM/Hello.java:/var/lib/cwl/job301600808_tmp94qFiM/Hello.java:ro --volume=/tmp/tmpu1iaKL:/var/spool/cwl:rw --volume=/tmp/tmpfZnNdR:/tmp:rw --workdir=/var/spool/cwl --read-only=true --net=none --user=1001 --rm --env=TMPDIR=/tmp java:7 javac -d /var/spool/cwl /var/lib/cwl/job301600808_tmp94qFiM/Hello.java
[step compile] completion status is success
[workflow 1st-workflow.cwl] outdir is /home/example
Final process status is success
{
  "compiled_class": {
    "location": "/home/example/Hello.class",
    "checksum": "sha1$e68df795c0686e9aa1a1195536bd900f5f417b18",
    "class": "File",
    "size": 416
  }
}

我们先它分解来分析：

cwlVersion: v1.0
class: Workflow

cwlVersion字段显示了文档使用的CWL规范的版本。
class字段指示此文档描述工作流。

inputs:
  tarball: File
  name_of_file_to_extract: string

inputs 部分描述工作流的输入。这是一个输入参数的列表，其中每个参数由一个标识符和一个数据类型组成。这些参数可以用作特定工作流步骤的输入源。

outputs:
  compiled_class:
    type: File
    outputSource: compile/classfile

outputs部分描述工作流的输出。这是输出参数的列表，其中每个参数都由标识符和数据类型组成。outputSource将compile步骤的输出参数classfile连接到工作流输出参数compiled_class。

steps:
  untar:
    run: tar-param.cwl
    in:
      tarfile: tarball
      extractfile: name_of_file_to_extract
    out: [extracted_file]

steps部分描述工作流的实际步骤。在本例中，第一步从tar文件中提取文件，第二步使用java编译器编译第一步中的文件。工作流步骤不一定按列出的顺序运行，而是由步骤之间的依赖关系（使用源）确定。此外，互不依赖的工作流步骤可以并行运行。

第一步，untar运行tar-param.cwl（如前所述参数参考). 此工具有两个输入参数tarfile和extractfile，还有一个输出参数extracted_file。

This means that when the workflow step is executed, the values assigned to tarball and name_of_file_to_extract will be used for the parameters tarfile and extractfile in order to run the tool.

工作流步骤的in部分使用source将tarball 和name_of_file_to_extract这两个输入参数连接到工作流的输入。这意味着在执行工作流步骤时，为tarball和name_of_file_to_extract指定的values值将被参数tarfile和extractfile使用，以便运行该工具。

工作流步骤的out部分列出了工具预期的输出参数。

  compile:
    run: arguments.cwl
    in:
      src: untar/extracted_file
    out: [classfile]

第二步compile依赖于第一步的结果，通过使用untar/extracted_file将输入参数src 连接到untar的输出参数。此步骤classfile 的输出连接到工作流的outputs 部分。

总结

工作流中的每个步骤都必须有自己的CWL描述。
工作流的顶层输入和输出分别在输入和输出字段中描述。
步骤在steps下指定。
执行顺序由步骤之间的连接决定。

Common Workflow Language [四]
十六文件格式问题如何标记输入文件所需的文件格式？如何标记输出文件的生成文件格式？目标学习如何明确指定F...
生信流程工具-CWL
1. 简介 Common Workflow Language简称CWL 官网：https://www.common...
Common Workflow Language [三]
十一Advanced Inputs 如何描述哪些参数必须和不必须使用?学习如何使用记录来描述输入之间的关系。有时...
Common Workflow Language [一]
一、介绍什么是Common Workflow Language? CWL是一种描述命令行工具，它能能够将命令行衔...
Common Workflow Language [二]
六、参数引用能在另一个位置重复使用参数值吗? 在上一个例子中，我们使用tar程序提取了一个文件。然而，这个例子...
Common Workflow Language [五]
二十一嵌套工作流问题：如何将多个工作流连接在一起？目标：了解如何从多个CWL工作流构造嵌套工作流。工作流是...
利用WDL语言书写数据处理流程
The Workflow Description Language (WDL) is a way to speci...
CLR IL JIT
CLR: Common Language Runtime IL: Intermediate Language JI...
【WDL】1. 语言介绍
WDL（Workflow Description Language）Broad Institute推出，描述流程。...
Building Academic Language Meeti
下载地址：Building Academic Language Meeting Common Core Stand...