SJM HPC-SGE 投递任务管理工具简介

作者: 不二小张 | 来源:发表于2021-12-13 15:07 被阅读0次

HPC-SGE环境下的任务投递方式

HPC指的是高可用计算集群，是相对于单机服务或节点而言的，通过节点或集群间的分布式结构解决任务需求中的高算力、高存储、强动态调节等功能。
每个节点在集群中是以相对独立的姿态在系统中存在，在多节点的集群中，需要任务投递时选择合适的任务投递工具管理任务和集群。
一般使用SGE集群时，采用的投递工具为QSUB，具体的使用方式参见博客集群任务管理系统SGE的简明教程。

SJM 一种HPC-SGE的任务组织与状态监控工具

SJM简介
SJM全称为Simple Job Manager，由StanfordBioinformatics团队负责开发与维护，它是一个用于管理在计算集群上运行的一组相关作业的程序，是在sge-qsub的更上一层的抽象，具有如下特性：
- 它提供了一种方便的方法来指定作业之间的依赖关系以及每个作业的资源需求（例如内存、CPU内核）
- 它监视作业的状态，以便您可以知道整个组何时完成。如果任何作业失败（例如，由于计算节点崩溃），SJM允许您恢复，而无需重新运行成功完成的作业
- 最后，SJM提供了一种可移植的方式，可以将作业提交给不同的作业调度器，如Sun网格引擎或平台LSF

软件安装

# 注：软件安装前需要确认是否安装了Boost http://www.boost.org
git clone https://github.com/StanfordBioinformatics/SJM.git
cd SJM
./configure
make
sudo make install

SJM job中的属性介绍
一个示例,任务以job_begin和job_end开始和结尾，任务名称用name标注，内存申请用memory标注，任务命令用cmd标注，投递队列用queue标注，：

job_begin
    name jobA
    queue test.q
    time 1h
    memory 500m
    cmd echo "hello from job jobA"
job_end

或者用一下的形式代替：

job_begin
    name jobA
    sched_options -cwd -l vf=1G,p=1 -q test.q
    cmd echo "hello from job jobA"
job_end

多行命令在一个任务中时可以如下编写：

job_begin
    name jobB
    time 2d
    memory 1G
    queue test.q
    cmd_begin
        /home/lacroute/project/jobB_prolog.sh;
        /home/lacroute/project/jobB.sh;
        /home/lacroute/project/jobB_epilog.sh
    cmd_end
job_end

编写多个任务之间的前后关系时，采用order属性标记，如下：
order jobA before jobB

编写一个SJM的JOB并投递
编写任务：假设一个任务有A，B两个部分，且A必须在B正常结束后才能执行，可复用代码如下：

job_begin
    name jobA
    time 4h
    memory 3G
    queue standard
    project sequencing
    cmd /home/lacroute/project/jobA.sh
job_end
job_begin
    name jobB
    time 2d
    memory 1G
    queue extended
    cmd_begin
        /home/lacroute/project/jobB_prolog.sh;
        /home/lacroute/project/jobB.sh;
        /home/lacroute/project/jobB_epilog.sh
    cmd_end
job_end
order jobA before jobB
log_dir /home/lacroute/project/log

投递任务：基于以上代码写在在test.job文件中，流程投递命令如下：

export PATH=sjm的文件夹:$PATH
# 投递至后台
sjm test.job
# 前台投递
sjm --interactive --log test.job.status.log  test.job

基于SJM的任务流的监控
sjm投递任务后（后台投递情况下），将会产生额外三个文件，其中*.status文件为每个job任务状态记录文件，会伴随更新，job状态如下几种，注意及时查看输出的log文件信息。
```
# 文件
test.job.status
test.job.status.bak
test.job.status.log
# 状态
waiting 等待，未投递
running 正在执行
failed 运行失败
done 运行成功
```
基于sjm-job 抽象的模板式任务编辑工具畅想
我们可以规定一套模板，模块使用makefile编写，模块组织使用sjm，实现流程模块的高复用和高可用

其他类型的HPC任务流投递工具

基于Kubernetes集群的ARGO
Argo Workflows是一个开源的容器原生工作流引擎，用于在Kubernetes上编排并行作业。可应用与混合云等多种云计算集群的任务编排工具，高度满足命令式编排和声明式自动化的特点。
基于argo创建任务流可参见博客ARGO-工作流部署与管理工具

基于Cromwell的WDL
WDL 是主要应用与生物信息领域的流程定义工具，需要依赖Cromwell引擎投递任务，但集群如果是SGE，其底层也是依赖QSUB投递任务的，单机节点下的配置文件如下：

# Cromwell HTTP server settings
include required(classpath("application"))
webservice {
#port = 8100
interface = 0.0.0.0
binding-timeout = 5s
instance.name = "reference"
}

#database {
#  profile = "slick.jdbc.MySQLProfile$"
##  db {
#    driver = "com.mysql.cj.jdbc.Driver"
#    url = "jdbc:mysql://host/cromwell?rewriteBatchedStatements=true"
#    user = "user"
#    password = "pass"
#    connectionTimeout = 5000
#  }
#}

call-caching {
enabled = true

# In a multi-user environment this should be false so unauthorized users don't invalidate results for authorized users.
invalidate-bad-cache-results = true

}

docker {
hash-lookup {
    # Set this to match your available quota against the Google Container Engine API
    #gcr-api-queries-per-100-seconds = 1000

    # Time in minutes before an entry expires from the docker hashes cache and needs to be fetched again
    #cache-entry-ttl = "20 minutes"

    # Maximum number of elements to be kept in the cache. If the limit is reached, old elements will be removed from the cache
    #cache-size = 200

    # How should docker hashes be looked up. Possible values are "local" and "remote"
    # "local": Lookup hashes on the local docker daemon using the cli
    # "remote": Lookup hashes on docker hub and gcr
    #method = "remote"
}
}

backend {
default = Local

providers {
    Local {
    actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
    config {
            run-in-background = true
            runtime-attributes = """
            String? docker
            String? docker_user
            String? sge_mount
            """
            submit = "${job_shell} ${script}"
            submit-docker = """
            # make sure there is no preexisting Docker CID file
            rm -f ${docker_cid}
            # run as in the original configuration without --rm flag (will remove later)
            docker run \
                --cidfile ${docker_cid} \
                -i \
                ${"--user " + docker_user} \
                --entrypoint ${job_shell} \
                -v ${cwd}:${docker_cwd}:delegated \
                ${sge_mount} \
                ${docker} ${docker_script}

            # get the return code (working even if the container was detached)
            rc=$(docker wait `cat ${docker_cid}`)

            # remove the container after waiting
            docker rm `cat ${docker_cid}`

            # return exit code
            exit $rc
            """
        kill-docker = "docker kill `cat ${docker_cid}`"
        root = "cromwell-executions"
        filesystems {

            # For SFS backends, the "local" configuration specifies how files are handled.
            local {

                # Try to hard link (ln), then soft-link (ln -s), and if both fail, then copy the files.
                localization: [
                "hard-link", "soft-link", "copy"
                ]

                # Call caching strategies
                caching {
                # When copying a cached result, what type of file duplication should occur.
                # For more information check: https://cromwell.readthedocs.io/en/stable/backends/HPC/#shared-filesystem
                duplication-strategy: [
                    "hard-link", "soft-link", "copy"
                ]

                # Strategy to determine if a file has been used before.
                # For extended explanation and alternative strategies check: https://cromwell.readthedocs.io/en/stable/Configuring/#call-caching
                hashing-strategy: "md5"

                # When true, will check if a sibling file with the same name and the .md5 extension exists, and if it does, use the content of this file as a hash.
                # If false or the md5 does not exist, will proceed with the above-defined hashing strategy.
                check-sibling-md5: false
                }
            }
            }

            # The defaults for runtime attributes if not provided.
            default-runtime-attributes {
            failOnStderr: false
            continueOnReturnCode: 0
            }
        }
    }
}
}

nextflow
流程预定义，可以直接下载模板后运行，定制化程度不高，特殊需求下，需要手动运行命令分析，详情参考生信流程大全-基于nextflow的nf-core
snakemake
某种程度上和WDL比较像，通过模块间的输入输出定义流程的DAG。

引用

网友评论

本文标题：SJM HPC-SGE 投递任务管理工具简介

本文链接：https://www.haomeiwen.com/subject/zkwjfrtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！