美文网首页
SJM HPC-SGE 投递任务管理工具简介

SJM HPC-SGE 投递任务管理工具简介

作者: 不二小张 | 来源:发表于2021-12-13 15:07 被阅读0次

    HPC-SGE环境下的任务投递方式

      HPC指的是高可用计算集群,是相对于单机服务或节点而言的,通过节点或集群间的分布式结构解决任务需求中的高算力、高存储、强动态调节等功能。
      每个节点在集群中是以相对独立的姿态在系统中存在,在多节点的集群中,需要任务投递时选择合适的任务投递工具管理任务和集群。
      一般使用SGE集群时,采用的投递工具为QSUB,具体的使用方式参见博客集群任务管理系统SGE的简明教程

    SJM 一种HPC-SGE的任务组织与状态监控工具

    1. SJM简介
      SJM全称为Simple Job Manager,由StanfordBioinformatics团队负责开发与维护,它是一个用于管理在计算集群上运行的一组相关作业的程序,是在sge-qsub的更上一层的抽象,具有如下特性:
      • 它提供了一种方便的方法来指定作业之间的依赖关系以及每个作业的资源需求(例如内存、CPU内核)
      • 它监视作业的状态,以便您可以知道整个组何时完成。如果任何作业失败(例如,由于计算节点崩溃),SJM允许您恢复,而无需重新运行成功完成的作业
      • 最后,SJM提供了一种可移植的方式,可以将作业提交给不同的作业调度器,如Sun网格引擎或平台LSF
    2. 软件安装
      # 注:软件安装前需要确认是否安装了Boost http://www.boost.org
      git clone https://github.com/StanfordBioinformatics/SJM.git
      cd SJM
      ./configure
      make
      sudo make install
      
    3. SJM job中的属性介绍
      一个示例,任务以job_begin和job_end开始和结尾,任务名称用name标注,内存申请用memory标注,任务命令用cmd标注,投递队列用queue标注,:
      job_begin
          name jobA
          queue test.q
          time 1h
          memory 500m
          cmd echo "hello from job jobA"
      job_end
      
      或者用一下的形式代替:
      job_begin
          name jobA
          sched_options -cwd -l vf=1G,p=1 -q test.q
          cmd echo "hello from job jobA"
      job_end
      
      多行命令在一个任务中时可以如下编写:
      job_begin
          name jobB
          time 2d
          memory 1G
          queue test.q
          cmd_begin
              /home/lacroute/project/jobB_prolog.sh;
              /home/lacroute/project/jobB.sh;
              /home/lacroute/project/jobB_epilog.sh
          cmd_end
      job_end
      
      编写多个任务之间的前后关系时,采用order属性标记,如下:
      order jobA before jobB
    4. 编写一个SJM的JOB并投递
      编写任务:假设一个任务有A,B两个部分,且A必须在B正常结束后才能执行,可复用代码如下:
      job_begin
          name jobA
          time 4h
          memory 3G
          queue standard
          project sequencing
          cmd /home/lacroute/project/jobA.sh
      job_end
      job_begin
          name jobB
          time 2d
          memory 1G
          queue extended
          cmd_begin
              /home/lacroute/project/jobB_prolog.sh;
              /home/lacroute/project/jobB.sh;
              /home/lacroute/project/jobB_epilog.sh
          cmd_end
      job_end
      order jobA before jobB
      log_dir /home/lacroute/project/log
      
      投递任务:基于以上代码写在在test.job文件中,流程投递命令如下:
      export PATH=sjm的文件夹:$PATH
      # 投递至后台
      sjm test.job
      # 前台投递
      sjm --interactive --log test.job.status.log  test.job 
      
    5. 基于SJM的任务流的监控
      sjm投递任务后(后台投递情况下),将会产生额外三个文件,其中*.status文件为每个job任务状态记录文件,会伴随更新,job状态如下几种,注意及时查看输出的log文件信息。
      # 文件
      test.job.status
      test.job.status.bak
      test.job.status.log
      # 状态
      waiting 等待,未投递
      running 正在执行
      failed 运行失败
      done 运行成功
      
    6. 基于sjm-job 抽象的模板式任务编辑工具畅想
      我们可以规定一套模板,模块使用makefile编写,模块组织使用sjm,实现流程模块的高复用和高可用

    其他类型的HPC任务流投递工具

    1. 基于Kubernetes集群的ARGO
      Argo Workflows是一个开源的容器原生工作流引擎,用于在Kubernetes上编排并行作业。可应用与混合云等多种云计算集群的任务编排工具,高度满足命令式编排和声明式自动化的特点。
      基于argo创建任务流可参见博客ARGO-工作流部署与管理工具
    2. 基于Cromwell的WDL
      WDL 是主要应用与生物信息领域的流程定义工具,需要依赖Cromwell引擎投递任务,但集群如果是SGE,其底层也是依赖QSUB投递任务的,单机节点下的配置文件如下:
      # Cromwell HTTP server settings
      include required(classpath("application"))
      webservice {
      #port = 8100
      interface = 0.0.0.0
      binding-timeout = 5s
      instance.name = "reference"
      }
      
      #database {
      #  profile = "slick.jdbc.MySQLProfile$"
      ##  db {
      #    driver = "com.mysql.cj.jdbc.Driver"
      #    url = "jdbc:mysql://host/cromwell?rewriteBatchedStatements=true"
      #    user = "user"
      #    password = "pass"
      #    connectionTimeout = 5000
      #  }
      #}
      
      call-caching {
      enabled = true
      
      # In a multi-user environment this should be false so unauthorized users don't invalidate results for authorized users.
      invalidate-bad-cache-results = true
      
      }
      
      docker {
      hash-lookup {
          # Set this to match your available quota against the Google Container Engine API
          #gcr-api-queries-per-100-seconds = 1000
      
          # Time in minutes before an entry expires from the docker hashes cache and needs to be fetched again
          #cache-entry-ttl = "20 minutes"
      
          # Maximum number of elements to be kept in the cache. If the limit is reached, old elements will be removed from the cache
          #cache-size = 200
      
          # How should docker hashes be looked up. Possible values are "local" and "remote"
          # "local": Lookup hashes on the local docker daemon using the cli
          # "remote": Lookup hashes on docker hub and gcr
          #method = "remote"
      }
      }
      
      backend {
      default = Local
      
      providers {
          Local {
          actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
          config {
                  run-in-background = true
                  runtime-attributes = """
                  String? docker
                  String? docker_user
                  String? sge_mount
                  """
                  submit = "${job_shell} ${script}"
                  submit-docker = """
                  # make sure there is no preexisting Docker CID file
                  rm -f ${docker_cid}
                  # run as in the original configuration without --rm flag (will remove later)
                  docker run \
                      --cidfile ${docker_cid} \
                      -i \
                      ${"--user " + docker_user} \
                      --entrypoint ${job_shell} \
                      -v ${cwd}:${docker_cwd}:delegated \
                      ${sge_mount} \
                      ${docker} ${docker_script}
      
                  # get the return code (working even if the container was detached)
                  rc=$(docker wait `cat ${docker_cid}`)
      
                  # remove the container after waiting
                  docker rm `cat ${docker_cid}`
      
                  # return exit code
                  exit $rc
                  """
              kill-docker = "docker kill `cat ${docker_cid}`"
              root = "cromwell-executions"
              filesystems {
      
                  # For SFS backends, the "local" configuration specifies how files are handled.
                  local {
      
                      # Try to hard link (ln), then soft-link (ln -s), and if both fail, then copy the files.
                      localization: [
                      "hard-link", "soft-link", "copy"
                      ]
      
                      # Call caching strategies
                      caching {
                      # When copying a cached result, what type of file duplication should occur.
                      # For more information check: https://cromwell.readthedocs.io/en/stable/backends/HPC/#shared-filesystem
                      duplication-strategy: [
                          "hard-link", "soft-link", "copy"
                      ]
      
                      # Strategy to determine if a file has been used before.
                      # For extended explanation and alternative strategies check: https://cromwell.readthedocs.io/en/stable/Configuring/#call-caching
                      hashing-strategy: "md5"
      
                      # When true, will check if a sibling file with the same name and the .md5 extension exists, and if it does, use the content of this file as a hash.
                      # If false or the md5 does not exist, will proceed with the above-defined hashing strategy.
                      check-sibling-md5: false
                      }
                  }
                  }
      
                  # The defaults for runtime attributes if not provided.
                  default-runtime-attributes {
                  failOnStderr: false
                  continueOnReturnCode: 0
                  }
              }
          }
      }
      }
      
      
    3. nextflow
      流程预定义,可以直接下载模板后运行,定制化程度不高,特殊需求下,需要手动运行命令分析,详情参考生信流程大全-基于nextflow的nf-core
    4. snakemake
      某种程度上和WDL比较像,通过模块间的输入输出定义流程的DAG。

    引用

    相关文章

      网友评论

          本文标题:SJM HPC-SGE 投递任务管理工具简介

          本文链接:https://www.haomeiwen.com/subject/zkwjfrtx.html