美文网首页自留地
【已解决】k8s Cronjob.spec.failedJobs

【已解决】k8s Cronjob.spec.failedJobs

作者: 王小奕 | 来源:发表于2020-11-03 10:35 被阅读0次

    标签

    kubernetesCronjobpod

    背景介绍

    如下面的yaml所示,明明已经设置了.spec.failedJobsHistoryLimit为1,但仍然产生了7个状态为Error的Pod:

    apiVersion: batch/v1beta1
    kind: CronJob
    metadata:
      name: mycronjob
      namespace: prod
      labels:
        task: processor
    spec:
      failedJobsHistoryLimit: 1
      successfulJobsHistoryLimit: 3
    ……
    
    kubectl get pod -n prod -l task=processor
    NAME                      READY   STATUS   RESTARTS   AGE
    mycronjob-16043364027mpp   0/1     Error    0          9h
    mycronjob-16043364098q8q   0/1     Error    0          9h
    mycronjob-160433640hc2ch   0/1     Error    0          9h
    mycronjob-160433640nrdqb   0/1     Error    0          9h
    mycronjob-160433640r49cq   0/1     Error    0          8h
    mycronjob-160433640tnfvw   0/1     Error    0          9h
    mycronjob-160433640vhdsc   0/1     Error    0          9h
    

    那么,问题来了,为什么CronJob.spec.successfulJobsHistoryLimit可以生效,而CronJob.spec.failedJobsHistoryLimit没有生效呢?

    分析

    理解这个问题前,我们首先要搞清楚,CronJob是干什么的。
    官方介绍

    A CronJob creates Jobs on a repeating schedule.

    One CronJob object is like one line of a crontab (cron table) file. It runs a job periodically on a given schedule, written in Cron format.

    从定义中,我们不难看出,CronJob是用来管理Job的,而Job才是生成Pod的源头,因此想要探寻CronJob.spec.failedJobsHistoryLimit失效的原因,我们得去看CronJob定期创建的Job的配置:
    执行命令:

    kubectl get job -n prod -l task=processor -o yaml
    

    得到:

    apiVersion: v1
    items:
    - apiVersion: batch/v1
      kind: Job
      metadata:
        labels:
          task: processor
        name: processor-1604336400
        namespace: prod
        ownerReferences:
        - apiVersion: batch/v1beta1
          blockOwnerDeletion: true
          controller: true
          kind: CronJob
          name: processor
      spec:
        backoffLimit: 6
        completions: 1
        parallelism: 1
      status:
        conditions:
        - message: Job has reached the specified backoff limit
          reason: BackoffLimitExceeded
          type: Failed
    

    注意观察spec.backoffLimit这个配置,官方解释是:

    There are situations where you want to fail a Job after some amount of retries due to a logical error in configuration etc. To do so, set .spec.backoffLimit to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6. Failed Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s ...) capped at six minutes. The back-off count is reset when a Job's Pod is deleted or successful without any other Pods for the Job failing around that time.

    翻译过来就是说,Job处理过程中,如果它创建的Pod失败了,那么默认情况下,Job会重复创建6次新的Pod,如果我们不想它创建这么多次,可以更改.spec.backoffLimit这个配置。
    讲到这里,相信大家都知道问题出在哪儿了。

    总结

    CronJob创建了Job,并且根据我们的配置,限制了Job的失败以及成功历史输分别为3和1,但是Job什么时候算失败确是由Job.spec.backoffLimit规定的,因此我们通过CronJob.spec.failedJobsHistoryLimit限制的只能是Job的个数,此个数可以通过命令kubectl get job -n prod -l task=processor查看,想要限制最终的失败Pod数,得控制Job.spec.backoffLimit这个配置才可以。

    参考

    Running Automated Tasks with a CronJob
    Jobs
    Pod Lifecycle

    思考

    如果设置CronJob.spec.failedJobsHistoryLimit为2,Job.spec.backoffLimit为5,那么最多会保留多少个状态为Error的Pod ?

    相关文章

      网友评论

        本文标题:【已解决】k8s Cronjob.spec.failedJobs

        本文链接:https://www.haomeiwen.com/subject/qklbvktx.html