美文网首页K8skubernetesCKA认证
【K8s 精选】CKA - 如何排查应用故障

【K8s 精选】CKA - 如何排查应用故障

作者: 熊本极客 | 来源:发表于2022-03-18 11:44 被阅读0次

    1.查看应用状态

    #查找非运行态的pod
    $kubectl get pod |grep 0/1
    deployment-flink-jobmanager-7bc59d769-brqzd          0/1     Evicted   0          5min
    
    #查看正常Pod的详情,关键信息是事件Events
    $kubectl describe pod deployment-flink-jobmanager-57b59994f8-4lqw6
    Name:         deployment-flink-jobmanager-57b59994f8-4lqw6
    Namespace:    default
    Priority:     0
    Node:         node-3/192.168.0.248
    Start Time:   Wed, 16 Mar 2022 01:46:11 +0000
    Labels:       app=flink
                  component=jobmanager
                  pod-template-hash=57b59994f8
    Annotations:  metrics.alpha.kubernetes.io/custom-endpoints: [{"api":"prometheus", "path":"/metrics", "port":"8080"}]
    Status:       Running
    IP:           10.244.2.251
    IPs:
      IP:           10.244.2.251
    Controlled By:  ReplicaSet/deployment-flink-jobmanager-57b59994f8
    Containers:
      jobmanager:
        Container ID:  docker://c90f6cc947e20cddd8b72e99411dc58697f319f10653e1c12aa3dfda3a9a518e
        Image:         192.168.0.60:5000/test/flink:2022.0221.1542.00
        Image ID:      docker-pullable://192.168.0.60:5000/test/flink@sha256:2f31389c4b5ac444ed03e174b2a0fe9c5e23469b0fe4dc31149dd29cb87a2c81
        Ports:         8123/TCP, 8124/TCP, 8091/TCP
        Host Ports:    0/TCP, 0/TCP, 0/TCP
        Command:
          /opt/flink/scripts/start.sh
        Args:
          jobmanager
          $(POD_IP)
        State:          Running
          Started:      Wed, 16 Mar 2022 01:46:12 +0000
        Ready:          True
        Restart Count:  0
        Limits:
          cpu:     500m
          memory:  1Gi
        Requests:
          cpu:      500m
          memory:   1Gi
        Liveness:   tcp-socket :8123 delay=30s timeout=10s period=20s #success=1 #failure=5
        Readiness:  tcp-socket :8123 delay=30s timeout=10s period=20s #success=1 #failure=5
        Environment:
          POD_IP:                  (v1:status.podIP)
          POD_NAME:               deployment-flink-jobmanager-57b59994f8-4lqw6 (v1:metadata.name)
          JVM_ARGS:               -Xms1024m -Xmx4096m -XX:MetaspaceSize=256M
        Mounts:
          /opt/flink/conf from flink-config-volume (rw)
          /opt/flink/log from flink-jobmanager-log-dir (rw)
          /var/run/secrets/kubernetes.io/serviceaccount from serviceaccount-test-token-8k8z2 (ro)
    Conditions:
      Type              Status
      Initialized       True
      Ready             True
      ContainersReady   True
      PodScheduled      True
    Volumes:
      flink-config-volume:
        Type:      ConfigMap (a volume populated by a ConfigMap)
        Name:      flink-config
        Optional:  false
      flink-jobmanager-log-dir:
        Type:          HostPath (bare host directory volume)
        Path:          /opt/container/flink/jobmanager/logs
        HostPathType:
      serviceaccount-test-token-8k8z2:
        Type:        Secret (a volume populated by a Secret)
        SecretName:  serviceaccount-test-token-8k8z2
        Optional:    false
    QoS Class:       Guaranteed
    Node-Selectors:  <none>
    Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                     node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
    Events:
      Type    Reason     Age   From               Message
      ----    ------     ----  ----               -------
      Normal  Scheduled  39s   default-scheduler  Successfully assigned test/deployment-flink-jobmanager-7bc59d769-rttvv to node-3
      Normal  Pulling    65s   kubelet            Pulling image "192.168.0.60:5000/test/flink:2022.0221.1542.00"
      Normal  Pulled     65s   kubelet            Successfully pulled image "192.168.0.60:5000/test/flink:2022.0221.1542.00" in 113.582103ms
      Normal  Created    65s   kubelet            Created container jobmanager
      Normal  Started    65s   kubelet            Started container jobmanager
    

    2.Pending 状态处理

    2.1 查看 Pending 状态的故障详情

    #查看Pending状态的pod
    $kubectl get pod |grep Pending
    deployment-flink-jobmanager-7c879b9649-2tmj9           0/1     Pending       0          58s
    
    #查看异常Pod的事件Events
    $kubectl describe pod deployment-flink-jobmanager-7c879b9649-2tmj9
    Name:           deployment-flink-jobmanager-7c879b9649-2tmj9
    Namespace:      default
    Priority:       0
    Node:           <none>
    Labels:         app=flink
                    component=jobmanager
                    pod-template-hash=7c879b9649
    Annotations:    metrics.alpha.kubernetes.io/custom-endpoints: [{"api":"prometheus", "path":"/metrics", "port":"8080"}]
    Status:         Pending
    IP:
    IPs:            <none>
    Controlled By:  ReplicaSet/deployment-flink-jobmanager-7c879b9649
    Containers:
      jobmanager:
        Image:       192.168.0.60:5000/test/flink:2022.0221.1542.00
    省略......
    Events:
      Type     Reason            Age   From               Message
      ----     ------            ----  ----               -------
      Warning  FailedScheduling  103s  default-scheduler  0/28 nodes are available: 25 node(s) didn't match Pod's node affinity, 3 Insufficient memory.
      Warning  FailedScheduling  103s  default-scheduler  0/28 nodes are available: 25 node(s) didn't match Pod's node affinity, 3 Insufficient memory.
    

    2.2 Pending 状态的常见原因

    资源不足:集群或者 Pod 所打标签的 Node 资源不足(CPU 或者 内存)。如上述示例中的 deployment-flink-jobmanager-7c879b9649-2tmj9Pod 所在节点的内存不足,需要减少 Pod 请求的内存或者增加 Node 的内存,甚至增加新节点同时新节点增加标签。关于调整 Pod 资源可以参考计算资源文档
    使用了 hostPort:如果绑定 PodhostPort,那么能够运行该 Pod 的节点就有限了。 多数情况下,hostPort 是非必要的,而应该采用 Service 对象来暴露 Pod

    3.ContainerCreating 或者 Waiting 状态处理

    3.1 查看 Waiting 状态的故障详情

    #查看ContainerCreating或者Waiting的Pod
    $kubectl get pod |grep ContainerCreating
    deployment-flink-jobmanager-7bc59d769-7xqz7            0/1     ContainerCreating   0          37s
    
    #查看ContainerCreating或者Waiting状态的事件Events
    $kubectl describe pod deployment-flink-jobmanager-7bc59d769-7xqz7
    Name:           deployment-flink-jobmanager-7bc59d769-7xqz7
    Namespace:      default
    Priority:       0
    Node:           node-3/192.168.0.248
    Start Time:     Thu, 17 Mar 2022 07:27:51 +0000
    Labels:         app=flink
                    component=jobmanager
                    pod-template-hash=7bc59d769
    Annotations:    metrics.alpha.kubernetes.io/custom-endpoints: [{"api":"prometheus", "path":"/metrics", "port":"8080"}]
    Status:         Pending
    IP:
    IPs:            <none>
    Controlled By:  ReplicaSet/deployment-flink-jobmanager-7bc59d769
    Containers:
      jobmanager:
        Container ID:
        Image:         192.168.0.60:5000/test/flink:2022.0221.1542.00
        Image ID:
        Ports:         8123/TCP, 8124/TCP, 8091/TCP
        Host Ports:    0/TCP, 0/TCP, 0/TCP
        Command:
          /opt/flink/scripts/start.sh
        Args:
          jobmanager
          $(POD_IP)
        State:          Waiting
          Reason:       ContainerCreating
        Ready:          False
        Restart Count:  0
        Limits:
          cpu:     500m
          memory:  1Gi
        Requests:
          cpu:      500m
          memory:   1Gi
        Liveness:   tcp-socket :8123 delay=30s timeout=10s period=20s #success=1 #failure=5
        Readiness:  tcp-socket :8123 delay=30s timeout=10s period=20s #success=1 #failure=5
    省略......
    Events:
      Type     Reason       Age                 From               Message
      ----     ------       ----                ----               -------
      Normal   Scheduled    2m40s               default-scheduler  Successfully assigned default/deployment-flink-jobmanager-7bc59d769-7xqz7 to node-3
      Warning  FailedMount  66s                 kubelet            Unable to attach or mount volumes: unmounted volumes=[flink-config-volume], unattached volumes=[flink-config-volume rtacomposer-volume flink-jobmanager-log-dir serviceaccount-token-8k8z2]: timed out waiting for the condition
      Warning  FailedMount  61s (x9 over 3m8s)  kubelet            MountVolume.SetUp failed for volume "flink-config-volume" : configmap "flink-config" not found
    

    3.2 ContainerCreating 或者 Waiting 状态的常见原因

    挂载 Volume 失败
    例如,挂载本地磁盘、ConfigmapSecret 等失败
    磁盘爆满
    启动 Pod 会调 CRI 接口创建容器,容器运行时创建容器时通常会在数据目录下为新建的容器创建一些目录和文件,如果数据目录所在的磁盘空间满了就会创建失败并报错:

    Events:
      Type     Reason                  Age                  From                   Message
      ----     ------                  ----                 ----                   -------
      Warning  FailedCreatePodSandBox  2m (x4307 over 16h)  kubelet, 10.179.80.31  (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "apigateway-6dc48bf8b6-l8xrw": Error response from daemon: mkdir /var/lib/docker/aufs/mnt/1f09d6c1c9f24e8daaea5bf33a4230de7dbc758e3b22785e8ee21e3e3d921214-init: no space left on device
    

    Pod 设置的 limit 太小或者单位不对
    如果 limit 设置过小以至于不足以成功运行 Sandbox 也会造成这种状态,常见的是因为 memory limit 单位设置不对造成的 limit 过小,比如误将 memory 的 limit 单位像 request 一样设置为小写 m,这个单位在 memory 不适用,会识别成 byte, 应该用 MiM

    to start sandbox container for pod ... Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:301: running exec setns process for init caused \"signal: killed\"": unknown
    

    CNI 网络错误
    如果发生 CNI 网络错误通常需要检查下网络插件的配置和运行状态,如果没有正确配置或正常运行通常表现为:无法配置 Pod 网络、无法分配 Pod IP。

    4.镜像 Image 异常状态处理

    4.1 查看 Image 异常状态的故障详情

    #查看Image异常状态的pod
    $kubectl get pod |grep 0/1
    deployment-flink-jobmanager-7bc59d769-586rv         0/1     ImagePullBackOff 0          65s
    
    #查看Image异常状态的事件Events
    $kubectl describe pod deployment-flink-jobmanager-7bc59d769-586rv
    Name:           deployment-flink-jobmanager-7bc59d769-586rv
    Namespace:      default
    Priority:       0
    Node:           <none>
    Labels:         app=flink
                    component=jobmanager
                    pod-template-hash=7bc59d769
    Annotations:    metrics.alpha.kubernetes.io/custom-endpoints: [{"api":"prometheus", "path":"/metrics", "port":"8080"}]
    Status:         Pending
    IP:
    IPs:            <none>
    Controlled By:  ReplicaSet/deployment-flink-jobmanager-7bc59d769
    Containers:
      jobmanager:
        Image:       192.168.0.60:5000/test/flink:2022.0221.1542.00
    省略......
    Events:
      Type     Reason     Age                 From               Message
      ----     ------     ----                ----               -------
      Normal   Scheduled  72s                 default-scheduler  Successfully assigned default/deployment-flink-jobmanager-7bc59d769-586rv to node-3
      Normal   Pulling    57s (x3 over 100s)  kubelet            Pulling image "192.168.0.60:5000/test/flink:2022.0221.1542.00_x86"
      Warning  Failed     57s (x3 over 100s)  kubelet            Failed to pull image "192.168.0.60:5000/test/flink:2022.0221.1542.00_x86": rpc error: code = Unknown desc = Error response from daemon: manifest for 192.168.0.60:5000/test/flink:2022.0221.1542.00_x86 not found
      Warning  Failed     57s (x3 over 100s)  kubelet            Error: ErrImagePull
      Normal   BackOff    31s (x4 over 99s)   kubelet            Back-off pulling image "192.168.0.60:5000/test/flink:2022.0221.1542.00_x86"
      Warning  Failed     31s (x4 over 99s)   kubelet            Error: ImagePullBackOff
    

    4.2 Image 异常状态的常见原因

    私有仓地址未加入到 insecure-registry

    下面以 Docker 为实例,首先进入部署 Pod 的所在 Node,然后编辑如下的 daemon.json 文件vi /etc/docker/daemon.json,新增字段 insecure-registries 添加本地私有仓 192.168.0.60:5000,最后 reload dockerd 生效。

    {
      "registry-mirrors": ["https://r9xxm8z8.mirror.aliyuncs.com","https://registry.docker-cn.com"],
      "insecure-registries":["192.168.0.60:5000"],
      "default-ulimits": {
                    "nofile": {
                            "Name": "nofile",
                            "Hard": 1000000,
                            "Soft": 1000000
                    }
            },
        "log-driver":"json-file",
        "log-opts": {"max-size":"10m", "max-file":"5"}
    }
    
    #reload dockerd生效配置
    $sudo systemctl enable docker
    $sudo systemctl daemon-reload
    $sudo systemctl restart docker
    

    如果 registry 的仓库地址是自签发证书的 https,则 Node 需要添加 CA 证书
    将 registry 的 ca 证书放置到 /etc/docker/certs.d/<address>/ca.crt 位置,例如,/etc/docker/certs.d/registry.access.test.com/ca.crt

    私有镜像仓库认证失败
    如果 registry 需要认证,但是 Pod 没有配置 imagePullSecret,配置的 Secre 不存在或者有误都会认证失败,参考文章 k8s 的 imagePullSecrets 如何生成及使用

    镜像文件损坏
    如果镜像文件损坏了,拉取下来也用不了,需要重新制作镜像并上传。

    镜像拉取超时
    如果节点上同时启动太多 Pod,就会有许多可能会造成容器镜像下载排队。如果前面排队许多的 Pod 需要下载大镜像,则下载很长时间导致后面排队的 Pod 就会报拉取镜像超时。参考 Kubelet 命令可以设置是否串行拉取镜像及其速率。

    --serialize-image-pulls   默认值:true 
    --registry-qps int32     默认值:5
    

    镜像不不存在
    查询 Pod 详情 kubectl describe pod deployment-flink-jobmanager-7bc59d769-586rv,可知事件:

    Events:
       ....
         Warning  Failed     57s (x3 over 100s)  kubelet            Failed to pull image "192.168.0.60:5000/test/flink:2022.0221.1542.00_x86": rpc error: code = Unknown desc = Error response from daemon: manifest for 192.168.0.60:5000/test/flink:2022.0221.1542.00_x86 not found
    

    5.Crashing 状态处理

    Pod 如果处于 CrashLoopBackOff 状态说明之前是启动了,但是运行过程中异常退出了。只要 PodrestartPolicy 不是 Never 就可能被重启拉起,此时 Pod 的 RestartCounts 通常是大于 0。因此,可以查看容器进程的退出状态来缩小问题范围。

    Crashing 的常见原因

    容器进程主动退出
    容器 OOM

    6.Running 状态但没有正常工作

    如果 Pod 行为不符合预期,很可能 Pod 描述(例如 mypod.yaml)中有问题, 并且该错误在创建 Pod 时被忽略掉,没有报错。 通常,Pod 的定义中节区嵌套关系错误、字段名字拼错的情况都会引起对应内容被忽略掉。 例如,如果误将 command 写成 commnd,Pod 虽然可以创建,但不会执行你期望它执行的命令行。

    利用 --validate 校验部署的 yaml
    首先删除运行中的 Pod,然后 --validate 重新创建 Pod,例如,kubectl apply --validate -f mypod.yaml。如果 command 错误写成 commnd,将会看到下面的错误信息:

    I0805 10:43:25.129850   46757 schema.go:126] unknown field: commnd
    I0805 10:43:25.129973   46757 schema.go:129] this may be a false alarm, see https://github.com/kubernetes/kubernetes/issues/6842
    pods/mypod
    

    手工对比本机和环境的 yaml
    首先导出环境的 yaml,例如,kubectl get pods/mypod -o yaml > mypod-on-k8s.yaml,然后利用工具 Beyond Compare 对比 yaml。

    相关文章

      网友评论

        本文标题:【K8s 精选】CKA - 如何排查应用故障

        本文链接:https://www.haomeiwen.com/subject/pllxdrtx.html