在kubernetes中如何debug一个运行失败的pod?首先可以过滤出非Running状态的podkubectl get pods --all-namespaces | grep -iv Running
,pod最常见的错误状态是CrashLoopBackOff
,这表示着这个pod在启动之后恰好crashes了,kubernetes接着尝试再去启动这个pod,但是pod最终还是启动失败了。
Pod Crash 可能的原因
- 在Pull image的时候出现错误,错误的或者丢失了 secrets或者image;
- 应用运行时错误,比如没有缺少环境变量或者ConfigMaps Secrets;
- Liveness probe 检查失败;
- 资源消耗太高(Mem,CPU)或者是太严格的资源限制;
- PV没有创建出来或者没有mount成功;
- 容器的image没有更新。
通常,可以使用kubectl logs ...
或者kubectl describe...
加上对应的参数就可以获得一些失败的信息。通过kubectl logs --help
可以得到命令的具体参数如何使用。
注:即使你的Pod处于running的状态,如果Restarts
的次数太多,这也表示你的Pod可能存在潜在的问题。
错误的image名字导致Pod运行失败
可以通过kubectl describe pod <your-pod> <your-namespace>
来获得更多的信息。
在Events
项,会提示错误信息Failed to pull image...
和Reason: Failed
。此时Pod的状态是ImagePullBackOff
。
创建一个Pod
apiVersion: v1
kind: Pod
metadata:
name: termination-demo
spec:
containers:
- name: termination-demo-container
image: debiann
command: ["/bin/sh"]
args: ["-c", "sleep 10 && echo Sleep expired > /dev/termination-log"]
# kubectl get pods
NAME READY STATUS RESTARTS AGE
termination-demo 0/1 ErrImagePull 0 4s
# kubectl describe pods termination-demo
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 72s default-scheduler Successfully assigned default/termination-demo to 172.16.219.186
Normal Pulling 31s (x3 over 71s) kubelet, 172.16.219.186 pulling image "debiann"
Warning Failed 30s (x3 over 70s) kubelet, 172.16.219.186 Failed to pull image "debiann": rpc error: code = Unknown desc = Error response from daemon: pull access denied for debiann, repository does not exist or may require 'docker login'
Warning Failed 30s (x3 over 70s) kubelet, 172.16.219.186 Error: ErrImagePull
Normal BackOff 6s (x4 over 69s) kubelet, 172.16.219.186 Back-off pulling image "debiann"
Warning Failed 6s (x4 over 69s) kubelet, 172.16.219.186 Error: ImagePullBackOff
丢失ConfigMap或者Secrets
创建Pod
apiVersion: apps/v1
kind: Deployment
metadata:
name: termination-demo
labels:
app: termination-demo
spec:
replicas: 1
selector:
matchLabels:
app: termination-demo
template:
metadata:
labels:
app: termination-demo
spec:
containers:
- name: termination-demo-container
image: debian
command: ["/bin/sh"]
args: ["-c", "sed \"s/foo/bar/\" < $MYFILE"]
# kubectl get pods
NAME READY STATUS RESTARTS AGE
termination-demo-6654b86785-vf9bx 0/1 CrashLoopBackOff 2 41s
# kubectl describe pods termination-demo-6654b86785-vf9bx
......
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 69s default-scheduler Successfully assigned default/termination-demo-6654b86785-vf9bx to 172.16.219.186
Normal Pulling 16s (x4 over 68s) kubelet, 172.16.219.186 pulling image "debian"
Normal Pulled 15s (x4 over 63s) kubelet, 172.16.219.186 Successfully pulled image "debian"
Normal Created 14s (x4 over 62s) kubelet, 172.16.219.186 Created container
Normal Started 14s (x4 over 62s) kubelet, 172.16.219.186 Started container
Warning BackOff 1s (x8 over 59s) kubelet, 172.16.219.186 Back-off restarting failed container
# kubectl logs termination-demo-6654b86785-vf9bx
/bin/sh: 1: cannot open : No such file
没有如何提示错误的信息,在这个pod中其实是缺少一个ConfigMap,手动创建一个ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: app-env
data:
MYFILE: "/etc/profile"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: termination-demo
labels:
app: termination-demo
spec:
replicas: 1
selector:
matchLabels:
app: termination-demo
template:
metadata:
labels:
app: termination-demo
spec:
containers:
- name: termination-demo-container
image: debian
command: ["/bin/sh"]
args: ["-c", "sed \"s/foo/bar/\" < $MYFILE"]
envFrom:
- configMapRef:
name: app-env
# kubectl apply -f configmap.yaml
configmap/app-env created
deployment.apps/termination-demo configured
当加入ConfigMap以后,你会发现Pod的状态依旧是CrashLoopBackOff
的,这是因为当应用执行完sed命令以后,Pod就运行完毕了,这不是一个long running service,为了让Pod保持一直运行,可以加一个一直运行的脚本
apiVersion: v1
kind: ConfigMap
metadata:
name: app-env
data:
MYFILE: "/etc/profile"
SLEEP: "5"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: termination-demo
labels:
app: termination-demo
spec:
replicas: 1
selector:
matchLabels:
app: termination-demo
template:
metadata:
labels:
app: termination-demo
spec:
containers:
- name: termination-demo-container
image: debian
command: ["/bin/sh"]
# args: ["-c", "sed \"s/foo/bar/\" < $MYFILE"]
args: ["-c", "while true; do sleep $SLEEP; echo sleeping; done;"]
envFrom:
- configMapRef:
name: app-env
资源限制
在定义一个pod时,你可以会指定应用可使用的资源如Mem或者CPU,如果没有定义这些限制,那系统会使用默认的资源配置,CPU:0m (in Milli CPU) , RAM: 0Gi 表示节点本身没有任何限制。
如果你的应用需要更多的资源,kubernetes会在requests
和limit
之间权衡,request指定保证的资源总量,limit告诉kubernetes容器可能需要的最大的资源的数量,他们之间的关系可以表示成0 <= requests <= limit
,对于这两种设置,你都需要考虑可用节点提供的资源总量。
apiVersion: apps/v1
kind: Deployment
metadata:
name: termination-demo
labels:
app: termination-demo
spec:
replicas: 1
selector:
matchLabels:
app: termination-demo
template:
metadata:
labels:
app: termination-demo
spec:
containers:
- name: termination-demo-container
image: debian
command: ["/bin/sh"]
args: ["-c", "sleep 10 && echo Sleep expired > /dev/termination-log"]
resources:
requests:
cpu: "600m"
$ kubectl describe po termination-demo-fdb7bb7d9-mzvfw
Name: termination-demo-fdb7bb7d9-mzvfw
Namespace: default
...
Containers:
termination-demo-container:
Image: debian
Port: <none>
Host Port: <none>
Command:
/bin/sh
Args:
-c
sleep 10 && echo Sleep expired > /dev/termination-log
Requests:
cpu: 6
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-t549m (ro)
Conditions:
Type Status
PodScheduled False
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 9s (x7 over 40s) default-scheduler 0/2 nodes are available: 2 Insufficient cpu.
Image没有更新
假如你在你的应用加入了新的fix,重新build出image并且push到镜像仓库中,在你部署了应用后,容器并没有Running起来。这个问题取决于你在kubernetes中如何定义image的使用策略。
如果你没有更改image的tag,则默认image策略IfNotPresent会告诉Kubernetes使用缓存的image。
最佳做法是,无论何时更改image中的任何内容,都不应使用最新tag并更改image的tag。
网友评论