今天在尝试将TensorFlow训练部署到K8s GPU机器上时,发现部分实例启动不起来,报出下面的错误:
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
Failed to load the native TensorFlow runtime.
从日志信息可以看到,是缺少libcuda.so.1
库。
使用命令查找该库,发现正常的实例中都能找到下面3个库,而不正常的实例中都找不到这3个库。
[root@w1-6d975b49c5-sd5zn tf_nn_model]# find /usr -name "libcuda*"
/usr/lib64/libcuda.so.418.43
/usr/lib64/libcuda.so
/usr/lib64/libcuda.so.1
我搭建的K8s环境是通过k8s-device-plugin
调用GPU的,我猜想应该是k8s-device-plugin
启动异常导致库找不到。
于是我就去K8s查看k8s-device-plugin
启动日志。
下面是正常启动的日志:
2019/11/07 08:23:44 Loading NVML
2019/11/07 08:23:46 Fetching devices.
2019/11/07 08:23:46 Starting FS watcher.
2019/11/07 08:23:46 Starting OS watcher.
2019/11/07 08:23:46 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2019/11/07 08:23:46 Registered device plugin with Kubelet
下面是异常启动的日志:
2019/11/07 08:13:09 Loading NVML
2019/11/07 08:13:09 Failed to initialize NVML: could not load NVML library.
2019/11/07 08:13:09 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2019/11/07 08:13:09 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2019/11/07 08:13:09 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
从错误日志信息中我们大概可以猜到,是因为没有设置nvidia为默认runtime
才导致的启动失败。
解决方案:修改/etc/docker/daemon.json,添加"default-runtime": "nvidia",
即可。
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
网友评论