K8s运行TensorFlow找不到libcuda.so.1

作者: 王勇1024 | 来源:发表于2019-11-07 17:13 被阅读0次

今天在尝试将TensorFlow训练部署到K8s GPU机器上时，发现部分实例启动不起来，报出下面的错误：

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

Failed to load the native TensorFlow runtime.

从日志信息可以看到，是缺少libcuda.so.1库。
使用命令查找该库，发现正常的实例中都能找到下面3个库，而不正常的实例中都找不到这3个库。

[root@w1-6d975b49c5-sd5zn tf_nn_model]# find /usr -name "libcuda*"
/usr/lib64/libcuda.so.418.43
/usr/lib64/libcuda.so
/usr/lib64/libcuda.so.1

我搭建的K8s环境是通过k8s-device-plugin调用GPU的，我猜想应该是k8s-device-plugin启动异常导致库找不到。
于是我就去K8s查看k8s-device-plugin启动日志。
下面是正常启动的日志：

2019/11/07 08:23:44 Loading NVML
2019/11/07 08:23:46 Fetching devices.
2019/11/07 08:23:46 Starting FS watcher.
2019/11/07 08:23:46 Starting OS watcher.
2019/11/07 08:23:46 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2019/11/07 08:23:46 Registered device plugin with Kubelet

下面是异常启动的日志：

2019/11/07 08:13:09 Loading NVML
2019/11/07 08:13:09 Failed to initialize NVML: could not load NVML library.
2019/11/07 08:13:09 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2019/11/07 08:13:09 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2019/11/07 08:13:09 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

从错误日志信息中我们大概可以猜到，是因为没有设置nvidia为默认runtime才导致的启动失败。
解决方案：修改/etc/docker/daemon.json，添加"default-runtime": "nvidia",即可。

{
   "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

网友评论

本文标题：K8s运行TensorFlow找不到libcuda.so.1

本文链接：https://www.haomeiwen.com/subject/dcvvbctx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

K8s运行TensorFlow找不到libcuda.so.1

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

Kubernetes

机器学习平台