美文网首页Kubernetes机器学习平台
K8s运行TensorFlow找不到libcuda.so.1

K8s运行TensorFlow找不到libcuda.so.1

作者: 王勇1024 | 来源:发表于2019-11-07 17:13 被阅读0次

    今天在尝试将TensorFlow训练部署到K8s GPU机器上时,发现部分实例启动不起来,报出下面的错误:

    ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
    
    Failed to load the native TensorFlow runtime.
    

    从日志信息可以看到,是缺少libcuda.so.1库。
    使用命令查找该库,发现正常的实例中都能找到下面3个库,而不正常的实例中都找不到这3个库。

    [root@w1-6d975b49c5-sd5zn tf_nn_model]# find /usr -name "libcuda*"
    /usr/lib64/libcuda.so.418.43
    /usr/lib64/libcuda.so
    /usr/lib64/libcuda.so.1
    

    我搭建的K8s环境是通过k8s-device-plugin调用GPU的,我猜想应该是k8s-device-plugin启动异常导致库找不到。
    于是我就去K8s查看k8s-device-plugin启动日志。
    下面是正常启动的日志:

    2019/11/07 08:23:44 Loading NVML
    2019/11/07 08:23:46 Fetching devices.
    2019/11/07 08:23:46 Starting FS watcher.
    2019/11/07 08:23:46 Starting OS watcher.
    2019/11/07 08:23:46 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
    2019/11/07 08:23:46 Registered device plugin with Kubelet
    

    下面是异常启动的日志:

    2019/11/07 08:13:09 Loading NVML
    2019/11/07 08:13:09 Failed to initialize NVML: could not load NVML library.
    2019/11/07 08:13:09 If this is a GPU node, did you set the docker default runtime to `nvidia`?
    2019/11/07 08:13:09 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
    2019/11/07 08:13:09 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
    

    从错误日志信息中我们大概可以猜到,是因为没有设置nvidia为默认runtime才导致的启动失败。
    解决方案:修改/etc/docker/daemon.json,添加"default-runtime": "nvidia",即可。

    {
       "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "nvidia-container-runtime",
                "runtimeArgs": []
            }
        }
    }
    

    相关文章

      网友评论

        本文标题:K8s运行TensorFlow找不到libcuda.so.1

        本文链接:https://www.haomeiwen.com/subject/dcvvbctx.html