美文网首页程序员深度学习
TensorFlow GPU 与 源码编译

TensorFlow GPU 与 源码编译

作者: SpikeKing | 来源:发表于2018-06-30 17:00 被阅读424次

    在深度学习中,服务器的GPU可以极大地加快算法的执行速度,不同版本的TensorFlow默认使用的GPU版本不同,导致与服务器无法兼容,这就需要根据服务器的GPU版本,重新编译TensorFlow源码。

    欢迎Follow我的GitHubhttps://github.com/SpikeKing

    GPU

    检查GPU

    检测服务器的GPU,用于在编译中选择合适的GPU版本。CUDA是NVIDIA发布的GPU上的并行计算平台和模型,多数GPU的运行环境都需要CUDA的支持。

    导入CUDA的环境变量,具体的cuda版本,在/usr/local中查看。

    export PATH=/usr/local/cuda-8.0/bin:$PATH
    export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH
    

    检查CUDA版本,使用nvcc命令,当前CUDA版本是8.0.61:

    nvcc  --version
    
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2016 NVIDIA Corporation
    Built on Tue_Jan_10_13:22:03_CST_2017
    Cuda compilation tools, release 8.0, V8.0.61
    

    或,查看version文件,当前CUDA版本是8.0.61:

    cat /usr/local/cuda/version.txt
    
    CUDA Version 8.0.61
    

    检查cuDNN的版本,当前cuDNN版本是6.0.21:

    cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
    
    #define CUDNN_MAJOR      6
    #define CUDNN_MINOR      0
    #define CUDNN_PATCHLEVEL 21
    --
    #define CUDNN_VERSION    (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
    
    #include "driver_types.h"
    

    检测GPU数量和型号,当前服务器的GPU数量是4:

    nvidia-smi
    
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  TITAN X (Pascal)    Off  | 0000:02:00.0     Off |                  N/A |
    | 23%   20C    P0    54W / 250W |      0MiB / 12189MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  TITAN X (Pascal)    Off  | 0000:03:00.0     Off |                  N/A |
    | 23%   21C    P0    54W / 250W |      0MiB / 12189MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   2  TITAN X (Pascal)    Off  | 0000:83:00.0     Off |                  N/A |
    | 23%   21C    P0    55W / 250W |      0MiB / 12189MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   3  TITAN X (Pascal)    Off  | 0000:84:00.0     Off |                  N/A |
    |  0%   21C    P0    51W / 250W |      0MiB / 12189MiB |      2%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID  Type  Process name                               Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    

    TensorFlow编译

    参考TensorFlow的官方文档

    Bazel

    Bazel

    Bazel是构建和测试软件的工具。在Ubuntu服务器中,支持apt工具安装Bazel,参考

    安装JDK(Install JDK 8):

    sudo apt-get install openjdk-8-jdk
    

    添加Bazel的发布URI作为包源(Add Bazel distribution URI as a package source)

    echo "deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list
    curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -
    

    安装Bazel(Install and update Bazel)

    sudo apt-get install bazel
    

    与官网略有不同,不需要更新apt-get,因为storage.googleapis.com可能无法访问。

    检查Bazel版本:

    bazel version
    

    输出,Bazel安装成功:

    Build label: 0.15.0
    Build target: bazel-out/k8-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
    Build time: Tue Jun 26 12:10:19 2018 (1530015019)
    Build timestamp: 1530015019
    Build timestamp as int: 1530015019
    

    导入CUDA

    与官方文档不同,不需要安装libcupti和cuda-command-line-tools,已经包含在CUDA文件夹中,导入CUDA文件夹即可。

    export PATH=/usr/local/cuda-8.0/bin:$PATH
    export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH
    

    编译源码

    下载TensorFlow源码

    git clone https://github.com/tensorflow/tensorflow
    

    配置编译,选择GPU支持。

    ./configure
    
    Please specify the location of python. [Default is /data2/wcl1/tensorflow/venv/bin/python] ## 选择Python的版本,2或3
    
    ## 其余选N或n
    
    Do you wish to build TensorFlow with CUDA support? [y/N]: y  # 选择GPU版本,y
    CUDA support will be enabled for TensorFlow.
    
    Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 9.0]: 8.0  # CUDA版本,与服务器一致,8.0
    
    Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: 6.0.21  # cuDNN版本,与服务器一致,6.0.21
    
    ## 其余选N,或默认
    
    Do you want to use clang as CUDA compiler? [y/N]: N  ## 选择nvcc
    nvcc will be used as CUDA compiler.
    
    ## 其余选N,或默认
    
    Configuration finished
    

    构建GPU包:

    bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
    

    libdevice异常

    Cannot find libdevice.10.bc under /usr/local/cuda-8.0
    

    则,修改/usr/local/cuda-8.0/nvvm/libdevice/libdevice.compute_50.10.bclibdevice.10.bc同时,复制到/usr/local/cuda-8.0/中。

    cd /usr/local/cuda-8.0/nvvm/libdevice/
    sduo cp libdevice.compute_50.10.bc libdevice.10.bc
    sudo cp libdevice.compute_50.10.bc /usr/local/cuda-8.0/libdevice.10.bc
    

    Bazel的构建时间较长,耐心等待...,共9945步,依次执行。

    将构建完成的数据,转换为whl的pip支持包,默认存放于/tmp/tensorflow_pkg文件夹中:

    bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
    

    使用pip安装软件包,TensorFlow的1.9版本,支持GPU:

    pip install /tmp/tensorflow_pkg/tensorflow-1.9.0rc0-cp27-cp27mu-linux_x86_64.whl -i https://pypi.douban.com/simple
    

    检查是否安装成功,退出TensorFlow文件夹,进入Python的shell,执行:

    import tensorflow as tf
    hello = tf.constant('Hello, TensorFlow!')
    sess = tf.Session()
    print(sess.run(hello))
    

    platform异常,原因是位于tensorflow文件夹下,import导入tensorflow的包:

    No module named tensorflow.python.platform
    

    则,退出tensorflow文件夹,再进入Python的shell,导入tensorflow包即可。

    检查是否可用GPU,执行:

    from tensorflow.python.client import device_lib
    local_device_protos = device_lib.list_local_devices()
    print "all: %s" % [x.name for x in local_device_protos]
    
    ## 输出
    all: [u'/device:CPU:0', u'/device:GPU:0', u'/device:GPU:1', u'/device:GPU:2', u'/device:GPU:3']
    

    注意:遇到在编译TensorFlow之后,无法执行nvidia-smi,卡住(Stuck),在重启服务器之后,恢复正常,原因不明,可能GPU资源未完全释放,又进行二次加载,导致异常。

    OK, that's all! Enjoy it!

    相关文章

      网友评论

        本文标题:TensorFlow GPU 与 源码编译

        本文链接:https://www.haomeiwen.com/subject/ctztuftx.html