今天正式开始接触深度学习,10w一台的GPU服务器配置起来可真不是玩的。。
话不多说,任何配置不指明环境信息都是耍流氓
系统:Ubuntu16.04 64bit
显卡:Nvidia GFoce GTX1080 TITAN
驱动:nvidia 384.20
软件版本:cuda9.0 + cudnn7.1.3
1. 安装NVIDIA驱动
总的说,安装显卡驱动的方式有如下几种:
- 直接去nvidia官网下载驱动包安装(网址:http://www.nvidia.cn/Download/index.aspx?lang=cn)
- 从PPA中安装(参考博文:http://blog.csdn.net/qiusuoxiaozi/article/details/70195689)
- 直接通过安装Cuda带的驱动(网址:https://developer.nvidia.com/cuda-downloads)
首先, 这三种方式,每种的问题出现可能不一样,反正我是试了2和3(泪奔ing),而在这三种方法中,最重要的就是你首先知道你该装那个版本的驱动,方式如下:
sudo apt-cache search nvidia*
结果如下
可以看到,我的电脑最高可支持的驱动是
nvidia-384
.知道了驱动的版本号就好办了.
禁用nouveau
nouveau
是ubuntu
自带的第三方显卡驱动,这里我们首先要禁用掉,否则会和nvidia家的发生冲突.
打开编辑配置文件:
/etc/modprobe.d/blacklist.conf
在最后一行添加:
blacklist nouveau
禁用nouveau第三方驱动,之后也不需要改回来
执行:
sudo update-initramfs -u
接下来需要重启,输入reboot
命令,
重启后执行:
lsmod | grep nouveau
没有输出的话就是已经禁用掉了
禁用X服务
这里是因为在安装显卡驱动的时候需要先关掉桌面服务, 执行:
sudo /etc/init.d/lightdm stop
开始安装驱动
去英伟达官网下载Linux驱动,我的是NVIDIA-Linux-x86_64-384.130.run
注意,这里的版本号是根据之前查到的nvidia版本号来进行安装的
进入命令行界面
按 Ctrl-Alt+F1
进入命令行界面,开始安装驱动:
给驱动run文件赋予执行权限
sudo chmod a+x NVIDIA-Linux-x86_64-384.130.run
安装(注意,一定要注意这个参数,否则会造成图形界面循环登录的问题)
sudo ./NVIDIA-Linux-x86_64-384.130.run --no-opengl-files
安装完成以后即可重启X服务
sudo /etc/init.d/lightdm start
重启电脑reboot
输入命令nvidia-smi
进行测试
nvidia-smi
嘿嘿嘿,8卡1080Ti,壕的一批
2. 安装cuda
在安装好了驱动以后,就可以再安装cuda了,
注意,此处也是吭,不能随便下载的, 上一步我们安装了nvidia-384
版本的驱动, 所以只能下载cuda 的9.0
版本(https://developer.nvidia.com/cuda-90-download-archive):
如下图,记住一定要下载.run
文件,这个会问你是否需要cuda自带的NVIDIA驱动,我们选择不安装即可.
我是用的cuda_9.0.176_384.81_linux.run文件安装.
一定要注意:
驱动安装选N,而且问你要不要安装opengl的时候,一定要选择n,其他一路选择y,重启
./cuda_9.0.176_384.81_linux.run
配置cuda环境变量
vim /etc/profile
在文件尾加入:
export PATH=/usr/local/cuda-9.0/lib64/bin:$PATH
export LD_LIBRARY_PATH="/usr/local/cuda-9.0/lib64/:/usr/local/cuda/lib64/":$LD_LIBRARY_PATH
export CUDA_HOME=/usr/local/cuda-9.0
3. 安装cudnn
下载cudnn(https://developer.nvidia.com/rdp/cudnn-archive)需要注册NVIDIA Developer ,选择你对应的版本号,我这里是cudnn7.1.3 for cuda9.0的
下载完成以后直接解压缩然后copy就行了
tar -zxvf cudnn-9.0-linux-x64-v7.1.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
4. 安装TensorFlow
在Ubuntu环境下可以直接使用,
注意:
1 . 这里需要更换将pip源更换为douban的,否则太慢.
2 . 要安装的是tensorflow-gpu而非tensorflow,因为后者是使用CPU进行运算的.
sudo pip install tensorflow-gpu
来安装,这一步一般不会出什么差错,安装完成后就可以运行python
进行测试啦!
import tensorflow as tf
hello= tf.constant('hello world')
sess=tf.Session()
输出如下,测试成功
Python 2.7.12 (default, Dec 4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> hello= tf.constant('hello world')
>>> sess=tf.Session()
2018-04-23 12:12:39.122320: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-04-23 12:12:39.706554: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62
pciBusID: 0000:04:00.0
totalMemory: 10.91GiB freeMemory: 396.38MiB
2018-04-23 12:12:40.129654: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62
pciBusID: 0000:05:00.0
totalMemory: 10.91GiB freeMemory: 396.38MiB
2018-04-23 12:12:40.548703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 2 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62
pciBusID: 0000:08:00.0
totalMemory: 10.91GiB freeMemory: 396.38MiB
2018-04-23 12:12:40.988355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 3 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62
pciBusID: 0000:09:00.0
totalMemory: 10.91GiB freeMemory: 396.38MiB
2018-04-23 12:12:41.413895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 4 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62
pciBusID: 0000:83:00.0
totalMemory: 10.91GiB freeMemory: 396.38MiB
2018-04-23 12:12:41.846651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 5 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62
pciBusID: 0000:84:00.0
totalMemory: 10.91GiB freeMemory: 396.38MiB
2018-04-23 12:12:42.286805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 6 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62
pciBusID: 0000:87:00.0
totalMemory: 10.91GiB freeMemory: 396.38MiB
2018-04-23 12:12:42.719848: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 7 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62
pciBusID: 0000:88:00.0
totalMemory: 10.91GiB freeMemory: 396.38MiB
其他问题:
- 搭建成功在使用TensorFlow使用的时候还遇到了一个GPU导致服务器经常死机的情况:
查看系统崩溃日志:
GPU has fallen off the bus
- 解决办法: 参考GPU has fallen off the bus
Put NVIDIA Driver In Persistence Mode,You need to set your GPU in persistence mode. From the man page:
A flag that indicates whether persistence mode is enabled for the GPU. Value is either “Enabled” or “Disabled”. When persistence mode is enabled the NVIDIA driver remains loaded even when no active clients, such as X11 or nvidia-smi, exist. This minimizes the driver load latency associated with running dependent apps, such as CUDA programs. For all CUDA- capable products. Linux only.
Edit /etc/rc.local
file and add the following line before exit 0
statement:
/usr/bin/nvidia-smi -pm 1
Save and close the file. The above line ensures that your GPU is set to persistence mode as soon as it boots into the system.
参考链接:
how can i install cuda on ubuntu16 04
how can i install cudnn on ubuntu 16 04
Ubuntu 16.04 + Nvidia 显卡驱动 + Cuda 8.0 (问题总结 + 解决方案)
Ubuntu安装NVIDIA驱动(咨询NVIDIA工程师的解决方案)
网友评论