Squeezenet
测试对象:Squeezenet - 256 - cifar10
测试种类:改变线程1 4 10;添加数据增强选项
1 GPU使用率
data:image/s3,"s3://crabby-images/42466/42466e074425c14a3878f11fe2dbb84cf7926ed4" alt=""
data:image/s3,"s3://crabby-images/0ee03/0ee031b870d6b69eb4eacd56ddf339968dde6307" alt=""
GPU占用最高的为10cpu线程情况,最低为线程1的情况。
时间来说,线程1训练最慢,线程4其次,线程4+五增强其次,线程10最快。
总结来说,CPU线程多,数据准备更充分,训练快。
2 GPU内存分配
data:image/s3,"s3://crabby-images/48509/48509a2811225fa5da8d66d8027c4c82a2e71673" alt=""
3 CPU利用率
data:image/s3,"s3://crabby-images/2ecd0/2ecd0f99824e6bd548111f1e869eb3db2c2f72e9" alt=""
4 GPU读写操作所占比例
data:image/s3,"s3://crabby-images/36542/36542e92710a243275626554bee434fe1e79c95d" alt=""
data:image/s3,"s3://crabby-images/76c07/76c0733abb50fb2ca11f7e954e7a3e4e363ffed3" alt=""
AlexNet
测试对象:Alexnet - ImageNet
测试种类:改变batchsize 64 512 1024
data:image/s3,"s3://crabby-images/e0260/e02607032f00aab577ca2c1ab121174373bcb72c" alt=""
1 GPU使用率
data:image/s3,"s3://crabby-images/5c1aa/5c1aa8ba67911192559b1ce2e8dd9080d06b6c17" alt=""
2 GPU内存占用量
data:image/s3,"s3://crabby-images/e5dff/e5dff2067c2056bcd9359d1825ae364a8a482759" alt=""
3 GPU运行过程中read/write的占比
data:image/s3,"s3://crabby-images/abe87/abe8737171739bbb6ab45f3b54fafe6586d0ae0a" alt=""
该指标为:在Sample时间内,GPU读or写的占比。我们会希望值越低越好,越低说明大部分的操作是用在计算。
data:image/s3,"s3://crabby-images/33be3/33be3f517af6e43c35082352fda549393adda6a1" alt=""
根据红框的内容,两个值相减基本能得到在AlexNet模型下也就25%左右的操作是放在计算上。
4 CPU使用率
data:image/s3,"s3://crabby-images/49136/4913630f1af18c63510a7a319726c024c607746d" alt=""
VGG16 - ImageNet
测试对象:Vgg - ImageNet
测试种类:改变batchsize 16 32 64
data:image/s3,"s3://crabby-images/a35c9/a35c9a111d4b35b87fe2d512350b2f57030ef41d" alt=""
1 GPU利用率
data:image/s3,"s3://crabby-images/177a7/177a798d4533b6ccb2af1b69e17d0bbaac6af056" alt=""
VGG16这个模型基本利用率跑满了,向下突出的为一个epoch结束,释放显存。(这个里面我设置64的batchsize的case跑5轮,其他是三轮。时间太长)
其实也能够看出来,Batch size越大其实训练速度是越快的。
2 GPU显存分配
data:image/s3,"s3://crabby-images/8c3da/8c3dac52fa9f0cf0efa9c8b47944614c229d231a" alt=""
显存分配从batch size 64 -> 16,分别由95%下降到48%。其实这里有很有趣的地方:batch size 16的时候,GPU利用率已经是满的了,那我增加batch size,反而会加快训练。说明,核数没有跑满。这里GPU的利用率100%,就是因为这个GPU利用率定义导致的如下图(https://stackoverflow.com/questions/40937894/nvidia-smi-volatile-gpu-utilization-explanation):GPU利用率其实与核数没有关系,只要有一个及以上的核在使用,那就会定义此时GPU在被用。而这个百分比是因为在一个Sample时间内,每次探测都会算一个记录点,只要记录点GPU在使用,分子就++。结合上面的分析,得出结论:GPU在训练过程中并不一定会把核全部使用掉,里面有可进一步优化的空间。
data:image/s3,"s3://crabby-images/9d1fc/9d1fc5672152394c85c725b6843f8015ae36af8c" alt=""
data:image/s3,"s3://crabby-images/f0ead/f0eadc66c4a7a54c3556b9e1905e46206bb29ad7" alt=""
3 GPU运行过程中read/write的占比
data:image/s3,"s3://crabby-images/9e959/9e959b473ed01f0b8804f77ab12cb4c975f61b51" alt=""
这个依旧很高,依旧只有25%的操作是计算。但是这里有个很有趣的发现,vgg16占比75%,而其他只有66%。说明batch size小的话有利于提高计算效率,不用频繁的去取数据。
VGG16 - Cifar10
data:image/s3,"s3://crabby-images/82a36/82a3605b03648a9a48d3fcad6c6a2ae902a79c14" alt=""
测试对象:Vgg - Cifar10
测试种类:改变batchsize 64 512 1024 2048
这里同样,64这个case训练了5轮。其他是3轮
1 GPU利用率
data:image/s3,"s3://crabby-images/6fa63/6fa63c83df5fcc9bd3e3253bfb2f7dc7155059bc" alt=""
训练时间比较短,所以周期变化不明显,当batch比较小的时候GPU使用率较低,之后增大bs到达93%。
2 GPU内存分配
data:image/s3,"s3://crabby-images/83028/83028686ba6deb85474e246ca04445d8fbab2720" alt=""
3 GPU运行过程中read/write的占比
data:image/s3,"s3://crabby-images/e50f2/e50f2411f7c64878abc97dd3719d6f615d85d036" alt=""
VGG16 - Cifar10 与 ImageNet对比
data:image/s3,"s3://crabby-images/834f7/834f7688840231e6f3d2b9d35587580afc93843a" alt=""
使用了相同的内存分配,即Cifar512 与 Cifar16.
1 GPU利用率
Cifar512利用率要小于ImageNet16
data:image/s3,"s3://crabby-images/f72b4/f72b4e5641cb24477ec904410fbf9ad7fee28f14" alt=""
ResNet18 - Cifar10
data:image/s3,"s3://crabby-images/870ef/870ef442502b8697069992ac075c80b07cabfbbd" alt=""
1 GPU利用率
data:image/s3,"s3://crabby-images/96ca5/96ca59271f2ded23cb297d15190794832b881f7c" alt=""
MobileNet- Cifar10
data:image/s3,"s3://crabby-images/366e4/366e4b81dbd284ee1bfdd17714ab03d93e5edcd3" alt=""
多个模型- Cifar10 - 64
data:image/s3,"s3://crabby-images/f0baa/f0baa0b99bdfaca10dc6b5e20b400fd431ce98b7" alt=""
网友评论