需求说明:假设有10万次彼此不相关的矩阵运算,能否把任务分成4份,每份任务量是25000;然后用cpu创建4个进程分别来调用gpu计算。即:理想情况,用cpu把任务分4份,然后每个子任务调用gpu的一个线程来计算。
不建议原因:gpu本身就是"自动并行"的,不需要外界过多干预!不管是否用multiprocessing创建多进程,如果"计算工作"的"内核还是gpu",那么用多进程来分配任务毫无意义!因为多进程是cpu的概念,你"用multiprocessing创建多进程却没用cpu来计算",那创建它就是面儿上高级,实际毫无帮助(反而浪费时间,因为初始化)。
下面用一个对比例子来测试用与不用的区别,计算在服务器gpu上完成:
- 单进程,所有矩阵计算都用gpu完成:
import time
import cupy as cp
import numpy as np
matrix = cp.ones( (1024,512,4,4) )
sole_time1 = time.time()
num = int( input('一共计算次数:') )
for x in range(num):
matrix = cp.add(matrix, matrix)
sole_time2 = time.time()
sole_total = sole_time2 - sole_time1
print('共计算 {} 次,单进程共用时:{}'.format(num, sole_total) )
- 多进程,用4进程分配总任务,计算在服务器gpu上完成:
import time, os
import cupy as cp
import numpy as np
import multiprocessing as mp
def fun(matrix, num):
pid = os.getpid()
print('进程{}启动!'.format(pid))
for y in range(num):
x = cp.add(matrix,matrix)
if __name__ == '__main__':
mp.set_start_method('spawn')
matrix = cp.ones( (1024,512,4,4) )
# 进程创建:4个进程
# 一共计算100000次,每个进程25000次
num = int(input('一共计算次数:'))
ps = []
multips_time1 = time.time()
for x in range(4):
print('进入!')
p = mp.Process(target = fun, args = (matrix, int(num/4)) )
ps.append(p)
# 进程启动:
for x in ps:
x.start()
# 父进程等待处理:
for x in ps:
x.join()
multips_time2 = time.time()
multips_total = multips_time2 - multips_time1
print('共计算 {} 次,多进程共用时:{}'.format(num, multips_total) )
将总任务设置为100000,二者计算用时:
共计算 100000 次,多进程共用时:77.40965700149536
共计算 100000 次,单进程共用时:77.3151204586029
实际测试中,当样本非常大的时候,用多进程貌似确实会快一点点!但只是微不足道的一点点!不是数量级层面上的差距。
但是,服务器一般不止你一个人在用,当多人都在服务器上跑任务时,使用多进程非常容易报"gpu显存不足的错误",导致任务直接中断!
File "/home/gaoboyu/mag3d/cupytry1/cupyprocess1.py", line 10, in fun
x = cp.add(matrix,matrix)
File "cupy/core/_kernel.pyx", line 831, in cupy.core._kernel.ufunc.__call__
File "cupy/core/_kernel.pyx", line 339, in cupy.core._kernel._get_out_args
File "cupy/core/core.pyx", line 134, in cupy.core.core.ndarray.__init__
File "cupy/cuda/memory.pyx", line 518, in cupy.cuda.memory.alloc
File "cupy/cuda/memory.pyx", line 1085, in cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 1106, in cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 934, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
File "cupy/cuda/memory.pyx", line 949, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
File "cupy/cuda/memory.pyx", line 697, in cupy.cuda.memory._try_malloc
cupy.cuda.memory.OutOfMemoryError: out of memory to allocate 67108864 bytes (total 134217728 bytes)
这个错误经常在"服务器使用高峰期出现",在深夜计算超大的样本数也不报错!本人猜测可能是"显存到达上限",或者是"gpu的线程数"达到上限!因此,既然二者时间差距不大,用单进程实在比多进程保险的多!
补充:cupy内核是使用cuda计算,用multiprocessing在服务器使用多进程,进行cuda的gpu并行计算时,在"主函数开头"必须加如下一句话,如本例中:
if __name__ == '__main__':
mp.set_start_method('spawn') # 这一句必须得加!
matrix = cp.ones( (1024,512,4,4) )
否则在服务器上运行会报如下错误:
File "cupyprocess1.py", line 10, in fun
x = cp.add(matrix,matrix)
File "cupy/core/_kernel.pyx", line 810, in cupy.core._kernel.ufunc.__call__
File "cupy/cuda/device.pyx", line 25, in cupy.cuda.device.get_device_id
File "cupy/cuda/runtime.pyx", line 173, in cupy.cuda.runtime.getDevice
File "cupy/cuda/runtime.pyx", line 145, in cupy.cuda.runtime.check_status
cupy.cuda.runtime.CUDARuntimeError: cudaErrorInitializationError: initialization error
另外,再次补充测试一下:把上面所有的cp换成np,在同样10万样本的计算任务下,用多进程需要至少"500秒"以上!再次证明需用大量的矩阵计算时,非常需要cupy来代替numpy!
另外,本文中cp的程序写的不好!因为使用了"循环"!每次对"循环变量的更新"都"又回到了cpu计算之下"!即又发生了cpu和gpu的混合编程现象(二者之间频繁切换)。
网友评论