美文网首页
python multi-thread & multi-proc

python multi-thread & multi-proc

作者: db24cc | 来源:发表于2018-01-10 20:59 被阅读0次

    python odd & ends

    multi-thread vs multi-process in py

    后记

    python odd & ends

    python是一个解释型的语言, 类比java是一个语言标准真正的实现有Hotspot,JRockit, py解释器实现最常见是CPython,其他常vendor还有IronPython (Python running on .NET), Jython (Python running on the Java Virtual Machine),PyPy (A fast python implementation with a JIT compiler),Stackless Python (Branch of CPython supporting microthreads)
    后面分析的内容都基于cpython


    multi-thread vs multi-process

    这是我看到一个比较好的答案:Multiprocessing vs Threading Python

    Here are some pros/cons I came up with.
    Multiprocessing
    Pros:

    1. Separate memory space
    2. Code is usually straightforward
    3. Takes advantage of multiple CPUs & cores
    4. Avoids GIL limitations for cPython
    5. Eliminates most needs for synchronization primitives unless if you use shared memory (instead, it's more of a communication model for IPC)
    6. Child processes are interruptible/killable
    7. Python 'multiprocessing' module includes useful abstractions with an interface much like 'threading.Thread'
    8. A must with cPython for CPU-bound processing

    Cons:

    1. IPC a little more complicated with more overhead (communication model vs. shared memory/objects)
    2. Larger memory footprint

    Threading
    Pros:

    1. Lightweight - low memory footprint
    2. Shared memory - makes access to state from another context easier
    3. Allows you to easily make responsive UIs
    4. cPython C extension modules that properly release the GIL will run in parallel
    5. Great option for I/O-bound applications

    Cons:

    1. cPython - subject to the GIL
    2. Not interruptible/killable
    3. If not following a command queue/message pump model (using the Queue module), then manual use of synchronization primitives become a necessity (decisions are needed for the granularity of locking)
    4. Code is usually harder to understand and to get right - the potential for race conditions increases dramatically

    以上列举了multi-process和multi-threads的优劣之处, 有2个问题需要验证一下.
    1.在multi-threads环境下, GIL的影响是什么?
    2.对于multi-process,multi-threads针对不同场景应该如何选型?

    通过实验我们可以一窥究竟:

    在multi-threads环境下, GIL的影响是什么?

    如下类似代码在java或者cpp环境下, 因为并发和cache不一致会造成最后结果

    from threading import Thread
    
    counter = 0
    num_threads = 16
    
    
    def increase_atomic_test():
        global counter
        for i in range(10000):
            counter += 1
    
    
    threads = []
    for th in range(num_threads):
        threads.append(Thread(target=increase_atomic_test, args=(), name='increase_atomic_test_' + str(th)))
    
    for th in range(num_threads):
        threads[th].start()
    
    for th in range(num_threads):
        threads[th].join()
    
    print('counter = %s' % counter)
    

    运行结果如下:

    /usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/bin/python3.6 /Users/db24/work_src/bianlifeng/test/test_atomic.py
    counter = 160000

    16个线程,每个更新1万次,最后结果是对的, 这里的初步结论: 实际真正执行py代码的thread只有一个

    GIL是cpython实现的一个内部细节, python定义了锁变量, 对JPython可能就不是一个问题,所以对共享变量的访问修改还是应该加上类似RLock的机制

    def RLock(*args, **kwargs):
    Factory function that returns a new reentrant lock.
    A reentrant lock must be released by the thread that acquired it. Once a thread has acquired a reentrant lock, the same thread may acquire it again without blocking; the thread must release it once for each time it has acquired it

    这样cpython升级后GIL不是一个问题,或者换到其他py的实现版本上就不会有问题了

    对于multi-process,multi-threads针对不同场景应该如何选型?

    我们来看一个更加复杂的case
    一个cpu密集操作的task单元,task_unit.cc

    int work_run_(){
        int s = 0;
        for(int i = 0; i < 10000; ++i){
            for(int j = 0; j < 10000; ++j){
                for(int z = 0; z < 2; ++z)
                    s += 1;
            }
        }
        return s;
    }
    extern "C" {
        int work_run(){ return work_run_();}
    }
    

    一个cpu密集操作的task单元test_unit.py, 逻辑计算量等于task_unit.cc

    import queue
    import time
    from ctypes import cdll
    
    
    # def work_unit_cpp(v1, v2, _flann, _surf):
    # _, des1 = _surf.detectAndCompute(v1, None)
    # _, des2 = _surf.detectAndCompute(v2, None)
    # matches = _flann.knnMatch(des1, des2, k=2)
    # return sum(1 for x in matches if x[0].distance < 0.5 * x[1].distance) > 3
    # time.sleep(0.1)
    
    
    def work_unit_cpp():
        lib = cdll.LoadLibrary('libtask_unit.so')
        lib.work_run()
    
    
    def work_unit_py():
        x = 0
        for i in range(10000):
            for j in range(10000):
                for z in range(2):
                    x += 1
        return x
    
    
    def work_unit_q(q, task_type):
    #    surf = cv2.xfeatures2d.SIFT_create(600)
    
    #    FLANN_INDEX_KDTREE = 0
    #    index_params = dict(algorithm=FLANN_INDEX_KDTREE, trees=5)
    #    search_params = dict(checks=50)
    #    flann = cv2.FlannBasedMatcher(index_params, search_params)
        while not q.empty():
            try:
                v2 = q.get(block=False, timeout=None)
                q.task_done()
                if task_type == 'cpp':
                    work_unit_cpp()
                else:
                    work_unit_py()
            except queue.Empty:
                return
        return
    

    组织调用代码如下:

    # import cv2
    import sys
    import argparse
    from datetime import datetime
    from datetime import timedelta
    import queue
    from threading import Thread
    import multiprocessing as mp
    from multiprocessing import JoinableQueue
    from test_unit import work_unit_cpp, work_unit_py, work_unit_q
    from multiprocessing import Queue as MPQueue
    import time
    
    NUMBER_OF_TARGET = 32
    NUMBER_OF_THREADS = 8
    NUMBER_OF_PROCESS = 8
    
    
    def parse_arg(args):
        parser = argparse.ArgumentParser()
        parser.add_argument('--run_type', type=str, choices=['single', 'mt', 'mp'], help='single for within thread, '
                                                                                         'mt for multiple thread, '
                                                                                         'mp for multi-process',
                            default='single')
        parser.add_argument('--task_type', type=str, choices=['cpp', 'py'], help='cpp for task run in cpp '
                                                                                 'py for task run in python',
                            default='cpp')
        return parser.parse_args(args)
    
    
    def test_one_thread(task_type):
        print('test_one_thread %s' % task_type)
        for i in range(NUMBER_OF_TARGET):
            if task_type == 'cpp':
                work_unit_cpp()
            else:
                work_unit_py()
    
    
    def test_multi_thread(task_type):
        print('test_multi_thread %s' % task_type)
        q = queue.Queue(NUMBER_OF_TARGET)
        for i in range(NUMBER_OF_TARGET):
            q.put(i)
        ths = []
        for i in range(NUMBER_OF_THREADS):
            ths.append(Thread(target=work_unit_q, args=(q, task_type,), name=str(i)))
        for i in range(NUMBER_OF_THREADS):
            ths[i].start()
        for i in range(NUMBER_OF_THREADS):
            ths[i].join()
    
    
    def test_multi_process(task_type):
        print('test_multi_process %s' % task_type)
        q = JoinableQueue(NUMBER_OF_TARGET)
        for i in range(NUMBER_OF_TARGET):
            q.put(i)
        processes = []
        for i in range(NUMBER_OF_PROCESS):
            processes.append(mp.Process(target=work_unit_q, args=(q, task_type,)))
        for process in processes:
            process.start()
        for process in processes:
            process.join()
        q.close()
    
    
    if __name__ == '__main__':
        start = datetime.now()
        arg = parse_arg(sys.argv[1:])
        if arg.run_type == 'single':
            test_one_thread(arg.task_type)
        elif arg.run_type == 'mt':
            test_multi_thread(arg.task_type)
        else:
            test_multi_process(arg.task_type)
        print('time:%s' % timedelta.total_seconds(datetime.now() - start))
    

    这里有2个参数,run_type:标识单线程,多线程,多进程;task_type:标识执行任务是c/cpp,python的
    最开始cpp执行的任务是用opencv surf抽特征点计算相似度,但是opencv在多进程环境下有问题, 这里任务是一个CPU密集操作并且cpp和py是逻辑等效的
    以下是测试结果:

    time python3 test_multi_process_thread.py --run_type=mp --task_type=cpp
    test_multi_process cpp
    time:3.51
    real 0m3.822s
    user 0m14.324s
    sys 0m2.932s6788

    time python3 test_multi_process_thread.py --run_type=mt --task_type=cpp
    test_multi_thread cpp
    time:2.135229
    real 0m2.455s
    user 0m16.528s
    sys 0m1.624s

    time python3 test_multi_process_thread.py --run_type=single --task_type=cpp
    test_one_thread cpp
    time:14.562856
    real 0m14.810s
    user 0m15.136s
    sys 0m2.704s

    time python3 test_multi_process_thread.py --run_type=mp --task_type=py
    test_multi_process py
    time:170.000028
    real 2m50.302s
    user 21m46.504s
    sys 0m2.176s

    time python3 test_multi_process_thread.py --run_type=single --task_type=py
    test_one_thread py
    time:1146.867732
    real 19m7.136s
    user 19m7.336s
    sys 0m2.856s

    time python3 test_multi_process_thread.py --run_type=mt --task_type=py
    test_multi_thread py
    time:1810.804411
    real 30m11.120s
    user 30m31.556s
    sys 0m28.404s

    可以看出:

    1. 同样的计算任务,同样的运行模式, cpp优于py的
    2. 对于计算任务是cpp的,多线程略优于多进程,大幅优于串行, 这个可以解释为线程开销和交互小于进程,都可以做到cpu级别的任务并行
    3. 对于计算任务是py的, 多进程因为规避了GIL 所以效率最优,串行居中,多线程因为互相争抢GIL造成时间最慢,这时候用多线程反而慢

    后记

    1. 写程序不应依赖解释器的实现细节, 对于多呈现环境下变量的访问要么用queue的机制或者加入类似RLock,即使解释器升级或者调用c/cpp时暂时放弃GIL也不会造成状态不一致

    2. python的特点是容易写,调用别的库方便,因为python的变量都是动态的都要从堆里面创建和读取, 不能善用寄存器, 所以对于CPU密集型的计算任务应该放进c或者cpp中,应用多线程模型,最大化吞吐

    3. 虽然调用c/cpp会释放GIL, 但是在c/cpp内部的锁机制依然会影响程序的吞吐, 还是需要了解依赖模块的阻塞调用关系

    4. 对于计算任务本身就是用py执行的,那么慎用多线程模型,可以考虑用多进程模型提高吞吐

    5. 依据python的特点,适合做程序的连接者而不是执行者, building block用高效的语言实现, 用py快速组织, 兼顾迭代速度和吞吐

    比如在tensorflow中, graph的定义变化比较快,而对于定义好图的执行是通用的,可以用py定义,真正落地执行放到cpp上,弱化GIL的争抢, 各兼其长

    相关文章

      网友评论

          本文标题:python multi-thread & multi-proc

          本文链接:https://www.haomeiwen.com/subject/frtgnxtx.html