python multi-thread & multi-proc

作者: db24cc | 来源:发表于2018-01-10 20:59 被阅读0次

python multi-thread & multi-proc
MyEclipse问题waiting for multi-thr
iOS - Multi-Thread
LUA 牛逼的coroutine
Android开发笔记之基础篇 ------ 基础概念（三）
文章不再同步这里，请关注微信公众号【Python猫】
人工智能学习线路图
基础知识
Python教程导航
Introduction to Python

python odd & ends

multi-thread vs multi-process in py

后记

python odd & ends

python是一个解释型的语言, 类比java是一个语言标准真正的实现有Hotspot,JRockit, py解释器实现最常见是CPython,其他常vendor还有IronPython (Python running on .NET), Jython (Python running on the Java Virtual Machine),PyPy (A fast python implementation with a JIT compiler),Stackless Python (Branch of CPython supporting microthreads)
后面分析的内容都基于cpython

multi-thread vs multi-process

这是我看到一个比较好的答案:Multiprocessing vs Threading Python

Here are some pros/cons I came up with.
Multiprocessing
Pros:

Separate memory space

Code is usually straightforward

Takes advantage of multiple CPUs & cores

Avoids GIL limitations for cPython

Eliminates most needs for synchronization primitives unless if you use shared memory (instead, it's more of a communication model for IPC)

Child processes are interruptible/killable

Python 'multiprocessing' module includes useful abstractions with an interface much like 'threading.Thread'

A must with cPython for CPU-bound processing

Cons:

IPC a little more complicated with more overhead (communication model vs. shared memory/objects)

Larger memory footprint

Threading
Pros:

Lightweight - low memory footprint

Shared memory - makes access to state from another context easier

Allows you to easily make responsive UIs

cPython C extension modules that properly release the GIL will run in parallel

Great option for I/O-bound applications

Cons:

cPython - subject to the GIL

Not interruptible/killable

If not following a command queue/message pump model (using the Queue module), then manual use of synchronization primitives become a necessity (decisions are needed for the granularity of locking)

Code is usually harder to understand and to get right - the potential for race conditions increases dramatically

以上列举了multi-process和multi-threads的优劣之处, 有2个问题需要验证一下.
1.在multi-threads环境下, GIL的影响是什么?
2.对于multi-process,multi-threads针对不同场景应该如何选型?

通过实验我们可以一窥究竟:

在multi-threads环境下, GIL的影响是什么?

如下类似代码在java或者cpp环境下, 因为并发和cache不一致会造成最后结果

from threading import Thread

counter = 0
num_threads = 16


def increase_atomic_test():
    global counter
    for i in range(10000):
        counter += 1


threads = []
for th in range(num_threads):
    threads.append(Thread(target=increase_atomic_test, args=(), name='increase_atomic_test_' + str(th)))

for th in range(num_threads):
    threads[th].start()

for th in range(num_threads):
    threads[th].join()

print('counter = %s' % counter)

运行结果如下:

/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/bin/python3.6 /Users/db24/work_src/bianlifeng/test/test_atomic.py
counter = 160000

16个线程,每个更新1万次,最后结果是对的, 这里的初步结论: 实际真正执行py代码的thread只有一个

GIL是cpython实现的一个内部细节, python定义了锁变量, 对JPython可能就不是一个问题,所以对共享变量的访问修改还是应该加上类似RLock的机制

def RLock(*args, **kwargs):
Factory function that returns a new reentrant lock.
A reentrant lock must be released by the thread that acquired it. Once a thread has acquired a reentrant lock, the same thread may acquire it again without blocking; the thread must release it once for each time it has acquired it

这样cpython升级后GIL不是一个问题,或者换到其他py的实现版本上就不会有问题了

对于multi-process,multi-threads针对不同场景应该如何选型?

我们来看一个更加复杂的case
一个cpu密集操作的task单元,task_unit.cc

int work_run_(){
    int s = 0;
    for(int i = 0; i < 10000; ++i){
        for(int j = 0; j < 10000; ++j){
            for(int z = 0; z < 2; ++z)
                s += 1;
        }
    }
    return s;
}
extern "C" {
    int work_run(){ return work_run_();}
}

一个cpu密集操作的task单元test_unit.py, 逻辑计算量等于task_unit.cc

import queue
import time
from ctypes import cdll


# def work_unit_cpp(v1, v2, _flann, _surf):
# _, des1 = _surf.detectAndCompute(v1, None)
# _, des2 = _surf.detectAndCompute(v2, None)
# matches = _flann.knnMatch(des1, des2, k=2)
# return sum(1 for x in matches if x[0].distance < 0.5 * x[1].distance) > 3
# time.sleep(0.1)


def work_unit_cpp():
    lib = cdll.LoadLibrary('libtask_unit.so')
    lib.work_run()


def work_unit_py():
    x = 0
    for i in range(10000):
        for j in range(10000):
            for z in range(2):
                x += 1
    return x


def work_unit_q(q, task_type):
#    surf = cv2.xfeatures2d.SIFT_create(600)

#    FLANN_INDEX_KDTREE = 0
#    index_params = dict(algorithm=FLANN_INDEX_KDTREE, trees=5)
#    search_params = dict(checks=50)
#    flann = cv2.FlannBasedMatcher(index_params, search_params)
    while not q.empty():
        try:
            v2 = q.get(block=False, timeout=None)
            q.task_done()
            if task_type == 'cpp':
                work_unit_cpp()
            else:
                work_unit_py()
        except queue.Empty:
            return
    return

组织调用代码如下:

# import cv2
import sys
import argparse
from datetime import datetime
from datetime import timedelta
import queue
from threading import Thread
import multiprocessing as mp
from multiprocessing import JoinableQueue
from test_unit import work_unit_cpp, work_unit_py, work_unit_q
from multiprocessing import Queue as MPQueue
import time

NUMBER_OF_TARGET = 32
NUMBER_OF_THREADS = 8
NUMBER_OF_PROCESS = 8


def parse_arg(args):
    parser = argparse.ArgumentParser()
    parser.add_argument('--run_type', type=str, choices=['single', 'mt', 'mp'], help='single for within thread, '
                                                                                     'mt for multiple thread, '
                                                                                     'mp for multi-process',
                        default='single')
    parser.add_argument('--task_type', type=str, choices=['cpp', 'py'], help='cpp for task run in cpp '
                                                                             'py for task run in python',
                        default='cpp')
    return parser.parse_args(args)


def test_one_thread(task_type):
    print('test_one_thread %s' % task_type)
    for i in range(NUMBER_OF_TARGET):
        if task_type == 'cpp':
            work_unit_cpp()
        else:
            work_unit_py()


def test_multi_thread(task_type):
    print('test_multi_thread %s' % task_type)
    q = queue.Queue(NUMBER_OF_TARGET)
    for i in range(NUMBER_OF_TARGET):
        q.put(i)
    ths = []
    for i in range(NUMBER_OF_THREADS):
        ths.append(Thread(target=work_unit_q, args=(q, task_type,), name=str(i)))
    for i in range(NUMBER_OF_THREADS):
        ths[i].start()
    for i in range(NUMBER_OF_THREADS):
        ths[i].join()


def test_multi_process(task_type):
    print('test_multi_process %s' % task_type)
    q = JoinableQueue(NUMBER_OF_TARGET)
    for i in range(NUMBER_OF_TARGET):
        q.put(i)
    processes = []
    for i in range(NUMBER_OF_PROCESS):
        processes.append(mp.Process(target=work_unit_q, args=(q, task_type,)))
    for process in processes:
        process.start()
    for process in processes:
        process.join()
    q.close()


if __name__ == '__main__':
    start = datetime.now()
    arg = parse_arg(sys.argv[1:])
    if arg.run_type == 'single':
        test_one_thread(arg.task_type)
    elif arg.run_type == 'mt':
        test_multi_thread(arg.task_type)
    else:
        test_multi_process(arg.task_type)
    print('time:%s' % timedelta.total_seconds(datetime.now() - start))

这里有2个参数,run_type:标识单线程,多线程,多进程;task_type:标识执行任务是c/cpp,python的
最开始cpp执行的任务是用opencv surf抽特征点计算相似度,但是opencv在多进程环境下有问题, 这里任务是一个CPU密集操作并且cpp和py是逻辑等效的
以下是测试结果:

time python3 test_multi_process_thread.py --run_type=mp --task_type=cpp
test_multi_process cpp
time:3.51
real 0m3.822s
user 0m14.324s
sys 0m2.932s6788

time python3 test_multi_process_thread.py --run_type=mt --task_type=cpp
test_multi_thread cpp
time:2.135229
real 0m2.455s
user 0m16.528s
sys 0m1.624s

time python3 test_multi_process_thread.py --run_type=single --task_type=cpp
test_one_thread cpp
time:14.562856
real 0m14.810s
user 0m15.136s
sys 0m2.704s

time python3 test_multi_process_thread.py --run_type=mp --task_type=py
test_multi_process py
time:170.000028
real 2m50.302s
user 21m46.504s
sys 0m2.176s

time python3 test_multi_process_thread.py --run_type=single --task_type=py
test_one_thread py
time:1146.867732
real 19m7.136s
user 19m7.336s
sys 0m2.856s

time python3 test_multi_process_thread.py --run_type=mt --task_type=py
test_multi_thread py
time:1810.804411
real 30m11.120s
user 30m31.556s
sys 0m28.404s

可以看出:

同样的计算任务,同样的运行模式, cpp优于py的
对于计算任务是cpp的,多线程略优于多进程,大幅优于串行, 这个可以解释为线程开销和交互小于进程,都可以做到cpu级别的任务并行
对于计算任务是py的, 多进程因为规避了GIL 所以效率最优,串行居中,多线程因为互相争抢GIL造成时间最慢,这时候用多线程反而慢

后记

写程序不应依赖解释器的实现细节, 对于多呈现环境下变量的访问要么用queue的机制或者加入类似RLock,即使解释器升级或者调用c/cpp时暂时放弃GIL也不会造成状态不一致
python的特点是容易写,调用别的库方便,因为python的变量都是动态的都要从堆里面创建和读取, 不能善用寄存器, 所以对于CPU密集型的计算任务应该放进c或者cpp中,应用多线程模型,最大化吞吐
虽然调用c/cpp会释放GIL, 但是在c/cpp内部的锁机制依然会影响程序的吞吐, 还是需要了解依赖模块的阻塞调用关系
对于计算任务本身就是用py执行的,那么慎用多线程模型,可以考虑用多进程模型提高吞吐
依据python的特点,适合做程序的连接者而不是执行者, building block用高效的语言实现, 用py快速组织, 兼顾迭代速度和吞吐

比如在tensorflow中, graph的定义变化比较快,而对于定义好图的执行是通用的,可以用py定义,真正落地执行放到cpp上,弱化GIL的争抢, 各兼其长

python multi-thread & multi-proc
python odd & ends multi-thread vs multi-process in py 后记 ...
MyEclipse问题waiting for multi-thr
MyEclipse问题waiting for multi-thread deployment of directo...
iOS - Multi-Thread
概念篇进程线程多线程单核多线程 & 多核多线程并行 & 并发同步 & 异步队列队列 & 任务的执...
LUA 牛逼的coroutine
Lua的coroutine跟thread的概念贼相似，但是一个multi-thread的程序，可以同时有多个thr...
Android开发笔记之基础篇 ------ 基础概念（三）
今天的内容是这周工作中出现的一个相关内容的学习：都说在Android的开发中，多线程（Multi-thread）是...
文章不再同步这里，请关注微信公众号【Python猫】
Python猫Python猫Python猫 Python猫 Python猫 Python猫
人工智能学习线路图
Python教程 Python 教程Python 简介Python 环境搭建Python 中文编码Python 基...
基础知识
python 了解Python Python的应用领域 Python的版本 Python介绍 Python是时下最...
Python教程导航
Python 教程 Python 简介 Python 环境搭建 Python 中文编码 Python 基础语法 ...
Introduction to Python
Introduction to Python @(Python入门) [TOC] Python简介 Python的...