最近在优化Django数据库操作的性能问题,由于Python的GIL限制引入了多进程,优化过程中碰到了非常奇怪的数据库连接丢失问题,导致程序终止运行
errcode is 2013, errmsg is 'Lost connection to MySQL server during query'
排查了很长时间,最后明确导致这个错误的原因跟进程的fork机制有关:fork子进程完全复制父进程的空间
原程序的逻辑是这样的:
1.主进程查询待处理任务数据集
2.然后,创建进程池,并将任务数据集交由进程池处理
也就是说:
1.主进程建立了数据库连接
2.fork子进程的时候也fork了数据库连接(可以理解数据库连接资源是一个引用复用,不确定这样描述是否准确)
3.主进程操作完成,释放了数据库连接,这时子进程在做数据库操作的时候就连接丢失了
解决方案
1.在主进程创建进程池前释放数据库连接(测试时发现不能解决问题)
2.主进程不做数据库操作,子进程做任务抢占
模拟代码
问题复现
在主进程操作数据库,随后创建进程池,并在子进程中进行数据库操作
def process_lost(key):
i = 0
while True:
if i > 5:
return
time.sleep(2)
keys = ReviewRetraceKey.objects.filter(job_id__in=[6631860103316242436]).order_by('-create_time')
print keys
i += 1
def lost_connection():
'''
Lost connection to MySQL server during query
:return:
'''
keys = ReviewRetraceKey.objects.filter(job_id__in=[6631860103316242436]).order_by('-create_time')
data_list = list()
for i in range(5):
data_list.extend(keys)
print len(data_list)
pool = Pool(5)
pool.map(process_lost, data_list)
pool.close()
pool.join()
lost_connection()
通过运行结果可以看到“数据库连接丢失的错误”
Traceback (most recent call last):
File "test_data_model.py", line 77, in <module>
lost_connection()
File "test_data_model.py", line 38, in lost_connection
pool.map(process_lost, data_list)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
raise self._value
_mysql_exceptions.OperationalError: errcode is 2013, errmsg is 'Lost connection to MySQL server during query'
优化方案
主进程只负责进程池创建,由子进程自己完成任务抢占和处理
lock = Lock()
manager = Manager()
map = manager.dict()
def process_un_lost(inx):
job = None
try:
lock.acquire()
for j in range(5):
# 模拟取job通过锁控制竞争
if j in map:
print j, 'in map continue', inx
else:
map[j] = inx
job = j
break
finally:
lock.release()
print 'process = ', inx, 'job = ', job
i = 0
while True:
if i > 5:
return
time.sleep(2)
keys = ReviewRetraceKey.objects.filter(job_id__in=[6631860103316242436]).order_by('-create_time')
print inx, keys
i += 1
def no_lost_connection():
pool = Pool(5)
pool.map(process_un_lost, [i for i in range(6, 11)])
pool.close()
pool.join()
no_lost_connection()
print map
程序运行正常
{0: 6, 1: 7, 2: 8, 3: 9, 4: 10}
总结
1.对类似进程fork这种底层技术了解的不多
2.CPython的GIL导致python并发编程的复杂度相对java和go高不少
3.ORM框架虽然简化了代码实现,但某些特定场景还是直接用原生技术更易掌控
网友评论