pyspider代码结构
主要从以下几个模块看pyspider源码
- libs里面的工具类。比如最常用的basehandler 等等。
- process,scheduler,resultdb。其中最容易看得是resultdb
- webui部分,web如何调用数据库的脚本等
debug的一些技巧
pyspider在run.py启动了多个线程, 多个线程的debug其实很麻烦,我们需要根据上一个栏目的代码结构来一个个的看。
比如最简单的resultdb模块,只有一个result_worker.py文件,看懂了这个文件,其实就看懂了这个模块。
class ResultWorker(object):
"""
do with result
override this if needed.
"""
def __init__(self, resultdb, inqueue):
self.resultdb = resultdb
self.inqueue = inqueue
self._quit = False
def on_result(self, task, result):
'''Called every result'''
if not result:
return
if 'taskid' in task and 'project' in task and 'url' in task:
logger.info('result %s:%s %s -> %.30r' % (
task['project'], task['taskid'], task['url'], result))
return self.resultdb.save(
project=task['project'],
taskid=task['taskid'],
url=task['url'],
result=result
)
else:
logger.warning('result UNKNOW -> %.30r' % result)
return
def quit(self):
self._quit = True
def run(self):
'''Run loop'''
logger.info("result_worker starting...")
while not self._quit:
try:
task, result = self.inqueue.get(timeout=1)
self.on_result(task, result)
except Queue.Empty as e:
continue
except KeyboardInterrupt:
break
except AssertionError as e:
logger.error(e)
continue
except Exception as e:
logger.exception(e)
continue
logger.info("result_worker exiting...")
class OneResultWorker(ResultWorker):
'''Result Worker for one mode, write results to stdout'''
def on_result(self, task, result):
'''Called every result'''
if not result:
return
if 'taskid' in task and 'project' in task and 'url' in task:
logger.info('result %s:%s %s -> %.30r' % (
task['project'], task['taskid'], task['url'], result))
print(json.dumps({
'taskid': task['taskid'],
'project': task['project'],
'url': task['url'],
'result': result,
'updatetime': time.time()
}))
else:
logger.warning('result UNKNOW -> %.30r' % result)
return
OneResultWorker不需要看,功能上,这个模块和上面实现一样的功能,应该是用来单独启动的时候调用的代码。同样的道理,很多one开头的代码都可以掠过。
网友评论