Python实战 - 第7节：多进程爬虫的数据抓取

作者: 辉叔不太萌 | 来源:发表于2016-11-12 22:35 被阅读0次

笔记

进程与线程的关系

python中可使用multiprocessing来实现多进程

from multiprocessing import Pool

# 自动分配进程数
pool = Pool()
# 人工指定进程数
pool = Pool(processes=6)

pool.map($functionName, $paraList)

“主函数”入口声明：
```
if __name__ == '__main__':
```
其具体含义可以参考：https://zhuanlan.zhihu.com/p/21297237

作业

思路
- 在url入库时，在记录中增加一个标记，用于描述该url是否被爬取过。
- 当该url被爬取处理时，更新该标记的值。
- 当爬取任务中断，再次启动时，仅过滤查询尚未爬取过的url继续爬取。
代码示意如下：

import pymongo

client = pymongo.MongoClient('localhost', 27017)
test1 = client['test1']
table1 = test1['table1']

# url入库的时候，增加一个是否已经爬取过的标识 flag=false
table1.insert_one({'url': 'urlxxxxxxxx', 'flag': 'false'})

# 处理时，仅筛选尚未爬取过的url
list = table1.find({'flag': 'false'})
for row in list:
    # 爬取处理
    # Something ！
    print(row['_id'])

    # 爬取之后，更新是否爬取过的标记 flag=true
    table1.update({'_id': row['_id']}, {'$set': {'flag': 'true'}})

网友评论

本文标题：Python实战 - 第7节：多进程爬虫的数据抓取

本文链接：https://www.haomeiwen.com/subject/dygqpttx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Python实战 - 第7节：多进程爬虫的数据抓取

笔记

作业

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读