-
Python库
- requests
- urllib
- BeautifulSoup
- time
-
目标
- 豆瓣主页验证码100个
-
时间
- 1h
-
问题
- 降低爬取速度
-
伪代码
- 打开豆瓣主页
- Chrome定位验证码位置
- 分析HTML结构
- 提取验证码链接
- 保存链接+下载验证码+命名验证码
- 刷新豆瓣主页
- 重复以上步骤
-
实现代码
# -*- coding: utf-8 -*- """ Created on Sat Nov 4 14:45:13 2017 @author: Howin 爬取豆瓣验证码 """ import requests from bs4 import BeautifulSoup import urllib import time headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'} for n in range(1,101): url = 'https://www.douban.com' r = requests.get(url, headers = headers) soup = BeautifulSoup(r.text, 'html.parser') captcha_url = soup.find('img', id = 'captcha_image')['src'] captcha_name = 'D:/Dairly/Code/PY/爬取豆瓣/爬取数据/豆瓣验证码/captcha_{}.jpg' captcha_name = captcha_name.format(n) urllib.request.urlretrieve(captcha_url,captcha_name) time.sleep(1)
网友评论