1.安装
pip install Scrapy
#python3 安装twisted报错,Microsoft Visual C++ 14.0 is required.根据链接下载visual c++ build tools
no module named win32api
# http://sourceforge.net/projects/pywin32/files/下载对应的包。
#虚拟环境安装:切换到到虚拟目录easy_instatll "xxx.exe"。
#使用pycharm创建的虚拟目录,安装提示权限问题。可以命令行创建虚拟环境,再pycharm关联该虚拟环境。找到Python.exe
2.调试模式 scrapy shell
scrapy shell "http://xxx.xxx.com"
#如出现403,可以命令行添加UA,scrapy shell "url" -s USER_AGENT="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"
#添加默认UA,site-packages\scrapy\settings\default_settings.py中USER_AGENT = "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0"
3.在IDE中运行
根目录新建Python文件。
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'qiushibaike'])
#第3个参数为spider的name
4.设置setting.py
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
#打开这几项设置,可以读取已经请求过的缓存文档
#在settings里面,
DOWNLOAD_DELAY = 0.25 # 250 ms of delay
#设置爬虫请求间隔时间
5.HTTP status code is not handled or not allowed错误
setting.py添加 HTTPERROR_ALLOWED_CODES = [503]
网友评论