项目名称
BOT_NAME = 'qidianwang'
爬虫文件路径
SPIDER_MODULES = ['qidianwang.spiders']
NEWSPIDER_MODULE = 'qidianwang.spiders'
Crawl responsibly by identifying yourself (and your website) on the user-agent
设置模拟浏览器加载
USER_AGENT = 'qidianwang (+http://www.yourdomain.com)'
Obey robots.txt rules
是否遵守robot协议(默认为True表示遵守)
ROBOTSTXT_OBEY = False
Configure maximum concurrent requests performed by Scrapy (default: 16)
scrapy 发起请求的最大并发数量(默认是16个)
CONCURRENT_REQUESTS = 32
Configure a delay for requests for the same website (default: 0)
See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
See also autothrottle settings and docs
设置下载延时,默认为0
DOWNLOAD_DELAY = 0
The download delay setting will honor only one of:
在每个域下允许发起请求的最大并发数(默认是8个)
CONCURRENT_REQUESTS_PER_DOMAIN = 16
针对每个ip允许发起请求的最大并发数量(默认0个)
1.在不为0的情况CONCURRENT_REQUESTS_PER_IP的设置优先级要比CONCURRENT_REQUESTS_PER_DOMAIN要高
2.不为0的情况下DOWNLOAD_DELAY就会针对于ip而不是网站了,
CONCURRENT_REQUESTS_PER_IP = 16
Disable cookies (enabled by default)
是否要携带cookies,默认为True表示携带
COOKIES_ENABLED = False
COOKIES_DEBUG 默认为False表示不追踪cookies
COOKIES_DEBUG = True
Disable Telnet Console (enabled by default)
====是一个扩展插件,通过TELENET可以监听到当前爬虫的一些状态,默认是True开启状态
TELNETCONSOLE_ENABLED = False
Override the default request headers:
=======请求头的设置
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8',
'Accept-Language': 'en',
'User-Agnet':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
}
Enable or disable spider middlewares
See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
=========爬虫中间件
SPIDER_MIDDLEWARES = {
'qidianwang.middlewares.QidianwangSpiderMiddleware': 543,
}
Enable or disable downloader middlewares
See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
===========下载中间件,自定义下载中间键需要在这里激活,后面的数字越小优先级越高,
DOWNLOADER_MIDDLEWARES = {
'qidianwang.middlewares.QidianUserAgentDownloadmiddlerware': 543,
# 'qidianwang.middlewares.QidianProxyDownloadMiddlerware':544,
# 'qidianwang.middlewares.SeleniumDownlaodMiddlerware':543,
}
Enable or disable extensions
See https://doc.scrapy.org/en/latest/topics/extensions.html
EXTENSIONS================添加扩展
EXTENSIONS = {
'scrapy.extensions.telnet.TelnetConsole': None,
}
Configure item pipelines
See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
=====================激活管道,后面跟的数字越小优先级越高
ITEM_PIPELINES = {
'qidianwang.pipelines.QidianwangPipeline': 300,
'scrapy_redis.pipelines.RedisPipeline': 400,
}
网友评论