主要有两种限速方式
DOWNLOAD_DELAY 和 (CONCURRENT_REQUESTS_PER_DOMAIN 或 CONCURRENT_REQUESTS_PER_IP) 组合控制
RANDOMIZE_DOWNLOAD_DELAY = True
DOWNLOAD_DELAY = 0.75
CONCURRENT_REQUESTS_PER_DOMAIN = 8
CONCURRENT_REQUESTS_PER_IP = 0
- 当CONCURRENT_REQUESTS_PER_IP非零时忽略CONCURRENT_REQUESTS_PER_DOMAIN设置
- RANDOMIZE_DOWNLOAD_DELAY = True 真正使用延时 = (0.5 ~ 1.5)* DOWNLOAD_DELAY
访问数量限制
- CONCURRENT_REQUESTS 下载器的最大并发下载数量
- CONCURRENT_REQUESTS_PER_IP 访问同一个ip的最大并发数量
- CONCURRENT_REQUESTS_PER_DOMAIN 如果上一个设置非0, 这个失效
使用限速模块AutoThrottle
详细算法位置:scrapy.extensions.throttle line:_adjust_delay
DOWNLOAD_DELAY = 12
CONCURRENT_REQUESTS_PER_IP = 1
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1.0
AUTOTHROTTLE_MAX_DELAY = 60.0
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
AUTOTHROTTLE_DEBUG = True
- 启动时采用AUTOTHROTTLE_START_DELAY当做起始的延迟, 此时previous_delay = AUTOTHROTTLE_START_DELAY
- 收到应答后的时间为latency,计算下次目标延时:tag_delay = latency / AUTOTHROTTLE_TARGET_CONCURRENCY
- next_delay = download_delay + (tag_delay + previous_tag_delay) / 2
- 非200的代码不会降低延迟速度
- 下载延迟不会少于DOWNLOAD_DELAY 或大于AUTOTHROTTLE_MAX_DELAY
- AutoThrottle 是基于计算服务器响应能力的算法,DOWNLOAD_DELAY + 预估服务器响应能力延迟。算法不与DOWNLOAD_DELAY冲突,会尊重DOWNLOAD_DELAY机制。
网友评论