一、首先cmd创建项目,创建爬虫文件,这里我给爬虫起的名字是ipip
scrapy startproject 项目名
scrapy genspider 爬虫名 域名
二、把setting配置一下。
robot协议关掉,USER_AGENT设置上。
三、可以现在ipip文件中写一段代码,把本地的ip先输出出来。
# -*- coding: utf-8 -*-
import scrapy
class IpipSpider(scrapy.Spider):
name = 'ipip'
allowed_domains = ['baidu.com']
start_urls = ['https://www.baidu.com/s?ie=utf-8&wd=ip']
def parse(self, response):
with open("ip.html",'w',encoding='utf-8') as fp:
fp.write(response.text)
四、scrapy crawl ipip 跑一下
五、去某刺爬多几个代理IP。setting中定义一个字段,表示我们收集好的代理。
IPPOOL = [
{"ip":"113.16.160.101:8118"},
{"ip":"119.29.119.64:8080"},
{"ip":"202.112.237.102:3128"},
{"ip":"119.31.210.170:7777"},
{"ip":"183.129.207.83:10800"},
{"ip":"183.129.207.73:14823"}
]
六、配置中间件文件
# 从setting文件导入IPPOOL
from .settings import IPPOOL
# 导入官方文档对应的HttpProxyMiddleware
from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware
import random
# 创建一个代理中间件类 继承第四行我们导入的那个东西
class IPPOOLS(HttpProxyMiddleware):
def __init__(self,ip=''):
self.ip = ip
# 重写请求处理方法
def process_request(self, request, spider):
# 从ip代理池中随机挑选一个ip地址
current_ip = random.choice(IPPOOL)
print("当前代理ip是:",current_ip['ip'])
# 设置情趣对象的代理服务器是当前ip
request.meta["proxy"] = 'https://'+current_ip["ip"]
七、setting中将中间键打开并更改代码
DOWNLOADER_MIDDLEWARES = {
"scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware":123,
'IPPOOLTest01.middlewares.IPPOOLS': 125
}
八、scrapy crawl ipip 跑一下
网友评论