scrapy实现post翻页请求
scrapy框架默认发送的是get请求,若要发送post请求需要重写scrapy下面的start_requests方法
#认识start_requests的返回值
def start_requests(self):
url=""
data = {}
headers = {}
yield scrapy.FormRequest(url=url, # 请求的post地址
formdata=data, # post携带的数据,是一个字典
headers=headers, # 可以定制头信息,在setting.py中定制也可以
callback=self.parse # 回调函数
)
scrapy实现post翻页请求仅重写start_requests方法,源代码与get方法略有不同,其他setting、items、pipelines等操作与get方法相同
创建项目参考爬虫框架scrapy篇二——创建一个scrapy项目
参考源码如下:
# -*- coding: utf-8 -*-
import re
import scrapy
from ktgg.items import KtggItem
# 爬取网址
# http://www.hshfy.sh.cn/shfy/gweb2017/ktgg_search.jsp?zd=splc
class FyktggSpider(scrapy.Spider):
name = 'fyktgg'
allowed_domains = ['hshfy.sh.cn']
start_urls = ['http://www.hshfy.sh.cn/shfy/gweb2017/ktgg_search_content.jsp']
# scrapy框架默认发送的是get请求,若要发送post请求需要重写scrapy下面的start_requests方法
# 方法名:start_requests是固定的
def start_requests(self):
data = {"yzm": "WFi4",
"ft": "",
"ktrqks": "2021-03-13",
"ktrqjs": "2021-04-13",
"spc": "",
"yg": "",
"bg": "",
"ah": "",
"pagesnum": "3"}
yield scrapy.FormRequest(url=self.start_urls[0], formdata=data, callback=self.parse)
# 使用代理ip
# ip = str(json.dumps(IpProxy.getRandomIP())).replace('"', '')
# proxies = {
# 'http': 'http://' + str(ip),
# 'https': 'https://' + str(ip),
# }
# yield scrapy.FormRequest(url=self.start_urls[0], formdata=data, callback=self.parse, meta={'proxies':proxies})
# 实现翻页及解析
def parse(self, response):
# 解析当页数据
now_page = response.xpath('//span[@class="current"]/text()').extract()[0].strip()
print("正在爬取第{}页数据:".format(now_page))
trs = response.xpath('//table[@id="report"]/tbody/tr')[1:]
for tr in trs:
# 创建KtggItem类
item = KtggItem()
item['fy'] = tr.xpath('./td[1]/font/text()').extract()[0].strip()
item['ft'] = tr.xpath('./td[2]/font/text()').extract()[0].strip()
item['ktrq'] = tr.xpath('./td[3]/text()').extract()[0].strip()
item['ah'] = tr.xpath('./td[4]/text()').extract()[0].strip()
item['ay'] = tr.xpath('./td[5]/text()').extract()[0].strip()
item['cbbm'] = tr.xpath('./td[6]/div/text()').extract()[0].strip()
item['spz'] = tr.xpath('./td[7]/div/text()').extract()[0].strip()
item['yg'] = tr.xpath('./td[8]/text()').extract()[0].strip()
item['bg'] = tr.xpath('./td[9]/text()').extract()[0].strip()
# 提交item到管道文件(pipelines.py)
yield item
# 爬取下一页数据
next_page = re.findall("\d+", response.xpath('//div[@class="meneame"]/div/a[12]/@href').extract()[0].strip())[0]
if next_page:
data = {"yzm": "WFi4",
"ft": "",
"ktrqks": "2021-03-13",
"ktrqjs": "2021-04-13",
"spc": "",
"yg": "",
"bg": "",
"ah": "",
"pagesnum": "{}".format(next_page)}
yield scrapy.FormRequest(url=self.start_urls[0], formdata=data, callback=self.parse)
items
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class KtggItem(scrapy.Item):
# define the fields for your item here like:
fy = scrapy.Field() # 法院
ft = scrapy.Field() # 法庭
ktrq = scrapy.Field() # 开庭日期
ah = scrapy.Field() # 案号
ay = scrapy.Field() # 案由
cbbm = scrapy.Field() # 承办部门
spz = scrapy.Field() # 审判长
yg = scrapy.Field() # 原告
bg = scrapy.Field() # 被告
传送门
爬虫框架scrapy篇一——scrapy的架构
https://www.jianshu.com/p/39b326f9cad6
爬虫框架scrapy篇二——创建一个scrapy项目
https://www.jianshu.com/p/00d99a9628b0
爬虫框架scrapy篇三——数据的处理与持久化以及遇到的一些问题
https://www.jianshu.com/p/8824623b551c
爬虫框架scrapy篇四——数据入库(mongodb,mysql)
https://www.jianshu.com/p/573ca74c2277
参考:
scrapy的post简单请求
网友评论