爬虫scrapy框架实战——爬取阳光平台

作者: 猛犸象和剑齿虎 | 来源:发表于2019-06-02 10:10 被阅读0次

爬虫scrapy框架实战——爬取阳光平台
Scrapy流程及模块介绍
Python爬虫实战-使用Scrapy框架爬取土巴兔(五)
Python爬虫实战-使用Scrapy框架爬取土巴兔(二)
Python爬虫之Scrapy框架爬取XXXFM音频文件
爬虫练习_使用scrapy爬取淘宝
Python爬虫作业 | 爬取拉勾职位信息-Scrapy版
Scrapy 爬虫实战-爬取字幕库
python爬虫框架Scrapy
Python爬虫实战-使用Scrapy框架爬取土巴兔(四)

t013b9c86f5a43c0037.jpg

目标网站：阳光政务平台的。

http://wz.sun0769.com/html/top/report.shtml

image.png

分析网页的分页url规律

http://wz.sun0769.com/index.php/question/report?page=30 第二页
http://wz.sun0769.com/index.php/question/report?page=60 第三页
http://wz.sun0769.com/index.php/question/report?page=90 第四页

创建项目

新建文件夹 image.png
打开文件夹，按住shift单击鼠标右键，选择在此处打开命令窗口。 image.png
在出现的黑屏终端中输入：scrapy startproject sunspider 创建项目。 image.png
打开文件在这样的文件夹目录中选择在此处打开命令窗口。 image.png
在出现的黑屏终端中输入：scrapy genspider sun sun.com 创建爬虫文件。
关掉cmd，把整个文件夹拖入pycharm。 image.png

爬虫代码部分

在items.py文件中定义我们的目标也就是我们要爬取的内容

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class SunspiderItem(scrapy.Item):
    # define the fields for your item here like:
    #帖子的标题
    title = scrapy.Field()
    #帖子内容
    content = scrapy.Field()
    #帖子的url
    url = scrapy.Field()

写爬虫在sun.py文件中写代码。首先我们要解决翻页问题。分析xpath网页结构

image.png

# -*- coding: utf-8 -*-
import scrapy
from sunspider.items import SunspiderItem

#http://wz.sun0769.com/index.php/question/report?page=0
class SunSpider(scrapy.Spider):
    name = 'sun'
    allowed_domains = ['wz.sun0769.com']
    url='http://wz.sun0769.com/index.php/question/questionType?type=4&page='
    offset=0
    start_urls = [url+str(offset)]

    #获取每个帖子url
    def parse(self, response):
        #取出链接列表
        #scrapy有内置的xpath模块,所以直接写xpath就可以
        links=response.xpath("//div[@class='greyframe']/table//td/a[@class='news14']/@href").extract()

        #发送每个帖子的请求,使用parse_item方法进行处理
        for link in links:
            yield scrapy.Request(link,callback=self.parse_item)
        #设置自动翻页
        global offset
        if offset<=150:
            self.offset+=30
            #重新发送新页面
            yield scrapy.Request(self.url+str(self.offset),callback=self.parse)

     #爬取帖子内容
    def parse_item(self,response):
        item=SunspiderItem()
        #url,标题，内容
        item['url']=response.url
        #<span class="niae2_top">提问：大岭山中兴路两边停车位被私自占用</span>
        #< td class ="txt16_3" > & nbsp; & nbsp; & nbsp; & nbsp;我在两个月前曾经在这里投诉过，大岭山中兴路两边的公共停车位被路边商铺私自占用，他们用雪糕筒，电动自行车等摆放在停车位里，据为己用。当时大岭山城管局的回复是已经处理，以后会加强管理。但是最近我去到中兴路那边，不仅情况没有改善，反而更多的停车位被占用，城管局难道就是这样管理的？ < / td >
        item['title']=response.xpath("//span[@class='niae2_top']/text()").extract()[0]#列表中第一个元素
        item['content']=''.join(response.xpath("//td[@class='txt16_3']/text()").extract())
        yield item

返回的信息交给管道pipelines处理。
pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class SunspiderPipeline(object):
    def __init__(self):
        self.filename=open('sun.txt','a',encoding='utf-8')#追加写入方式utf-8减少乱码几率
    def process_item(self, item, spider):
        #构造写入的item,需要转换成字符串写入文件
        content=str(item)+'\n\n'
        self.filename.write(content)
        return item
    def spider_closed(self,spider):
        self.filename.closed()

4.最后运行代码：在此之前别忘了settings.py文件中解除注释（管道部分）。