[CP_15] Python爬虫框架02：Scrapy框架爬取咨

作者: Fighting_001 | 来源:发表于2019-04-21 16:23 被阅读0次

[CP_15] Python爬虫框架02：Scrapy框架爬取咨
python爬虫框架Scrapy
Python爬虫作业 | 爬取拉勾职位信息-Scrapy版
【读书笔记】_爬虫
Scrapy与scrapy-redis分布式爬虫抓取
Scrapy流程及模块介绍
Python爬虫之Scrapy框架爬取XXXFM音频文件
无标题文章
Scrapy功能介绍
Scrapy框架步骤

目录结构

一、Scrapy框架发送POST请求的应用
    1. Scrapy发送POST请求
    2. 添加请求头信息
二、某咨询平台咨询问题爬取案例
    1. 创建项目
    2. 在项目内生成主体脚本
    3. 在items中定义目标字段
    4. 编写主体脚本：rpPlatform.py
    5. piplines对items后期处理&数据保存
    6. 配置项目设置文件settings.py
    7. 执行主体脚本rpPlatform.py
    8. 利用main.py快捷启动执行爬虫命令

一、Scrapy框架发送POST请求的应用

1. Scrapy发送POST请求

创建项目：scrapy startproject youdaoTranslate
在项目内生成主体脚本：scrapy genspider ydTranslate "fanyi.youdao.com"

ydTranslate.py

# -*- coding: utf-8 -*-
import scrapy

class YdtranslateSpider(scrapy.Spider):
    name = 'ydTranslate'
    allowed_domains = ['fanyi.youdao.com']

    def start_requests(self):
        url="http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule"

        # 向队列中加入一个带有表单信息的POST请求
        yield scrapy.FormRequest(
            url=url,
            formdata={
                "i":"测试",
                "from":"AUTO",
                "to":"AUTO",
                "smartresult":"dict",
                "client":"fanyideskweb",
                "salt":"15556827153720",
                "sign":"067431debbc1e4c7666f6f4b1e204747",
                "ts":"1555682715372",
                "bv":"e2a78ed30c66e16a857c5b6486a1d326",
                "doctype":"json",
                "version":"2.1",
                "keyfrom":"fanyi.web",
                "action":"FY_BY_CLICKBUTTION"
            },
            callback=self.parse # 回调函数
        )

    def parse(self, response):
        print("------------------")
        print(response.body)

执行主体脚本：scrapy crawl ydTranslate

2. 添加请求头信息

ydTranslate.py

# -*- coding: utf-8 -*-
import scrapy
import random

class YdtranslateSpider(scrapy.Spider):
    name = 'ydTranslate'
    allowed_domains = ['fanyi.youdao.com']

    def start_requests(self):
        url="http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule"

        # 多个User-Agent随机取一个
        agent1="User-Agent,Mozilla/5.0 (Windows NT 6.1; rv:65.0) Gecko/20100101 Firefox/65.0"
        agent2="User-Agent, Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"
        agent3="User-Agent,Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11"
        agent4="User-Agent, MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"
        agent5="Mozilla/5.0 (Linux; U; Android 8.1.0; zh-cn; BLA-AL00 Build/HUAWEIBLA-AL00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 MQQBrowser/8.9 Mobile Safari/537.36"
        agent6="Mozilla/5.0 (Linux; Android 5.1.1; vivo X6S A Build/LMY47V; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 MQQBrowser/6.2 TBS/044207 Mobile Safari/537.36 MicroMessenger/6.7.3.1340(0x26070332) NetType/4G Language/zh_CN Process/tools"
        ls=[agent1,agent2,agent3,agent4,agent5,agent6]
        agent=random.choice(ls)
        # 构造请求头信息
        header={"User-Agent":agent}

        # 向队列中加入一个带有表单信息的POST请求
        yield scrapy.FormRequest(
            url=url,
            headers=header,
            formdata={
                "i":"测试",
                "from":"AUTO",
                "to":"AUTO",
                "smartresult":"dict",
                "client":"fanyideskweb",
                "salt":"15556827153720",
                "sign":"067431debbc1e4c7666f6f4b1e204747",
                "ts":"1555682715372",
                "bv":"e2a78ed30c66e16a857c5b6486a1d326",
                "doctype":"json",
                "version":"2.1",
                "keyfrom":"fanyi.web",
                "action":"FY_BY_CLICKBUTTION"
            },
            callback=self.parse # 回调函数
        )

    def parse(self, response):
        print("------------------")
        print(response.body)

二、某咨询平台咨询问题爬取案例

目标：利用Scrapy框架爬取某问题咨询平台指定页码范围的咨询问题的标题、内容、url链接

第1页：http://wz.sun0769.com/index.php/question/huiyin?page=0
第2页：http://wz.sun0769.com/index.php/question/huiyin?page=30
第3页：http://wz.sun0769.com/index.php/question/huiyin?page=60

1. 创建项目

scrapy startproject replyPlatform

2. 在项目内生成主体脚本

scrapy genspider rpPlatform "wz.sun0769.com"

3. 在items中定义目标字段

本次目标字段：title、content、url
items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class ReplyplatformItem(scrapy.Item):
    url=scrapy.Field()  # 帖子的url链接
    title=scrapy.Field()    # 每个帖子的标题
    content=scrapy.Field()  # 帖子的内容

4. 编写主体脚本：rpPlatform.py

rpPlatform.py

# -*- coding: utf-8 -*-
import scrapy
from replyPlatform.items import ReplyplatformItem

class RpplatformSpider(scrapy.Spider):
    name = 'rpPlatform'
    allowed_domains = ['wz.sun0769.com']
    url="http://wz.sun0769.com/index.php/question/huiyin?page="
    num=0
    start_urls=[url+str(num)]

    # 获取每个帖子的url
    def parse(self, response):
        # 提取每个帖子的href属性值存放到列表中
        links=response.xpath('//div[@class="newsHead clearfix"]/table//td/a[@class="news14"]/@href').extract()
        # 发送每个帖子的请求，使用parse_item方法处理
        for link in links:
            yield scrapy.Request(link,callback=self.parse_item)

        # 设置自动翻页
        if self.num<=150:
            self.num+=30
            # 重新发送新的页面请求
            yield scrapy.Request(self.url+str(self.num),callback=self.parse)

    # 爬取每个帖子的内容
    def parse_item(self,response):
        item=ReplyplatformItem()    # 新建实例
        item["url"]=response.url
        item["title"]=response.xpath('//span[@class="niae2_top"]/text()').extract()[0]
        item["content"]="".join(response.xpath('//td[@class="txt16_3"]/text()').extract())
        yield item

5. piplines对items后期处理&数据保存

piplines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class ReplyplatformPipeline(object):
    
    def __init__(self):
        self.filename=open("reply.txt","a",encoding="utf-8")

    def process_item(self, item, spider):
        # 构造每个返回的item内容写入到指定文件
        result=str(item)+"\n\n"
        self.filename.write(result)
        return item

    def spider_closed(self,spider):
        self.filename.closed()

6. 配置项目设置文件settings.py

settings.py
（1）注释robots.txt协议文件的遵从规则：

# Obey robots.txt rules
# ROBOTSTXT_OBEY = True

（2）开启管道配置的优先级：
多个管道并存时，数字越小相应的优先级越高；默认数字300

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'replyPlatform.pipelines.ReplyplatformPipeline': 300,
}

7. 执行主体脚本rpPlatform.py

进入目录：\scrapyProject\replyPlatform
执行命令：scrapy crawl rpPlatform

执行完成之后，在该目录下生成指定的txt文本文件，其内存储有爬取的数据，如下：

8. 利用main.py快捷启动执行爬虫命令

在项目目录（\scrapyProject\replyPlatform）下新建一个main.py文件，作为启动执行爬虫命令的快捷启动文件
main.py

from scrapy import cmdline

cmd="scrapy crawl rpPlatform"   # 需要执行的爬虫cmd命令
cmdline.execute(cmd.split())    # 执行命令；默认以空格分割切片

[CP_15] Python爬虫框架02：Scrapy框架爬取咨
目录结构一、Scrapy框架发送POST请求的应用 1. Scrapy发送POST请求创建项目：scrapy ...
python爬虫框架Scrapy
爬虫框架Scrapy(一) 框架Scrapy是使用python实现的一个爬取网站数据、提取数据的异步网络框架，加快...
Python爬虫作业 | 爬取拉勾职位信息-Scrapy版
由于说到Python爬虫一定绕不过Scrapy框架，所以这次也就尝试将之前的爬虫用Scrapy框架爬取拉勾网,这个...
【读书笔记】_爬虫
使用urllib模块爬取图片并下载到本地 python爬虫框架-Scrapy学习自：http://python.j...
Scrapy与scrapy-redis分布式爬虫抓取
Scrapy爬虫框架 Scrapy是用纯Python实现一个为了爬取网站数据、提取结构性数据而编写的应用框架，用途...
Scrapy流程及模块介绍
参考资料：python网络爬虫开发实战 1.scrapy的优势 Scrapy框架具有高效爬取速率，相关扩展组件多，...
Python爬虫之Scrapy框架爬取XXXFM音频文件
本文介绍使用Scrapy爬虫框架爬取某FM音频文件。框架介绍 Scrapy是一个为了爬取网站数据，提取结构性数据...
无标题文章
一、前言由于最近使用Python爬虫框架scrapy练习爬虫，在爬取动态网页的时候，需要用到splash，进行...
Scrapy功能介绍
scrapy是一个为爬取网站，提取结构化数据而创建的一个爬虫框架，scrapy基于python，是目前python...
Scrapy框架步骤
简单了解一下Scrapy框架于操作步骤什么是Scrapy框架呢？ scrapy是python下的数据爬取集框架，...