Python爬虫之scrapy跨页面爬取信息

作者: 罗罗攀 | 来源:发表于2017-02-07 19:47 被阅读1066次

各类链接
Python爬虫之scrapy跨页面爬取信息
Python学习
117.爬虫框架：scrapy 特点
Python爬虫实战之爬取链家广州房价_03存储
Python爬虫作业 | 爬取拉勾职位信息-Scrapy版
python爬虫实战——爬取股票个股信息
【读书笔记】_爬虫
Scrapy爬虫框架使用
Python爬虫-豆瓣电影Top250-各项信息爬取及xls保存

昨天凌晨2点醒了看了下向右奔跑的文章，准备来个scrapy跨页面的数据爬取，以简书七日热门数据为例。

1 items.py代码

from scrapy.item import Item,Field

class SevendayItem(Item):
    article_url = Field()#文章链接在首页爬取
    author = Field()
    article = Field()
    date = Field()
    word = Field()
    view = Field()
    comment = Field()
    like = Field()
    gain = Field()

可以看出，我要爬取的数据不在一个页面，这时候就需要跨页面爬取了。

2 新建sevendayspider.py

import scrapy
import sys
sys.path.append("..")
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request
from sevenday.items import SevendayItem
import re
import json
import requests


class sevenday(CrawlSpider):
    name = 'sevenday'
    start_urls = ['http://www.jianshu.com/trending/weekly']

    def parse(self, response):

        selector = Selector(response)
        infos = selector.xpath('//ul[@class="note-list"]/li')

        for info in infos:
            article_url_part = info.xpath('div/a/@href').extract()[0]
            article_url = 'http://www.jianshu.com/' + article_url_part
            yield Request(article_url, meta={'article_url':article_url},
                          callback=self.parse_item)

        urls = ['http://www.jianshu.com/trending/weekly?page={}'.format(str(i)) for i in range(1, 11)]
        for url in urls:
            yield Request(url,callback=self.parse)


    def parse_item(self,response):
        item = SevendayItem()

        item['article_url'] = response.meta['article_url']

        selector = Selector(response)
        author = selector.xpath('//span[@class="name"]/a/text()').extract()[0]
        article = selector.xpath('//h1[@class="title"]/text()').extract()[0]
        date = selector.xpath('//span[@class="publish-time"]/text()').extract()[0]
        word = selector.xpath('//span[@class="wordage"]/text()').extract()[0]
        view = re.findall(r'"views_count":(.*?),', response.body.decode('utf-8'), re.S)[0]
        comment = re.findall(r'"comments_count":(.*?)}', response.body.decode('utf-8'), re.S)[0]
        like = re.findall(r'"likes_count":(.*?),', response.body.decode('utf-8'), re.S)[0]
        id = re.findall(r'{"id":(.*?),', response.body.decode('utf-8'), re.S)[0]
        gain_url = 'http://www.jianshu.com/notes/{}/rewards?count=20'.format(id)
        wb_data = requests.get(gain_url)
        json_data = json.loads(wb_data.text)
        gain = json_data['rewards_count']

        item['author'] = author
        item['article'] = article
        item['date'] = date
        item['word'] = word
        item['view'] = view
        item['comment'] = comment
        item['like'] = like
        item['gain'] = gain

        yield item

看文章和我代码就能懂，我就班门弄斧了。

结果

各类链接
爬虫使用python-aiohttp爬取今日头条【Python】爬虫爬取各大网站新闻 Scrapy 模拟登录新...
Python爬虫之scrapy跨页面爬取信息
昨天凌晨2点醒了看了下向右奔跑的文章，准备来个scrapy跨页面的数据爬取，以简书七日热门数据为例。 1 item...
Python学习
python爬虫(五) python爬虫爬取豆瓣电影Top250数据利用python爬取豆瓣电影TOP250页面...
117.爬虫框架：scrapy 特点
爬虫框架：scrapy 特点：爬取效率高，扩展性前，python编写跨平台运行一：数据流由执行引擎控制，过程如下...
Python爬虫实战之爬取链家广州房价_03存储
问题引入系列目录： Python爬虫实战之爬取链家广州房价_01简单的单页爬虫 Python爬虫实战之爬取链家广...
Python爬虫作业 | 爬取拉勾职位信息-Scrapy版
由于说到Python爬虫一定绕不过Scrapy框架，所以这次也就尝试将之前的爬虫用Scrapy框架爬取拉勾网,这个...
python爬虫实战——爬取股票个股信息
python爬虫实战——爬取股票个股信息 python IDLE版本：(Python 3.6 64-bit) 爬虫...
【读书笔记】_爬虫
使用urllib模块爬取图片并下载到本地 python爬虫框架-Scrapy学习自：http://python.j...
Scrapy爬虫框架使用
keywords：python scrapy crawl mysql git 建材爬虫之前爬取过指定建材网站的...
Python爬虫-豆瓣电影Top250-各项信息爬取及xls保存
Python爬虫-豆瓣电影Top250-各项信息爬取及xls保存目的采用python爬虫爬取豆瓣电影Top25...