【爬取小说系列一】聊斋志异

作者: 松龄学编程 | 来源:发表于2020-05-05 16:36 被阅读0次

《聊斋志异》是清代蒲松龄老爷子所著的经典狐鬼小说集，全书共494篇。前端时间买了本选编本，只有45篇。篇篇经典，越看越上瘾，根本停不下来。该书被评为“写鬼写妖，高人一等；刺贫刺贪，入木三分”，就想着看看全本。淘宝看了看，藏本较多。藏本多是包装厚重，不适合随时随地地看。刚好家里有打印机，可以自己打印来看。百度了一番，找到了中国古典文学的网站，书类繁多，排版经典，就从这里爬取吧。

想着以后还要爬取别的书籍，《金瓶梅》什么的，就考虑采用scrapy框架了。这个框架，可扩展性强，结构严谨，代码清晰，很容易维护。

网页分析

打开firefox网页检查器，可以看到，将要爬取的网页共有496页，url就是每一篇的序列号。

想要的标题title和内容content，分别在（（class为info）的div）下的h1）和（（class为content）的div）下的p）中。

中国古典文学网站

聊斋志异

需求分析

爬取小说内容，放到一个txt文件。排版样式这次先不考虑了。

代码实现

这个小说系列，我们放到hoho的文件夹管理吧。打开hoho项目。

scrapy startproject hoho
cd hoho
scrapy genspider liaozhai http://www.zggdwx.com

项目初始化好了。看看liaozhai spider。

# -*- coding: utf-8 -*-
import scrapy

from hoho.items import LiaozhaiItem

class LiaozhaiSpider(scrapy.Spider):
    name = 'liaozhai'
    allowed_domains = ['http://www.zggdwx.com']

    def start_requests(self):
        base_url = 'http://www.zggdwx.com/liaozhai'

        for page in range(1,497):
            url = base_url + f'/{page}.html'
            yield scrapy.Request(url, dont_filter=True)

    def parse(self, response):
        title = response.xpath('//div[@class="info"]/h1/text()').extract()
        paragraphs = response.xpath('//div[@class="content"]/p/text()').extract()
        return LiaozhaiItem(title=title,content=paragraphs)

分析一下：

start_requests方法配置将要请求的urls。

parse方法获取到每个小说的标题和内容。

这样聊斋志异的爬虫就做好了。也可以看下其他要配置的地方。

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class LiaozhaiItem(scrapy.Item):
    title = scrapy.Field()
    content = scrapy.Field()

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class HohoPipeline(object):

    def process_item(self, item, spider):
        base_dir = '/path to save/fruits'
        filename = base_dir + '/liaozhai.txt'
        with open(filename, 'a') as f:
            f.write('\n'.join(item['title']) + '\n')
            f.write('\n'.join(item['content']) + '\n\n')
        return item

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'hoho.pipelines.HohoPipeline': 300,
}

分析一下：

定义item，这里只需要title和content

定义pipeline，追加的形式写入txt

加个延时3秒，减小一些服务器压力

看看成果吧：

爬取成果

迫不及待地想要去看小说了，打印机的声音开始啦。。。

网友评论

本文标题：【爬取小说系列一】聊斋志异

本文链接：https://www.haomeiwen.com/subject/iibmghtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！