Scrapy抓取小说各个章节存储成Text

作者: whong736 | 来源:发表于2018-12-09 10:15 被阅读59次

Scrapy抓取小说各个章节存储成Text
Scrapy抓取小说网站存储成json
scrapy抓取百度图片-写给自己看爬虫系列1
python 爬虫之路之Scrapy框架
Scrapy安装
初识Scrapy框架+爬虫实战(7)-爬取链家网100页租房信息
Win32 Python3.6爬虫-Scrapy简介与安装
Scrapy定时爬虫总结&Docker/K8s部署
抓取网站音频
Python爬虫之Scrapy的安装

抓取小说网站全书网 http://www.quanshuwang.com/
玄幻魔法小说存储成text，抓取小说名，小说的章节，以及小说的内容

全书网

达到的效果；

抓取小说

image.png

抓取小说章节

image.png

抓取思路：

1.分析网页

image.png

2.链接地址规律：

http://www.quanshuwang.com/list/1_1.html  #第一页
http://www.quanshuwang.com/list/1_2.html  #第二页
http://www.quanshuwang.com/list/1_3.html  #第三页

规律如下：
http://www.quanshuwang.com/list/X_Y.html
X 代表不同分小说分类，比如这里的1代表玄幻魔法，2代表武侠修真
Y代表分页页数

3.页面层级分析

第一层级：分类列表

http://www.quanshuwang.com/list/1_1.html

第二层级一个小说名称

http://www.quanshuwang.com/book_167173.html  #小说链接地址

进入第三层章节列表页链接,此处不同的小说网站进入章节列表页的方式会不同

http://www.quanshuwang.com/book/167/167173

image.png

第三层小说章节列表页面

http://www.quanshuwang.com/book/167/167173

image.png

第四层，章节内容页

http://www.quanshuwang.com/book/167/167173/48194124.html

image.png

1.新建项目

scrapy startproject Fiction

cd Fiction

2.新建爬虫文件

scrapy genspider -t basic novel quanshuwang.com

3.确定自己需要抓取的小说字段，小说名，小说章节，小说内容，其他字段可以慢慢补充。开始编写：item

# -*- coding: utf-8 -*-
import scrapy
class FictionItem(scrapy.Item):
    # define the fields for your item here like:
    #小说名称
    name = scrapy.Field()
    #小说章节名字
    chapter_name = scrapy.Field()
    #小说章节内容
    chapter_content = scrapy.Field()

image.png

4.编写爬虫文件

# -*- coding: utf-8 -*-
import scrapy
import re
from Fiction.items import FictionItem
from scrapy.http import Request


class NovelSpider(scrapy.Spider):
    name = 'novel'
    allowed_domains = ['quanshuwang.com']
    start_urls = [
        'http://www.quanshuwang.com/list/1_2.html',
        'http://www.quanshuwang.com/list/1_3.html',

    ]  # 全书网玄幻魔法类前2页

    # 获取每一本书的URL
    def parse(self, response):
        book_urls = response.xpath('//li/a[@class="l mr10"]/@href').extract()
        for book_url in book_urls:
            yield Request(book_url, callback=self.parse_read)

    # 获取马上阅读按钮的URL，进入章节目录
    def parse_read(self, response):
        read_url = response.xpath('//a[@class="reader"]/@href').extract()[0]
        yield Request(read_url, callback=self.parse_chapter)

    # 获取小说章节的URL
    def parse_chapter(self, response):
        chapter_urls = response.xpath('//div[@class="clearfix dirconone"]/li/a/@href').extract()
        for chapter_url in chapter_urls:
            yield Request(chapter_url, callback=self.parse_content)

    # 获取小说名字,章节的名字和内容
    def parse_content(self, response):
        # 小说名字
        name = response.xpath('//div[@class="main-index"]/a[@class="article_title"]/text()').extract_first()

        result = response.text
        # 小说章节名字
        chapter_name = response.xpath('//strong[@class="l jieqi_title"]/text()').extract_first()
        # 小说章节内容
        chapter_content_reg = r'style5\(\);</script>(.*?)<script type="text/javascript">'
        chapter_content_2 = re.findall(chapter_content_reg, result, re.S)[0]
        chapter_content_1 = chapter_content_2.replace('    ', '')
        chapter_content = chapter_content_1.replace('<br />', '')

        item = FictionItem()
        item['name'] = name
        item['chapter_name'] = chapter_name
        item['chapter_content'] = chapter_content
        yield item

image.png

5.编写Pipeline文件，存储文件成Text

# -*- coding: utf-8 -*-
import os


class FictionPipeline(object):

  def process_item(self, item, spider):
      #将/Users/vincentwen/MyCode/Scrapy更换为你的本机目录
      curPath = '/Users/vincentwen/MyCode/Scrapy'  

      tempPath = str(item['name'])
      targetPath = curPath + os.path.sep + tempPath
      if not os.path.exists(targetPath):
          os.makedirs(targetPath)
      #将/Users/vincentwen/MyCode/Scrapy更换为你的本机目录
      filename_path = '/Users/vincentwen/MyCode/Scrapy' + os.path.sep + str(item['name']) + os.path.sep + str(item['chapter_name']) + '.txt' 
      with open(filename_path, 'w', encoding='utf-8') as f:
          f.write(item['chapter_content'] + "\n")
          f.close()
      return item

image.png

6.修改setting文件

ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}

ITEM_PIPELINES = {
   'Fiction.pipelines.FictionPipeline': 300,
}

image.png

7.运行爬虫测试效果,先用查看日志模式，成功后再切换成无日志模式

scrapy crawl novel

image.png

8.代码地址：
https://github.com/wzw5566/Fiction

觉得文章有用，请用支付宝扫描，领取一下红包！打赏一下

支付宝红包码

网友评论

爬虫，数据分析那些事

本文标题：Scrapy抓取小说各个章节存储成Text

本文链接：https://www.haomeiwen.com/subject/tipchqtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Scrapy抓取小说各个章节存储成Text

达到的效果；

1.分析网页

2.链接地址规律：

3.页面层级分析

第一层级：分类列表

第二层级一个小说名称

进入第三层章节列表页链接,此处不同的小说网站进入章节列表页的方式会不同

第三层小说章节列表页面

第四层，章节内容页

1.新建项目

2.新建爬虫文件

3.确定自己需要抓取的小说字段，小说名，小说章节，小说内容，其他字段可以慢慢补充。开始编写：item

4.编写爬虫文件

5.编写Pipeline文件，存储文件成Text

6.修改setting文件

相关文章

Scrapy抓取小说各个章节存储成Text

Scrapy抓取小说网站存储成json

scrapy抓取百度图片-写给自己看爬虫系列1

python 爬虫之路之Scrapy框架

Scrapy安装

初识Scrapy框架+爬虫实战(7)-爬取链家网100页租房信息

Win32 Python3.6爬虫-Scrapy简介与安装

Scrapy定时爬虫总结&Docker/K8s部署

抓取网站音频

Python爬虫之Scrapy的安装

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

爬虫，数据分析那些事

Scrapy抓取小说各个章节存储成Text

达到的效果；

1.分析网页

2.链接地址规律：

3.页面层级分析

第一层级：分类列表

第二层级 一个小说名称

进入第三层章节列表页链接,此处不同的小说网站进入章节列表页的方式会不同

第三层小说章节列表页面

第四层，章节内容页

1.新建项目

2.新建爬虫文件

3.确定自己需要抓取的小说字段，小说名，小说章节，小说内容，其他字段可以慢慢补充。开始编写：item

4.编写爬虫文件

5.编写Pipeline文件，存储文件成Text

6.修改setting文件

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

第二层级一个小说名称