【爬取小说系列四】如果没有明天

作者: 松龄学编程 | 来源:发表于2020-05-10 11:59 被阅读0次

【爬取小说系列四】如果没有明天
利用Python爬取妙笔阁小说网站的小说信息并保存为txt和cs
scrapy对爬取的内容进行更新爬取
【爬取小说系列二】金瓶梅
2019-02.24（review）
爬取小说
爬取小说
爬取小说（步骤四）python
【爬取小说系列三】鬼吹灯
python使用scrapy自动爬取多个网页

《我是余欢水》是时下最热的电视剧之一。主角余欢水在患病前后，经历了一处处啼笑皆非的故事。故事之奇，让人拍案，故事之真，让人入座。看着余欢水，多多少少有自己的影子。无论故事本身，人物设计，还是拍摄效果，都感同身受。该剧改编自余耕小说《如果没有明天》，感叹作者的天才构思。小说的文字戏虐调侃，就好像作者在对面侃大山。尊重著作权，看全本请支持正版。言情花园网站有预览版，可以一睹其文字风采。

面对反爬，还是选择selenium，继续对app项目来扩展吧！

网页分析

打开firefox网页检查器

章节xpath为[/html/body/div[3]/div/dl/dd/a]

想要的标题title和内容content，分别在[h1]和[[id为content]的div]中。

/html/body/div[3]/div/dl/dd/a

title:h1,content:div[@id="content"]

需求分析

把如果没有明天小说存储在txt文件

代码实现

打开项目app，添加文件yuhuanshui_crawler.py

tree
.
├── config.py
├── crawler.py
├── crawler_manager.py
├── crawlerlogger.py
├── guichuideng_crawler.py
├── yuhuanshui_crawler.py
├── issue_builder.py
└── util.py

项目初始化好了。给爬虫编码：

# -*- coding: utf-8 -*-

from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from crawler import NovelCrawler
from crawlerlogger import CrawlerLogger

import time

class YuhuanshuiCrawler(NovelCrawler):

    BASEURL = 'https://www.yqhy.org/read/158/158813/'

    def setupLocator(self):
        self.locator = (By.CLASS_NAME, 'article-list')
        self.logger.info(f'locator: {self.locator}')

    def setupURL(self):
        self.urls = [YuhuanshuiCrawler.BASEURL]

    def setupLogger(self):
        self.logger = CrawlerLogger(__name__).logger

    def parse(self):
        driver = self.driver
        logger = self.logger
        links = driver.find_elements_by_xpath('/html/body/div[3]/div/dl/dd/a')
        data = []
        count = len(links)
        logger.info(f"打开小说列表，长度：{count}")
        for index in range(count):
            self.fetch(YuhuanshuiCrawler.BASEURL,self.locator)
            links = driver.find_elements_by_xpath('/html/body/div[3]/div/dl/dd/a')
            logger.info(f"打开第【{index + 1}】条章节")
            links[index].click()

            locator = (By.ID, 'content')
            WebDriverWait(driver, 20, 0.5).until(EC.presence_of_element_located(locator))
            logger.info("通过页面延迟机制")
            title = driver.find_element_by_tag_name('h1').text
            content = driver.find_element_by_xpath('//div[@id="content"]').text

            data.append({'title':title,'content':content})
            time.sleep(self.intervel)

        return data

分析一下：

使用继承的方式，具体的子类处理不同的内容。每个子类的代码量就会小很多。相同的逻辑和结构，抽取出来，放在父类中。新的需求，从父类下进行开展，剩去工作量。对于相同的逻辑，更新和调整也会是统一起来。如果共有逻辑发生变化，不能再共有，下沉到具体的子类中。产生新的子类与原有子类有共同的逻辑，合并后，上升到父类中。

遇到的报错

'FirefoxWebElement' object is not subscriptable

'list' object has no attribute 'find_elements_by_tag_name'

原因分析：

find_element方法获取到的第一个WebElement，不能调用list的取下标方法

find_elements方法获取到的是list，需要先取下标再调用WebElement查找方法

一起看看成果吧：

如果没有明天

生命无常，可能还有五十年的寿命，也可能只有五天的寿命。如果没有明天，该怎么渡过今天呢？珍惜时光吧！

网友评论

本文标题：【爬取小说系列四】如果没有明天

本文链接：https://www.haomeiwen.com/subject/uzndnhtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

【爬取小说系列四】如果没有明天

网页分析

需求分析

代码实现

遇到的报错

相关文章