【Python】利用lxml爬取起点小说网小说

作者: 群体遗传学 | 来源:发表于2020-05-08 17:16 被阅读0次

【Python】利用lxml爬取起点小说网小说
BeautifulSoup4小试牛刀
利用python爬取某小说网站
Python爬虫实战
爬取小说网站章节和小说语音播放（文章末-->获取源码）
使用node制作爬虫
python pyttsx3模块初探及实战项目 (爬取小说后朗读
python实战项目：爬取某小说网
利用Python爬取妙笔阁小说网站的小说信息并保存为txt和cs

先写在前面，人生苦短，我用python。

此文作为自己的一个小笔记，记录自己的爬虫的一些东西，此处为为爬取起点小说网的首页的全部小说，使用语言python，使用库lxml以及requests。

#！/usr/bin/env python
#!-*-coding:utf-8 -*-
#!@Time     :2018/12/7 21:11
#!@Author   :Guocc
#!@File     :GetStory.py

import os
from lxml import etree
import requests

class WebStoryGet(object):          # 定义网页抓取的类
     def __init__(self,story_url):
        self.url = story_url
    
    def start_request(self):            #定义抓取的准备步骤
            response = requests.get(self.url) #定义起点网站的url
            html = etree.HTML(response.content.decode())        #定义起点网站的内容解析
            Bigtitle_list = html.xpath('//div[@class="book-mid-info"]/h4/a/text()') #利用xpath提取小说标题
            Bighref_list = html.xpath('//div[@class="book-mid-info"]/h4/a/@href') #利用xpath提取小说的网址url
            for Bigtitle, Bighref in zip(Bigtitle_list, Bighref_list):
                self.section_get(Bigtitle, Bighref)

    def section_get(self, Bigtitle, Bighref):   #定义章节选择的函数
        section_content = requests.get("https:"+Bighref+"#Catalog")
        section_html = etree.HTML(section_content.content.decode())
        section_title_list = section_html.xpath('//div[@class="volume"]/ul/li/a/text()')
        section_href_list = section_html.xpath('//div[@class="volume"]/ul/li/a/@href')
        for section_title, section_href in zip(section_title_list, section_href_list):
            print(Bigtitle+" "+section_title+" 开始下载.\n")
            story = self.content_get(section_href)
            if not os.path.exists(Bigtitle) :
                os.mkdir(Bigtitle)

            with open (Bigtitle+"/"+section_title, "w",encoding="utf-8") as f:      #保存小说
                f.write(story)

    def content_get(self,section_href):     ## 定义小说内容下载的函数
        story_response = requests.get("https:"+section_href)
        story_html = etree.HTML(story_response.content.decode())
        story = story_html.xpath('//div[@class="read-content j_readContent"]/p/text()')
        return ("\n".join(story))

if __name__ == '__main__':
    story_url = "https://www.qidian.com/all"
    Story = WebStoryGet(story_url)
    Story.start_request()

以上就是爬取起点小说网的源码，想爬其他网站可以修改story_url，不过如果小说的标题和网址的所在网页标签不同需修改xpath的信息以便提取网址url和标题，其他静态网页类似。

网友评论

本文标题：【Python】利用lxml爬取起点小说网小说

本文链接：https://www.haomeiwen.com/subject/aefqnhtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

【Python】利用lxml爬取起点小说网小说

相关文章

【Python】利用lxml爬取起点小说网小说

BeautifulSoup4小试牛刀

利用python爬取某小说网站

Python爬虫实战

爬取小说网站章节和小说语音播放（文章末-->获取源码）

使用node制作爬虫

python pyttsx3模块初探及实战项目 (爬取小说后朗读

python实战项目：爬取某小说网

利用Python爬取妙笔阁小说网站的小说信息并保存为txt和cs

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读