Python爬虫实战——再爬斗破苍穹小说

作者: libdream | 来源:发表于2023-07-31 12:02 被阅读0次

2017-12-31
python爬取斗破苍穹小说
Python爬虫实战——爬取《斗破苍穹》全文小说（基于re模块）
Python爬虫实战之爬取链家广州房价_03存储
python爬虫实战——爬取股票个股信息
怀旧篇学Python！你看过斗破苍穹嘛？今天来好好怀旧一下！
Python爬虫实战
Python网络爬虫实战之十四：Scrapy结合scrapy-s
Python网络爬虫实战之七：动态网页爬取案例实战 Seleni
Python网络爬虫实战之八：动态网页爬取案例实战 Seleni

很久很久之前发过一次初学爬虫时爬取斗破小说的笔记，后来由于工作调整原因，爬虫很久没碰了，现在又想重新拾起这个技能，再重头学习一次吧。

这次爬取直接从首页网址入手
http://book.doupoxs.com/doupocangqiong/
主要思路就是：
1-爬取首页所有章节的超链接，按章节顺序排列。
2-获取各章节的标题和正文。
3-将爬取的内容保存到文本文件中。

代码如下：

import requests
from bs4 import BeautifulSoup
import time
import random

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

def get_chapter_urls(url):
    response = requests.get(url,headers=headers)
    soup = BeautifulSoup(response.text, 'lxml')
    #方法一：获取所有章节的超链接
    chapters = soup.find('div',class_='xsbox clearfix').find_all('a')
    urls = [url.get('href') for url in chapters]
    #方法二：获取所有章节的超链接
##    chapters = soup.find_all('a')
##    urls = [url.get('href') for url in chapters if url.get('href').startswith('/doupocangqiong/') and url.get('href').split('/')[-1].split('.')[0].isdigit()]
##    urls.sort(key=lambda x: int(x.split('/')[-1].split('.')[0]))  # 按章节顺序排序
    return ['http://book.doupoxs.com' + url for url in urls]

def get_chapter_content(url):
    time.sleep(random.randint(1,3))#设置随机1-3秒内延迟
    response = requests.get(url,headers=headers)
    response.encoding = 'utf-8'  
    soup = BeautifulSoup(response.text, 'lxml')
    title = soup.find('div',class_='entry-tit').text # 获取章节标题
    content = soup.find('div', class_='m-post').text.replace('\xa0'*8, '\n\n')  # 获取章节内容
    return title, content

def save_to_file(title, content, filename):
    with open(filename, 'a', encoding='utf-8') as f:
        f.write(title + '\n')
        f.write(content + '\n\n')

def main(url, filename):
    chapter_urls = get_chapter_urls(url)
    for url in chapter_urls:
        title, content = get_chapter_content(url)
        save_to_file(title, content, filename)

if __name__ == "__main__":
    main('http://book.doupoxs.com/doupocangqiong/', 'doupo.txt')

最后爬取的小说如图：

2023-08-01_12-01-21.png

网友评论

本文标题：Python爬虫实战——再爬斗破苍穹小说

本文链接：https://www.haomeiwen.com/subject/zeeepdtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Python爬虫实战——再爬斗破苍穹小说

相关文章

2017-12-31

python爬取斗破苍穹小说

Python爬虫实战——爬取《斗破苍穹》全文小说（基于re模块）

Python爬虫实战之爬取链家广州房价_03存储

python爬虫实战——爬取股票个股信息

怀旧篇学Python！你看过斗破苍穹嘛？今天来好好怀旧一下！

Python爬虫实战

Python网络爬虫实战之十四：Scrapy结合scrapy-s

Python网络爬虫实战之七：动态网页爬取案例实战 Seleni

Python网络爬虫实战之八：动态网页爬取案例实战 Seleni

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读