美文网首页
Python爬虫实战——再爬斗破苍穹小说

Python爬虫实战——再爬斗破苍穹小说

作者: libdream | 来源:发表于2023-07-31 12:02 被阅读0次

    很久很久之前发过一次初学爬虫时爬取斗破小说的笔记,后来由于工作调整原因,爬虫很久没碰了,现在又想重新拾起这个技能,再重头学习一次吧。

    这次爬取直接从首页网址入手
    http://book.doupoxs.com/doupocangqiong/
    主要思路就是:
    1-爬取首页所有章节的超链接,按章节顺序排列。
    2-获取各章节的标题和正文。
    3-将爬取的内容保存到文本文件中。

    代码如下:

    import requests
    from bs4 import BeautifulSoup
    import time
    import random
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    
    def get_chapter_urls(url):
        response = requests.get(url,headers=headers)
        soup = BeautifulSoup(response.text, 'lxml')
        #方法一:获取所有章节的超链接
        chapters = soup.find('div',class_='xsbox clearfix').find_all('a')
        urls = [url.get('href') for url in chapters]
        #方法二:获取所有章节的超链接
    ##    chapters = soup.find_all('a')
    ##    urls = [url.get('href') for url in chapters if url.get('href').startswith('/doupocangqiong/') and url.get('href').split('/')[-1].split('.')[0].isdigit()]
    ##    urls.sort(key=lambda x: int(x.split('/')[-1].split('.')[0]))  # 按章节顺序排序
        return ['http://book.doupoxs.com' + url for url in urls]
    
    def get_chapter_content(url):
        time.sleep(random.randint(1,3))#设置随机1-3秒内延迟
        response = requests.get(url,headers=headers)
        response.encoding = 'utf-8'  
        soup = BeautifulSoup(response.text, 'lxml')
        title = soup.find('div',class_='entry-tit').text # 获取章节标题
        content = soup.find('div', class_='m-post').text.replace('\xa0'*8, '\n\n')  # 获取章节内容
        return title, content
    
    def save_to_file(title, content, filename):
        with open(filename, 'a', encoding='utf-8') as f:
            f.write(title + '\n')
            f.write(content + '\n\n')
    
    def main(url, filename):
        chapter_urls = get_chapter_urls(url)
        for url in chapter_urls:
            title, content = get_chapter_content(url)
            save_to_file(title, content, filename)
    
    if __name__ == "__main__":
        main('http://book.doupoxs.com/doupocangqiong/', 'doupo.txt')
    
    

    最后爬取的小说如图:


    2023-08-01_12-01-21.png

    相关文章

      网友评论

          本文标题:Python爬虫实战——再爬斗破苍穹小说

          本文链接:https://www.haomeiwen.com/subject/zeeepdtx.html