美文网首页爬虫python爬虫Python
python爬虫-爬取笔趣阁小说

python爬虫-爬取笔趣阁小说

作者: hello_spider | 来源:发表于2018-05-03 19:31 被阅读327次

    1.环境

    python3.6
    python官网: www.python.org
    需要用到的库: re、time、random、requests
    requests库安装:https://jingyan.baidu.com/article/86f4a73ea7766e37d7526979.html

    2.思路

    我们先按照正常流程来访问一次网页


    image.png

    打开浏览器的开发者工具(我用的是chrome)F12可以打开


    image.png
    image.png
    image.png

    从上面的网页提取到小说详情页的url


    image.png
    这个网页就可以提取出每个章节的url,然后根据章节的url就可以查看到每一页小说的内容了,再根据正则就可以提取出小说了。

    3.分析网页结构

    image

    小说的搜索的真实网址


    image

    小说的url

    image

    小说每个章节的url


    image

    4.代码实现

    import requests
    import re
    import time
    import random
    
    
    def download(book_name):
        # 下载小说
        search_real_url = 'https://www.biquge5200.com/modules/article/search.php?searchkey=' + book_name
        try:
            novel_source = requests.get(search_real_url).text
            reg1 = r'<td class="odd"><a href="(.*?)">(.*?)</a></td>.*?<td class="odd">(.*?)</td>'
            # 所有搜索到的结果(包括小说网址、名称、作者姓名)
            novel_list = re.findall(reg1, novel_source, re.S)
            # 判断是否有结果返回
            if len(novel_list) == 0:
                print('你要找的小说不存在,请检查后重新输入')
        except Exception as e:
            print(e)
        for novel_url, novel_name, novel_author in novel_list:
            if novel_name == book_name:
                print('你即将下载的小说:%s 作者:%s' % (novel_name, novel_author))
                return novel_url, novel_name
    
    
    def get_chapter(url):
        # 获取章节页面
        try:
            # 章节页面源代码
            chapter_page_source = requests.get(url).text
            reg2 = r'<dd><a href="(.*?)">(.*?)</a></dd>'
            chapter_list = re.findall(reg2, chapter_page_source)
        except Exception as e:
            print(e)
        return chapter_list
    
    
    def get_content(chapter_list, novel_name):
        count = 0
        length = len(chapter_list)
        for chapter_url, chapter_name in chapter_list:
            try:
                time.sleep(1+random.random())
                content_source = requests.get(chapter_url).text
                reg = r'<div id="content">(.*?)</div>'
                content = re.findall(reg, content_source, re.S)[0]
                content = content.replace('<br/>', '').replace(' ', '').replace('<p>', '').replace('</p>', '')
                count += 1
                with open(novel_name + '.txt', 'a', encoding='utf-8') as f:
                    f.write(chapter_name + '\n' * 2 + content + '\n' * 2)
                    print('正在写入: ' + chapter_name)
                    print('进度:%0.2f' % (count / length)+'%')
            except Exception as e:
                print(e)
    
    
    if __name__ == '__main__':
        book_name = input('请输入你要下载的小说名字(确保输入的小说名字正确):')
        novel_url, novel_name = download(book_name)
        chapter_list = get_chapter(novel_url)
        get_content(chapter_list, novel_name)
    
    ##以上内容仅供学习使用

    相关文章

      网友评论

      本文标题:python爬虫-爬取笔趣阁小说

      本文链接:https://www.haomeiwen.com/subject/tnplrftx.html