爬取小说（步骤二）python

作者: 肥宅_Sean | 来源:发表于2018-01-13 11:14 被阅读265次

假设各位老哥已经安装好了bs4 requests这些库了
这个小说是随便挑的，各位也就不用太介意（仅供各位学习）
python3 实现，网上用python2做爬虫的太多了，但用python3的还是比较少

爬取的链接是https://www.qu.la/book/12763/10664294.html
获取文章的标题（找到准确的html代码位置）
没有看过步骤一的朋友们可以点击下面的链接看看步骤一先
点击查看步骤一

第二步获取小说标题

关键代码

title = soup.find(attrs={"class": "bookname"})
title = title.find('h1').text

虽然还有些其他改动，但是总体上就是加了这一句来获取章节的标题

总的代码如下：

import requests
from bs4 import BeautifulSoup
url = "https://www.qu.la/book/12763/10664294.html"
req = requests.get(url)
req.encoding = 'utf-8'
soup = BeautifulSoup(req.text, 'html.parser')
content = soup.find(id='content')
title = soup.find(attrs={"class": "bookname"})
title = title.find('h1').text

with open('E:/Code/Python/Project/txtGet/1.txt', 'w') as f:
    string = content.text.replace('\u3000', '').replace('\t', '').replace('\n', '').replace('\r', '').replace('『', '“')\
        .replace('』', '”')  # 去除不相关字符
    string = string.split('\xa0')  # 编码问题解决
    string = list(filter(lambda x: x, string))
    for i in range(len(string)):
        string[i] = '    ' + string[i]
        if "本站重要通知" in string[i]:  # 去除文末尾注
            t = string[i].index('本站重要通知')
            string[i] = string[i][:t]
    string = '\n'.join(string)
    string = title +'\n' + string
    print(string)
    f.write(string)

进阶文章：
爬取小说（步骤三）python
爬取小说（步骤四）python
爬取小说（步骤五）python
爬取小说（步骤六）python

爬取小说（步骤二）python

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

Python爬虫作业

计算机技术杂类

一起玩python

大数据爬虫Python AI Sql

程序员

首页投稿（暂停使用，暂停投稿）

爬取小说（步骤二）python

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

Python爬虫作业

计算机技术杂类

一起玩python

大数据 爬虫Python AI Sql

程序员

首页投稿（暂停使用，暂停投稿）

大数据爬虫Python AI Sql