python3--简单爬虫--小说网站

作者: w_dll | 来源:发表于2020-02-14 10:58 被阅读0次

python3--简单爬虫--小说网站
python3--简单爬虫--小说网站(2)
node爬虫类&&切换免费IP代理池
Golang 简单爬虫实现
5.3黑客成长日记——爬虫篇(1)
从零开始开发python和qt项目（二）
利用知乎的客户端Api写个知乎爬虫
django+echarts数据可视化(NBA球队数据可视化02
各语言简单爬虫
爬虫基本原理

1 环境

centos7
python3

2 安装所需依赖

pip install requests
pip install beautifulsoup4

3 爬虫脚本

适用于笔趣阁(20200214)
使用方法:

python3 get_novel.py 参数(笔趣阁小说主页网址) > all.txt &

爬取结束后，小说会保存到当前目录的 all.txt 文件中，
以下是脚本源码:

[root@xxwdll novel]# cat get_novel.py
import requests
import bs4
import re
import sys
b_url=sys.argv[1]
r=requests.get(str(b_url))
content = bs4.BeautifulSoup(r.content.decode("utf-8"),features="html.parser")
lis=content.find_all(id="list")
lis=str(lis).split()
pages=[]
for li in lis:
    if 'html' in li:
        page=re.findall(r'[0-9]*.html',li)
        pages.append(str(page))
pages=sorted(set(pages))
for p in pages :
    p=str(p).split('\'')
    this_url=str(b_url)+str(p[1])
    print(str(this_url))
    r=requests.get(str(this_url))
    content = bs4.BeautifulSoup(r.content.decode("utf-8"),features="html.parser")
    article=content.find_all(id='content')
    print(content.title.string)
    text=str(article).replace('<br/>','\n').replace('</p>','').replace('<p>','')
    text=re.sub('<.*>','',text)
    print(str(text))

网友评论

本文标题：python3--简单爬虫--小说网站

本文链接：https://www.haomeiwen.com/subject/mrldfhtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

python3--简单爬虫--小说网站

1 环境

2 安装所需依赖

3 爬虫脚本

相关文章

python3--简单爬虫--小说网站

python3--简单爬虫--小说网站(2)

node爬虫类&&切换免费IP代理池

Golang 简单爬虫实现

5.3黑客成长日记——爬虫篇(1)

从零开始开发python和qt项目（二）

利用知乎的客户端Api写个知乎爬虫

django+echarts数据可视化(NBA球队数据可视化02

各语言简单爬虫

爬虫基本原理

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读