【案例-爬虫】小说下载

作者: X_Ran_0a11 | 来源:发表于2019-06-27 16:27 被阅读0次

【案例-爬虫】小说下载
python各类爬虫案例，爬到你手软！（附代码）
干货｜18个Python爬虫实战案例（已开源）
干货｜18个Python爬虫实战案例（已开源）
干货｜18个Python爬虫实战案例（已开源）
python各类爬虫案例，爬到你手软！
python各类爬虫案例，爬到你手软！
Python网络爬虫（八） - 利用有道词典实现一个简单翻译程序
Python网络爬虫（七）- 深度爬虫CrawlSpider
Python网络爬虫（二）- urllib爬虫案例

#!/user/bin/env python3
# -*- coding: utf-8 -*-

import requests
import re

# 下载一个网页
url='http://www.liudatxt.com/so/10818/'
# 模拟浏览器发出请求
response=requests.get(url)
# 目标小说网页（正确编码后解码）
html=response.text.encode(response.encoding).decode(response.apparent_encoding)
#获取每一章的信息（url、章节）
dl=re.findall(r'<h3>第一卷 诗成惊鬼神</h3>.*?</ul>',html,re.S)
chapter_url=re.findall(r'/so/.*?html',dl[0]) #只有str可以findall
chapter_title=re.findall(r'">.*?<',dl[0])#findall好像可以用（）匹配+筛选出想要的字段，而不用再转化
chapter_title=list(map(lambda x:''.join(list(x)[2:-1]),chapter_title))
chapter_info=list(zip(chapter_url,chapter_title))

#新建一个文件，保存小说内容
fb=open('%s.txt'%'儒道至圣','w',encoding='utf-8')
#循环每一个章节，分别去下载
for chapter in chapter_info:
    chapter_url,chapter_title=chapter
    chapter_url='http://www.liudatxt.com%s'%chapter_url
    #下载章节内容
    chapter_response=requests.get(chapter_url)
    chapter_html=chapter_response.text.encode(chapter_response.encoding).decode(chapter_response.apparent_encoding)
    #提取章节内容，用ctrl+f网页搜索确保前缀和后缀字段是唯一的
    chapter_content=re.findall(r'<div id="content">.*?<script type="text/javascript">read_bot',chapter_html,re.S)[0]
    #清洗数据,replace比较简单，如果要彻底清洗干净，用re.sub函数进行替换会更好
    chapter_content=chapter_content.replace('<div id="content">','')
    chapter_content=chapter_content.replace('&nbsp;','')
    chapter_content=chapter_content.replace('<br/>','')
    chapter_content=chapter_content.replace('</div>','')
    chapter_content=chapter_content.replace('<script type="text/javascript">read_bot','')
    #数据持久化（写入数据）
    fb.write(chapter_title)
    fb.write(chapter_content)
    fb.write('\n')
    print(chapter_url)