python BeautifulSoup爬取某网站小说

作者: 青铜搬砖工 | 来源:发表于2018-04-17 22:29 被阅读59次

python BeautifulSoup爬取某网站小说
BeautifulSoup4爬取某社招网站数据
以人人都是产品经理网站3.6万篇文章为例阐述整个数据ETL和分析
爬虫从零开始--爬取静态网站
python爬取妹子图全部图片
Python爬取图虫网摄影作品
利用python爬取某小说网站
爬取小说名字和链接
第一个爬虫
Selenium小例子

好多小说不支持下载了。没办法保存到手机看，所以想爬取小说内容保存到txt中去，蹲坑没网的时候也可以消磨时间。
言归正传，想要爬取完整正本小说首先要解决以下问题：

因为每章小说都是一个独立的URL，所以想要连续的爬取文章，就要要找到每章URL的关系。
找到关系后，如何根据当前URL获取下一章的URL

例如需要爬取如下小说：

image.png
分析URL
第四章的URL为：https://m.qidian.com/book/1003782761/320278263
第五章的URL为：https://m.qidian.com/book/1003782761/321319656
第六章的URL为：https://m.qidian.com/book/1003782761/322050277
.......
通过上面的例子可以找到一个规律：https://m.qidian.com/book/1003782761/是不变的，变化的是最后一个字段，但是最后一个字段通过目前的信息推断不出来有什么规律。
看一下request请求，看看能发现什么。

image.png
通过RequestURL：bookId=1003782761&chapterId=320278263，上面章节的URL中1003782761为bookid,320278263为chapterid。
现在我们知道这篇小说的URL组成为：
https://m.qidian.com/book/bookid/chapterid
只要我们可以得到正确的chapterid我们就能源源不断的获取小说内容。
既然给服务器发送了request请求，看一下服务器返回了什么（以第四章为例）。
Response返回一个Json格式的字符串。
经过分析，next就是下一章的chapterid(第五章正好为321319656）

image.png

解决了url的问题，剩下就是使用BeautifulSoup爬取内容了，代码如下：

# _*_ encoding:utf-8 _*_
from bs4 import BeautifulSoup
import  requests
import  time
import json
import sys
#设置系统默认编码模式，如果不设置可能写入文件会报错
reload(sys)
sys.setdefaultencoding('utf8')

sArticle =""
#urlcontent 为爬取小说内容的url
urlcontent ="https://m.qidian.com/book/{}/{}".format("1003782761","321319656")
#url为获取下一章节id的url,通过requesturl获得
url = "https://m.qidian.com/majax/chapter/getChapterInfo?_csrfToken=w7RePr18qXzxByPdIn0h7iQtII0AC4z8oPMIXioz&bookId={}&chapterId={}".format("1003782761","321319656")
#保存爬取的小说内容
f = open('C:\Users\Administrator\Desktop\\test.txt', 'w')


def xiaoshuo(url,urlcontent,sArticle):
    wb_data = requests.get(url)
    soup =BeautifulSoup(wb_data.text,'lxml')
    wb_Content = requests.get(urlcontent)
    soupContent =BeautifulSoup(wb_Content.text,'lxml')
    title = soupContent.select("#chapterContent > section > h3")
    article = soupContent.select("#chapterContent > section > p")
    print title[0].get_text()
  #为了方便阅读标题后加入换行
    f.writelines(str(title[0].get_text()) + "\n")
    for p in article:
        sArticle +=p.get_text()
        #为了阅读方便每一段加一个换行
        f.writelines(str(p.get_text())+"\n")

    print sArticle
    #字符串转化json格式
    jsonStr = json.loads(soup.text)

    nextCharId=jsonStr["data"]["chapterInfo"]["next"]
    print(nextCharId)
    return nextCharId
#例子获取从第四章开始的后五篇小说
for i in range(5):
    nextId = xiaoshuo(url, urlcontent, sArticle)
    urlcontent = "https://m.qidian.com/book/{}/{}".format("1003782761", nextId)
    url = "https://m.qidian.com/majax/chapter/getChapterInfo?_csrfToken=w7RePr18qXzxByPdIn0h7iQtII0AC4z8oPMIXioz&bookId={}&chapterId={}".format("1003782761", nextId)
    sArticle =""
f.close()

文件内容为：