稍微看了点python,写一写练下手

首先看一下我们要爬的小说的url, 风大的新书
http://www.biquge.com.tw/16_16209/
导入request_html库
from requests_html import HTMLSession
先获取下这个网页的html
def initData (url):
try:
if url is None:
return
session = HTMLSession()
content = session.get(url)
html = content.html
return html
except :
return
先看一下网页源码

层级大概是wrapper-> div.box_con-> list->dd
list标签中就是这本书的目录列表了

然后我们就要遍历目录, 取得整个目录下dd标签中对应章节的url:
def getData (html):
dic = {}
content = html.find('#wrapper')[0].find('div.box_con')[1].find('#list')[0].find('dd') //取得目录
for list in content :
dic["name"] = list.text
dic["url"] = list.find('a')[0].attrs["href"] //从href中提取章节url
getBookHtml(dic)
这样我们就取到了对应章节的url
然后看一下获取到的url, 就是章节正文了

可以看到网页中小说正文每一行对应content下br标签中的内容
那我们再把content下所有br标签的内容提取出来
def getBookContent (html):
print(html)
dic = {}
content = html.find('#content', first=True).find('br')
for line in content :
dic["line"] = line.text
print(dic)
writeToFile(line.text)
然后把这些内容写入到txt:
def writeToFile (text):
result = text
f = open('我是至尊.txt', 'a', encoding='utf-8')
# f.write(chapter[i] + "\n")
f.write(text + "\n")
f.close()
这样整个爬虫就完成了
整体逻辑就是 遍历目录->获取目录下对应章节的url->获取小说正文每一行的内容->写入本地
贴出来代码
from requests_html import HTMLSession
url = "http://www.biquge.com.tw/16_16209/"
host = "http://www.biquge.com.tw"
def initData (url):
try:
if url is None:
return
session = HTMLSession()
content = session.get(url)
html = content.html
return html
except :
return
def getData (html):
dic = {}
content = html.find('#wrapper')[0].find('div.box_con')[1].find('#list')[0].find('dd')
for list in content :
dic["name"] = list.text
dic["url"] = list.find('a')[0].attrs["href"]
# print(dic)
getBookHtml(dic)
def getBookHtml (dic):
url = dic["url"]
session = HTMLSession()
content = session.get(host+url)
html = content.html
getBookContent(html)
def getBookContent (html):
print(html)
dic = {}
content = html.find('#content', first=True).find('br')
for line in content :
dic["line"] = line.text
print(dic)
writeToFile(line.text)
def start ():
html = initData(url)
getData(html)
def writeToFile (text):
result = text
f = open('我是至尊.txt', 'a', encoding='utf-8')
# f.write(chapter[i] + "\n")
f.write(text + "\n")
f.close()
start()

由于对html不怎么熟悉, 对python也了解颇浅, 这段代码性能比较低, 灵活性也比较低, 有时间看过多线程之后会优化一下
有什么错误的地方, 请多指正!
网友评论