终于写好代码可以爬取所有的章节的href,不用去找数字规律。直接运行就可以把小说下载到txt文件里面。
之前学了点xpath,试着用了一会但是不熟练,之后直接用for循环,直接用了五次for循环。
import requests
from bs4 import BeautifulSoup as bf
url = 'https://www.soxscc.com/MangHuangJi/'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0"}
html = requests.get(url,headers=headers)
texts = html.text
soup = bf(texts,'lxml')
t1 = soup.find('div',class_='novel_list',id='novel4451')
t2 = t1.findAll('dl')
for i in t2:
t3 = i.findAll('dd')
for i in t3:
t4 = i.findAll('a')
for i in t4:
t5 = i.get('href')
url1 = 'https://www.soxscc.com'+ t5
res = requests.get(url1,headers=headers)
html = bf(res.text,'lxml')
title = html.find('h1').string
contental = html.findAll('div',class_='content')
output = "{}\n\{}n\n\n\n\n\n"
for i in contental:
contents = i.get_text()
outputs = output.format(title,contents)
for i in outputs:
with open('biquge.txt', 'a', encoding='utf-8') as f:
f.write(i)
确实可以连续下载了,快七百多章了

爬取这些东西没用什么太特别的东西,只有requests和Beautifulsoup,其余的就是找标签find和findAll,之后再用for循环,在加上with open这个下载入txt文件的函数就行了。xpath和正则表达式,用起来可能比较方便。没有这些的话有些问题还是可以解决的。
网友评论