发现个爬虫闯关游戏网站,挺有意思,点击跳转
进入首页,发现内容很简单,如图
![](https://img.haomeiwen.com/i690168/b5b78b5d089fc7d0.png)
提示在网址后面输入数字14901,按照要求做,访问http://www.heibanke.com/lesson/crawler_ex00/14901,
得到如下内容
![](https://img.haomeiwen.com/i690168/c97b36ca95f52cb5.png)
看到这里就明白了,第一关内容就是不断在网址后面添加当前页面返回的那串数字,直到得到最后一串数字就算过关,很简单,提取那串数字添加到网址后面访问,循环这个步骤,直到过关。
分析html结构,就是h3那一句就行了
![](https://img.haomeiwen.com/i690168/dedd7c7787d49629.png)
爬虫代码
from urllib import request
from bs4 import BeautifulSoup
import re
def get_page(url):
print('get url %s' % url)
headers = {
'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
r'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3',
'Connection': 'keep-alive'
}
req = request.Request(url, headers=headers)
page = request.urlopen(req).read()
page = page.decode('utf-8')
return page
count = 1
numResult = ' '
# 判断numResult是否为空,为空则代表闯关成功
while numResult:
print('第%d次请求' % count)
url = "http://www.heibanke.com/lesson/crawler_ex00/" + numResult
result = get_page(url)
soup = BeautifulSoup(result, "html.parser")
# 解析h3元素
h3 = soup.find_all("h3")[0]
result = soup.find_all("h3")[0].text
# 解析出数字
numResult = re.sub("\D", "", result)
print('数字: %s' % numResult)
count += 1
print('成功闯关,url:%s' % url)
结果:
![](https://img.haomeiwen.com/i690168/50fd72749db2bbc6.png)
访问http://www.heibanke.com/lesson/crawler_ex00/30366/,果然成功
![](https://img.haomeiwen.com/i690168/dd8a1dfd5dc6e321.png)
网友评论