知识点
- 爬虫的步骤
- requests
- parsel
- xpath数据解析
爬虫四个步骤:
1.获取网页地址 (目标地址)
2.发送请求
3.数据解析
4.保存 本地
分析网站
网站是静态数据,那么只要找到它的规律,以及url地址就行
代码实现
导入模块
import requests
import parsel
请求数据
url =f'https://hdqwalls.com/latest-wallpapers/page/1'
# url = 'https://hdqwalls.com'
# 请求头 伪装 爬虫:伪装成客户端向服务器发送数据请求
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'
}
requ = requests.get(url=url, headers=headers).text
数据解析
sel = parsel.Selector(requ) # <Selector xpath=None data='<html lang="en">\n<head>\n<script src="...'>
pic_html = sel.xpath('//body/div/div[3]/div/a[1]/@href').getall()
for html in pic_html:
pic_html = 'https://hdqwalls.com' + html
requ2 = requests.get(url=pic_html, headers=headers).text
sel2 = parsel.Selector(requ2)
title = sel2.xpath('//body/header/div/div/h1/text()').get().strip()
href = sel2.xpath('//body/div/div[2]/div/div/div/a/@href').get()
# 二进制请求
requ3 = requests.get(url=href, headers=headers).content
保存数据
with open('壁纸\\' + title + '.jpg', mode='wb')as fp:
fp.write(requ3)
print(title, '下载完成')
添加翻页后的完整代码
import requests
import parsel
for page in range(1,6): # 包头不包尾
url =f'https://hdqwalls.com/latest-wallpapers/page/{page}'
# url = 'https://hdqwalls.com'
# 请求头 伪装 爬虫:伪装成客户端向服务器发送数据请求
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'
}
requ = requests.get(url=url, headers=headers).text # <Response [200]> 数据请求成功
sel = parsel.Selector(requ) # <Selector xpath=None data='<html lang="en">\n<head>\n<script src="...'>
pic_html = sel.xpath('//body/div/div[3]/div/a[1]/@href').getall()
for html in pic_html:
pic_html = 'https://hdqwalls.com' + html
requ2 = requests.get(url=pic_html, headers=headers).text
sel2 = parsel.Selector(requ2)
title = sel2.xpath('//body/header/div/div/h1/text()').get().strip()
href = sel2.xpath('//body/div/div[2]/div/div/div/a/@href').get()
# 二进制请求
requ3 = requests.get(url=href, headers=headers).content
with open('壁纸\\' + title + '.jpg', mode='wb')as fp:
fp.write(requ3)
print(title, '下载完成')
print(f'----------------------第{page}页下载完成----------------------')
网友评论