从今天开始带大家全方位的学习爬虫,如果有感兴趣的朋友记得留意我的发文哦!
http和https:超文本传输协议
https相比于http安全一些
我们在浏览器上搜索百度时,一般都是https://www.baidu.com
https://www.baidu.com/:80,浏览器会默认渲染80端口
http:80
https:443 80
举个例子:
http://gy123456.cn:8088/admin/
协议://hostname:port/path
协议:http或https
hostname:域名
port:端口
path:路由
页面分析、HTTP原理和响应/01_requests之请求.py:
import requests
"""
请求库:
它本身存在的意义是做网络测试的,但是之后被发展成为了一个爬虫工具之一
"""
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
}
# response = requests.get(url="http://httpbin.org/get") # get请求
# response = requests.get(url="http://httpbin.org/get?id=Lonelyroots&id2=123") # 第一种方式get请求传参
response = requests.get(url="http://httpbin.org/get?id=Lonelyroots&id2=123",params={'kw':'百度'},headers=headers) # 第二种方式get请求传参
# params参数 只会在get方法中使用,因为使用之后,url = http://httpbin.org/get?id=Lonelyroots&id2=123&kw=百度
# outputs:一个参数kw为\u767e\u5ea6:Unicode编码转中文,即百度
# response = requests.post(url="http://httpbin.org/post?id=Lonelyroots&id2=123") # post请求,一般不用?传参
# response = requests.post(url="http://httpbin.org/post",data={'kw':'百度'}) # post请求,一般不用?传参,用data传参
# headers={
# "content-type": "application/x-www-form-urlencoded"
# }
# response = requests.put(url="http://httpbin.org/put",headers=headers) # put请求:用于更新资源
# response = requests.delete(url="http://httpbin.org/delete") # delete请求,通常用于删除实例
# response = requests.head(url="http://httpbin.org/head") # 返回请求头
# print(response.headers)
# response = requests.patch("http://httpbin.org/patch") # patch请求(提交修改部分数据)比如设置邮件能见度,这个接口用来设置邮件是公共可见的还是私有的
# print(response)
print(response.text)
页面分析、HTTP原理和响应/02_requests子响应.py:
import requests
url = "https://www.baidu.com/"
# 反爬机制:网站检查UA,如果发现是UA爬虫程序,则拒绝提供 伪装
# 爬虫的默认UA"python-requests/2.27.1"
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
}
response = requests.get(url=url,headers=headers,allow_redirects=False) # allow_redirects设置是否可以重定向,设置为False以后,返回的响应路由是重定向前的路由
# print(response.text) # 响应内容的字符串形式,用于看文字 (常用)
# print(response.content) # 响应内容的二进制形式,用于看图片、视频、音频 (常用)
# print(type(response.content)) # <class 'bytes'>
# print(response.status_code) # 状态码
# print(response.headers) # 响应头
# print(response.request) # requests对象
# print(response.request.headers) # 请求头
# print(response.request.url) # 打印请求的路由
# print(response.url) # 打印的是响应的url,比如说设置重定向后的新路由 (常用)
print(response.cookies)
# outputs:
# <RequestsCookieJar[<Cookie BAIDUID=5E91A0E2ABA4BCC3B6D3F4D7B2EA1E66:FG=1 for .baidu.com/>]>
# print(response.cookies['BAIDUID']) # cookies通过键取值,如AAB9E16204105B5ED6DB6123E67832BD:FG=1
print(requests.utils.dict_form_cookiejar(response.cookies))
02_页面分析、HTTP原理和响应/03_新浪.py:
import requests
url = "https://www.sina.com.cn/"
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
}
response = requests.get(url=url,headers=headers)
# print(response.text)
# print(response.encoding) # ISO-8859-1
# print(response.apparent_encoding) # utf-8
# 解决乱码
response.encoding = response.apparent_encoding
print(response.text)
02_页面分析、HTTP原理和响应/04_百度关键字.py:
import requests
# Ajax 在不刷新整个网页的情况下,刷新部分地方,如搜索关键字,但不确定搜索
# 抓pycharm关键字的url
# 通过抓关键字来找路由这种方式,不需要伪装UA
url = "https://www.baidu.com/sugrec?pre=1&p=3&ie=utf-8&json=1&prod=pc&from=pc_web&sugsid=35104,35488,34584,35490," \
"35872,35949,35955,35316,26350,35941&wd=requests&req=2&csor=7&cb=jQuery110205986892274942364_1645516722140&_" \
"=1645516722141 "
response = requests.get(url=url)
print(response.text)
02_页面分析、HTTP原理和响应/05_百度贴吧.py:
import requests
import os
"""
"https://tieba.baidu.com/f?kw=宝马&pn=0" # 这里的kw,如果遇到乱码,可以去使用url的Unicode解码
"https://tieba.baidu.com/f?kw=宝马&pn=50"
"https://tieba.baidu.com/f?kw=宝马&pn=100"
"""
kw = input("请输入关键字:\t")
os_path = os.getcwd()+'/html/'+kw # os.getcwd():当前路径位置:F:\learning_records\U1 2021.9.26\Python\06、爬虫开发\02_页面分析、HTTP原理和响应
# 如果文件夹不存在,则创建文件夹
if not os.path.exists(os_path):
os.mkdir(os_path)
# 打印3页内容
for page in range(0,101,50):
url = f'https://tieba.baidu.com/f?kw={kw}&pn={page}'
response = requests.get(url=url)
with open(f'html/{kw}/{kw}_{int(page/50)}.html','w',encoding='utf-8') as f:
f.write(response.text)
文章到这里就结束了!希望大家能多多支持Python(系列)!六个月带大家学会Python,私聊我,可以问关于本文章的问题!以后每天都会发布新的文章,喜欢的点点关注!一个陪伴你学习Python的新青年!不管多忙都会更新下去,一起加油!
Editor:Lonelyroots
网友评论