Python网络数据采集4-POST提交与Cookie的处理

作者: sunhaiyu | 来源:发表于2017-07-17 20:08 被阅读142次

Python网络数据采集4-POST提交与Cookie的处理
Python网络数据采集之图像识别与文字处理
Python网络数据采集6-隐含输入字段
Python网络数据采集
2018最佳人工智能数据采集(爬虫)工具书下载
《Python网络数据采集》 ([美] 米切尔) 中文pdf版
Python简单爬取网页图片
Python网络数据采集之读取文件|第05天
Python网络数据采集之登录采集处理|第08天
Python网络数据采集之处理自然语言|第07天

Python网络数据采集4-POST提交与Cookie的处理

POST提交

之前访问页面都是用的get提交方式，有些网页需要登录才能访问，此时需要提交参数。虽然在一些网页，get方式也能提交参参数。比如https://www.some-web-site.com?param1=username&param2=age。但是在登录这种需要安全性的地方。还是通过表单提交的方式好。此时就需要用到post提交了。这在requests库中特别简单。指定data参数就行了。

表单提交例子这个网页有个表单。

<form action="processing.php" method="post">
First name: <input name="firstname" type="text"><br>
Last name: <input name="lastname" type="text"><br>
<input id="submit" type="submit" value="Submit">
</form>

method属性里可以看到提交方式是POST。action属性里可以看到，我们表单提交后实际上会转到processing.php这个页面进行表单处理。所以我们应该访问这个页面，进行表单参数的传递。

在往requests的data传入参数的时候，注意对应input标签的name属性就行。他们分别是firstname、lastname。

import requests

url = 'https://pythonscraping.com/pages/files/processing.php'
params = {'firstname': 'Sun', 'lastname': 'Haiyu'}

r = requests.post(url, data=params, allow_redirects=False)
print(r.text)

Hello there, Sun Haiyu!

上传文件

虽然在爬虫中，上传文件几乎用不到。但是有必要了解下基本用法。使用requests的files参数就可以轻松实现。

这个网页可以上传图片。同样是一个表单。

<form action="processing2.php" enctype="multipart/form-data" method="post">
  Submit a jpg, png, or gif: <input name="uploadFile" type="file"><br>
  <input type="submit" value="Upload File">
</form>

和上面例子一样，我们需要访问的实际页面是processing2.php，提交方法依然是POST。参数name为uploadFile。

import requests

url = 'https://pythonscraping.com/pages/files/processing2.php'
files = {'uploadFile': open('abc.PNG', 'rb')}
r = requests.post(url, files=files)
print(r.text)

Sorry, there was an error uploading your file.

代码是没有问题的，而且在浏览器里是上传也是这个结果。估计书中提供的网址有问题吧...

处理登录和Cookie

Cookie用来跟踪用户是否已经登录的状态信息。一旦网站认证了我们的登录，就会将cookie存到浏览器中，里面包含了服务器生成的令牌、登录有效时长、状态跟踪信息。当登陆有效时长达到，我们的登录状态就被清空，想要访问其他需要登录后才能访问的页面也就不能成功了。还是先登录，然后获取cookie。

这里有个登录页面

<form action="welcome.php" method="post">
Username (use anything!): <input name="username" type="text"><br>
Password (try "password"): <input name="password" type="password"><br>
<input type="submit" value="Login">
</form>

可以看到，登录后会进入welcome.php，输入账号和密码(账号任意, 密码必须是password)。

登录成功后，可以使用get方式访问简介页面

注意如果直接requests.get('https://pythonscraping.com/pages/cookies/profile.php')浏览器不知道我们“已经登录了”这个状态，所以拒绝返回内容。但是若是传入登录成功后得到的cookie，这个信息让浏览器知道我已经登录，所以请给我看profile.php，浏览器看到这个令牌就会同意。

import requests
url = 'https://pythonscraping.com/pages/cookies/welcome.php'

params = {'username': 'Ryan', 'password': 'password'}

r = requests.post(url, params)

print(r.cookies.get_dict())
res = requests.get('https://pythonscraping.com/pages/cookies/profile.php', cookies=r.cookies)
print(res.text)

{'loggedin': '1', 'username': 'Ryan'}
Hey Ryan! Looks like you're still logged into the site!

Session

对于简单的访问这样处理没有问题，但是如果你面对的网站比较复杂，它经常暗自调整cookie，这时候可以使用requests的Session对象了。它可以持续跟踪会话信息，如cookie、header甚至包括运行HTTP协议的信息。

import requests

session = requests.Session()

params = {'username':'admin', 'password': 'password'}
s = session.post('https://pythonscraping.com/pages/cookies/welcome.php', params)
print(s.cookies.get_dict())
print('Go to profile page')
# 这里并不像上面一样传入了cookie
s = session.get('https://pythonscraping.com/pages/cookies/profile.php')
print(s.text)

{'loggedin': '1', 'username': 'admin'}
Go to profile page
Hey admin! Looks like you're still logged into the site!

其他登录认证方式

还有一些登录认证方式，比如HTTP基本接入认证。使用requests的auth参数。

这个页面需要输入账号和密码登录

import requests

url = 'https://pythonscraping.com/pages/auth/login.php'

res = requests.get(url, auth=('sun', '123456'))
print(res.text)

<p>Hello sun.</p><p>You entered 123456 as your password.</p>

向auth传入一个含有两个元素的元组，分别是账号和密码，就能成功登录了。

by @sunhaiyu

2017.7.17

网友评论

本文标题：Python网络数据采集4-POST提交与Cookie的处理

本文链接：https://www.haomeiwen.com/subject/vupqkxtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Python网络数据采集4-POST提交与Cookie的处理

Python网络数据采集4-POST提交与Cookie的处理

POST提交

上传文件

处理登录和Cookie

Session

其他登录认证方式

相关文章