Python网络数据采集之登录采集处理|第08天

作者: 你好我是森林 | 来源:发表于2018-04-08 21:01 被阅读93次

Python网络数据采集之图像识别与文字处理
Python网络数据采集之登录采集处理|第08天
Python网络数据采集
2018最佳人工智能数据采集(爬虫)工具书下载
《Python网络数据采集》 ([美] 米切尔) 中文pdf版
Python网络数据采集之使用API|第03天
Python网络数据采集之读取文件|第05天
Python网络数据采集之处理自然语言|第07天
Python网络数据采集之数据清洗|第06天
Python网络数据采集之采集JavaScript|第09天

User:你好我是森林
Date:2018-04-08
Mark:《Python网络数据采集》

网络采集系列文章

Python网络数据采集之创建爬虫
 Python网络数据采集之HTML解析
 Python网络数据采集之开始采集
 Python网络数据采集之使用API
Python网络数据采集之存储数据
 Python网络数据采集之读取文件
 Python网络数据采集之数据清洗
 Python网络数据采集之处理自然语言

登录采集处理

如果我们采集的网站需要我们登录后才能获取我们想要的数据，这就需要进一步处理登录这个问题。

登录的原理很简单，即前台向服务器传输数据进行验证。传输的方式有很多种，例如GET、POST；页面表单基本上可以看成是一种用户提交 POST请求的方式，且这种请求方式是服务器能够理解和使用的。

Python Requests库

除了Python的标准库urllib库，还有第三方库可以选择，例如：Requests。主要擅长处理那些复杂的 HTTP 请求、cookie、header(响应头和请求头)等内容的 Python第三方库。

项目地址：https://github.com/kennethreitz/requests/

安装的方式也很简单。例如pip安装，或者下载源码安装。

源码地址：https://github.com/kennethreitz/requests/tarball/master

提交表单

提交表单一般是HTML的方式可以实现，且大多也采用这样的方式进行提交。例如：

<form method="post" action="processing.php">
Nickename: <input type="text" name="nickename"><br> 
username: <input type="text" name="username"><br> 
<input type="submit" value="Submit">
</form>

我们Python的用Requests库来提交十分简单。

import requests
params = {'name': 'Ryan', 'username': 'Mitchell'}
r = requests.post("http://pythonscraping.com/files/processing.php", data=params)
print(r.text)

单选按钮、复选框等输入

无论表单的字段看起来多么复杂，仍然只有两件事是需要关注的:字段名称和值。字段名称可以通过查看源代码寻找name 属性轻易获得。而字段的值有时会比较复杂，有可能是在表单提交之前通过 JavaScript 生成的。

我们可以通过抓包或者浏览器的网络请求信息来判断，例如：

https:chensenlin.cn?c=hello&m=senlin

Python需要理解为：

{'c':'hello','m':'senlin'}

具体查看方法可以参考下图所示：

image

提交文件或者图像

HTML提交文件的时候，需要添加一个参数enctype="multipart/form-data",声明这是文件上传的类型。同时input的type为file。

<from action="uploadFile.php" metoh="post"  enctype="multipart/form-data">
uploadFile:<input type="file" name="filename">
提交：<input type="submit" value="上传">
</from>

同理，Python Requests 库对这种表单的处理方式如下：

import requests
     files = {'uploadFile': open('../files/Python-logo.png', 'rb')}
     r = requests.post("https:chensenlin.cn?c=filename&m=upload",files=files)

print(r.text)

处理登录和Cookie

网站大多都用 cookie 跟踪用户是否已登录的状态信息。一旦网站验证了你的登录权证，它就会将它们保存在你的浏览器的 cookie 中，里面通常包含一个服务器生成的令牌、登录有效时限和状态跟踪信息。网站会把这个cookie当作信息验证的证据，在你浏览网站的每个页面时出示给服务器。

根据我们上面的逻辑用Requests库跟踪cookie的代码示例也比较简单：

import requests

params = {'username': 'demochen', 'password': 'password'}

r = requests.post("http://pythonscraping.com/pages/cookies/welcome.php", params)
print("Cookie is set to:")
print(r.cookies.get_dict())
print("-----------")
print("Going to profile page...")
r = requests.get("http://pythonscraping.com/pages/cookies/profile.php",cookies=r.cookies)
print(r.text)

不过也有session的方式进行登录,但是Requests库的session函数处理也很方便。具体和cookie类似，不过多阐述,或者查看文档了解也可以。

import requests
session = requests.Session()
params = {'username': 'username', 'password': 'password'}
s = session.post("http://pythonscraping.com/pages/cookies/welcome.php", params)
print("Cookie is set to:")
print(s.cookies.get_dict())
print("-----------")
print("Going to profile page...")
s = session.get("http://pythonscraping.com/pages/cookies/profile.php") 
print(s.text)

值得注意的是，登录还有一种是用HTTP基本接入认证的方式。Requests库有一个 auth模块专门用来处理 HTTP 认证:

import requests
from requests.auth import AuthBase
from requests.auth import HTTPBasicAuth
     auth = HTTPBasicAuth('ryan', 'password')
     r = requests.post(url="http://pythonscraping.com/pages/auth/login.php", auth=
auth)
print(r.text)

今天的内容主要是Requests库的基本使用,查看文档也很简单的熟悉了。欢迎您的阅读，如果想了解更多的 Python相关知识，可以关注我；如果本文对您有所帮助，欢迎喜欢或者评论。

原文地址：https://chensenlin.cn/posts/64604/

正式邀请你免费加入我的星球，一起分享，共同成长。在星球的分享不限于技术、还有生活、阅读的心得或者精华笔记等各种有趣、有料的东西。

知识星球

Python网络数据采集之图像识别与文字处理
网络采集系列文章 Python网络数据采集之创建爬虫Python网络数据采集之HTML解析Python网络数据采集...
Python网络数据采集之登录采集处理|第08天
User:你好我是森林Date:2018-04-08Mark:《Python网络数据采集》网络采集系列文章 Py...
Python网络数据采集
《Python网络数据采集》本书采用简洁强大的Python语言，介绍了网络数据采集，并为采集新式网络中的各种数据类...
2018最佳人工智能数据采集(爬虫)工具书下载
Python网络数据采集 Python网络数据采集 - 2016.pdf 本书采用简洁强大的Python语言，介绍...
《Python网络数据采集》 ([美] 米切尔) 中文pdf版
Python网络数据采集采用简洁强大的Python语言，介绍了网络数据采集，并为采集新式网络中的各种数据类型提供了...
Python网络数据采集之使用API|第03天
Python网络数据采集之使用API|第03天 User:DemoChenDate:2018-03-30Mark:...
Python网络数据采集之读取文件|第05天
User:你好我是森林Date:2018-04-01Mark:《Python网络数据采集》网络采集系列文章 Py...
Python网络数据采集之处理自然语言|第07天
User:你好我是森林Date:2018-04-01Mark:《Python网络数据采集》网络采集系列文章 Py...
Python网络数据采集之数据清洗|第06天
User:你好我是森林Date:2018-04-03Mark:《Python网络数据采集》网络采集系列文章 Py...
Python网络数据采集之采集JavaScript|第09天
User:你好我是森林Date:2018-04-11Mark:《Python网络数据采集》网络采集系列文章 Py...

Python网络数据采集之登录采集处理|第08天

网络采集系列文章

登录采集处理

Python Requests库

提交表单

单选按钮、复选框等输入

提交文件或者图像

处理登录和Cookie

相关文章

Python网络数据采集之图像识别与文字处理

Python网络数据采集之登录采集处理|第08天

Python网络数据采集

2018最佳人工智能数据采集(爬虫)工具书下载

《Python网络数据采集》 ([美] 米切尔) 中文pdf版

Python网络数据采集之使用API|第03天

Python网络数据采集之读取文件|第05天

Python网络数据采集之处理自然语言|第07天

Python网络数据采集之数据清洗|第06天

Python网络数据采集之采集JavaScript|第09天

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

Python

Python学习日志

Java-Python-Django社区

程序员

程序猿阵线联盟-汇总各类技术干货

Python数据采集与爬虫