requests快速入门

作者: ThomasYoungK | 来源:发表于2019-02-08 13:23 被阅读70次

Requests 快速入门
requests快速入门
爬虫入门系列（六）：正则表达式完全指南（下）
python之requests模块
requests的快速入门
requests-html快速入门
Python Requests模块快速入门
Requests01-快速入门
入门级爬虫
Python之网络请求库Requests

requests是个非常好用的http库，但是我之前对它一知半解，这次初步得做了整理，加深了对它的理解，相信对其他同学也会有帮助。

返回码
get请求
request headers
request body
检查发出的request
Authentication
SSL与https
Session
cookies
重定向历史
timeout和max-retries
- Timeout
- Max Retries

返回码

>>> response = requests.get('https://api.github.com')
>>> response.status_code
200

response自身也有bool值，但真值是一个范围:

if response:  # True if the status code was between 200 and 400, and False otherwise.
    print('Success!')
else:
    print('An error has occurred.')

此外也可以通过response.raise_for_status()抛异常来判断是否错误：

# If the response was successful, no Exception will be raised
response.raise_for_status()

get请求

response.content是byte类型

>>> response = requests.get('https://api.github.com')
>>> response.content
b'{"current_user_url":"https://api.github.com/user","current_user_authorizations_html_url":"https://github.com/settings/connections/applications{/client_id}","authorizations_url":"https://api.github.com/authorizations","code_search_url":"https://api.github.com/search/code?q={query}{&page,per_page,sort,order}","commit_search_url":"https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}","emails_url":"https://api.github.com/user/emails","emojis_url":"https://api.github.com/emojis","events_url":"https://api.github.com/events","feeds_url":"https://api.github.com/feeds","followers_url":"https://api.github.com/user/followers","following_url":"https://api.github.com/user/following{/target}","gists_url":"https://api.github.com/gists{/gist_id}","hub_url":"https://api.github.com/hub","issue_search_url":"https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}","issues_url":"https://api.github.com/issues","keys_url":"https://api.github.com/user/keys","notifications_url":"https://api.github.com/notifications","organization_repositories_url":"https://api.github.com/orgs/{org}/repos{?type,page,per_page,sort}","organization_url":"https://api.github.com/orgs/{org}","public_gists_url":"https://api.github.com/gists/public","rate_limit_url":"https://api.github.com/rate_limit","repository_url":"https://api.github.com/repos/{owner}/{repo}","repository_search_url":"https://api.github.com/search/repositories?q={query}{&page,per_page,sort,order}","current_user_repositories_url":"https://api.github.com/user/repos{?type,page,per_page,sort}","starred_url":"https://api.github.com/user/starred{/owner}{/repo}","starred_gists_url":"https://api.github.com/gists/starred","team_url":"https://api.github.com/teams","user_url":"https://api.github.com/users/{user}","user_organizations_url":"https://api.github.com/user/orgs","user_repositories_url":"https://api.github.com/users/{user}/repos{?type,page,per_page,sort}","user_search_url":"https://api.github.com/search/users?q={query}{&page,per_page,sort,order}"}'

response.text会自动decode成str, 内部用了response的headers中的信息或chardet.detect来猜测编码格式

>>> response.text
'{"current_user_url":"https://api.github.com/user","current_user_authorizations_html_url":"https://github.com/settings/connections/applications{/client_id}","authorizations_url":"https://api.github.com/authorizations","code_search_url":"https://api.github.com/search/code?q={query}{&page,per_page,sort,order}","commit_search_url":"https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}","emails_url":"https://api.github.com/user/emails","emojis_url":"https://api.github.com/emojis","events_url":"https://api.github.com/events","feeds_url":"https://api.github.com/feeds","followers_url":"https://api.github.com/user/followers","following_url":"https://api.github.com/user/following{/target}","gists_url":"https://api.github.com/gists{/gist_id}","hub_url":"https://api.github.com/hub","issue_search_url":"https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}","issues_url":"https://api.github.com/issues","keys_url":"https://api.github.com/user/keys","notifications_url":"https://api.github.com/notifications","organization_repositories_url":"https://api.github.com/orgs/{org}/repos{?type,page,per_page,sort}","organization_url":"https://api.github.com/orgs/{org}","public_gists_url":"https://api.github.com/gists/public","rate_limit_url":"https://api.github.com/rate_limit","repository_url":"https://api.github.com/repos/{owner}/{repo}","repository_search_url":"https://api.github.com/search/repositories?q={query}{&page,per_page,sort,order}","current_user_repositories_url":"https://api.github.com/user/repos{?type,page,per_page,sort}","starred_url":"https://api.github.com/user/starred{/owner}{/repo}","starred_gists_url":"https://api.github.com/gists/starred","team_url":"https://api.github.com/teams","user_url":"https://api.github.com/users/{user}","user_organizations_url":"https://api.github.com/user/orgs","user_repositories_url":"https://api.github.com/users/{user}/repos{?type,page,per_page,sort}","user_search_url":"https://api.github.com/search/users?q={query}{&page,per_page,sort,order}"}'

也可以自定义编码格式：

>>> response.encoding = 'utf-8' # Optional: requests infers this internally
>>> response.text
'{"current_user_url":"https://api.github.com/user","current_user_authorizations_html_url":"https://github.com/settings/connections/applications{/client_id}","authorizations_url":"https://api.github.com/authorizations","code_search_url":"https://api.github.com/search/code?q={query}{&page,per_page,sort,order}","commit_search_url":"https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}","emails_url":"https://api.github.com/user/emails","emojis_url":"https://api.github.com/emojis","events_url":"https://api.github.com/events","feeds_url":"https://api.github.com/feeds","followers_url":"https://api.github.com/user/followers","following_url":"https://api.github.com/user/following{/target}","gists_url":"https://api.github.com/gists{/gist_id}","hub_url":"https://api.github.com/hub","issue_search_url":"https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}","issues_url":"https://api.github.com/issues","keys_url":"https://api.github.com/user/keys","notifications_url":"https://api.github.com/notifications","organization_repositories_url":"https://api.github.com/orgs/{org}/repos{?type,page,per_page,sort}","organization_url":"https://api.github.com/orgs/{org}","public_gists_url":"https://api.github.com/gists/public","rate_limit_url":"https://api.github.com/rate_limit","repository_url":"https://api.github.com/repos/{owner}/{repo}","repository_search_url":"https://api.github.com/search/repositories?q={query}{&page,per_page,sort,order}","current_user_repositories_url":"https://api.github.com/user/repos{?type,page,per_page,sort}","starred_url":"https://api.github.com/user/starred{/owner}{/repo}","starred_gists_url":"https://api.github.com/gists/starred","team_url":"https://api.github.com/teams","user_url":"https://api.github.com/users/{user}","user_organizations_url":"https://api.github.com/user/orgs","user_repositories_url":"https://api.github.com/users/{user}/repos{?type,page,per_page,sort}","user_search_url":"https://api.github.com/search/users?q={query}{&page,per_page,sort,order}"}'

使用response.json()可以自动反序列化，用来简化json.loads(response.text)

>>> response.json()
{'current_user_url': 'https://api.github.com/user', 'current_user_authorizations_html_url': 'https://github.com/settings/connections/applications{/client_id}', 'authorizations_url': 'https://api.github.com/authorizations', 'code_search_url': 'https://api.github.com/search/code?q={query}{&page,per_page,sort,order}', 'commit_search_url': 'https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}', 'emails_url': 'https://api.github.com/user/emails', 'emojis_url': 'https://api.github.com/emojis', 'events_url': 'https://api.github.com/events', 'feeds_url': 'https://api.github.com/feeds', 'followers_url': 'https://api.github.com/user/followers', 'following_url': 'https://api.github.com/user/following{/target}', 'gists_url': 'https://api.github.com/gists{/gist_id}', 'hub_url': 'https://api.github.com/hub', 'issue_search_url': 'https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}', 'issues_url': 'https://api.github.com/issues', 'keys_url': 'https://api.github.com/user/keys', 'notifications_url': 'https://api.github.com/notifications', 'organization_repositories_url': 'https://api.github.com/orgs/{org}/repos{?type,page,per_page,sort}', 'organization_url': 'https://api.github.com/orgs/{org}', 'public_gists_url': 'https://api.github.com/gists/public', 'rate_limit_url': 'https://api.github.com/rate_limit', 'repository_url': 'https://api.github.com/repos/{owner}/{repo}', 'repository_search_url': 'https://api.github.com/search/repositories?q={query}{&page,per_page,sort,order}', 'current_user_repositories_url': 'https://api.github.com/user/repos{?type,page,per_page,sort}', 'starred_url': 'https://api.github.com/user/starred{/owner}{/repo}', 'starred_gists_url': 'https://api.github.com/gists/starred', 'team_url': 'https://api.github.com/teams', 'user_url': 'https://api.github.com/users/{user}', 'user_organizations_url': 'https://api.github.com/user/orgs', 'user_repositories_url': 'https://api.github.com/users/{user}/repos{?type,page,per_page,sort}', 'user_search_url': 'https://api.github.com/search/users?q={query}{&page,per_page,sort,order}'}

get的params参数可以用字典方式方式设置：

response = requests.get(
    'https://api.github.com/search/repositories',
    params={'q': 'requests+language:python'},
)
print(response.request.url)  
# 输出
# https://api.github.com/search/repositories?q=requests%2Blanguage%3Apython

用tuple也是等效的

requests.get(
    'https://api.github.com/search/repositories',
    params=[('q', 'requests+language:python')],
)
print(response.request.url)  
# 输出
# https://api.github.com/search/repositories?q=requests%2Blanguage%3Apython

上面2种方法自动做URL编码, 若不想做编码，可以传入bytes类型

response = requests.get(
    'https://api.github.com/search/repositories',
    params=b'q=requests+language:python',
)
print(response.request.url)
# 输出
# https://api.github.com/search/repositories?q=requests+language:python

request headers

请求的时候，可以传入headers参数：

response = requests.get(
    'https://api.github.com/search/repositories',
    params={'q': 'requests+language:python'},
    headers={'Accept': 'application/vnd.github.v3.text-match+json'},
)

request body

当用data传入时，可以是dict也可以是tuple，request的headers中content-type=application/x-www-form-urlencoded;
当用json传入时，request的headers中content-type=application/json
看了下面的代码和输出很容易就理解了

response = requests.post('https://httpbin.org/post', data={'key': 'value'})
print(response.json().get('form'))
print(response.request.headers['Content-Type'])
print(response.request.body)

print()
response = requests.post('https://httpbin.org/post', json={'key':' value'})
json_response = response.json()
print(type(json_response['data']))
print(response.request.headers['Content-Type'])
print(response.request.body)

print()
response = requests.post('https://httpbin.org/post', data=[('key', 'value')])
print(response.json().get('form'))
print(response.request.headers['Content-Type'])
print(response.request.body)

输出

{'key': 'value'}
application/x-www-form-urlencoded
key=value

<class 'str'>
application/json
b'{"key": " value"}'

{'key': 'value'}
application/x-www-form-urlencoded
key=value

检查发出的request

request是一个PreparedRequest对象，可以通过response.request查看本次请求的request

response = requests.post('https://httpbin.org/post', json={'key': 'value'})
print(type(response.request.body))
print(response.request.body)
print(response.request.headers)

print()
response = requests.post('https://httpbin.org/post', data={'key': 'value'}, cookies={'xx': 'yy', 'zz': 'aa'})
print(type(response.request))
print(type(response.request.body))
print(response.request.body)
print(response.request.headers)

输出

<class 'bytes'>
b'{"key": "value"}'
{'User-Agent': 'python-requests/2.21.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '16', 'Content-Type': 'application/json'}

<class 'requests.models.PreparedRequest'>
<class 'str'>
key=value
{'User-Agent': 'python-requests/2.21.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Cookie': 'xx=yy; zz=aa', 'Content-Length': '9', 'Content-Type': 'application/x-www-form-urlencoded'}

Authentication

在访问某些服务时，需要提供认证，requests提供了3种认证方式(HTTPBasicAuth, HTTPProxyAuth, HTTPDigestAuth)，还可以自定义认证方式.

其中HTTPBasicAuth的原理是, :拼接username和password后做base64编码放在headers中， https://en.wikipedia.org/wiki/Basic_access_authentication

from base64 import b64encode
print(b64encode(b':'.join((b'username', b'password'))))

请求时传入auth参数即可，默认是HTTPBasicAuth

from getpass import getpass
# https://en.wikipedia.org/wiki/Basic_access_authentication
response = requests.get('https://api.github.com/user', auth=('username', getpass()))
print(response.status_code)
print(response.request.headers)

response = requests.get('https://api.github.com/user')
print(response.request.headers)
print(response.status_code)
print(response.text)

输出

200
{'User-Agent': 'python-requests/2.21.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Authorization': 'Basic bWluaXaflrMjAxjp5adfUwNzgdfyMzasfdMA=='}
{'User-Agent': 'python-requests/2.21.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
401
{"message":"Requires authentication","documentation_url":"https://developer.github.com/v3/users/#get-the-authenticated-user"}

也可以明确指定auth, 效果同上

response = requests.get(
     'https://api.github.com/user',
     auth=HTTPBasicAuth('username', getpass())
)

也可以自定义认证, 继承AuthBase,实现__call__即可：

"""
自定义Auth，其实就是自定义request.headers中的认证
"""
import requests
from requests.auth import AuthBase

# AuthBase有3个子类: HTTPBasicAuth, HTTPProxyAuth, HTTPDigestAuth. 可以查询google或wiki了解它们的定义
class TokenAuth(AuthBase):
    """Implements a custom authentication scheme."""

    def __init__(self, token):
        self.token = token

    def __call__(self, r):
        """Attach an API token to a custom auth header."""
        r.headers['X-TokenAuth'] = f'{self.token}'  # Python 3.6+
        return r


response = requests.get('https://httpbin.org/get', auth=TokenAuth('12345abcde-token'))
print(response.request.headers['X-TokenAuth'])

# 输出
# 12345abcde-token

SSL与https

https=http+SSL，requests请求https时，会默认做SLL认证。如果不想做认证，可以设置verify=False

>>> requests.get('https://api.github.com', verify=False)
InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
<Response [200]>

Session

官方文档见：http://docs.python-requests.org/en/master/user/advanced/#session-objects

session有2个作用：

persist parameters across requests
When your app makes a connection to a server using a Session, it keeps that connection around in a connection pool. When your app wants to connect to the same server again, it will reuse a connection from the pool rather than establishing a new one.

session-level的dict会被session persist, method-level的dict不会被session persist, method-level会覆盖session-level的headers (http://docs.python-requests.org/en/master/user/advanced/#session-objects
), session-level和method-level见下方代码：

s = requests.Session()
s.auth = ('user', 'pass')  # session-level的dict会被session persist
s.headers.update({'x-test': 'true'})  # session-level的dict会被session persist
s.get('https://httpbin.org/headers', auth=('yangkai', 'pass'))  # method-level的dict不会被session persist

我做了一些实验，代码可以查看gist：session_test.py

需要注意的是session不是线程安全的(this issue)，因此多线程不能共用同一个session(因为操作系统会不断切换线程，共用一个session可能导致该session内部的状态混乱)，要在每个线程各自创建一个session（同一个线程内代码是顺序执行的，就不用担心），以下是一种多线程session实现方案（使用了threading.local()）:

import concurrent.futures
import requests
import threading
import time


thread_local = threading.local()


def get_session():
    if not getattr(thread_local, "session", None):
        thread_local.session = requests.Session()
    return thread_local.session


def download_site(url):
    session = get_session()
    with session.get(url) as response:
        print(f"Read {len(response.content)} from {url}")


def download_all_sites(sites):
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        executor.map(download_site, sites)


if __name__ == "__main__":
    sites = [
        "http://www.jython.org",
        "http://olympus.realpython.org/dice",
    ] * 80
    start_time = time.time()
    download_all_sites(sites)
    duration = time.time() - start_time
    print(f"Downloaded {len(sites)} in {duration} seconds")

cookies

官方文档见：
http://docs.python-requests.org/en/master/api/#api-cookies
http://cn.python-requests.org/zh_CN/latest/user/quickstart.html#cookie

session-level设置cookies

with requests.Session() as s:
    jar = requests.cookies.RequestsCookieJar()
    jar.set('a', 'b')
    jar.set('x', 'y')
    s.cookies = jar

method-level设置cookies

with requests.Session() as s:
    cookies = dict(cookies_are='working')
    r = s.get("http://httpbin.org/cookies", cookies=cookies)

从请求中查看cookies, 建议通过headers查看

print(response.request._cookies)   # RequestsCookieJar对象, 不能确保请求确实带上了该cookies，`_`说明了这一点
# 或者
print(response.request.headers['Cookie'])   # 真正的请求headers

查看刚刚设置的session-level级别的cookies

print(s.cookies)

从响应中查看Set-Cookie

通过response.cookies查看

with requests.Session() as s:
    r = s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
    print(r.history)
    print(r.history[1].headers)
    print(r.history[1].cookies)   # 查看源码可知, 取的是响应headers的`Set-Cookie`的value, 是个RequestsCookieJar对象

请看我的gist: cookies_test.py

重定向历史

重定向的历史可以通过response.history获得，它是一个由Response对象构成的列表，而返回的那个response其实是最后一次跳转的response.

import requests

url = 'http://httpbin.org/cookies/set/sessioncookie/123456789'

with requests.Session() as s:
    response = s.get(url)
    #: A list of :class:`Response <Response>` objects from
    #: the history of the Request. Any redirect responses will end
    #: up here. The list is sorted from the oldest to the most recent request.
    print(response.history)
    for resp in response.history:
        print(resp, resp.request.url, resp.headers)
    print(response, response.request.url)

从输出可以看出，跳了3次:
http://httpbin.org/cookies/set/sessioncookie/123456789 ->
https://httpbin.org/cookies/set/sessioncookie/123456789 ->
https://httpbin.org/cookies

[<Response [301]>, <Response [302]>]
<Response [301]> http://httpbin.org/cookies/set/sessioncookie/123456789 {'Connection': 'close', 'Cache-Control': 'max-age:86400', 'Date': 'Friday, 08-Feb-19 12:49:25 CST', 'Expires': 'Sat, 09 Feb 2019 12:49:25 GMT', 'Keep-Alive': 'timeout=38', 'Location': 'https://httpbin.org/cookies/set/sessioncookie/123456789', 'Content-Length': '0'}
<Response [302]> https://httpbin.org/cookies/set/sessioncookie/123456789 {'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Date': 'Fri, 08 Feb 2019 04:49:26 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '223', 'Location': '/cookies', 'Set-Cookie': 'sessioncookie=123456789; Secure; Path=/', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true', 'Via': '1.1 vegur'}
<Response [200]> https://httpbin.org/cookies

有个概念：重定向与转发：
重定向是客户端行为，浏览器会发起多次请求
转发是服务器行为，浏览器只会发起一次请求

我们这里是重定向，浏览器里面确实发起了3次跳转(顺序从下往上)：

重定向.png

timeout和max-retries

Timeout

请求总时长=连接时长+读时长, requests默认无超时时间，设置 timeout参数可以指定超时时间，可以传一个数字，也可以传一个tuple，超时则会抛出异常。（以上是我的理解，不保证正确）

You can also pass a tuple to timeout with the first element being a connect timeout (the time it allows for the client to establish a connection to the server), and the second being a read timeout (the time it will wait on a response once your client has established a connection).

import requests
from requests.exceptions import Timeout, ConnectionError

try:
    # 连接超时
    response = requests.get('https://api.github.com', timeout=(0.1, 5))
except ConnectionError as e:
    print(e, type(e))

try:
    # 读超时
    response = requests.get('https://api.github.com', timeout=(1, 0.1))
except Timeout as e:
    print(e, type(e))

try:
    # 整体超时，连接就超时了
    response = requests.get('https://api.github.com', timeout=0.1)
except Exception as e:
    print(e, type(e))

try:
    # 整体超时, 连接成功了，但是读超时了
    response = requests.get('https://api.github.com', timeout=0.3)
except Exception as e:
    print(e, type(e))

Max Retries

requests默认失败不重试，可以通过Transport Adapter指定失败后的重试次数，下面的代码会重试最多3次：

import requests
from requests.adapters import HTTPAdapter
from requests.exceptions import ConnectionError

github_adapter = HTTPAdapter(max_retries=3)

session = requests.Session()

# Use `github_adapter` for all requests to endpoints that start with this URL
session.mount('https://api.github.com', github_adapter)

try:
    session.get('https://api.github.com')
except ConnectionError as ce:
    print(ce)

其实我建议不用timeout或adapter做超时和重试处理，有一个非常好的库来做这件事：tenacity.

import requests
from tenacity import retry, stop_after_attempt, stop_after_delay

# 不断重试，直到以下任一种情况发生 1.执行成功不抛异常，2.耗时达到3秒，3.重试达到4次；
# 若最后还未成功则抛RetryError异常
@retry(stop=(stop_after_delay(3) | stop_after_attempt(4)))
def get_github_api():
    with requests.Session() as session:
        return session.get('https://api.github.com')


if __name__ == '__main__':
    try:
        response = get_github_api()
        print(response)
    except RetryError as ce:
        print(ce)

参考文献：

Requests 快速入门
Requests是python的一个HTTP客户端库，跟python内置的urllib，urllib2类似，那为什...
requests快速入门
requests是个非常好用的http库，但是我之前对它一知半解，这次初步得做了整理，加深了对它的理解，相信对其他...
爬虫入门系列（六）：正则表达式完全指南（下）
爬虫入门系列目录：爬虫入门系列（一）：快速理解HTTP协议爬虫入门系列（二）：优雅的HTTP库requests...
python之requests模块
Requests快速上手迫不及待了吗？本页内容为如何入门 Requests 提供了很好的指引。其假设你已经安装了...
requests的快速入门
1.安装 pip install requests 2.导入 import requests 3.常用函数 4.给...
requests-html快速入门
Python上有一个非常著名的HTTP库——requests，相比大家都听说过，用过的人都说好！现在request...
Python Requests模块快速入门
Requests是Python的一个HTTP客户端库，跟urllib，urllib2类似。它比 urllib 更加...
Requests01-快速入门
一、简介 Requests是Python非常好用的第三方Http操作库，它支持Http协议的一些请求，如get/p...
入门级爬虫
Requests库入门我的个人博客 requests Requests库的7个主要方法 Requests库的get...
Python之网络请求库Requests
Requests库入门 Requests库的7个主要方法 Response对象属性 Requests响应异常 HT...

requests快速入门

返回码

get请求

request headers

request body

检查发出的request

Authentication

SSL与https

Session

cookies

session-level设置cookies

method-level设置cookies

从请求中查看cookies, 建议通过headers查看

查看刚刚设置的session-level级别的cookies

从响应中查看Set-Cookie

重定向历史

timeout和max-retries

Timeout

Max Retries

相关文章

Requests 快速入门

requests快速入门

爬虫入门系列（六）：正则表达式完全指南（下）

python之requests模块

requests的快速入门

requests-html快速入门

Python Requests模块快速入门

Requests01-快速入门

入门级爬虫

Python之网络请求库Requests

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读