Python爬虫入门 requests

作者: 小铮冲冲冲 | 来源:发表于2020-12-13 01:36 被阅读0次

学习笔记 2018-10-21
Pyhton爬虫之requests与BeautifulSoup
3分钟带你了解世界第一语言Python 入门上手也这么简单！
python爬虫-抓取内涵吧内涵段子
爬虫入门系列（六）：正则表达式完全指南（下）
Python爬虫入门 requests
[Python]从Web解析到网络空间（一些第三方库的简要介绍）
python爬虫cookbook1爬虫入门
[雪峰磁针石博客]python爬虫cookbook1爬虫入门
爬虫入门

什么是网络爬虫

网络爬虫，简称爬虫，是一种按照一定的规则，自动地抓取互联网信息的程序或者脚本。
搜索引擎本质上就是爬虫。在上面的过程中，搜索引擎将互联网上的网页都爬取并存储起来。当我们搜索的时候，搜索引擎就从自己存储的网页里找到我们需要的结果并展示出来。
随着机器学习、人工智能技术的发展，数据越来越重要，需要的数据量也越来越大。而我们可以通过爬虫获取海量的数据，所以爬虫是这一切的源头。

爬虫的工作原理

第一步：获取数据，爬虫会根据我们提供的网址，向服务器发起请求获取数据；
第二步：处理数据，对获取的数据进行处理，得到我们需要的部分；
第三步：存储数据，将处理后的数据保存起来，便于后续的使用和分析等。

requests.get() 方法

爬虫中最常用的发起请求的第三方库——requests
我们从爬虫的第一步获取数据开始，我们来看个例子：

import requests  # 导入 requests 模块

res = requests.get('https://wpblog.x0y1.com')  # 发起请求
print(res)
# 输出：<Response [200]>

我们使用 requests.get('网站地址') 方法向对应的网站发起了请求，然后我们将返回的结果存到了变量 res 中供后续使用。它的类型是 Response 对象，后面的 200 是状态码

Response 对象

我们前面通过 requests.get() 方法获取到了网页的数据，作为 Response 对象存到了变量 res，那么我们如何查看它的具体内容呢？
Response 对象的常用属性：

res.status_code:响应的HTTP状态码
res.text:响应内容的字符串形式
res.content:响应内容的二进制形式
res.ending:响应内容的编码

res.status_code

import requests

res = requests.get('https://wpblog.x0y1.com')
print(res.status_code)
# 输出：200

这里的 200 就是响应的状态码，表示请求成功。当请求失败时会有不同的状态码，不同的状态码有不同的含义，常见的状态码如下：

响应状态码.png

res.text

res.text 返回的是服务器响应内容的字符串形式，也就是文本内容。我们直接看段代码：

import requests

res = requests.get('https://wpblog.x0y1.com')
print(res.text)

上面代码的运行结果（结果太长省略了一部分）是：

<!DOCTYPE html>
<html lang="zh-CN" class="no-js no-svg">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<script>(function(html){html.className = html.className.replace(/\bno-js\b/,'js')})(document.documentElement);</script>
<title>&#8211; 爬虫示例站点</title>
<link rel="alternate" type="application/rss+xml" title="扇贝编程 &raquo; Feed" href="https://wpblog.x0y1.com/?feed=rss2" />
<link rel="alternate" type="application/rss+xml" title="扇贝编程 &raquo; 评论Feed" href="https://wpblog.x0y1.com/?feed=comments-rss2" />
<script type="text/javascript">
    window._wpemojiSettings = {"baseUrl":"https:\/\/s.w.org\/images\/core\/emoji\/12.0.0-1\/72x72\/","ext":".png","svgUrl":"https:\/\/s.w.org\/images\/core\/emoji\/12.0.0-1\/svg\/","svgExt":".svg","source":{"concatemoji":"https:\/\/wpblog.x0y1.com\/wp-includes\/js\/wp-emoji-release.min.js?ver=5.2.3"}};
    !function(a,b,c){function d(a,b){var c=String.fromCharCode;l.clearRect(0,0,k.width,k.height),l.fillText(c.apply(this,a),0,0);var d=k.toDataURL();l.clearRect(0,0,k.width,k.height),l.fillText(c.apply(this,b),0,0);var e=k.toDataURL();return d===e}function e(a){var b;if(!l||!l.fillText)return!1;switch(l.textBaseline="top",l.font="600 32px Arial",a){case"flag":return!(b=d([55356,56826,55356,56819],[55356,56826,8203,55356,56819]))&&(b=d([55356,57332,56128,56423,56128,56418,56128,56421,56128,56430,56128,56423,56128,56447],[55356,57332,8203,56128,56423,8203,56128,56418,8203,56128,56421,8203,56128,56430,8203,56128,56423,8203,56128,56447]),!b);case"emoji":return b=d([55357,56424,55356,57342,8205,55358,56605,8205,55357,56424,55356,57340],[55357,56424,55356,57342,8203,55358,56605,8203,55357,56424,55356,57340]),!b}return!1}function f(a){var c=b.createElement("script");c.src=a,c.defer=c.type="text/javascript",b.getElementsByTagName("head")[0].appendChild(c)}var g,h,i,j,k=b.createElement("canvas"),l=k.getContext&&k.getContext("2d");for(j=Array("flag","emoji"),c.supports={everything:!0,everythingExceptFlag:!0},i=0;i<j.length;i++)c.supports[j[i]]=e(j[i]),c.supports.everything=c.supports.everything&&c.supports[j[i]],"flag"!==j[i]&&(c.supports.everythingExceptFlag=c.supports.everythingExceptFlag&&c.supports[j[i]]);c.supports.everythingExceptFlag=c.supports.everythingExceptFlag&&!c.supports.flag,c.DOMReady=!1,c.readyCallback=function(){c.DOMReady=!0},c.supports.everything||(h=function(){c.readyCallback()},b.addEventListener?(b.addEventListener("DOMContentLoaded",h,!1),a.addEventListener("load",h,!1)):(a.attachEvent("onload",h),b.attachEvent("onreadystatechange",function(){"complete"===b.readyState&&c.readyCallback()})),g=c.source||{},g.concatemoji?f(g.concatemoji):g.wpemoji&&g.twemoji&&(f(g.twemoji),f(g.wpemoji)))}(window,document,window._wpemojiSettings);
</script>

接下来我们来试试用爬虫下载一个小说——孔乙己，它的网址是 https://apiv3.shanbay.com/codetime/articles/mnvdu。该网址返回的是小说的纯文本格式，源代码和内容是一样的。

import requests

# 获取孔乙己数据
res = requests.get('https://apiv3.shanbay.com/codetime/articles/mnvdu')
# 以写入的方式打开一个名为孔乙己的 txt 文档
with open('孔乙己.txt', 'w') as file:
  # 将数据的字符串形式写入文件中
  file.write(res.text)

open() 函数是 Python 中的内置函数，用于打开文件，返回值是一个 file 对象。
open() 函数接收的第一个参数为文件名，第二个参数为文件打开模式。打开模式默认为 r，是 read 的缩写，表示只读模式。即只能读取内容，不能修改内容。
常用的打开模式有 w（write，只写模式）、b（binary，二进制模式）和 a（append，追加模式，表示在文件末尾写入内容，不会从头开始覆盖原文件）。

Tips：在 w 和 a 模式下，如果你打开的文件不存在，那么 open() 函数会自动帮你创建一个。

需要注意的是，使用 open() 函数打开文件，操作完毕后，最后一定要调用 file 对象的 close() 方法关闭该文件。所以一般我们像下面这样读写文件：

# 读取文件
file = open('文本.txt')  # 打开模式默认为 r，可省略
print(file.read())  # 调用 read() 方法读取文件内容
file.close()  # 关闭文件

# 写入文件
file = open('文本.txt', 'w')  # 写入模式
file.write('编程')  # 调用 write() 方法写入内容
file.close()  # 关闭文件

为了避免忘记调用 close() 方法关闭文件，导致资源占用、文件内容丢失等问题，推荐使用 with ... as ... 语法，它在最后会自动帮你关闭文件。

# 普通写法
file = open('文本.txt', 'w')  # 写入模式
file.write('编程')  # 调用 write() 方法写入内容
file.close()  # 关闭文件

# 使用 with ... as ... 写法
with open('文本.txt', 'w') as file:
  file.write('编程')

我们获取到网页的响应后，以写入模式打开一个名为孔乙己.txt 的文件，然后调用 write() 方法将响应内容的字符串形式写入到文件中，实现了小说的下载。同理，所有文本内容都可以通过这种方式进行下载，只需将 res.text 写入到文件当中保存即可。

res.content

除了文本内容的下载，爬虫还能下载图片、音频、视频等。我们来看一个下载图片的例子：

import requests

# 获取图片数据
res = requests.get('https://assets.baydn.com/baydn/public/codetime/xiaobei/info.jpg')
# 以二进制写入的方式打开一个名为 info.jpg 的文件
with open('info.jpg', 'wb') as file:
  # 将数据的二进制形式写入文件中
  file.write(res.content)

所以 res.text 和 res.content 的区别是：res.text 用于文本内容的获取、下载，res.content 用于图片、音频、视频等二进制内容的获取、下载。

res.encoding

编码是信息从一种形式或格式转换为另一种形式的过程，常见的编码方式有 ASCII、GBK、UTF-8 等。如果用和文件编码不同的方式去解码，我们就会得到一些乱码。
res.encoding 就是爬虫获取到数据的编码格式，requests 库会根据内容推测编码格式是什么，然后将 res.encoding 设成推测的格式，在访问 res.text 时使用该格式解码。
当推测的格式错误时，即出现乱码时，就需要我们手动给 res.encoding 赋值成正确的编码。

import requests

res = requests.get('https://www.baidu.com')
print(res.text)

#结果
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>ç�¾åº¦ä¸�ä¸�ï¼�ä½ å°±ç�¥é��</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=ç�¾åº¦ä¸�ä¸� class="bg s_btn" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ�°é�»</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>å�°å�¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§�é¢�</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å�§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç�»å½�</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">ç�»å½�</a>');
                </script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ�´å¤�äº§å��</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å
³äº�ç�¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>ä½¿ç�¨ç�¾åº¦å��å¿
è¯»</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>æ��è§�å��é¦�</a>&nbsp;äº¬ICPè¯�030173å�·&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

上面的代码框可以往右滑动，从中我们可以看到类似 æ�´å¤�äº§å 的乱码，它们其实是中文被错误解码导致的。我们来看看 requests 库推测的编码格式是什么：

import requests

res = requests.get('https://www.baidu.com')
print(res.encoding)
# 输出：ISO-8859-1

我们可以看到，requests 库将编码错误地推测成了 ISO-8859-1 格式。国内网站的编码格式一般都是 UTF-8、GBK 或 GB2312。

上述代码中网站的正确编码格式其实是 UTF-8，我们需要手动将编码修改成 UTF-8，便能显示正确的内容了。

import requests

res = requests.get('https://www.baidu.com')
res.encoding='utf-8'
print(res.text)

学习笔记 2018-10-21
课程 Python网络爬虫与信息提取 requests库入门 IndentationError: unexpect...
Pyhton爬虫之requests与BeautifulSoup
requests与BeautifulSoup基础入门 1. 前言最近在学习python爬虫，以前实现python...
3分钟带你了解世界第一语言Python 入门上手也这么简单！
一、Python入门 1. Python爬虫入门一之综述 Python爬虫入门二之爬虫基础了解 Python爬虫入...
python爬虫-抓取内涵吧内涵段子
这是个python简易爬虫，主要使用了requests和re模块，适合入门。出处：https://github.c...
爬虫入门系列（六）：正则表达式完全指南（下）
爬虫入门系列目录：爬虫入门系列（一）：快速理解HTTP协议爬虫入门系列（二）：优雅的HTTP库requests...
Python爬虫入门 requests
什么是网络爬虫网络爬虫，简称爬虫，是一种按照一定的规则，自动地抓取互联网信息的程序或者脚本。搜索引擎本质上就是爬...
[Python]从Web解析到网络空间（一些第三方库的简要介绍）
Python库之网络爬虫 http://www.python-requests.org/Requests:最友好的...
python爬虫cookbook1爬虫入门
第一章爬虫入门 Requests和Beautiful Soup 爬取python.org urllib3和Bea...
[雪峰磁针石博客]python爬虫cookbook1爬虫入门
第一章爬虫入门 Requests和Beautiful Soup 爬取python.org urllib3和Bea...
爬虫入门
参考博客：爬虫入门系列简要介绍： 1.用到的Python库： requests: 主要用于获取网页结果 Beau...