大师兄的Python学习笔记(十九): Python与(XML和JSON)
师兄的Python学习笔记(二十一): 爬虫(二)
一、关于爬虫(Spider)
1. Python之外的学习前置条件
1) 网页前端知识
- html
- css
- javescript
- ajax
2) 网络通讯知识
- url
- http协议
3) 内容处理
- re
- xpath
- xml
2. 什么是爬虫
- 爬虫是一种程序或脚本,可以模拟人的行为,按照一定规则,自动在互联网上收集网站的数据和信息。
3. 爬虫分类
- 通用爬虫: 主要目的是将互联网上的网页下载到本地,形成一个互联网内容的镜像备份。
- 聚焦爬虫: 在网页抓取时会对内容进行处理筛选,尽量保证只抓取与需求相关的网页信息。
4. 爬虫的基本步骤
下载网页 >> 提取信息 >> 保存数据 >> 跳转到其它网页并执行前面内容
二、下载网页
1.urllib库
2. Requests库
2.1 关于Requests库
- requests库是简单易用的Http库。
- 继承了urllib的所有特征, 底层使用urllib3实现。
- 比urllib更方便,完全满足 HTTP 测试需求。
2.2 常用方法
1) requests.get(url,params=None, **kwargs)
- 以get方式请求url,并返回response对象。
- 可以以字典的方式传递params、headers等。
>>>import requests >>>url = "https://www.baidu.com/" >>>res = requests.get(url) >>>print(f'状态码:{res.status_code}') >>>print(f'内容:{res.text}') 状态码:200 内容:<!DOCTYPE html><!--STATUS OK--><html> <!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>ç�¾åº¦ä¸�ä¸�ï¼�ä½ å°±ç�¥é��</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=ç�¾åº¦ä¸�ä¸� class="bg s_btn" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ�°é�»</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>å�°å�¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§�é¢�</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å�§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç�»å½�</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">ç�»å½�</a>');...
2) requests.post(url, data=None, json=None, **kwargs)
- 以post方式请求url,并返回response对象。
>>>import requests >>>url = "https://www.httpbin.org/post" >>>data = {'name':'test'} >>>res = requests.post(url,data=data) >>>print(f'状态码:{res.status_code}') >>>print(f'内容:{res.text}') 状态码:200 内容:{ "args": {}, "data": "", "files": {}, "form": { "name": "test" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "9", "Content-Type": "application/x-www-form-urlencoded", "Host": "www.httpbin.org", "User-Agent": "python-requests/2.22.0", "X-Amzn-Trace-Id": "Root=1-5f056eb5-b711fdfcc85ef0a2d6e6b8fc" }, "json": null, "origin": "122.115.236.202", "url": "https://www.httpbin.org/post" }
3) Cookies
- 可以通过response对象直接获得cookies。
>>>import requests >>>url = "https://www.baidu.com" >>>res = requests.get(url) >>>cookies = res.cookies >>>for k,v in cookies.items(): >>> print(f'{k}={v}') BDORZ=27315
- 可以将cookies添加到header中,用来保持登录状态。
>>>import requests >>>url = "https://www.baidu.com" >>>headers = { >>> 'Cookie':'BIDUPSID=8DA5516860A041C8C2682A9F6CE8310A; PSTM=1488431109; HMACCOUNT=F82546873707B865; BAIDUID=2E7ED6E71EC32E4E738D8FBA51115B51:FG=1; H_WISE_SIDS=147417_146789_143879_148320_147087_141744_147887_148194_148209_147279_146536_148001_148823_147848_147762_147828_147639_148754_147897_148524_149194_127969_149061_147239_147350_142420_146653_147024_146732_138425_131423_144659_142209_147527_145597_126063_107311_147304_146339_148029_147212_143507_144966_145607_148071_139882_146786_148345_147547_146056_145395_148869_110085; MCITY=-%3A; H_PS_PSSID=; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; BDSFRCVID=hW-OJeC62R6WKOTr3f6CbPEHwe5B58TTH6aoDIGUqq8sj7AJuNMnEG0PoM8g0Ku-S2-BogKK0mOTH6KF_2uxOjjg8UtVJeC6EG0Ptf8g0f5; H_BDCLCKID_SF=tJAq_Dt-tC83jb7G2tu_KPk8hx6054CX2C8sVUP2BhcqEIL40ljIDUkVKGb-blcaLgT7Lh5VylRPqxbSj4Qo-RFkjU6z0b5C22on5MK-Qh5nhMJSb67JDMPF-GoKhlby523ion6vQpP-OpQ3DRoWXPIqbN7P-p5Z5mAqKl0MLPbtbb0xXj_0Djb-Datttjn2aIOt0Tr25RrjeJrmq4bohjP3jaO9BtQO-DOxoM7xynrKhpcOy45mQfkWbtRi-qKeQgnk2p523-Tao-Ooj4-WQlKNWGo30x-jLTny3l3ebxAVDPP9QtnJyUnQbPnnBT5i3H8HL4nv2JcJbM5m3x6qLTKkQN3T-PKO5bRu_CFhJKI2MIKCePbShnLOqlO-2tJ-ajPX3b7EfMnnsl7_bJ7KhUbyBn3v2JDe5jbt3fJNWP3qOpvC36bxQhFTQqOxhRov-20q3KnIQqrAMJnHQT3m5-4_QUOtyKryMnb4Wb3cWKJJ8UbSjxRPBTD02-nBat-OQ6npaJ5nJq5nhMJmb67JDMr0eGLeqT_JtJ-s06rtKRTffjrnhPF32J8PXP6-3bbu2GnIoP3K-DOtMRQP0Mo15PLU-J5eLp37JD6y--Ox-hcBEn0lhjOk5fCIh4oxJpOdMnbMopvaHx8KKqovbURvD-ug3-AqBM5dtjTO2bc_5KnlfMQ_bf--QfbQ0hOhqP-j5JIEoC8ytC-KbKvNq45HMt00qxby26n3Q-j9aJ5nJDoCMx3oXpOKDP5BLU5T0xvf5RTG_CnmQpP-HJ7eyfJlKJobKUCf3xTtBejXKl0MLn7Ybb0xyn_V0TjDLxnMBMPjamOnaIQc3fAKftnOM46JehL3346-35543bRTLnLy5KJtMDcnK4-XjjbXDHQP; delPer=0; PSINO=1; HMVT=6bcd52f51e9b3dce32bec4a3997715ac|1594190119|; BDRCVFR[tFA6N9pQGI3]=mk3SLVN4HKm', >>> 'Host':'hm.baidu.com', >>> 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36' >>>} >>>res = requests.get(url,headers=headers) >>>print(res.cookies) <RequestsCookieJar[<Cookie H_PS_PSSID=32190_1431_31326_32140_31253_32046_32231_32260 for .baidu.com/>, <Cookie BDSVRTM=14 for hm.baidu.com/>, <Cookie BD_HOME=1 for hm.baidu.com/>]>
4) Session
- 用于维持对话,与服务器session不是同一个概念。
>>>import requests >>>url = "https://httpbin.org" >>>s = requests.Session() >>>s.get(url+'/cookies/set/name/test') >>>r = s.get(url+'/cookies') >>>print(r.text) # 获得了上一次会话设置的cookies { "cookies": { "name": "test" } }
5) SSL证书验证
- 控制是否验证证书。
>>>import requests >>>url = "https://www.baidu.com" >>>res = requests.get(url,verify=False) # 停止验证证书 D:\Anaconda3\lib\site-packages\urllib3\connectionpool.py:1004: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.baidu.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning,
- 指定本地证书。
>>>import requests >>>url = "https://www.baidu.com" >>>res = requests.get(url,cert=('server.crt','key')) # 停止验证证书
6) 设置代理
- 通过参数proxies设置代理。
>>>import requests >>>url = "https://www.baidu.com" >>>proxies = { >>> "http":"http://host:port", # 你的代理服务器地址 >>> "https":"https://host:port"} >>>res = requests.get(url,proxies=proxies)
- 支持SOCKS协议代理。
>>>import requests >>>url = "https://www.baidu.com" >>>proxies = { >>> "http":"socks5://0.0.0.0:10005", # 你的代理服务器地址 >>> "https":"socks5://0.0.0.0:10006" >>>} >>>res = requests.get(url,proxies=proxies)
7) 设置超时
- 使用timeout参数设定超时时间。
- 如果需要分别设置连接和读取时间,则将参数设置为元组。
>>>import requests >>>url = "https://www.bbaidu.com" >>>try: >>> res = requests.get(url,timeout = (3,5)) >>>except Exception as e: >>> print(e) HTTPSConnectionPool(host='www.bbaidu.com', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x000001D9C0F72248>, 'Connection to www.bbaidu.com timed out. (connect timeout=3)'))
8) 设置身份验证
- 使用auth参数设置身份验证。
- auth参数为一个包含用户名和密码的元祖。
- 底层使用HTTPBasicAuth类验证。
>>>import requests >>>url = "https://www.baidu.com" >>>username = 'youruser' >>>password = 'yourpassword' >>>res = requests.get(url,auth=(username,password))
9) 使用Prepared Request
- Prepared Request就是将参数封装为独立的Request对象。
>>>from requests import Request,Session >>>url1 = "https://www.baidu.com" >>>url2 = "https://httpbin.org" >>>data = { >>> 'name':'test' >>>} >>>headers = { >>> 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like >>>>Gecko) Chrome/78.0.3904.108 Safari/537.36' >>>} >>>session = Session() >>>req1 = Request('GET',url1,headers=headers) >>>req2 = Request('GET',url2,data=data,headers=headers) >>>prepared1 = session.prepare_request(req1) >>>prepared2 = session.prepare_request(req2) >>>res1 = session.send(prepared1) >>>res2 = session.send(prepared2) >>>print(f"res1:{res1.url}") >>>print(f"res2:{res2.url}") res1:https://www.baidu.com/ res2:https://httpbin.org/
三、提取信息
1. 使用正则表达式
- 大师兄的Python学习笔记(七): re包与正则表达式
- 案例:提取豆瓣网排行前10名
1) 抓取首页
>>>import requests >>>def get_page(url): >>> # 获得页面内容 >>> headers = { >>> 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like >>>>Gecko) Chrome/78.0.3904.108 Safari/537.36' >>> } >>> res = requests.get(url=url,headers=headers) >>> if res.status_code == 200: >>> return res.text >>> else: >>> return None >>>def main(): >>> # 入口 >>> url = 'https://movie.douban.com/top250' >>> page_data = get_page(url) >>> if page_data: >>> print(len(page_data)) # 内容太长,只打印长度 >>>if __name__ == '__main__': >>> main() 63460
2) 观察html源代码编写正则规则
- 排行部分html代码:
<ol class="grid_view"> <li> <div class="item"> <div class="pic"> <em class="">1</em> <a href="https://movie.douban.com/subject/1292052/"> <img width="100" alt="肖申克的救赎" >src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p480747492.jpg" class=""> </a> </div> <div class="info"> <div class="hd"> <a href="https://movie.douban.com/subject/1292052/" class=""> <span class="title">肖申克的救赎</span> <span class="title"> / The Shawshank >Redemption</span> <span class="other"> / 月黑高飞(港) / 刺激1995(台)</span> </a> <span class="playable">[可播放]</span> </div> <div class="bd"> <p class=""> 导演: 弗兰克·德拉邦特 Frank Darabont 主演: 蒂姆·罗宾斯 Tim Robbins /...<br> 1994 / 美国 / 犯罪 剧情 </p> <div class="star"> <span class="rating5-t"></span> <span class="rating_num" property="v:average">9.7</span> <span property="v:best" content="10.0"></span> <span>2072827人评价</span> </div> <p class="quote"> <span class="inq">希望让人自由。</span> </p> </div> </div> </div> </li> ... ...
- 根据html代码写出正则表达式:
>>> pattern = re.compile( >>> '<em class="">(.*?)</em>\n.*?<a href="(.*?)">[\s\S]*?class="title">(.*?)</span>[\s\S]*?导演: (.*?) .*?主演: (.*?)<br>[\s\S]*?<span>(.*?人评价)</span?' >>> )
3) 用正则表达式抓取网页内容
>>>import requests,re >>>def sort_data(func): >>> def deco(*args,**kargs): >>> # 处理内容 >>> data = func(*args,**kargs) >>> pattern = re.compile( >>> '<em class="">(.*?)</em>\n.*?<a href="(.*?)">[\s\S]*?class="title">(.*?)</span>[\s\S]*?导演: (.*?) .*?主演: (.*?)<br>[\s\S]*?<span>(.*?人评价)</span?' >>> ) >>> items = re.findall(pattern,data) >>> for item in items: >>> yield { >>> 'index':item[0], >>> 'link':item[1], >>> 'name':item[2], >>> 'director':item[3], >>> 'actors':item[4], >>> 'post':item[5] >>> } >>> return items >>> return deco >>>@sort_data >>>def get_page(url): >>> # 获得页面内容 >>> headers = { >>> 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like >>>>Gecko) Chrome/78.0.3904.108 Safari/537.36' >>> } >>> res = requests.get(url=url,headers=headers) >>> if res.status_code == 200: >>> return res.text >>> else: >>> return None >>>def show_result(data): >>> # 打印结果 >>> for i in range(10): >>> print(next(data)) >>>def main(): >>> # 入口 >>> url = 'https://movie.douban.com/top250' >>> page_data = get_page(url) >>> show_result(page_data) >>>if __name__ == '__main__': >>> main() {'index': '1', 'link': 'https://movie.douban.com/subject/1292052/', 'name': '肖申克的救赎', 'director': '弗兰克·德拉邦特 Frank Darabont', 'actors': '蒂姆·罗宾斯 Tim Robbins /...', 'post': '2072827人评价'} {'index': '2', 'link': 'https://movie.douban.com/subject/1291546/', 'name': '霸王别姬', 'director': '陈凯歌 Kaige Chen', 'actors': '张国荣 Leslie Cheung / 张丰毅 Fengyi Zha...', 'post': '1536626人评价'} {'index': '3', 'link': 'https://movie.douban.com/subject/1292720/', 'name': '阿甘正传', 'director': '罗伯特·泽米吉斯 Robert Zemeckis', 'actors': '汤姆·汉克斯 Tom Hanks / ...', 'post': '1566647人评价'} {'index': '4', 'link': 'https://movie.douban.com/subject/1295644/', 'name': '这个杀手不太冷', 'director': '吕克·贝松 Luc Besson', 'actors': '让·雷诺 Jean Reno / 娜塔莉·波特曼 ...', 'post': '1757433人评价'} {'index': '5', 'link': 'https://movie.douban.com/subject/1292063/', 'name': '美丽人生', 'director': '罗伯托·贝尼尼 Roberto Benigni', 'actors': '罗伯托·贝尼尼 Roberto Beni...', 'post': '982086人评价'} {'index': '6', 'link': 'https://movie.douban.com/subject/1292722/', 'name': '泰坦尼克号', 'director': '詹姆斯·卡梅隆 James Cameron', 'actors': '莱昂纳多·迪卡普里奥 Leonardo...', 'post': '1519400人评价'} {'index': '7', 'link': 'https://movie.douban.com/subject/1291561/', 'name': '千与千寻', 'director': '宫崎骏 Hayao Miyazaki', 'actors': '柊瑠美 Rumi Hîragi / 入野自由 Miy...', 'post': '1627730人评价'} {'index': '8', 'link': 'https://movie.douban.com/subject/1295124/', 'name': '辛德勒的名单', 'director': '史蒂文·斯皮尔伯格 Steven Spielberg', 'actors': '连姆·尼森 Liam Neeson...', 'post': '798211人评价'} {'index': '9', 'link': 'https://movie.douban.com/subject/3541415/', 'name': '盗梦空间', 'director': '克里斯托弗·诺兰 Christopher Nolan', 'actors': '莱昂纳多·迪卡普里奥 Le...', 'post': '1496572人评价'} {'index': '10', 'link': 'https://movie.douban.com/subject/3011091/', 'name': '忠犬八公的故事', 'director': '莱塞·霍尔斯道姆 Lasse Hallström', 'actors': '理查·基尔 Richard Ger...', 'post': '1041023人评价'}
参考资料
- https://blog.csdn.net/u010138758/article/details/80152151 J-Ombudsman
- https://www.cnblogs.com/zhuluqing/p/8832205.html moisiet
- https://www.runoob.com 菜鸟教程
- http://www.tulingxueyuan.com/ 北京图灵学院
- http://www.imooc.com/article/19184?block_id=tuijian_wz#child_5_1 两点水
- https://blog.csdn.net/weixin_44213550/article/details/91346411 python老菜鸟
- https://realpython.com/python-string-formatting/ Dan Bader
- https://www.liaoxuefeng.com/ 廖雪峰
- https://blog.csdn.net/Gnewocean/article/details/85319590 新海说
- https://www.cnblogs.com/Nicholas0707/p/9021672.html Nicholas
- https://www.cnblogs.com/dalaoban/p/9331113.html 超天大圣
- https://blog.csdn.net/zhubao124/article/details/81662775 zhubao124
- https://blog.csdn.net/z59d8m6e40/article/details/72871485 z59d8m6e40
- 《Python学习手册》Mark Lutz
- 《Python编程 从入门到实践》Eric Matthes
本文作者:大师兄(superkmi)
网友评论