美文网首页
大师兄的Python学习笔记(二十): 爬虫(一)

大师兄的Python学习笔记(二十): 爬虫(一)

作者: superkmi | 来源:发表于2020-07-13 15:13 被阅读0次

    大师兄的Python学习笔记(十九): Python与(XML和JSON)
    师兄的Python学习笔记(二十一): 爬虫(二)

    一、关于爬虫(Spider)

    1. Python之外的学习前置条件

    1) 网页前端知识

    • html
    • css
    • javescript
    • ajax

    2) 网络通讯知识

    • url
    • http协议

    3) 内容处理

    • re
    • xpath
    • xml
    2. 什么是爬虫
    • 爬虫是一种程序或脚本,可以模拟人的行为,按照一定规则,自动在互联网上收集网站的数据和信息。
    3. 爬虫分类
    • 通用爬虫: 主要目的是将互联网上的网页下载到本地,形成一个互联网内容的镜像备份。
    • 聚焦爬虫: 在网页抓取时会对内容进行处理筛选,尽量保证只抓取与需求相关的网页信息。
    4. 爬虫的基本步骤

    下载网页 >> 提取信息 >> 保存数据 >> 跳转到其它网页并执行前面内容

    二、下载网页

    1.urllib库
    2. Requests库
    2.1 关于Requests库
    • requests库是简单易用的Http库。
    • 继承了urllib的所有特征, 底层使用urllib3实现。
    • 比urllib更方便,完全满足 HTTP 测试需求。
    2.2 常用方法

    1) requests.get(url,params=None, **kwargs)

    • 以get方式请求url,并返回response对象。
    • 可以以字典的方式传递params、headers等。
    >>>import requests
    
    >>>url = "https://www.baidu.com/"
    >>>res = requests.get(url)
    
    >>>print(f'状态码:{res.status_code}')
    >>>print(f'内容:{res.text}')
    状态码:200
    内容:<!DOCTYPE html><!--STATUS OK--><html>
    <!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>ç�¾åº¦ä¸�ä¸�ï¼�ä½ å°±ç�¥é��</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=ç�¾åº¦ä¸�ä¸� class="bg s_btn" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ�°é�»</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>å�°å�¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§�é¢�</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å�§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç�»å½�</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">ç�»å½�</a>');...
    

    2) requests.post(url, data=None, json=None, **kwargs)

    • 以post方式请求url,并返回response对象。
    >>>import requests
    
    >>>url = "https://www.httpbin.org/post"
    >>>data = {'name':'test'}
    >>>res = requests.post(url,data=data)
    
    >>>print(f'状态码:{res.status_code}')
    >>>print(f'内容:{res.text}')
    状态码:200
    内容:{
     "args": {}, 
     "data": "", 
     "files": {}, 
     "form": {
       "name": "test"
     }, 
     "headers": {
       "Accept": "*/*", 
       "Accept-Encoding": "gzip, deflate", 
       "Content-Length": "9", 
       "Content-Type": "application/x-www-form-urlencoded", 
       "Host": "www.httpbin.org", 
       "User-Agent": "python-requests/2.22.0", 
       "X-Amzn-Trace-Id": "Root=1-5f056eb5-b711fdfcc85ef0a2d6e6b8fc"
     }, 
     "json": null, 
     "origin": "122.115.236.202", 
     "url": "https://www.httpbin.org/post"
    }
    
    

    3) Cookies

    • 可以通过response对象直接获得cookies。
    >>>import requests
    
    >>>url = "https://www.baidu.com"
    >>>res = requests.get(url)
    >>>cookies = res.cookies
    >>>for k,v in cookies.items():
    >>>    print(f'{k}={v}')
    BDORZ=27315
    
    • 可以将cookies添加到header中,用来保持登录状态。
    >>>import requests
    
    >>>url = "https://www.baidu.com"
    >>>headers = {
    >>>    'Cookie':'BIDUPSID=8DA5516860A041C8C2682A9F6CE8310A; PSTM=1488431109; HMACCOUNT=F82546873707B865; BAIDUID=2E7ED6E71EC32E4E738D8FBA51115B51:FG=1; H_WISE_SIDS=147417_146789_143879_148320_147087_141744_147887_148194_148209_147279_146536_148001_148823_147848_147762_147828_147639_148754_147897_148524_149194_127969_149061_147239_147350_142420_146653_147024_146732_138425_131423_144659_142209_147527_145597_126063_107311_147304_146339_148029_147212_143507_144966_145607_148071_139882_146786_148345_147547_146056_145395_148869_110085; MCITY=-%3A; H_PS_PSSID=; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; BDSFRCVID=hW-OJeC62R6WKOTr3f6CbPEHwe5B58TTH6aoDIGUqq8sj7AJuNMnEG0PoM8g0Ku-S2-BogKK0mOTH6KF_2uxOjjg8UtVJeC6EG0Ptf8g0f5; H_BDCLCKID_SF=tJAq_Dt-tC83jb7G2tu_KPk8hx6054CX2C8sVUP2BhcqEIL40ljIDUkVKGb-blcaLgT7Lh5VylRPqxbSj4Qo-RFkjU6z0b5C22on5MK-Qh5nhMJSb67JDMPF-GoKhlby523ion6vQpP-OpQ3DRoWXPIqbN7P-p5Z5mAqKl0MLPbtbb0xXj_0Djb-Datttjn2aIOt0Tr25RrjeJrmq4bohjP3jaO9BtQO-DOxoM7xynrKhpcOy45mQfkWbtRi-qKeQgnk2p523-Tao-Ooj4-WQlKNWGo30x-jLTny3l3ebxAVDPP9QtnJyUnQbPnnBT5i3H8HL4nv2JcJbM5m3x6qLTKkQN3T-PKO5bRu_CFhJKI2MIKCePbShnLOqlO-2tJ-ajPX3b7EfMnnsl7_bJ7KhUbyBn3v2JDe5jbt3fJNWP3qOpvC36bxQhFTQqOxhRov-20q3KnIQqrAMJnHQT3m5-4_QUOtyKryMnb4Wb3cWKJJ8UbSjxRPBTD02-nBat-OQ6npaJ5nJq5nhMJmb67JDMr0eGLeqT_JtJ-s06rtKRTffjrnhPF32J8PXP6-3bbu2GnIoP3K-DOtMRQP0Mo15PLU-J5eLp37JD6y--Ox-hcBEn0lhjOk5fCIh4oxJpOdMnbMopvaHx8KKqovbURvD-ug3-AqBM5dtjTO2bc_5KnlfMQ_bf--QfbQ0hOhqP-j5JIEoC8ytC-KbKvNq45HMt00qxby26n3Q-j9aJ5nJDoCMx3oXpOKDP5BLU5T0xvf5RTG_CnmQpP-HJ7eyfJlKJobKUCf3xTtBejXKl0MLn7Ybb0xyn_V0TjDLxnMBMPjamOnaIQc3fAKftnOM46JehL3346-35543bRTLnLy5KJtMDcnK4-XjjbXDHQP; delPer=0; PSINO=1; HMVT=6bcd52f51e9b3dce32bec4a3997715ac|1594190119|; BDRCVFR[tFA6N9pQGI3]=mk3SLVN4HKm',
    >>>    'Host':'hm.baidu.com',
    >>>    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
    >>>}
    >>>res = requests.get(url,headers=headers)
    >>>print(res.cookies)
    <RequestsCookieJar[<Cookie H_PS_PSSID=32190_1431_31326_32140_31253_32046_32231_32260 for .baidu.com/>, <Cookie BDSVRTM=14 for hm.baidu.com/>, <Cookie BD_HOME=1 for hm.baidu.com/>]>
    

    4) Session

    • 用于维持对话,与服务器session不是同一个概念。
    >>>import requests
    
    >>>url = "https://httpbin.org"
    >>>s = requests.Session()
    >>>s.get(url+'/cookies/set/name/test')
    >>>r = s.get(url+'/cookies') 
    >>>print(r.text) # 获得了上一次会话设置的cookies
    {
     "cookies": {
       "name": "test"
     }
    }
    

    5) SSL证书验证

    • 控制是否验证证书。
    >>>import requests
    
    >>>url = "https://www.baidu.com"
    >>>res = requests.get(url,verify=False) # 停止验证证书
    D:\Anaconda3\lib\site-packages\urllib3\connectionpool.py:1004: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.baidu.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
     InsecureRequestWarning,
    
    • 指定本地证书。
    >>>import requests
    
    >>>url = "https://www.baidu.com"
    >>>res = requests.get(url,cert=('server.crt','key')) # 停止验证证书
    

    6) 设置代理

    • 通过参数proxies设置代理。
    >>>import requests
    
    >>>url = "https://www.baidu.com"
    >>>proxies = {
    >>>    "http":"http://host:port", # 你的代理服务器地址
    >>>    "https":"https://host:port"}
    >>>res = requests.get(url,proxies=proxies)
    
    • 支持SOCKS协议代理。
    >>>import requests
    
    >>>url = "https://www.baidu.com"
    >>>proxies = {
    >>>    "http":"socks5://0.0.0.0:10005", # 你的代理服务器地址
    >>>    "https":"socks5://0.0.0.0:10006"
    >>>}
    >>>res = requests.get(url,proxies=proxies)
    

    7) 设置超时

    • 使用timeout参数设定超时时间。
    • 如果需要分别设置连接和读取时间,则将参数设置为元组。
    >>>import requests
    
    >>>url = "https://www.bbaidu.com"
    >>>try:
    >>>    res = requests.get(url,timeout = (3,5))
    >>>except Exception as e:
    >>>    print(e)
    HTTPSConnectionPool(host='www.bbaidu.com', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x000001D9C0F72248>, 'Connection to www.bbaidu.com timed out. (connect timeout=3)'))
    

    8) 设置身份验证

    • 使用auth参数设置身份验证。
    • auth参数为一个包含用户名和密码的元祖。
    • 底层使用HTTPBasicAuth类验证。
    >>>import requests
    
    >>>url = "https://www.baidu.com"
    >>>username = 'youruser'
    >>>password = 'yourpassword'
    
    >>>res = requests.get(url,auth=(username,password))
    

    9) 使用Prepared Request

    • Prepared Request就是将参数封装为独立的Request对象。
    >>>from requests import Request,Session
    
    >>>url1 = "https://www.baidu.com"
    >>>url2 = "https://httpbin.org"
    >>>data = {
    >>>    'name':'test'
    >>>}
    >>>headers = {
    >>>    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like >>>>Gecko) Chrome/78.0.3904.108 Safari/537.36'
    >>>}
    
    >>>session = Session()
    >>>req1 = Request('GET',url1,headers=headers)
    >>>req2 = Request('GET',url2,data=data,headers=headers)
    >>>prepared1 = session.prepare_request(req1)
    >>>prepared2 = session.prepare_request(req2)
    >>>res1 = session.send(prepared1)
    >>>res2 = session.send(prepared2)
    >>>print(f"res1:{res1.url}")
    >>>print(f"res2:{res2.url}")
    res1:https://www.baidu.com/
    res2:https://httpbin.org/
    

    三、提取信息

    1. 使用正则表达式

    1) 抓取首页

    >>>import requests
    
    >>>def get_page(url):
    >>>    # 获得页面内容
    >>>    headers = {
    >>>        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like >>>>Gecko) Chrome/78.0.3904.108 Safari/537.36'
    >>>    }
    >>>    res = requests.get(url=url,headers=headers)
    >>>    if res.status_code == 200:
    >>>        return res.text
    >>>    else:
    >>>        return None
    
    >>>def main():
    >>>    # 入口
    >>>    url = 'https://movie.douban.com/top250'
    >>>    page_data = get_page(url)
    >>>    if page_data:
    >>>        print(len(page_data)) # 内容太长,只打印长度
    
    >>>if __name__ == '__main__':
    >>>    main()
    63460
    

    2) 观察html源代码编写正则规则

    • 排行部分html代码:
    <ol class="grid_view">
           <li>
               <div class="item">
                   <div class="pic">
                       <em class="">1</em>
                       <a href="https://movie.douban.com/subject/1292052/">
                           <img width="100" alt="肖申克的救赎" >src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p480747492.jpg" class="">
                       </a>
                   </div>
                   <div class="info">
                       <div class="hd">
                           <a href="https://movie.douban.com/subject/1292052/" class="">
                               <span class="title">肖申克的救赎</span>
                                       <span class="title">&nbsp;/&nbsp;The Shawshank >Redemption</span>
                                   <span class="other">&nbsp;/&nbsp;月黑高飞(港)  /  刺激1995(台)</span>
                           </a>
    
    
                               <span class="playable">[可播放]</span>
                       </div>
                       <div class="bd">
                           <p class="">
                               导演: 弗兰克·德拉邦特 Frank Darabont&nbsp;&nbsp;&nbsp;主演: 蒂姆·罗宾斯 Tim Robbins /...<br>
                               1994&nbsp;/&nbsp;美国&nbsp;/&nbsp;犯罪 剧情
                           </p>
    
                           
                           <div class="star">
                                   <span class="rating5-t"></span>
                                   <span class="rating_num" property="v:average">9.7</span>
                                   <span property="v:best" content="10.0"></span>
                                   <span>2072827人评价</span>
                           </div>
    
                               <p class="quote">
                                   <span class="inq">希望让人自由。</span>
                               </p>
                       </div>
                   </div>
               </div>
           </li>
    ... ... 
    
    • 根据html代码写出正则表达式:
    >>> pattern = re.compile(
    >>>            '<em class="">(.*?)</em>\n.*?<a href="(.*?)">[\s\S]*?class="title">(.*?)</span>[\s\S]*?导演: (.*?)&nbsp.*?主演: (.*?)<br>[\s\S]*?<span>(.*?人评价)</span?'
    >>>        )
    

    3) 用正则表达式抓取网页内容

    >>>import requests,re
    
    >>>def sort_data(func):
    >>>    def deco(*args,**kargs):
    >>>        # 处理内容
    >>>        data = func(*args,**kargs)
    >>>        pattern = re.compile(
    >>>            '<em class="">(.*?)</em>\n.*?<a href="(.*?)">[\s\S]*?class="title">(.*?)</span>[\s\S]*?导演: (.*?)&nbsp.*?主演: (.*?)<br>[\s\S]*?<span>(.*?人评价)</span?'
    >>>        )
    >>>        items = re.findall(pattern,data)
    >>>        for item in items:
    >>>            yield {
    >>>                'index':item[0],
    >>>                'link':item[1],
    >>>                'name':item[2],
    >>>                'director':item[3],
    >>>                'actors':item[4],
    >>>                'post':item[5]
    >>>            }
    >>>        return items
    >>>    return deco
    
    >>>@sort_data
    >>>def get_page(url):
    >>>    # 获得页面内容
    >>>    headers = {
    >>>        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like >>>>Gecko) Chrome/78.0.3904.108 Safari/537.36'
    >>>    }
    >>>    res = requests.get(url=url,headers=headers)
    >>>    if res.status_code == 200:
    >>>        return res.text
    >>>    else:
    >>>        return None
    
    >>>def show_result(data):
    >>>    # 打印结果
    >>>    for i in range(10):
    >>>        print(next(data))
    
    >>>def main():
    >>>    # 入口
    >>>    url = 'https://movie.douban.com/top250'
    >>>    page_data = get_page(url)
    >>>    show_result(page_data)
    
    >>>if __name__ == '__main__':
    >>>    main()
    {'index': '1', 'link': 'https://movie.douban.com/subject/1292052/', 'name': '肖申克的救赎', 'director': '弗兰克·德拉邦特 Frank Darabont', 'actors': '蒂姆·罗宾斯 Tim Robbins /...', 'post': '2072827人评价'}
    {'index': '2', 'link': 'https://movie.douban.com/subject/1291546/', 'name': '霸王别姬', 'director': '陈凯歌 Kaige Chen', 'actors': '张国荣 Leslie Cheung / 张丰毅 Fengyi Zha...', 'post': '1536626人评价'}
    {'index': '3', 'link': 'https://movie.douban.com/subject/1292720/', 'name': '阿甘正传', 'director': '罗伯特·泽米吉斯 Robert Zemeckis', 'actors': '汤姆·汉克斯 Tom Hanks / ...', 'post': '1566647人评价'}
    {'index': '4', 'link': 'https://movie.douban.com/subject/1295644/', 'name': '这个杀手不太冷', 'director': '吕克·贝松 Luc Besson', 'actors': '让·雷诺 Jean Reno / 娜塔莉·波特曼 ...', 'post': '1757433人评价'}
    {'index': '5', 'link': 'https://movie.douban.com/subject/1292063/', 'name': '美丽人生', 'director': '罗伯托·贝尼尼 Roberto Benigni', 'actors': '罗伯托·贝尼尼 Roberto Beni...', 'post': '982086人评价'}
    {'index': '6', 'link': 'https://movie.douban.com/subject/1292722/', 'name': '泰坦尼克号', 'director': '詹姆斯·卡梅隆 James Cameron', 'actors': '莱昂纳多·迪卡普里奥 Leonardo...', 'post': '1519400人评价'}
    {'index': '7', 'link': 'https://movie.douban.com/subject/1291561/', 'name': '千与千寻', 'director': '宫崎骏 Hayao Miyazaki', 'actors': '柊瑠美 Rumi Hîragi / 入野自由 Miy...', 'post': '1627730人评价'}
    {'index': '8', 'link': 'https://movie.douban.com/subject/1295124/', 'name': '辛德勒的名单', 'director': '史蒂文·斯皮尔伯格 Steven Spielberg', 'actors': '连姆·尼森 Liam Neeson...', 'post': '798211人评价'}
    {'index': '9', 'link': 'https://movie.douban.com/subject/3541415/', 'name': '盗梦空间', 'director': '克里斯托弗·诺兰 Christopher Nolan', 'actors': '莱昂纳多·迪卡普里奥 Le...', 'post': '1496572人评价'}
    {'index': '10', 'link': 'https://movie.douban.com/subject/3011091/', 'name': '忠犬八公的故事', 'director': '莱塞·霍尔斯道姆 Lasse Hallström', 'actors': '理查·基尔 Richard Ger...', 'post': '1041023人评价'}
    

    参考资料



    本文作者:大师兄(superkmi)

    相关文章

      网友评论

          本文标题:大师兄的Python学习笔记(二十): 爬虫(一)

          本文链接:https://www.haomeiwen.com/subject/yavstktx.html