美文网首页pythonpython爬虫
绝无仅有,史上最全的python爬虫教程!

绝无仅有,史上最全的python爬虫教程!

作者: c067527d47c2 | 来源:发表于2019-05-05 15:06 被阅读55次
    import urllib
    import urllib.request
    urllib.request.urlopen("http://www.baidu.com")
    
    image
    当然在学习Python的道路上肯定会困难,没有好的学习资料,怎么去学习呢? 学习Python中有不明白推荐加入交流群号:984137898 群里有志同道合的小伙伴,互帮互助, 群里有不错的视频学习教程和PDF!

    2.re

    3.requests

    4.selenimu

    这个库是配合一些驱动去爬取动态渲染网页的库

    (1)chromedriver

    我们使用的时候需要先下载一个 chromedriver.exe ,下载好了以后放在 chrome.exe 的相同目录下(默认安装路径),然后将这个目录放作为 PATH

    import selenium
    from selenium import webdriver
    
    driver = webdriver.Chrome()
    driver.get("http://www.baidu.com")
    driver.page_source
    

    这种方式的唯一的缺点是会出现浏览器界面,这可能是我们不需要的,所以我们可以使用 headless 的方式来隐藏 web 界面(其实就是使用 options() 对象的 add_argument 属性去设置 headless 参数 )

    import os
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    from selenium.webdriver.chrome.options import Options
    import time
    
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    
    base_url = "http://www.baidu.com/"
    #对应的chromedriver的放置目录
    driver = webdriver.Chrome(executable_path=(r'C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe'), chrome_options=chrome_options)
    
    driver.get(base_url + "/")
    
    start_time=time.time()
    print('this is start_time ',start_time)
    
    driver.find_element_by_id("kw").send_keys("selenium webdriver")
    driver.find_element_by_id("su").click()
    driver.save_screenshot('screen.png')
    
    driver.close()
    
    end_time=time.time()
    print('this is end_time ',end_time)
    

    (2)phantomJS

    这是另一种无界面的实现方法,虽然说不维护了,并且在使用的过程中会出现各种玄学,但是还是要介绍一下

    和 Chromedriver 一样,我们首先要去 下载 phantomJS,然后将其放在 PATH 中方便我们后面的调用

    import selenium
    from selenium import webdriver
    
    driver = webdriver.phantomJS()
    driver.get("http://www.baidu.com")
    driver.page_source
    

    5.lxml

    这个是为 XPATH 的使用准备的库

    6.beautifulsoup

    pip 安装的时候注意一下要安装 beautifulsoup4,表示第四个版本,并且这个库是依赖于 lxml 的,所以安装之前请先安装 lxml

    from bs4 import BeautifulSoup
    soup = BeautifulSoup('`<html></html>','lxml')
    

    7.pyquery

    和 BeautifulSoup 一样也是一个网页解析库,但是相对来讲语法简单一些(语法是模仿 jQuery 的)

    from pyquery import PyQuery as pq
    
    page = pq('`<html>hello world</html>`')
    result = page('html').text()
    result
    

    8.pymysql

    这个库是 py 操纵 Mysql 的库

    import pymysql
    
    conn = pymysql.connect(host='localhost',user='root',password='root',port=3306,db='test')
    cursor = conn.cursor()
    result = cursor.execute('select * from user where id = 1')
    print(cursor.fetchone())
    

    9.pymango

    import pymango
    
    client = pymango.MongoClient('localhost')
    db = client('newtestdb')
    db['table'].insert({'name':'Bob'})
    db['table'].find_one({'name':'Bob'})
    

    10.redis

    import redis
    
    r = redis.Redis('localhost',6379)
    r.set("name","Bob")
    r.get('name')
    

    11.flask

    flask 在后期使用代理的时候可能会用到

    from flask import Flask
    
    app = Flask(__name__)
    
    @app.route('/')
    
    def hello():
        return "hello world"
    
    if __name__ == '__main__':
    
        app.run(debug=True)
    

    12.django

    在分布式爬虫的维护方面可能会用到 django

    13.jupyter

    网页端记事本

    0X02 基础部分

    1.爬虫基本原理

    (1)爬虫是什么

    爬虫就是请求网页并且提取数据的自动化工具

    (2)爬虫的基本流程

    1.发起请求:

    通过 HTTP 库向目标网站发起请求,即发送一个 request(可以包含额外的header信息),然后等待服务器的响应

    2.获取响应内容

    如果服务器正常响应,会得到一个 Response.其内容就是所要获取的页面的内容,类型可以是 HTML、JSON、二进制数据(图片视频)等

    3.解析内容

    对 HTML 数据可以使用正则表达式、网页解析库进行解析。如果是 Json 则可以转化成 JSON 对象解析,如果是二进制数据可以保存或者进一步处理

    4.保存数据

    保存的形式多样,可以是纯文本,也可以保存成数据库,或者保存为特定格式的文件

    (3)请求的基本元素

    1.请求方法

    2.请求 URL

    3.请求头

    4.请求体(POST 方法独有)

    (4)请响应的基本元素

    1.状态码

    2.响应头

    3.响应体

    (5)实例代码:

    1.请求网页数据

    import requests
    
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
    res = requests.get("http://www.baidu.com",headers=headers)
    
    print(res.status_code)
    print(res.headers)
    print(res.text)
    

    当然这里使用的是 res.text 这种文本格式,如果返回的是一个二进制格式的数据(比如图片),那么我们应该使用 res.content

    2.请求二进制数据

    import requests
    
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
    res = requests.get("https://ss2.bdstatic.com/lfoZeXSm1A5BphGlnYG/icon/95486.png",headers=headers)
    print(res.content)
    
    with open(r'E:\桌面\1.png','wb') as f:
        f.write(res.content)
        f.close()
    

    (6)解析方式

    1.直接处理

    2.转化成 json对象

    3.正则匹配

    4.BeautifulSoap

    5.PyQuery

    6.XPath

    (7)response 的结果为什么和浏览器中的看到的不同

    我们使用脚本去请求(只是一次请求)网页得到的是最原始的网页的源码,这个源码里面会有很多的远程的 js 和 css 的加载,我们的脚本是没法解析的,但是浏览器能对这些远程的链接进行再次的请求,然后利用加载到的数据对页面进行进一步的加载和渲染,于是我们在浏览器中看到的页面是很多请求渲染得到的结果,因此和我们一次请求的到的页面肯定是不一样的。

    (8)如何解决 JS 渲染的问题

    解决问题的方法本质上就是模拟浏览器的加载渲染,然后将渲染好的页面进行返回

    1.分析Ajax 请求

    2.selenium+webdriver(推荐)

    3.splash

    4.PyV8、Ghostpy

    (9)如何存储数据

    1.纯本文

    2.关系型数据库

    3.非关系型数据库

    4.二进制文件

    0X03 Urllib 库

    1.什么是 Urllib 库

    这个库是 python 的内置的一个请求库

    urllib.request —————–>请求模块

    urllib.error——————–>异常处理模块

    urllib.parse——————–>url解析模块

    urllib.robotparser ————>robots.txt 解析模块

    2.urllib 库的基本使用

    (1)函数调用原型

    urllib.request.urlopen(url,data,timeout...)
    

    (2)实例代码一:GET 请求

    import urllib.request
    
    res = urllib.request.urlopen("http://www.baidu.com")
    print(res.read().decode('utf-8'))
    

    (3)实例代码二:POST 请求

    import urllib.request
    import urllib.parse
    from pprint import pprint
    
    data = bytes(urllib.parse.urlencode({'world':'hello'}),encoding = 'utf8')
    res = urllib.request.urlopen('https://httpbin.org/post',data = data)
    pprint(res.read().decode('utf-8'))
    

    (4)实例代码三:超时设置

    import urllib.request
    
    res = urllib.request.urlopen("http://httpbin.org.get",timeout = 1)
    print(res.read().decode('utf-8'))
    

    (5)实例代码:获取响应状态码、响应头、响应体

    import urllib.request
    
    res = urllib.request.urlopen("http://httpbin.org/get")
    print(res.status)
    print(res.getheaders())
    print(res.getheader('Server'))
    #获取响应体的使用 read() 的结果是 Bytes 类型,我们还要用 decode('utf-8')转化成字符串
    print(res.read().decode('utf-8'))
    

    (6) request 对象

    from urllib import request,parse
    from pprint import pprint
    
    url = "https://httpbin.org/post"
    headers = {
        'User-Agent':'hello wolrd',
        'Host':'httpbin.org'
    }
    dict = {
        'name':'Tom',
    }
    
    data = bytes(parse.urlencode(dict),encoding='utf8')
    req = request.Request(url=url,data=data,headers=headers,method='POST')
    res = request.urlopen(req)
    pprint(res.read().decode('utf-8'))
    

    3.urllib 库的进阶使用

    (1)代理

    import urllib.request
    
    proxy_handler = urllib.request.ProxyHandler({
        'http':'http://127.0.0.1:9743'
    })
    
    opener = urllib.request.build_opener(proxy_handler)
    res = opener.open('https://www.taobao.com')
    print(res.read())
    

    (2)Cookie

    1.获取 cookies

    import http.cookiejar
    import urllib.request
    
    cookie = http.cookiejar.CookieJar()
    handler = urllib.request.HTTPCookieProcessor(cookie)
    opener = urllib.request.build_opener(handler)
    response = opener.open("http://www.baidu.com")
    for item in cookie:
        print(item.name+"="+item.value)
    

    2.将 cookie 保存成文本文件

    格式一:

    import http.cookiejar, urllib.request
    filename = "cookie.txt"
    cookie = http.cookiejar.MozillaCookieJar(filename)
    handler = urllib.request.HTTPCookieProcessor(cookie)
    opener = urllib.request.build_opener(handler)
    response = opener.open('http://www.baidu.com')
    cookie.save(ignore_discard=True, ignore_expires=True)
    

    格式二:

    import http.cookiejar, urllib.request
    filename = 'cookie.txt'
    cookie = http.cookiejar.LWPCookieJar(filename)
    handler = urllib.request.HTTPCookieProcessor(cookie)
    opener = urllib.request.build_opener(handler)
    response = opener.open('http://www.baidu.com')
    cookie.save(ignore_discard=True, ignore_expires=True)
    

    3.使用文件中的 cookie

    import http.cookiejar, urllib.request
    cookie = http.cookiejar.LWPCookieJar()
    cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
    handler = urllib.request.HTTPCookieProcessor(cookie)
    opener = urllib.request.build_opener(handler)
    response = opener.open('http://www.baidu.com')
    print(response.read().decode('utf-8'))
    

    (3)异常处理

    1.实例代码一:URLError

    from urllib import request
    from urllib import error
    
    try:
        urllib.request.urlopen("http://httpbin.org/xss")
    except error.URLError as e:
        print(e.reason)
    

    2.实例代码二:HTTPError

    from urllib import request, error
    
    try:
        response = request.urlopen('http://httpbin.org/xss')
    except error.HTTPError as e:
        print(e.reason, e.code, e.headers, sep='\n')
    except error.URLError as e:
        print(e.reason)
    else:
        print('Request Successfully')
    

    3.实例代码三:异常类型判断

    import socket
    import urllib.request
    import urllib.error
    
    try:
        response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01)
    except urllib.error.URLError as e:
        print(type(e.reason))
        if isinstance(e.reason, socket.timeout):
            print('TIME OUT')
    

    (4)URL 解析工具类

    1.urlparse

    from urllib.parse import urlparse
    
    result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
    print(type(result), result)
    

    2.urlunparse

    from urllib.parse import urlunparse
    
    data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
    print(urlunparse(data))
    

    3.urljoin

    from urllib.parse import urljoin
    
    print(urljoin('http://www.baidu.com', 'FAQ.html'))
    

    4.urlencode

    from urllib.parse import urlencode
    
    params = {
        'name': 'germey',
        'age': 22
    }
    base_url = 'http://www.baidu.com?'
    url = base_url + urlencode(params)
    print(url)
    

    0X04 Requests 库

    1.什么是 requests 库

    这个库是基于 URLlib3 的,改善了 urllib api 比较繁琐的特点,使用几句简单的语句就能实现设置 cookie 和设置代理的功能,非常的方便

    2.requests 库的基本使用

    (1)获取响应信息

    import requests
    
    res = requests.get("http://www.baidu.com")
    print(res.status_code)
    print(res.text)
    print(res.cookies)
    

    (2)各种请求方法

    import requests
    
    requests.get("http://httpbin.org/get")
    requests.post("http://httpbin.org/post")
    requests.put("http://httpbin.org/put")
    requests.head("http://httpbin.org/get")
    requests.delete("http://httpbin.org/delete")
    requests.options("http://httpbin.org/get")
    

    (3)带参数的 get 请求

    import requests
    
    params = {
        'id':1,
        'user':'Tom',
        'pass':'123456'
    }
    
    res = requests.get('http://httpbin.org/get',params = params )
    print(res.text)
    

    (4)解析 json

    import requests
    
    res = requests.get("http://httpbin.org/get")
    print(res.json())
    

    (5)获取二进制数据

    import requests
    
    response = requests.get("https://github.com/favicon.ico")
    with open('favicon.ico', 'wb') as f:
        f.write(response.content)
        f.close()
    

    (6)添加 headers

    import requests
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
    }
    response = requests.get("https://www.zhihu.com/explore", headers=headers)
    print(response.text)
    

    (7)POST 请求

    import requests
    
    data = {
        'id':1,
        'user':'Tom',
        'pass':'123456',
    }
    
    res = requests.post('http://httpbin.org/post',data=data)
    print(res.text)
    

    (8) response 属性

    import requests
    
    data = {
        'id':1,
        'user':'Tom',
        'pass':'123456',
    }
    
    res = requests.post('http://httpbin.org/post',data=data)
    print(res.text)
    print(res.status_code)
    print(res.headers)
    print(res.cookies)
    print(res.history)
    print(res.url)
    

    (9)响应状态码

    每一个状态码都对应着一个名字,我们只要调用这个名字就可以进行判断了

    100: ('continue',),
    101: ('switching_protocols',),
    102: ('processing',),
    103: ('checkpoint',),
    122: ('uri_too_long', 'request_uri_too_long'),
    200: ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', '\\o/', '✓'),
    201: ('created',),
    202: ('accepted',),
    203: ('non_authoritative_info', 'non_authoritative_information'),
    204: ('no_content',),
    205: ('reset_content', 'reset'),
    206: ('partial_content', 'partial'),
    207: ('multi_status', 'multiple_status', 'multi_stati', 'multiple_stati'),
    208: ('already_reported',),
    226: ('im_used',),
    
    # Redirection.
    300: ('multiple_choices',),
    301: ('moved_permanently', 'moved', '\\o-'),
    302: ('found',),
    303: ('see_other', 'other'),
    304: ('not_modified',),
    305: ('use_proxy',),
    306: ('switch_proxy',),
    307: ('temporary_redirect', 'temporary_moved', 'temporary'),
    308: ('permanent_redirect',
          'resume_incomplete', 'resume',), # These 2 to be removed in 3.0
    
    # Client Error.
    400: ('bad_request', 'bad'),
    401: ('unauthorized',),
    402: ('payment_required', 'payment'),
    403: ('forbidden',),
    404: ('not_found', '-o-'),
    405: ('method_not_allowed', 'not_allowed'),
    406: ('not_acceptable',),
    407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'),
    408: ('request_timeout', 'timeout'),
    409: ('conflict',),
    410: ('gone',),
    411: ('length_required',),
    412: ('precondition_failed', 'precondition'),
    413: ('request_entity_too_large',),
    414: ('request_uri_too_large',),
    415: ('unsupported_media_type', 'unsupported_media', 'media_type'),
    416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'),
    417: ('expectation_failed',),
    418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),
    421: ('misdirected_request',),
    422: ('unprocessable_entity', 'unprocessable'),
    423: ('locked',),
    424: ('failed_dependency', 'dependency'),
    425: ('unordered_collection', 'unordered'),
    426: ('upgrade_required', 'upgrade'),
    428: ('precondition_required', 'precondition'),
    429: ('too_many_requests', 'too_many'),
    431: ('header_fields_too_large', 'fields_too_large'),
    444: ('no_response', 'none'),
    449: ('retry_with', 'retry'),
    450: ('blocked_by_windows_parental_controls', 'parental_controls'),
    451: ('unavailable_for_legal_reasons', 'legal_reasons'),
    499: ('client_closed_request',),
    
    # Server Error.
    500: ('internal_server_error', 'server_error', '/o\\', '✗'),
    501: ('not_implemented',),
    502: ('bad_gateway',),
    503: ('service_unavailable', 'unavailable'),
    504: ('gateway_timeout',),
    505: ('http_version_not_supported', 'http_version'),
    506: ('variant_also_negotiates',),
    507: ('insufficient_storage',),
    509: ('bandwidth_limit_exceeded', 'bandwidth'),
    510: ('not_extended',),
    511: ('network_authentication_required', 'network_auth', 'network_authentication'),
    

    实例代码:

    import requests
    
    response = requests.get('http://www.jianshu.com/hello.html')
    exit() if not response.status_code == requests.codes.not_found else print('404 Not Found')
    

    3.requests 库的进阶使用

    (1)文件上传

    import requests
    
    files = {'file':open('E:\\1.png','rb')}
    
    res= requests.post('http://httpbin.org/post',files=files)
    print(res.text)
    

    (2)获取 cookies

    import requests
    
    res = requests.get("http://www.baidu.com")
    for key,value in res.cookies.items():
        print(key + "=" + value)
    

    (3)会话维持

    这个用法非常的重要,在我们的模拟登陆的过程中是必然会用到的方法,在 CTF 的写脚本的过程中也经常会用到,所以我们稍微详细解释一下

    我们在使用 requests.get 的时候要明确一点就是,我们每使用一个 requests.get 就相当于重新打开了一个浏览器,因此上一个 requests.get 中设置的 cookie 在下面的第二次请求中是不能同步的,我们来看下面的例子

    实例代码:

    import requests
    
    #这里我们设置了 cookie 
    requests.get('http://httpbin.org/cookies/set/number/123456789')
    #我们再次发起请求,查看是否能带着我们设置的 cookie 
    res = requests.get('http://httpbin.org/cookies')
    print(res.text)
    

    运行结果:

    {
      "cookies": {}
    }
    

    我们发现,正如我们上面分析的,我们第一次访问设置的 cookie 并没有在第二次访问中生效,那么怎么办呢,我们有一个 session() 方法能帮助我们解决这个问题

    实例代码:

    import requests
    
    s = requests.Session()
    s.get('http://httpbin.org/cookies/set/number/123456789')
    res = s.get('http://httpbin.org/cookies')
    print(res.text)
    

    运行结果:

    {
      "cookies": {
        "number": "123456789"
      }
    }
    

    (4)证书验证

    我们在访问 https 的网站的时候浏览器首先会对网站的证书进行校验,如果发现这个证书不是官方授权的话就会出现警告页面而不会继续访问该网站,对于爬虫来讲就会抛出异常,那如果我们想要让爬虫忽略证书的问题继续访问这个网站的话就要对其进行设置

    1.忽略证书验证

    import requests
    
    response = requests.get('https://www.heimidy.cc/',verify=False)
    print(response.status_code)
    

    但是此时是存在一个警告的,我们可以通过导入 urilib3 的包,并调用消除 warning 的方法来消除这个警告

    import requests
    from requests.packages import urllib3
    
    urllib3.disable_warnings()
    response = requests.get('https://www.heimidy.cc/',verify=False)
    print(response.status_code)
    

    2.手动指定本地证书进行验证

    import requests
    
    response = requests.get('https://www.12306.cn', cert=('/path/server.crt', '/path/key'))
    print(response.status_code)
    

    (5)代理设置

    除了常见到的 https 和 http 代理以=以外,我们还可以使用 socks 代理,不过需要 pip 安装一个 requests[socks] 包

    import requests
    
    proxies = {
        "http":"http://127.0.0.1:1080",
        "https":"https://127.0.0.1:1080"
    }
    
    res = requests.get("https://www.google.com",proxies=proxies)
    print(res.status_code)
    

    这里有一个疑问就是我是用 socks 代理访问 google 是失败的,会报错

    实例代码:

    import requests
    
    proxies = {
        "http":"socks5://127.0.0.1:1080",
        "https":"socks5://127.0.0.1:1080"
    }
    
    res = requests.get("https://www.google.com",proxies=proxies,verify=False)
    print(res.status_code)
    

    运行结果:

    SSLError: SOCKSHTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: SysCallError(-1, 'Unexpected EOF')")))
    

    试了一些方法都没什么效果,有待以后考证

    (6)超时设置

    import requests
    from requests.exceptions import ReadTimeout
    try:
        response = requests.get("http://httpbin.org/get", timeout = 0.5)
        print(response.status_code)
    except ReadTimeout:
        print('Timeout')
    

    (7)Basic 认证

    实例代码一:

    import requests
    from requests.auth import HTTPBasicAuth
    
    r = requests.get('http://120.27.34.24:9001', auth=HTTPBasicAuth('user', '123'))
    print(r.status_code)
    

    实例代码二:

    import requests
    
    r = requests.get('http://120.27.34.24:9001', auth=('user', '123'))
    print(r.status_code)
    

    (8)异常处理

    import requests
    from requests.exceptions import ReadTimeout, ConnectionError, RequestException
    try:
        response = requests.get("http://httpbin.org/get", timeout = 0.5)
        print(response.status_code)
    except ReadTimeout:
        print('Timeout')
    except ConnectionError:
        print('Connection error')
    except RequestException:
        print('Error')
    

    0X05 正则表达式

    1.什么是正则表达式

    正则表达式是对字符串进行操作的一种逻辑公式,用事先定义好的一些特定的字符,以及这些字符的组合,组成一个规则字符串,用这个规则字符串去表达对字符串的一种过滤的逻辑,在python 中使用过re 库来实现

    2.常见的匹配模式

    img

    3.re.match

    re.match(pattern, string, flags=0)
    

    (1)常规匹配

    span() 方法是返回匹配的范围,group() 是返回匹配的结果

    实例代码:

    import re
    
    content = 'Hello 123 4567 World_This is a Regex Demo'
    res = re.match('^\w{5}\s\d{3}\s\d{4}\s\w{10}.*Demo$',content)
    print(res.span())
    print(res.group())
    

    (2)泛匹配

    import re
    
    content = 'Hello 123 4567 World_This is a Regex Demo'
    res = re.match('^Hello.*Demo$',content)
    print(res.span())
    print(res.group())
    

    (3)匹配具体内容

    我们如果想匹配具体的内容,我们可以用小括号将其括起来

    import re
    
    content = 'Hello 1234567 World_This is a Regex Demo'
    res = re.match('^Hello\s(\d+)\s.*Demo$',content)
    print(res.span(1))
    print(res.group(1))
    

    (4)贪婪与非贪婪模式

    所谓贪婪模式指的就是 .* 会匹配尽可能多的字符,我们来看下面的例子

    实例代码:

    import re
    
    content = 'Hello 1234567 World_This is a Regex Demo'
    res = re.match('^He.*(\d+).*Demo$',content)
    print(res.span(1))
    print(res.group(1))
    

    运行结果:

    (12, 13)
    7
    

    我们的本意是想匹配 1234567 这个字符串,但是实际上我们只匹配到了 7 ,因为 .* 默认的贪婪模式将123456匹配掉了,那么为了解决这个问题,我们可以使用 去消除非贪婪模式

    实例代码:

    import re
    
    content = 'Hello 1234567 World_This is a Regex Demo'
    res = re.match('^He.*?(\d+).*Demo$',content)
    print(res.span(1))
    print(res.group(1))
    

    运行结果:

    (6, 13)
    1234567
    

    (5)匹配模式

    匹配模式是用来解决一些细节问题的,比如匹配中的是否区分大小写、是否能匹配换行符等

    实例代码:

    import re
    
    content = '''Hello 1234567 World_This is 
    
    a Regex Demo'''
    
    res = re.match('^He.*?(\d+).*Demo$',content,re.S)
    print(res.span(1))
    print(res.group(1))
    

    运行结果:

    (6, 13)
    1234567
    

    可以发现我们的 .* 本来是不能匹配换行符的,但是我们使用了 re.S 的模式以后就可以正常匹配了

    (6)转义字符

    如果在待匹配字符串中出现了正则表达式中的特殊字符,我们要对其进行转义操作

    import re
    
    content = 'price is $5.00'
    res = re.match('price is \$5\.00', content)
    print(res.group())
    

    4.re.search

    我们上面介绍的 re.match 有一个弊端就是它只能从开头开始匹配,也就是说如果我们的正则不匹配第一个字符那么是无法匹配中间的字符的,所以我们还有一个武器叫 re.search,它会扫描整个字符串并返回第一个成功的匹配。

    实例代码:

    import re
    
    content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'
    res = re.search('Hello.*?(\d+).*?Demo', content)
    print(res.group(1))
    

    运行结果:

    因为这个特性能大大减少我们写正则的难度,因此,我们在能用 search 的情况下就不要用 match

    匹配练习:

    实例代码:

    import re
    
    html = '''<div id="songs-list">
        <h2 class="title">经典老歌</h2>
        <p class="introduction">
            经典老歌列表
        </p>
        <ul id="list" class="list-group">
            <li data-view="2">一路上有你</li>
            <li data-view="7">
                <a href="/2.mp3" singer="任贤齐">沧海一声笑</a>
            </li>
            <li data-view="4" class="active">
                <a href="/3.mp3" singer="齐秦">往事随风</a>
            </li>
            <li data-view="6"><a href="/4.mp3" singer="beyond">光辉岁月</a></li>
            <li data-view="5"><a href="/5.mp3" singer="陈慧琳">记事本</a></li>
            <li data-view="5">
                <a href="/6.mp3" singer="邓丽君"><i class="fa fa-user"></i>但愿人长久</a>
            </li>
        </ul>
    </div>'''
    
    res = re.search('<li.*?/2\.mp3.*?singer="(.*?)">(.*?)</a>',html,re.S)
    print(res.group(1),res.group(2))
    

    运行结果:

    任贤齐 沧海一声笑
    

    5.re.findall

    与之前两个不同的是 re.findall 搜会索字符串,并以列表形式返回全部能匹配的子串。

    匹配练习一:

    实例代码:

    import re
    
    html = '''<div id="songs-list">
        <h2 class="title">经典老歌</h2>
        <p class="introduction">
            经典老歌列表
        </p>
        <ul id="list" class="list-group">
            <li data-view="2">一路上有你</li>
            <li data-view="7">
                <a href="/2.mp3" singer="任贤齐">沧海一声笑</a>
            </li>
            <li data-view="4" class="active">
                <a href="/3.mp3" singer="齐秦">往事随风</a>
            </li>
            <li data-view="6"><a href="/4.mp3" singer="beyond">光辉岁月</a></li>
            <li data-view="5"><a href="/5.mp3" singer="陈慧琳">记事本</a></li>
            <li data-view="5">
                <a href="/6.mp3" singer="邓丽君">但愿人长久</a>
            </li>
        </ul>
    </div>'''
    
    res = re.findall('<li.*?href="(.*?)".*?singer="(.*?)">(.*?)</a>',html,re.S)
    #print(res)
    for i in res:
        print(i[0],i[1],i[2])
    

    运行结果:

    [('/2.mp3', '任贤齐', '沧海一声笑'), ('/3.mp3', '齐秦', '往事随风'), ('/4.mp3', 'beyond', '光辉岁月'), ('/5.mp3', '陈慧琳', '记事本'), ('/6.mp3', '邓丽君', '但愿人长久')]
    /2.mp3 任贤齐 沧海一声笑
    /3.mp3 齐秦 往事随风
    /4.mp3 beyond 光辉岁月
    /5.mp3 陈慧琳 记事本
    /6.mp3 邓丽君 但愿人长久
    

    匹配练习二:

    实例代码:

    import re
    
    html = '''<div id="songs-list">
        <h2 class="title">经典老歌</h2>
        <p class="introduction">
            经典老歌列表
        </p>
        <ul id="list" class="list-group">
            <li data-view="2">一路上有你</li>
            <li data-view="7">
                <a href="/2.mp3" singer="任贤齐">沧海一声笑</a>
            </li>
            <li data-view="4" class="active">
                <a href="/3.mp3" singer="齐秦">往事随风</a>
            </li>
            <li data-view="6"><a href="/4.mp3" singer="beyond">光辉岁月</a></li>
            <li data-view="5"><a href="/5.mp3" singer="陈慧琳">记事本</a></li>
            <li data-view="5">
                <a href="/6.mp3" singer="邓丽君">但愿人长久</a>
            </li>
        </ul>
    </div>'''
    
    res = re.findall('<li.*?>\s*?(<a.*?>)?(\w+)(</a>)?\s*?</li>',html,re.S)
    for i in res:
        print(i[0],i[1],i[2])
    

    运行结果:

    一路上有你 
    <a href="/2.mp3" singer="任贤齐"> 沧海一声笑 </a>
    <a href="/3.mp3" singer="齐秦"> 往事随风 </a>
    <a href="/4.mp3" singer="beyond"> 光辉岁月 </a>
    <a href="/5.mp3" singer="陈慧琳"> 记事本 </a>
    <a href="/6.mp3" singer="邓丽君"> 但愿人长久 </a>
    

    6.re.sub

    该方法的作用是替换字符串中每一个匹配的子串后返回替换后的字符串

    实例代码一:

    import re
    
    content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'
    res = re.sub('\d+','K0rz3n',content)
    print(res)
    

    运行结果:

    Extra stings Hello K0rz3n World_This is a Regex Demo Extra stings
    

    有时候我们替换的时候需要保留原始字符串,这个时候我们就要使用正则表达式的后向引用技术

    实例代码二:

    import re
    
    content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'
    content = re.sub('(\d+)', '\\1 8910', content)
    print(content)
    

    运行结果:

    Extra stings Hello 1234567 890 World_This is a Regex Demo Extra stings
    

    7.re.compile

    该方法可以将正则表达式转换成正则表达式对象,方便我们后期的复用

    实例代码:

    import re
    
    content = '''Hello 1234567 World_This
    is a Regex Demo'''
    pattern = re.compile('Hello.*Demo', re.S)
    res = re.match(pattern, content)
    print(res.group(0))
    

    8.实战练习爬取豆瓣读书

    import requests
    import re
    content = requests.get('http://book.douban.com/').text
    pattern = re.compile('<li.*?cover.*?href="(.*?)".*?title="(.*?)".*?more-meta.*?author">(.*?)</span>.*?year">(.*?)</span>.*?</li>', re.S)
    results = re.findall(pattern, content)
    for result in results:
        url, name, author, date = result
        author = re.sub('\s', '', author)
        date = re.sub('\s', '', date)
        print(url, name, author, date)
    

    0X06 BeautifulSoup

    1.什么是 BeautifulSoup

    这是一个方便的网页解析库,不用编写正则就是实现网页信息的提取

    2.常见的配合使用的解析库

    img

    3.基本使用

    实例代码:

    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html,'lxml')
    print(soup.prettify())
    print(soup.title.string)
    

    运行结果:

    <html>
     <head>
      <title>
       The Dormouse's story
      </title>
     </head>
     <body>
      <p class="title" name="dromouse">
       <b>
        The Dormouse's story
       </b>
      </p>
      <p class="story">
       Once upon a time there were three little sisters; and their names were
       <a class="sister" href="http://example.com/elsie" id="link1">
        <!-- Elsie -->
       </a>
       ,
       <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
       </a>
       and
       <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
       </a>
       ;
    and they lived at the bottom of a well.
      </p>
      <p class="story">
       ...
      </p>
     </body>
    </html>
    The Dormouse's story
    

    4.标签选择器

    (1)选择元素

    使用soup.(点)属性标签的方式来进行选择,如果有多个符合的话只能返回第一个匹配的标签

    实例代码:

    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html,'lxml')
    print(soup.head)
    print(soup.title)
    print(soup.p)
    

    运行结果:

    <head><title>The Dormouse's story</title></head>
    <title>The Dormouse's story</title>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    

    (2)获取属性

    实例代码:

    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.p.attrs['name'])
    print(soup.p['name'])
    

    运行结果:

    dromouse
    dromouse
    

    (3)获取内容

    实例代码:

    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p clss="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.p.string)
    

    运行结果:

    The Dormouse's story
    

    (4)嵌套选择

    实例代码:

    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.head.title.string)
    

    运行结果:

    The Dormouse's story
    

    (5)获取子孙节点

    1.contents

    这种方法是以列表形式返回标签的子节点

    实例代码:

    html = """
    <html>
        <head>
            <title>The Dormouse's story</title>
        </head>
        <body>
            <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a href="http://example.com/elsie" class="sister" id="link1">
                    <span>Elsie</span>
                </a>
                <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
                and
                <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
            <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.p.contents)
    

    运行结果:

    ['\n            Once upon a time there were three little sisters; and their names were\n            ', <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n            and\n            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n            and they lived at the bottom of a well.\n        ']
    

    2.children

    这种方法返回的是一个子节点的迭代器形式

    实例代码:

    html = """
    <html>
        <head>
            <title>The Dormouse's story</title>
        </head>
        <body>
            <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a href="http://example.com/elsie" class="sister" id="link1">
                    <span>Elsie</span>
                </a>
                <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
                and
                <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
            <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.p.children)
    for i, child in enumerate(soup.p.children):
        print(i, child)
    

    运行结果:

    <list_iterator object at 0x1064f7dd8>
    0 
                Once upon a time there were three little sisters; and their names were
    
    1 <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>
    2 
    
    3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    4  
                and
    
    5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    6 
                and they lived at the bottom of a well.
    

    3.descendants

    返回子孙节点,其实和上面 children 的不同在于这个方法会再次强调一下孙子节点

    实例代码:

    html = """
    <html>
        <head>
            <title>The Dormouse's story</title>
        </head>
        <body>
            <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a href="http://example.com/elsie" class="sister" id="link1">
                    <span>Elsie</span>
                </a>
                <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
                and
                <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
            <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.p.descendants)
    for i, child in enumerate(soup.p.descendants):
        print(i, child)
    

    运行结果:

    <generator object descendants at 0x10650e678>
    0 
                Once upon a time there were three little sisters; and their names were
    
    1 <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>
    2 
    
    3 <span>Elsie</span>
    4 Elsie
    5 
    
    6 
    
    7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    8 Lacie
    9  
                and
    
    10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    11 Tillie
    12 
                and they lived at the bottom of a well.
    

    (6)父节点和祖先节点

    1.parent

    实例代码:

    html = """
    <html>
        <head>
            <title>The Dormouse's story</title>
        </head>
        <body>
            <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a href="http://example.com/elsie" class="sister" id="link1">
                    <span>Elsie</span>
                </a>
                <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
                and
                <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
            <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.a.parent)
    

    运行结果:

    <p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
    </a>
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
    and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
     and they lived at the bottom of a well.
    </p>
    

    2.parents

    以列表的形式输出所有的祖先节点

    html = """
    <html>
        <head>
            <title>The Dormouse's story</title>
        </head>
        <body>
            <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a href="http://example.com/elsie" class="sister" id="link1">
                    <span>Elsie</span>
                </a>
                <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
                and
                <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
            <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(list(enumerate(soup.a.parents)))
    

    (7)兄弟节点

    实例代码:

    html = """
    <html>
        <head>
            <title>The Dormouse's story</title>
        </head>
        <body>
            <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a href="http://example.com/elsie" class="sister" id="link1">
                    <span>Elsie</span>
                </a>
                <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
                and
                <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
                and they lived at the bottom of a well.
            </p>
            <p class="story">...</p>
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(list(enumerate(soup.a.next_siblings)))
    print(list(enumerate(soup.a.previous_siblings)))
    

    运行结果:

    [(0, '\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \n            and\n            '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n            and they lived at the bottom of a well.\n        ')]
    [(0, '\n            Once upon a time there were three little sisters; and their names were\n            ')]
    

    5.标准选择器

    上面我们讲述的标签选择器虽然选择速度快,但是选择的内容也是比较笼统的,在现实中很难满足我们的需求,于是我们就需要更强大的选择器帮助我们去实现

    find_all( name , attrs , recursive , text , **kwargs )
    

    可根据标签名、属性、内容查找文档

    (1) name

    实例代码一:

    html='''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
    
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html,'lxml')
    soup.find_all('ul')
    

    运行结果:

    [<ul class="list" id="list-1">
     <li class="element">Foo</li>
     <li class="element">Bar</li>
     <li class="element">Jay</li>
     </ul>, <ul class="list list-small" id="list-2">
     <li class="element">Foo</li>
     <li class="element">Bar</li>
     </ul>]
    

    如果我们还想获取更里面的标签,我们可以再次对获取到的 ul 标签使用 find_all()

    实例代码二:

    html='''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
    
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html,'lxml')
    for i in soup.find_all('ul'):
        print(i.find_all('li'))
    

    运行结果:

    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
    [<li class="element">Foo</li>, <li class="element">Bar</li>]
    

    (2)attrs

    传入想要定位的属性键值对,就能成功定位

    实例代码一:

    html='''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1" name="elements">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
    
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html,'lxml')
    print(soup.find_all(attrs={'id':'list-1'}))
    print(soup.find_all(attrs={'name':'elements'}))
    

    运行结果:

    [<ul class="list" id="list-1" name="elements">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>]
    [<ul class="list" id="list-1" name="elements">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>]
    

    或者,如果 你觉得这种方式比较复杂的话我们还可以使用更加简单的直接使用等于号链接属性和值作为参数传入来定位

    实例代码二:

    html='''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1" name="elements">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
    
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html,'lxml')
    print(soup.find_all(id = 'list-1'))
    print(soup.find_all(class_ = 'panel-heading'))
    

    运行结果:

    [<ul class="list" id="list-1" name="elements">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>]
    [<div class="panel-heading">
    <h4>Hello</h4>
    </div>]
    

    注意:

    Class 是 python 中的关键字,因此我们在写属性的时候不能直接写 classs,否则会引起歧义,所以我们改成了 class_

    (3)text

    实例代码:

    html='''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all(text='Foo'))
    

    运行结果:

    ['Foo', 'Foo']
    

    (4)其他

    find( name , attrs , recursive , text , **kwargs )

    find_all() 返回所有元素,而find()返回单一元素

    find_parents() find_parent()

    find_parents()返回所有祖先节点,find_parent()返回直接父节点。

    find_next_siblings() find_next_sibling()

    find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点。

    find_previous_siblings() find_previous_sibling()

    find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点。

    find_all_next() find_next()

    find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点

    find_all_previous() 和 find_previous()

    find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点

    6.CSS选择器

    (1)基本使用

    通过select()直接传入CSS选择器即可完成选择

    实例代码一:

    html='''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
    
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html,'lxml')
    print(soup.select('.panel-heading'))
    print(soup.select('#list-1'))
    print(soup.select('li'))
    

    运行结果:

    [<div class="panel-heading">
    <h4>Hello</h4>
    </div>]
    [<ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>]
    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
    

    实例代码二:

    html='''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    for ul in soup.select('ul'):
        print(ul.select('li'))
    

    运行结果:

    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
    [<li class="element">Foo</li>, <li class="element">Bar</li>]
    

    (2)获取属性

    实例代码:

    html='''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    for ul in soup.select('ul'):
        print(ul['id'])
        print(ul.attrs['id'])
    

    运行结果:

    list-1
    list-1
    list-2
    list-2
    

    (3)获取内容

    实例代码:

    html='''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    for li in soup.select('li'):
        print(li.get_text())
    

    运行结果:

    Foo
    Bar
    Jay
    Foo
    Bar
    

    0X07 PyQuery

    PyQuery 是另一个比较强大的网页解析库,语法完全从 jQuery 迁移过来,所以对于熟悉 JQuery 的开发人员来说是非常好的选择

    1.初始化

    (1)字符串初始化

    实例代码:

    html = '''
    <div>
        <ul>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
    '''
    from pyquery import PyQuery as pq
    
    doc = pq(html)
    print(doc('li'))
    

    运行结果:

    <li class="item-0">first item</li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
    <li class="item-1 active"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">fifth item</a></li>
    

    (2)URL初始化

    实例代码:

    from pyquery import PyQuery as pq
    
    doc = pq(url='http://www.baidu.com')
    print(doc('head'))
    

    运行结果:

    <head><meta http-equiv="content-type" content="text/html;charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta content="always" name="referrer"/><link rel="stylesheet" type="text/css" href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css"/><title>ç�¾åº¦ä¸�ä¸�ï¼�ä½ å°±ç�¥é��</title></head>
    

    (3)文件初始化

    实例代码:

    from pyquery import PyQuery as pq
    doc = pq(filename='demo.html')
    print(doc('li'))
    

    2.基本CSS选择器

    实例代码:

    html = '''
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
    '''
    from pyquery import PyQuery as pq
    doc = pq(html)
    print(doc('#container .list li'))
    

    运行结果:

    <li class="item-0">first item</li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
    <li class="item-1 active"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">fifth item</a></li>
    

    3.查找元素

    (1)子元素

    实例代码一:

    html = '''
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
    '''
    
    from pyquery import PyQuery as pq
    
    doc = pq(html)
    li = doc('.list').find('li')
    print(li)
    

    运行结果:

    <li class="item-0">first item</li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
    <li class="item-1 active"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">fifth item</a></li>
    

    当然,除了使用 find 方法以外,我们还能使用 children 方法

    实例代码二:

    html = '''
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
    '''
    
    from pyquery import PyQuery as pq
    
    doc = pq(html)
    items = doc('.list')
    lis = items.children('.active')
    print(lis)
    

    运行结果:

    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
    

    (2)父元素

    实例代码一:

    html = '''
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
    '''
    
    from pyquery import PyQuery as pq
    
    doc = pq(html)
    items = doc('.list')
    container = items.parent()
    print(container)
    

    运行结果:

    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
    

    使用 parent() 是返回直接父节点,但是使用 parents()能返回全部的父节点

    实例代码二:

    html = '''
    <div class="wrap">
        <div id="container">
            <ul class="list">
                 <li class="item-0">first item</li>
                 <li class="item-1"><a href="link2.html">second item</a></li>
                 <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
                 <li class="item-1 active"><a href="link4.html">fourth item</a></li>
                 <li class="item-0"><a href="link5.html">fifth item</a></li>
             </ul>
         </div>
     </div>
    '''
    from pyquery import PyQuery as pq
    doc = pq(html)
    items = doc('.list')
    parents = items.parents('.wrap')
    print(parents)
    

    运行结果:

    <div class="wrap">
        <div id="container">
            <ul class="list">
                 <li class="item-0">first item</li>
                 <li class="item-1"><a href="link2.html">second item</a></li>
                 <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
                 <li class="item-1 active"><a href="link4.html">fourth item</a></li>
                 <li class="item-0"><a href="link5.html">fifth item</a></li>
             </ul>
         </div>
     </div>
    

    (3)兄弟节点

    实例代码:

    html = '''
    <div class="wrap">
        <div id="container">
            <ul class="list">
                 <li class="item-0">first item</li>
                 <li class="item-1"><a href="link2.html">second item</a></li>
                 <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
                 <li class="item-1 active"><a href="link4.html">fourth item</a></li>
                 <li class="item-0"><a href="link5.html">fifth item</a></li>
             </ul>
         </div>
     </div>
    '''
    from pyquery import PyQuery as pq
    doc = pq(html)
    li = doc('.list .item-0.active')
    print(li.siblings('.active'))
    

    运行结果:

    <li class="item-1 active"><a href="link4.html">fourth item</a></li>
    

    4.遍历

    实例代码:

    html = '''
    <div class="wrap">
        <div id="container">
            <ul class="list">
                 <li class="item-0">first item</li>
                 <li class="item-1"><a href="link2.html">second item</a></li>
                 <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
                 <li class="item-1 active"><a href="link4.html">fourth item</a></li>
                 <li class="item-0"><a href="link5.html">fifth item</a></li>
             </ul>
         </div>
     </div>
    '''
    
    from pyquery import PyQuery as pq
    
    doc = pq(html)
    lis = doc('li').items()
    for i in lis:
        print(i)
    

    运行结果:

    <li class="item-0">first item</li>
    
    <li class="item-1"><a href="link2.html">second item</a></li>
    
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
    
    <li class="item-1 active"><a href="link4.html">fourth item</a></li>
    
    <li class="item-0"><a href="link5.html">fifth item</a></li>
    

    5.获取信息

    (1)获取属性

    实例代码:

    html = '''
    <div class="wrap">
        <div id="container">
            <ul class="list">
                 <li class="item-0">first item</li>
                 <li class="item-1"><a href="link2.html">second item</a></li>
                 <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
                 <li class="item-1 active"><a href="link4.html">fourth item</a></li>
                 <li class="item-0"><a href="link5.html">fifth item</a></li>
             </ul>
         </div>
     </div>
    '''
    
    from pyquery import PyQuery as pq
    
    doc = pq(html)
    a = doc('.list .item-0.active a')
    print(a.attr.href)
    print(a.attr('href'))
    

    运行结果:

    link3.html
    link3.html
    

    (2)获取文本

    实例代码:

    html = '''
    <div class="wrap">
        <div id="container">
            <ul class="list">
                 <li class="item-0">first item</li>
                 <li class="item-1"><a href="link2.html">second item</a></li>
                 <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
                 <li class="item-1 active"><a href="link4.html">fourth item</a></li>
                 <li class="item-0"><a href="link5.html">fifth item</a></li>
             </ul>
         </div>
     </div>
    '''
    from pyquery import PyQuery as pq
    doc = pq(html)
    a = doc('.item-0.active a')
    print(a.text())
    

    运行结果:

    third item
    

    (3)获取 HTML

    实例代码:

    html = '''
    <div class="wrap">
        <div id="container">
            <ul class="list">
                 <li class="item-0">first item</li>
                 <li class="item-1"><a href="link2.html">second item</a></li>
                 <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
                 <li class="item-1 active"><a href="link4.html">fourth item</a></li>
                 <li class="item-0"><a href="link5.html">fifth item</a></li>
             </ul>
         </div>
     </div>
    '''
    from pyquery import PyQuery as pq
    doc = pq(html)
    li = doc('.item-0.active')
    print(li.html())
    

    运行结果:

    <a href="link3.html"><span class="bold">third item</span></a>
    

    6.DOM操作

    (1)addClass、removeClass

    实例代码:

    html = '''
    <div class="wrap">
        <div id="container">
            <ul class="list">
                 <li class="item-0">first item</li>
                 <li class="item-1"><a href="link2.html">second item</a></li>
                 <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
                 <li class="item-1 active"><a href="link4.html">fourth item</a></li>
                 <li class="item-0"><a href="link5.html">fifth item</a></li>
             </ul>
         </div>
     </div>
    '''
    from pyquery import PyQuery as pq
    doc = pq(html)
    li = doc('.item-0.active')
    print(li)
    li.removeClass('active')
    print(li)
    li.addClass('active')
    print(li)
    

    运行结果:

    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
    
    <li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>
    
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
    

    (2)attr、css

    实例代码:

    html = '''
    <div class="wrap">
        <div id="container">
            <ul class="list">
                 <li class="item-0">first item</li>
                 <li class="item-1"><a href="link2.html">second item</a></li>
                 <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
                 <li class="item-1 active"><a href="link4.html">fourth item</a></li>
                 <li class="item-0"><a href="link5.html">fifth item</a></li>
             </ul>
         </div>
     </div>
    '''
    from pyquery import PyQuery as pq
    doc = pq(html)
    li = doc('.item-0.active')
    print(li)
    li.attr('name','link')
    print(li)
    li.css('front-size','14px')
    print(li)
    

    运行结果:

    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
    
    <li class="item-0 active" name="link"><a href="link3.html"><span class="bold">third item</span></a></li>
    
    <li class="item-0 active" name="link" style="front-size: 14px"><a href="link3.html"><span class="bold">third item</span></a></li>
    

    (3)remove

    实例代码:

    html = '''
    <div class="wrap">
        Hello, World
        <p>This is a paragraph.</p>
     </div>
    '''
    
    from pyquery import PyQuery as pq
    
    doc = pq(html)
    wrap = doc('.wrap')
    print(wrap.text())
    wrap.find('p').remove()
    print(wrap.text())
    

    运行结果:

    Hello, World
    This is a paragraph.
    Hello, World
    

    (4)其他

    相关文章

      网友评论

        本文标题:绝无仅有,史上最全的python爬虫教程!

        本文链接:https://www.haomeiwen.com/subject/bfqhoqtx.html