美文网首页
Python实战 - 第4节:如何获取页面中的动态数据

Python实战 - 第4节:如何获取页面中的动态数据

作者: 辉叔不太萌 | 来源:发表于2016-11-01 22:48 被阅读0次

    笔记

    • 通过观察加载动态数据时的网络交互,寻找加载更多数据的Request的规律,进一步构造相应Request来获取Response。

    作业

    • 代码:
    from bs4 import BeautifulSoup
    import requests
    import urllib.request
    import os
    import socket
    
    urls = ['http://weheartit.com/inspirations/taylorswift?page={}'.format(str(i)) for i in range(1, 2)]
    '''proxies = {"http": "122.96.59.99:3128"}'''
    '''proxies = {"http": "121.69.29.162:8118"}'''
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'
    }
    base_path = 'F:\\workspace-python\\hw_02\\img_dl'
    
    
    def download_img(img_url):
        file_name = img_url.split("/")[-2] + "." + img_url.split(".")[-1]
        target = os.path.join(base_path, file_name)
    
        print('%s ==> %s' % (img_url, target))
        '''urllib.request.urlretrieve(img_url, target)'''
    
    
    def process_dynamic_page(url):
    
        web_data = requests.get(url, headers=headers)
        if web_data.status_code != 200:
            print(web_data.status_code)
            return
    
        soap = BeautifulSoup(web_data.text, 'lxml')
    
        images = soap.select('div > div > div > a > img[class="entry-thumbnail"]')
        web_data.close()
        for image in images:
            img_url = image.get('src')
            download_img(img_url)
    
    
    for url in urls:
        process_dynamic_page(url)
        
    
    
    • 执行结果(部分):
    "D:\Program Files\Python35\python.exe" F:/workspace-python/hw_02/hw_04.py
    http://data.whicdn.com/images/201685162/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\201685162.jpg
    http://data.whicdn.com/images/261819708/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\261819708.jpg
    http://data.whicdn.com/images/262877209/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\262877209.jpg
    http://data.whicdn.com/images/225569474/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\225569474.jpg
    http://data.whicdn.com/images/264736360/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264736360.jpg
    http://data.whicdn.com/images/262204064/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\262204064.jpg
    http://data.whicdn.com/images/254688840/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\254688840.jpg
    http://data.whicdn.com/images/258279435/superthumb.png ==> F:\workspace-python\hw_02\img_dl\258279435.png
    http://data.whicdn.com/images/261497975/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\261497975.jpg
    http://data.whicdn.com/images/264710374/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264710374.jpg
    http://data.whicdn.com/images/264713023/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264713023.jpg
    http://data.whicdn.com/images/264706335/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264706335.jpg
    http://data.whicdn.com/images/264721633/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264721633.jpg
    http://data.whicdn.com/images/264721658/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264721658.jpg
    http://data.whicdn.com/images/264721683/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264721683.jpg
    http://data.whicdn.com/images/206651826/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\206651826.jpg
    http://data.whicdn.com/images/264711782/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264711782.jpg
    http://data.whicdn.com/images/264715635/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264715635.jpg
    http://data.whicdn.com/images/264710414/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264710414.jpg
    http://data.whicdn.com/images/264697940/superthumb.png ==> F:\workspace-python\hw_02\img_dl\264697940.png
    http://data.whicdn.com/images/264697906/superthumb.gif ==> F:\workspace-python\hw_02\img_dl\264697906.gif
    http://data.whicdn.com/images/264705727/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264705727.jpg
    http://data.whicdn.com/images/264703283/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264703283.jpg
    http://data.whicdn.com/images/264703286/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264703286.jpg
    http://data.whicdn.com/images/261104252/superthumb.gif ==> F:\workspace-python\hw_02\img_dl\261104252.gif
    http://data.whicdn.com/images/264695862/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264695862.jpg
    http://data.whicdn.com/images/264695929/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264695929.jpg
    http://data.whicdn.com/images/264695960/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264695960.jpg
    http://data.whicdn.com/images/173728739/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\173728739.jpg
    http://data.whicdn.com/images/197006986/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\197006986.jpg
    http://data.whicdn.com/images/264674428/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264674428.jpg
    http://data.whicdn.com/images/264579949/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264579949.jpg
    http://data.whicdn.com/images/264631087/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264631087.jpg
    http://data.whicdn.com/images/264644105/superthumb.png ==> F:\workspace-python\hw_02\img_dl\264644105.png
    http://data.whicdn.com/images/264628123/superthumb.png ==> F:\workspace-python\hw_02\img_dl\264628123.png
    http://data.whicdn.com/images/264634842/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264634842.jpg
    http://data.whicdn.com/images/259844486/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\259844486.jpg
    
    • 遗留问题:
    • 下载图片时,提示 “urllib.error.URLError: <urlopen error [WinError 10013] 以一种访问权限不允许的方式做了一个访问套接字的尝试。>”,详见讨论帖:http://study.163.com/forum/detail/1002726062.htm

    相关文章

      网友评论

          本文标题:Python实战 - 第4节:如何获取页面中的动态数据

          本文链接:https://www.haomeiwen.com/subject/eyeuuttx.html