美文网首页
Python实战计划学习笔记:week1_4 爬取霉霉照片

Python实战计划学习笔记:week1_4 爬取霉霉照片

作者: luckywoo | 来源:发表于2016-06-29 21:08 被阅读96次

    学习爬虫第3天,爬取霉霉照片。
    代码如下:

    #!/usr/bin/env python
    # coding:utf-8
    __author__ = 'lucky'
    from bs4 import BeautifulSoup
    import requests,urllib.request
    import time
    urls = ['http://weheartit.com/inspirations/taylorswift?scrolling=true&page={}'.format(number) for number in range(1,21)]
    
    header = { 'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',   
               'Cookie':'locale=zh-cn; __whiAnonymousID=cedf556d59434a78a518b36279b59bd4; auth=no; _session=06742c2ee80e676adfa76366d2b522ed; _ga=GA1.2.1879005139.1467165244; _weheartit_anonymous_session=%7B%22page_views%22%3A1%2C%22search_count%22%3A0%2C%22last_searches%22%3A%5B%5D%2C%22last_page_view_at%22%3A1467202282156%7D'}
    
    img_links = []
    def get_links(url,data=None):    
        wb_data = requests.get(url,headers=header)    
        Soup = BeautifulSoup(wb_data.text,'lxml')    
        imgs = Soup.select('body > div > div > div > a > img')    
        if data == None:        
            for img in imgs:            
            img_links.append(img.get('src'))
    
    for url in urls:    
        get_links(url)  #获取图片的link    
        print('OK')
    
    i = 0  #图片名称: i.jpg
    folder_path ='/Users/lucky/Life/pic/'
    
    for img in img_links:               
        urllib.request.urlretrieve(img,folder_path+str(i)+'.jpg')    
        i+=1    
        print('Done')
    

    获取图片如下:

    pictures.png

    单独照片:

    2.jpg

    总结:

    1.更加熟练的调用函数来
    2.添加header,伪装成浏览器,添加cookies来爬取相关网页。
    3.Download图片用到了urllib.request,及urllib.request.urlretrieve()这个函数,此函数调用了open('filename','wb')这样的函数。
    4.更好的理解了CSS和HTML网页元素的位置抓取和chrome浏览器的使用。

    相关文章

      网友评论

          本文标题:Python实战计划学习笔记:week1_4 爬取霉霉照片

          本文链接:https://www.haomeiwen.com/subject/umyxjttx.html