记一次 Python 爬虫

作者: Wythe | 来源:发表于2017-05-18 18:39 被阅读40次

记一次 Python 爬虫
3分钟带你了解世界第一语言Python 入门上手也这么简单！
Python网络爬虫（八） - 利用有道词典实现一个简单翻译程序
Python网络爬虫（七）- 深度爬虫CrawlSpider
Python网络爬虫（二）- urllib爬虫案例
Python网络爬虫（一）- 入门基础
Python网络爬虫（四）- XPath
Python网络爬虫（三）- 爬虫进阶
Python网络爬虫（六）- Scrapy框架
Python网络爬虫（五）- Requests和Beautifu

给我的 app WallSquare 写【发现】功能的时候需要下载一些照片，几十张，一张张地下载，太慢了，于是想着写个爬虫。

第一步，先分析网页结构

打开网站https://unsplash.com/explore，查看源码，看一下标签结构

UnsplashTag.png

可以看到大概是这样的结构

PhotoTag.png

我们需要的就是图片地址和图片名。

背景图所在标签的类都是 _1mlK1，图片名在它的类为 _3iawX _1WCyJ _3myVE 标签里。

接下来就可以写代码了，因为大概懂点 Python 的语法，就决定了用 Python 来写。

搜寻了一番，决定网络库用 urllib、 HTML 解析用 BeautifulSoup 。

解析网页，得到图片标签

html = urllib2.urlopen('https://unsplash.com/explore').read()
# print html
soup = BeautifulSoup(html)

itemArray = soup.findAll('div',attrs={"class":"_1mlK1"})

```

itemArray 即存着所有的图片标签，再遍历分别取网址和名字即可

获取地址

# 获取背景照片地址
style = item["style"]
# 通过正则表达式截取地址
url = re.findall('url\((.*?)\)', style)[0]
imageURL =  url.strip('\"')

获取名字

# 获取集合名字
titleDiv = item.findAll('h2',attrs={"class":"_3iawX _1WCyJ _3myVE"})[0]
imageName = titleDiv.text

最后下载并保存

# 下载并存储照片
filesavepath = './UnsplashExplore/%s.jpg' % imageName
urllib.urlretrieve(imageURL,filesavepath)

以上就是全部过程了，毕竟只是一个比较简单的爬虫，没有涉及到登录、cookie 和反爬虫什么的。下一步希望爬一下我的网易云音乐，换个帐号，但是保留收藏的歌单。

最后附上全部代码

#!/usr/bin/python
#-*- coding: utf-8 -*-
#encoding=utf-8

import urllib2
import urllib
import os
import re
from BeautifulSoup import BeautifulSoup
def downloadImageFromUnsplashExplore():
    html = urllib2.urlopen('https://unsplash.com/explore').read()
    # print html
    soup = BeautifulSoup(html)

    itemArray = soup.findAll('div',attrs={"class":"_1mlK1"})

    for item in itemArray:

        # 获取背景照片地址
        style = item["style"]
        url = re.findall('url\((.*?)\)', style)[0]
        imageURL =  url.strip('\"')

        # 获取集合名字
        titleDiv = item.findAll('h2',attrs={"class":"_3iawX _1WCyJ _3myVE"})[0]
        imageName = titleDiv.text

        # 下载并存储照片
        filesavepath = './UnsplashExplore/%s.jpg' % imageName

        print imageURL
        print imageName
        print filesavepath
        urllib.urlretrieve(imageURL,filesavepath)
         
    

if __name__ == '__main__':
    downloadImageFromUnsplashExplore()

谢谢阅读。

我是 Wythe，iOS 开发者，对其他技术也有好奇。公众号 WytheTalk，从一个程序员的角度看世界，主要是技术分享，也有对互联网各种事的观点。欢迎关注。

WytheTalk.jpg