Python 爬虫 | 爬取简单静态网站

作者: YocnZhao | 来源:发表于2019-05-23 11:17 被阅读0次

Python 爬虫 | 爬取简单静态网站
各类链接
Python爬虫实战之爬取链家广州房价_03存储
一个不那么典型的Python爬虫
python爬虫学习手册-服务器渲染（基础库urllib）熟悉
python爬虫王者荣耀高清皮肤大图背景故事通用爬虫
3个适合新人上手的Python项目
node.js爬虫入门（二）爬取动态页面(puppeteer)
Python商品数据预处理与K-Means聚类可视化分析
爬虫很难？最适合新人上手的3个Python项目,即学即用！

最近在学习python，而爬虫也是python最常见的应用，写了简单的爬取脚本，仅供参考学习交流之用，若有冒犯，请联系我。
需要有python基础跟少部分html基础。学习python基础可以移步菜鸟教程，本机Python版本为 Python 2.7.10
我们拿知乎的一个网页入手https://www.zhihu.com/question/35005800/answer/61498512
我们想拿出来这个网址上有用的图片地址并下载下来。
用到了几个库urllib跟BeautifulSoup
urllib2用来做网络请求，urllib用来做下载，BeautifulSoup用来做标签的解析管理。
这里有对爬虫和urllib2，bs的简单介绍https://www.runoob.com/w3cnote/python-spider-intro.html

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import urllib2, urllib, bs4


def test():
    url = "https://www.zhihu.com/question/35005800/answer/61498512"
    response1 = urllib2.urlopen(url)
    html = response1.read()

    soup = bs4.BeautifulSoup(html, "html.parser", from_encoding="utf-8")
    # 打印美化之后的网站信息
    print soup.prettify() 
    # 找到所有的<img >标签
    all_img = soup.find_all("img")
    # 用来存放需要下载的img的链接地址
    img_list = []
    for img in all_img:
        # 如果img标签里面含有data-original属性则把data-original标签的内容也就是http链接地址存储到img_list中
        if "data-original" in img.attrs:
            img_list.append(img.attrs["data-original"])

    for img in img_list:
        print "开始下载", (img)
        # https://pic4.zhimg.com/49ce58f2c038c709968a804384747d15_r.jpg -> 49ce58f2c038c709968a804384747d15_r.jpg
        local_path = "/Users/y/PythonWorkSpace/" + img[img.rindex("/") + 1:len(img)]
        # 把img下载下来并存储到local_path中
        urllib.urlretrieve(img, local_path)

    return

上面介绍了抓取最简单的网页，其实最主要的就是在万军从中找到我们想要的数据，这些数据可能藏在某些地方，还可能有其他的东西混淆我们的视线。
下面再贴一个抓取美女写真的爬虫。
百度一下随便找个网站，一般的网站都是分页的，比如说一个展示美女写真的网站，往往是一个系列有几十张图片，一页展示不下，我们要点下一页下一页的去看。
比如说这个链接https://www.lsm.me/thread-20000-1-1.html

点了下一页之后我们不难发现其中的规律，点下一个链接变成了https://www.lsm.me/thread-20000-2-1.html，那我们其实只需要找到头尾两个边界就能遍历一遍网站的内容。
然后我们去找链接中的<img>标签。

image.png
然后取出标签中的src属性的值保存起来，下载到本地就好了。

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import urllib2, urllib, bs4, os, time


class Pic:
    def __init__(self, url, desc, path):
        self.url = url
        self.desc = desc
        self.path = path


locol = "/Users/y/PythonWorkSpace/LSM/"


def test():
    # 20000-20700
    start_index = 20000
    end_index = 20100
    for i in range(start_index, end_index):
        download_pic(i)
    return


sample = "https://i.gzjxfw.com:116/k/1178/T/XiuRen/1319/1319_001_bz2_1200_1800.jpg"


def download_pic(index):
    total_index = 0
    for i in range(1, 6):
        url = "https://www.lsm.me/thread-%d-%d-1.html" % (index, i)
        print url
        # 添加header模拟浏览器，必要的话尽量填完整，简单的话就只加UA
        request = urllib2.Request(url)  # Request参数有三个，url,data,headers,如果没有data参数，那就得按我这样的写法
        request.add_header("User-Agent",
                           "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36")
        request.add_header("Accept-Language", "zh-CN,zh;q=0.9,en;q=0.8,zh-TW;q=0.7,fr;q=0.6")
        request.add_header("Accept",
                           "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3")

        response = urllib2.urlopen(request)
        # 可能某些情况下会失败，如果不想打断爬取可以加上这个，如果想查原因，推荐链接和失败code存数据库，后续可以查原因
        if response.code != 200:
            return
        html = response.read()

        soup = bs4.BeautifulSoup(html, "html.parser", from_encoding="utf-8")
        # print soup.prettify()
        all_img = soup.find_all("img")
        pic_list = []
        # 本地存储路径
        current_dir = locol + "" + str(index) + "/"
        for img in all_img:
            # 用来做文件名字，当然也可以抓取链接其他内容作为文件名
            total_index += 1
            # 获取src属性，也就是图片的真实地址
            src = img.attrs["src"]
            # 这个判断其实是为了筛出来我真正想要的图片地址，因为还有一些网站的logo之类的小图片
            if len(src) >= len(sample):
                name = str(total_index)
                locol_path = locol + "" + str(index) + "/" + name + ".jpg"
                pic = Pic(src, name, locol_path)
                print pic.url + " " + pic.path + " " + pic.desc
                pic_list.append(pic)

            # 这里也是一种方法来筛选图片，我们观察看我们想要的img的parent的parent的class属性都是adw，可以用这个来做判断条件，但是这样效率可能更低些~
            # 而且好像有好像有bug，我最终没有用这个方法
            # parent = img.parent;
            # pparent = parent.parent;
            # attrs = pparent.attrs
            #
            # if not attrs is None and attrs.has_key("class") and attrs["class"][0].encode('utf-8') == 'adw':
            #     for i in range(len(pparent.contents)):
            #         total_index += 1
            #         name = str(total_index)
            #         curr = parent.contents[i]
            #         locol_path = locol + "" + str(index) + "/" + name + ".jpg"
            #         pic = Pic(curr.attrs["src"], name, locol_path)
            #         print pic.url + " " + pic.path + " " + pic.desc
            #         pic_list.append(pic)

        for pic in pic_list:
            if not os.path.exists(current_dir):
                os.mkdir(current_dir)
            print "-------------开始下载---------------", pic.url, pic.path
            urllib.urlretrieve(pic.url, pic.path)

        time.sleep(3)
    return