使用Python requests 模块实现简单爬虫

作者: 安心远 | 来源:发表于2017-01-16 12:02 被阅读95次

2019-01-01
Python爬虫基础(一)
Requests库基本使用
使用Python requests 模块实现简单爬虫
Python：Requests模块的异常值处理
爬虫相关的一些命令
python 爬虫练习（一）
Python小脚本-爬取补天厂商列表
python网络爬虫基础模块安装
Python爬虫系列1-安装爬虫模块

爬虫就是能够获取网页内容，或者网页上资源的程序。因为每个页面的结构、逻辑可能都不一样，获取网页上资源的方式不都是一样的，所以爬虫其实是具有针对性的，针对某个网站进行编写。

以下是一个简单爬虫的源代码，说了简单，不需要登录，读到源代码就可以下载资源。

import requests
import re

def Spider(url):
    head = 'http://www.xxx.com'
    r = requests.get(url).content
    pic_url = re.findall('class="mb10" src="(.*?)"', r, re.S)
    i=0
    for each  in pic_url:
        if '@' in each:
            each = each[0:each.find('@')]
        print each
        pic = requests.get(each)
        fp = open('pic\\'+str(i)+ '.jpg','wb')
        fp.write(pic.content)
        fp.close()
        i += 1
        
    nextPage = re.findall("<a href='(.*?)' btnmode='true' hideFocus class='pageNext'>", r)
    if len(nextPage)<=0:
        return
    nextPage = nextPage[0]
    print nextPage
    if nextPage.strip('')!='':
        nextPage = head+nextPage
    else:
        return
    Spider(nextPage)

分析：

(1):requests模块获取url网页源代码

r = requests.get(url).content

(2):re模块用正则匹配查找class

pic_url= re.findall('class="mb10" src="(.*?)"', r, re.S)

re模块通过正则按照class 去查找，网页规则不一样，要具体编写

此处得到一个src的数组，即资源url数组

(3): 将得到的src数组进行遍历，根据网页源代码规则编写，读取内容并写入本地文件

for each  in pic_url:
    if '@' in each:
        each = each[0:each.find('@')]
     print each
     pic = requests.get(each)
     fp = open('pic\\'+str(i)+ '.jpg','wb')
     fp.write(pic.content)
     fp.close()
     i += 1

(4): nextPage 进行自动翻页