python爬虫小练习

作者: 朱晓飞 | 来源:发表于2016-12-03 11:14 被阅读678次

2020-02-01 python 爬虫小练习-下载“百度”图片
Python爬虫之爬取美女图片
python爬虫小练习
python 爬小说
python爬虫学习教程之兼职网数据爬取
贴吧帖子内图片抓取
Python+PhantomJS+selenium+Beauti
简单Python小爬虫
python 爬虫练习
3分钟带你了解世界第一语言Python 入门上手也这么简单！

网页抓取

根据链接

从入口页面开始抓取出所有链接，支持proxy、支持定义深度抓取、链接去重等，尚未做并发处理

code如下

import urlparse
import urllib2
import re
import Queue

#页面下载
def page_download(url,num_retry=2,user_agent='zhxfei',proxy=None):
    #print 'downloading ' , url
    headers = {'User-agent':user_agent}
    request = urllib2.Request(url,headers = headers)
    opener = urllib2.build_opener()
    if proxy:
        proxy_params = {urlparse(url).scheme:proxy}
        opener.add_handler(urllib2.ProxyHandler(proxy_params))

    try:
        html = urllib2.urlopen(request).read()   #try : download the page
    except urllib2.URLError as e:                       #except : 
        print 'Download error!' , e.reason                  #URLError 
        html = None
        if num_retry > 0:                                   # retry download when time>0
            if hasattr(e, 'code') and 500 <=e.code <=600:
                return  page_download(url,num_retry-1)       
            
    if html is None:
        print '%s Download failed' % url
    else:
        print '%s has Download' % url
    
    return html

#使用正则表达式匹配出页面中的链接
def get_links_by_html(html):
    webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)   
    return webpage_regex.findall(html)

#判断抓取的链接和入口页面是否为同站
def same_site(url1,url2):
    return urlparse.urlparse(url1).netloc == urlparse.urlparse(url2).netloc

def link_crawler(seed_url,link_regex,max_depth=-1):
    crawl_link_queue = Queue.deque([seed_url])
    seen = {seed_url:0}         # seen means page had download
    depth = 0
    
    while crawl_link_queue:
        url = crawl_link_queue.pop()
        depth = seen.get(url)
        if seen.get(url) > max_depth:
            continue
        links = []
        html = page_download(url)
        
        links.extend(urlparse.urljoin(seed_url, x) for x in get_links_by_html(html) if re.match(link_regex, x))

        for link in links:
            if link not in seen:
                seen[link]= depth + 1
                if same_site(link, seed_url):
                    crawl_link_queue.append(link)

        #print seen.values()
    print '----All Done----' , len(seen)
    return seen


if __name__ == '__main__':
    all_links = link_crawler('http://www.zhxfei.com',r'/.*',max_depth=1)

运行结果：

http://www.zhxfei.com/archives has Download
http://www.zhxfei.com/2016/08/04/lvs/ has Download
...
...
http://www.zhxfei.com/2016/07/22/app-store-审核-IPv6-Olny/#more has Download
http://www.zhxfei.com/archives has Download
http://www.zhxfei.com/2016/07/22/HDFS/#comments has Download
----All Done----

根据sitmap

sitemap是相当于网站的地图，于其相关的还有robots.txt，一般都是在网站的根目录下专门提供给各种spider，使其更加友好的被搜索引擎收录，定义了一些正规爬虫的抓取规则

所有也可以这样玩，将xml文件中的url拿出来，根据url去直接抓取网站，这是最方便的做法（虽然别人不一定希望我们这么做）

#!/usr/bin/env python
# _*_encoding:utf-8 _*_

# description: this modlue is load crawler By SITEMAP

import re
from download import page_download

def load_crawler(url):
    #download the sitemap
    sitemap = page_download(url)
    
    links = re.findall('<loc>(.*?)</loc>',sitemap)
    
    for link in links:
    
        page_download(link)
        
        if link == links[-1]:
        
            print 'All links has Done'
#     print links
    
load_crawler('http://example.webscraping.com/sitemap.xml')

小结

好了，现在爬虫已经具备了抓取网页的能力，然而他并没有做什么事情，只是将网页download下来，所以我们还要进行数据处理。也就是需要在网页中抓取出我们想要的信息。

数据提取

使用Lxml提取

抓取网页中的信息常用的的三种方法：

使用正则表达式解析，re模块，这是最快的解决方案，并且默认的情况下它会缓存搜索的结果（可以借助re.purge()来讲缓存清除），当然也是最复杂的方案（不针对你是一只老鸟）
使用Beautifulsoup进行解析，这是最人性化的选择，因为它处理起来很简单，然而处理大量数据的时候很慢，所以当抓取很多页面的时候，一般不推荐使用
使用Lxml，这是相对比较中性的做法，使用起来也比较简单，这里我们选择它对抓取的页面进行处理

Lxml的使用有两种方式：Xpath和cssselect，都是使用起来比较简单的，Xpath可以和bs一样，使用find和find_all匹配parten（匹配模式），用链型的结构描述DOM和数据的位置。而cssselct直接是用了jQuery的选择器来进行匹配，这样对有前端功底的同学更加友好。

先给个demo试下：即将抓取的网页http://example.webscraping.com/places/view/United-Kingdom-239 has Download

网页中有个表格<table>,我们想要的信息都是存在body的表格中，可以使用浏览器的开发者工具来省查元素，也可以使用firebug（Firefox上面的一款插件）来查看DOM结构

import lxml.html
import cssselect
from download import page_download

example_url = 'http://example.webscraping.com/places/view/United-Kingdom-239'

def demo():
    html = page_download(example_url, num_retry=2)

    result = lxml.html.fromstring(html)
    print type(result)
    td = result.cssselect('tr#places_area__row > td.w2p_fw')
    print type(td)
    print len(td)
    css_element = td[0]
    print type(css_element)
    print css_element.text_content()

执行结果：

http://example.webscraping.com/places/view/United-Kingdom-239 has Download
<class 'lxml.html.HtmlElement'>
<type 'list'>
1
<class 'lxml.html.HtmlElement'>
244,820 square kilometres

可以看到，使用cssselect进行选择器是拿到了一个长度是1的列表，当然列表的长度显然和我定义的选择器的模式有关，这个列表中每一项都是一个HtmlElement，他有一个text_content方法可以返回这个节点的内容，这样我们就拿到了我们想要的数据。

回调处理

接下来我们就可以为上面的爬虫增加定义一个回调函数，在我们每下载一个页面的时候，做一些小的操作。
显然应该修改link_crawler函数，并在其参数传递回调函数的引用，这样就可以针对不同页面来进行不同的回调处理如：

def link_crawler(seed_url,link_regex,max_depth=-1,scrape_callback=None):
...

    html = page_download(url)   #这行和上面一样
    if scrape_callback:
        scrape_callback(url,html)    
    links.extend(urlparse.urljoin(seed_url, x) for x in get_links_by_html(html) if re.match(link_regex, x)) #这行和上面一样
...

接下来编写回调函数，由于python的面向对象很强大，所以这里使用回调类来完成，由于我们需要调用回调类的实例，所以需要重写它的__call__方法，并实现在调用回调类的实例的时候，将拿到的数据以csv格式保存，这个格式可以用wps打开表格。当然你也可以将其写入到数据库中，这个之后再提

import csv
class ScrapeCallback():
    
    def __init__(self):
        self.writer = csv.writer(open('contries.csv','w+'))
        self.rows_name = ('area','population','iso','country','capital','tld','currency_code','currency_name','phone','postal_code_format','postal_code_regex','languages','neighbours')
        self.writer.writerow(self.rows_name)
        
    def __call__(self,url,html):
        if re.search('/view/', url):
            tree = lxml.html.fromstring(html)            
            rows = []
            for row in self.rows_name:
                rows.append(tree.cssselect('#places_{}__row > td.w2p_fw'.format(row))[0].text_content())
    
            self.writer.writerow(rows)

可以看到回调类有三个属性：

self.rows_name这个属性保存了我们的想要抓取数据的信息
self.writer这个类似文件句柄一样的存在
self.writer.writerow这个属性方法是将数据写入csv格式表格

好了，这样就可以将我们的数据持久化保存起来

修改下link_crawler的define:def link_crawler(seed_url,link_regex,max_depth=-1,scrape_callback=ScrapeCallback()):

运行看下结果：

zhxfei@zhxfei-HP-ENVY-15-Notebook-PC:~/桌面/py_tran$ python crawler.py 
http://example.webscraping.com has Download
http://example.webscraping.com/index/1 has Download      # /index 在__call__中的/view 所以不会进行数据提取
http://example.webscraping.com/index/2 has Download
http://example.webscraping.com/index/0 has Download
http://example.webscraping.com/view/Barbados-20 has Download
http://example.webscraping.com/view/Bangladesh-19 has Download
http://example.webscraping.com/view/Bahrain-18 has Download
...
...
http://example.webscraping.com/view/Albania-3 has Download
http://example.webscraping.com/view/Aland-Islands-2 has Download
http://example.webscraping.com/view/Afghanistan-1 has Download
----All Done---- 35

zhxfei@zhxfei-HP-ENVY-15-Notebook-PC:~/桌面/py_tran$ ls
contries.csv  crawler.py

打开这个csv,就可以看到数据都保存了：

完整代码在这里：

#!/usr/bin/env python
# _*_encoding:utf-8 _*_

import urlparse
import urllib2
import re
import time
import Queue
import lxml.html
import csv

class ScrapeCallback():
    
    def __init__(self):
        self.writer = csv.writer(open('contries.csv','w+'))
        self.rows_name = ('area','population','iso','country','capital','tld','currency_code','currency_name','phone','postal_code_format','postal_code_regex','languages','neighbours')
        self.writer.writerow(self.rows_name)
        
    def __call__(self,url,html):
        if re.search('/view/', url):
            tree = lxml.html.fromstring(html)            
            rows = []
            for row in self.rows_name:
                rows.append(tree.cssselect('#places_{}__row > td.w2p_fw'.format(row))[0].text_content())
    
            self.writer.writerow(rows)

def page_download(url,num_retry=2,user_agent='zhxfei',proxy=None):
    #print 'downloading ' , url
    headers = {'User-agent':user_agent}
    request = urllib2.Request(url,headers = headers)
    opener = urllib2.build_opener()
    if proxy:
        proxy_params = {urlparse(url).scheme:proxy}
        opener.add_handler(urllib2.ProxyHandler(proxy_params))

    try:
        html = urllib2.urlopen(request).read()   #try : download the page
    except urllib2.URLError as e:                       #except : 
        print 'Download error!' , e.reason                  #URLError 
        html = None
        if num_retry > 0:                                   # retry download when time>0
            if hasattr(e, 'code') and 500 <=e.code <=600:
                return  page_download(url,num_retry-1)       
            
    if html is None:
        print '%s Download failed' % url
    else:
        print '%s has Download' % url
    
    return html

def same_site(url1,url2):
    return urlparse.urlparse(url1).netloc == urlparse.urlparse(url2).netloc

def get_links_by_html(html):
    webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)   #理解正则表达式
    return webpage_regex.findall(html)

def link_crawler(seed_url,link_regex,max_depth=-1,scarape_callback=ScrapeCallback()):
    crawl_link_queue = Queue.deque([seed_url])
    # seen contain page had find and it's depth,example first time:{'seed_page_url_find','depth'}
    seen = {seed_url:0}         
    depth = 0
    
    while crawl_link_queue:
        url = crawl_link_queue.pop()
        depth = seen.get(url)
        if seen.get(url) > max_depth:
            continue
        links = []
        html = page_download(url)
        
        links.extend(urlparse.urljoin(seed_url, x) for x in get_links_by_html(html) if re.match(link_regex, x))

        for link in links:
            if link not in seen:
                seen[link]= depth + 1
                if same_site(link, seed_url):
                    crawl_link_queue.append(link)

        #print seen.values()
    print '----All Done----' , len(seen)

    return seen


if __name__ == '__main__':
    all_links = link_crawler('http://example.webscraping.com', '/(index|view)',max_depth=2)

2020-02-01 python 爬虫小练习-下载“百度”图片
1.python 爬虫小练习 #爬虫import requests#第三方库url="http://www.bai...
Python爬虫之爬取美女图片
需求：最近对python爬虫感兴趣，于是学习了下python爬虫并找了个网站练习，练习网址：http://www....
python爬虫小练习
网页抓取根据链接从入口页面开始抓取出所有链接，支持proxy、支持定义深度抓取、链接去重等，尚未做并发处理 c...
python 爬小说
学习爬虫，练习一下，环境python 3.6
python爬虫学习教程之兼职网数据爬取
源码分享：可以对着代码练习，学习编程一定要多动手练习。代码运行效果截图学习python、python爬虫过程...
贴吧帖子内图片抓取
Python之爬虫练习利用Python对百度贴吧进行网络爬虫，实现抓取每个帖子内的所有图片并将之保存到本地。本...
Python+PhantomJS+selenium+Beauti
Python+PhantomJS+selenium+BeautifulSoup实现简易网络爬虫简易网络小爬虫，目...
简单Python小爬虫
简单Python小爬虫https://scrapy.org/
python 爬虫练习
python 六节课爬虫 1-3https://www.jianshu.com/p/645c731c5422py...
3分钟带你了解世界第一语言Python 入门上手也这么简单！
一、Python入门 1. Python爬虫入门一之综述 Python爬虫入门二之爬虫基础了解 Python爬虫入...