美文网首页
python爬虫小练习

python爬虫小练习

作者: 朱晓飞 | 来源:发表于2016-12-03 11:14 被阅读678次

    网页抓取

    根据链接

    从入口页面开始抓取出所有链接,支持proxy、支持定义深度抓取、链接去重等,尚未做并发处理

    code如下

    import urlparse
    import urllib2
    import re
    import Queue
    
    #页面下载
    def page_download(url,num_retry=2,user_agent='zhxfei',proxy=None):
        #print 'downloading ' , url
        headers = {'User-agent':user_agent}
        request = urllib2.Request(url,headers = headers)
        opener = urllib2.build_opener()
        if proxy:
            proxy_params = {urlparse(url).scheme:proxy}
            opener.add_handler(urllib2.ProxyHandler(proxy_params))
    
        try:
            html = urllib2.urlopen(request).read()   #try : download the page
        except urllib2.URLError as e:                       #except : 
            print 'Download error!' , e.reason                  #URLError 
            html = None
            if num_retry > 0:                                   # retry download when time>0
                if hasattr(e, 'code') and 500 <=e.code <=600:
                    return  page_download(url,num_retry-1)       
                
        if html is None:
            print '%s Download failed' % url
        else:
            print '%s has Download' % url
        
        return html
    
    #使用正则表达式匹配出页面中的链接
    def get_links_by_html(html):
        webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)   
        return webpage_regex.findall(html)
    
    #判断抓取的链接和入口页面是否为同站
    def same_site(url1,url2):
        return urlparse.urlparse(url1).netloc == urlparse.urlparse(url2).netloc
    
    def link_crawler(seed_url,link_regex,max_depth=-1):
        crawl_link_queue = Queue.deque([seed_url])
        seen = {seed_url:0}         # seen means page had download
        depth = 0
        
        while crawl_link_queue:
            url = crawl_link_queue.pop()
            depth = seen.get(url)
            if seen.get(url) > max_depth:
                continue
            links = []
            html = page_download(url)
            
            links.extend(urlparse.urljoin(seed_url, x) for x in get_links_by_html(html) if re.match(link_regex, x))
    
            for link in links:
                if link not in seen:
                    seen[link]= depth + 1
                    if same_site(link, seed_url):
                        crawl_link_queue.append(link)
    
            #print seen.values()
        print '----All Done----' , len(seen)
        return seen
    
    
    if __name__ == '__main__':
        all_links = link_crawler('http://www.zhxfei.com',r'/.*',max_depth=1) 
    

    运行结果:

    http://www.zhxfei.com/archives has Download
    http://www.zhxfei.com/2016/08/04/lvs/ has Download
    ...
    ...
    http://www.zhxfei.com/2016/07/22/app-store-审核-IPv6-Olny/#more has Download
    http://www.zhxfei.com/archives has Download
    http://www.zhxfei.com/2016/07/22/HDFS/#comments has Download
    ----All Done----
    
    根据sitmap

    sitemap是相当于网站的地图,于其相关的还有robots.txt,一般都是在网站的根目录下专门提供给各种spider,使其更加友好的被搜索引擎收录,定义了一些正规爬虫的抓取规则

    所有也可以这样玩,将xml文件中的url拿出来,根据url去直接抓取网站,这是最方便的做法(虽然别人不一定希望我们这么做)

    #!/usr/bin/env python
    # _*_encoding:utf-8 _*_
    
    # description: this modlue is load crawler By SITEMAP
    
    import re
    from download import page_download
    
    def load_crawler(url):
        #download the sitemap
        sitemap = page_download(url)
        
        links = re.findall('<loc>(.*?)</loc>',sitemap)
        
        for link in links:
        
            page_download(link)
            
            if link == links[-1]:
            
                print 'All links has Done'
    #     print links
        
    load_crawler('http://example.webscraping.com/sitemap.xml')
    
    小结

    好了,现在爬虫已经具备了抓取网页的能力,然而他并没有做什么事情,只是将网页download下来,所以我们还要进行数据处理。也就是需要在网页中抓取出我们想要的信息。

    数据提取

    使用Lxml提取

    抓取网页中的信息常用的的三种方法:

    • 使用正则表达式解析,re模块,这是最快的解决方案,并且默认的情况下它会缓存搜索的结果(可以借助re.purge()来讲缓存清除),当然也是最复杂的方案(不针对你是一只老鸟)
    • 使用Beautifulsoup进行解析,这是最人性化的选择,因为它处理起来很简单,然而处理大量数据的时候很慢,所以当抓取很多页面的时候,一般不推荐使用
    • 使用Lxml,这是相对比较中性的做法,使用起来也比较简单,这里我们选择它对抓取的页面进行处理

    Lxml的使用有两种方式:Xpath和cssselect,都是使用起来比较简单的,Xpath可以和bs一样,使用find和find_all匹配parten(匹配模式),用链型的结构描述DOM和数据的位置。而cssselct直接是用了jQuery的选择器来进行匹配,这样对有前端功底的同学更加友好。

    先给个demo试下:即将抓取的网页http://example.webscraping.com/places/view/United-Kingdom-239 has Download

    网页中有个表格<table>,我们想要的信息都是存在body的表格中,可以使用浏览器的开发者工具来省查元素,也可以使用firebug(Firefox上面的一款插件)来查看DOM结构

    import lxml.html
    import cssselect
    from download import page_download
    
    example_url = 'http://example.webscraping.com/places/view/United-Kingdom-239'
    
    def demo():
        html = page_download(example_url, num_retry=2)
    
        result = lxml.html.fromstring(html)
        print type(result)
        td = result.cssselect('tr#places_area__row > td.w2p_fw')
        print type(td)
        print len(td)
        css_element = td[0]
        print type(css_element)
        print css_element.text_content()
    

    执行结果:

    http://example.webscraping.com/places/view/United-Kingdom-239 has Download
    <class 'lxml.html.HtmlElement'>
    <type 'list'>
    1
    <class 'lxml.html.HtmlElement'>
    244,820 square kilometres
    

    可以看到,使用cssselect进行选择器是拿到了一个长度是1的列表,当然列表的长度显然和我定义的选择器的模式有关,这个列表中每一项都是一个HtmlElement,他有一个text_content方法可以返回这个节点的内容,这样我们就拿到了我们想要的数据。

    回调处理

    接下来我们就可以为上面的爬虫增加定义一个回调函数,在我们每下载一个页面的时候,做一些小的操作。
    显然应该修改link_crawler函数,并在其参数传递回调函数的引用,这样就可以针对不同页面来进行不同的回调处理如:

    def link_crawler(seed_url,link_regex,max_depth=-1,scrape_callback=None):
    ...
    
        html = page_download(url)   #这行和上面一样
        if scrape_callback:
            scrape_callback(url,html)    
        links.extend(urlparse.urljoin(seed_url, x) for x in get_links_by_html(html) if re.match(link_regex, x)) #这行和上面一样
    ...
    

    接下来编写回调函数,由于python的面向对象很强大,所以这里使用回调类来完成,由于我们需要调用回调类的实例,所以需要重写它的__call__方法,并实现在调用回调类的实例的时候,将拿到的数据以csv格式保存,这个格式可以用wps打开表格。当然你也可以将其写入到数据库中,这个之后再提

    import csv
    class ScrapeCallback():
        
        def __init__(self):
            self.writer = csv.writer(open('contries.csv','w+'))
            self.rows_name = ('area','population','iso','country','capital','tld','currency_code','currency_name','phone','postal_code_format','postal_code_regex','languages','neighbours')
            self.writer.writerow(self.rows_name)
            
        def __call__(self,url,html):
            if re.search('/view/', url):
                tree = lxml.html.fromstring(html)            
                rows = []
                for row in self.rows_name:
                    rows.append(tree.cssselect('#places_{}__row > td.w2p_fw'.format(row))[0].text_content())
        
                self.writer.writerow(rows)
    

    可以看到回调类有三个属性:

    self.rows_name这个属性保存了我们的想要抓取数据的信息
    self.writer这个类似文件句柄一样的存在
    self.writer.writerow这个属性方法是将数据写入csv格式表格

    好了,这样就可以将我们的数据持久化保存起来

    修改下link_crawler的define:def link_crawler(seed_url,link_regex,max_depth=-1,scrape_callback=ScrapeCallback()):

    运行看下结果:

    zhxfei@zhxfei-HP-ENVY-15-Notebook-PC:~/桌面/py_tran$ python crawler.py 
    http://example.webscraping.com has Download
    http://example.webscraping.com/index/1 has Download      # /index 在__call__中的/view 所以不会进行数据提取
    http://example.webscraping.com/index/2 has Download
    http://example.webscraping.com/index/0 has Download
    http://example.webscraping.com/view/Barbados-20 has Download
    http://example.webscraping.com/view/Bangladesh-19 has Download
    http://example.webscraping.com/view/Bahrain-18 has Download
    ...
    ...
    http://example.webscraping.com/view/Albania-3 has Download
    http://example.webscraping.com/view/Aland-Islands-2 has Download
    http://example.webscraping.com/view/Afghanistan-1 has Download
    ----All Done---- 35
    
    zhxfei@zhxfei-HP-ENVY-15-Notebook-PC:~/桌面/py_tran$ ls
    contries.csv  crawler.py
    

    打开这个csv,就可以看到数据都保存了:


    完整代码在这里:

    #!/usr/bin/env python
    # _*_encoding:utf-8 _*_
    
    import urlparse
    import urllib2
    import re
    import time
    import Queue
    import lxml.html
    import csv
    
    class ScrapeCallback():
        
        def __init__(self):
            self.writer = csv.writer(open('contries.csv','w+'))
            self.rows_name = ('area','population','iso','country','capital','tld','currency_code','currency_name','phone','postal_code_format','postal_code_regex','languages','neighbours')
            self.writer.writerow(self.rows_name)
            
        def __call__(self,url,html):
            if re.search('/view/', url):
                tree = lxml.html.fromstring(html)            
                rows = []
                for row in self.rows_name:
                    rows.append(tree.cssselect('#places_{}__row > td.w2p_fw'.format(row))[0].text_content())
        
                self.writer.writerow(rows)
    
    def page_download(url,num_retry=2,user_agent='zhxfei',proxy=None):
        #print 'downloading ' , url
        headers = {'User-agent':user_agent}
        request = urllib2.Request(url,headers = headers)
        opener = urllib2.build_opener()
        if proxy:
            proxy_params = {urlparse(url).scheme:proxy}
            opener.add_handler(urllib2.ProxyHandler(proxy_params))
    
        try:
            html = urllib2.urlopen(request).read()   #try : download the page
        except urllib2.URLError as e:                       #except : 
            print 'Download error!' , e.reason                  #URLError 
            html = None
            if num_retry > 0:                                   # retry download when time>0
                if hasattr(e, 'code') and 500 <=e.code <=600:
                    return  page_download(url,num_retry-1)       
                
        if html is None:
            print '%s Download failed' % url
        else:
            print '%s has Download' % url
        
        return html
    
    def same_site(url1,url2):
        return urlparse.urlparse(url1).netloc == urlparse.urlparse(url2).netloc
    
    def get_links_by_html(html):
        webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)   #理解正则表达式
        return webpage_regex.findall(html)
    
    def link_crawler(seed_url,link_regex,max_depth=-1,scarape_callback=ScrapeCallback()):
        crawl_link_queue = Queue.deque([seed_url])
        # seen contain page had find and it's depth,example first time:{'seed_page_url_find','depth'}
        seen = {seed_url:0}         
        depth = 0
        
        while crawl_link_queue:
            url = crawl_link_queue.pop()
            depth = seen.get(url)
            if seen.get(url) > max_depth:
                continue
            links = []
            html = page_download(url)
            
            links.extend(urlparse.urljoin(seed_url, x) for x in get_links_by_html(html) if re.match(link_regex, x))
    
            for link in links:
                if link not in seen:
                    seen[link]= depth + 1
                    if same_site(link, seed_url):
                        crawl_link_queue.append(link)
    
            #print seen.values()
        print '----All Done----' , len(seen)
    
        return seen
    
    
    if __name__ == '__main__':
        all_links = link_crawler('http://example.webscraping.com', '/(index|view)',max_depth=2)
    
    

    相关文章

      网友评论

          本文标题:python爬虫小练习

          本文链接:https://www.haomeiwen.com/subject/lkccmttx.html