美文网首页
Python爬虫进阶

Python爬虫进阶

作者: 小OO明 | 来源:发表于2018-09-11 16:15 被阅读0次

    明确爬取的网站

    http://web.jobbole.com/all-posts/

    选择伯乐在线网站,爬取网站中的所有文章。

    安装scrapy

    进入基础篇安装的环境:workon py3scrapy

    pip install -i https://pypi.douban.com/simple/ scrapy

    提示,twisted无法安装。

    进入python非官方库:https://www.lfd.uci.edu/~gohlke/pythonlibs/
    下载Twisted,注意Python版本号(火狐浏览器可以触发下载)

    pip install twisted本地路径

    pip install -i https://pypi.douban.com/simple/ scrapy 此时 scrapy安装成功

    Scrapy搭建第一个爬虫

    scrapy startproject jobbole // 创建爬虫框架

    scrapy genspider bole blog.jobbole.com // 创建爬虫模板

    为了能在Pycharm中调试Scrapy,在爬虫的根目录(scrapy.cfg同级目录)中创建一个main.py作为命令行入口文件。

    from scrapy.cmdline import execute
    import sys
    import os
    
    sys.path.append(os.path.dirname(os.path.abspath(__file__)))
    execute(["scrapy", "crawl", "bole"])
    

    Windows环境下,运行该文件发现缺少win32api模块

    安装win32api:

    pip install pypiwin32

    此时,运行main.py可以进行调试。

    运行之前检查settings.py文件,设置

    ROBOTSTXT_OBEY = False
    

    Scrapy Shell调试

    workon py3scrapy

    进入scrapy.cfg同级目录

    srapy shell http://blog.jobbole.com/114159/

    CSS选择器

    Scrapy支持CSS选择器,在不支持的爬虫中只能使用xpath

    response.css('.entry-header h1::text').extract_first()

    CSS选择器对比Xpath更加简短,学前端的同学使用起来更加顺手,推荐优先考虑

    # -*- coding: utf-8 -*-
    import scrapy
    import re
    
    
    class BoleSpider(scrapy.Spider):
        name = 'bole'
        allowed_domains = ['blog.jobbole.com']
        start_urls = ['http://blog.jobbole.com/114159/']
    
        def parse(self, response):
            title = response.css('.entry-header h1::text').extract_first()
            create_date =  response.css('.entry-meta-hide-on-mobile::text').extract_first().strip().replace('·', '').strip()
            praise_num = response.css('h10::text').extract_first()
            fav_num = response.css('.bookmark-btn::text').extract_first()
            comment_num = response.css('a[href="#article-comment"] span::text').extract_first()
            math_re = re.match(".*?(\d+).*", fav_num)
            if math_re:
                fav_num = math_re.group(1)
            else:
                fav_num = 0
            math_re = re.match(".*?(\d+).*", comment_num)
            if math_re:
                comment_num = math_re.group(1)
            else:
                comment_num = 0
            content = response.css('.entry').extract_first()
            tag_list = response.css('.entry-meta-hide-on-mobile a::text').extract()
            tag_list =  [element for element in tag_list if not element.strip().endswith("评论")]
            tag_list = ",".join(tag_list)
            pass
    

    从列表页进入内容页

    # -*- coding: utf-8 -*-
    import scrapy
    import re
    from scrapy.http import Request
    from urllib import parse
    
    
    class BoleSpider(scrapy.Spider):
        name = 'bole'
        allowed_domains = ['blog.jobbole.com']
        start_urls = ['http://blog.jobbole.com/all-posts/']
    
        def parse(self, response):
            # 循环列表页
            post_nodes = response.css('.post.floated-thumb')
            for post_node in post_nodes:
                image_url = post_node.css('img::attr(src)').extract_first()
                post_url = post_node.css('.archive-title::attr(href)').extract_first()
                yield Request(url=parse.urljoin(response.url, post_url), callback=self.parse_detail, meta={"front_image_url": image_url})
            # 获取下一页数据
            next_url = response.css('.next.page-numbers::attr("href")').extract_first()
            if next_url:
                yield Request(url=parse.urljoin(response.url, next_url), callback=self.parse)
    
        def parse_detail(self, response):
            title = response.css('.entry-header h1::text').extract_first()
            create_date =  response.css('.entry-meta-hide-on-mobile::text').extract_first().strip().replace('·', '').strip()
            praise_num = response.css('h10::text').extract_first()
            fav_num = response.css('.bookmark-btn::text').extract_first()
            comment_num = response.css('a[href="#article-comment"] span::text').extract_first()
            math_re = re.match(".*?(\d+).*", fav_num)
            if math_re:
                fav_num = math_re.group(1)
            else:
                fav_num = 0
            math_re = re.match(".*?(\d+).*", comment_num)
            if math_re:
                comment_num = math_re.group(1)
            else:
                comment_num = 0
            content = response.css('.entry').extract_first()
            tag_list = response.css('.entry-meta-hide-on-mobile a::text').extract()
            tag_list =  [element for element in tag_list if not element.strip().endswith("评论")]
            tag_list = ",".join(tag_list)
            front_image_url = response.meta.get('front_image_url')
            pass
    

    数据保存为Field格式

    items.py

    1530249997411555.png

    bole.py

    1530250024718544.png

    Field格式数据传递到Pipelines中

    settings.py

    1530250064653050.png

    pipelines.py打断点可以发现,数据传递到pipeline中

    1530250110606145.png

    下载图片

    配置settings文件的scrapy pipelines images

    ITEM_PIPELINES = {
        'jobbole.pipelines.JobbolePipeline': 300,
        'scrapy.pipelines.images.ImagesPipeline': 1,
    }
    IMAGES_URLS_FIELD = "front_image_url"
    project_dir = os.path.abspath(os.path.dirname(__file__))
    IMAGES_STORE = os.path.join(project_dir, 'images')
    

    pip 安装pillow

    pip install -i https://pypi.douban.com/simple/ pillow

    1530250919655244.png

    图片字段修改为数组的形式

    至此,爬虫可以下载网络图片了。

    获取保存文件路径

    自定义一个pipeline,继承scrapy.pipelines.images ImagesPipeline

    重新定义item_contpleted方法来保存图片保存地址


    1530251632134332.png

    数据保存到数据库中

    安装mysql驱动

    pip install mysqlclient

    这里可能安装失败,我们到https://www.lfd.uci.edu/~gohlke/pythonlibs/中下载

    mysqlPipline(异步方法)

    # twisted异步插入mysql
    class MysqlTwistedPipeline(object):
        def __init__(self, dbpool):
            self.dbpool = dbpool
    
        @classmethod
        def from_settings(cls, settings):
            dbparms = dict(
                host=settings['MYSQL_HOST'],
                db=settings['MYSQL_DBNAME'],
                user=settings['MYSQL_USER'],
                passwd=settings['MYSQL_PASSWD'],
                charset='utf8',
                cursorclass=MySQLdb.cursors.DictCursor,
                use_unicode=True
            )
    
            dbpool = adbapi.ConnectionPool('MySQLdb', **dbparms)
            return cls(dbpool)
    
        def process_item(self, item, spider):
            # 使用twisted将musql插入变成异步操作
            query = self.dbpool.runInteraction(self.do_insert, item)
            # 处理异常
            query.addErrback(self.handle_error)
    
        def handle_error(self, failure):
            # 处理异步插入的异常
            print(failure)
    
        def do_insert(self, cursor, item):
            insert_sql = """
                          INSERT INTO article_spider(title,time,url,url_object_id,content,vote_number,collect_number,comment_number,tags,front_image_url,front_image_path)
                          VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
                      """
            cursor.execute(insert_sql, (
            item['title'], item['create_date'], item['url'], item['url_objet_id'], item['content'], item['praise_num'],
            item['fav_num'], item['comment_num'], item['tag_list'], item['front_image_url'], item['front_image_path']))
    

    ItemLoader

    引入:from scrapy.loader from ItemLoader

    实例化:itemloader = ItemLoader(item=JobboleItem(), response=response)

    重要方法:

    itemloader.add_css()

    itemloader.add_xpath()

    itemloader.add_value()

    例子:

    item_loader.add_css('title', '.entry-header h1::text')
    调用load_item生成格式:
    item_loader = item_loader.load_item()

    打开调模式,发现Itemloader中的参数都是list类型,并且尚未处理:

    1530263027748580.png

    引入scrapy提供的Itemloader处理库:
    from scrapy.loader.processors import MapCompose, TakeFirst

    1530263698224066.png

    这样需要在每个Field中添加output_processor = TakeFirst()

    自定义一个Itemloader继承Itemloader:

    1530264110751033.png

    这样,Itemloader就开发完毕了

    相关文章

      网友评论

          本文标题:Python爬虫进阶

          本文链接:https://www.haomeiwen.com/subject/eeuugftx.html