美文网首页
scrapy入门

scrapy入门

作者: 那个人_37d7 | 来源:发表于2018-09-28 17:49 被阅读0次

    来源:天涯明月笙的慕课笔记

    准备工作

    • 系统windows7

    • 安装MYSQL
      提示:

      • 安装的时候, 选安装选项server only
      • 根据提示, 遇到安装界面没有下一步可以用键盘操作
        键盘操作
        b-back。n-next。x-execute。f-finish。c-cancel
      • 根据界面完成安装, 进入安装目录下, mysqld -initialize命令初始化, 用'mysql -uroot -p'进入shell
      • net start mysql启动mysql服务, 如果服务名无效
        cmd打开到mysql/bin目录,输入 mysqld -install. 同时在控制面板进入 服务 选项, 启动mysql 服务. 多试试吧
    • 安装pycharm
      开启pycharm会员模式

    伯乐在线爬取所有文章

    安装模块

    scrapy, pymysql, pillow, pypiwin32

    • pymysql是插入数据库的模块
    • 用scrapy自带的ImagesPipeline需要pillow模块
    • 创建爬虫后, windows输入命令scrapy crawl jobbole会报错需要pypiwin32

    爬虫结构

    • items: 爬虫的解析信息的 字段
      包含名称, 设置输入输出处理器
    • pipelines: 爬虫的管道, 用于将解析后消息持久化存储
      包含图片存储, Json文件的存储, 数据库的存储
    • settings: 爬虫各种相关设置
      包含 是否遵循ROBOTS_TXT, 爬虫下载网页时延, 爬虫图片下载存储的目录, 日志文件的存储目录, 管道的启用和优先级
    • spiders: 爬虫主体
      爬虫的爬取主要逻辑

    基本命令

    # 创建爬虫项目
    scrapy startproject jobbole_article 
    # 进入spiders目录下, 生成爬虫
    scrapy genspider jobbole blog.jobbole.com
    # 运行爬虫
    scrapy crawl jobbole
    

    最终的文件目录, 上述命令后images文件夹暂时没有

    伯乐在线爬虫目录.png

    jobbole.py

    # -*- coding: utf-8 -*-
    import scrapy
    from urllib import parse
    from jobbole_article.items import ArticleItemLoader, JobboleArticleItem
    from scrapy.http import Request
    
    
    class JobboleSpider(scrapy.Spider):
        name = 'jobbole'
        allowed_domains = ['blog.jobbole.com']
        start_urls = ['http://blog.jobbole.com/all-posts']
    
        @staticmethod
        def add_num(value):
            return value if value else [0]
    
        def parse_deatail(self, response):
            response_url = response.url
            front_image_url = response.meta.get('front_image_url', '')
            item_loader = ArticleItemLoader(item=JobboleArticleItem(), response=response)
            item_loader.add_xpath('title', "//div[@class='entry-header']/h1/text()")
            item_loader.add_value('url', response_url)
            item_loader.add_value('url_object_id', response_url)
            item_loader.add_value('front_image_url', front_image_url)
            item_loader.add_xpath('content', "//div[@class='entry']//text()")
            # span_loader = loader.nested_path('//span[@class='href-style'])
            # 赞
            item_loader.add_xpath('praise_nums', "//span[contains(@class,'vote-post-up')]/h10/text()", self.add_num)
            # 评论
            item_loader.add_xpath('comment_nums', "//span[contains(@class, 'hide-on-480')]/text()", self.add_num)
            # 收藏
            item_loader.add_xpath('fav_nums', "//span[contains(@class, 'bookmark-btn')]/text()", self.add_num)
            item_loader.add_xpath('tags', "//p[@class='entry-meta-hide-on-mobile']/a[not(@href='#article-comment')]/text()")
            return item_loader.load_item()
    
        def parse(self, response):
            post_nodes = response.xpath("//div[@class='post floated-thumb']")
            for post_node in post_nodes:
                post_url = post_node.xpath(".//a[@title]/@href").extract_first("")
                img_url = post_node.xpath(".//img/@src").extract_first("")
                yield Request(url=parse.urljoin(response.url, post_url), meta={'front_image_url': img_url},
                              callback=self.parse_deatail)
            next_url = response.xpath('//a[@class="next page-numbers"]/@href').extract_first('')
            if next_url:
                yield Request(url=parse.urljoin(response.url, next_url), callback=self.parse)
    

    模块

    • from urllib import parse
      该模块主要用于对不完整的url进行补全
    url = parse.urljoin('http://blog.jobbole.com/', '10000')
    #url输出为拼接后的'http://blog.jobbole.com/10000', 如果后面是完整的, 则不拼接
    
    • from jobbole_article.items import ArticleItemLoader, JobboleArticleItem
      是items.py中的类
    • from scrapy.http impot Request
      • 构造scrapy网页请求, 请求需要跟进的url.
      • meta参数,为字典形式. 主要是在Request中传送额外的变量给response.可以通过response.meta.get()获取
      • callback参数则是请求内容下载完毕后调用相应的解析函数

    比如在http://blog.jobbole.com/all-posts/中需要获取文章内容, 则构造对下面图片中箭头所指url的请求.内容下载完毕后调用parse_detail方法进行处理. 处理函数可以获得Request中键front_image_url的值img_url
    对应代码

    yield Request(url=parse.urljoin(response.url, post_url), meta={'front_image_url': img_url},
                              callback=self.parse_deatail)
    
    Request.png

    JobboleSpider类

    • 该类继承scrapy.Spider, 其他的属性需要查看文档
    @staticmethod
    def add_num(value):
    

    可暂时忽略,
    该类的静态方法, 用在以下代码中, 作为输入处理器.主要作用是在解析相关字段为空值时返回默认值

    item_loader.add_xpath('comment_nums', "//span[contains(@class, 'hide-on-480')]/text()", self.add_num)
    
    • 自定义方法parse_detail
    1. 作用:解析文章详情页的,提取相关字段值的方法, 文章详情页如http://blog.jobbole.com/114420/. 返回填充后的item
    2. 一些变量的解释
      response_url是响应内容的连接, 比如http://blog.jobbole.com/114420/
      front_image_url是http://blog.jobbole.com/all-posts图片连接
      item_loader是具有填充item方法的实例, 常用方法add_xpath, add_value, 注意填充后的item的值比如item['title']是一个列表
      • add_xpath
        用xpath解析response的方法, 第一个参数如'title'是item的键或者说字段, 第二个是xpath解析规则, 第三个是处理器
      • add_value
        直接赋予相应的值
      • load_item
        执行填充item
    • JobboleSpider的自带parse方法
    1. 作用: 与parse_detail相同都是解析response, 不同的是parse是爬虫默认调用的解析方法.
    2. response.xpath
      xpath解析规则, 返回Selector对象,用extract()获取所有的文本值列表[], 或者是用extract_first()获取第一个文本值
    • xpath规则
    1. 一些规则
      • 可以像url那样拼接规则, 但是注意的是第二个规则加.
    post_nodes = response.xpath("//div[@class='post floated-thumb']")
            for post_node in post_nodes:
                #  .//a[@title]/@href
                post_url = post_node.xpath(".//a[@title]/@href").extract_first("")
                img_url = post_node.xpath(".//img/@src").extract_first("")
    
    - xpath中不含某个属性"//div[not(@class='xx')]"
    - xpath中包含某个属性"//div[contains(@class, 'xx')]" 
    - @herf表示提取属性href的值, text()表示提取元素里的文本值
    - //表示元素任意层下的子元素, /表示元素的直接子元素
    
    1. 调试方法
      可以在浏览器中输入相应的路径测试, 但是要写css规则


      css规则浏览器.png

      用scrapy shell命令测试

    scrapy shell http://blog.jobbole.com/all-posts
    # 然后输入相应的规则可以看返回的值
    response.xpath("...").extract()
    # 可以用fetch(url)更改下载的Response
    fetch('http://blog.jobbole.com/10000')
    

    或者打断点,运行爬虫用pycharm可以查看

    items.py

    import scrapy
    import re
    import hashlib
    from scrapy.loader import ItemLoader
    from scrapy.loader.processors import TakeFirst, MapCompose, Join, Identity, Compose
    
    
    def get_md5(value):
        if isinstance(value, str):
            value = value.encode(encoding='utf-8')
            # print('value--------------------------', value)
            m = hashlib.md5()
            m.update(value)
            return m.hexdigest()
    
    
    def get_num(value):
        # print(value)
        if value:
            num = re.match(r".*?(\d+?)", value)
            try:
                # print("----------------",num.group(1), int(num.group(1)))
                return int(num.group(1))
            except (AttributeError, TypeError):
                return 0
        else:
            return 0
    
    #多余
    def return_num(value):
        # return value[0] if value else 0
        if value:
            return value
        else:
            return "1"
    
    
    class JobboleArticleItem(scrapy.Item):
        # define the fields for your item here like:
        title = scrapy.Field()
        url = scrapy.Field()
        url_object_id = scrapy.Field(
            input_processor=MapCompose(get_md5)
        )
        front_image_url = scrapy.Field(
            output_processor=Identity()
        )
        front_image_path = scrapy.Field()
        content = scrapy.Field(
            output_processor=Join()
        )
        praise_nums = scrapy.Field(
            input_processor=MapCompose(get_num),
            # output_processor=MapCompose(return_num)
        )
        fav_nums = scrapy.Field(
            input_processor=MapCompose(get_num),
            # output_processor=MapCompose(return_num)
            # input_processor=Compose(get_num, stop_on_none=False)
        )
        comment_nums = scrapy.Field(
            input_processor=MapCompose(get_num),
            # output_processor=MapCompose(return_num)
            # input_processor=Compose(get_num, stop_on_none=False)
        )
        tags = scrapy.Field(
            output_processor=Join()
        )
    
        def get_insert_sql(self):
            insert_sql = """
                    insert into jobbole(title, url, url_object_id, front_image_url, front_image_path,praise_nums, fav_nums, 
                    comment_nums, tags, content)
                    VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
                    """
            params = (
                self['title'], self['url'], self['url_object_id'], self['front_image_url'], self['front_image_path'],
                self['praise_nums'], self['fav_nums'], self['comment_nums'], self['tags'], self['content']
            )
            return insert_sql, params
    
    
    class ArticleItemLoader(ItemLoader):
        default_output_processor = TakeFirst()
    

    模块

    • `import re'
      正则匹配模块
    # match函数是从字符串开始处匹配
    num = re.match(".*(\d)", 'xx')
    # 如果上面没有匹配成功, 会出现AttributeError
    num = num.group(1)
    # 另外int([])会出现TypeError
    
    • import hashlib
      将字符串转换为 md5字符串, 必须经过utf-8编码.scrapy的值都是unicode编码
    • from scrapy.loader import ItemLoader
      继承scrapy的ItemLoader, 自定义ArticleItemLoader.
    • from scrapy.loader.processors import TakeFirst, MapCompose, Join, Identity, Compose
      一系列scrapy给定的处理器函数类,TakeFirst是获取列表第一个非空值, MapCompose的参数是多个函数, 能将列表中的每个值通过函数处理,并将处理结果汇成列表再进入下一个函数. Join将列表连接成一个字符串, Identity不作处理, Compose的参数是多个函数, 与MapCompose不同, 是将整个列表传入函数处理

    get_num函数

    jobbole.py中add_xpath()中加上add_num, if判断就多余了
    TypeError也有点多余, 懒得改了

    def get_num(value):
      num = re.match(r".*?(\d+?)", value)
      try:
         return int(num.group(1))
       except AttributeError:
          return 0
    

    JobboleArticleItem类

    定义item的字段和输入输出处理器, 输入输出处理器的作用时候不同
    这里有点疑问:文档说输入处理器作用在解析出一个值后立即作用, 而输出处理器则是在整个列表完成后作用.假如我把Compose写在输出处理器里, Compose不是处理整个列表的吗?有点矛盾
    注意的是如果scrapy.Field()中有output_processor将会使default_output_processor失效
    另外MapCompose()中的函数是不处理空值.如果是空列表, 那么函数将不生效.
    在scrapy源码可以看到用了一个for循环调用函数处理列表中的值

                for v in values:
                    next_values += arg_to_iter(func(v))
    
    • get_insert_sql方法
      写入mysql数据库的语句和参数, 会在pipelines.py中用到

    ArticleItemLoader类

    为item每个字段赋予一个默认的输出处理器

    pipelines.py

    import pymysql
    from twisted.enterprise import adbapi
    from scrapy.pipelines.images import ImagesPipeline
    
    
    class JobboleArticlePipeline(object):
        def process_item(self, item, spider):
            return item
    
    
    class JobboleMysqlPipeline(object):
    
        def __init__(self, dbpool):
            self.dbpool = dbpool
    
        @classmethod
        def from_settings(cls, settings):
            params = dict(
                host=settings['MYSQL_HOST'],
                db=settings['MYSQL_DBNAME'],
                user=settings['MYSQL_USER'],
                passwd=settings['MYSQL_PASSWORD'],
                charset='utf8',
                cursorclass=pymysql.cursors.DictCursor,
                use_unicode=True
            )
            dbpool = adbapi.ConnectionPool('pymysql', **params)
            return cls(dbpool)
    
        def process_item(self, item, spider):
            query = self.dbpool.runInteraction(self.do_insert, item)
            query.addErrback(self.handle_error, item, spider)
    
        def do_insert(self, cursor, item):
            insert_sql, params = item.get_insert_sql()
            cursor.execute(insert_sql, params)
    
        def handle_error(self, failure, item, spider):
            print(failure)
    
    
    class ArticleImagePipeline(ImagesPipeline):
    
        def item_completed(self, results, item, info):
            # 注意这里的判断, 可能front_image_url为空
            if 'front_image_url' in item:
                for _, value in results:
                    # print(value)
                    image_file_path = value['path']
                item['front_image_path'] = image_file_path
            return item
    

    模块

    • import pymysql
      连接和写入数据库的模块
    import pymysql
    # 连接pymysql
    db = pymysql.connect('localhost', 'root', '123456', 'jobbole')
    # 使用cursor()方法获取游标
    cursor = db.cursor()
    # sql插入语句
    insert_sql = "insert into jobbole (字段)values('值')"
    # 执行插入
    try:
      cursor.execute(insert_sql)
      # 确认提交...
      db.commit()
    except:
      # 错误就回滚
      cursor.rollback()
    # 关闭连接
    db.close()
    
    • from twisted.enterprise import adbapi
      异步, 不清楚, 先背着吧
    • from scray.pipelines.images import ImagesPipeline
      scrapy的图片存储管道, 需要手动添加pillow模块

    JobboleArticlePipeline类

    自动生成的管道类

    JobboleMysqlPipeline类, 自定义异步写入mysql

    • settings在settings.py中设置
    • 异步连接mysql?
    dbpool = adbapi.ConnectionPool('pymysql', **params)
    
    • 生成实例...
    # 执行 __init__(dbpool), 生成实例
    return cls(dbpool)
    
    • process_item管道处理item的方法
    # 异步执行插入操作?
    # 不需要db.commit()
    query = self. dbpool.runInteraction(self.do_insert, item) 
    
    # 看不懂
    # 不用返回item?
    query.addErrback(self.handle_error, item, spider)
    
    • do_insert
      cursor参数在ConnectionPool中获得吗?

    ArticleImagePipeline

    • item_completed
      参数results, item, info
      主要是记录front_image_path

    settings.py

    通用

    ROBOTSTXT_OBEY = False
    DOWNLOAD_DELAY = 1

    mysql设置

    MYSQL_HOST = '127.0.0.1'
    MYSQL_USER = 'root'
    MYSQL_DBNAME = 'jobbole'
    MYSQL_PASSWORD = '123456'
    

    管道的启用和优先级

    数字越低优先级越高, 对应的是Pipelines.py中编写的管道

    ITEM_PIPELINES = {
    # 'jobbole_article.pipelines.JobboleArticlePipeline': 300,
        'jobbole_article.pipelines.ArticleImagePipeline': 1,
        'jobbole_article.pipelines.JobboleMysqlPipeline': 2,
    }
    

    图片存储目录

    import os
    # 指定图片下载url的item字段
    IMAGES_URLS_FIELD = 'front_image_url'
    # 图片存储的父目录, 也是settings.py的父目录, __file__是settings.py?
    #abspath绝对路径, dirname父目录
    image_dir = os.path.abspath(os.path.dirname(__file__))
    # 图片存储的文件夹
    IMAGES_STORE = os.path.join(image_dir, 'images')
    

    mysql需要用到的命令

    # 查看数据库
    show databases;
    # 查看表格
    show tables;
    # 创建数据库
    create database jobbole;
    # 切换数据库
    use jobbole;
    # 创建表格
    create table(
    title varchar(200) not null,
    url varchar(300) not null,
    url_object_id varchar(50) primary key not null,
    front_image_url varchar(200),
    praise_nums int(11) not null,
    fav_nums int(11) not null,
    tags varchar(200),
    content longtext not null
    )
    # 查看数据库编码信息
    show variables like 'character_set_database';
    # 查看表格第一条记录
    select * from jobbole limit 1;
    # 查看表格记录的数量
    select count(title) from jobbole;
    # 查看表格的大小
    use information_schema
    select concat(round(sum(DATA_LENGTH/1024/1024),2),'MB') as data from TABL
    ES where table_schema='jobbole' and table_name='jobbole';
    # 清空数据表记录
    truncate table jobbole;
    # 删除一个字段
    alter table <tablename> drop column <column_name>;
    

    问题

    • 第一次只爬取了1300多条文章爬虫就终止了, 不清楚具体原因
    • 封面图片数量明显少, 数据库记录9000多条, 图片只有6000多张
    • 封面图片url为空会报错
    'fav_nums': 2,
     'front_image_url': [''],
     'praise_nums': 2,
     'tags': '职场 产品经理 程序员 职场',
     'title': '程序员眼里的 PM 有两种:有脑子的和没脑子的。后者占 90%',
     'url': 'http://blog.jobbole.com/92328/',
     'url_object_id': 'f74aa62b6a79fcf8f294173ab52f4459'}
    Traceback (most recent call last):
      File "g:\py3env\bole2\venv\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "g:\py3env\bole2\venv\lib\site-packages\scrapy\pipelines\media.py", line 79, in process_item
        requests = arg_to_iter(self.get_media_requests(item, info))
      File "g:\py3env\bole2\venv\lib\site-packages\scrapy\pipelines\images.py", line 155, in get_media_requests
        return [Request(x) for x in item.get(self.images_urls_field, [])]
      File "g:\py3env\bole2\venv\lib\site-packages\scrapy\pipelines\images.py", line 155, in <listcomp>
        return [Request(x) for x in item.get(self.images_urls_field, [])]
      File "g:\py3env\bole2\venv\lib\site-packages\scrapy\http\request\__init__.py", line 25, in __init__
        self._set_url(url)
      File "g:\py3env\bole2\venv\lib\site-packages\scrapy\http\request\__init__.py", line 62, in _set_url
        raise ValueError('Missing scheme in request url: %s' % self._url)
    ValueError: Missing scheme in request url:
    
    • 文章中如果有emoji表情, 会出现编码错误.
    • 爬取的时候没有设置输出的日志文件
    • 当add_xpath()中path路径提取为空列表时, 输出输入处理器MapCompose()不起作用.
      解决办法是在add_xpath参数额外加上处理器

    总结

    • 在能理解的基础上看英文文档要比机翻的中文文档好
    • 不能理解可以看看源码
    • 日志输出到文件
    • 如果不能很好的理解每个部分, 那么需要在看完整体后回顾

    selenium登录知乎,爬取问答

    编码问题

    python 中str 和bytes(二进制)的互相转化

    因为scray中的response.body是bytes,所以写入文件要转成string

    str = 'abc'
    # errors 有strick,ignore
    byt = str.encode(encoding='utf8', errors='strick')
    # bytes->str
    str = byt.decode(encodeing='utf8',errors='ignore')
    

    bytes写入文件中要注意的编码

    因为在windows中,新文件默认编码是gbk,所以python解释器会用gbk解析网络数据流。此时往往会失败。要在打开文件时指定编码。

    with open('c:test.txt', 'w', encoding='utf8') as f:
      f.write(response.body.decode('utf8', errors='ignore'))
    

    base64图片编码

    from PIL import image
    from io import BytesIO
    import base64
    img_src = ".."
    img_src = img_src.split(',')[1]
    img_src = base64.b64encode(img_src)
    img = image.open(BytesIO(img_src))
    img.show()
    

    爬虫的小技巧

    手动构造response

    from scrapy.http import HtmlResponse
    body = open("example.html").read()
    response = HtmlResponse(url='http://example.com', body=body.encode('utf-8'))
    

    爬虫的url的拼接和跟进

    def parse(self, response):
        yield {}
        for url in response.xpath().extract():
        yield scrapy.Request(url=response.urljoin(url), callback=self.parse)
    //进一步简化,不要for中extract()和response.urljoin
    //如果要对提取的Url作处理,url.extract()?
        for url in response.xpath():
            yield response.follow(url, callback=self.parse)
    

    爬虫的日志

    scrapy 文档 日志
    爬虫日志信息的级别和python的是一样,debug,info,warning,error,critical
    Spider类自带日志属性

    class ZhihuSpider(scrapy.Spider):
      def func(self)
          self.logger.warning('this is a log')
    

    在Spider类外可以

    import logging
    logging.warning('this is a log')
    # 也可以写不同的logger
    logger = logging.getlogger('mycustomlogger')
    logger.warning('this is a log')
    

    另外在settings.py中可以设置命令行信息输出的级别和输出的日志文件

    LOG_FILE = 'dir'
    LOG_LEVEL = logging.WARNING
    # 命令行
    --logfile FILE
    --loglevel LEVEL
    

    re匹配不包含字符串

    注意(?=)不占匹配位

    s = 'sda'
    re.match('s(?=d)$', s) # 匹配失败
    # 不能匹配s后含da字符串
    re.match('s(?!da)', s)
    

    selenium的使用

    selenium python api文档

    下载浏览器驱动

    chrome版本6.0,最新版本会有missing arguments granttype错误

    selenium的方法

    from selenium import webdriver
    driver = webdriver.Chrome(execute_path="驱动所在目录")
    # driver.page_source页面源
    # selenium等待
    from selenium.webdriver.support.ui import WebDriverwait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as ec
    # 10是超时时间,until参数是一个函数,这个函数的参数是driver,返回真假
    element = WebDriverWait(driver, 10).until(lambda x:x.find_element_by_xpath(
                "//div[@class='SignContainer-switch']/span"))
    # 同上,ec是selenium自带的等待函数
     WebDriverWait(driver, 10).until(ec.text_to_be_present_in_element(
                (By.XPATH, "//div[@class='SignContainer-switch']/span"), '注册'))
    

    整个爬虫代码

    settings.py

    # -*- coding: utf-8 -*-
    import logging
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'zhihu'
    
    SPIDER_MODULES = ['zhihu.spiders']
    NEWSPIDER_MODULE = 'zhihu.spiders'
    SQL_DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"
    MYSQL_HOST = '127.0.0.1'
    MYSQL_DBNAME = 'zhihuSpider'
    MYSQL_USER = 'root'
    MYSQL_PASSWORD = '123456'
    
    LOG_LEVEL = logging.WARNING
    LOG_FILE = 'G:\py3env\bole2\zhihu\zhihu\zhihu_spider.log'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    COOKIES_ENABLED = True
    
    # Override the default request headers:
    # 必须
    USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0",
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
      'zhihu.pipelines.ZhihuPipeline': 300,
    }
    

    zhihu_login.py

    # -*- coding: utf-8 -*-
    import scrapy
    from zhihu.items import ZhihuQuestionItem, ZhihuAnswerItem, ZhihuItem
    import re
    import json
    import datetime
    from selenium import webdriver
    # 使文本能解析
    #from scrapy.selector import Selector
    # 用法:Seletor(text=driver.pager_source).css().extract()
    # 打开base64编码的图片
    #import base64
    #from io import BytesIO, StringIO
    import logging
    # selenium等待加载相关的模块
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as ec
    
    
    class ZhihuLoginSpider(scrapy.Spider):
    
        name = 'zhihu_login'
        allowed_domains = ['www.zhihu.com']
        # start_requests 初始url
        start_urls = ['https://www.zhihu.com/signup?next=%2F']
        # 获取问题答案的api
        start_answer_url = ["https://www.zhihu.com/api/v4/questions/{0}/answers?include=data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp;data[*].mark_infos[*].url;data[*].author.follower_count,badge[*].topics&offset={2}&limit={1}&sort_by=default"]
    
        def start_requests(self):
            driver = webdriver.Chrome(executable_path='C:/Users/Administrator/Desktop/chromedriver.exe')
            # 打开网址
            driver.get(start_urls[0])
            # 等待登录元素出现,超时10秒
            element = WebDriverWait(driver, 10).until(lambda x:x.find_element_by_xpath(
                "//div[@class='SignContainer-switch']/span"))
            # 点击登录
            element.click()
            # 等待点击后显示“注册”文本
            WebDriverWait(driver, 10).until(ec.text_to_be_present_in_element(
                (By.XPATH, "//div[@class='SignContainer-switch']/span"), '注册'))
            # 模拟输入账号和密码
            driver.find_element_by_css_selector("div.SignFlow-account input").send_keys("你的账号")
            driver.find_element_by_css_selector("div.SignFlow-password input").send_keys("你的宻码")
            driver.find_element_by_css_selector("button.SignFlow-submitButton").click()
            # 等待页面中某个元素加载完成
            WebDriverWait(driver, 10).until(lambda x:x.find_element_by_xpath(
                "//div[@class='GlobalWrite-navTitle']"))
            # 获取cookie
            Cookies = driver.get_cookies()
            cookie_dict = {}
            for cookie in Cookies:
                cookie_dict[cookie['name']] = cookie['value']
            # 关闭驱动 
            driver.close()
            return [scrapy.Request('https://www.zhihu.com/', cookies=cookie_dict, callback=self.parse)]
    
        def parse(self, response):
            # 获取页面中所有的链接
            all_urls = response.css("a::attr(href)").extract()
            for url in all_urls:
                # 不匹配https://www.zhihu.com/question/13413413/log
                match_obj = re.match('.*zhihu.com/question/(\d+)(/|$)(?!log)', url)
                if match_obj:
                    yield scrapy.Request(response.urljoin(url), callback=self.parse_question)
                else:
                    yield scrapy.Request(response.urljoin(url), callback=self.parse)
    
        def parse_question(self, response):
            if "QuestionHeader-title" in response.text:
                match_obj = re.match(".*zhihu.com/question/(\d+)(/|$)", response.url)
                self.logger.warning('Parse function called on {}'.format(response.url))
                if match_obj:
                    self.logger.warning('zhihu id is {}'.format(match_obj.group(1)))
                    question_id = int(match_obj.group(1))
                    item_loader = ZhihuItem(item=ZhihuQuestionItem(), response=response)
                    # ::text前不带空格表示直接子节点的文本
                    item_loader.add_css("title", "h1.QuestionHeader-title::text")
                    item_loader.add_css("content", ".QuestionHeader-detail ::text")
                    item_loader.add_value("url", response.url)
                    item_loader.add_value("zhihu_id", question_id)
                    # 点击查看全部答案和不点击 ,answer_num两个网页提取的css规则不同。
                    # 这里将两个css都写上
                    item_loader.add_css("answer_num", "h4.List-headerText span ::text")
                    item_loader.add_css("answer_num", "a.QuestionMainAction::text")
                    item_loader.add_css("comments_num", "div.QuestionHeader-Comment button::text")
                    item_loader.add_css("watch_user_num", "strong.NumberBoard-itemValue::text")
                    item_loader.add_css("topics", ".QuestionHeader-topics ::text")
                    item_loader.add_value("crawl_time", datetime.datetime.now())
                    question_item = item_loader.load_item()
            """没用
            else:
                match_obj = re.match(".*zhihu.com/question/(\d+)(/|$)", response.url)
                if match_obj:
                    question_id = int(match_obj.group(1))
                    item_loader = ItemLoader(item=ZhihuQuestionItem(), response=response)
                    item_loader.add_css("title",
                                    "//*[id='zh-question-title']/h2/a/text()|//*[@id='zh-question-title']/h2/span/text()")
                    item_loader.add_css("content", ".QuestionHeader-detail")
                    item_loader.add_value("url", response.url)
                    item_loader.add_value("zhihu_id", question_id)
                    item_loader.add_css("answer_num", "#zh-question-answer-num::text")
                    item_loader.add_css("comment_num", "#zh-question-meta-wrap a[name='addcomment']::text")
                    item_loader.add_css("watch_user_num", "//*[@id='zh-question-side-header-wrap']/text()|"
                                                      "//*[@class='zh-question-followers-sidebar]/div/a/strong/text()")
                    item_loader.add_css("topics", ".zm-tag-editor-labels a::text")
                    question_item = item_loader.load_item()
            """
            # format(*args, **kwargs)
            # print("{1}{程度}{0}".format("开心", "今天", 程度="很")
            # 今天很开心
            yield scrapy.Request(self.start_answer_url[0].format(question_id, 20, 0),
                                 callback=self.parse_answer)
            yield question_item
    
        def parse_answer(self, response):
            # 网页返回的是json字符串,转为字典对象
            ans_json = json.loads(response.text)
            is_end = ans_json["paging"]['is_end']
            next_url = ans_json["paging"]["next"]
            for answer in ans_json["data"]:
                # 用item直接赋值简单,却不能用processor
                answer_item =ZhihuAnswerItem()
                answer_item["zhihu_id"] = answer["id"]
                answer_item["url"] = answer["url"]
                answer_item["question_id"] = answer["question"]["id"]
                answer_item["author_id"] = answer["author"]["id"] if "id" in answer["author"] else None
                answer_item["content"] = answer["content"] if "content" in answer else None
                answer_item["parise_num"] = answer["voteup_count"]
                answer_item["comments_num"] = answer["comment_count"]
                answer_item["create_time"] = answer["created_time"]
                answer_item["update_time"] = answer["updated_time"]
                answer_item["crawl_time"] = datetime.datetime.now()
                yield answer_item
            if not is_end:
                yield scrapy.Request(next_url,  callback=self.parse_answer)
    

    下面图片中就是查看网页中api


    图片.png
    图片.png
    图片.png

    pipelines.py

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    import pymysql
    from twisted.enterprise import adbapi
    
    
    class ZhihuPipeline(object):
    
        def __init__(self, dbpool):
            self.dbpool = dbpool
        
        def process_item(self, item, spider):
            query = self.dbpool.runInteraction(self.do_insert_sql, item)
            query.addErrback(self.handle_error, item, spider)
    
        def do_insert_sql(self, cursor, item):
            insert_sql, params = item.get_insert_sql()
            cursor.execute(insert_sql, params)
    
        def handle_error(self, failure, item, spider):
            print(failure)
    
        @classmethod
        def from_settings(cls, settings):
            params = dict(
                host=settings['MYSQL_HOST'],
                db=settings['MYSQL_DBNAME'],
                user=settings['MYSQL_USER'],
                passwd=settings['MYSQL_PASSWORD'],
                charset='utf8',
                cursorclass=pymysql.cursors.DictCursor,
                use_unicode=True,
            )
            dbpool = adbapi.ConnectionPool("pymysql", **params)
            return cls(dbpool)
    

    items.py

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    import logging
    import datetime
    import re
    import scrapy
    from scrapy.loader.processors import TakeFirst, Join, Compose, MapCompose
    from scrapy.loader import ItemLoader
    
    
    # 提取关注数量,回答数量,评论数量文本中的数字
    def extract_num(value):
        # 输出日志信息
        logging.warning('this is function extract_num value:{}'.format(value))
        for val in value:
            if val is not None:
                # 去掉数字中的,
                val = ''.join(val.split(','))
                match_obj = re.match(".*?(\d+)", val)
                if match_obj:
                    logging.warning('this is one of value:{}'.format(match_obj.group(1)))
                    return int(match_obj.group(1))
                break
    
    
    # 重写ItemLoader,指定默认输出处理器
    class ZhihuItem(ItemLoader):
        # 取列表第一个元素
        default_output_processor = TakeFirst()
    
    class ZhihuQuestionItem(scrapy.Item):
    
        topics = scrapy.Field(
                # 将主题连接
                output_processor=Join(',')
                )
        url = scrapy.Field()
        title = scrapy.Field()
        content = scrapy.Field()
        answer_num = scrapy.Field(
                # 提取数字
                output_processor=Compose(extract_num)
                )
        comments_num = scrapy.Field(
                output_processor=Compose(extract_num)
                )
        # 关注者数量
        watch_user_num = scrapy.Field(
                output_processor=Compose(extract_num)
                )
        zhihu_id = scrapy.Field()
        crawl_time = scrapy.Field()
    
        def get_insert_sql(self):
            # on duplicate key update col_name=value(col_name)
            insert_sql = """
                insert into zhihu_question(zhihu_id, topics, url, title, content, answer_num, comments_num,
                watch_user_num,  crawl_time
                )
                VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)
                on duplicate key update content=values(content), answer_num=values(answer_num), comments_num=values(
                comments_num), watch_user_num=values(watch_user_num)
                """
            # [Failure instance: Traceback: <class 'AttributeError'>: Use item['crawl_time'] = '2018-10-29 19:16:24' to set field value
            # self.crawl_time = datetime.datetime.now()
            # 用get处理相应的键为空的情况
            # 用datetime.datetime.now()返回的值可以插入数据库
            params = (self.get('zhihu_id'), self.get('topics','null'), self.get('url'), self.get('title'), self.get('content','null'), self.get('answer_num',0), self.get('comments_num',0),
                      self.get('watch_user_num',0),  self.get('crawl_time'))
            return insert_sql, params
    
    
    class ZhihuAnswerItem(scrapy.Item):
    
        zhihu_id = scrapy.Field()
        url = scrapy.Field()
        question_id = scrapy.Field()
        author_id = scrapy.Field()
        content = scrapy.Field()
        # 赞
        parise_num = scrapy.Field()
        comments_num = scrapy.Field()
        # 创建时间
        create_time = scrapy.Field()
        update_time = scrapy.Field()
        crawl_time = scrapy.Field()
    
        def get_insert_sql(self):
            insert_sql = """
                insert into zhihu_answer(zhihu_id, url, question_id, author_id, content, parise_num, comments_num,
                create_time, update_time, crawl_time
                )
                VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
                on duplicate key update content=values(content), comments_num=values(comments_num), parise_num=values(
                parise_num), update_time=values(update_time)
                """
            # fromtimestamp方法的将时间戳转为时间元组
            params = (
                self.get("zhihu_id"), self.get("url"), self.get("question_id"), self.get("author_id"), self.get("content"), self.get("parise_num", 0),
                self.get("commennts_num", 0), datetime.datetime.fromtimestamp(self.get("create_time")), datetime.datetime.fromtimestamp(self.get("update_time")), self.get("crawl_time"),
            )
            return insert_sql, params
    
    

    总结

    1. 有些问题会反复的遇到
    2. 程序一步一步写,记得加注释
    3. 链接先记下
    4. 知乎不用selenium的都失效了,看到不用selenium可以登录的请告知

    相关文章

      网友评论

          本文标题:scrapy入门

          本文链接:https://www.haomeiwen.com/subject/wemuoftx.html