scrapyd使用修改api

作者: 瓜T_T | 来源:发表于2019-07-05 17:26 被阅读0次

    安装

    服务
    pip install scrapyd
    使用命令行工具
    python3 -m pip install scrapyd-client
    python连接包
    python3 -m pip install python-scrapyd-api
    找到python文件路径,设置scrapyd可以直接执行

    image.png

    启动服务

    scrapyd

    image.png
    启动会读取python包下的默认配置文件
    image.png
    max_proc
    最大的scrapy线程数,默认值是0,代表不限制,表示可用cpu个数 * max_proc_per_cpu
    max_proc_per_cpu
    每个CPU最大的scrapy线程数
    bind_address
    绑定ip,修改为0.0.0.0就可以在别的机器访问服务器了(防火墙端口要放开)
    http_port
    端口
    运行scrapyd后,在浏览器打开对应的地址,可以看到如下界面:
    image.png
    example里面的例子不是很全,可以修改项目下的website.py,添加
    <p><code>curl http://localhost:6800/schedule.json -d project=default -d spider=somespider</code></p>
    <p><code> curl http://localhost:6800/cancel.json -d project=myproject -d job=6487ec79947edab326d6db28a2d86511e8247444 </code></p>
    <p><code> curl http://localhost:6800/listprojects.json </code></p>
    <p><code> curl http://localhost:6800/listversions.json?project=myproject </code></p>
    <p><code> curl http://localhost:6800/listspiders.json?project=myproject </code></p><p><code>  curl http://localhost:6800/listjobs.json?project=myproject </code></p>
    <p><code> curl http://localhost:6800/delproject.json -d project=myproject </code></p>
    <p><code> curl http://localhost:6800/delversion.json -d project=myproject -d version=r99 </code></p>
    

    修改完成如下:

    image.png
    到这里scrapyd服务启动完成,下面我们通过一个deploy新浪新闻爬虫说明如何把一个爬虫项目上传到scrapyd
    cd sinanew进入爬虫项目顶目录可以看到如下结构
    image.png
    其中scrapy.cfg就是跟deploy有关的配置文件
    image.png
    url参数如果是本机默认即可,project代表项目名称是scrapyd里面每个project的标识
    上传使用scrapyd-deploy命令,安装scrapyd-client时会有可执行文件放在python/bin目录下,同样需要做个软连接
    ln -s /usr/local/bin/python3/bin/scrapyd-deploy /usr/bin/scrapyd-deploy
    image.png
    scrapyd-deploy -l
    会根据scrapy.cfg文件列出可以选择的tag和project,tag是:后面的标识
    image.png
    image.png
    上传使用命令
    scrapyd-deploy <target> -p <project> --version <version>
    scrapyd-deploy abc -p sinanews --version 1
    version参数可空,会随机生成一串,成功deploy返回如下信息:
    image.png
    同时回到web界面会看到刚刚上传的项目。
    image.png
    试着使用一下api
    list project
    [root@localhost sinanews]# curl http://localhost:6800/listprojects.json
    {"node_name": "localhost.localdomain", "status": "ok", "projects": ["sinanews"]}
    

    list version

    [root@localhost sinanews]# curl http://localhost:6800/listversions.json?project=sinanews
    {"node_name": "localhost.localdomain", "status": "ok", "versions": ["1"]}
    

    再deploy一个版本2,然后list version

    [root@localhost sinanews]# curl http://localhost:6800/listversions.json?project=sinanews
    {"node_name": "localhost.localdomain", "status": "ok", "versions": ["1", "2"]}
    

    list spiders

    [root@localhost sinanews]# curl http://localhost:6800/listspiders.json?project=sinanews
    {"node_name": "localhost.localdomain", "status": "ok", "spiders": ["sina"]}
    

    运行爬虫

    [root@localhost sinanews]# curl http://localhost:6800/schedule.json -d project=sinanews -d spider=sina
    {"node_name": "localhost.localdomain", "status": "ok", "jobid": "2157910a9ef811e995c020040fe78714"}
    

    取消任务

    [root@localhost sinanews]# curl http://localhost:6800/cancel.json -d project=sinanews -d job=2157910a9ef811e995c020040fe78714
    {"node_name": "localhost.localdomain", "status": "ok", "prevstate": null}
    

    删除项目

    curl http://localhost:6800/delproject.json -d project=myproject
    

    删掉指定版本

    curl http://localhost:6800/delversion.json -d project=myproject -d version=r99
    

    日志文件存放
    日志目录/项目名称/爬虫名称/任务ID.log,存储个数根据配置文档来定

    image.png
    egg
    项目代码上传会打包成egg文件
    image.png
    分别是eggs目录/项目名称/代码版本号

    使用scrapyd_api

    调度

    from scrapyd_api import ScrapydAPI
    scrapyd = ScrapydAPI('http://localhost:6800')
    scrapyd.schedule(project_name, spider_name)
    

    源码修改,方便使用cancel

    #scrapyd.webservice.py
    class SpiderId(WsResource):
    
        def render_POST(self, txrequest):
            args = native_stringify_dict(copy(txrequest.args), keys_only=False)
            project = args['project'][0]
            spider = args['spider'][0]
            spiders = self.root.launcher.processes.values()
            running = [(s.job,s.start_time.isoformat(' '))
                       for s in spiders if (s.project == project and s.spider == spider)]
            # queue = self.root.poller.queues[project]
            # pending = [(x["_job"],) for x in queue.list() if x["name"] == spider]
            finished = [(s.job,s.start_time.isoformat(' ')) for s in self.root.launcher.finished
                if (s.project == project and s.spider == spider)]
            alist = running + finished
            if len(alist) == 0:
                return {"node_name": self.root.nodename, "status": "error", "message": 'no such project or spider'}
            last_id = max(alist,key=lambda a:a[0])
            return {"node_name": self.root.nodename, "status": "ok", 'id': last_id[0]}
    
    #scrapyd.default_scrapyd.conf
    spiderid.json     = scrapyd.webservice.SpiderId
    
    #scrapyd.website.py
    <p><code> curl http://localhost:6800/cancel.json -d project=myproject -d job=6487ec79947edab326d6db28a2d86511e8247444 </code></p>
    <p><code> curl http://localhost:6800/listprojects.json </code></p>
    <p><b><code> curl http://localhost:6800/spiderid.json -d project=myproject -d spider=spider</b></code></p>
    <p><code> curl http://localhost:6800/listversions.json?project=myproject </code></p>
    <p><code> curl http://localhost:6800/listspiders.json?project=myproject </code></p><p><code>  curl http://localhost:6800/listjobs.json?project=myproject </code></p>
    <p><code> curl http://localhost:6800/delproject.json -d project=myproject </code></p>
    <p><code> curl http://localhost:6800/delversion.json -d project=myproject -d version=r99 </code></p>
    

    scrapyd-api代码修改

    #contants.py
    SPIDERID_ENDPOINT = 'spiderid'
    DEFAULT_ENDPOINTS = {
        ADD_VERSION_ENDPOINT: '/addversion.json',
        CANCEL_ENDPOINT: '/cancel.json',
        DELETE_PROJECT_ENDPOINT: '/delproject.json',
        DELETE_VERSION_ENDPOINT: '/delversion.json',
        LIST_JOBS_ENDPOINT: '/listjobs.json',
        LIST_PROJECTS_ENDPOINT: '/listprojects.json',
        LIST_SPIDERS_ENDPOINT: '/listspiders.json',
        LIST_VERSIONS_ENDPOINT: '/listversions.json',
        SCHEDULE_ENDPOINT: '/schedule.json',
        SPIDERID_ENDPOINT: '/spiderid.json',
    }
    

    wrapper.py

    def spiderid(self, project, spider):
            """
            """
            url = self._build_url(constants.SPIDERID_ENDPOINT)
            params = {'project': project, 'spider': spider}
            json = self.client.post(url, data=params)
            return json['id']
    

    相关文章

      网友评论

        本文标题:scrapyd使用修改api

        本文链接:https://www.haomeiwen.com/subject/omtshctx.html