美文网首页
Pyspider的参数

Pyspider的参数

作者: 岸与海 | 来源:发表于2019-01-10 11:38 被阅读0次
    url :

    the url or url list to be crawled.
    爬行url或url列表。

    callback:

    the method to parse the response. _default: call _
    该方法解析响应。

    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)
    
    
    age:

    the period of validity of the task. The page would be regarded as not modified during the period. default: -1(never recrawl)
    有效期内的任务。该页面将被视为不修改期间。默认值:1(从来没有重新抓取)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        ...
    
    priority:

    the priority of task to be scheduled, higher the better. default: 0
    要调度的任务的优先级,高越好。默认值:0

    def index_page(self):
        self.crawl('http://www.example.org/page2.html', callback=self.index_page)
        self.crawl('http://www.example.org/233.html', callback=self.detail_page,
                   priority=1)
    
    exetime:

    the executed time of task in unix timestamp. default: 0(immediately)
    任务的执行时间在unix时间戳。默认值:0(马上)

    import time
    def on_start(self):
        self.crawl('http://www.example.org/', callback=self.callback,
                   exetime=time.time()+30*60)
    
    retries:

    retry times while failed. default: 3
    重试次数而失败了。默认值:3

    itag:

    a marker from frontier page to reveal the potential modification of the task. It will be compared to its last value, recrawl when it's changed. default: None

    从边境页面标记揭示潜在的修改任务。这将是最后的价值相比,重新抓取的时候改变了。默认值:无

    def index_page(self, response):
        for item in response.doc('.item').items():
            self.crawl(item.find('a').attr.url, callback=self.detail_page,
                       itag=item.find('.update-time').text())
    
    auto_recrawl:

    when enabled, task would be recrawled every age time. default: False
    当启用,任务将会重新抓取每个时代。默认值:假

    def on_start(self):
        self.crawl('http://www.example.org/', callback=self.callback,
                   age=5*60*60, auto_recrawl=True)
    
    method:

    HTTP method to use. default: GET
    使用HTTP方法。违约:

    params:

    dictionary of URL parameters to append to the URL.
    字典的URL参数附加到URL。

    def on_start(self):
        self.crawl('http://httpbin.org/get', callback=self.callback,
                   params={'a': 123, 'b': 'c'})
        self.crawl('http://httpbin.org/get?a=123&b=c', callback=self.callback)
    
    data:

    the body to attach to the request. If a dictionary is provided, form-encoding will take place.
    身体附加到请求。如果提供了字典,form-encoding将发生。

    def on_start(self):
        self.crawl('http://httpbin.org/post', callback=self.callback,
                   method='POST', data={'a': 123, 'b': 'c'})
    
    files:

    dictionary of {field: {filename: 'content'}} files to multipart upload.
    {领域的词典:{文件名:‘内容’} }文件多部分upload.”

    user_agent:

    the User-Agent of the request
    用户代理的请求

    headers:

    dictionary of headers to send.
    字典的头来发送。

    cookies:

    dictionary of cookies to attach to this request.
    字典的饼干附着这个请求。

    connect_timeout:

    timeout for initial connection in seconds. default: 20
    首次连接超时秒。默认值:20

    timeout:

    maximum time in seconds to fetch the page. default: 120
    最长时间以秒为单位获取页面。默认值:120

    allow_redirects:

    follow 30x redirect default: True
    遵循30 x定向违约:真的

    validate_cert:

    For HTTPS requests, validate the server’s certificate? default: True
    HTTPS请求,验证服务器的证书吗?默认值:真正的

    proxy

    proxy server of username:password@hostname:port to use, only http proxy is supported currently.
    代理服务器的用户名:password@hostname:端口使用,目前只支持http代理。

    class Handler(BaseHandler):
        crawl_config = {
            'proxy': 'localhost:8080'
        }
    
    etag

    use HTTP Etag mechanism to pass the process if the content of the page is not changed. default: True
    使用HTTP Etag机制通过这个过程如果页面的内容没有改变。默认值:真正的

    last_modified

    use HTTP Last-Modified header mechanism to pass the process if the content of the page is not changed. default: True
    使用HTTP last - modified头机制通过这个过程如果页面的内容没有改变。默认值:真正的

    fetch_type

    set to js to enable JavaScript fetcher. default: None
    将js启用JavaScript取物。默认值:无

    js_script

    JavaScript run before or after page loaded, should been wrapped by a function like function() { document.write("binux"); }.
    JavaScript运行之前或之后页面加载,应该被包装函数函数(){ document . write(“binux”);}。

    def on_start(self):
        self.crawl('http://www.example.org/', callback=self.callback,
                   fetch_type='js', js_script='''
                   function() {
                       window.scrollTo(0,document.body.scrollHeight);
                       return 123;
                   }
                   ''')
    
    js_run_at

    run JavaScript specified via js_script at document-start or document-end. default: document-end
    通过指定运行JavaScript js_script document-start或document-end。默认值:document-end

    js_viewport_width/js_viewport_height

    set the size of the viewport for the JavaScript fetcher of the layout process.
    设置窗口的大小的JavaScript取物的布局过程。

    load_images

    load images when JavaScript fetcher enabled. default: False
    当启用JavaScript访问者载入图像。默认值:假

    save

    a object pass to the callback method, can be visit via response.save.
     一个对象传递给回调方法,可以通过response.save访问。

    def on_start(self):
        self.crawl('http://www.example.org/', callback=self.callback,
                   save={'a': 123})
    
    def callback(self, response):
        return response.save['a']
    
    taskid

    unique id to identify the task, default is the MD5 check code of the URL, can be overridden by method def get_taskid(self, task)
    惟一的id来识别任务,默认URL的MD5校验码,可以覆盖方法def get_taskid(自我,任务)

    import json
    from pyspider.libs.utils import md5string
    def get_taskid(self, task):
        return md5string(task['url']+json.dumps(task['fetch'].get('data', '')))
    
    force_update

    force update task params even if the task is in ACTIVE status.
    力更新任务params即使任务处于活动状态。

    cancel

    cancel a task, should be used with force_update to cancel a active task. To cancel an auto_recrawl task, you should set auto_recrawl=False as well.
    取消任务,应该使用force_update取消活动任务。取消auto_recrawl任务,你应该设定auto_recrawl = False。

    cURL command

    self.crawl(curl_command)
    cURL is a command line tool to make a HTTP request. It can easily get form Chrome Devtools > Network panel, right click the request and "Copy as cURL".
    旋度是一个命令行工具做一个HTTP请求。它可以很容易地形成铬Devtools >网络面板中,右键单击请求和“复制为旋度”。
    You can use cURL command as the first argument of self.crawl. It will parse the command and make the HTTP request just like curl do.
    您可以使用cURL命令self.crawl的第一个参数。它将解析命令和HTTP请求和旋度一样。

    @config(**kwargs)

    default parameters of self.crawl when use the decorated method as callback. For example:
    默认参数的自我。爬行时使用装饰方法的回调

    @config(age=15*60)
    def index_page(self, response):
        self.crawl('http://www.example.org/list-1.html', callback=self.index_page)
        self.crawl('http://www.example.org/product-233', callback=self.detail_page)
    
    @config(age=10*24*60*60)
    def detail_page(self, response):
        return {...}
    
    Handler.crawl_config = {}

    default parameters of self.crawl for the whole project.
    The parameters in crawl_config for scheduler (priority, retries, exetime, age, itag, force_update, auto_recrawl, cancel) will be joined when the task created, the parameters for fetcher and processor will be joined when executed.
    You can use this mechanism to change the fetch config (e.g. cookies) afterwards.

    默认参数的自我。爬行对于整个项目。
    crawl_config的参数调度器(优先级、重试exetime,年龄,itag,force_update,auto_recrawl,取消)将加入任务创建时,参数的取物和处理器执行时将加入。
    您可以使用这种机制改变获取配置(如饼干)。

    相关文章

      网友评论

          本文标题:Pyspider的参数

          本文链接:https://www.haomeiwen.com/subject/uknorqtx.html