Intro

作者: 方方块 | 来源:发表于2017-07-15 04:36 被阅读0次

    (Optional) Create virtual environment

    prefer using python version 3
    mkvirtualenv --python=/usr/bin/python3 python3

    check pip version by pip --version to make sure python 3 is used

    Steps

    • scrapy startproject name
    • scrapy genspider botname url

    robotstxt in setting should be true to always crawl permitted pages and be a good web citizen

    • inside project folder scrapy crawl botname
    • test in shell
    • scrapy crawl botname -o xx.json or csv to see result

    shell to debug and test

    scrapy shell

    • test url is valid - fetch(url)
    • test valid html - view(response.body)

    Alternative xpath testing tool
    http://www.freeformatter.com/xpath-tester.html

    Xpath docs

    uses response from selector

    selctor, as it is named, selects html content,
    from scrapy.selector import Selector
    Since this is a common operation, response.selector is shorten to .xpath()

    Extra
    css can also be used as selector, but xpath is the official way

    //name or //* - relative select every instance of html tag name or all
    text() - text content in unicode
    '//name[1]' - python isolated selector for ('//name')[0], use either
    . - extracting first instance of data that is not response, can also just omit //
    @ - attribute grabbing

    if itemprop exist, use it over class to extract

    Tools to get xpath fast -

    Paste_Image.png

    https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl

    相关文章

      网友评论

          本文标题:Intro

          本文链接:https://www.haomeiwen.com/subject/jgkahxtx.html