Intro

作者: 方方块 | 来源:发表于2017-07-15 04:36 被阅读0次

(Optional) Create virtual environment

prefer using python version 3
mkvirtualenv --python=/usr/bin/python3 python3

check pip version by pip --version to make sure python 3 is used

Steps

  • scrapy startproject name
  • scrapy genspider botname url

robotstxt in setting should be true to always crawl permitted pages and be a good web citizen

  • inside project folder scrapy crawl botname
  • test in shell
  • scrapy crawl botname -o xx.json or csv to see result

shell to debug and test

scrapy shell

  • test url is valid - fetch(url)
  • test valid html - view(response.body)

Alternative xpath testing tool
http://www.freeformatter.com/xpath-tester.html

Xpath docs

uses response from selector

selctor, as it is named, selects html content,
from scrapy.selector import Selector
Since this is a common operation, response.selector is shorten to .xpath()

Extra
css can also be used as selector, but xpath is the official way

//name or //* - relative select every instance of html tag name or all
text() - text content in unicode
'//name[1]' - python isolated selector for ('//name')[0], use either
. - extracting first instance of data that is not response, can also just omit //
@ - attribute grabbing

if itemprop exist, use it over class to extract

Tools to get xpath fast -

Paste_Image.png

https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl

相关文章

  • Intro

    希望通过这个软件每天做一点记录,类似小日记吧,可能无聊或琐碎的这些点滴构成了我的生活^_^

  • Intro

    太漫长了 相当于 长出一根犀牛角的时间 在你不断对我说着话时 泡沫们跃升着 我的眼中口中开满了花 所有的泪水和话语...

  • Intro

    本人名“悦”,金牛座,性格实在活脱的不像人类。 因重视生命起源,想尝试摆脱各种人类局限,于是喜欢画一些“受精卵”磨...

  • Intro

    大家好啊,我是Tonia。Full stake designer。

  • 《The intro》

    我好不容易离开了忙碌,远离了人海,避开了聒噪,走进了你的世界,却又发现我已经再也睁不开疲乏的双眼,挥不动思想的笔触...

  • Intro

    这首歌真的很适合一大早没有人, 然后自己很拉风地路跑, 感觉整个世界都是自己的!! 有时候 ,我真的不喜欢热闹, ...

  • Intro

    (Optional) Create virtual environment prefer using python...

  • Intro

    As my return flight was making its initial landing attemp...

  • Intro

    敏捷教练,CSP,CSM。有13年软件开发经验,为团队提供敏捷教导和培训服务。他的使命是:和程序员一起重新点燃编程...

  • Intro

    SQL, which stands forStructured Query Language, is a lang...

网友评论

      本文标题:Intro

      本文链接:https://www.haomeiwen.com/subject/jgkahxtx.html