安装环境
macOS 环境
需要安装c语言的编译环境
xcode-select --install
安装Scrapy
pip3 install Scrapy
创建项目
scrapy startproject xxx(项目名称)
scrapy startproject firstProject
New Scrapy project 'firstProject', using template directory '/usr/local/lib/python3.6/site-packages/scrapy/templates/project', created in:
/Users/baxiang/Documents/Python/Scrapy/firstProject
You can start your first spider with:
cd firstProject
scrapy genspider example example.com
项目结构
.
├── firstProject
│ ├── __init__.py
│ ├── __pycache__
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ └── __pycache__
└── scrapy.cfg
4 directories, 7 files
Items.py
定义需要抓取并需要后期处理的数据
settings.py
文件设置scapy
pipeline.py
用于存放后期数据处理的功能。
常用命令
$ scrapy -h
Scrapy 1.5.0 - project: firstProject
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
check Check spider contracts
crawl Run a spider
edit Edit spider
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
list List available spiders
parse Parse URL (using its spider) and print the results
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
Use "scrapy <command> -h" to see more info about a command
创建爬虫
创建爬虫工程
scrapy startproject Toscrape
创建爬虫文件
scrapy genspider news www.163.com
创建news.py文件内容
class NewsSpider(scrapy.Spider):
name = 'news'
allowed_domains = ['www.163.com']
start_urls = ['http://www.163.com/']
def parse(self, response):
pass
- name属性,是一个爬虫子项目的唯一标识,一个项目工程可存在多个爬虫子项目.
- allowed_domains 允许爬去的域名
- start_urls属性,一个爬虫的起始网址页面地址
- parse函数,当爬虫引擎下载完成一个页面时候,页面的解析函数回调
网友评论