爬虫技术(2) Scrapy框架的使用

作者: 袁梦祥941115 | 来源:发表于2018-06-12 21:06 被阅读0次

(六)Scrapy爬虫框架的认识(读书笔记)|Python网络爬
爬虫练习_使用scrapy爬取淘宝
python爬虫框架scrapy
python爬虫框架Scrapy
python爬虫框架Scrapy
09-Scrapy基础
爬虫技术(2) Scrapy框架的使用
Pycharm+Scrapy框架运行爬虫糗事百科（无items数
python网络爬虫笔记三
Python爬虫基础：scrapy框架简介及第一个scrapy爬

1. 创建自定义爬虫

scrapy startproject zhihurb

目录结构

scrapy.cfg: 项目的配置文件(很少用)
zhihurb/: 该项目的python模块。之后您将在此加入代码。
zhihurb/items.py: 项目中的item文件.
zhihurb/pipelines.py: 项目中的pipelines文件.
zhihurb/settings.py: 项目的设置文件（设置）
zhihurb/spiders/: 放置spider代码的目录.

settings.py 常用配置：
LOG_LEVEL = 'ERROR'
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
ROBOTSTXT_OBEY = False 
DOWNLOAD_DELAY = 1  下载延时
DEFAULT_REQUEST_HEADERS = {} 重新请求头

2. xpath 选择器

可以在命令行输入 scrapy shell [要测试的url地址]
response.xpath('//div[@class="tqtongji2"]/ul[position()>1]/li[1]/a/text()').extract() 进行测试

extract()  
Serialize and return the matched nodes as a list of unicode strings. Percent encoded content is unquoted. 

extract_first()  
return the matched node.

re(regex)
Apply the given regex and return a list of unicode strings with the matches.
同样有re_first()

获取某节点所有文字内容
node = response.xpath('//div[@class="content"]')[0]
article = node.xpath('string(.)').extract_first()

3. 中断和恢复爬虫

scrapy crawl article -s JOBDIR=crawls/article
中断后，重新执行该命令，从暂停地方继续

4. 数据导出

数据导出
如果要简单将已抓取的item数据保存到文件，可以传递-o选项：
scrapy crawl heartsong -o index.xml
格式包括 csv,json,xml

如果有复杂操作，在pipelines处理逻辑，注释setting中的 ITEM_PIPELINES 配置项可以更换实现类。

ITEM_PIPELINES = {
   'music163.pipelines.MongoPipeline': 300,
}

demo

from scrapy import Spider, Request
from zhihurb.items import ZhihurbItem
 
class ZhihuSpider(Spider):
    name = "zhihu"
    allowed_domains = ["zhihu.com"]
    start_urls = ['https://daily.zhihu.com/']

    def parse(self, response):
        urls = response.xpath('//div[@class="box"]/a/@href').extract()
        for url in urls:
            url = response.urljoin(url)
            print(url)
            yield Request(url, callback=self.parse_url)

    def parse_url(self, response):
        # name = xxxx
        # article = xxxx
        # 保存
        name = response.xpath('//h1[@class="headline-title"]/text()').extract_first()
        node = response.xpath('//div[@class="content"]')[0]
        article = node.xpath('string(.)').extract_first()
        item = ZhihurbItem()
        item['name'] = name
        item['article'] = article
     
        # 返回item
        yield item

网友评论

本文标题：爬虫技术(2) Scrapy框架的使用

本文链接：https://www.haomeiwen.com/subject/hftqdftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

爬虫技术(2) Scrapy框架的使用

1. 创建自定义爬虫

目录结构

2. xpath 选择器

3. 中断和恢复爬虫

4. 数据导出

demo

相关文章

(六)Scrapy爬虫框架的认识(读书笔记)|Python网络爬

爬虫练习_使用scrapy爬取淘宝

python爬虫框架scrapy

python爬虫框架Scrapy

python爬虫框架Scrapy

09-Scrapy基础

爬虫技术(2) Scrapy框架的使用

Pycharm+Scrapy框架运行爬虫糗事百科（无items数

python网络爬虫笔记三

Python爬虫基础：scrapy框架简介及第一个scrapy爬

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读