1.初识scrapy框架

作者: 思绪太重_飘不动 | 来源:发表于2019-06-10 19:10 被阅读0次

1.初识scrapy框架
初识Scrapy框架
Scrapy爬取数据初识
scrapy框架是真爱
DC-01:爬虫框架scrapy入门
Scrapy定时爬虫总结&Docker/K8s部署
[CP_15] Python爬虫框架02：Scrapy框架爬取咨
Scrapy框架的使用
爬虫练习_使用scrapy爬取淘宝
[CP_14] Python爬虫框架01：Scrapy框架创建项

scrapy框架的使用

1.创建爬虫项目

1.创建scrapy项目 : 
    scrapy startproject project_name(项目的名称)

2.创建爬虫:
    cd 到project_name (切换到工程目录下,然后再创建爬虫)
    scrapy genspider spiser_name(爬虫的名字) spider.com(要爬取网站的域名)

3.在spider_name中书写爬虫代码

4.启动项目: 
    方式一: cd到爬虫所在的文件夹,执行代码 scrapy runspider sipder_name.py
    方式二: scrapy crawl spider_name
    方式三: 创建一个start.py , 编写如下代码:
    import scrapy.cmdline  

    # 执行scrapy命令
    def main():
        # 启动爬虫显示日志
        # scrapy.cmdline.execute(['scrapy', 'crawl', 'movie'])
        # scrapy.cmdline.execute("scrapy crawl movie".split())
        # 启动爬虫,但不显示日志
        scrapy.cmdline.execute("scrapy crawl movie --nolog".split()) 


    if __name__ == '__main__':
        main()

2.在爬虫文件中如何提取文本内容

print(type(response))   # 显示相应类型
print(response.text)    # 显示字符串内容
print(response.body)    # 显示二进制内容
extract()函数,抽取对象的文本内容
extract_first()函数,抽取对象的第一个文内容

1.在scrapy框架中自带xpath,所以我们一般使用xpath来解析内容
2.尽量使用extract_first()函数来抽取文本,如果文本为空不会报错

3.实例 ,爬取美剧网站的电影

爬取  url= 'https://www.meijutt.com/new100.html' 的最新电影
  

爬取的数据为 :{ 电影的名字 :name , 电影的分类:mjjp, 电影的播放电视台:mjtv, 电影的更新时间:data_time }

4.具体代码

# 1.自己在工程目录下创建的start.py文件
import scrapy.cmdline


# 执行scrapy命令
def main():
    # 启动爬虫显示日志
    # scrapy.cmdline.execute(['scrapy', 'crawl', 'movie'])
    # scrapy.cmdline.execute("scrapy crawl movie".split())
    # 启动爬虫,但不显示日志
    scrapy.cmdline.execute("scrapy crawl movie --nolog".split())
    # 通过命令保存成json文件
    # scrapy.cmdline.execute("scrapy crawl movie -o movie.json --nolog".split())
    # 通过命令保存成xml文件
    # scrapy.cmdline.execute("scrapy crawl movie -o movie.xml --nolog".split())
    # 通过命令保存成csv文件
    # scrapy.cmdline.execute("scrapy crawl movie -o movie.csv --nolog".split())


if __name__ == '__main__':
    main()

# 2.movie.py  (自己创建的爬虫文件)
# -*- coding: utf-8 -*-
import scrapy
from ..items import MeijuItem


# 继承自基类scrapy.Spider
class MovieSpider(scrapy.Spider):
    name = 'movie'  # 项目名称
    allowed_domains = ['www.meijutt.com']   # 允许爬取的url的域名
    start_urls = ['https://www.meijutt.com/new100.html']    # 开始爬取url的列表

    # 定义parse()用来解析数据
    # 参数response: 就是服务端的响应,里面有我们想要的数据
    def parse(self, response):
        movie_list = response.xpath('//ul[@class="top-list  fn-clear"]/li')
        for movie in movie_list:
            name = movie.xpath('./h5/a/text()').extract_first()
            mjjp = movie.xpath('./span[@class="mjjq"]/text()').extract_first()
            mjtv = movie.xpath('./span[@class="mjtv"]/text()').extract_first()
            data_time = movie.xpath('./div[@class="lasted-time new100time fn-right"]/text()').extract_first()
            # print(name, mjjp, mjtv, data_time)

            item = MeijuItem()
            item['name'] = name
            item['mjjp'] = mjjp
            item['mjtv'] = mjtv
            item['data_time'] = data_time

            # yield会将item传入piplines.py文件中
            yield item

# 3.items.py (创建数据存储的模型)
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


# 定义模型,相当于Django中的Model
class MeijuItem(scrapy.Item):
    # 定义爬取内容的字段
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()  # 名字
    mjjp = scrapy.Field()  # 分类
    mjtv = scrapy.Field()  # 电视台
    data_time = scrapy.Field()  # 更新时间

# 4.pipelines.py(管道用来处理存储操作)
# -*- coding: utf-8 -*-

# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


# pipelines管道,用于存储爬取到的内容的操作
# 使用管道来存储数据的好处是,自动帮我们去重
class MeijuPipeline(object):
    def __init__(self):
        pass
    
    # 开始爬取的函数,这是系统默认的写法,使用要自己添加
    def open_spider(self, spider):
        print('开始爬取......')
        self.fp = open('movie.txt', 'a', encoding='utf-8')

    # 处理传入进来的每个item.会被毒刺调用
    # 参数item : 在爬虫.py中的parse()函数yield返回的每个item
    # 参数spider: 爬虫对象
    def process_item(self, item, spider):
        string = str((item['name'], item['mjjp'], item['mjtv'], item['data_time'])) + '\n'
        self.fp.write(string)
        self.fp.flush()
        return item

    # 结束爬取的函数,这是系统默认的写法,使用要自己添加
    def close_spider(self, spider):
        print('爬取结束......')
        self.fp.close()

# 5.settings.py (文件中大多数配置是默认不适用的,要是用它就要去掉注释)
# -*- coding: utf-8 -*-

# 爬虫的配置文件
# Scrapy settings for meiju project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

# 工程的名称
BOT_NAME = 'meiju'

# 指定爬虫文件位置
SPIDER_MODULES = ['meiju.spiders']
# 新建的爬虫位置
NEWSPIDER_MODULE = 'meiju.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
# 设置User_Agent, 默认未使用
#USER_AGENT = 'meiju (+http://www.yourdomain.com)'

# Obey robots.txt rules
# 默认遵守robots协议, 不遵守可改为Flase
# ROBOTSTXT_OBEY = True
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# 连接的请求数, 默认是16个
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
# 爬虫的中间键,默认未使用
#SPIDER_MIDDLEWARES = {
#    'meiju.middlewares.MeijuSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# 下载中间键,默认未使用
#DOWNLOADER_MIDDLEWARES = {
#    'meiju.middlewares.MeijuDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
# 设置pipeline管道,默认未使用,要是用必须放开它
ITEM_PIPELINES = {
   'meiju.pipelines.MeijuPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

1.初识scrapy框架
scrapy框架的使用 1.创建爬虫项目 2.在爬虫文件中如何提取文本内容 3.实例 ,爬取美剧网站的电影 4.具体代码
初识Scrapy框架
Scrapy是一个流行的网络爬虫框架。 Ubuntu下安装 sudo apt-get install python...
Scrapy爬取数据初识
Scrapy爬取数据初识初窥Scrapy Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架。 ...
scrapy框架是真爱
初识scrapy框架首先我认为scrapy框架和编写的普通爬虫文件没有什么区别唯一不同的是它可以把你得各种爬虫...
DC-01:爬虫框架scrapy入门
本主题主要是scrapy入门，包含内容如下： 1. Scrapy框架环境搭建； 2. 理解scrapy框架结...
Scrapy定时爬虫总结&Docker/K8s部署
初识Scrapy Scrapy是Python开发的一个快速、高层次的屏幕抓取和web抓取框架，用于抓取web站点并...
[CP_15] Python爬虫框架02：Scrapy框架爬取咨
目录结构一、Scrapy框架发送POST请求的应用 1. Scrapy发送POST请求创建项目：scrapy ...
Scrapy框架的使用
一 . scrapy的介绍 1. 什么是scrapy? 2. scrapy框架的流程结构图 3. 怎样安装scrapy?
爬虫练习_使用scrapy爬取淘宝
使用爬虫框架scrapy爬取淘宝一.创建项目 1.安装scrapy pip install scrapy 2.选...
[CP_14] Python爬虫框架01：Scrapy框架创建项
目录结构一、Scrapy框架简介 1. Scrapy引入 Scrapy：是用Python实现爬取网站数据、提取结...