scrapy 初体验（这车很稳）

作者: 越长越圆 | 来源:发表于2018-10-27 14:36 被阅读4次

scrapy 爬虫

目标把gank上的图片趴下来

镇楼图

2018-07-30.jpg

2018-08-16.jpg

2018-09-19.jpg

// 初始化项目
scrapy startproject demo

修改items对象

import scrapy
import os
import requests


class GankItem(scrapy.Item):
      # define the fields for your item here like:
    name = scrapy.Field()
    imageurl = scrapy.Field()
    url = scrapy.Field()
    pass

    def canParse(self):
        return self['name'] != '' and self['imageurl'] != ''

    def downLoad(self):
        filename = 'file'
        files = self['url'].split("/")
        if len(files) > 3:
            filename = files[len(files) - 3] + "-" + files[len(files) - 2] + "-" + files[len(files) - 1]
        suffix = "jpg"
        data = self['imageurl'].split(".")

        if len(data) >= 2:
            suffix = data[len(data) - 1]

        path =  filename + "." + suffix

        if not os.path.exists(path):
            print(path)
            with open(path, 'wb') as fp:
                r = requests.get(self['imageurl'])
                fp.write(r.content)

piplines

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html



class GankPipeline(object):
    def process_item(self, item, spider):
        if item.canParse():
            item.downLoad()

        pass

新建ganksprider


import scrapy

from demo.spiders.gank import GankItem

class GankSpider(scrapy.Spider,count=1):
    name="gank"

    allowed_domains = ["gank.io"]
    start_urls=["https://gank.io/2018/10/22"]
    def parse(self, response):
        item=GankItem()
        item['url'] = response.url
        item['name']=response.xpath('//div[@class="container content"]/h1/text()').extract()[0]
        item['imageurl']=response.xpath('//div[@class="container content"]/div[@class="outlink"]//p/img/@src').extract()[0]

        yield item
        newcontent =response.xpath('//div[@class="container content"]/div[@class="row"]/div[@class="six columns"]/p[@style="text-align: right"]/a/@href').extract_first()
        if newcontent:
            newurl="https://gank.io"+newcontent
            print(newurl)
            yield scrapy.Request(newurl, callback=self.parse)

修复setting 打开

ITEM_PIPELINES = {
   'gank.pipelines.GankPipeline': 300,
}

就跑起来了

scrapy crawl xxx

年轻人注意身体

原文链接：https://blog.csdn.net/qq_22329521/article/details/83446096

网友评论

922fa8db23c4:不懂呢~

向阳花z:请注意身体！

黄景文kingmen:6666

安东尼卡:你看这个图它又大又圆

xiaocand1:“年轻人注意身体”——只看懂了这个的人有没有？

读书的大叔:注意身体

__唐一__:那么问题来了。里面装的什么

麦客Nike:就这点乐趣了

Blight:图片下载的有专门的pipline 这样写的话速度可能会降低

E_Page:@Blight 感觉又有新东西了

ffa50a30d06c:66666

_陈跑跑:在0和1的世界里，程序员才是真正的玩家

越长越圆:上车请刷卡

7fdd3d730b40:哇喔～

coder_it:卧槽满血复活,这是双十一的福利吗

熊猫周:镇楼图在某网站看过套图

4e57cd2be803:看不绑定

比特i:这个代码是截下来的么？怎么可以拖动，是图片么

Jokeoo:@比特i markdown

比特i:@意识流丶怎么弄的

意识流丶:是代码框不是图片

056d62c1dc8f:NB

算法成瘾者:厉害了

922fa8db23c4:不懂呢~
向阳花z:请注意身体！
黄景文kingmen:6666
安东尼卡:你看这个图它又大又圆
xiaocand1:“年轻人注意身体”——只看懂了这个的人有没有？
读书的大叔:注意身体
__唐一__:那么问题来了。里面装的什么
麦客Nike:就这点乐趣了
Blight:图片下载的有专门的pipline 这样写的话速度可能会降低
E_Page:@Blight 感觉又有新东西了
ffa50a30d06c:66666
_陈跑跑:在0和1的世界里，程序员才是真正的玩家
越长越圆:上车请刷卡
7fdd3d730b40:哇喔～
coder_it:卧槽满血复活,这是双十一的福利吗
熊猫周:镇楼图在某网站看过套图
4e57cd2be803:看不绑定
比特i:这个代码是截下来的么？怎么可以拖动，是图片么
Jokeoo:@比特i markdown
比特i:@意识流丶怎么弄的
意识流丶:是代码框不是图片
056d62c1dc8f:NB
算法成瘾者:厉害了

scrapy 初体验（这车很稳）

目标把gank上的图片趴下来

镇楼图

年轻人注意身体

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

程序猿阵线联盟-汇总各类技术干货

需要使用

程序媛开发笔记

程序员