基于scrapy框架的关于58同城招聘网站信息的爬取(一)

作者: www_Dicky | 来源:发表于2016-07-14 11:01 被阅读0次

基于scrapy框架的关于58同城招聘网站信息的爬取(一)
9.5 58同城scrapy爬取代码示例及存入Mongodb中
Python 实战计划学习笔记：自动设置代理爬取58同城商品信息
Scrapy爬取数据初识
强大的爬虫利器scrapy（介绍与安装）！
Scrapy功能介绍
2021-07-21
Python爬虫之Scrapy框架爬取XXXFM音频文件
python爬虫框架Scrapy
[CP_14] Python爬虫框架01：Scrapy框架创建项

起因：学校项目实训，要求我们爬取招聘网站信息并对其进行分析，在此我和大家分享一下关于我爬取58同城招聘网站信息的过程和结果~

前期准备步骤：

1.搭建环境：首先把scrapy需要的环境搭建好，再次我就不赘述了，这个去百度，有很多的教程，可能有些不够全面不够准确，反正多看看，先把环境搭建好，我是在windows7下进行的安装。

2.环境搭建好后，学习scrapy框架的结构以及运行流程，具体网上也有很多介绍，我也不赘述了，提一点百度百科的解释，scrapy:Scrapy，Python开发的一个快速,高层次的屏幕抓取和web抓取框架，用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛，可以用于数据挖掘、监测和自动化测试。

这个关于scrapy的中文的网站点击打开链接，大家可以学习学习，这项目，我也就学习了前面的几点知识。

代码编写过程：

1.在cmd中新建一个新项目。

scrapy startproject tc (58同城的缩写，项目名称)

2.对于该项目的items类进行编写：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class TcItem(scrapy.Item):

# define the fields for your item here like:

name = scrapy.Field() #招聘职位名称

Cpname = scrapy.Field() #公司名称

pay = scrapy.Field() #薪资待遇

edu = scrapy.Field() #学历要求

num = scrapy.Field() #招聘人数

year = scrapy.Field() #工作年限

FL = scrapy.Field() #福利待遇

以上是我给想爬取的数据定义的属性

3.在spiders中新建了一个tc_spider.py,一下是tc_spider.py的代码:

# -*- coding: utf-8 -*-

import scrapy

from tc.items import TcItem

from scrapy.selector import HtmlXPathSelector,Selector

from scrapy.http import Request

class TcSpider(scrapy.Spider):

name='tc'

allowed_domains=['jn.58.com']

start_urls=[

"http://jn.58.com/tech/pn1/?utm_source=market&spm=b-31580022738699-me-f-824.bdpz_biaoti&PGTID=0d303655-0010-915b-ca53-cb17de8b2ef6&ClickID=3"

]

theurl="http://jn.58.com/tech/pn"

theurl2="/?utm_source=market&spm=b-31580022738699-me-f-824.bdpz_biaoti&PGTID=0d303655-0010-915b-ca53-cb17de8b2ef6&ClickID=3"

for i in range(75):

n=i+2

the_url=theurl+str(n)+theurl2

start_urls.append(the_url)

def start_request(self,response):

sel = Selector(response)

sites = sel.xpath("//*[@id='infolist']/dl")

#items = []

for site in sites:

#item = DmozItem()

#item['namee'] = site.xpath('dt/a/text()').extract()

href = site.xpath('dt/a/@href').extract()

self.start_urls.append(href)

#item['company'] = site.xpath('dd/a/@title').extract()

#if site!= " " :

# items.append(item)

for url in self.start_urls:

yield self.make_requests_from_url()

def parse_item(self, response):

items2 = []

item=TcItem()

item['name']=response.xpath("//*[@class='headConLeft']/h1/text()").extract()

item['Cpname']=response.xpath("//*[@class='company']/a/text()").extract()

item['pay']=response.xpath(("//*[@class='salaNum']/strong/text()")).extract()

item['edu']=response.xpath("//*[@class='xq']/ul/li[1]/div[2]/text()").extract()

item['num']=response.xpath("//*[@class='xq']/ul/li[2]/div[1]/text()").extract()

item['year']=response.xpath("//*[@class='xq']/ul/li[2]/div[2]/text()").extract()

item['FL']=response.xpath("//*[@class='cbSum']/span/text()").extract()

dec=item['num']

items2.append(item)

return items2

def parse(self, response):

sel = HtmlXPathSelector(response)

href = sel.select("//*[@id='infolist']/dl/dt/a/@href").extract()

for he in href:

yield Request (he,callback=self.parse_item)

# 翻页

# next_page=response.xpath("//*[@class='nextMsg']/a/@href")

# if next_page:

# url=response.urljoin(next_page[0].extract())

# yield scrapy.Request(url,self.parse)