美文网首页
爬虫-(旧)51JOB职位数据

爬虫-(旧)51JOB职位数据

作者: 花讽院_和狆 | 来源:发表于2020-03-12 10:26 被阅读0次

    51JOB的数据相比BOSS直聘还是好做很多,首先还是在items.py中进行定义:

    import scrapy
    
    
    class PositionViewItem(scrapy.Item):
        # define the fields for your item here like:
        
        name :scrapy.Field = scrapy.Field()#名称
        salary :scrapy.Field = scrapy.Field()#薪资
        education :scrapy.Field = scrapy.Field()#学历
        experience :scrapy.Field = scrapy.Field()#经验
        jobjd :scrapy.Field = scrapy.Field()#工作ID
        district :scrapy.Field = scrapy.Field()#地区
        category :scrapy.Field = scrapy.Field()#行业分类
        scale :scrapy.Field = scrapy.Field()#规模
        corporation :scrapy.Field = scrapy.Field()#公司名称
        url :scrapy.Field = scrapy.Field()#职位URL
        createtime :scrapy.Field = scrapy.Field()#发布时间
        posistiondemand :scrapy.Field = scrapy.Field()#岗位职责
        cortype :scrapy.Field = scrapy.Field()#公司性质
    

    然后也是采取直接搜索全国-数据分析职位的url作为起始url,记得需要模拟一个请求头:

        name :str = 'job51Analysis'
        url :str = 'https://search.51job.com/list/000000,000000,0000,00,9,99,%25E6%2595%25B0%25E6%258D%25AE%25E5%2588%2586%25E6%259E%2590,2,1.html?lang=c&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare='
    
        headers :Dict = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0',
            'Referer': 'https://mkt.51job.com/tg/sem/pz_2018.html?from=baidupz'
        }
    
        def start_requests(self) -> Request:
            yield Request(self.url, headers=self.headers)
    

    直接把定义好的headers作为参数传进Request里就可以了.

    首先也是用默认的回调函数parse(比较懒,临时用就没有自定义):

            if response.status == 200:
                PositionInfos :selector.SelectorList = response.selector.xpath(r'//div[@class="el"]')
    

    如何取得单个职位的信息呢?首先用xpath把单个职位的list选出来,之后再用这个list去二次选择,这样就可以获取了.

                for positioninfo in PositionInfos:#遍历取得的selectorlist
                    pvi = PositionViewItem()
                    pvi['name'] :str = ''.join(positioninfo.xpath(r'p[@class="t1 "]/span/a/text()').extract()).strip()
                    pvi['salary'] :str = ''.join(positioninfo.xpath(r'span[@class="t4"]/text()').extract())
                    pvi['createtime'] :str = ''.join(positioninfo.xpath(r'span[@class="t5"]/text()').extract())
                    pvi['district'] :str = ''.join(positioninfo.xpath(r'span[@class="t3"]/text()').extract())
                    pvi['corporation'] :str = ''.join(positioninfo.xpath(r'span[@class="t2"]/a/text()').extract()).strip()
                    pvi['url'] :str = ''.join(positioninfo.xpath(r'p[@class="t1 "]/span/a/@href').extract())
    

    由于51JOB中的职位信息用一层搜索是看不全的,需要点击进去处理下一层路径,因此在这里获取职位详细信息的url,之后进行下一级处理:

                    #处理二级路径
                    if len(pvi['url']) > 0:
                        request :Request = Request(pvi['url'], callback=self.positiondetailparse, headers=self.headers)
                        request.meta['positionViewItem'] = pvi
                        yield request
    

    以上的代码用到了自定义的callback函数,用来对二级路径进行处理,另外在request中加入了meta属性,可以用来把参数通过request进行传递(这个好像是get请求,所以不建议传太长,不太安全也不规范),这样的话在positiondetailparse这个方法中就可以获取传过去的item实例了.

        def positiondetailparse(self, response) -> PositionViewItem:
            if response.status == 200:
                pvi :PositionViewItem = response.meta['positionViewItem']
                pvi['posistiondemand'] :str = ''.join(response.selector.xpath(r'//div[@class="bmsg job_msg inbox"]//p/text()').extract()).strip()
                pvi['cortype'] :str = ''.join(response.selector.xpath(r'//div[@class="com_tag"]/p[@class="at"][1]/@title').extract()).strip()#xpath从1开始需要注意
                pvi['scale'] :str = ''.join(response.selector.xpath(r'//div[@class="com_tag"]/p[@class="at"][2]/@title').extract()).strip()
                pvi['category'] :str = ''.join(response.selector.xpath(r'//div[@class="com_tag"]/p[@class="at"][3]/@title').extract())
                pvi['education'] :str = ''.join(response.selector.xpath(r'//p[@class="msg ltype"]/text()[3]').extract()).strip()
                yield pvi
    

    解析二级路径中的信息,需要注意的是xpath选择器中的元素个数是从1开始的,不是0.

    把所有的信息取得之后,返回一个pvi给pipeline,用来进行处理和存储.

    单个职位信息抓取完成之后,自然也需要下一页的信息,在parse中:

                nexturl = ''.join(response.selector.xpath(r'//li[@class="bk"][2]/a/@href').extract())
                print(nexturl)
                if nexturl:
                    # nexturl = urljoin(self.url, ''.join(nexturl))
                    print(nexturl)
                    yield Request(nexturl, headers=self.headers)
    

    如果不加callback参数,就会默认调用parse这个方法,从而达到解析下一页的目的.

    最后要在pipelines.py里加入处理item数据的程序,这里我选择把数据存到csv当中.

    import os
    import csv
    
    class LearningPipeline(object):
    
        def __init__(self):
            self.file = open('51job.csv', 'a+', encoding='utf-8', newline='')
            self.writer = csv.writer(self.file, dialect='excel')
    
        def process_item(self, item, spider):
            if item['name']:
                self.writer.writerow([item['name'], item['salary'], item['district'], item['createtime'], item['education'],
                item['posistiondemand'], item['corporation'], item['cortype'], item['scale'], item['category']])
            return item
    
        def close_spider(self, spider):
            self.file.close()
    

    初始化方法里默认打开这个文件,然后process_item是默认的处理item 的方法,返回一个item就会调用一次!

    close_spider方法是关闭爬虫时用的,写一个关闭文件就可以了.

    需要注意的是输出的csv文件用excel打开是有中文乱码的,我把文件用记事本打开之后以ASCII的方式另存一份,乱码就消失了.

    好了,接下来就可以开始运行爬虫了,这只是一个非常初级的爬虫,也很简易.留着备忘吧!

    相关文章

      网友评论

          本文标题:爬虫-(旧)51JOB职位数据

          本文链接:https://www.haomeiwen.com/subject/clgcjhtx.html