美文网首页
Python爬虫抓取拉勾网的一点心得思路(上)

Python爬虫抓取拉勾网的一点心得思路(上)

作者: 飞飞飞段啊 | 来源:发表于2019-12-08 22:45 被阅读0次

    拉勾网的职位信息都是通过ajax加载,那么利用chrome浏览器的开发者工具选择Network然后点选XHR,可以看到这个请求是包含要的所有职位信息


    图1

    该请求是post请求,来看请求参数,有个Query和Formdata,Query查询参数一个城市名另外一个固定不变不用管;
    Formdata里的first是是否是第一页的意思,如果第一页则为True否则为False,可以点击下一页来验证;pn则为页码数,kd为查询职位关键词;


    图2
    图3

    点击下一页,可以看到有另外一个sid参数,直接Ctrl+F 查找sid该参数发现并未找到,不是js加密生成,那么直接搜这个键的值再查找发现是在第一页请求返回的响应中


    图4
    点击搜索结果可以发现是在第一页的响应中
    图5

    那么至此拉勾的请求分析完成了
    上代码先来测试下

    class HtmlDownloader:
        def __init__(self):
            self.kd = {'kd': '爬虫'}
            self.filename = '拉勾职位_爬虫.csv'
            self.headers = {
                'Accept': 'application/json, text/javascript, */*; q=0.01',
                'Accept-Encoding': 'gzip, deflate, br',
                'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
                'Connection': 'keep-alive',
                'Cookie': '_ga=GA1.2.233679007.1573439286; user_trace_token=20191111102804-e3c3fe93-042a-11ea-a62c-5254005c3644; LGUID=20191111102804-e3c4028b-042a-11ea-a62c-5254005c3644; gate_login_token=73e156ba52977761089f3d37583de42e8dcd1ea2506698c2; LG_HAS_LOGIN=1; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; privacyPolicyPopup=false; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2216e584b77cf4b3-014da43e02f0f4-7711a3e-1049088-16e584b77d12c1%22%2C%22%24device_id%22%3A%2216e584b77cf4b3-014da43e02f0f4-7711a3e-1049088-16e584b77d12c1%22%7D; hasDeliver=138; JSESSIONID=ABAAABAAAGGABCB7680CBFEC7E039473761CB87660C6AEC; WEBTJ-ID=20191208193410-16ee549d1f9235-0af59637342299-2393f61-1049088-16ee549d1fa53b; _putrc=83E18668C00BF1E9; _gid=GA1.2.1239443746.1575804851; login=true; unick=%E9%BB%84%E5%AD%90%E8%89%AF; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1575209025,1575596847,1575623707,1575804851; index_location_city=%E5%B9%BF%E5%B7%9E; X_MIDDLE_TOKEN=35ce96b8be0ed94c0c4b6f6f0aa959c6; _gat=1; LGSID=20191208214605-14c59413-19c1-11ea-abcb-525400f775ce; PRE_UTM=; PRE_HOST=; PRE_SITE=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_%25E7%2588%25AC%25E8%2599%25AB%2Fp-city_213%3F%26cl%3Dfalse%26fromSearch%3Dtrue; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_%25E7%2588%25AC%25E8%2599%25AB%2Fp-city_213%3F%26cl%3Dfalse%26fromSearch%3Dtrue%26labelWords%3D%26suginput%3D; LG_LOGIN_USER_ID=5b475f3bae04f1c78d6e85d9eba4a64b8214963fe963fcbe; TG-TRACK-CODE=index_search; X_HTTP_TOKEN=280ce513cf0023da718218575117ddab9d2d21bc93; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1575812837; LGRID=20191208214657-33873159-19c1-11ea-abcb-525400f775ce; SEARCH_ID=a8cdcc911b3343deb4204374bf7a3ace',
                'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
                'Host': 'www.lagou.com',
                'Origin': 'https: // www.lagou.com',
                'Referer': 'https://www.lagou.com/jobs/list_%E7%88%AC%E8%99%AB?labelWords=&fromSearch=true&suginput='
            }
            self.headers.update(get_ua())
    
        def get_index(self, url, **kwargs):
            kwargs.update(self.kd)
            response = requests.post(url=url, headers=self.headers, data=kwargs, timeout=None)
            if response.status_code == 200:
                print(response.json())
            else:
                pass
    
        def parse(self, response):
            results = response['content']['positionResult']['result']
            for result in results:
                positionId = result['positionId']
                companyId = result['companyId']
                companyFullName = result['companyFullName']
                companyLabelList = result['companyLabelList']
                companySize = result['companySize']
                createTime = result['createTime']
                district = result['district']
                education = result['education']
                financeStage = result['financeStage']  # 融资情况
                industryField = result['industryField']  # 公司业务性质,例如广告营销
                positionAdvantage = result['positionAdvantage']  # 福利
                positionName = result['positionName']
                salary = result['salary']
                workYear = result['workYear']
                yield {
                    '职位id': positionId,
                    '公司id': companyId,
                    '职位名称': positionName,
                    '薪水': salary,
                    '工作年限': workYear,
                    '公司全名': companyFullName,
                    '公司标签': companyLabelList,
                    '公司规模': companySize,
                    '职位创建时间': createTime,
                    '所在城市区域': district,
                    '学历要求': education,
                    '融资情况': financeStage,
                    '公司业务性质': industryField,
                    '福利': positionAdvantage
                }
    
        def save(self, data):
            with open(self.filename, 'a', encoding='utf_8_sig', newline='') as f:
                header = ['职位id', '公司id', '职位名称', '薪水',
                          '工作年限', '公司全名', '公司标签', '公司规模',
                          '职位创建时间', '所在城市区域', '学历要求', '融资情况', '公司业务性质', '福利'
                          ]
                writer = csv.DictWriter(f, fieldnames=header)
                writer.writerow(data)
        def start(self, url):
                print(f'正在请求第一页数据')
                response = self.get_index(url, first=True, pn=1)
                if response:
                    sid = response['content']['showId']
                    for item in self.parse(response):
                        self.save(item)
                    print('第一页数据写入完毕')
                    return sid
    
        def run(self, url):
            self.create_file()
            sid = self.start(url)
            for i in range(2, 4):
                print(f'开始请求第{i}页数据')
                response = self.get_index(url, first=False, pn=i, sid=sid)
                if response:
                    for item in self.parse(response):
                        self.save(item)
                    print(f'存入第{i}页数据完毕')
    
    
    if __name__ == '__main__':
        page_url = 'https://www.lagou.com/jobs/positionAjax.json?city=%E5%B9%BF%E5%B7%9E&needAddtionalResult=false'
        html_downloader = HtmlDownloader()
        # html_downloader.get_index(page_url, first=True, pn=1)
        html_downloader.run(page_url)
    
    

    注意 拉勾网的headers要加上后面的host,referer等字段,尽量全一点,我在测试发现只使用代码中cookie键以上字段直接返回false,message请稍后再试,加上后面的字段就可以拿到数据了;代码还需要改进,可以根据查询关键词和城市名添加扩展,本文只是说说拉勾基本的获取数据的小思路,望各位大佬轻喷(✪ω✪)

    更新 最后说下采集过程中返回操作太频繁的解决办法,也是在搜索的过程中发现的,拉勾会根据cookies判定是否为爬虫,没刷新一次cookies都会变,所以要保持登录状态,方案就是使用requests的session类。
    上一段伪代码

        def get_index(self, url, **kwargs):
            session = requests.Session()
            session.get(self.index_url, headers=self.headers)
            cookies = session.cookies 
            time.sleep(random.randint(3, 5))
            response = requests.post(url=url, headers=self.headers, data=kwargs, cookies=cookies)
            data = response.json()
            if response.status_code == 200 and 'content' in data.keys():
                return data
            else:
                print(data)
                # print('访问频繁稍后再试')
                return
    

    上面代码中的self.index_url是搜索关键词之后的请求url

    image.png
    以上抓取都是需要先登录你自己的拉勾,然后抓包的

    相关文章

      网友评论

          本文标题:Python爬虫抓取拉勾网的一点心得思路(上)

          本文链接:https://www.haomeiwen.com/subject/nisvgctx.html