美文网首页程序员Python爬虫作业
爬取拉钩上海Python职位信息并存入MongoDB数据库

爬取拉钩上海Python职位信息并存入MongoDB数据库

作者: Treehl | 来源:发表于2017-12-09 18:26 被阅读0次

昨天准备爬拉钩的python职位数据,用了老办法bs4+requests发现数据是空的,心情so down!!经过网上的查询才明白,拉钩使用Ajax技术,用bs4查找html元素是找不到数据的。今天我总结下学习过程,也算是巩固自己的知识了!!!

分析网页

登陆拉钩网站,打开开发者功能
[图片上传失败...(image-b6ac08-1512815188537)]

我们先用requests发送请求并保存一个html,来查看数据

import requests
import random 

user_agents = [
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2995.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2986.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.0 Safari/537.36'
]

headers = {
    'Host': 'www.lagou.com',
    'Referer': 'https://www.lagou.com/zhaopin/Python/?labelWords=label',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': random.choice(user_agents)
}

url = 'https://www.lagou.com/jobs/list_Python?px=default&city=%E4%B8%8A%E6%B5%B7#filterBox'
r = requests.get(url, headers=headers)
result = r.text
#print(r.text)
# 写入logou.html
with open('laogou.html', 'w', encoding='utf-8') as f:
    f.write(result)

运行代码试一下,代开lagou.html,我们看到职位信息数据是没有的

[图片上传失败...(image-4a7f52-1512957356176)]

接下来,我们再观察下Chrome开发者工具的NetWork一栏,类型选择XHR,找到下面这个链接,我们可以看到有Ajax、Json几个关键字,点击Preview

[图片上传失败...(image-d6f890-1512957356176)]

按顺序分别点开红框,就得到我们想要的数据啦
[图片上传失败...(image-f6b3c3-1512957356176)]

现在来试着写一下,注意这里的请求是post,带上表单,改变请求头的数据

data = {
    'first': 'true',
    'pn': 1,
    'kd': 'Python'
}

r = requests.post(url, headers=headers, data=data).json()
positions = r['content']['positionResult']['result']
print(positions)

Run一下,返回的数据就是我们想要的啦!!!

[图片上传失败...(image-3446d5-1512957356176)]

翻页

我们观察下表单内有一个pn参数,这就是页码,大家可以跳转页面来观察下数据的变化


for i in range(1, 17):
    data = {
        'first': 'true',
        'pn': i,
        'kd': 'Python'
    }

url = 'https://www.lagou.com/jobs/positionAjax.json?px=default&city=%E4%B8%8A%E6%B5%B7&needAddtionalResult=false&isSchoolJob=0'

r = requests.post(url,  headers=headers, data=data)
time.sleep(3)
print(json.url)

这样就把16页链接都打印了出来

[图片上传失败...(image-584b63-1512957356176)]

爬取拉钩的思路就是这样,完整代码在GitHub,欢迎大家访问!!!!!!!假如觉得有用点个star噢!!互勉!!!!!!!!!!

最后,附上一张爬下来的数据截图
[图片上传失败...(image-50bd3b-1512957356176)]

欢迎访问博客Treehl的博客
GitHub
简书
爬虫集合

相关文章

网友评论

    本文标题:爬取拉钩上海Python职位信息并存入MongoDB数据库

    本文链接:https://www.haomeiwen.com/subject/uerpixtx.html