拉勾网的职位信息都是通过ajax加载,那么利用chrome浏览器的开发者工具选择Network然后点选XHR,可以看到这个请求是包含要的所有职位信息
图1
该请求是post请求,来看请求参数,有个Query和Formdata,Query查询参数一个城市名另外一个固定不变不用管;
Formdata里的first是是否是第一页的意思,如果第一页则为True否则为False,可以点击下一页来验证;pn则为页码数,kd为查询职位关键词;
图2
图3
点击下一页,可以看到有另外一个sid参数,直接Ctrl+F 查找sid该参数发现并未找到,不是js加密生成,那么直接搜这个键的值再查找发现是在第一页请求返回的响应中
图4
点击搜索结果可以发现是在第一页的响应中
图5
那么至此拉勾的请求分析完成了
上代码先来测试下
class HtmlDownloader:
def __init__(self):
self.kd = {'kd': '爬虫'}
self.filename = '拉勾职位_爬虫.csv'
self.headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Connection': 'keep-alive',
'Cookie': '_ga=GA1.2.233679007.1573439286; user_trace_token=20191111102804-e3c3fe93-042a-11ea-a62c-5254005c3644; LGUID=20191111102804-e3c4028b-042a-11ea-a62c-5254005c3644; gate_login_token=73e156ba52977761089f3d37583de42e8dcd1ea2506698c2; LG_HAS_LOGIN=1; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; privacyPolicyPopup=false; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2216e584b77cf4b3-014da43e02f0f4-7711a3e-1049088-16e584b77d12c1%22%2C%22%24device_id%22%3A%2216e584b77cf4b3-014da43e02f0f4-7711a3e-1049088-16e584b77d12c1%22%7D; hasDeliver=138; JSESSIONID=ABAAABAAAGGABCB7680CBFEC7E039473761CB87660C6AEC; WEBTJ-ID=20191208193410-16ee549d1f9235-0af59637342299-2393f61-1049088-16ee549d1fa53b; _putrc=83E18668C00BF1E9; _gid=GA1.2.1239443746.1575804851; login=true; unick=%E9%BB%84%E5%AD%90%E8%89%AF; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1575209025,1575596847,1575623707,1575804851; index_location_city=%E5%B9%BF%E5%B7%9E; X_MIDDLE_TOKEN=35ce96b8be0ed94c0c4b6f6f0aa959c6; _gat=1; LGSID=20191208214605-14c59413-19c1-11ea-abcb-525400f775ce; PRE_UTM=; PRE_HOST=; PRE_SITE=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_%25E7%2588%25AC%25E8%2599%25AB%2Fp-city_213%3F%26cl%3Dfalse%26fromSearch%3Dtrue; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_%25E7%2588%25AC%25E8%2599%25AB%2Fp-city_213%3F%26cl%3Dfalse%26fromSearch%3Dtrue%26labelWords%3D%26suginput%3D; LG_LOGIN_USER_ID=5b475f3bae04f1c78d6e85d9eba4a64b8214963fe963fcbe; TG-TRACK-CODE=index_search; X_HTTP_TOKEN=280ce513cf0023da718218575117ddab9d2d21bc93; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1575812837; LGRID=20191208214657-33873159-19c1-11ea-abcb-525400f775ce; SEARCH_ID=a8cdcc911b3343deb4204374bf7a3ace',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Host': 'www.lagou.com',
'Origin': 'https: // www.lagou.com',
'Referer': 'https://www.lagou.com/jobs/list_%E7%88%AC%E8%99%AB?labelWords=&fromSearch=true&suginput='
}
self.headers.update(get_ua())
def get_index(self, url, **kwargs):
kwargs.update(self.kd)
response = requests.post(url=url, headers=self.headers, data=kwargs, timeout=None)
if response.status_code == 200:
print(response.json())
else:
pass
def parse(self, response):
results = response['content']['positionResult']['result']
for result in results:
positionId = result['positionId']
companyId = result['companyId']
companyFullName = result['companyFullName']
companyLabelList = result['companyLabelList']
companySize = result['companySize']
createTime = result['createTime']
district = result['district']
education = result['education']
financeStage = result['financeStage'] # 融资情况
industryField = result['industryField'] # 公司业务性质,例如广告营销
positionAdvantage = result['positionAdvantage'] # 福利
positionName = result['positionName']
salary = result['salary']
workYear = result['workYear']
yield {
'职位id': positionId,
'公司id': companyId,
'职位名称': positionName,
'薪水': salary,
'工作年限': workYear,
'公司全名': companyFullName,
'公司标签': companyLabelList,
'公司规模': companySize,
'职位创建时间': createTime,
'所在城市区域': district,
'学历要求': education,
'融资情况': financeStage,
'公司业务性质': industryField,
'福利': positionAdvantage
}
def save(self, data):
with open(self.filename, 'a', encoding='utf_8_sig', newline='') as f:
header = ['职位id', '公司id', '职位名称', '薪水',
'工作年限', '公司全名', '公司标签', '公司规模',
'职位创建时间', '所在城市区域', '学历要求', '融资情况', '公司业务性质', '福利'
]
writer = csv.DictWriter(f, fieldnames=header)
writer.writerow(data)
def start(self, url):
print(f'正在请求第一页数据')
response = self.get_index(url, first=True, pn=1)
if response:
sid = response['content']['showId']
for item in self.parse(response):
self.save(item)
print('第一页数据写入完毕')
return sid
def run(self, url):
self.create_file()
sid = self.start(url)
for i in range(2, 4):
print(f'开始请求第{i}页数据')
response = self.get_index(url, first=False, pn=i, sid=sid)
if response:
for item in self.parse(response):
self.save(item)
print(f'存入第{i}页数据完毕')
if __name__ == '__main__':
page_url = 'https://www.lagou.com/jobs/positionAjax.json?city=%E5%B9%BF%E5%B7%9E&needAddtionalResult=false'
html_downloader = HtmlDownloader()
# html_downloader.get_index(page_url, first=True, pn=1)
html_downloader.run(page_url)
注意 拉勾网的headers要加上后面的host,referer等字段,尽量全一点,我在测试发现只使用代码中cookie键以上字段直接返回false,message请稍后再试,加上后面的字段就可以拿到数据了;代码还需要改进,可以根据查询关键词和城市名添加扩展,本文只是说说拉勾基本的获取数据的小思路,望各位大佬轻喷(✪ω✪)
更新 最后说下采集过程中返回操作太频繁的解决办法,也是在搜索的过程中发现的,拉勾会根据cookies判定是否为爬虫,没刷新一次cookies都会变,所以要保持登录状态,方案就是使用requests的session类。
上一段伪代码
def get_index(self, url, **kwargs):
session = requests.Session()
session.get(self.index_url, headers=self.headers)
cookies = session.cookies
time.sleep(random.randint(3, 5))
response = requests.post(url=url, headers=self.headers, data=kwargs, cookies=cookies)
data = response.json()
if response.status_code == 200 and 'content' in data.keys():
return data
else:
print(data)
# print('访问频繁稍后再试')
return
上面代码中的self.index_url是搜索关键词之后的请求url
image.png以上抓取都是需要先登录你自己的拉勾,然后抓包的
网友评论