一个当设计师的朋友想跳槽,但很困惑不知道接下来该怎么办,所以帮她从站酷(zcool)爬取了【北京】的【10-15K】薪资的平面设计师的任职资格和职位描述等资料,并进行分词,看看高频词都有些啥,她就可以按照高频词来规划自己的简历,并且也能了解到自己的不足之处在哪里。
用到的库
requests、beautifulsoup、jieba
思路
这个比智联招聘的好爬多了,先循环出页码列表,然后用requests去访问,获取每个职位的url,然后遍历访问这些url,把里面的职位描述解析出来存到一个字符串里,然后用jieba分词,去掉部分不要的词性,接着collections计算词频,最后存到本地文件里。
产品视角
本来帮朋友爬的是智联招聘的,但是智联的职位大多不是设计专业领域的岗位,而站酷的比较专业,得出的结果参考意义可能对我朋友来说更大,缺点就是就是数据量有点少。
代码
import requests
from bs4 import BeautifulSoup
import jieba.posseg as psg
from collections import Counter
# 页码
pages = []
for i in range(4):
pages.append(str(i+1))
# 访问url
headers ={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
}
urls = []
# 爬取每一个列表页里的招聘详情url
for page in pages:
url = 'https://www.zcool.com.cn/job/searchpost.do?keys=%E5%B9%B3%E9%9D%A2&p=' + page +'&from=&search_cityid=47&search_districtid=0&search_experienceid=0&search_diplomaid=-1&search_stageid=0&search_industryid=0&search_workstatus=0&search_salaryid=8&orderflag='
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
# 获取到的url放进urls列表内
for x in soup.find_all('a', class_='f-18 c-282828 text-overflow'):
link = x.get('href')
if link:
link_without_args = link.split('?')[0]
urls.append(link_without_args)
print(urls)
# 爬取每一个url里的职位描述和任职资格内容
descs = ''
num = 1
for link in urls:
r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
desc_list = soup.find_all('div', class_='l30 f-14 c-666 mt-20')
for i in desc_list:
desc = i.get_text().strip()
descs = descs + desc
print(num)
num += 1
print('Finished!')
# jieba分词,并排除部分词性
result_words_with_attr = [(x.word, x.flag) for x in psg.cut(descs) if len(x.word) >= 2]
stop_attr = ['b','c','d','f','df','m','mq','p','r','rr','s','t','u','z']
words = [x[0] for x in result_words_with_attr if x[1] not in stop_attr]
# 计算高频词
words_seq = Counter(words).most_common(100)
print(words_seq)
# 数据存入本地
with open('zcool_seq.txt', mode='w', encoding='utf-8') as f:
for i,j in words_seq:
f.write(i + ':' + str(j) + '\n')
f.close()
网友评论