起因:爬取拉勾网职位信息模块的技术栈,实现把技术栈爬回之前的csv文件(以新增列的方式)
具体实现过程如下:
- 使用Pandas的read_csv方法读取csv文件里面的PositionId,访问对应的网址
- 设置headers、cookies、time.sleep、try-except,防止反爬机制
- 利用BeautifulSoup方法爬取对应源码并用html.parser实现转化成Html5格式
- 使用正则表达式清洗
- 使用jieba库中的jieba.lcut()方法进行单词统计并把结果保存进数组里
- 先把技术栈存入数组list_position_detail中(因为to_csv()方法,只保留最后一次的结果,得一次执行全部添加
- 再把技术栈存入字典中(因为可以根据字典的key,去对应保存在那一行),因为字典不能append,所以先用数组append,再转化为字典格式
- 使用Pandas的to_csv()方法添加列
就如下图中圈红的信息:
image.pngimport requests
from bs4 import BeautifulSoup
import time
import random
import pandas as pd
import jieba
import re
import csv
import numpy as np
time1 = random.uniform(0.05,0.1)
time2 = random.uniform(0.1,0.2)
def position_detail(position_id):
try:
# 调试
print(position_id)
time.sleep(time1)
headers={
# 'User-Agent': 'Mozilla/5.0 (Windows NT xx; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/xx.x.xxxx.xxx Safari/xxx.xx',
'Host': 'www.lagou.com',
'Referer': 'https://www.lagou.com/jobs/list_Java/p-city_0?&cl=false&fromSearch=true&labelWords=&suginput=',
'X-Anit-Forge-Code': '0',
'X-Anit-Forge-Token': 'None',
'X-Requested-With': 'XMLHttpRequest'
}
session = requests.session()
session.get(
'https://www.lagou.com/jobs/list_Java/p-city_0?&cl=false&fromSearch=true&labelWords=&suginput=',
headers=headers)
cookies = session.cookies
cookies.get_dict()
url = 'https://www.lagou.com/jobs/%s.html' % position_id
result = requests.get(url,headers=headers,cookies=cookies)
# print(result)
soup = BeautifulSoup(result.content,'html.parser',from_encoding='utf-8')
# print(soup)
position_conent = soup.find(class_="job_detail")
# print(position_conent)
# 未清洗文本
text_1 = position_conent.text
# print(text_1)
# 技术统计
k = re.sub('[\u4e00-\u9fa5]|\n', '', text_1)
k_1 = jieba.lcut(k)
# print(k_1)
# 存放到列表a中,以便在控制台查看结果
a = []
for i in k_1:
s = re.sub('[^a-zA-Z]', '', i)
if s != '':
a.append(s)
print(str(a))
#把清洗好的技术栈返回
return a
# 把列表a的结果输出到txt文件中
# with open('technology_1.csv', 'a', encoding='UTF-8') as file:
# # 把列表先转换为str类型
# v = str(a)
# print(v)
# file.write(v + '\n')
#
# file.close()
# 以上方法为方法一、单词统计用法 https://www.jb51.net/article/173492.htm
# b=" ".join(jieba.lcut(a))
text_list = jieba.lcut(text_1)
# print(text_list)
# 方法二、关键字匹配
technology = ['javascript', 'jquery']
# list_technology=[]
#
# for tech in text_list:
# if tech in technology:
# list_technology.append(tech)
#
# # print(tech)
# print(list_technology)
except:
time.sleep(time2)
print(position_id)
def read(file):
# 直接读取的方式存储
# position_id=[11072]
# for x in position_id:
# position_detail(x)
# 以文件的方式存储
data = pd.read_csv(file,usecols=[0],skiprows=1)
list_position_detail=[]
for i in data.values:
# 把数据类型转化为int类型
x=int(i)
#先把技术栈存入数组list_position_detail中(因为to_csv()方法,只保留最后一次的结果,得一次执行全部添加)
for i in range(0, 1):
obj = {x: str(position_detail(x))}
list_position_detail.append(obj)
# print(list_position_detail)
#把技术栈存入字典中,因为字典不能append,所以先用数组append。再转化为字典格式
type_dict = {}
for i in list_position_detail:
type_dict.update(i)
print(type_dict)
#使用to_csv()方法添加列
df = pd.read_csv(r'D:\pycharm\coderush\拉勾网爬取\32-2.csv')
df['aaaa'] = df['positionId'].map(type_dict)
df.to_csv(r'D:\pycharm\coderush\拉勾网爬取\32-2.csv', index=0, header=1)
read(r'D:\pycharm\coderush\拉勾网爬取\32-2.csv')
网友评论