1. 数据集介绍:
此数据集包含了一些特征如下图,主要通过这些特征来决定是否雇佣员工。
代码:
# 快速查看数据集
Hire_data=pd.read_csv('/Users/{0}/Documents/MLCourse/PastHires.csv'.format(username))
Hire_data.head()
数据集样例2. 安装相应的packages, 并请提前安装好Anaconda3, Python3 链接可参考: Anaconda 安装。
2.1 安装 package pydotplus (used for visualizing decision tree)
安装方法:Terminal 输入 conda install pydotplus
2.2 安装 Pyspark (used for running machine learning job in Spark)
安装方法:Terminal 输入 pip install pyspark
2.3 安装 Pandas
安装方法:Terminal 输入 pip install pandas
2.4 安装 Numpy
安装方法:Terminal 输入 pip install numpy
3. 用决策树建模并且预测结果
3.1 标签和特征参数
Hired这一列作为label, 其它作为特征参数。
3.2 代码部分实现:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree
from pyspark import SparkConf, SparkContext
from numpy import array
import pandas as pd
import getpass
username = getpass.getuser()
# Initialize a Spark Context:
conf = SparkConf().setMaster("local").setAppName("SparkDecisionTree")
sc = SparkContext(conf = conf)
# 定义函数binary, mapEducation将一些非数字的值转换为数字,便于处理
# 定义函数createLabeledPoints 返回label列和所需特征列
def binary(YN):
if (YN == 'Y'):
return 1
else:
return 0
def mapEducation(degree):
if (degree == 'BS'):
return 1
elif (degree =='MS'):
return 2
elif (degree == 'PhD'):
return 3
else:
return 0
# Convert a list of raw fields from our CSV file to a
# LabeledPoint that MLLib can use. All data must be numerical...
def createLabeledPoints(fields):
yearsExperience = int(fields[0])
employed = binary(fields[1])
previousEmployers = int(fields[2])
educationLevel = mapEducation(fields[3])
topTier = binary(fields[4])
interned = binary(fields[5])
hired = binary(fields[6])
return LabeledPoint(hired, array([yearsExperience, employed,
previousEmployers, educationLevel, topTier, interned])
# SparkContext read local file, will create a RDD
rawData = sc.textFile('/Users/{0}/Documents/MLCourse/PastHires.csv'.format(username))
# Get all fields name, save result in a RDD
header = rawData.first()
# Get all data except for the row with fields name, save result in a RDD
rawData = rawData.filter(lambda x:x != header)
# 将数据集根据逗号分割分成不同的fields
csvData = rawData.map(lambda x: x.split(","))
# 建立训练数据集
trainingData = csvData.map(createLabeledPoints)
# 建立测试数据集,下面数字的意思可参考对应列名
testCandidates = [ array([10, 1, 3, 1, 0, 0])]
# 将测试数据集变为RDD, 如果不明白RDD, 请查看Spark RDD的说明
testData = sc.parallelize(testCandidates)
# 用训练数据集建立决策树模型,categoricalFeaturesInfo为种类mapping,例如1代表field[1],即为employed?列,2代表mapping成两个值,因为employed?列只有Y和N。剩余3:4, 4:2, 5:2的意思也可根据上述描述进行赋值。
model = DecisionTree.trainClassifier(trainingData, numClasses=2,
categoricalFeaturesInfo={1:2, 3:4, 4:2, 5:2},
impurity='gini', maxDepth=5, maxBins=32)
# 将测试数据集放入决策树模型
predictions = model.predict(testData)
# 打印最终决策后的结果
print('Hire prediction:')
results = predictions.collect()
for result in results:
print(result)
预测结果为 Hire# 打印决策树内部是如何工作的
print('Learned classification tree model:')
print(model.toDebugString())
决策树内部分析图参考 https://www.udemy.com/course/data-science-and-machine-learning-with-python-hands-on/
如要转载,请注明参考。最后,希望我的总结能帮到您,happy learning :)
网友评论