美文网首页
Spark mllib运行决策树分类器实例(附代码)

Spark mllib运行决策树分类器实例(附代码)

作者: 夜空中最亮的星Simon | 来源:发表于2020-07-19 12:01 被阅读0次

    1. 数据集介绍:

    此数据集包含了一些特征如下图,主要通过这些特征来决定是否雇佣员工。

    代码:

    # 快速查看数据集

    Hire_data=pd.read_csv('/Users/{0}/Documents/MLCourse/PastHires.csv'.format(username))

    Hire_data.head()

    数据集样例

    2. 安装相应的packages, 并请提前安装好Anaconda3, Python3 链接可参考: Anaconda 安装

    2.1 安装 package pydotplus (used for visualizing decision tree)

    安装方法:Terminal 输入 conda install pydotplus

    2.2 安装 Pyspark (used for running machine learning job in Spark)

    安装方法:Terminal 输入 pip install pyspark

    2.3 安装 Pandas

    安装方法:Terminal 输入 pip install pandas

    2.4 安装 Numpy

    安装方法:Terminal 输入 pip install numpy


    3. 用决策树建模并且预测结果

    3.1 标签和特征参数

    Hired这一列作为label, 其它作为特征参数。

    3.2 代码部分实现:

    from pyspark.mllib.regression import LabeledPoint

    from pyspark.mllib.tree import DecisionTree

    from pyspark import SparkConf, SparkContext

    from numpy import array

    import pandas as pd

    import getpass

    username = getpass.getuser()

    # Initialize a Spark Context:

    conf = SparkConf().setMaster("local").setAppName("SparkDecisionTree")

    sc = SparkContext(conf = conf)

    # 定义函数binary, mapEducation将一些非数字的值转换为数字,便于处理

    # 定义函数createLabeledPoints 返回label列和所需特征列

    def binary(YN):

        if (YN == 'Y'):

            return 1

        else:

            return 0

    def mapEducation(degree):

        if (degree == 'BS'):

            return 1

        elif (degree =='MS'):

            return 2

        elif (degree == 'PhD'):

            return 3

        else:

            return 0

    # Convert a list of raw fields from our CSV file to a

    # LabeledPoint that MLLib can use. All data must be numerical...

    def createLabeledPoints(fields):

        yearsExperience = int(fields[0])

        employed = binary(fields[1])

        previousEmployers = int(fields[2])

        educationLevel = mapEducation(fields[3])

        topTier = binary(fields[4])

        interned = binary(fields[5])

        hired = binary(fields[6])

        return LabeledPoint(hired, array([yearsExperience, employed,

            previousEmployers, educationLevel, topTier, interned])

    # SparkContext read local file, will create a RDD

    rawData = sc.textFile('/Users/{0}/Documents/MLCourse/PastHires.csv'.format(username))

    # Get all fields name, save result in a RDD

    header = rawData.first()

    # Get all data except for the row with fields name, save result in a RDD

    rawData = rawData.filter(lambda x:x != header)

    # 将数据集根据逗号分割分成不同的fields

    csvData = rawData.map(lambda x: x.split(","))

    # 建立训练数据集

    trainingData = csvData.map(createLabeledPoints)

    # 建立测试数据集,下面数字的意思可参考对应列名

    testCandidates = [ array([10, 1, 3, 1, 0, 0])]

    # 将测试数据集变为RDD, 如果不明白RDD, 请查看Spark RDD的说明

    testData = sc.parallelize(testCandidates)

    # 用训练数据集建立决策树模型,categoricalFeaturesInfo为种类mapping,例如1代表field[1],即为employed?列,2代表mapping成两个值,因为employed?列只有Y和N。剩余3:4, 4:2, 5:2的意思也可根据上述描述进行赋值。

    model = DecisionTree.trainClassifier(trainingData, numClasses=2,

                                        categoricalFeaturesInfo={1:2, 3:4, 4:2, 5:2},

                                        impurity='gini', maxDepth=5, maxBins=32)

    # 将测试数据集放入决策树模型

    predictions = model.predict(testData)

    # 打印最终决策后的结果

    print('Hire prediction:')

    results = predictions.collect()

    for result in results:

        print(result)

    预测结果为 Hire

    # 打印决策树内部是如何工作的

    print('Learned classification tree model:')

    print(model.toDebugString())

    决策树内部分析图

    参考 https://www.udemy.com/course/data-science-and-machine-learning-with-python-hands-on/

    如要转载,请注明参考。最后,希望我的总结能帮到您,happy learning :)

    相关文章

      网友评论

          本文标题:Spark mllib运行决策树分类器实例(附代码)

          本文链接:https://www.haomeiwen.com/subject/fmjlkktx.html