美文网首页
Learning Spark [9] - MLlib库 - Tr

Learning Spark [9] - MLlib库 - Tr

作者: 屹然1ran | 来源:发表于2021-02-19 10:02 被阅读0次

    基于树模型(Tree-Based Models)

    基于树模型,比如决策树,梯度提升树,随机森林等,相对比回归模型,是较为好解释的(Interpret)

    决策树(Decision Tree)

    决策树模型是由一系列的if-then-else规则构成,用于解决分类或回归问题。


    决策树模型:是否接受offer
    # decision tree
    from pyspark.ml.regression import DecisionTreeRegressor
    from pyspark.ml.feature import VectorAssembler, OneHotEncoder, StringIndexer
    
    df = DecisionTreeRegressor(labelCol = 'price')
    
    # filter for just numeric columns
    numericCols = [field for (field, dataType) in trainDF.dtypes
                   if ((dataType == 'double') & (field != 'price'))]
    
    # combine output of StringIndexer defined above and numeric columns
    # dummy variable 
    categoricalCols = [field for (field, dataTypes) in trainDF.dtypes if dataTypes == 'string']
    indexOutputCols = [x + 'Index' for x in categoricalCols]
    assembleInputs = indexOutputCols + numericCols
    
    vecAssembler = VectorAssembler(inputCols = assembleInputs, outputCol = 'features')
    
    stringIndexer = StringIndexer(inputCols=categoricalCols,                               
                                  outputCols=indexOutputCols,                               
                                  handleInvalid = 'skip')
    
    stages = [stringIndexer, vecAssembler, df]
    pipeline = Pipeline(stages = stages)
    
    df.setMaxBins(40)
    pipelineModel = pipeline.fit(trainDF)
    
    dtModel = pipelineModel.stages[-1]
    print(dtModel.toDebugString)
    
    DecisionTreeRegressionModel: uid=DecisionTreeRegressor_901986020659, depth=5, numNodes=47, numFeatures=33
      If (feature 12 <= 2.5)
       If (feature 12 <= 1.5)
        If (feature 5 in {1.0,2.0})
         If (feature 4 in {0.0,1.0,3.0,5.0,9.0,10.0,11.0,13.0,14.0,16.0,18.0,24.0})
          If (feature 3 in {0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0,23.0,24.0,25.0,26.0,27.0,28.0,29.0,30.0,31.0,32.0,33.0,34.0})
           Predict: 104.23992784125075
          Else (feature 3 not in
    ...
    

    相关文章

      网友评论

          本文标题:Learning Spark [9] - MLlib库 - Tr

          本文链接:https://www.haomeiwen.com/subject/azygxltx.html