pyspark 决策树使用（分类和回归）

作者: 米斯特芳 | 来源:发表于2021-08-05 14:05 被阅读0次

pyspark 决策树使用（分类和回归）
机器学习——决策树
统计机器学习-决策树
决策树
机器学习算法——决策树1（基础）
对于树模型的一些见解
决策树的理解与应用
机器学习入门（十一）：决策树——既能分类又能回归的模型
python数据分析和数据实战笔记
随机森林和决策树(DecisionTree & RandomFo

分类

from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .appName("DecisionTreeClassificationExample")\
    .getOrCreate()

data = spark.read.format("libsvm").load("sample_libsvm_data.txt")# 特征用稀疏向量表示

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
# 将字符串标签（不是则先转为字符串）编码为整数索引（频率从高到低的标签转为0,1,...），逆操作：IndexToString
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
# 根据maxCategories自动识别分类特征（如果某个特征不重复的值小于等于maxCategories，则该特征转为分类特征）
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)


# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")

# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))

treeModel = model.stages[2]
# summary only
print(treeModel)

回归

from pyspark.ml import Pipeline
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .appName("DecisionTreeRegressionExample")\
    .getOrCreate()

data = spark.read.format("libsvm").load("sample_libsvm_data.txt")
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)# 回归就不用对标签索引了

(trainingData, testData) = data.randomSplit([0.7, 0.3])

dt = DecisionTreeRegressor(labelCol='label',featuresCol="indexedFeatures")

# Chain indexer and tree in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, dt])

# Train model.  This also runs the indexer.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(
    labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

treeModel = model.stages[1]
# summary only
print(treeModel)