pyspark交叉验证

作者: 米斯特芳 | 来源:发表于2021-08-01 16:05 被阅读0次

pyspark交叉验证
深度学习知识点汇总-机器学习基础（15）
机器学习笔记（一）——交叉验证
嵌套交叉验证（Nested cross-validation）
十一、交叉验证和网格搜索
算法笔记（20）交叉验证及Python代码实现
交叉验证
交叉验证
交叉验证
交叉验证

没什么好说的，直接上代码（注释）

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .appName("CrossValidatorExample")\
    .getOrCreate()

# Prepare training documents, which are labeled.
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0),
    (4, "b spark who", 1.0),
    (5, "g d a y", 0.0),
    (6, "spark fly", 1.0),
    (7, "was mapreduce", 0.0),
    (8, "e spark program", 1.0),
    (9, "a e c l", 0.0),
    (10, "spark compile", 1.0),
    (11, "hadoop software", 0.0)
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.
# 先转为小写，再按空格分词
tokenizer = Tokenizer(inputCol="text", outputCol="words")
# 将document编码为长度numFeatures的稀疏向量，向量元素之和为document长度
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)# 逻辑回归
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
# This will allow us to jointly choose parameters for all Pipeline stages.
# A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,
# this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.
paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \ # numFeatures建议使用2的n次方，以尽力保证散列均匀
    .addGrid(lr.regParam, [0.1, 0.01]) \ # 添加参数搜索空间
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

# Prepare test documents, which are unlabeled.
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "mapreduce spark"),
    (7, "apache hadoop")
], ["id", "text"])

# Make predictions on test documents. cvModel uses the best model found (lrModel).
prediction = cvModel.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
    print(row)

网友评论

本文标题：pyspark交叉验证

本文链接：https://www.haomeiwen.com/subject/jwfrvltx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

pyspark交叉验证

相关文章

pyspark交叉验证

深度学习知识点汇总-机器学习基础（15）

机器学习笔记（一）——交叉验证

嵌套交叉验证（Nested cross-validation）

十一、交叉验证和网格搜索

算法笔记（20）交叉验证及Python代码实现

交叉验证

交叉验证

交叉验证

交叉验证

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读