美文网首页
pyspark交叉验证

pyspark交叉验证

作者: 米斯特芳 | 来源:发表于2021-08-01 16:05 被阅读0次

没什么好说的,直接上代码(注释)

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .appName("CrossValidatorExample")\
    .getOrCreate()

# Prepare training documents, which are labeled.
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0),
    (4, "b spark who", 1.0),
    (5, "g d a y", 0.0),
    (6, "spark fly", 1.0),
    (7, "was mapreduce", 0.0),
    (8, "e spark program", 1.0),
    (9, "a e c l", 0.0),
    (10, "spark compile", 1.0),
    (11, "hadoop software", 0.0)
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.
# 先转为小写,再按空格分词
tokenizer = Tokenizer(inputCol="text", outputCol="words")
# 将document编码为长度numFeatures的稀疏向量,向量元素之和为document长度
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)# 逻辑回归
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
# This will allow us to jointly choose parameters for all Pipeline stages.
# A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,
# this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.
paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \ # numFeatures建议使用2的n次方,以尽力保证散列均匀
    .addGrid(lr.regParam, [0.1, 0.01]) \ # 添加参数搜索空间
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

# Prepare test documents, which are unlabeled.
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "mapreduce spark"),
    (7, "apache hadoop")
], ["id", "text"])

# Make predictions on test documents. cvModel uses the best model found (lrModel).
prediction = cvModel.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
    print(row)

相关文章

  • pyspark交叉验证

    没什么好说的,直接上代码(注释)

  • 深度学习知识点汇总-机器学习基础(15)

    2.15 交叉验证和k折交叉验证 交叉验证是为了提高模型的泛华能力。 方法主要有: 留一交叉验证 k折交叉验证 这...

  • 机器学习笔记(一)——交叉验证

    交叉验证问题 什么是交叉验证     交叉验证是在机器学习建立模型和验证模型参数时常用的办法。交叉验证,顾名思义,...

  • 嵌套交叉验证(Nested cross-validation)

    传统交叉验证和嵌套交叉验证的区别 在开始分享嵌套交叉验证前,首先以K-Fold为例,区分K-Fold交叉验证和嵌套...

  • 十一、交叉验证和网格搜索

    1交叉验证 1)评估方法一般有留出法,交叉验证法,自助法,这里我们介绍交叉验证法。2)k折交叉验证法:将数据集D划...

  • 算法笔记(20)交叉验证及Python代码实现

    常用交叉验证法包括K折叠交叉验证法(K-fold cross validation)、随机拆分交叉验证法(shuf...

  • 交叉验证

    1. 既有趣、又有益的交叉验证 validate your machine learning in a bette...

  • 交叉验证

    交叉验证 问题 我们在训练数据的时候通常把原数据分成 训练集以及测试集两份。当我们使用训练集训练出模型后再使用测试...

  • 交叉验证

    1、为什么要进行交叉验证? 目的:为了得到可靠稳定的模型 交叉验证是一种模型选择方法,其将样本的一部分用于训练,另...

  • 交叉验证

    为什么使用交叉验证? 在许多实际应用中数据是不充足的。交叉验证的基本思想:重复的使用数据。把给定的数据进行切分,将...

网友评论

      本文标题:pyspark交叉验证

      本文链接:https://www.haomeiwen.com/subject/jwfrvltx.html