美文网首页Spark学习笔记
14.Spark学习(Python版本):构建一个机器学习工作流

14.Spark学习(Python版本):构建一个机器学习工作流

作者: 马淑 | 来源:发表于2018-09-02 16:50 被阅读48次
    准备

    Spark2.0以上版本的pyspark创建一个名为spark的SparkSession对象,当需要手工创建时,SparkSession可以由其伴生对象的builder()方法创建出来,如下代码段所示:
    spark = SparkSession.builder.master("local").appName("Word Count").getOrCreate()
    pyspark.ml依赖numpy包,Ubuntu 自带python3是没有numpy的,执行如下命令安装:
    sudo pip install numpy

    实现过程

    在vim中编写程序,输入以下代码:
    mashu@mashu-Inspiron-5458:/usr/local/spark/python_code/ML$ vim ml_pipelines.py

    from pyspark.sql import SparkSession
    from pyspark.ml import Pipeline
    from pyspark.ml.classification import LogisticRegression
    from pyspark.ml.feature import HashingTF, Tokenizer
    
    # Create SparkSession
    spark = SparkSession.builder.master("local").appName("Word Count").getOrCreate()
    
    # Prepare training documents from a list of (id, text, label) tuples.
    training = spark.createDataFrame([
            (0,"a b c d e soark", 1.0),
            (1,"b d",0.0),
            (2,"spark f g h",1.0),
            (3,"hadoop mapreduce",1.0)],
            ["id","text","label"])
    
    # Define pipeline stage including tokenizer, hashingTF and lr
    tokenizer = Tokenizer(inputCol="text",outputCol="words")
    hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(),outputCol="features")
    lr = LogisticRegression(maxIter=10,regParam=0.001)
    
    # Build pipeline
    pipeline = Pipeline(stages=[tokenizer,hashingTF,lr])
    
    # Generate a pipeline model
    model = pipeline.fit(training)
    
    # Prepare test documents
    test = spark.createDataFrame([
            (4, "spark i j k"),
            (5, "l m n"),
            (6, "spark hadoop spark"),
            (7, "apache hadoop")],
            ["id","text"])
    
    # Predict
    prediction = model.transform(test)
    selected = prediction.select("id","text","probability","prediction")
    for row in selected.collect():
            rid, text, prob, prediction = row
            print("(%d,%s)-->prob=%s,prediction=%f" % (rid,text,str(prob),prediction))
    

    保存退出vim,运行程序,输出结果:

    mashu@mashu-Inspiron-5458:/usr/local/spark/python_code/ML$ python ml_pipelines.py
    
    (省略部分Warning信息)
    
    (4,spark i j k)-->prob=[0.08271837966143641,0.9172816203385635],prediction=1.000000
    (5,l m n)-->prob=[0.26514807025650067,0.7348519297434992],prediction=1.000000
    (6,spark hadoop spark)-->prob=[0.001869208706242662,0.9981307912937574],prediction=1.000000
    (7,apache hadoop)-->prob=[0.029108488540555488,0.9708915114594445],prediction=1.000000
    

    由于训练数据集较少,如果有更多的测试数据进行学习,预测的准确率将会有显著提升。

    相关文章

      网友评论

        本文标题:14.Spark学习(Python版本):构建一个机器学习工作流

        本文链接:https://www.haomeiwen.com/subject/vtymwftx.html