准备
Spark2.0以上版本的pyspark创建一个名为spark的SparkSession对象,当需要手工创建时,SparkSession可以由其伴生对象的builder()方法创建出来,如下代码段所示:
spark = SparkSession.builder.master("local").appName("Word Count").getOrCreate()
pyspark.ml依赖numpy包,Ubuntu 自带python3是没有numpy的,执行如下命令安装:
sudo pip install numpy
实现过程
在vim中编写程序,输入以下代码:
mashu@mashu-Inspiron-5458:/usr/local/spark/python_code/ML$ vim ml_pipelines.py
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
# Create SparkSession
spark = SparkSession.builder.master("local").appName("Word Count").getOrCreate()
# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([
(0,"a b c d e soark", 1.0),
(1,"b d",0.0),
(2,"spark f g h",1.0),
(3,"hadoop mapreduce",1.0)],
["id","text","label"])
# Define pipeline stage including tokenizer, hashingTF and lr
tokenizer = Tokenizer(inputCol="text",outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(),outputCol="features")
lr = LogisticRegression(maxIter=10,regParam=0.001)
# Build pipeline
pipeline = Pipeline(stages=[tokenizer,hashingTF,lr])
# Generate a pipeline model
model = pipeline.fit(training)
# Prepare test documents
test = spark.createDataFrame([
(4, "spark i j k"),
(5, "l m n"),
(6, "spark hadoop spark"),
(7, "apache hadoop")],
["id","text"])
# Predict
prediction = model.transform(test)
selected = prediction.select("id","text","probability","prediction")
for row in selected.collect():
rid, text, prob, prediction = row
print("(%d,%s)-->prob=%s,prediction=%f" % (rid,text,str(prob),prediction))
保存退出vim,运行程序,输出结果:
mashu@mashu-Inspiron-5458:/usr/local/spark/python_code/ML$ python ml_pipelines.py
(省略部分Warning信息)
(4,spark i j k)-->prob=[0.08271837966143641,0.9172816203385635],prediction=1.000000
(5,l m n)-->prob=[0.26514807025650067,0.7348519297434992],prediction=1.000000
(6,spark hadoop spark)-->prob=[0.001869208706242662,0.9981307912937574],prediction=1.000000
(7,apache hadoop)-->prob=[0.029108488540555488,0.9708915114594445],prediction=1.000000
由于训练数据集较少,如果有更多的测试数据进行学习,预测的准确率将会有显著提升。
网友评论