Pipeline里面的重要概念
MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow. This section covers the key concepts introduced by the Pipelines API, where the pipeline concept is mostly inspired by the scikit-learn project.
-
DataFrame
: This ML API usesDataFrame
from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., aDataFrame
could have different columns storing text, feature vectors, true labels, and predictions. -
Transformer
: ATransformer
is an algorithm which can transform oneDataFrame
into anotherDataFrame
. E.g., an ML model is aTransformer
which transforms aDataFrame
with features into aDataFrame
with predictions. -
Estimator
: AnEstimator
is an algorithm which can be fit on aDataFrame
to produce aTransformer
. E.g., a learning algorithm is anEstimator
which trains on aDataFrame
and produces a model. -
Pipeline
: APipeline
chains multipleTransformer
s andEstimator
s together to specify an ML workflow. -
Parameter
: AllTransformer
s andEstimator
s now share a common API for specifying parameters.
自定义Transformer请查看
https://stackoverflow.com/questions/32331848/create-a-custom-transformer-in-pyspark-ml
自定义Transformer请查看
https://stackoverflow.com/questions/41399399/serialize-a-custom-transformer-using-python-to-be-used-within-a-pyspark-ml-pipel
自定义Estimtor请查看
https://stackoverflow.com/questions/37270446/how-to-create-a-custom-estimator-in-pyspark
网友评论