美文网首页
pyspark 实现数据分桶(Bucketizer)

pyspark 实现数据分桶(Bucketizer)

作者: 米斯特芳 | 来源:发表于2021-07-23 15:48 被阅读0次
    from pyspark.sql import SparkSession
    from pyspark.ml.feature import Bucketizer
    spark = SparkSession\
        .builder\
        .appName("BucketizerExample")\
        .getOrCreate()
    splits = [-float("inf"), -0.5, 0.0, 0.5, float("inf")]
    data = [(-999.9,), (-0.5,), (-0.3,), (0.0,), (0.2,), (999.9,)]
    dataFrame = spark.createDataFrame(data, ["features"])
    # splits:区间边界,outputCol:分桶后的特征名
    bucketizer = Bucketizer(splits=splits, inputCol="features", outputCol="bucketedFeatures")
    # Transform original data into its bucket index.
    bucketedData = bucketizer.transform(dataFrame)
    print("Bucketizer output with %d buckets" % (len(bucketizer.getSplits())-1))
    bucketedData.show()
    

    相关文章

      网友评论

          本文标题:pyspark 实现数据分桶(Bucketizer)

          本文链接:https://www.haomeiwen.com/subject/zaiemltx.html