美文网首页
pyspark 实现bisecting k-means算法

pyspark 实现bisecting k-means算法

作者: 米斯特芳 | 来源:发表于2021-07-27 17:34 被阅读0次

    bisecting k-means

    KMeans的一种,基于二分法实现:开始只有一个簇,然后分裂成2个簇(最小化误差平方和),再对所有可分的簇分成2类,如果某次迭代导致大于K个类,则样本量大的类具有优先权(保证只有K个类)

    与KMeans区别

    KMeans对初始中心点的选择非常敏感,可能收敛到局部最优值,而二分法KMeans无此影响。两者都不适用非球形簇。当K值较大时,Bisecting KMeans不太适合,它可能导致分裂在各自的子群进行

    其他聚类

    Gaussian mixture/Power iteration clustering (PIC)/Latent Dirichlet allocation (LDA)/Streaming k-means

    from pyspark.ml.clustering import BisectingKMeans
    from pyspark.ml.evaluation import ClusteringEvaluator
    from pyspark.sql import SparkSession
    
    spark = SparkSession\
        .builder\
        .appName("BisectingKMeansExample")\
        .getOrCreate()
    # libsvm格式数据:每一行中,第一个是标签,后面是序号:特征值,以空格分隔,例如 label 1:first_feature 2:second_feature ...
    dataset = spark.read.format("libsvm").load("sample_kmeans_data.txt")# 格式化读取
    
    # Trains a bisecting k-means model.
    bkm = BisectingKMeans().setK(2).setSeed(1)
    model = bkm.fit(dataset)
    
    # Make predictions
    predictions = model.transform(dataset)
    
    # Evaluate clustering by computing Silhouette score
    evaluator = ClusteringEvaluator()
    
    silhouette = evaluator.evaluate(predictions)
    print("Silhouette with squared euclidean distance = " + str(silhouette))
    
    # Shows the result.
    print("Cluster Centers: ")
    centers = model.clusterCenters()
    for center in centers:
        print(center)
    
    

    相关文章

      网友评论

          本文标题:pyspark 实现bisecting k-means算法

          本文链接:https://www.haomeiwen.com/subject/ywkpmltx.html