美文网首页
聚类算法应用——DBSAN算法

聚类算法应用——DBSAN算法

作者: 敬子v | 来源:发表于2019-10-10 08:07 被阅读0次

    DBSAN算法是基于密度的一种算法:

    1.聚类的时候不需要指定簇的个数。
    2.簇的个数是不定的。

    将数据点分为三类:

    1.核心点:在半径Eps内有超过Minpts数目的点的个数
    2.边界点:在半径Eps内点的数量小于MinPts,但是落在核心点的邻域内
    3.噪声点:既不是核心点又不是噪声点的点


    image.png

    算法的思路就是找出核心点和边界点,然后去除噪声点,得到结果。

    DBSCAN算法具体步骤:

    1.将所有点标记为核心点、边界点或噪声点;
    2.删除噪声点;
    3.为距离在Eps之内的所有核心点之间赋予一条边;(使用的是曼哈顿距离)
    4.每组连通的核心点形成一个簇;
    5.将每个边界点指派到一个与之关联的核心点的簇中(哪一个核心点的半
    径范围之内)。

    数据的分析:

    用DBSCAN聚类,大致要分析的数据有:分类便签、噪声比、簇数、轮廓系数
    其中,轮廓系数:


    2019-10-09 21-32-42屏幕截图.png

    具体案例中的应用:

    数据介绍:

    现有大学校园网的日志数据,290条大学生的校园网使用情况数据,数据包
    括用户ID,设备的MAC地址,IP地址,开始上网时间,停止上网时间,上
    网时长,校园网套餐等。利用已有数据,分析学生上网的模式。

    实验目的:

    通过DBSCAN聚类,分析学生上网时间和上网时长的模式。
    技术路线:sklearn.cluster.DBSCAN数据实例:

    #-*- coding:utf-8 -*-
    #1.导入相应的模块,进行数据提取
    import numpy as np
    import matplotlib.pyplot as plt
    import sklearn.cluster as skc
    from sklearn import metrics
    onlinetimes=[]
    mac2id=dict()
    f=open("TestData.txt",encoding='utf-8')
    for line in f:
       mac=line.split(",")[2]
       onlinetime=int(line.split(",")[6])  #一定要转化成整数的形式
       starttime=int(line.split(",")[4].split(" ")[1].split(":")[0])
       if mac not in mac2id:
           onlinetimes.append((starttime,onlinetime)) #要加两个括号,就是添加元组
           mac2id[mac]=len(onlinetimes)
       else:
           onlinetimes[mac2id[mac]]=[(starttime,onlinetime)]
    
    '''
    2.上网时间聚类
    '''
    real_x=np.array(onlinetimes).reshape((-1,2)) #打印成N行两列的数据
    # print(real_x)
    X=real_x[:,0:1]
    db=skc.DBSCAN(eps=0.01,min_samples=20).fit(X)  #训练数据
    labels=db.labels_
    print("labels:")
    print(labels)
    raito=len(labels[labels[:]==-1])/len(labels)
    print("噪声比:%2.f"%raito)
    n_cluster=len(set(labels))-(1 if -1 in labels else 0) #把labels变成集合,去掉重复的部分,计算标签数
    print("簇数:%s"%n_cluster)
    print("轮廓系数:%s"%metrics.silhouette_score(X,labels))
    for i in range(n_cluster):
        print("簇数:n_cluster%s"%i)
        print(list(X[labels==i].flatten()))
    plt.hist(X,24)  #在24个小时内,X出现的额次数汇总
    plt.show()
    

    输出结果:

    labels:
    [ 0 -1  0  1 -1  1  0  1  2 -1  1  0  1  1  3 -1 -1  3 -1  1  1 -1  1  3
      4 -1  1  1  2  0  2  2 -1  0  1  0  0  0  1  3 -1  0  1  1  0  0  2 -1
      1  3  1 -1  3 -1  3  0  1  1  2  3  3 -1 -1 -1  0  1  2  1 -1  3  1  1
      2  3  0  1 -1  2  0  0  3  2  0  1 -1  1  3 -1  4  2 -1 -1  0 -1  3 -1
      0  2  1 -1 -1  2  1  1  2  0  2  1  1  3  3  0  1  2  0  1  0 -1  1  1
      3 -1  2  1  3  1  1  1  2 -1  5 -1  1  3 -1  0  1  0  0  1 -1 -1 -1  2
      2  0  1  1  3  0  0  0  1  4  4 -1 -1 -1 -1  4 -1  4  4 -1  4 -1  1  2
      2  3  0  1  0 -1  1  0  0  1 -1 -1  0  2  1  0  2 -1  1  1 -1 -1  0  1
      1 -1  3  1  1 -1  1  1  0  0 -1  0 -1  0  0  2 -1  1 -1  1  0 -1  2  1
      3  1  1 -1  1  0  0 -1  0  0  3  2  0  0  5 -1  3  2 -1  5  4  4  4 -1
      5  5 -1  4  0  4  4  4  5  4  4  5  5  0  5  4 -1  4  5  5  5  1  5  5
      0  5  4  4 -1  4  4  5  4  0  5  4 -1  0  5  5  5 -1  4  5  5  5  5  4
      4]
    噪声比: 0
    簇数:6
    轮廓系数:0.7104124919280866
    簇数:n_cluster0
    [22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22]
    簇数:n_cluster1
    [23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23]
    簇数:n_cluster2
    [20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20]
    簇数:n_cluster3
    [21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21]
    簇数:n_cluster4
    [8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8]
    簇数:n_cluster5
    [7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7]
    
    image.png

    '''
    3.上网时长聚类
    '''
    Y=np.log(1+real_x[:,1:]) #是数据分布的更加紧凑,便于统计
    dby=skc.DBSCAN(eps=0.14,min_samples=8).fit(Y)
    labels_y=dby.labels_
    print("lables:")
    print(labels_y)
    ratio=len(labels_y[labels_y[:]==-1])/len(labels_y)
    print("ratio:{:.2%}".format(ratio))
    n_clustery=len(set(labels_y))-(1 if -1 in labels_y else 0)
    print("n_clustery:%s"%n_clustery)
    print("轮廓系数%s"%metrics.silhouette_score(Y,labels_y))

    计算每一个簇的样本个数,标准差,均值

    for i in range(n_clustery):
    print("n_clustery",i,":")
    count=len(Y[labels_y==i])
    mean=np.mean(real_x[labels_y==i][:,1])
    std=np.std(real_x[labels_y==i][:,1])
    print("样本个数:%d"%count)
    print("标准差:%d"%std)
    print("均值:%d"%mean)
    plt.subplot(122)
    plt.subplots_adjust(wspace=0.3)#调整subplots之间横向间距,纵向用hspace
    x=np.linspace(0,len(labels_y),len(labels_y))#在指定的间隔内返回均匀间隔的数字。
    plt.plot(x,real_x[:,1])
    plt.show()

    输出结果:
    

    lables:
    [ 0 1 0 2 1 0 0 0 0 1 -1 0 -1 -1 0 1 1 0 1 0 0 1 0 0
    1 1 -1 -1 0 0 0 0 1 0 -1 0 0 0 0 0 1 0 0 2 0 0 0 1
    0 0 -1 1 0 1 0 0 -1 0 0 0 0 1 1 1 0 0 0 -1 1 0 0 0
    0 0 0 0 1 -1 0 0 0 0 0 0 1 -1 0 1 1 0 1 1 0 1 0 1
    0 0 -1 1 1 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 2 0 1 0 -1
    0 1 0 0 0 2 -1 -1 0 1 1 1 -1 0 1 0 0 0 0 0 1 1 0 0
    0 0 2 -1 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 2 0 0 -1 0
    0 0 0 2 0 1 2 0 0 -1 1 1 0 0 0 0 0 1 -1 -1 -1 1 0 0
    0 1 0 -1 -1 1 0 0 0 0 1 0 1 0 0 0 1 0 0 -1 0 1 -1 -1
    0 0 0 1 2 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 2 -1 -1 0
    0 0 -1 -1 1 -1 2 -1 0 0 0 -1 0 1 0 -1 0 -1 0 0 0 1 -1 0
    1 0 -1 -1 1 -1 0 -1 -1 1 2 0 1 1 0 2 0 0 2 0 2 0 0 0
    -1]
    ratio:15.22%
    n_clustery:3
    轮廓系数0.2928419027876771
    n_clustery 0 :
    样本个数:169
    标准差:3732
    均值:4643
    n_clustery 1 :
    样本个数:60
    标准差:13109
    均值:32109
    n_clustery 2 :
    样本个数:16
    标准差:46
    均值:317

    ![image.png](https://img.haomeiwen.com/i10216152/d72edbf33aea478f.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
    
    
        
    
    

    相关文章

      网友评论

          本文标题:聚类算法应用——DBSAN算法

          本文链接:https://www.haomeiwen.com/subject/fvxgpctx.html