美文网首页
聚类算法应用——DBSAN算法

聚类算法应用——DBSAN算法

作者: 敬子v | 来源:发表于2019-10-10 08:07 被阅读0次

DBSAN算法是基于密度的一种算法:

1.聚类的时候不需要指定簇的个数。
2.簇的个数是不定的。

将数据点分为三类:

1.核心点:在半径Eps内有超过Minpts数目的点的个数
2.边界点:在半径Eps内点的数量小于MinPts,但是落在核心点的邻域内
3.噪声点:既不是核心点又不是噪声点的点


image.png

算法的思路就是找出核心点和边界点,然后去除噪声点,得到结果。

DBSCAN算法具体步骤:

1.将所有点标记为核心点、边界点或噪声点;
2.删除噪声点;
3.为距离在Eps之内的所有核心点之间赋予一条边;(使用的是曼哈顿距离)
4.每组连通的核心点形成一个簇;
5.将每个边界点指派到一个与之关联的核心点的簇中(哪一个核心点的半
径范围之内)。

数据的分析:

用DBSCAN聚类,大致要分析的数据有:分类便签、噪声比、簇数、轮廓系数
其中,轮廓系数:


2019-10-09 21-32-42屏幕截图.png

具体案例中的应用:

数据介绍:

现有大学校园网的日志数据,290条大学生的校园网使用情况数据,数据包
括用户ID,设备的MAC地址,IP地址,开始上网时间,停止上网时间,上
网时长,校园网套餐等。利用已有数据,分析学生上网的模式。

实验目的:

通过DBSCAN聚类,分析学生上网时间和上网时长的模式。
技术路线:sklearn.cluster.DBSCAN数据实例:

#-*- coding:utf-8 -*-
#1.导入相应的模块,进行数据提取
import numpy as np
import matplotlib.pyplot as plt
import sklearn.cluster as skc
from sklearn import metrics
onlinetimes=[]
mac2id=dict()
f=open("TestData.txt",encoding='utf-8')
for line in f:
   mac=line.split(",")[2]
   onlinetime=int(line.split(",")[6])  #一定要转化成整数的形式
   starttime=int(line.split(",")[4].split(" ")[1].split(":")[0])
   if mac not in mac2id:
       onlinetimes.append((starttime,onlinetime)) #要加两个括号,就是添加元组
       mac2id[mac]=len(onlinetimes)
   else:
       onlinetimes[mac2id[mac]]=[(starttime,onlinetime)]
'''
2.上网时间聚类
'''
real_x=np.array(onlinetimes).reshape((-1,2)) #打印成N行两列的数据
# print(real_x)
X=real_x[:,0:1]
db=skc.DBSCAN(eps=0.01,min_samples=20).fit(X)  #训练数据
labels=db.labels_
print("labels:")
print(labels)
raito=len(labels[labels[:]==-1])/len(labels)
print("噪声比:%2.f"%raito)
n_cluster=len(set(labels))-(1 if -1 in labels else 0) #把labels变成集合,去掉重复的部分,计算标签数
print("簇数:%s"%n_cluster)
print("轮廓系数:%s"%metrics.silhouette_score(X,labels))
for i in range(n_cluster):
    print("簇数:n_cluster%s"%i)
    print(list(X[labels==i].flatten()))
plt.hist(X,24)  #在24个小时内,X出现的额次数汇总
plt.show()

输出结果:

labels:
[ 0 -1  0  1 -1  1  0  1  2 -1  1  0  1  1  3 -1 -1  3 -1  1  1 -1  1  3
  4 -1  1  1  2  0  2  2 -1  0  1  0  0  0  1  3 -1  0  1  1  0  0  2 -1
  1  3  1 -1  3 -1  3  0  1  1  2  3  3 -1 -1 -1  0  1  2  1 -1  3  1  1
  2  3  0  1 -1  2  0  0  3  2  0  1 -1  1  3 -1  4  2 -1 -1  0 -1  3 -1
  0  2  1 -1 -1  2  1  1  2  0  2  1  1  3  3  0  1  2  0  1  0 -1  1  1
  3 -1  2  1  3  1  1  1  2 -1  5 -1  1  3 -1  0  1  0  0  1 -1 -1 -1  2
  2  0  1  1  3  0  0  0  1  4  4 -1 -1 -1 -1  4 -1  4  4 -1  4 -1  1  2
  2  3  0  1  0 -1  1  0  0  1 -1 -1  0  2  1  0  2 -1  1  1 -1 -1  0  1
  1 -1  3  1  1 -1  1  1  0  0 -1  0 -1  0  0  2 -1  1 -1  1  0 -1  2  1
  3  1  1 -1  1  0  0 -1  0  0  3  2  0  0  5 -1  3  2 -1  5  4  4  4 -1
  5  5 -1  4  0  4  4  4  5  4  4  5  5  0  5  4 -1  4  5  5  5  1  5  5
  0  5  4  4 -1  4  4  5  4  0  5  4 -1  0  5  5  5 -1  4  5  5  5  5  4
  4]
噪声比: 0
簇数:6
轮廓系数:0.7104124919280866
簇数:n_cluster0
[22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22]
簇数:n_cluster1
[23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23]
簇数:n_cluster2
[20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20]
簇数:n_cluster3
[21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21]
簇数:n_cluster4
[8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8]
簇数:n_cluster5
[7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7]
image.png

'''
3.上网时长聚类
'''
Y=np.log(1+real_x[:,1:]) #是数据分布的更加紧凑,便于统计
dby=skc.DBSCAN(eps=0.14,min_samples=8).fit(Y)
labels_y=dby.labels_
print("lables:")
print(labels_y)
ratio=len(labels_y[labels_y[:]==-1])/len(labels_y)
print("ratio:{:.2%}".format(ratio))
n_clustery=len(set(labels_y))-(1 if -1 in labels_y else 0)
print("n_clustery:%s"%n_clustery)
print("轮廓系数%s"%metrics.silhouette_score(Y,labels_y))

计算每一个簇的样本个数,标准差,均值

for i in range(n_clustery):
print("n_clustery",i,":")
count=len(Y[labels_y==i])
mean=np.mean(real_x[labels_y==i][:,1])
std=np.std(real_x[labels_y==i][:,1])
print("样本个数:%d"%count)
print("标准差:%d"%std)
print("均值:%d"%mean)
plt.subplot(122)
plt.subplots_adjust(wspace=0.3)#调整subplots之间横向间距,纵向用hspace
x=np.linspace(0,len(labels_y),len(labels_y))#在指定的间隔内返回均匀间隔的数字。
plt.plot(x,real_x[:,1])
plt.show()

输出结果:

lables:
[ 0 1 0 2 1 0 0 0 0 1 -1 0 -1 -1 0 1 1 0 1 0 0 1 0 0
1 1 -1 -1 0 0 0 0 1 0 -1 0 0 0 0 0 1 0 0 2 0 0 0 1
0 0 -1 1 0 1 0 0 -1 0 0 0 0 1 1 1 0 0 0 -1 1 0 0 0
0 0 0 0 1 -1 0 0 0 0 0 0 1 -1 0 1 1 0 1 1 0 1 0 1
0 0 -1 1 1 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 2 0 1 0 -1
0 1 0 0 0 2 -1 -1 0 1 1 1 -1 0 1 0 0 0 0 0 1 1 0 0
0 0 2 -1 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 2 0 0 -1 0
0 0 0 2 0 1 2 0 0 -1 1 1 0 0 0 0 0 1 -1 -1 -1 1 0 0
0 1 0 -1 -1 1 0 0 0 0 1 0 1 0 0 0 1 0 0 -1 0 1 -1 -1
0 0 0 1 2 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 2 -1 -1 0
0 0 -1 -1 1 -1 2 -1 0 0 0 -1 0 1 0 -1 0 -1 0 0 0 1 -1 0
1 0 -1 -1 1 -1 0 -1 -1 1 2 0 1 1 0 2 0 0 2 0 2 0 0 0
-1]
ratio:15.22%
n_clustery:3
轮廓系数0.2928419027876771
n_clustery 0 :
样本个数:169
标准差:3732
均值:4643
n_clustery 1 :
样本个数:60
标准差:13109
均值:32109
n_clustery 2 :
样本个数:16
标准差:46
均值:317

![image.png](https://img.haomeiwen.com/i10216152/d72edbf33aea478f.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)


    

相关文章

网友评论

      本文标题:聚类算法应用——DBSAN算法

      本文链接:https://www.haomeiwen.com/subject/fvxgpctx.html