DBSAN算法是基于密度的一种算法:
1.聚类的时候不需要指定簇的个数。
2.簇的个数是不定的。
将数据点分为三类:
1.核心点:在半径Eps内有超过Minpts数目的点的个数
2.边界点:在半径Eps内点的数量小于MinPts,但是落在核心点的邻域内
3.噪声点:既不是核心点又不是噪声点的点
image.png
算法的思路就是找出核心点和边界点,然后去除噪声点,得到结果。
DBSCAN算法具体步骤:
1.将所有点标记为核心点、边界点或噪声点;
2.删除噪声点;
3.为距离在Eps之内的所有核心点之间赋予一条边;(使用的是曼哈顿距离)
4.每组连通的核心点形成一个簇;
5.将每个边界点指派到一个与之关联的核心点的簇中(哪一个核心点的半
径范围之内)。
数据的分析:
用DBSCAN聚类,大致要分析的数据有:分类便签、噪声比、簇数、轮廓系数
其中,轮廓系数:
2019-10-09 21-32-42屏幕截图.png
具体案例中的应用:
数据介绍:
现有大学校园网的日志数据,290条大学生的校园网使用情况数据,数据包
括用户ID,设备的MAC地址,IP地址,开始上网时间,停止上网时间,上
网时长,校园网套餐等。利用已有数据,分析学生上网的模式。
实验目的:
通过DBSCAN聚类,分析学生上网时间和上网时长的模式。
技术路线:sklearn.cluster.DBSCAN数据实例:
#-*- coding:utf-8 -*-
#1.导入相应的模块,进行数据提取
import numpy as np
import matplotlib.pyplot as plt
import sklearn.cluster as skc
from sklearn import metrics
onlinetimes=[]
mac2id=dict()
f=open("TestData.txt",encoding='utf-8')
for line in f:
mac=line.split(",")[2]
onlinetime=int(line.split(",")[6]) #一定要转化成整数的形式
starttime=int(line.split(",")[4].split(" ")[1].split(":")[0])
if mac not in mac2id:
onlinetimes.append((starttime,onlinetime)) #要加两个括号,就是添加元组
mac2id[mac]=len(onlinetimes)
else:
onlinetimes[mac2id[mac]]=[(starttime,onlinetime)]
'''
2.上网时间聚类
'''
real_x=np.array(onlinetimes).reshape((-1,2)) #打印成N行两列的数据
# print(real_x)
X=real_x[:,0:1]
db=skc.DBSCAN(eps=0.01,min_samples=20).fit(X) #训练数据
labels=db.labels_
print("labels:")
print(labels)
raito=len(labels[labels[:]==-1])/len(labels)
print("噪声比:%2.f"%raito)
n_cluster=len(set(labels))-(1 if -1 in labels else 0) #把labels变成集合,去掉重复的部分,计算标签数
print("簇数:%s"%n_cluster)
print("轮廓系数:%s"%metrics.silhouette_score(X,labels))
for i in range(n_cluster):
print("簇数:n_cluster%s"%i)
print(list(X[labels==i].flatten()))
plt.hist(X,24) #在24个小时内,X出现的额次数汇总
plt.show()
输出结果:
labels:
[ 0 -1 0 1 -1 1 0 1 2 -1 1 0 1 1 3 -1 -1 3 -1 1 1 -1 1 3
4 -1 1 1 2 0 2 2 -1 0 1 0 0 0 1 3 -1 0 1 1 0 0 2 -1
1 3 1 -1 3 -1 3 0 1 1 2 3 3 -1 -1 -1 0 1 2 1 -1 3 1 1
2 3 0 1 -1 2 0 0 3 2 0 1 -1 1 3 -1 4 2 -1 -1 0 -1 3 -1
0 2 1 -1 -1 2 1 1 2 0 2 1 1 3 3 0 1 2 0 1 0 -1 1 1
3 -1 2 1 3 1 1 1 2 -1 5 -1 1 3 -1 0 1 0 0 1 -1 -1 -1 2
2 0 1 1 3 0 0 0 1 4 4 -1 -1 -1 -1 4 -1 4 4 -1 4 -1 1 2
2 3 0 1 0 -1 1 0 0 1 -1 -1 0 2 1 0 2 -1 1 1 -1 -1 0 1
1 -1 3 1 1 -1 1 1 0 0 -1 0 -1 0 0 2 -1 1 -1 1 0 -1 2 1
3 1 1 -1 1 0 0 -1 0 0 3 2 0 0 5 -1 3 2 -1 5 4 4 4 -1
5 5 -1 4 0 4 4 4 5 4 4 5 5 0 5 4 -1 4 5 5 5 1 5 5
0 5 4 4 -1 4 4 5 4 0 5 4 -1 0 5 5 5 -1 4 5 5 5 5 4
4]
噪声比: 0
簇数:6
轮廓系数:0.7104124919280866
簇数:n_cluster0
[22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22]
簇数:n_cluster1
[23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23]
簇数:n_cluster2
[20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20]
簇数:n_cluster3
[21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21]
簇数:n_cluster4
[8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8]
簇数:n_cluster5
[7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7]
image.png
'''
3.上网时长聚类
'''
Y=np.log(1+real_x[:,1:]) #是数据分布的更加紧凑,便于统计
dby=skc.DBSCAN(eps=0.14,min_samples=8).fit(Y)
labels_y=dby.labels_
print("lables:")
print(labels_y)
ratio=len(labels_y[labels_y[:]==-1])/len(labels_y)
print("ratio:{:.2%}".format(ratio))
n_clustery=len(set(labels_y))-(1 if -1 in labels_y else 0)
print("n_clustery:%s"%n_clustery)
print("轮廓系数%s"%metrics.silhouette_score(Y,labels_y))
计算每一个簇的样本个数,标准差,均值
for i in range(n_clustery):
print("n_clustery",i,":")
count=len(Y[labels_y==i])
mean=np.mean(real_x[labels_y==i][:,1])
std=np.std(real_x[labels_y==i][:,1])
print("样本个数:%d"%count)
print("标准差:%d"%std)
print("均值:%d"%mean)
plt.subplot(122)
plt.subplots_adjust(wspace=0.3)#调整subplots之间横向间距,纵向用hspace
x=np.linspace(0,len(labels_y),len(labels_y))#在指定的间隔内返回均匀间隔的数字。
plt.plot(x,real_x[:,1])
plt.show()
输出结果:
lables:
[ 0 1 0 2 1 0 0 0 0 1 -1 0 -1 -1 0 1 1 0 1 0 0 1 0 0
1 1 -1 -1 0 0 0 0 1 0 -1 0 0 0 0 0 1 0 0 2 0 0 0 1
0 0 -1 1 0 1 0 0 -1 0 0 0 0 1 1 1 0 0 0 -1 1 0 0 0
0 0 0 0 1 -1 0 0 0 0 0 0 1 -1 0 1 1 0 1 1 0 1 0 1
0 0 -1 1 1 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 2 0 1 0 -1
0 1 0 0 0 2 -1 -1 0 1 1 1 -1 0 1 0 0 0 0 0 1 1 0 0
0 0 2 -1 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 2 0 0 -1 0
0 0 0 2 0 1 2 0 0 -1 1 1 0 0 0 0 0 1 -1 -1 -1 1 0 0
0 1 0 -1 -1 1 0 0 0 0 1 0 1 0 0 0 1 0 0 -1 0 1 -1 -1
0 0 0 1 2 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 2 -1 -1 0
0 0 -1 -1 1 -1 2 -1 0 0 0 -1 0 1 0 -1 0 -1 0 0 0 1 -1 0
1 0 -1 -1 1 -1 0 -1 -1 1 2 0 1 1 0 2 0 0 2 0 2 0 0 0
-1]
ratio:15.22%
n_clustery:3
轮廓系数0.2928419027876771
n_clustery 0 :
样本个数:169
标准差:3732
均值:4643
n_clustery 1 :
样本个数:60
标准差:13109
均值:32109
n_clustery 2 :
样本个数:16
标准差:46
均值:317
![image.png](https://img.haomeiwen.com/i10216152/d72edbf33aea478f.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
网友评论