We perform clustering because we believe the underlying cluster labels are meaningful, will lead to a more efficient description of our data, and will help us choose better actions.
K-means Clustering
The K-means algorithm is an algorithm for putting N data points in an I-dimensional space into K clusters. Each cluster is parameterized by a vector m(k) called its mean.
Each data point is denoted by x(n), which is consisted of I components.
Distance between data points are defined, such as:
Simply two steps
- assignment step: Each data point x(n) is assigned to the nearest mean.
- update step: The means are adjusted to match the sample means of the data points that they are responsible for.
After iterations of the two steps, this algorithm will definitely converge. (This can be proved.) The convergence is indicated by the means remaining unmoved when updated.
Withdraw about K-means: It is a 'hard' algorithm. 'Hard' means that it assigns each data point to exactly one cluster, and all data points in a cluster are equal in updating the mean. Maybe points on the borderline of two or more clusters should have less vote in updating step.
Soft K-means clustering
The bad things about 'hard' K-means algorithm gives rise to the soft K-means algorithm.
This algorithm still has some flaws. It is hopefully improved by using maximum-likelihood.
网友评论