数据挖掘：基于实例的分类器-最近邻KNN

作者: Cache_wood | 来源:发表于2022-04-13 00:25 被阅读0次

数据挖掘：基于实例的分类器-最近邻KNN
KNN算法：K最近邻分类算法(K-NearestNeighbor
K-近邻
二：K近邻
Unsupervised learner--k-Nearest
机器学习十大算法之kNN
监督学习之最近邻算法
KNN算法介绍
1-K近邻
邻近算法KNN

@[toc]

Instance Based Classifiers

Examples:

Rote-learner

Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly.
Nearest neighbor

Uses k "closest" points(nearest neighbors) for performing classification.

Nearest Neighbor Classifiers

Requires three things

The set of stored records.
Distance Metric to compute distance between records.
The value of k, the number of nearest neighbors to retrieve.

To classify an unknown record

Compute distance to other training records.
Identify k nearest neighbors.
Use class labels of nearest neighbors to determine the class label of unknown record(e.g., taking majority vote)

Definition of Nearest Neighbor

K-nearest neighbors of a record x are data points that have the k smallest distance to x.

Nearest Neighbor Classification

Compute distance between two points:

Euclidean distance : $d(p,q) = \sqrt{\sum_i (p_i-q_i)^2}$

Determine the class from nearest neighbor list

take the majority vote of class labels among the k-nearest neighbors.
weigh the vote according to distance(加权) $w = 1/d^2$

Choosing the value of k:

If k is too small, sensitive to noise points.
If k is too large, neighborhood may include points from other classes.

Scaling issues

Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes.

Problem with Euclidean measure:

High dimensional data: curse of dimensionality
Can produce counter-intuitive results.
Solution: Normalize the vectors to unit length

k-NN classifiers are lazy learners

It does not build models explicitly.
Unlike eager learners such as decision tree induction and rule-based systems.
Classifying unknown records are relatively expensive.

Example: PEBLS

PEBLS: Parallel Examplar-Based Learning System(Cost & Salzberg)

Works with both continuous and nominal features.
- For nominal features, distance between two nominal features.
Each record is assigned a weight factor.
Number of nearest neighbor, k=1
$d(V_1,V_2) = \sum_i|\frac{n_{1i}}{n_1}-\frac{n_{2i}}{n_2}|$

Distance between record X and record Y:
$\Delta (X,Y) = w_Xw_Y\sum_{i=1}^d d(X_i,Y_i)^2$
where: $w_X$ = Number of times X is used for prediction/(Number of times X predicts correctly)