1. Spectral Clustering
(a) Spectral Clustering on synthesis data
(a.1) change threshold
neighbor = 20, threshold = 0.1

- Almost all the points are labeled as blue Cluster 2. (red < bule)
neighbor = 20, threshold = 0.2

- Enlarge the threshold to 0.2, the inner circle is labeled as Cluster 1, the outer ring is labeled more red Cluster 1 than blue Cluster 2. (red > bule)
- We should reduce the threshold.
- And the result is quite like the k-means result.
neighbor = 20, threshold = 0.15

- A better result.
- The threshold is quite depending on experience. If we can't visualize the data directly, we may get some bad results.
Conclusions:
- If we can't decide the best threshold, we can use "fully connected" to set edge weights so that each weight is larger than 0. Gaussian Kernel Function, like RBF, is usually used to decide the edge weights.

(a.2) compare with K-means

- K-means is based on the assumption that each cluster is spherical in vector space so that it can deal with data cluster distributed in spherical but can't handle the data from the problem a.
- Spectral Clustering is based on the graph theory, it can handle the data from the problem a because the data cluster shape in that clustering algorithm is not limited.
- Both Spectral Clustering and K-means need to decide K first.
- Spectral Clustering transforms original data into a new low-dimension vector space by fusing original data features, which makes it more stable than K-means.
- If the graph built by Spectral Clustering is not sparse, or the weight matrix is not sparse, this algorithm will be quite slow.
(b) Spectral Clustering on real-world data
- when k = 3
Algorithm | accuracy | mutual info |
---|---|---|
K-means | 0.572 | -0.365 |
Spectral Clustering | 0.816 | 0.371 |

- when k = 5 (the largest k constructW supports)
Algorithm | accuracy | mutual info |
---|---|---|
K-means | 0.535 | 0.017 |
Spectral Clustering | 0.671 | 0.661 |

2. Principal Component Analysis
(a) hack CAPTCHA system
Below are the 5 original and rotated CAPTCHA samples.





- Using point coordinates as features is very clever.
- The 1st largest eigenvector is vertical to the long side of the text part because this projecting direction has the largest variance.
- As the principal components are orthogonal, the 2nd largest eigenvector is exactly parallel to the long side. This character can be used to calculate the rotated angle.
(b) apply PCA on face image
(b.i) Eigenface
- 40 largest eigenface

- 40 smallest eigenface (just have a look)

(b.ii) Dimensionality reduction test error
- KNN, k = 1

- KNN, k = 3

- KNN, k = 5

Conclusion:
- The larger k, the larger test error. That's because of the character of KNN, k = 1 is an overfitting condition, more precise on train data. Maybe the train data and test data are quite similar.
(b.iii) Visualized face

Conclusion:
- When k is smaller, the dimension reduces more, the result image tends to be smoother. In contrast, the larger k means more detailed infomation is reserved, so the result image tends to be sharper.
(b.iv) Apply LDA on this dataset
- KNN, k = 1

- KNN, k = 5

- KNN, k = 10

Conclusion:
- LDA has smaller test error than PCA.
- LDA is supervised learning, the LDA function uses training features and ground truth as parameters, paying more attention to the classification labels.
- Though both LDA and PCA project data to low dimension, PCA seeks the largest variance while LDA seeks the largest distance among different class data.
- After dimension reduction, PCA feature dimension can equal the original feature dimension, while LDA feature dimension can only equal the class number of original data.
- LDA can suffer overfitting too. But it's more precise with supervision.
网友评论