Dimensionality Reduction
Motivation I: Data compression
Reduce data from 2D to 1D:
Reduce data from 3D to 2D:
Motivation II: Data Visualization
Principal Component Analysis (PCA) problem formulation
Reduce from 2D to 1D: Find a direction (a vector ) onto which to project the data so as to minimize the projection error.
Reduce from nD to kD: Find k vectors onto which to project data, so as to minimize the projection error.
PCA is not liner regression
Principal Componenet Analysis algorithm
Data preprocessing
Training set:
Preprocessiong (feature scaling/mean normalization):
Replace each with
.
If defferent features on different scales (e.g., = size of house,
= number of bedrooms), scale features to have comparable range of values.
Pricipal Component Analysis (PCA) algroithm
Reduce data from nD to kD
Compute "covariance matrix":
Compute "eigenvectors" of matrix :
[U,S,V] = svd(Sigma)
U
will be a matrix, what we should do is to take the first k columns of
U
.Then we get
Choosing the number of principal components
Choosing k
Average squared projection error:
Total variation in the data:
Typically, choose k to be smallest value so that
"99% of variance is retained"
[U,S,V]=svd(Sigma)
For given k
or
Reconstruction from compressed representation
Advice for applying PCA
Supervised learning speedup
- Extract inputs: get unlabled dataset
- PCA: get new training set
Mapping
should be defined by running PCA only on the training set. This mapping can be applied as well to the examples
and
in the cross validation and test sets.
Applying of PCA
- Compression
- Reduce memory/disk needed to store data
- Speed up learing algorithm
- Visualization
Bad use of PCA: To prevent overfitting
PCA is sometimes used where it shouldn't be
Design of ML system
Before using PCA, first try running whatever you want to do with the original/raw data.
网友评论