Machine learning
-
AI: Artificial Intelligence: 处理人工智能的learning部分
-
Pattern Recognition: 根据对象的特征对其进行识别分类
-
Data Mining (数据挖掘,从数据内分析出新的信息以及知识)
模式识别: 序列标签: 给每个输入值打上一个标签,如输入是一个字,那么根据上下文判断词性
句法分析(parsing): 分析输入自然语言的语法结构
分类(claasification) : 已知类别,对输入进行分类
聚集(clustering): 根据输入的特征创造分类
方法: Multiple features : 从能最小化错误发生概率的地方划分特征边界。
Bayesian Decision Theory
Design classifiers to recommend decisions that minimize some total expected ”risk”.
posterior(P(wj|X)= likelihood(P(X|wj))*prior(p(wj)) /evidence(P(x))
use prior
P(w1)>P(w2)===>w1
general theory
-
use more than one features
-
allow more than two categories
-
Allow actions other than classifying the input to one of the possible categories (e.g., rejection).
-
Employ a more general error function (i.e., “risk” function) by associating a “cost” (“loss” function) with each error (i.e., wrong action)
loss function
[图片上传失败...(image-510ced-1526236530463)]
find minR
zero one loss function: when i=j ==> lambda=0 ===> R(ai|x)=1-P(wi|x)
discriminant function
useful way of representing pattern classifiers gi(x)
Decision rules divide the feature space in decision regions R1, R2, …, Rc,separated by decision boundaries.
Naïve Bayes
classifier assumes that all features are conditionally independent
In the high dimensional feature space, it is difficult to
get the joint probability, so the Naïve Bayes is assume the all feature is
conditionally independent.
Naïve Bayes: the conditional independence assumption
– Training is very easy and fast; just requiring considering each attribute in each class separately
– Test is straightforward; just looking up tables or calculating conditional probabilities with estimated distributions
• A popular generative model
– Performance competitive to most of state-of-the-art classifiers even in presence of violating independence assumption
– Many successful applications, e.g., spam mail filtering
– A good candidate of a base learner in ensemble learning
ROC curve
Receiver Operating Characteristic (ROC) Curve
ROC curves can help us evaluate system performance for different thresholds, Can be used to compare tests/procedures
ROC curve is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied,in machine learning field, it is used to asses classifier and compare tests.
ROC曲线说明了二进制分类器系统的性能,因为它的识别阈值是可变的。
AUC
Area under ROC curve (AUC)
Overall measure of test performance
Comparisons between two tests based on differences between (estimated) AUC
Decision boundary
is a hypersurface that portions the underlying vector space into two set, one for each class.
KNN
K-nearest neighbors algorithm is a non-parametric method used for classification and regression.
PROS and CONS (优缺点):
缺点 1. Slower prediction with larger dataset.
-
Many irrelevant features may lead to problem.
-
The results tend poor on high-dimensional case.
-
Can't get insight about patterns underlying the data.
-
All k nearest neighbors have the same influence on the prediction.
-
Closer nearest neighbors should (perhaps) have a higher influence on the prediction.
优点: 1. Easy to use.
-
Have a good compatibility by choosing suitable distance measure.
-
In complex non-linear problem, it can model better than basic linear models.
-
Good data set makes good predictive accuracy.
-
No learning algorithm needed. We do not explicitly learn a model out of the training data, the data is the model.
注意事项: 1. sample weight and feature weight.
2. It is a good choice to select odd as the K value.
避免overfitting: choose a suitable K based on the scale of your data set.
Covariance matrix
Variance: How much a random variable varies around the expected value
Covariance is the measure the strength of the linear relationship between two random variables
For N-dimension random variable, the elements of the matrix are the covariance of every 2 random variables. This matrix called covariance matrix.
Covariance becomes more positive for each pair of values which differ from their mean in the same direction.
Covariance becomes more negative with each pair of values which differ from their mean in opposite directions.
应用案例: iris data, Eigen faces
PCA
Principal component analysis (PCA) uses the variance in the data as the structure preservation criterion.
PCA tries to preserve as much of the original variance of the data when projected to a lower-dimensional space.
Principal component analysis is a useful method to reduce dimensional (projection the high-dimensional space to lower-dimensional space)
in data preprocessing, it see the biggest variance direction in raw data as main feature ,and used the direction of biggest eigenvector represent the biggest variance direction by covariance matrix transform.
pros:
PCA is a non-parameter method which means the output on same raw input is same.
cons:
PCA is a linear method for dimension reduction, it can't work with non-linear raw data.
some time the max variance results may overlapping in two group data which is belong to same dataset, so it should choice a projection direction which is not overlapping.
the feature is in the direction of the big variance in other words is the direction of the big eigenvector.
basis matrix is the matrix compose by the eigenvector,and we can diagonalization the covariance to get the basis matrix.
SOM
a type of ANN that is trained using unsupervised learning to produce a low-dimension and discrete representation.
a clustering (vector quantization) method combined topology preservation ability and provides good data visualization possibilities
SOM is lattice of competitive neuronal units for clustering (vector quantization) and topology preserving.
topology preservation means that input patterns close in the input space are mapped to units close on the SOM lattice and units close on the SOM are close in the input space.
The training algorithm of the SOM is based on unsupervised learning, which can be either iterative or batch based.
The SOM can be used for data visualization, clustering (or classification), estimation and a variety of other purposes.
batch training
the gradient is computed for the entire input set and the map is updated toward the estimated optimum for the set.
Batch learning is opposite to the sequential learning, which means is update the weight in neuron after all input seen. It is more accurate estimate of gradient, and more fast to converges to local minimum.
U-matrix
The U-matrix shows the distance between neighboring units in SOM component planes, the high value on the U-matrix mean large distance, which means it is a cluster borders, thus visualizes the cluster structure of the map.
ANN
Computational models inspired by the human brain:
-Massively parallel, distributed system, made up of simple processing units (neurons)
-Synaptic connection strengths among neurons are used to store the acquired knowledge.
-Knowledge is acquired by the network from its environment through a learning process
Many types of models (supervised, unsupervised) for different tasks (classification, regression, clustering, visualization)
supervised
trained to produce desired outputs in response to sample inputs
Unsupervised
trained by letting the network adjust itself to inputs to find relationships (e.g. clusters) within data. Require only data, no labeled samples
delta rule
Delta learning rule is a gradient descent learning rule for updating the weights of the inputs to artificial neurons in a single-layer neural network.
differentiate to get error derivatives for weights ===> update the weight
Batch vs. Sequential Learning
batch
*More accurate estimate of gradient
*Converges to local minimum faster
sequential
*Simpler to program
*May escape from local minima (change order or presentation)
Both ways, need many epochs - passes through the whole dataset
Activation functions
sigmoid activation function is used in the hidden units, and sigmoid or linear activation functions are used in the output units
Sigmoid
Tanh
Rectified Linear Unit(ReLU)
ANN summary
• Perceptron and linear regression optimize the same target function
• In both cases we compute the gradient (vector of partial derivatives)
• In the case of linear regression, we set the gradient to zero and solve for vector w. As the solution we have a closed formula for w such that the target function obtains the global minimum.
• In the case of perceptron, we iteratively go in the direction of the minimum by going in the direction of minus the gradient.
MLP
The complexity of the MLP can vary from a simple parametric model to a complex nonlinear regression model by varying the number of layers and the number of units in each layer.
Understand model complexity
there are how many models we can choose in a training case. If we can use some methods like weight decay to reduce model complexity, we will have better performance on avoiding over-fitting.
non-linear optimization the training usually stops at local minimum
Backpropagation
In general, back propagation is using the error at the output layer to correct the weight of the neurons in MLP layer by layer to reach the inputs. It using the chain rule to make sure the gradient calculation only in local term.
is used to calculate a gradient that is needed in the calculation of the weights to be used in the network. It is commonly used in DNN.
Practical considerations
-Preprocessing of training data
-Initialization of weights
-Choosing the learning rate
-Batch or on-line learning
-Choosing the transfer functions
Normalize data
Because the raw data may have different unit(for example, Kilometers meters), different range. If we don't normalize those data,the training result is bad ever don't have any means.
Normalization makes sure each feature has its fare chance
Overfitting
overfitting meaning that the model works well with the training data, but performs poorly with the data not seen before.
解决overfitting的方法:
增加数据(data augmentation);
cross validation,交叉比对;
Weight decay(权值衰减);
dropout,
early stopping(属于正则化技术, early stopping的思想是不给训练网络形成过拟合时间,在测试集表现更好时我们就保存参数,然后继续测试,直到表现开始下降的点,然后停止测试);
Bayesian(贝叶斯法);
Noise
CNN中pool也是避免overfitting的方法
Underfitting
增加新特征; feature augmentation;
尝试非线性模型,DNN等模型; try non-linear model, for example DNN.
Vector norm
The vector norm reflects the ratio of the "length" scaling of a vector by mapping a vector to another vector.
Feature Extraction
either from the raw data or from other features ==> derived features
Feature Selection
所谓的维灾难就是当特征维度超过一定界限后,分类器的性能随着特征维度的增加反而下降(而且维度越高训练模型的时间开销也会越大)。导致分类器下降的原因往往是因为这些高纬度特征中含有无关特征和冗余特征,因此特征选择的主要目的是去除特征中的无关特征和冗余特征:
3 basic approaches
Filter methods
Wrapper methods
Embedded methods
Feature Generation
features: numerical values passed to the classifier.
Given the raw input data:
How to best describe this data with numerical features?
What values can we extract or generate from the raw input?
Problem specific solutions.
Common/general approaches: PCA.
Our focus: generating features from images.
Domain specific: DFT, DCT/DST, convolution.
Important in any pattern recognition task.
DFT
Computer can only deal with discrete and non-periodic signal, so after sampling and get discrete signal, we use DFT to transfer a discrete and limited length signal in time domain to a discrete and limited length signal in frequency domain. It lets computer can compute both in time domain and frequency domain.
DFT is Discrete Fourier Transform, DFT can transform a discrete and limited length signal(image) in time domain or spatial domain to a discrete and limited length signal(image) in frequency domain. It lets computer can compute both in time domain and frequency domain. And sometime it's more efficient to analysis data in frequency domain.
Template Matching
Assume a set of reference patterns are available.
Match an image to one of these patterns.
Typical use cases: Speech recognition. Robotic vision. Motion estimation in video coding. Image database retrieval.
Measure the similarity of two patterns (reference and test).
For 1D signals: edit distance. See the algorithms course.
For images: cross-correlation and deformable templates.
Another approach: PCA. ===> Project into lower dimensions.
If the rotated images are correlated. ======>Their projections can match.
Deep learning
Deep learning is a subset of the machine learning, it using the multiple processing layers to extract feature from lower layer to high layer and it is a hierarchical architecture.
Why deep learning
To address the challenge of generalizing to new example when working with high-dimensional data.
Traditional machine learning:Hand-crafted features
Deep learning can learn features directly from data without the need for manual feature extraction.
Why GPUs good for deep learning.
1)GPUs have many morere sources and faster bandwidth to memory
2)Deep learning computations fit well with GPU architecture.
DNN
DNN can have hundreds hidden layers to extract better feature representation.
The term “deep” refers to the number of layers in the network— the more layers,the deeper the network.
Popular DNN models
Supervised learning
each sample is a pair consisting of an input x and a desired target label y.
Goal: predict the target label of unseen inputs
Unsupervised learning
Challenge : In a lot of real-world use cases, even small-scale data collection can be extremely expensive or sometimes near-impossible(e.g.in medical imaging).
Autoencoder
-
Encoder compresses the input into a hidden representation.
-
Decode reconstructs the input from the hidden representation
Goal of training : minimizing the difference between the input vector x and the reconstruction vector z which is called reconstruction error(loss function).
SAE
Stacked/deep AutoEncoder (SAE) is constructed by extending the encoder and decoder of autoencoder with multiple hidden layers.
SDAE
Stacked Denoising AutoEncoder
Idea: adding noise to the input, it can force autoencoder to learn more robust features.
Phase 1: unsupervised pre-training
Phase 2: supervised fine-tuning
Improve Performance
DATA
Get more data
Invent more data: generate new data by creating randomly modified versions of existing data ======>data augmentation or data generation
Rescale the data: to the bounds based on the activation functions.
Transform the data:Scaling/ Attribute decompositions/Attribute aggregations
Feature selection
Hyper-parameters Tuning
Activation Functions
Optimization algorithm
Loss function
Weight Initialization
Batches and Epochs :Try batch size equal to training data size(batch learning) / Try a batch size of one (online learning).
Network Topology: Larger networks havea greater representational capability/ More layers offer more opportunity for hierarchical re-composition of abstract features learned from the data
Learning Rate adaptive learning rate
DNN training
Three common ways for DNN training
1.Purely supervised
-Initialize parameters randomly
-Trian in supervised mode ( typically with standard backpropagation procedures and a stocasticgradient descent algorithm with mean squared error as the loss function.
-Used in most practical systems for speech and image
2.Unsupervised, layerwise+ supervised classifier on top
-Train each layer unsupervised, one after the other
-Train a supervised classifier on top, keeping the other layers fixed
-Good when very few labeled samples are available
3.Unsupervised, layerwise+ global supervised fine-tuning
-Train each layer unsupervised, one after the other
-Add a classifier layer, and retrain the whole thing supervised
-Good when label set is poor
CNN
CNNs take advantage of the fact that the input consists of images. Ordinary NNs do not scale well to image data.
The convolution is applied using a convolution filter to produce a feature (activation ) map.
Pooling layer
Max pooling
the most common downsampling operation. Max pooling is done by applying a max filter to (usually) non-overlapping subregions of the initial representation.
Average pooling
consider the average output of a rectangular neighborhood (possibly weighted by the distance from the central pixel).
Max pooling extracts the most important features like edges whereas,average pooling extracts features so smoothly.
FC layer
At the end of the network is a fully connected(FC)layer. This layer basically takes an input volume(whatever the output is of the conv or ReLU or pool layer preceding it)and outputs an N dimensional vector where N is the number of classes.
CNN application
分类: classification(是猫是狗); detection:(图中有没有猫); segmentation(图中那部分是猫的轮廓,跟背景分离开来)
RNN
循环神经网络,用于处理前后关联的一串信息,并做出预测。
Other
Deep learning:
Improve performance:
Data: 1. more data. 2. bigger model. 3. more computation.
Hyper-parameter tuning: 1. optimize the algorithm. 2. pay attention to loss function. 3. weight initialization
Three common ways for DNN training:
Purely supervised: Used in most practical systems for speech and image.
Unsupervised, layer wise + supervised classifier on top:
Good when very few labeled samples are available
Unsupervised, layer wise + global supervised fine-tuning.
Good when label set is poor.
网友评论