美文网首页
machine &deep learning study not

machine &deep learning study not

作者: ariali9 | 来源:发表于2018-05-14 02:35 被阅读0次

    Machine learning

    1. AI: Artificial Intelligence: 处理人工智能的learning部分

    2. Pattern Recognition: 根据对象的特征对其进行识别分类

    3. Data Mining (数据挖掘,从数据内分析出新的信息以及知识)

    模式识别: 序列标签: 给每个输入值打上一个标签,如输入是一个字,那么根据上下文判断词性

    ​ 句法分析(parsing): 分析输入自然语言的语法结构

    ​ 分类(claasification) : 已知类别,对输入进行分类

    ​ 聚集(clustering): 根据输入的特征创造分类

    方法: Multiple features : 从能最小化错误发生概率的地方划分特征边界。

    Bayesian Decision Theory

    Design classifiers to recommend decisions that minimize some total expected ”risk”.

    posterior(P(wj|X)= likelihood(P(X|wj))*prior(p(wj)) /evidence(P(x))

    use prior

    P(w1)>P(w2)===>w1

    general theory

    1. use more than one features

    2. allow more than two categories

    3. Allow actions other than classifying the input to one of the possible categories (e.g., rejection).

    4. Employ a more general error function (i.e., “risk” function) by associating a “cost” (“loss” function) with each error (i.e., wrong action)

    loss function

    [图片上传失败...(image-510ced-1526236530463)]

    find minR

    zero one loss function: when i=j ==> lambda=0 ===> R(ai|x)=1-P(wi|x)

    discriminant function

    useful way of representing pattern classifiers gi(x)

    Decision rules divide the feature space in decision regions R1, R2, …, Rc,separated by decision boundaries.

    Naïve Bayes

    classifier assumes that all features are conditionally independent

    In the high dimensional feature space, it is difficult to

    get the joint probability, so the Naïve Bayes is assume the all feature is

    conditionally independent.

    Naïve Bayes: the conditional independence assumption

    – Training is very easy and fast; just requiring considering each attribute in each class separately

    – Test is straightforward; just looking up tables or calculating conditional probabilities with estimated distributions

    • A popular generative model

    – Performance competitive to most of state-of-the-art classifiers even in presence of violating independence assumption

    – Many successful applications, e.g., spam mail filtering

    – A good candidate of a base learner in ensemble learning

    ROC curve

    Receiver Operating Characteristic (ROC) Curve

    ROC curves can help us evaluate system performance for different thresholds, Can be used to compare tests/procedures

    ROC curve is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied,in machine learning field, it is used to asses classifier and compare tests.

    ROC曲线说明了二进制分类器系统的性能,因为它的识别阈值是可变的。

    AUC

    Area under ROC curve (AUC)

    Overall measure of test performance

    Comparisons between two tests based on differences between (estimated) AUC

    Decision boundary

    is a hypersurface that portions the underlying vector space into two set, one for each class.

    KNN

    K-nearest neighbors algorithm is a non-parametric method used for classification and regression.

    PROS and CONS (优缺点):

    缺点 1. Slower prediction with larger dataset.

    1. Many irrelevant features may lead to problem.

    2. The results tend poor on high-dimensional case.

    3. Can't get insight about patterns underlying the data.

    4. All k nearest neighbors have the same influence on the prediction.

    5. Closer nearest neighbors should (perhaps) have a higher influence on the prediction.

    优点: 1. Easy to use.

    1. Have a good compatibility by choosing suitable distance measure.

    2. In complex non-linear problem, it can model better than basic linear models.

    3. Good data set makes good predictive accuracy.

    4. No learning algorithm needed. We do not explicitly learn a model out of the training data, the data is the model.

    注意事项: 1. sample weight and feature weight.

    ​ 2. It is a good choice to select odd as the K value.

    避免overfitting: choose a suitable K based on the scale of your data set.

    Covariance matrix

    Variance: How much a random variable varies around the expected value

    Covariance is the measure the strength of the linear relationship between two random variables

    For N-dimension random variable, the elements of the matrix are the covariance of every 2 random variables. This matrix called covariance matrix.

    Covariance becomes more positive for each pair of values which differ from their mean in the same direction.

    Covariance becomes more negative with each pair of values which differ from their mean in opposite directions.

    应用案例: iris data, Eigen faces

    PCA

    Principal component analysis (PCA) uses the variance in the data as the structure preservation criterion.

    PCA tries to preserve as much of the original variance of the data when projected to a lower-dimensional space.

    Principal component analysis is a useful method to reduce dimensional (projection the high-dimensional space to lower-dimensional space)

    in data preprocessing, it see the biggest variance direction in raw data as main feature ,and used the direction of biggest eigenvector represent the biggest variance direction by covariance matrix transform.

    pros:

    PCA is a non-parameter method which means the output on same raw input is same.

    cons:

    PCA is a linear method for dimension reduction, it can't work with non-linear raw data.

    some time the max variance results may overlapping in two group data which is belong to same dataset, so it should choice a projection direction which is not overlapping.

    the feature is in the direction of the big variance in other words is the direction of the big eigenvector.

    basis matrix is the matrix compose by the eigenvector,and we can diagonalization the covariance to get the basis matrix.

    SOM

    a type of ANN that is trained using unsupervised learning to produce a low-dimension and discrete representation.

    a clustering (vector quantization) method combined topology preservation ability and provides good data visualization possibilities

    SOM is lattice of competitive neuronal units for clustering (vector quantization) and topology preserving.

    topology preservation means that input patterns close in the input space are mapped to units close on the SOM lattice and units close on the SOM are close in the input space.

    The training algorithm of the SOM is based on unsupervised learning, which can be either iterative or batch based.

    The SOM can be used for data visualization, clustering (or classification), estimation and a variety of other purposes.

    batch training

    the gradient is computed for the entire input set and the map is updated toward the estimated optimum for the set.

    Batch learning is opposite to the sequential learning, which means is update the weight in neuron after all input seen. It is more accurate estimate of gradient, and more fast to converges to local minimum.

    U-matrix

    The U-matrix shows the distance between neighboring units in SOM component planes, the high value on the U-matrix mean large distance, which means it is a cluster borders, thus visualizes the cluster structure of the map.

    ANN

    Computational models inspired by the human brain:

    -Massively parallel, distributed system, made up of simple processing units (neurons)

    -Synaptic connection strengths among neurons are used to store the acquired knowledge.

    -Knowledge is acquired by the network from its environment through a learning process

    Many types of models (supervised, unsupervised) for different tasks (classification, regression, clustering, visualization)

    supervised

    trained to produce desired outputs in response to sample inputs

    Unsupervised

    trained by letting the network adjust itself to inputs to find relationships (e.g. clusters) within data. Require only data, no labeled samples

    delta rule

    Delta learning rule is a gradient descent learning rule for updating the weights of the inputs to artificial neurons in a single-layer neural network.

    differentiate to get error derivatives for weights ===> update the weight

    Batch vs. Sequential Learning

    batch

    *More accurate estimate of gradient

    *Converges to local minimum faster

    sequential

    *Simpler to program

    *May escape from local minima (change order or presentation)

    Both ways, need many epochs - passes through the whole dataset

    Activation functions

    sigmoid activation function is used in the hidden units, and sigmoid or linear activation functions are used in the output units

    Sigmoid

    Tanh

    Rectified Linear Unit(ReLU)

    ANN summary

    • Perceptron and linear regression optimize the same target function

    • In both cases we compute the gradient (vector of partial derivatives)

    • In the case of linear regression, we set the gradient to zero and solve for vector w. As the solution we have a closed formula for w such that the target function obtains the global minimum.

    • In the case of perceptron, we iteratively go in the direction of the minimum by going in the direction of minus the gradient.

    MLP

    The complexity of the MLP can vary from a simple parametric model to a complex nonlinear regression model by varying the number of layers and the number of units in each layer.

    Understand model complexity

    there are how many models we can choose in a training case. If we can use some methods like weight decay to reduce model complexity, we will have better performance on avoiding over-fitting.

    non-linear optimization the training usually stops at local minimum

    Backpropagation

    In general, back propagation is using the error at the output layer to correct the weight of the neurons in MLP layer by layer to reach the inputs. It using the chain rule to make sure the gradient calculation only in local term.

    is used to calculate a gradient that is needed in the calculation of the weights to be used in the network. It is commonly used in DNN.

    Practical considerations

    -Preprocessing of training data

    -Initialization of weights

    -Choosing the learning rate

    -Batch or on-line learning

    -Choosing the transfer functions

    Normalize data

    Because the raw data may have different unit(for example, Kilometers meters), different range. If we don't normalize those data,the training result is bad ever don't have any means.

    Normalization makes sure each feature has its fare chance

    Overfitting

    overfitting meaning that the model works well with the training data, but performs poorly with the data not seen before.

    解决overfitting的方法:

    增加数据(data augmentation);

    cross validation,交叉比对;

    Weight decay(权值衰减);

    dropout,

    early stopping(属于正则化技术, early stopping的思想是不给训练网络形成过拟合时间,在测试集表现更好时我们就保存参数,然后继续测试,直到表现开始下降的点,然后停止测试);

    Bayesian(贝叶斯法);

    Noise

    CNN中pool也是避免overfitting的方法

    Underfitting

    增加新特征; feature augmentation;

    尝试非线性模型,DNN等模型; try non-linear model, for example DNN.

    Vector norm

    The vector norm reflects the ratio of the "length" scaling of a vector by mapping a vector to another vector.

    Feature Extraction

    either from the raw data or from other features ==> derived features

    Feature Selection

    所谓的维灾难就是当特征维度超过一定界限后,分类器的性能随着特征维度的增加反而下降(而且维度越高训练模型的时间开销也会越大)。导致分类器下降的原因往往是因为这些高纬度特征中含有无关特征和冗余特征,因此特征选择的主要目的是去除特征中的无关特征和冗余特征:

    3 basic approaches

    Filter methods

    Wrapper methods

    Embedded methods

    Feature Generation

    features: numerical values passed to the classifier.

    Given the raw input data:

    How to best describe this data with numerical features?

    What values can we extract or generate from the raw input?

    Problem specific solutions.

    Common/general approaches: PCA.

    Our focus: generating features from images.

    Domain specific: DFT, DCT/DST, convolution.

    Important in any pattern recognition task.

    DFT

    Computer can only deal with discrete and non-periodic signal, so after sampling and get discrete signal, we use DFT to transfer a discrete and limited length signal in time domain to a discrete and limited length signal in frequency domain. It lets computer can compute both in time domain and frequency domain.

    DFT is Discrete Fourier Transform, DFT can transform a discrete and limited length signal(image) in time domain or spatial domain to a discrete and limited length signal(image) in frequency domain. It lets computer can compute both in time domain and frequency domain. And sometime it's more efficient to analysis data in frequency domain.

    Template Matching

    Assume a set of reference patterns are available.

    Match an image to one of these patterns.

    Typical use cases: Speech recognition. Robotic vision. Motion estimation in video coding. Image database retrieval.

    Measure the similarity of two patterns (reference and test).

    For 1D signals: edit distance. See the algorithms course.

    For images: cross-correlation and deformable templates.

    Another approach: PCA. ===> Project into lower dimensions.

    If the rotated images are correlated. ======>Their projections can match.

    Deep learning

    Deep learning is a subset of the machine learning, it using the multiple processing layers to extract feature from lower layer to high layer and it is a hierarchical architecture.

    Why deep learning

    To address the challenge of generalizing to new example when working with high-dimensional data.

    Traditional machine learning:Hand-crafted features

    Deep learning can learn features directly from data without the need for manual feature extraction.

    Why GPUs good for deep learning.

    1)GPUs have many morere sources and faster bandwidth to memory

    2)Deep learning computations fit well with GPU architecture.

    DNN

    DNN can have hundreds hidden layers to extract better feature representation.

    The term “deep” refers to the number of layers in the network— the more layers,the deeper the network.

    Popular DNN models

    Supervised learning

    each sample is a pair consisting of an input x and a desired target label y.

    Goal: predict the target label of unseen inputs

    Unsupervised learning

    Challenge : In a lot of real-world use cases, even small-scale data collection can be extremely expensive or sometimes near-impossible(e.g.in medical imaging).

    Autoencoder

    1. Encoder compresses the input into a hidden representation.

    2. Decode reconstructs the input from the hidden representation

    Goal of training : minimizing the difference between the input vector x and the reconstruction vector z which is called reconstruction error(loss function).

    SAE

    Stacked/deep AutoEncoder (SAE) is constructed by extending the encoder and decoder of autoencoder with multiple hidden layers.

    SDAE

    Stacked Denoising AutoEncoder

    Idea: adding noise to the input, it can force autoencoder to learn more robust features.

    Phase 1: unsupervised pre-training

    Phase 2: supervised fine-tuning

    Improve Performance

    DATA

    Get more data

    Invent more data: generate new data by creating randomly modified versions of existing data ======>data augmentation or data generation

    Rescale the data: to the bounds based on the activation functions.

    Transform the data:Scaling/ Attribute decompositions/Attribute aggregations

    Feature selection

    Hyper-parameters Tuning

    Activation Functions

    Optimization algorithm

    Loss function

    Weight Initialization

    Batches and Epochs :Try batch size equal to training data size(batch learning) / Try a batch size of one (online learning).

    Network Topology: Larger networks havea greater representational capability/ More layers offer more opportunity for hierarchical re-composition of abstract features learned from the data

    Learning Rate adaptive learning rate

    DNN training

    Three common ways for DNN training

    1.Purely supervised

    -Initialize parameters randomly

    -Trian in supervised mode ( typically with standard backpropagation procedures and a stocasticgradient descent algorithm with mean squared error as the loss function.

    -Used in most practical systems for speech and image

    2.Unsupervised, layerwise+ supervised classifier on top

    -Train each layer unsupervised, one after the other

    -Train a supervised classifier on top, keeping the other layers fixed

    -Good when very few labeled samples are available

    3.Unsupervised, layerwise+ global supervised fine-tuning

    -Train each layer unsupervised, one after the other

    -Add a classifier layer, and retrain the whole thing supervised

    -Good when label set is poor

    CNN

    CNNs take advantage of the fact that the input consists of images. Ordinary NNs do not scale well to image data.

    The convolution is applied using a convolution filter to produce a feature (activation ) map.

    Pooling layer

    Max pooling

    the most common downsampling operation. Max pooling is done by applying a max filter to (usually) non-overlapping subregions of the initial representation.

    Average pooling

    consider the average output of a rectangular neighborhood (possibly weighted by the distance from the central pixel).

    Max pooling extracts the most important features like edges whereas,average pooling extracts features so smoothly.

    FC layer

    At the end of the network is a fully connected(FC)layer. This layer basically takes an input volume(whatever the output is of the conv or ReLU or pool layer preceding it)and outputs an N dimensional vector where N is the number of classes.

    CNN application

    分类: classification(是猫是狗); detection:(图中有没有猫); segmentation(图中那部分是猫的轮廓,跟背景分离开来)

    RNN

    循环神经网络,用于处理前后关联的一串信息,并做出预测。

    Other

    Deep learning:

    Improve performance:

    ​ Data: 1. more data. 2. bigger model. 3. more computation.

    ​ Hyper-parameter tuning: 1. optimize the algorithm. 2. pay attention to loss function. 3. weight initialization

    Three common ways for DNN training:

    Purely supervised: Used in most practical systems for speech and image.

    Unsupervised, layer wise + supervised classifier on top:

    Good when very few labeled samples are available

    Unsupervised, layer wise + global supervised fine-tuning.

    Good when label set is poor.

    相关文章

      网友评论

          本文标题:machine &deep learning study not

          本文链接:https://www.haomeiwen.com/subject/btardftx.html