Logistic Regression MaxEnt

作者: 美环花子若野 | 来源:发表于2019-01-07 17:46 被阅读8次

Logistic Regression MaxEnt
机器学习笔记1_逻辑回归
逻辑回归(Logistic Regression)
Logistic Regression 为什么用极大似然函数
tensorflow 已经完成高级别的模型封装种类
Logistic Regression
M.L.-Classification and Represen
机器学习算法速查
2017-08-11
分类算法（1）-LR逻辑回归

From: https://www.quora.com/What-is-the-difference-between-Logistic-Regression-and-Max-entropy-classifiers

The short answer is: there is no real differences between a MaxEnt model and a logistic regression. They are both log linear models.

And now, the long answer:

The logistic regression is a probabilistic model for binomial cases. The MaxEnt generalizes the same principle for multinomial cases.

In both models, we want a conditional probability:

where 𝑦y is the target class and 𝐱x is a vector of features.

The logistic regression follows a binomial distribution. Thus, we can write the following probability mass function:

where 𝜃θ is the vector of parameters and ℎ𝜃(𝐱)hθ(x) is the hypothesis:

The probability mass function can be rewritten as follows:

We use the maximum log-likelihood for 𝑁N observations to estimate paramaters:

And the partial derivative for a given parameter is:

The MaxEnt model uses the same principle but following a multinomial distribution. Thus, we can write the following probability mass function for 𝐶Cclasses:

Here, we have one vector of parameters and one hypothesis by class. Each hypothesis is a softmax function:

The probability mass function can be rewritten as follows:

In a similar way to the logistic regression, a maximum log-likelihood is used to estimate 𝜃θ:

And its corresponding partial derivative is:

And we can see that the MaxEnt model is a generalization of the logistic regression to 𝐶C classes. In both cases, we compare the observed expectations (from the training set) to the model expectations (computed with parameters). And the gradient is the difference of these terms.

The name of log-linear model is sometimes used because on a log scale, those models are linear. It is more apparent for the MaxEnt model:

The first term is linear in the feature space. And since the second term is the normalization constant, it depends only on 𝐱x. You can find a trick to compute efficiently this constant here :http://lingpipe-blog.com/2009/06...