1. NLP with Classification and V

作者: Kevin不会创作 | 来源:发表于2020-12-01 02:05 被阅读0次

1. NLP with Classification and V
Loss Function in Metric Learning
【现学现卖】CHEER中的概念解释——k-mer
AlexNet文章复现
斯坦福CS224n笔记系列4--Word Window分类与神经
2019-02 Classification in NLP
Linear Classification: Support V
Level 2B 08黄敏龙作业
多标签 V/S 多分类
【一阶前导班-5节权益类投资-1至4课】

The notes are about the first course of the Natural Language Processing Specialization at Coursera which is moderated by DeepLearning.ai.

Sentiment Analysis with Logistic Regression
Sentiment Analysis with Naïve Bayes
Vector Space Models
Machine Translation and Document Search

Sentiment Analysis with Logistic Regression

Feature Extraction

In order to use machine learning algorithms to make some predictions, you need to transform your text data to digital data firstly. This process is called feature extraction.

Given a tweet, or some text, you can represent it as a vector of dimension $V$ , where $V$ corresponds to your vocabulary size. If you had the tweet "I am happy because I am learning NLP", then you would put a 1 in the corresponding index for any word in the tweet, and a 0 otherwise.

Feature Extraction

You can see there are many 0 in the vector which is called sparse representation.

Problems with sparse representations
- Large training time
- Large prediction time

Instead, given a word you can keep track of the number of times, that's where it shows up as the positive class. Given another word you want to keep track of the number of times that word showed up in the negative class. Using both those counts, you can then extract features and use those features into your logistic regression classifier.

Positive and negative counts
Tweets Word Frequency

After you get the frequency of words, you can use it to extract useful features for sentiment analysis.

Feature Extraction

Preprocessing

When preprocessing, you have to perform the following:

Eliminate handles and URLs.
Tokenize the string into words.
Remove stop words like "and, is, a, on, etc."
Stemming- or convert every word to its stem. Like dancer, dancing, danced, becomes danc. You can use porter stemmer to take care of this.
Convert all your words to lower case.

Example

"@YMourri and @AndrewYNg are tuning a GREAT AI model at https://deeplearning.ai!!!"

Preprocessed tweet: [tun, great, ai, model]

Logistic Regression

See Logistic Regression

Sentiment Analysis with Naïve Bayes

Bayes' Rule

Conditional probabilities help us reduce the sample search space.

$P(X|Y)=\frac{P(X\cap{Y})}{P(Y)}=\frac{P(Y|X)P(X)}{P(Y)}$

Naïve Bayes

To build a classifier, we will first start by creating conditional probabilities given the following table:

Word Frequency

This allows us compute the following table of probabilities:

Conditional Probability

Once you have the probabilities, you can compute the likelihood score as follows

$score=\prod_{i=1}^{m}\frac{P(w_i|pos)}{P(w_i|neg)}$

For example,

Likelihood Score

A score greater than 1 indicates that the class is positive, otherwise it is negative.

Laplacian Smoothing

We usually compute the probability of a word given a class as follows:

$P(w_i|class)=\frac{freq(w_i,class)}{N_{class}}\ \ \ \ class\in\{Positive,Negative\}$

However, if a word does not appear in the training, then it automatically gets a probability of $0$ , to fix this we add smoothing as follows:

$P(w_i|class)=\frac{freq(w_i,class)+1}{N_{class}+V}$

$N_{class}$ : frequency of all words in class
$V$ : number of unique words in vocabulary

Note that we added a $1$ in the numerator, and since there are $V$ words to normalize, we add $V$ in the denominator. This method is called Laplacian Smoothing.

Log Likelihood

Log likelihoods are logarithms of the probability we calculated before. They are more convenient to work with and they appear throughout deep learning and NLP.

In order to compute the log likelihood, we need to get the ratios and use them to compute a score that will allow us to decide whether a tweet is positive or negative. The higher the ratio, the more positive the word is.

$score=\frac{P(pos)}{{P}(neg)}\prod_{i=1}^{m}\frac{P(w_i|pos)}{P(w_i|neg)}$

As $m$ gets larger, we can get numerical flow issues, so we introduce the log, which gives you the following equation

$log(\frac{P(pos)}{P(neg)}\prod_{i=1}^{m}\frac{P(w_i|pos)}{P(w_i|neg)})=log\frac{P(pos)}{P(neg)}+\sum_{i=1}^{m}log\frac{P(w_i|pos)}{P(w_i|neg)}$

The first component is called the log prior and the second component is the log likelihood.

We further introduce $\lambda$ as follows

$\lambda(w)=log\frac{P(w|pos)}{P(w|neg)}$

For example,

Lambda

Once you computed the $\lambda$ dictionary, it becomes straightforward to do inference:

Example

As you can see above, since $3.3 > 0$ , we will classify the document to be positive. If we got a negative number we would have classified it to the negative class.

Training naive Bayes

Get or annotate a dataset with positive and negative tweets.
Preprocess the tweets.
Compute word frequency.
Get $P(w|pos)$ and $P(w|neg)$
Get $\lambda(w)$
Compute $log\frac{P(pos)}{P(neg)}$ , which is equal to $log\frac{D_{pos}}{D_{neg}}$ , where $D_{pos}$ and $D_{neg}$ correspond to the number of positive and negative documents respectively.

Applications of Naive Bayes

Author identification
Spam filtering
Information retrieval
Word disambiguation

This method is usually used as a simple baseline. It also really fast.

Error Analysis

There are several mistakes that could cause you to misclassify an example or a tweet.

Removing punctuation
Example
Removing words
Example
Word order
Example
Adversarial attacks

These include sarcasm, irony, euphemisms.

Vector Space Models

Vector space model

Represent words and documents as vectors
Representation that captures relative meaning

Word by Word and Word by Doc

Word by Word Design
- Make a co-occurrence matrix and extract vector presentations for the words in your corpus.
- Find relationships between words and vectors, also known as their similarity.

Word by Word

Word by Document Design

Word by Document

Vector Space

Once you've constructed the representations for multiple sets of documents or words, you'll get your vector space.

Vector Space

you'll make comparisons between vector representations using the cosine similarity and the Euclidean distance in order to get the angle and distance between them.

Euclidean Distance

$d(\vec{v},\vec{w})=\sqrt{\sum_{i=1}^{n}(v_i-w_i)^2}\longrightarrow\Vert\vec{v}-\vec{w}\Vert_2$

This is the formula that you would use to get the Euclidean distance between vector representations on an n-dimensional vector space. This formula is also known as the norm of the difference between the vectors that you are comparing.

Cosine Similarity

The main advantage of cosine similarity over the Euclidean distance is that it isn't biased by the size difference between the representations.

$cos(\beta)=\frac{\vec{v}\cdot\vec{w}}{\Vert\vec{v}\Vert\Vert\vec{w}\Vert}$

Cosine Similarity

Manipulating Words in Vector Spaces

Manipulating word vectors

PCA

PCA consists in projecting your vectors in a lower dimension space, while trying to retain as much information as possible.
PCA algorithm
1. Mean normalize data
  $x_i=\frac{x_i-\mu_{x_i}}{\sigma_{x_i}}$
2. Get covariance matrix $M$
3. Perform SVD
  $SVD(M)=USV^T$
where $U$ contains the eigenvector stacked column wise. $S$ has the eigenvalues on the diagonal.
1. Dot product to project data
  $X^{'}=XU[:, 0:2]$
2. Percentage of retained variance
  $\frac{\sum_{i=0}^{1}S_{ii}}{\sum_{j=0}^{d}S_{jj}}$
As an important side note, your eigenvectors and eigenvalues should be organized according to the eigenvalues in descending order. This condition will ensure that you retain as much information as possible from your original embedding.

Machine Translation and Document Search

Transforming word vectors

Overview of Translation

English words to French words (e.g., "cat" to "chat")

Try to find $R$ so that

$XR\approx{Y}$

where $X$ is the matrix of English word vectors and $Y$ is the matrix of French word vectors.
Solving for $R$

$Loss=\Vert{XR}-Y\Vert_F^2$

$g=\frac{d}{dR}Loss=\frac{2}{m}(X^T(XR-Y))$

$R=R-\alpha{g}$

K-nearest neighbors

Notice that it transformed word vector after the transformation of its embedding through an $R$ matrix would be in the French word vector space. But it is not going to be necessarily identical to any of the word vectors in the French word vector space. You need to search through the actual French word vectors to find a French word that is similar to the one that you created from the transformation.

To accelerate the search process, you may just search for a subset of words. When you think about organizing subsets of a dataset efficiently, you may think about placing your data into buckets. If you think about buckets, then you'll definitely want to think about hash tables.

Hash tables

Hash table 1

Hash table 2

Ideally, you want to have a hash function that puts similar word vectors in the same bucket.

To do this you'll need to use what's called locality sensitive hashing. Locality is another word for location, sensitive is another word for caring. So locality sensitive hashing is a hashing method that's cares very deeply about assigning items based on where they're located in vector space.

Locality sensitive hashing

To start thinking about locality sensitive hashing, let's first assume that you're using word vectors with just two dimensions. Let's say you want to find a way to know that these blue dots are somehow close to each other, and that these gray dots are also related to each other.

LSH

First, divide the space using these dashed lines, which I'll call planes. It looks like the planes can help us bucket the vectors into subsets based on their location.

LSH

A plane would be this magenta line into two-dimensional space, and it actually represents all the possible vectors that would be sitting on that plane.

You can define a plane with a single vector. This magenta vector is perpendicular to the plane, and it's called the normal vector to that plane.

Plane

Let's compute the dot product of $P$ with other vectors.

Plane

Whether the dot product is positive or negative can tell you whether the vector $V1$ or $V2$ are on one side of the plane or the other.

Multiple planes

In order to divide your vector space into manageable regions, you'll want to use more than one plane.

Multiple planes

So how to get a single hash value?

First to compute the dot product of normal vectors $P_i$ based on different planes with a single vector $V$ , and determine the sign.

Then the hash value is set to 1 if the sign is positive, and 0 if the sign is negative.

$sign_i\geq{0}\to h_i=1$

$sign_i<{0}\to h_i=0$

Finally, combine all hash values using this formula.

$hash=\sum_i^H 2^i{h_i}$

Approximate nearest neighbors

So far, we've seen that a few planes, such as these three, can divide the vector space into regions. But are these planes the best way to divide up the vector space? What if, instead, you divided the vector space like this?

Multiple planes

In fact, you can't know for sure which sets of planes is the best way to divide up the vector space, so why not create multiple sets of random planes so that you can divide up the vector space into multiple, independent sets of hash tables.

By using multiple sets of random planes for locality-sensitive hashing, you have a more robust way of searching the vector space for a set of vectors that are possible candidates to be nearest neighbors. This is called approximate nearest neighbors because you're not searching the entire vector space, but just a subset of it.

num_dimensions = 2 
num_planes = 3

random_planes_matrix = np.random.normal(size=(num_planes, num_dimensions))

v = np.array([[2, 2]])

def side_of_plane_matrix(P, v):
    dotproduct = np.dot(P, v.T)
    sign_of_dot_product = np.sign(dotproduct)
    return sign_of_dot_product

num_planes_matrix = side_of_plane_matrix(random_planes_matrix, v)

1. NLP with Classification and V
The notes are about the first course of the Natural Langu...
Loss Function in Metric Learning
General Idea: For Classification Task:Input the feature v...
【现学现卖】CHEER中的概念解释——k-mer
“概念理解” CHEER: HierarCHical taxonomic classification for v...
AlexNet文章复现
AlexNet_v1：ImageNet Classification with Deep Convolutiona...
斯坦福CS224n笔记系列4--Word Window分类与神经
1. Classification background 1.1 Classification setup ...
2019-02 Classification in NLP
每年在分类上的paper不断，我主要罗列一些我觉得还行的分类模型吧。 1. Self-Attention base...
Linear Classification: Support V
此笔记基于斯坦福cs231n课程。作者：武秉文Jerry，bingwenwu@foxmail.com转载请注明出处...
Level 2B 08黄敏龙作业
Classification16-22 B A A C B C CvocabularyEX.81. iv2. v3...
多标签 V/S 多分类
多标签 V/S 多分类多类分类(Multiclass classification):表示分类任务中有多个类别, ...
【一阶前导班-5节权益类投资-1至4课】
1. Market Organization and Structure 2.Classification of ...