美文网首页
1. NLP with Classification and V

1. NLP with Classification and V

作者: Kevin不会创作 | 来源:发表于2020-12-01 02:05 被阅读0次

The notes are about the first course of the Natural Language Processing Specialization at Coursera which is moderated by DeepLearning.ai.

Table of Contents

  • Sentiment Analysis with Logistic Regression
  • Sentiment Analysis with Naïve Bayes
  • Vector Space Models
  • Machine Translation and Document Search

Sentiment Analysis with Logistic Regression

Feature Extraction

In order to use machine learning algorithms to make some predictions, you need to transform your text data to digital data firstly. This process is called feature extraction.

Given a tweet, or some text, you can represent it as a vector of dimension V, where V corresponds to your vocabulary size. If you had the tweet "I am happy because I am learning NLP", then you would put a 1 in the corresponding index for any word in the tweet, and a 0 otherwise.

Feature Extraction

You can see there are many 0 in the vector which is called sparse representation.

  • Problems with sparse representations

    • Large training time
    • Large prediction time

Instead, given a word you can keep track of the number of times, that's where it shows up as the positive class. Given another word you want to keep track of the number of times that word showed up in the negative class. Using both those counts, you can then extract features and use those features into your logistic regression classifier.

  • Positive and negative counts

    Tweets Word Frequency

After you get the frequency of words, you can use it to extract useful features for sentiment analysis.

Feature Extraction

Preprocessing

When preprocessing, you have to perform the following:

  1. Eliminate handles and URLs.
  2. Tokenize the string into words.
  3. Remove stop words like "and, is, a, on, etc."
  4. Stemming- or convert every word to its stem. Like dancer, dancing, danced, becomes danc. You can use porter stemmer to take care of this.
  5. Convert all your words to lower case.
  • Example

    "@YMourri and @AndrewYNg are tuning a GREAT AI model at https://deeplearning.ai!!!"

    Preprocessed tweet: [tun, great, ai, model]

Logistic Regression

See Logistic Regression

Sentiment Analysis with Naïve Bayes

Bayes' Rule

Conditional probabilities help us reduce the sample search space.

P(X|Y)=\frac{P(X\cap{Y})}{P(Y)}=\frac{P(Y|X)P(X)}{P(Y)}

Naïve Bayes

To build a classifier, we will first start by creating conditional probabilities given the following table:

Word Frequency

This allows us compute the following table of probabilities:

Conditional Probability

Once you have the probabilities, you can compute the likelihood score as follows

score=\prod_{i=1}^{m}\frac{P(w_i|pos)}{P(w_i|neg)}

For example,

Likelihood Score

A score greater than 1 indicates that the class is positive, otherwise it is negative.

Laplacian Smoothing

We usually compute the probability of a word given a class as follows:

P(w_i|class)=\frac{freq(w_i,class)}{N_{class}}\ \ \ \ class\in\{Positive,Negative\}

However, if a word does not appear in the training, then it automatically gets a probability of 0, to fix this we add smoothing as follows:

P(w_i|class)=\frac{freq(w_i,class)+1}{N_{class}+V}

N_{class}: frequency of all words in class
V: number of unique words in vocabulary

Note that we added a 1 in the numerator, and since there are V words to normalize, we add V in the denominator. This method is called Laplacian Smoothing.

Log Likelihood

Log likelihoods are logarithms of the probability we calculated before. They are more convenient to work with and they appear throughout deep learning and NLP.

In order to compute the log likelihood, we need to get the ratios and use them to compute a score that will allow us to decide whether a tweet is positive or negative. The higher the ratio, the more positive the word is.

score=\frac{P(pos)}{{P}(neg)}\prod_{i=1}^{m}\frac{P(w_i|pos)}{P(w_i|neg)}

As m gets larger, we can get numerical flow issues, so we introduce the log, which gives you the following equation

log(\frac{P(pos)}{P(neg)}\prod_{i=1}^{m}\frac{P(w_i|pos)}{P(w_i|neg)})=log\frac{P(pos)}{P(neg)}+\sum_{i=1}^{m}log\frac{P(w_i|pos)}{P(w_i|neg)}

The first component is called the log prior and the second component is the log likelihood.

We further introduce \lambda as follows

\lambda(w)=log\frac{P(w|pos)}{P(w|neg)}

For example,

Lambda

Once you computed the \lambda dictionary, it becomes straightforward to do inference:

Example

As you can see above, since 3.3 > 0, we will classify the document to be positive. If we got a negative number we would have classified it to the negative class.

Training naive Bayes

  1. Get or annotate a dataset with positive and negative tweets.

  2. Preprocess the tweets.

  3. Compute word frequency.

  4. Get P(w|pos) and P(w|neg)

  5. Get \lambda(w)

  6. Compute log\frac{P(pos)}{P(neg)}, which is equal to log\frac{D_{pos}}{D_{neg}}, where D_{pos} and D_{neg} correspond to the number of positive and negative documents respectively.

Applications of Naive Bayes

  • Author identification
  • Spam filtering
  • Information retrieval
  • Word disambiguation

This method is usually used as a simple baseline. It also really fast.

Error Analysis

There are several mistakes that could cause you to misclassify an example or a tweet.

  • Removing punctuation

    Example
  • Removing words

    Example
  • Word order

    Example
  • Adversarial attacks

    These include sarcasm, irony, euphemisms.

Vector Space Models

Vector space model
  • Represent words and documents as vectors
  • Representation that captures relative meaning

Word by Word and Word by Doc

  • Word by Word Design

    • Make a co-occurrence matrix and extract vector presentations for the words in your corpus.
    • Find relationships between words and vectors, also known as their similarity.
Word by Word
  • Word by Document Design
Word by Document
  • Vector Space

Once you've constructed the representations for multiple sets of documents or words, you'll get your vector space.

Vector Space

you'll make comparisons between vector representations using the cosine similarity and the Euclidean distance in order to get the angle and distance between them.

Euclidean Distance

d(\vec{v},\vec{w})=\sqrt{\sum_{i=1}^{n}(v_i-w_i)^2}\longrightarrow\Vert\vec{v}-\vec{w}\Vert_2

This is the formula that you would use to get the Euclidean distance between vector representations on an n-dimensional vector space. This formula is also known as the norm of the difference between the vectors that you are comparing.

Cosine Similarity

The main advantage of cosine similarity over the Euclidean distance is that it isn't biased by the size difference between the representations.

cos(\beta)=\frac{\vec{v}\cdot\vec{w}}{\Vert\vec{v}\Vert\Vert\vec{w}\Vert}

Cosine Similarity

Manipulating Words in Vector Spaces

Manipulating word vectors

PCA

  • PCA consists in projecting your vectors in a lower dimension space, while trying to retain as much information as possible.

  • PCA algorithm

    1. Mean normalize data
      x_i=\frac{x_i-\mu_{x_i}}{\sigma_{x_i}}

    2. Get covariance matrix M

    3. Perform SVD
      SVD(M)=USV^T

    where U contains the eigenvector stacked column wise. S has the eigenvalues on the diagonal.

    1. Dot product to project data
      X^{'}=XU[:, 0:2]

    2. Percentage of retained variance
      \frac{\sum_{i=0}^{1}S_{ii}}{\sum_{j=0}^{d}S_{jj}}

    As an important side note, your eigenvectors and eigenvalues should be organized according to the eigenvalues in descending order. This condition will ensure that you retain as much information as possible from your original embedding.

Machine Translation and Document Search

Transforming word vectors

  • Overview of Translation

    English words to French words (e.g., "cat" to "chat")

    Try to find R so that

    XR\approx{Y}

    where X is the matrix of English word vectors and Y is the matrix of French word vectors.

  • Solving for R

    Loss=\Vert{XR}-Y\Vert_F^2

    g=\frac{d}{dR}Loss=\frac{2}{m}(X^T(XR-Y))

    R=R-\alpha{g}

K-nearest neighbors

Notice that it transformed word vector after the transformation of its embedding through an R matrix would be in the French word vector space. But it is not going to be necessarily identical to any of the word vectors in the French word vector space. You need to search through the actual French word vectors to find a French word that is similar to the one that you created from the transformation.

To accelerate the search process, you may just search for a subset of words. When you think about organizing subsets of a dataset efficiently, you may think about placing your data into buckets. If you think about buckets, then you'll definitely want to think about hash tables.

Hash tables

Hash table 1 Hash table 2

Ideally, you want to have a hash function that puts similar word vectors in the same bucket.

To do this you'll need to use what's called locality sensitive hashing. Locality is another word for location, sensitive is another word for caring. So locality sensitive hashing is a hashing method that's cares very deeply about assigning items based on where they're located in vector space.

Locality sensitive hashing

To start thinking about locality sensitive hashing, let's first assume that you're using word vectors with just two dimensions. Let's say you want to find a way to know that these blue dots are somehow close to each other, and that these gray dots are also related to each other.

LSH

First, divide the space using these dashed lines, which I'll call planes. It looks like the planes can help us bucket the vectors into subsets based on their location.

LSH

A plane would be this magenta line into two-dimensional space, and it actually represents all the possible vectors that would be sitting on that plane.

You can define a plane with a single vector. This magenta vector is perpendicular to the plane, and it's called the normal vector to that plane.

Plane

Let's compute the dot product of P with other vectors.

Plane

Whether the dot product is positive or negative can tell you whether the vector V1 or V2 are on one side of the plane or the other.

Multiple planes

In order to divide your vector space into manageable regions, you'll want to use more than one plane.

Multiple planes

So how to get a single hash value?

First to compute the dot product of normal vectors P_i based on different planes with a single vector V, and determine the sign.

Then the hash value is set to 1 if the sign is positive, and 0 if the sign is negative.

sign_i\geq{0}\to h_i=1

sign_i<{0}\to h_i=0

Finally, combine all hash values using this formula.

hash=\sum_i^H 2^i{h_i}

Approximate nearest neighbors

So far, we've seen that a few planes, such as these three, can divide the vector space into regions. But are these planes the best way to divide up the vector space? What if, instead, you divided the vector space like this?

Multiple planes

In fact, you can't know for sure which sets of planes is the best way to divide up the vector space, so why not create multiple sets of random planes so that you can divide up the vector space into multiple, independent sets of hash tables.

By using multiple sets of random planes for locality-sensitive hashing, you have a more robust way of searching the vector space for a set of vectors that are possible candidates to be nearest neighbors. This is called approximate nearest neighbors because you're not searching the entire vector space, but just a subset of it.

num_dimensions = 2 
num_planes = 3

random_planes_matrix = np.random.normal(size=(num_planes, num_dimensions))
v = np.array([[2, 2]])

def side_of_plane_matrix(P, v):
    dotproduct = np.dot(P, v.T)
    sign_of_dot_product = np.sign(dotproduct)
    return sign_of_dot_product

num_planes_matrix = side_of_plane_matrix(random_planes_matrix, v)

相关文章

网友评论

      本文标题:1. NLP with Classification and V

      本文链接:https://www.haomeiwen.com/subject/rsjjwktx.html