The notes are about the first course of the Natural Language Processing Specialization at Coursera which is moderated by DeepLearning.ai.
Table of Contents
- Sentiment Analysis with Logistic Regression
- Sentiment Analysis with Naïve Bayes
- Vector Space Models
- Machine Translation and Document Search
Sentiment Analysis with Logistic Regression
Feature Extraction
In order to use machine learning algorithms to make some predictions, you need to transform your text data to digital data firstly. This process is called feature extraction.
Given a tweet, or some text, you can represent it as a vector of dimension , where
corresponds to your vocabulary size. If you had the tweet "I am happy because I am learning NLP", then you would put a 1 in the corresponding index for any word in the tweet, and a 0 otherwise.

You can see there are many 0 in the vector which is called sparse representation.
-
Problems with sparse representations
- Large training time
- Large prediction time
Instead, given a word you can keep track of the number of times, that's where it shows up as the positive class. Given another word you want to keep track of the number of times that word showed up in the negative class. Using both those counts, you can then extract features and use those features into your logistic regression classifier.
-
Positive and negative counts
Tweets
Word Frequency
After you get the frequency of words, you can use it to extract useful features for sentiment analysis.

Preprocessing
When preprocessing, you have to perform the following:
- Eliminate handles and URLs.
- Tokenize the string into words.
- Remove stop words like "and, is, a, on, etc."
- Stemming- or convert every word to its stem. Like dancer, dancing, danced, becomes
danc
. You can use porter stemmer to take care of this. - Convert all your words to lower case.
-
Example
"@YMourri and @AndrewYNg are tuning a GREAT AI model at https://deeplearning.ai!!!"
Preprocessed tweet:
[tun, great, ai, model]
Logistic Regression
Sentiment Analysis with Naïve Bayes
Bayes' Rule
Conditional probabilities help us reduce the sample search space.
Naïve Bayes
To build a classifier, we will first start by creating conditional probabilities given the following table:

This allows us compute the following table of probabilities:

Once you have the probabilities, you can compute the likelihood score as follows
For example,

A score greater than 1 indicates that the class is positive, otherwise it is negative.
Laplacian Smoothing
We usually compute the probability of a word given a class as follows:
However, if a word does not appear in the training, then it automatically gets a probability of , to fix this we add smoothing as follows:
: frequency of all words in class
: number of unique words in vocabulary
Note that we added a in the numerator, and since there are
words to normalize, we add
in the denominator. This method is called Laplacian Smoothing.
Log Likelihood
Log likelihoods are logarithms of the probability we calculated before. They are more convenient to work with and they appear throughout deep learning and NLP.
In order to compute the log likelihood, we need to get the ratios and use them to compute a score that will allow us to decide whether a tweet is positive or negative. The higher the ratio, the more positive the word is.
As gets larger, we can get numerical flow issues, so we introduce the log, which gives you the following equation
The first component is called the log prior and the second component is the log likelihood.
We further introduce as follows
For example,

Once you computed the dictionary, it becomes straightforward to do inference:

As you can see above, since , we will classify the document to be positive. If we got a negative number we would have classified it to the negative class.
Training naive Bayes
-
Get or annotate a dataset with positive and negative tweets.
-
Preprocess the tweets.
-
Compute word frequency.
-
Get
and
-
Get
-
Compute
, which is equal to
, where
and
correspond to the number of positive and negative documents respectively.
Applications of Naive Bayes
- Author identification
- Spam filtering
- Information retrieval
- Word disambiguation
This method is usually used as a simple baseline. It also really fast.
Error Analysis
There are several mistakes that could cause you to misclassify an example or a tweet.
-
Removing punctuation
Example
-
Removing words
Example
-
Word order
Example
-
Adversarial attacks
These include sarcasm, irony, euphemisms.
Vector Space Models

- Represent words and documents as vectors
- Representation that captures relative meaning
Word by Word and Word by Doc
-
Word by Word Design
- Make a co-occurrence matrix and extract vector presentations for the words in your corpus.
- Find relationships between words and vectors, also known as their similarity.

- Word by Document Design

- Vector Space
Once you've constructed the representations for multiple sets of documents or words, you'll get your vector space.

you'll make comparisons between vector representations using the cosine similarity and the Euclidean distance in order to get the angle and distance between them.
Euclidean Distance
This is the formula that you would use to get the Euclidean distance between vector representations on an n-dimensional vector space. This formula is also known as the norm of the difference between the vectors that you are comparing.
Cosine Similarity
The main advantage of cosine similarity over the Euclidean distance is that it isn't biased by the size difference between the representations.

Manipulating Words in Vector Spaces

PCA
-
PCA consists in projecting your vectors in a lower dimension space, while trying to retain as much information as possible.
-
PCA algorithm
-
Mean normalize data
-
Get covariance matrix
-
Perform SVD
where
contains the eigenvector stacked column wise.
has the eigenvalues on the diagonal.
-
Dot product to project data
-
Percentage of retained variance
As an important side note, your eigenvectors and eigenvalues should be organized according to the eigenvalues in descending order. This condition will ensure that you retain as much information as possible from your original embedding.
-
Machine Translation and Document Search
Transforming word vectors
-
Overview of Translation
English words to French words (e.g., "cat" to "chat")
Try to find
so that
where
is the matrix of English word vectors and
is the matrix of French word vectors.
-
Solving for
K-nearest neighbors
Notice that it transformed word vector after the transformation of its embedding through an matrix would be in the French word vector space. But it is not going to be necessarily identical to any of the word vectors in the French word vector space. You need to search through the actual French word vectors to find a French word that is similar to the one that you created from the transformation.
To accelerate the search process, you may just search for a subset of words. When you think about organizing subsets of a dataset efficiently, you may think about placing your data into buckets. If you think about buckets, then you'll definitely want to think about hash tables.
Hash tables


Ideally, you want to have a hash function that puts similar word vectors in the same bucket.
To do this you'll need to use what's called locality sensitive hashing. Locality is another word for location, sensitive is another word for caring. So locality sensitive hashing is a hashing method that's cares very deeply about assigning items based on where they're located in vector space.
Locality sensitive hashing
To start thinking about locality sensitive hashing, let's first assume that you're using word vectors with just two dimensions. Let's say you want to find a way to know that these blue dots are somehow close to each other, and that these gray dots are also related to each other.

First, divide the space using these dashed lines, which I'll call planes. It looks like the planes can help us bucket the vectors into subsets based on their location.

A plane would be this magenta line into two-dimensional space, and it actually represents all the possible vectors that would be sitting on that plane.
You can define a plane with a single vector. This magenta vector is perpendicular to the plane, and it's called the normal vector to that plane.

Let's compute the dot product of with other vectors.

Whether the dot product is positive or negative can tell you whether the vector or
are on one side of the plane or the other.
Multiple planes
In order to divide your vector space into manageable regions, you'll want to use more than one plane.

So how to get a single hash value?
First to compute the dot product of normal vectors based on different planes with a single vector
, and determine the sign.
Then the hash value is set to 1 if the sign is positive, and 0 if the sign is negative.
Finally, combine all hash values using this formula.
Approximate nearest neighbors
So far, we've seen that a few planes, such as these three, can divide the vector space into regions. But are these planes the best way to divide up the vector space? What if, instead, you divided the vector space like this?

In fact, you can't know for sure which sets of planes is the best way to divide up the vector space, so why not create multiple sets of random planes so that you can divide up the vector space into multiple, independent sets of hash tables.
By using multiple sets of random planes for locality-sensitive hashing, you have a more robust way of searching the vector space for a set of vectors that are possible candidates to be nearest neighbors. This is called approximate nearest neighbors because you're not searching the entire vector space, but just a subset of it.
num_dimensions = 2
num_planes = 3
random_planes_matrix = np.random.normal(size=(num_planes, num_dimensions))
v = np.array([[2, 2]])
def side_of_plane_matrix(P, v):
dotproduct = np.dot(P, v.T)
sign_of_dot_product = np.sign(dotproduct)
return sign_of_dot_product
num_planes_matrix = side_of_plane_matrix(random_planes_matrix, v)
网友评论