Word Vectors

Tags	CS 224N

What came before?

We had WordNet, which is essentially a giant dictionary that contains synonym sets and hypernums (”is a” relationships)

This didn’t work very well because words matter in their contexts; synonyms are dynamically bound. Furthermore, new words and new meanings of words appear all the time.

We also tried treating words as discrete symbols, i.e. like one-hot vectors. The problem is that two similar words were just as orthogonal as two very different words. Can we make a vector that somehow encodes the semantics of a word?

We arrive at the field of distributional semantics , which is the assumption that words are influenced by their surrounding contexts. Or, to put it another way, words are more similar if they are used in similar contexts.

☝

Here’s something tricky. Words that are antonyms are pretty semantically similar: “hot” and “cold” are more similar as a pair than “hot” and “rabbit.” With a large vector embedding size, we can encode this similarity as well as the antonym relationship, which is just like another relationship.

Word2Vec

The concept of Word2Vec is actually really simple. For the corpus of text, take a center word, and maximize the probability that the surrounding words are seen. This represents a sort of Markov Random Field where every $n$ nodes are interconnected

There are two model variants. The $p(o | c)$ is our version, and it’s called a skip gram. We can also do continuous bag of words, where you predict $p(c | o)$ .

Concretely, we arrive at this log probability loss function

Computing probabilities

For every word, we have an embedding $v_w$ , which is used when it is a center word. We have an embedding $u_w$ which is used when it is a context word. We use two embeddings because we will see that linear relationships are much easier to convey. When we use the embeddings, we average together $v, u$ .

We compute the probability of a context word given a center word $P(o | c)$ by computing

P(o | c) = \frac{\exp(u_o^Tv_c)}{\sum_{w \in V}\exp(u_w^Tv_c)}

which is akin to finding the dot product between $v_c$ and all $U$ , and then taking the softmax (hold onto this thought: we will look at vectorization later).

Computing $d/dv$ (non-vectorized)

Computation
There’s a difficulty of notation, so let’s just focus on the thing inside the nested summation: the log probability. We can expand and simplify:
$\log p(o| c) = u_o^Tv_c - \log \sum_{w\in V}\exp(u_w^T v_c)$
And now, we want to take the derivative of $d/dv_c$ , because we can use this arbitrary derivative definition later to formulate the real derivative of the loss. We know that
$\frac{d}{dv_c}u_o^Tv_c = u_0$
and the second part actually follows a pretty standard form
$\frac{d}{dv_c}\log \sum\exp(u_w^Tv_c) = \frac{1}{\sum \exp(u_w^Tv_c)}\sum \exp(u^T_wv_c)u_w$
and we can distribute the denominator and get
$\frac{d}{dv_c}\log \sum\exp(u_w^Tv_c) = \sum \frac{\exp(u^T_wv_c)}{\sum \exp(u_w^Tv_c)}u_w$
and as usual, we recognize that the inner term is $p(w | c)$ , and we recognize again that this whole thing is an expectation: $E_{p(w | c)}[u_w]$
So, to write it all out, we have
$\frac{d}{dv_c} \log p(o | c) = u_o - E_{p(w | c)}[u_w]$
or, in layman’s terms, observed - average. This brushes at a deeper truth. The direction you want to go is the direction that you are already going, minus the noise. If the noise pulls in the same direction, then you should go less readily, lest you blend into the crowd. So it’s inherently pushing the embeddings apart.
Now, if you want to take the derivative of the original loss function, it’s just the above quantity enclosed in the nested sums. We will show a better, vectorized way of doing this later.

Computing $d/du$ (non-vectorized)

☝

TO DO (not discussed in class yet)

Negative sampling and skip-gram

So computing $P(o | c)$ is ok for a small vocab, but the denominator is concerning. Can we reframe this problem as a binary classification? Yes we can!

We just want a binary logistic regression that differentiates a true center-context word pair with a “noise” pair (center word with a random word). Concretely, we use the logistic regression loss:

But what is $P(w)$ ? Well, we can start with $U(w)$ , or the natural distribution of words (found by tallying up the data). It turns out that if we do $U(w)^{3/4}$ , we “soften” the distribution and get better performance.

This yields a pretty sparse gradient, and this allows you to pick which rows in your embeddings to update.

Co-occurence counting

While this stuff all makes sense, why can’t we just count the co-occurences and make an embedding that way? We can make such a matrix.

In other words, $X_{ij}$ is the number of times word $i$ appeard within a certain window of word $j$ .

You can then reduce dimensionality through SVD. Often, this doesn’t work very well. There are some tricks, including log frequencies, clipping all values at 100 instead of 0, and ignoring function words that would saturate the occurrence matrices. You can also use ramped windows that count closer words more than further words.

Emergent Semantic Patterns

You can have a certain linear vector mean something. For example, you can have a pluralization vector that turns most nouns into its plural form.

GloVe

GloVE tries to exploit the emergent semantic patterns to a great degree. It uses a log-bilinear model that encourages linearity in the embedding space.

Concretely, it expressed

which means that vector differences becomes

Evaluating word vectors

Intrinsic

These are things that can be done to the word vectors directly without any downstream task. Examples include vector analogies. You try making a vector that moves between a semantic analogy, like man → woman yields a gender vector. Applying it to “king” should yield “queen”.

You can also see if the similarity between words correlates with how humans see the words.

Extrinsic

You can try the word vectors on a downstream task. A good example is named entity recognition, which tries to assign categories of person, organization, or location to each word.

To do this, you just take a window centered around the target word. You compute the embeddings for all the words in the window, and then you feed that lowdim vector as an intput to a classifier, which has a softmax output.

Word ambiguity

Words can take on a ton of meaning and meanings in different contexts. How can one vector express all of that?

Well, you can try to just add more vectors! This is one thing that can work, but it’s impractical. It adds more moving parts. How many more vectors? etc.

It actually turns out that in a good word embedding, the embedding is a superposition of the different semantic meanings. You can extract the relevant meaning using the surrounding context. And because there is a degree of sparseness to the vector, the different components are actually recoverable.