Lecture 2: Word Vectors 2 and Word Senses

Optimization: Gradient Descent

Gradient Descent:

$J(\theta)$ is of entire windows in the corpus
$\nabla_{\theta} J(\theta)$ is expensive to compute

Stochastic gradient descent:

sample window in the simplest case
For one window to work out and update

Mini-batch gradient descent:

The data is divided into several batches, and the parameters are updated in batches. So, a group of data in a batch jointly determines the direction of the gradient
get less noisy estimates of the gradient because of the average rather than using one
using GPU to go fast

Stochastic gradients with word vectors

$\nabla_{\theta} J(\theta)$ is very sparse, only update the word vectors that actually appear

Solution:

need sparse matrix to update certain rows of matrices U and V
need to keep around a hash for word vectors

About Glove

why not capture co-occurrence counts directly

Combine two options:

For window: use window like word2vec around each word to capture both syntactic and semantic information
For Word-doc co-occurrence matrix will give general topics leading to ‘Latent Semantic Analysis’

Using window length 1 to build co-occurrence matrix

I like deep learning.
I like NLP.
I enjoy flying.

Problems:

Increase in size with vocabulary
High dimensional
Subsequent classification models have sparsity issues

Method: Dimensionality Reduction on Co-occurrence Matrix

Factorizes X into $U\Sigma V^T$, where U and V are orthonormal and $\Sigma$ is the diagonal matrix.
To reduce dimensionality, throw away the smallest singular values, while removing the rows and columns
corresponding to U and V.

What are the essence and standards of dimensionality reduction?

Find the lower-dimensional representations which can explain most of the variance in the high-dimensional data

Encoding meaning in vector differences

Summary: The learning of word vectors should be related to the ratio of the conditional probability of words

Crucial insight: rations of co-occurrence probabilities can encode meaning components.
That means it’s not to have large/small by itself(Single probability) but the sort of the difference components(ration of co-occurrence) indicating a meaning component

$Ratio(i, j, k) = \frac{P_{ik}}{P_{jk}}$

How to capture ratios of co-occurrence probabilities as achieve linear meaning components in a word vector space?

Log-bilinear mode: $w_i \cdot w_j = \log P(i|j)$
with vector differences: $w_x \cdot (w_a - w_b) = \log \frac{P(x|a)}{P(x|b)}$

Co-occurrence Matrix

$X$: word-word co-occurrence matrix
$X_{ij}$: number of times word j occur in the context of word
$X_i = \Sigma_k X_{ik}$ : the number of times any word k appears in the context of word i
$P_{ij} = P(w_j|w_i) = \frac{X_{ij}}{X_i}$ : the probability of j appearing in the context of word i

Least Squares Objective

For the skip-gram model, the probability of word j appears int the context of word i, and implied the cross-entropy loss, like word2vec:

$Q_{ij} = \frac {exp(\vec{u}^T_j\vec{v}_i)}{\Sigma^W_{w=1}exp(\vec{u}^T_w\vec{v}_i)}$ $J = -\sum^W_{i=1}\sum^W_{j=1}X_{ij}logQ_{ij}$

Considering the cross-entropy loss is that it requires the normalized, instead using a least square objective in which the normalization factors are discarded:

$\hat J = \sum^W_{i=1}\sum^W_{j=1}X_{i}(\hat P_{ij} - \hat Q_{ij})^2 \tag{1.1}$

where $\hat{P}_{ij} = X_{ij}$ and $\hat{Q}_{ij} = exp(\vec{u}^T_j\vec{v}_i)$ are the unnormalized distributions. However it brings another problem $X_{ij}$ often takes on large values and makes the optimization difficult.

An effective change is to minimize the squared error of the logarithms:

$\begin{align*} \hat{J} =& \sum^W_{i=1}\sum^W_{j=1}X_i(log(\hat P)_{ij} - log(\hat Q_{ij}))^2\\ =& \sum^W_{i=1}\sum^W_{j=1}X_{i}(\vec u^T_j\vec v_i - logX_{ij})^2 \tag {1.2} \end{align*}$

Because the weighting factor $X_i$ is not guaranteed to be optimal. Instead, introducing a weighting function f(x) to concentrate the context word :

$\hat J= \sum^W_{i=1}\sum^W_{j=1}f(X_{ij})(\vec u^T_j\vec v_i - logX_{ij})^2 \tag {1.3}$ $f(x)= \begin{cases} (x/x_{max})^\alpha& if\ x < x_{max}\\ 1& else. \tag {1.4} \end{cases}$

$when\; X_{ij}\; equals\; 0,f(x) = 0; when\; X_{ij}\; is\; very\; large , f(x) = 1$

Comparison with LSA and Word2Vec

LSA vs GloVe

LSA based co-occurrence matrix can capture word similarities and do poorly on tasks such as word analogy.
GloVe can be described as an optimal and efficient matrix factorization algorithm for LSA

Word2Vec vs GloVe

Word2Vec is direct prediction; GloVe is count based(the cores lies in Method)
Word2Vec use window to predict the context based local corpus; GloVe use window to build the co-occurrence matrix based global corpus
The loss of Word2Vec is the cross-entropy; The loss of GloVe is the least squared error

Conclusion

Glove combines the advantages of matrix factorization and shallow window method, taking full use of global statistic and local context window.
By training only on the nonzero elements in a word-word co-occurrence matrix, and produces a vector space with meaningful sub-structure. GloVe can achieve better results faster compared with word2vec in the same situation.

What is “cross entropy” loss/error

Concept “entropy” is from information theory, which used to express the expectation of all information:

$H(X) = - \sum^n_{i=1}p(x_i)\log(p(x_i)) \tag {2.1}$

KL Divergence

Random variable x has two separate probability distributions P(x) and Q(x), KL is used to measure the difference between two distribution.

In the context of machine learning, DKL(P‖Q) is often called the information gain achieved if P is used instead of Q.

In deep learning, P is used to represent the true distribution, while Q means the predicted distribution.
It means if using P to describe the sample, that will be great, but using Q will need extra “information gain” to achieve the same result. Through repeated training, Q will not need “information gain” anymore.

KL Divergence formula:

$\begin{align*} D_{KL}(p||q) = & \sum^n_{i=1}p(x_i)\log(\frac{p(x_i)}{q(x_i)})\\ = & \sum^n_{i=1}p(x_i)\log(p(x_i)) - \sum^n_{i=1}p(x_i)\log(q(x_i))\\ = & -H(p(x)) - \sum^n_{i=1}p(x_i)\log(q(x_i)) \tag {2.2} \end{align*}$

Cross Entropy

From $(2.2)$, the first half of the formula is the entropy of p, which will not change, so the rest is cross entropy:

$H(p, q) = - \sum^n_{i=1}p(x_i)\log(q(x_i)) \tag {2.3}$

Because it is essential to measure the difference between labels and predicts, so choosing KL Divergence. Actually -H(p) won’t change, hence using cross entropy for loss directly.

To do

Derivation of GloVe algorithm