Lecture 2: Word Vectors 2 and Word Senses
Optimization: Gradient Descent
Gradient Descent:
- $J(\theta)$ is of entire windows in the corpus
- $\nabla_{\theta} J(\theta)$ is expensive to compute
Stochastic gradient descent:
- sample window in the simplest case
- For one window to work out and update
Mini-batch gradient descent:
- The data is divided into several batches, and the parameters are updated in batches. So, a group of data in a batch jointly determines the direction of the gradient
- get less noisy estimates of the gradient because of the average rather than using one
- using GPU to go fast
Stochastic gradients with word vectors
$\nabla_{\theta} J(\theta)$ is very sparse, only update the word vectors that actually appear
Solution:
- need sparse matrix to update certain rows of matrices U and V
- need to keep around a hash for word vectors
About Glove
why not capture co-occurrence counts directly
Combine two options:
- For window: use window like word2vec around each word to capture both syntactic and semantic information
- For Word-doc co-occurrence matrix will give general topics leading to ‘Latent Semantic Analysis’
Using window length 1 to build co-occurrence matrix
- I like deep learning.
- I like NLP.
- I enjoy flying.
Problems:
- Increase in size with vocabulary
- High dimensional
- Subsequent classification models have sparsity issues
Method: Dimensionality Reduction on Co-occurrence Matrix
Factorizes X into $U\Sigma V^T$, where U and V are orthonormal and $\Sigma$ is the diagonal matrix.
To reduce dimensionality, throw away the smallest singular values, while removing the rows and columns
corresponding to U and V.
What are the essence and standards of dimensionality reduction?
Find the lower-dimensional representations which can explain most of the variance in the high-dimensional data
Encoding meaning in vector differences
Summary: The learning of word vectors should be related to the ratio of the conditional probability of words
Crucial insight: rations of co-occurrence probabilities can encode meaning components.
That means it’s not to have large/small by itself(Single probability) but the sort of the difference components(ration of co-occurrence) indicating a meaning component
How to capture ratios of co-occurrence probabilities as achieve linear meaning components in a word vector space?
- Log-bilinear mode: $w_i \cdot w_j = \log P(i|j)$
- with vector differences: $w_x \cdot (w_a - w_b) = \log \frac{P(x|a)}{P(x|b)}$
Co-occurrence Matrix
- $X$: word-word co-occurrence matrix
- $X_{ij}$: number of times word j occur in the context of word
- : the number of times any word k appears in the context of word i
- : the probability of j appearing in the context of word i
Least Squares Objective
For the skip-gram model, the probability of word j appears int the context of word i, and implied the cross-entropy loss, like word2vec:
Considering the cross-entropy loss is that it requires the normalized, instead using a least square objective in which the normalization factors are discarded:
where and are the unnormalized distributions. However it brings another problem $X_{ij}$ often takes on large values and makes the optimization difficult.
An effective change is to minimize the squared error of the logarithms:
Because the weighting factor $X_i$ is not guaranteed to be optimal. Instead, introducing a weighting function f(x) to concentrate the context word :
Comparison with LSA and Word2Vec
LSA vs GloVe
- LSA based co-occurrence matrix can capture word similarities and do poorly on tasks such as word analogy.
- GloVe can be described as an optimal and efficient matrix factorization algorithm for LSA
Word2Vec vs GloVe
- Word2Vec is direct prediction; GloVe is count based(the cores lies in Method)
- Word2Vec use window to predict the context based local corpus; GloVe use window to build the co-occurrence matrix based global corpus
- The loss of Word2Vec is the cross-entropy; The loss of GloVe is the least squared error
Conclusion
Glove combines the advantages of matrix factorization and shallow window method, taking full use of global statistic and local context window.
By training only on the nonzero elements in a word-word co-occurrence matrix, and produces a vector space with meaningful sub-structure. GloVe can achieve better results faster compared with word2vec in the same situation.
What is “cross entropy” loss/error
Concept “entropy” is from information theory, which used to express the expectation of all information:
KL Divergence
Random variable x has two separate probability distributions P(x) and Q(x), KL is used to measure the difference between two distribution.
In the context of machine learning, DKL(P‖Q) is often called the information gain achieved if P is used instead of Q.
In deep learning, P is used to represent the true distribution, while Q means the predicted distribution.
It means if using P to describe the sample, that will be great, but using Q will need extra “information gain” to achieve the same result. Through repeated training, Q will not need “information gain” anymore.
KL Divergence formula:
Cross Entropy
From $(2.2)$, the first half of the formula is the entropy of p, which will not change, so the rest is cross entropy:
Because it is essential to measure the difference between labels and predicts, so choosing KL Divergence. Actually -H(p) won’t change, hence using cross entropy for loss directly.
To do
- Derivation of GloVe algorithm