Word Vectors

Class Notes

Word Representation#

How do we represent words in NLP models?

N-gram#

Definition: An n-gram refers to a sequence of n elements taken consecutively from a given text or speech sequence. In the field of natural language processing, these elements are usually words. The n-gram model is a probabilistic language model used to estimate the probability of a given text sequence occurring. It is based on the assumption that the probability of a word depends only on the n-1 words that precede it. For example, in a bigram model (bigram, i.e., n=2), the sentence "I love natural language processing" would be broken down into bigrams like "I love", "love natural", "natural language", "language processing". The model learns the probability of each bigram occurring, such as the frequency of the combination "love natural" in the corpus, to predict the likelihood of text or perform tasks like language generation.

P(w_1,w_2,...,w_n) = \prod_{i=1}^{n}P(w_i|w_{i-1})\\ P(w_i|w_{i-1})=\frac{C(w_{i-1},w_i)+\alpha}{C(w_{i-1})+\alpha|V|}

The first formula represents the joint probability of a sequence of words $w_1,w_2,...,w_n$ . It calculates the probability of the entire sequence by multiplying the probability of each word $w_i$ occurring (conditioned on the previous word $w_{i-1}$ ). This decomposition is known as the chain rule.
The second formula calculates the conditional probability of the current word $w_i$ given the previous word $w_{i-1}$ . The components of the formula are explained as follows:

$C(w _{i−1} ,w_{i} )$ : Represents the count of the word pair $(w_{i-1},w_i)$ in the training data.
$C(w_{i−1} )$ : Represents the count of the word $w_{i-1}$ in the training data.
∣V∣: Represents the size of the vocabulary, i.e., the number of different words in the vocabulary.
α: A smoothing parameter used to avoid the zero probability problem. It is a value between 0 and 1, typically used to adjust the probabilities of unseen word pairs.

Specific Details#

Definition: An n-gram is a chunk of n consecutive words.
• unigrams: “the”, “students”, “opened”, ”their”
• bigrams: “the students”, “students opened”, “opened their”
• trigrams: “the students opened”, “students opened their”
• four-grams: “the students opened their”
Concept: Collect the frequencies of different n-grams and use this data to predict the next word.

First we make a Markov assumption: $x^n$ depends only on the preceding n-1 words
How do we obtain the probabilities of these n-grams and (n-1)-grams? Answer: By counting them in some large text corpora!

Example#

Suppose we are learning a 4-gram language model:

Sparsity Problems#

The n-gram model is a method used in natural language processing to predict text sequences, based on the assumption that the occurrence of a word depends only on the n-1 preceding words. However, when the value of n is large, the sparsity problem becomes more severe, as there may be many unseen n-gram combinations.

Sparsity Problem 1:
Problem:
If a specific word or phrase (like “students opened their w”) has never appeared in the training data, then according to the n-gram model, the probability of this word or phrase will be 0. This leads to the model being unable to assign any probability to these unseen words or phrases.
Partial Solution:
Smoothing: To address this issue, a small value δ can be added to the count of each word. This way, even if a word or phrase has never appeared in the data, its probability will not be 0. Smoothing techniques can ensure that all words have a non-zero probability, thus avoiding the zero probability situation.
Sparsity Problem 2:
Problem:
If a longer n-gram (like “students opened their”) has never appeared in the data, the model will be unable to calculate the probability of any word following this phrase (like w).
Partial Solution:
Backoff: In this case, one can back off to a shorter n-gram (like “opened their”) to calculate the probability. This method allows the model to use shorter n-grams to estimate probabilities when encountering unseen long n-grams.
Notes:
Impact of Increasing n: Increasing n (the length of the n-gram) exacerbates the sparsity problem. Typically, we cannot let n exceed 5, as the number of unseen n-gram combinations increases dramatically with n, leading to more sparsity issues.
Through these methods, the n-gram language model can mitigate sparsity issues to some extent, improving the model's generalization ability and prediction accuracy. However, these methods also have their limitations, such as smoothing potentially introducing some noise, and backoff possibly losing some contextual information. Therefore, it is crucial to choose an appropriate n value and smoothing technique in practical applications.

Storage Problems#

Storage Requirements: The n-gram model needs to store the counts of all n-grams observed in the training corpus. This means the size of the model is proportional to the number of different n-grams in the training data.
Factors Affecting Model Size:

Increasing n: As the length n of the n-gram increases, the model needs to store more n-gram counts, as the number of combinations of longer n-grams increases dramatically.
Increasing Corpus: An increase in the training corpus also increases the size of the model, as more text means more n-gram combinations.
Solutions and Challenges:
Storage Optimization: Since the storage requirements of the n-gram model increase with n and the expansion of the corpus, effective storage optimization techniques, such as compression, hash tables, etc., are needed to reduce storage space.
Model Simplification: The model can be simplified by limiting the length of n-grams, using more efficient data structures or algorithms to reduce storage requirements.
Sparsity Issues: As n increases, sparsity issues (i.e., many n-grams that have never appeared in the training data) become more severe, necessitating the use of smoothing techniques to address them.
Alternative Models: Consider using more advanced models, such as neural network models (like Transformers), which are typically more compact and can learn more complex language patterns with fewer parameters.

Naive Bayes#

Naive Bayes is a simple probabilistic classifier based on Bayes' theorem, which assumes independence between features. In text classification tasks, the Naive Bayes model can be used to calculate the probability of a word $w_i$ occurring given a category $c_j$ .
The formula is as follows:
$P\^(w_i∣c_j)=\frac{Count(wi,cj)+α}{\Sigma_{w∈V}Count(w,c_j)+α∣V∣}$
Formula Explanation:

$P\^(wi∣cj)$ : Represents the probability of the word $w_i$ occurring given the category $c_j$ . This is the estimated probability predicted by the model.
Count(wi,cj): Represents the count of the word $w_i$ in category $c_j$ .
$∑_{w∈V}Count(w,cj)$ : Represents the total count of all words in category $c_j$ . Here, V is the vocabulary, representing all possible words.
α: A smoothing parameter used to handle data sparsity issues and avoid zero probabilities. It is a value between 0 and 1.
∣V∣: Represents the size of the vocabulary, i.e., the number of different words in the vocabulary.
Smoothing Techniques:
In the Naive Bayes model, smoothing techniques are also used to handle data sparsity issues. Specifically, the smoothing technique in the formula is achieved by adding α to both the numerator and denominator:
Adding α to the numerator: Count(wi,cj)+α, which means that even if the word $w_i$ has never appeared in category $c_j$ , its probability will not be zero, but rather α.
Adding α∣V∣ to the denominator: $∑_{w∈V}Count(w,cj)+α∣V∣$ , which means that even if some words have never appeared in category $c_j$ , their probabilities will not be zero but will be evenly distributed.

Why Focus on Semantics in NLP Models?#

If we use words: a feature is a word identity (= string)
For example, if the previous word is ‘terrible’, it needs to be exactly the same ‘terrible’ in both the test set and training set.
But if we can convert semantics into vectors:

previous word was vector [35, 22, 17, …]
Now in the test set we might see a similar vector [34, 21, 14, …]
We can generalize to similar but unseen words!!!
In traditional NLP, we treat words as discrete vectors represented by one-hot vectors, where the vector dimension equals the number of words in the vocabulary. But this method does not provide a way to capture natural similarity.
Distributional Semantics: The meaning of a word is given by the words that frequently appear nearby.
When a word w appears in a text, its context is the set of words that appear nearby (within a fixed-size window). We use the many contexts of w to build up a representation of w.
We can represent a word's context using vectors!

What do words mean?#

Synonyms: couch/sofa, car/automobile, filbert/hazelnut
Antonyms: dark/light, rise/fall, up/down
Some words are not synonyms, but they share some meaning elements, such as: cat/dog, car/bicycle, cow/horse
Some words are not similar, but they are related: coffee/cup, house/door, chef/menu
The big idea: model of meaning focusing on similarity
Similar words are “nearby in vector space”

Word Embedding Process#

Goal: represent words as short (50-300 dimensional) & dense (real-valued) vectors!

Count-based approaches:
Using history: This method has been in use since the 1990s.
Co-occurrence Matrix: Construct a sparse word-word co-occurrence (PPMI, Positive Pointwise Mutual Information) matrix that records the frequency of different words appearing together in the text.
SVD Decomposition: Use Singular Value Decomposition (SVD) to decompose the co-occurrence matrix to obtain low-dimensional vector representations of words.
Prediction-based approaches:
Machine Learning Problem: Frame the word embedding problem as a machine learning problem by predicting words in the context to learn the representations of words.
Word2vec: Proposed by Mikolov et al. in 2013, Word2vec learns word vectors by predicting context words given a word or predicting a center word given context words.
GloVe: Proposed by Pennington et al. in 2014, GloVe (Global Vectors for Word Representation) utilizes global word-word co-occurrence information to learn word vectors.

Word embeddings: the learning problem#

Learn vectors from text to represent words.
Input:
A large text corpus and vocabulary V.
Vector dimension d (e.g., 300 dimensions).
Output:
A function f→Rd that maps each word in the vocabulary to a d-dimensional real-valued vector space.
Learning Process:
The learning process of word embeddings typically involves optimizing an objective function that measures the model's performance on prediction tasks (such as predicting words in context).
Through training, the learned word vectors can capture relationships between words, such as synonyms, antonyms, and categories of words.
Basic Properties:

Similar words have similar vectors argmaxcos(e(w),w(w*))
The relationship between “man” (male) and “woman” (female), as well as the relationship between “king” (king) and “queen” (queen). In the word embedding space, these two relationships are similar, i.e., vman−vwoman≈vking−vqueen. This means that the vector from “man” to “woman” is similar to the vector from “king” to “queen”.
Verb tense: such as “walk” (walk), “walked” (walked), “swim” (swim) and “swam” (swam). These relationships are also similar in the word embedding space, i.e., vwalking−vwalked≈vswimming−vswam.
Country-Capital: such as “France” (France) and “Paris” (Paris), “Italy” (Italy) and “Rome” (Rome). These relationships are also similar in the word embedding space, i.e., vParis−vFrance≈vRome−vItaly.
Methods to solve analogy problems: Find analogy words by calculating vector differences and cosine similarity. The specific steps are as follows:
Define analogy relationships: Given an analogy relationship $a:a^∗:: b : b^∗$ , where a and b are known words, a∗ and b∗ are the analogy words to be found.
Calculate vector difference: Calculate e(a∗)−e(a)+e(b), where e(w) represents the vector representation of word w.
Find the most similar word: Find the word b∗ in the vocabulary V that has the highest cosine similarity with e(a∗)−e(a)+e(b), i.e., b∗=argmaxw∈Vcos(e(w),e(a∗)−e(a)+e(b)).

This image illustrates the process of learning language models (LMs) through neural networks, specifically how the concept of word embeddings is introduced. The model described in the image is the Neural Probabilistic Language Model, proposed by Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin in 2003.
Element explanations in the image:

Input Layer (Index for wt−n+1, wt−2, wt−1):
These are the indices of the previous n words, representing the context. Each word is mapped to a vector representation through a lookup table (Table look-up), i.e., word embedding.
Word Embedding Layer (C(wt−n+1), C(wt−2), C(wt−1)):
Each word's index is mapped to a vector through a lookup table, and these vectors are shared parameters (shared parameters across words), representing word embeddings.
Hidden Layer (tanh):
The word embedding vectors are concatenated and processed through a nonlinear activation function (such as tanh). This step is the most computationally intensive part of the model.
Output Layer (softmax):
The output of the hidden layer is transformed into a probability distribution through the softmax function, representing the probability of each possible next word given the context.
Output (t-th output = P(wt=i∣context)):
The final output is the probability of the t-th word being a specific word i given the context.

Word2vec#

Skip-gram#

The goal of the Skip-gram model is to use each word to predict other words in its context.
Assumption:
We have a large text corpus $w1_,w_2,...,w_T \in V$
Key Idea:
Use each word to predict other words in its context. This is a classification problem because the model needs to select the correct context word from the vocabulary.
Context:
The context is defined as a fixed-size window of size 2m (in the example in the image, m=2). This means that for each center word, the model considers m words before and after it as context.
Probability Calculation:
Given the center word a, the model needs to calculate the probability P(b∣a) of other words b becoming context words.
This probability distribution P(⋅∣a) is defined as ∑w∈VP(w∣a)=1, meaning the sum of the probabilities of all possible context words equals 1.
The image shows a center word "into," with a context window size of 2, including one word before and one word after, namely "problems," "turning," "banking," "crises," and "as."
The model needs to learn how to predict these context words based on the center word.
Principle of the Skip-gram Model:
Goal: For each center word, the model's objective is to maximize the probability of its context words.
Loss Function: Typically, the cross-entropy loss function is used to train the model, minimizing the difference between the predicted probability distribution and the actual context word distribution.
Optimization: Adjust model parameters through gradient descent or other optimization algorithms to minimize the loss function.

This image further explains the training process of the Skip-gram model, showing how to convert text data into a format that the model can process and illustrating the training objective of the model.
Context Window:
The image shows a fixed window size of 2, meaning that for each center word (marked in red in the image), the model considers two words before and after as context.
Probability Calculation:
For each center word, the model needs to calculate the probabilities of its context words. For example, given the center word "into," the model needs to calculate the probabilities of "problems," "turning," "banking," "crises," and "as" becoming context words.
Training Data Conversion:
The right side of the image shows how to convert the original text data into the format required for model training. For example, for the center word "into," the model generates training samples like (into, problems), (into, turning), (into, banking), (into, crises), etc.
Training Objective:
The model's training objective is to find a set of parameters that can maximize the probabilities of context words. In other words, the word vectors the model tries to learn should best predict the context words for a given center word.
Objective Function:

How to define $P(w_{t+j}|w_t;\theta)$ ?
This is achieved by using word vectors and the Softmax function.
Two Sets of Vectors:
For each word in the vocabulary V, use two sets of vectors:
$u_a∈Rd$ : The vector for the center word a, for all $a∈V$ .
$v_b∈Rd$ : The vector for the context word b, for all $b∈V$ .
Inner Product:
Use the inner product $u_a⋅v_b$ to measure the likelihood of the center word a appearing with the context word b.
Softmax Function:
Use the Softmax function to convert the inner product into a probability distribution. This is achieved by normalizing the exponent of the inner product against the sum of the exponents of all possible context words.
Probability Distribution:
P(⋅∣wt) is a probability distribution defined over the vocabulary V, representing the probability of each possible context word appearing given the center word wt.

vs Multinomial Logistic Regression#

Multinomial Logistic Regression:
Formula:
Multinomial logistic regression is used for multi-class problems, and its formula is:
$P(y=c∣x)=\frac{\sum_{j=1}^{m}exp(w_j⋅x+b_j)}{exp(w_c⋅x+b_c)}$
where y is the class label, c is one of the classes, x is the input feature vector, $w_c$ and $b_c$ are the weight vector and bias term for class c, and m is the total number of classes.
Explanation:
The numerator in the formula is the exponent of the inner product of the input feature vector x and the weight vector $w_c$ for class c, plus the bias term $b_c$ .
The denominator is the sum of the exponents for all classes, used for normalization, ensuring that the sum of probabilities for all classes equals 1.
Skip-gram Model:
Formula:
The probability calculation formula in the Skip-gram model is:
$P(w_{t+j}∣w_t)=\frac{\sum_{k∈V}exp(u_{w_t}⋅v_k)}{exp(u_{w_t}⋅v_{w_{t+j}})}$
where $w_t$ is the center word, $w_{t+j}$ is the context word, $u_{w_t}$ and $v_{w_{t+j}}$ are the vector representations of the center word and context word, and V is the vocabulary.
Explanation:
The numerator in the formula is the exponent of the inner product of the center word $w_t$ and the context word $w_{t+j}$ .
The denominator is the sum of the exponents of the inner products of the center word $w_t$ with all words in the vocabulary, used for normalization.
Comparison:
Essentially a ∣V∣-way classification problem: The Skip-gram model can be viewed as a multi-class problem, where ∣V∣ is the size of the vocabulary.
Fixing $u_{w_t}$ : If the vector of the center word $u_{w_t}$ is fixed, then the problem simplifies to a multinomial logistic regression problem.
Non-convex Optimization Problem: Since it requires learning the vectors for both the center word and context words simultaneously, the training objective is non-convex, meaning the optimization process may have multiple local optima.

vs Multinomial Logistic Regression#

Practice#

The answer is (b).
Each word has two d-dimensional vectors, so it is 2 × | V | × d

Question: Why does each word need two vectors instead of one?
Answer: Because a word is unlikely to appear in its own context window. For example, given the word "dog," P(dog∣dog) should be low. If we only use one set of vectors, the model essentially needs to minimize $u_{dog}⋅u_{dog}$ , which would lead to self-referential vectors being too similar, thus affecting model performance.
Question: Which set of vectors is used as word embeddings?
Answer: This is an empirical question. Typically, only $u_w$ is used as word embeddings, but you can also concatenate both sets of vectors.

Skip-gram with Negative Sampling (SGNS) and Other Variants#

Problem Description:
In the traditional Skip-gram model, each time a pair of center word and context word (t,c) is obtained, the context word vector $v_k$ needs to be updated using all words in the vocabulary. This is computationally expensive.
Negative Sampling Method:
Negative sampling does not consider all words in the vocabulary but instead randomly samples K negative samples (usually K is between 5 and 20). This means we randomly select K words from the vocabulary as negative samples instead of using all words.
Softmax and Negative Sampling Formula:
Softmax: The original Skip-gram model uses the Softmax function to calculate probabilities, with the formula:
$y=-log\left(\sum_{k∈V}exp(u_t⋅v_k)\right)-exp(u_t⋅v_c)$
Negative Sampling: The negative sampling method replaces Softmax with a simpler formula, given by:
$y=-log(\sigma(u_t⋅v_c))-\sum_{i=1}^{K}E_{j∼P(w)}log(\sigma(-u_t⋅v_j))$
where $\sigma(x)=\frac{1}{1+exp(-x)}$ is the sigmoid function used to convert the inner product into probabilities.

Key Idea:
Transform the original ∣V∣-way classification problem (where ∣V∣ is the size of the vocabulary) into a set of binary classification tasks.
Each time a pair of words (t,c) is obtained, the model predicts whether (t,c) is a positive sample pair, while (t,c′) is a negative sample pair, where c′ is randomly selected from a small sampling set.
Positive and Negative Samples:
Positive Sample: For example, for the center word "apricot" and the context word "tablespoon," this is a positive sample pair.
Negative Sample: For example, for the center word "apricot" and a randomly selected word "aardvark," this is a negative sample pair.
Loss Function:
The loss function y is defined as:
$y=-log(\sigma(u_t⋅v_c))-\sum_{i=1}^{K}E_{j∼P(w)}log(\sigma(-u_t⋅v_j))$
where $\sigma(x)=\frac{1}{1+exp(-x)}$ is the sigmoid function, K is the number of negative samples, and P(w) is the probability distribution sampled based on word frequency.
Probability Calculation:
The probability P(y=1∣t,c) of the center word t and context word c is calculated through $\sigma(u_t⋅v_c)$ .
The probability P(y=0∣t,c′) of the center word t and negative sample c′ is calculated through $1−\sigma(u_t⋅v_{c'})=\sigma(-u_t⋅v_{c'})$ .
Optimization:
Similar to binary logistic regression, but requires simultaneous optimization of the center word vector $u_t$ and context word vector $v_c$ .

Practice#

The vector of the center word t $u_t$ (dimension d).
The vector of the positive sample context word c $v_c$ (dimension d).
The vectors of K negative sample words (each of dimension d).

Continuous Bag of Words (CBOW)#

GloVe: Global Vectors#

This image introduces the GloVe (Global Vectors for Word Representation) model, which is an algorithm used to generate word embeddings. GloVe learns word vectors by leveraging global co-occurrence statistics of words, unlike window-based methods like Skip-gram and CBOW, which directly utilize the co-occurrence matrix of the entire corpus to learn word vectors.
Key Idea:
Directly use the co-occurrence counts of words to approximate the dot product between word vectors ( $u_i⋅v_j$ ).
Global Co-occurrence Statistics:
The model uses global co-occurrence statistics $X_{ij}$ , which is the frequency of words i and j appearing together in the corpus.
Loss Function $J(θ)$ :
The loss function for GloVe is defined as:
$J(θ)=\sum_{i,j∈V}f(X_{ij})(u_i⋅v_j+b_i+b_j−logX_{ij})^2$
where $f(X_{ij})$ is a weighting function used to adjust the influence of low-frequency word pairs; $u_i$ and $v_j$ are the vector representations of words i and j, respectively; $b_i$ and $b_j$ are bias terms; $X_{ij}$ is the co-occurrence frequency of words i and j.
Training Speed and Scalability:
The GloVe model trains faster and can scale to very large corpora.
Weighting Function $f$ :
The graph in the lower right corner of the image illustrates the shape of the weighting function $f$ , which is typically a smooth increasing function used to reduce the impact of low-frequency co-occurring word pairs.
Advantages of GloVe:
Global Information: GloVe utilizes the co-occurrence information of the entire corpus, allowing it to capture broader semantic relationships.
Training Efficiency: Due to its matrix decomposition form, GloVe is more efficient in training compared to window-based methods.
Scalability: GloVe can handle very large corpora, making it perform well on large-scale datasets.

FastText#

This image introduces subword embeddings in the FastText model, an improved word embedding method that captures finer-grained semantic information by breaking words down into subwords (n-grams).
Subword Embeddings:
The FastText model is similar to the Skip-gram model, but it breaks words down into n-grams (subwords), where n ranges from 3 to 6.
This method can capture semantic information within words; for example, the word “where” can be broken down into subwords “wh,” “her,” “ere,” etc.
Example:
The image provides an example of breaking down the word “where”:
3-grams: <wh, whe, her, ere, re>
4-grams: <whe, wher, here, ere>
5-grams: <wher, where, here>
6-grams: <where, where>
Replacement Operation:
When calculating the inner product of the center word and context word vectors, the FastText model replaces the inner product of the original word vectors with the sum of the inner products of all subword vectors:
$\sum_{g∈n-grams(w_i)}u_g⋅v_j$
where g is a subword of word $w_i$ , and n-grams( $w_i$ ) represents the set of all possible subwords of word $w_i$ .
Advantages of the FastText Model:
Capturing Internal Structure:
By breaking words down into subwords, FastText can capture internal structural information of words, which is very helpful for understanding word semantics.
Handling Rare and Unknown Words:
Subword embeddings can better handle rare and unknown words because even if a word has not appeared in the training data, its subwords may have.
Improving Generalization Ability:
Subword embeddings give the model better generalization ability when facing new words, as it can use known subword information to infer the semantics of new words.

Pre-trained Usable Word Embeddings#

word2vec: https://code.google.com/archive/p/word2vec/
GloVe: https://nlp.stanford.edu/projects/glove
FastText: https://fasttext.cc/

To Contextualized Word Vectors Using LMs#

This image illustrates the structure of the ELMo (Embeddings from Language Models) model, a deep learning model used to generate word embeddings. ELMo was proposed by Matthew E. Peters et al. in 2018, and its paper "Deep Contextualized Word Representations" details the principles and implementation of the model.
Element explanations in the image:

Input Layer (E1,E2,...,EN):
These are the input embeddings of words, typically one-hot encoding or term frequency encoding of words.
Bidirectional LSTM Layer:
The image shows two layers of bidirectional LSTMs (Long Short-Term Memory), each consisting of a forward and a backward LSTM. Each LSTM unit processes sequential data and can capture long-distance dependencies between words.
Bidirectional LSTMs can simultaneously consider the contextual information of words from both directions, thus better understanding the contextual meaning of words.
- Additional Notes on Bidirectional LSTM Layer:
  - Bidirectional Long Short-Term Memory (Bi-LSTM) is a special type of recurrent neural network (RNN) that processes sequential data through two LSTM layers, one layer processing data in the forward direction (from the beginning to the end of the sequence) and the other layer processing data in the backward direction (from the end to the beginning of the sequence). This structure allows the network to consider the contextual information of each element in the sequence from both directions.
  - Structure: In a bidirectional LSTM, for each time step t in the sequence, there are two LSTM units at work:
    Forward LSTM: This LSTM unit starts from the first element of the sequence and processes the sequence in the forward direction until the last element. For each time step t, it only considers information from the beginning of the sequence to the current time step.
    Backward LSTM: This LSTM unit starts from the last element of the sequence and processes the sequence in the backward direction until the first element. For each time step t, it only considers information from the end of the sequence to the current time step.
  - Information Flow: At each time step t, both the forward and backward LSTMs produce a hidden state. These two hidden states contain contextual information about the element at that position in the sequence, one coming from the front of the sequence and the other from the back.
  - Output: The output of the bidirectional LSTM can be combined in several different ways:
    - Concatenation: Concatenate the outputs of the forward and backward LSTMs at each time step to form a longer vector. This method retains the bidirectional contextual information for each position in the sequence.
    - Summation: Sum the output vectors of the forward and backward LSTMs at each time step. This method merges the bidirectional information but may lose some details.
    - Averaging: Average the output vectors of the forward and backward LSTMs at each time step. This method also merges the bidirectional information but may reduce the model's sensitivity to certain directional information.
    - Separate Use: In some cases, the outputs of the forward and backward LSTMs may be used separately, especially when different parts of the model require information from different directions.
Output Layer (T1,T2,...,TN):
These are the representations of words after processing by the LSTM. Each word's representation is a weighted sum of its outputs from different LSTM layers.
Principle:
Contextualizing Word Embeddings: Traditional word embeddings (like Word2Vec or GloVe) are static and do not consider the context of words. ELMo generates contextualized word embeddings, meaning the same word can have different representations in different contexts.
Capturing Long-Distance Dependencies: LSTMs are particularly suited for handling sequential data and can capture long-distance dependencies between words. This is crucial for understanding complex structures in language (like syntax and semantics).
Bidirectional Information Flow: By considering the contextual information of words from both directions, ELMo can understand the meaning of words more comprehensively. This is important for handling ambiguous words and understanding context.

Evaluating Word Vectors#

Extrinsic Evaluation
- Let’s embed these word vectors into real NLP systems and see if they can improve performance; this may take a long time, but it remains the most important evaluation metric.

Intrinsic Evaluation
- Evaluate specific/intermediate sub-tasks
- Quick computation
- It is unclear whether this actually helps downstream tasks.

Vocabulary Assumption: Assume there exists a fixed vocabulary built from the training set containing tens of thousands of words. All new words encountered during testing will be mapped to a single "UNK" (unknown word).
Vocabulary Mapping Example:
Common Words: For example, "hat" maps to "pizza" (index) in the vocabulary, and "learn" maps to "tasty" (index).
Variants, Misspellings, New Terms: "taaaaaasty" (variant), "laern" (misspelling), "Transformerify" (new term) are all mapped to "UNK" (index).
Limitations of the Finite Vocabulary Assumption: In many languages, the significance of the finite vocabulary assumption is smaller. This is because many languages have complex morphology or word structures, leading to more word types but fewer occurrences of each word.

Language Models#

Narrow Sense#

A probabilistic model that assigns a probability to every finite sequence (grammatical or not).
GPT-3 still acts in this way, but the model is implemented as a very large neural network with 175 billion parameters!

Broad Sense#

The image details three main architectures of pre-trained language models: decoder-only models, encoder-only models, and encoder-decoder models, along with their typical applications.

Decoder-only Models:
Representative Models: GPT-x models (like GPT-2, GPT-3). These models are primarily used for generation tasks, such as text generation and question answering. They typically use autoregressive methods to generate text from left to right.
Encoder-only Models:
Representative Models: BERT, RoBERTa, ELECTRA. These models process input text through an encoder to generate representations of the text but do not perform text generation. They are mainly used for understanding tasks, such as text classification and named entity recognition. The BERT model uses masked language modeling (Mask LM) and next sentence prediction (NSP) as pre-training objectives to learn contextual representations of words.
Encoder-Decoder Models:
Representative Models: T5, BART. These models combine encoders and decoders, capable of handling both generation and understanding tasks. The encoder generates text representations, and the decoder generates output text based on these representations. This structure allows the model to handle tasks like translation and summarization.
Explanation of Examples in the Image:
BERT:
BERT uses masked language modeling (Mask LM) and next sentence prediction (NSP) as pre-training objectives. The image shows how BERT processes two masked sentences (Masked Sentence A and Masked Sentence B) and an unlabeled sentence pair (Unlabeled Sentence A and B Pair).
T5:
T5 is an encoder-decoder model that uses a different pre-training objective. The image shows T5's applications in different tasks, including translation (translating English to German), summarization (summarizing text), and text evaluation (judging the acceptability of text).
Principle:
Masked Language Modeling (Mask LM): In BERT, some words in the input text are randomly replaced with a special [MASK] token, and the model needs to predict these masked words. This method allows the model to learn contextual representations of words.
Next Sentence Prediction (NSP): BERT also uses the NSP task to learn relationships between sentences. The model needs to predict whether two input sentences are continuous text.
Encoder-Decoder Structure: In T5 and BART, the encoder first processes the input text to generate representations. Then, the decoder generates output text based on these representations. This structure allows the model to handle both generation and understanding tasks.

Building Neural Language Models#

Approximately: Y. Bengio, et al. (2000/2003): A Neural Probabilistic Language Model
- Improvements to Fixed Window Neural Language Models:
  - No Sparsity Issues:
    Neural language models do not produce sparsity issues because they do not need to calculate the probability of each specific n-gram but instead predict the next word by learning word vectors and context.
  - No Need to Store All Observed n-grams:
    Neural models do not need to store all n-grams observed in the corpus along with their counts, thus reducing storage requirements.
- Existing Problems:
  - Fixed Window Too Small:
    The size of the fixed window limits the range of context the model can consider.
  - Enlarging the Window Increases the Number of Parameters:
    If one tries to enlarge the window to include more contextual information, the number of parameters in the model (the size of the weight matrix W) will also increase, which may lead to overfitting and increased computational costs.
  - The Window is Never Big Enough:
    No matter how large the window is, there will always be some long-distance dependencies that cannot be captured.
  - Input Processing Lacks Symmetry:
    In fixed window models, words at different positions in the sequence are processed with different weights, lacking symmetry.
  - Solution:
    Recurrent Neural Networks (RNNs):
    The image points out the need for a neural network architecture that can handle inputs of arbitrary length. RNNs are one solution because they can process sequential data through recurrent connections, regardless of the sequence length.

More On Word Vectors#

This image illustrates the workflow of MorphTE (a method for injecting morphology into tensor embeddings), derived from the paper "MorphTE: Injecting Morphology in Tensorized Embeddings" (NeurIPS 2022), authored by Guobing Gan, Peng Zhang, and others. Below is a detailed explanation of the content in the image:

Left Side - Vocabulary:
Displays a vocabulary containing multiple words, such as "kindness," "unkindly," and "unfeelingly." These words will serve as inputs for subsequent processing steps.
Middle Left - Morpheme Segmentation:
Segments each word in the vocabulary into morphemes. For example, "kindness" is segmented into "kind" and "ness," "unkindly" is segmented into "un," "kind," and "ly," and "unfeelingly" is segmented into "un," "feel," "ing," and "ly." The segmented morphemes are arranged in a matrix, with the number of rows equal to the vocabulary size |V| and the number of columns equal to the number of morphemes in the words n.
Middle Right - Indexing:
Indexes the segmented morphemes, mapping each morpheme to a unique identifier. The indexed results are used for subsequent embedding operations.
Right Side - Morpheme Embedding Matrices:
Contains two morpheme embedding matrices $f_l$ and $f_r$ , used to process the left and right parts of the morphemes, respectively. These matrices convert morpheme indices into low-dimensional vector representations.
Far Right - Word Embedding Matrix:
Combines the results of the morpheme embedding matrices (shown in the image as an addition operation) to generate the final word embedding vectors. These vectors represent the semantic and morphological information of the words.

The symbols and parameters in the image are explained as follows:

$n$ : The number of morphemes in a word (morpheme order).
$q$ : The dimensionality of the morpheme vectors.
$|V|$ : The size of the vocabulary.
$|M|$ : The size of the morpheme vocabulary.

Overall, this image provides a detailed illustration of how MorphTE converts words into vector representations that include morphological information through morpheme segmentation, indexing, and embedding operations.

Training Word Vectors#

How to train?

Practice#

Compute Gradients for Word2vec#

Overall Algorithm#

This image illustrates an overall algorithm primarily used for tasks related to word embeddings, with the following detailed explanation:

Input Section#

Text Corpus: The text data source that the algorithm processes.
Embedding Size d: The size of the embedding dimension, which determines the dimensionality of the final representation vector for each word.
Vocabulary V: The vocabulary containing all possible words.
Context Size m: The size of the context window, used to define the range of context considered in the text.

Initialization Section#

For each word i in the vocabulary V, randomly initialize two vectors $\mathbf{u}_i$ and $\mathbf{v}_i$ .

Training Section#

Iterate through the training corpus, for each training instance $(t, c)$ (where $t$ is the target word and $c$ is the context word):

Update Target Word Vector $\mathbf{u}_t$ :
- The formula is $\mathbf{u}_t \leftarrow \mathbf{u}_t - \eta \frac{\partial y}{\partial \mathbf{u}_t}$ , where $\frac{\partial y}{\partial \mathbf{u}_t} = -\mathbf{v}_c + \sum_{k \in V} P(k|t)\mathbf{v}_k$ . Here, $\eta$ is the learning rate, controlling the step size of each update.
Update Context Word Vector $\mathbf{v}_k$ :
- For each word $k$ in the vocabulary V, the formula is $\mathbf{v}_k \leftarrow \mathbf{v}_k - \eta \frac{\partial y}{\partial \mathbf{v}_k}$ .
- When $k = c$ (i.e., the current context word), $\frac{\partial y}{\partial \mathbf{v}_k} = (P(k|t) - 1)\mathbf{u}_t$ ; when $k \neq c$ , $\frac{\partial y}{\partial \mathbf{v}_k} = P(k|t)\mathbf{u}_t$ . $P(k|t)$ represents the probability of word $k$ appearing given the target word $t$ .

The right side also shows an example of converting training data into a specific format, such as (into, problems), reflecting the combination of target and context words.