Machine Learning FAQ
Does GPT (and other LLM) base model Produce Word embeddings similar to Word2Vec/GloVe ? :
Here are the popular embeddings available:
Embedding Source Total-Words Dim Vocabulary Description
Word2Vec Google-News 100 Billion 300 3 Million Dimension:300; 3 M words/phrases
Word2Vec Stanford 1.6 Billion 300 1.6 M Common Crawl Dataset
Word2Vec Wikipedia 4.3 Billion 300 1.1 M
GloVe Stanford 840 B Tokens 300 2.2 M Common Crawl Dataset.
Also:100/200 Dimensions.
GloVe Wikipedia 4.3 B Tokens 300 0.4 M
GloVe Twitter 2 B Tweets 200 1.2 M
Each word is represented by, say 300 dimensions vector, it means an array of 300 float32 or float64 numbers. i.e. Each word 1.2KB or 2.4KB.
Embedding total size:
Dimension Vocabulary Float32/64 Total32 Total64 Description
300 1 M 4/8 Bytes 1.2 GB 2.4 GB 300 x 4 x 1M = 1.2GB
300 2 M 4/8 Bytes 2.4 GB 4.8 GB
Popular Applications of Word Embeddings:
Popular Libraries and Frameworks using Word Embeddings Word2Vec and GloVe:
Word2vec, introduced by Mikolov et al. at Google, creates vector representations (embeddings) of words by learning from their context in large text corpora.
Total words are usually between 50,000 to 1 Million. Initial Google implementation used 3 Million.
There are two main architectures in word2vec: Skip-gram model and Continous Bag of Words (CBOW)
Both models use rolling window of N words. Both do not care about the order of words.
Core Idea: Words that appear in similar contexts have similar meanings, for example:
The cat sat on the mat.
The dog sat on the mat.
The words "cat" and "dog" share a similar context suggesting semantic similarity.
It uses shallow neural network.
Two Key Architectures of Word2Vec: Skip-Gram or CBOW to learn word embeddings:
- CBOW Predicts a target word based on its surrounding context.
- Skip-gram Predicts the surrounding context words based on a target word.
Input: A fixed-size window of context words (e.g., ["The", "sat", "on", "the"]).
Output: The probability of a target word (e.g., "cat") given the context.
Objective: Maximize the probability of predicting the target word from the context.
i.e. Maximize P(target word/context words)
Question: In supervised learning, we label expected output with 1 or 0, so it is easy to compute error. In this case of iterative process what will be the error function ?
Input: A single target word (e.g., "cat").
Output: The probabilities of the surrounding context words (e.g., "The", "sat", "on", "the").
Objective: Maximize the probability of predicting the context words given the target word.
i.e. Maximize P( Context words/ Target Word)
.
. Projection (CBOW)
. Input Word (One-Hot Encoded) --- Hidden Layer ----- SoftMax Layer ---> Target Word
. 1 x 10000 1 x 300 Context Words
. (Skip-Gram)
. Word Embedding = 300 Weights for Each Word
.
To make computation efficient for large vocabularies:
Negative Sampling: Instead of updating weights for all words in the vocabulary, it updates for a few negative (random) samples.
Hierarchical Softmax: Uses a binary tree representation of the vocabulary to reduce the cost of computing the softmax.
Output: Word Embeddings
Once trained, the dense vectors in the hidden layer (embedding matrix) serve as the word embeddings. Words with similar meanings are closer in this high-dimensional space.
Takes a word as input and predicts its context words (surrounding words):
.
. Single-Word -------> N surrounding words each side.
. 5 < N < 10 (Per side)
.
For example, given the word "cat" it might try to predict nearby words like "the", "sits", "on", "mat" Better for rare words and larger contexts
Continuous Bag of Words (CBOW):
Takes context words as input and predicts the target word For example, given "the", "sits", "on", "mat", it tries to predict "cat" Faster training and better for frequent words
The training process works like this:
First, words are converted to one-hot encoded vectors (a vector of zeros with a single 1 at the word's index) The neural network has three layers:
Input layer (one-hot encoded word or context) Hidden layer (size determines embedding dimension, typically 100-300 neurons) Output layer (predicts probability distribution over all words)
During training:
The network learns weights between input and hidden layer (embedding matrix) And between hidden and output layer Uses negative sampling or hierarchical softmax to make training efficient
After training:
The weights between input and hidden layer become the word embeddings Each word's embedding captures semantic relationships based on context Similar words end up with similar vectors in the embedding space
import pandas as pd
import numpy as np
from sklearn import linear_model
file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
df = pd.read_csv(file)
features = ['Por', 'Brittle', 'Perm', 'TOC']
target = 'Prod'
# Here reshape is redundant. not required.
X = df[features].values.reshape(-1, len(features))
y = df[target].values
ols = linear_model.LinearRegression()
model = ols.fit(X, y)
model.coef_
# array([244.60011793, 31.58801063, 86.87367291, 325.19354135])
model.intercept_
# -1616.4561900851832