Machine Learning FAQ

Machine Learning FAQ

References

Pythonic Excursions: https://aegis4048.github.io/
Word2Vec Architecture https://aegis4048.github.io/demystifying_neural_network_in_skip_gram_language_modeling

Additional Pointers

Feature Importance: https://github.com/parrt/random-forest-importances
Linear Regression Visualization: Multiple Linear Regression and Visualization in Python

Word Embeddings: GPT vs vs Word2Vec vs GloVe

Does GPT (and other LLM) base model Produce Word embeddings similar to Word2Vec/GloVe ? :

Unlike Word2Vec, GPT embeddings are context specific.
GPT uses token embeddings. (e.g. tokenize would be token+ize)
GPT's embeddings are specific to each layer, with earlier layers capturing more local context and later layers capturing more global context. (Explain this)
GPT word embeddings can be extracted from the model and is still useful as feature extraction for word for standard NLP tasks. (like Classification and Sentiment Analysis).
GPT vocabulary size Word2Vec vocabulary size

Here are the popular embeddings available:

Embedding  Source         Total-Words  Dim   Vocabulary   Description

Word2Vec   Google-News    100 Billion  300   3 Million    Dimension:300; 3 M words/phrases
Word2Vec   Stanford       1.6 Billion  300   1.6 M        Common Crawl Dataset
Word2Vec   Wikipedia      4.3 Billion  300   1.1 M 

GloVe      Stanford       840 B Tokens 300   2.2 M        Common Crawl Dataset. 
                                                          Also:100/200 Dimensions.
GloVe      Wikipedia      4.3 B Tokens 300   0.4 M
GloVe      Twitter          2 B Tweets 200   1.2 M

Each word is represented by, say 300 dimensions vector, it means an array of 300 float32 or float64 numbers. i.e. Each word 1.2KB or 2.4KB.

Embedding total size:

Dimension  Vocabulary  Float32/64  Total32  Total64   Description
300        1 M         4/8 Bytes   1.2 GB   2.4 GB    300 x 4 x 1M = 1.2GB
300        2 M         4/8 Bytes   2.4 GB   4.8 GB

Popular Applications of Word Embeddings:

Semantic Understanding for Search engines.
Text Classification, Sentiment Analysis, Spam Detection, Summarization.
Named Entity Recognition: NER: Identify Names, Locations, Organizations, Dates foundation for helping all other tasks.
Language modeling to predict next word (like Queen:King == Woman:Man ). Improve question answering.

Popular Libraries and Frameworks using Word Embeddings Word2Vec and GloVe:

Gensim: Python Library for document similarity analysis.
Spacy: Modern Python Library for NLP
TensorFlow includes Word2Vec/GloVe implementations.

MLM - Masked Language Modeling

Mask some input tokens. Train model to guess them.
As a base model, it learns the language and semantics quickly.
It uses both forward and backward context words to train.
Alternate training mechanisms are CLM (Causal Language Modeling) where the objective is to predict next token (e.g. GPT), and next sentence prediction, etc.
Both BERT and GPT are LLM, in the sense, it uses large corpus to train.
MLM based models are easier to train for sentiment analysis, classification, etc. e.g. It is easier to train BERT based model for sentiment analysis than, taking the GPT model and train for the same (though it is possible to do that)

What is Deep Learning, Activation Function

How does Word2Vec Work ?

Word2vec, introduced by Mikolov et al. at Google, creates vector representations (embeddings) of words by learning from their context in large text corpora.
Total words are usually between 50,000 to 1 Million. Initial Google implementation used 3 Million.
There are two main architectures in word2vec: Skip-gram model and Continous Bag of Words (CBOW)
Both models use rolling window of N words. Both do not care about the order of words.
Core Idea: Words that appear in similar contexts have similar meanings, for example:
```
The cat sat on the mat.
The dog sat on the mat.
```
The words "cat" and "dog" share a similar context suggesting semantic similarity.
It uses shallow neural network.
Two Key Architectures of Word2Vec: Skip-Gram or CBOW to learn word embeddings:
- CBOW Predicts a target word based on its surrounding context.
- Skip-gram Predicts the surrounding context words based on a target word.

Continuous Bag of Words (CBOW) (Context -> Word)

Input: A fixed-size window of context words (e.g., ["The", "sat", "on", "the"]).
Output: The probability of a target word (e.g., "cat") given the context.
Objective: Maximize the probability of predicting the target word from the context.
           i.e. Maximize P(target word/context words)

Question: In supervised learning, we label expected output with 1 or 0, so it is easy to compute error. In this case of iterative process what will be the error function ?

Skip-gram (Word -> Context):

Input: A single target word (e.g., "cat").
Output: The probabilities of the surrounding context words (e.g., "The", "sat", "on", "the").
Objective: Maximize the probability of predicting the context words given the target word.
           i.e. Maximize P( Context words/ Target Word)

Model Components

.
.                                       Projection                             (CBOW)
.     Input Word (One-Hot Encoded) --- Hidden Layer -----  SoftMax Layer ---> Target Word 
.        1 x 10000                       1 x 300                              Context Words
.                                                                             (Skip-Gram)
.     Word Embedding = 300 Weights for Each Word
.

A softmax layer predicts probabilities over the vocabulary.
In CBOW, the output is the target word. In Skip-gram, the output is the context words.

Training Objective

Word2Vec uses a likelihood-based objective function.
For CBOW, it maximizes P(Target Word / Context Words)
For for Skip-gram, it maximizes P(Context Words/Target Word)

To make computation efficient for large vocabularies:

Negative Sampling: Instead of updating weights for all words in the vocabulary, it updates for a few negative (random) samples.

Hierarchical Softmax: Uses a binary tree representation of the vocabulary to reduce the cost of computing the softmax.

Output: Word Embeddings

Once trained, the dense vectors in the hidden layer (embedding matrix) serve as the word embeddings. Words with similar meanings are closer in this high-dimensional space.

Skip-gram model

Takes a word as input and predicts its context words (surrounding words):

.
.      Single-Word  ------->  N surrounding words each side.   
.                                   5 < N < 10 (Per side)
.

For example, given the word "cat" it might try to predict nearby words like "the", "sits", "on", "mat" Better for rare words and larger contexts

Continuous Bag of Words (CBOW):

Takes context words as input and predicts the target word For example, given "the", "sits", "on", "mat", it tries to predict "cat" Faster training and better for frequent words

The training process works like this:

First, words are converted to one-hot encoded vectors (a vector of zeros with a single 1 at the word's index) The neural network has three layers:

Input layer (one-hot encoded word or context) Hidden layer (size determines embedding dimension, typically 100-300 neurons) Output layer (predicts probability distribution over all words)

During training:

The network learns weights between input and hidden layer (embedding matrix) And between hidden and output layer Uses negative sampling or hierarchical softmax to make training efficient

After training:

The weights between input and hidden layer become the word embeddings Each word's embedding captures semantic relationships based on context Similar words end up with similar vectors in the embedding space

BERT

Best Approach For Sentiment Analysis

Who Produces the Attention mask ?

Attention mask or weight is learned during pre-training and fine-tuned later.
User does not mask any words as "important".
In transformer based models, we have multi-head attention:
- Each attention weight is a vector of length = Total words in input sentence.

What is Entrophy ?

It is a measure of uncertainty.
If we have sufficient information, it is More predictable, Low Entrophy.
If there is lack of information, it is less predictable, high Entrophy.
High entropy image means, more random pixels -- Probably better picture without smoothening or compressing.

Hidden (Projection) Layer

In NLP, the hidden layer == projection layer
Wrt example of word2vec, input one-shot encoded vector is projected onto hidden layer.

Linear Regression Example

import pandas as pd
import numpy as np
from sklearn import linear_model

file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
df = pd.read_csv(file)

features = ['Por', 'Brittle', 'Perm', 'TOC']
target = 'Prod'

# Here reshape is redundant. not required.
X = df[features].values.reshape(-1, len(features))
y = df[target].values

ols = linear_model.LinearRegression()
model = ols.fit(X, y)

model.coef_
# array([244.60011793,  31.58801063,  86.87367291, 325.19354135])

model.intercept_
# -1616.4561900851832