ML Glossary

ML Glossary
- Synopsis
- Terminology

Synopsis

Terminology

Sequence Models

Model where order of input elements matters.
E.g: RNN, LSTM, GRU, Transformers, CNN, HMM

RNN - Recurrent Neural Network

Processes Sequential Data, Hidden State
Long-term dependency challenges due to vanishing gradient
RNN based models: RNN, LSTM, GRU, Bi-Directional RNN, RNN-T (RNN Transducer used in speech recognition)
Uses dynamic hidden state
ht = tanh(Wxh . xt + Whh . ht_1 +b)
Vannila RNN used since 1990s.

LSTM - Long Term Short Term Memory

Captures long term dependency slightly better than RNN
Uses input gate, forget gate, output gate
Widely adopted since 1997 itself. Became very popular when Google used it in 2014 for translation.
Around 2014, attention was used with RNN/LSTM as well.

Gated Recurrent Units (GRUs)

GRUs is a simplified version of LSTMs with fewer parameters.
Uses simplified "update gate" and "reset gate"

Transformers

Revolutionary sequence model by using self-attention.
Sequential data but parallel processing.
Positional embedding
Large contexts use Sliding Window Attention, Summarize after chunks, Sparse Attention (Selective words),
Stateless. "Carry-Forward State" across chunks only.

Hidden Markov Models (HMMs) :

Designer must define internal states and observable states using domain knowledge.
Probability is learned using labeled data
Highly interpretable.
Used in Speech Recognition, BioTech, etc

ML Paramters vs Hyper-Parameters

Parameters: Change as part of training (e.g. Weights)
HyperParameters: Manual Config Settings such as:
- Learning rate, Batch Size, Total iterations, Epochs
- Number of Layers
- Optimizer Algorithms

Hypothesis

Assumption about relationship between inputs vs outputs such as:
- Output is a linear function of inputs
- Logistic Regression: Response can be expressed as sigmoid func.
- SVM: output can be expressed by hyper plane of inputs
Refined during training.

ReLU (Rectified Linear Unit):

x > 0 ? x : 0
Faster to train and popular.
Some activation functions perform better for some models.

Leaky ReLU

x > 0 ? x : -0.1 x
Addresses dying ReLU and inactive neurons problem when x < 0

ELU (Exponential Linear Unit)

x > 0 ? x : 0.5 (e^x-1)
while negative converge to -1 exponentially

Tanh (Hyperbolic Tangent)

tanh(x) = sinh(x) / cosh(x) = (e^x - e^-x) / (e^x + e^-x)
Output maps from -1 to 1

Sigmoid

sigma(x) = 1 / (1+e^-x) = e^x / (1+e^x)
Maps to (0, 1)
sigma(0) = 0.5
sigma(-5) = 0.0067
sigma(5) = 0.9933

GLU (Gated Linear Unit)

GLU(x)=σ(xW+b) ⊙ (xV+c)
It is a mini neural network (unlike constant ReLU)
Acts more like hidden layer with 2 weight matrices.
x = input; W, V = weight matrices; b, c = bias;
The sigmoid σ(x.W) acts as gate to produce value.
Different variations of GLU exists when the gate is replaced by Swish (SwiGLU), GELU (GeGLU), ReLU (ReGLU)
SwiGLU found to be most effective in many modern LLMs.

Swish/SiLU (Sigmoid Linear Unit)

Swish(x) = x . sigmoid(beta.x) (commonly beta=1)
Response Range: (-0.3, infinity)
When x < 0, y drops to -0.3 and then converges to -0.0
Smooth. Since gradient direction changes, be careful.

GELU (Gaussian Error Linear Unit)

The shape is similar to Swish.
Approximates to x. sigmoid(1.702 . x)
Used in many popular models such as GPT-3, BERT, ROBERTa, etc

SwiGLU (Variant of GLU - Gated Linear Unit)

SwiGLU(x) = Swish(x. W) ⊙ (x . V) (W, V - Learned Weight matrices; ⊙ - Elementwise product
Llama, Deepseek use this.

Softmax

Maps discrete vector to a probability distribution
(z1, z2, ... zk) --> e^zi / (sum(e^zi)) for i: 1 to k
i.e. softmax applies exp function to each element and normalizes it by dividing by sum.

Seq2seq

Means any kind of Sequence Transformation Models
Application: Translation, Conversational, Summarization
Source Seq can be imagined as encoded version of destination.
RNN, Transformer models are Seq2seq models.

FFN - Feed-Forward Network aka MLP

aka MLP - MultiLayer Perceptron
Used in Transformer models after self attention
One FFN Block = Layer1 -> GLU-Activation -> Layer2
Linear Layer1 expands input dim d_model (e.g. 1024) to d_ff (e.g. 4096)
Linear Layer2 restores dimension d_ff back to d_model.
GLU-Activation behaves like a Layer and itself uses 2 weight matrices.
Unlike RNN, it has no memory

BERT - Bidrectional Encoder Representations From Transformers

2018: First Transformer based Model by Google. Open Weights.
Revolutionised Translation, NLU (Natural Language Understanding)
Bi-directional since both left and right contexts of word used
Encoder Only Model.
Used MLM (Masked Language Modeling) as pretraining objective
Ideal for Text Classification, Sentiment Analysis, NER (Name-Entity-Recognition) and also for Q&A.
It is not good fit for text-to-text translation tasks (Use T5)
Easier to train for NLU tasks.
Parameters: BERT-Base (110M), Large (340M)
Trained on Book Corpus (800M words), English Wikipedia (2.5B words)

BERT Variants

DistilBERT (2019 HuggingFace 66M)
TinyBERT (2019 Huawei 15-50M)
ALBERT (2019 Google 12M) A light BERT
RoBERTa (Robustly Optimized BERT) (2019 Facebook) Trained on 70B words vs 3.3B words for most other BERT models.
MobileBERT (2020 Google 25M)
ELECTRA (2020 Stanford University 14M-110M) Great Performance
DeBERTa (2020 Microsoft 50-100M)

GPT-2 : Generative Pretained Transformer - 2019

2019: GPT-2 Released year after BERT by OpenAI
Large-scale unsupervised pre-training using Transformers.
Could generate coherent long passages.
Better than BERT for generative tasks.

GPT-3 : 2020

Powerful 175B Transformer Generative Model by OpenAI. Released 2020

BART: Birdirectional Autoregressive Transformer (FaceBook 2019)

Combines the strengths of bidirectional encoding (like BERT) and auto-regressive decoding (like GPT)
Encoder-Decoder Transformer Architecture. Open Weight.
BART deliberately applies noise on input (masking, deletion, etc) during pre-training to make it more robust for inference. This denoising is a slight variation on Transformer model.
BART Large (406M)
T5, mBART, GPT-4 are better models than BART

T5: Text-to-Text Transfer Transformer (Google 2019)

Encoder-Decoder Transformer Model. Open Weight.
Used in Translation, Summarization, etc
T5 small, base, large, 3B or XL, 11B or XXL (2019)
Google Imagen uses T5-XXL as text encoder

NLP - Natural Language Processing

SOTA - State of the ART

Auto-encoder

An autoencoder is a type of NN that learns to compress (encode) and then reconstruct (decode) data.
Auto implies unsupervised learning by both encoder and decoder.
Several variations exist:
- Denoising Auto-encoders. e.g. BART
- Variational Auto-encoder: Has probabilistic constraint on the latent space for generative modeling.
- seq2seq Autoencoder
- convolutional autoencoder: Uses convolutional layers for image data.
Not all encoder-decoder models are auto-encoders. e.g. Machine translation model does not aim to reconstruct but translates

ViTS - Vision Transformers

Introduced in 2020
Rivaling and surpassing CNNs

MultiModal Models

Image and Text MultiModal models were released in 2021
E.g. Clip (Contrastive Language–Image Pre-training), DALL-E (OpenAI)
Imagen, Parti (Google)
Stable Diffusion (Stability AI) (Used Transformers since Feb 2024)
GauGAN, NVIDIA Canvas by NVIDIA
Combined Transformers with vision models for text to image generation

Attention Mechanism

First applied in RNN, then later with Transformers
2014 : Bahdanau et al. introduced the attention mechanism.
Allowed RNN model to access source input sequence from decoder apart from hidden state.
2015 : Luong et al. refined attention mechanisms with:
- global (soft) attention and
- local (hard) attention
Heart of Transformer Model introduced in 2017

ONNX Runtime

Cross-platform machine-learning model accelerator
Can integrate hardware-specific libraries.
ONNX Runtime can be used with models from PyTorch, Tensorflow/Keras, TFLite, scikit-learn, and other frameworks.
e.g. You can optimize BERT derived model using ONNX for use with desktop application (CPU optimized during inference)