ML Glossary

Synopsis

Terminology

Sequence Models

  • Model where order of input elements matters.
  • E.g: RNN, LSTM, GRU, Transformers, CNN, HMM

RNN - Recurrent Neural Network

  • Processes Sequential Data, Hidden State
  • Long-term dependency challenges due to vanishing gradient
  • RNN based models: RNN, LSTM, GRU, Bi-Directional RNN, RNN-T (RNN Transducer used in speech recognition)
  • Uses dynamic hidden state
  • ht = tanh(Wxh . xt + Whh . ht_1 +b)
  • Vannila RNN used since 1990s.

LSTM - Long Term Short Term Memory

  • Captures long term dependency slightly better than RNN
  • Uses input gate, forget gate, output gate
  • Widely adopted since 1997 itself. Became very popular when Google used it in 2014 for translation.
  • Around 2014, attention was used with RNN/LSTM as well.

Gated Recurrent Units (GRUs)

  • GRUs is a simplified version of LSTMs with fewer parameters.
  • Uses simplified "update gate" and "reset gate"

Transformers

  • Revolutionary sequence model by using self-attention.
  • Sequential data but parallel processing.
  • Positional embedding
  • Large contexts use Sliding Window Attention, Summarize after chunks, Sparse Attention (Selective words),
  • Stateless. "Carry-Forward State" across chunks only.

Hidden Markov Models (HMMs) :

  • Designer must define internal states and observable states using domain knowledge.
  • Probability is learned using labeled data
  • Highly interpretable.
  • Used in Speech Recognition, BioTech, etc

ML Paramters vs Hyper-Parameters

  • Parameters: Change as part of training (e.g. Weights)

  • HyperParameters: Manual Config Settings such as:

    • Learning rate, Batch Size, Total iterations, Epochs
    • Number of Layers
    • Optimizer Algorithms

Hypothesis

  • Assumption about relationship between inputs vs outputs such as:

    • Output is a linear function of inputs
    • Logistic Regression: Response can be expressed as sigmoid func.
    • SVM: output can be expressed by hyper plane of inputs
  • Refined during training.

ReLU (Rectified Linear Unit):

  • x > 0 ? x : 0
  • Faster to train and popular.
  • Some activation functions perform better for some models.

Leaky ReLU

  • x > 0 ? x : -0.1 x
  • Addresses dying ReLU and inactive neurons problem when x < 0

ELU (Exponential Linear Unit)

  • x > 0 ? x : 0.5 (e^x-1)
  • while negative converge to -1 exponentially

Tanh (Hyperbolic Tangent)

  • tanh(x) = sinh(x) / cosh(x) = (e^x - e^-x) / (e^x + e^-x)
  • Output maps from -1 to 1

Sigmoid

  • sigma(x) = 1 / (1+e^-x) = e^x / (1+e^x)
  • Maps to (0, 1)
  • sigma(0) = 0.5
  • sigma(-5) = 0.0067
  • sigma(5) = 0.9933

GLU (Gated Linear Unit)

  • GLU(x)=σ(xW+b) ⊙ (xV+c)
  • It is a mini neural network (unlike constant ReLU)
  • Acts more like hidden layer with 2 weight matrices.
  • x = input; W, V = weight matrices; b, c = bias;
  • The sigmoid σ(x.W) acts as gate to produce value.
  • Different variations of GLU exists when the gate is replaced by Swish (SwiGLU), GELU (GeGLU), ReLU (ReGLU)
  • SwiGLU found to be most effective in many modern LLMs.

Swish/SiLU (Sigmoid Linear Unit)

  • Swish(x) = x . sigmoid(beta.x) (commonly beta=1)
  • Response Range: (-0.3, infinity)
  • When x < 0, y drops to -0.3 and then converges to -0.0
  • Smooth. Since gradient direction changes, be careful.

GELU (Gaussian Error Linear Unit)

  • The shape is similar to Swish.
  • Approximates to x. sigmoid(1.702 . x)
  • Used in many popular models such as GPT-3, BERT, ROBERTa, etc

SwiGLU (Variant of GLU - Gated Linear Unit)

  • SwiGLU(x) = Swish(x. W) ⊙ (x . V) (W, V - Learned Weight matrices; ⊙ - Elementwise product
  • Llama, Deepseek use this.

Softmax

  • Maps discrete vector to a probability distribution
  • (z1, z2, ... zk) --> e^zi / (sum(e^zi)) for i: 1 to k
  • i.e. softmax applies exp function to each element and normalizes it by dividing by sum.

Seq2seq

  • Means any kind of Sequence Transformation Models
  • Application: Translation, Conversational, Summarization
  • Source Seq can be imagined as encoded version of destination.
  • RNN, Transformer models are Seq2seq models.

FFN - Feed-Forward Network aka MLP

  • aka MLP - MultiLayer Perceptron
  • Used in Transformer models after self attention
  • One FFN Block = Layer1 -> GLU-Activation -> Layer2
  • Linear Layer1 expands input dim d_model (e.g. 1024) to d_ff (e.g. 4096)
  • Linear Layer2 restores dimension d_ff back to d_model.
  • GLU-Activation behaves like a Layer and itself uses 2 weight matrices.
  • Unlike RNN, it has no memory

BERT - Bidrectional Encoder Representations From Transformers

  • 2018: First Transformer based Model by Google. Open Weights.
  • Revolutionised Translation, NLU (Natural Language Understanding)
  • Bi-directional since both left and right contexts of word used
  • Encoder Only Model.
  • Used MLM (Masked Language Modeling) as pretraining objective
  • Ideal for Text Classification, Sentiment Analysis, NER (Name-Entity-Recognition) and also for Q&A.
  • It is not good fit for text-to-text translation tasks (Use T5)
  • Easier to train for NLU tasks.
  • Parameters: BERT-Base (110M), Large (340M)
  • Trained on Book Corpus (800M words), English Wikipedia (2.5B words)

BERT Variants

  • DistilBERT (2019 HuggingFace 66M)
  • TinyBERT (2019 Huawei 15-50M)
  • ALBERT (2019 Google 12M) A light BERT
  • RoBERTa (Robustly Optimized BERT) (2019 Facebook) Trained on 70B words vs 3.3B words for most other BERT models.
  • MobileBERT (2020 Google 25M)
  • ELECTRA (2020 Stanford University 14M-110M) Great Performance
  • DeBERTa (2020 Microsoft 50-100M)

GPT-2 : Generative Pretained Transformer - 2019

  • 2019: GPT-2 Released year after BERT by OpenAI
  • Large-scale unsupervised pre-training using Transformers.
  • Could generate coherent long passages.
  • Better than BERT for generative tasks.

GPT-3 : 2020

  • Powerful 175B Transformer Generative Model by OpenAI. Released 2020

BART: Birdirectional Autoregressive Transformer (FaceBook 2019)

  • Combines the strengths of bidirectional encoding (like BERT) and auto-regressive decoding (like GPT)

  • Encoder-Decoder Transformer Architecture. Open Weight.

  • BART deliberately applies noise on input (masking, deletion, etc) during pre-training to make it more robust for inference. This denoising is a slight variation on Transformer model.

  • BART Large (406M)

  • T5, mBART, GPT-4 are better models than BART

T5: Text-to-Text Transfer Transformer (Google 2019)

  • Encoder-Decoder Transformer Model. Open Weight.
  • Used in Translation, Summarization, etc
  • T5 small, base, large, 3B or XL, 11B or XXL (2019)
  • Google Imagen uses T5-XXL as text encoder

NLP - Natural Language Processing

SOTA - State of the ART

Auto-encoder

  • An autoencoder is a type of NN that learns to compress (encode) and then reconstruct (decode) data.

  • Auto implies unsupervised learning by both encoder and decoder.

  • Several variations exist:

    • Denoising Auto-encoders. e.g. BART
    • Variational Auto-encoder: Has probabilistic constraint on the latent space for generative modeling.
    • seq2seq Autoencoder
    • convolutional autoencoder: Uses convolutional layers for image data.
  • Not all encoder-decoder models are auto-encoders. e.g. Machine translation model does not aim to reconstruct but translates

ViTS - Vision Transformers

  • Introduced in 2020
  • Rivaling and surpassing CNNs

MultiModal Models

  • Image and Text MultiModal models were released in 2021
  • E.g. Clip (Contrastive Language–Image Pre-training), DALL-E (OpenAI)
  • Imagen, Parti (Google)
  • Stable Diffusion (Stability AI) (Used Transformers since Feb 2024)
  • GauGAN, NVIDIA Canvas by NVIDIA
  • Combined Transformers with vision models for text to image generation

Attention Mechanism

  • First applied in RNN, then later with Transformers
  • 2014 : Bahdanau et al. introduced the attention mechanism.
  • Allowed RNN model to access source input sequence from decoder apart from hidden state.
  • 2015 : Luong et al. refined attention mechanisms with:
    • global (soft) attention and
    • local (hard) attention
  • Heart of Transformer Model introduced in 2017

ONNX Runtime

  • Cross-platform machine-learning model accelerator
  • Can integrate hardware-specific libraries.
  • ONNX Runtime can be used with models from PyTorch, Tensorflow/Keras, TFLite, scikit-learn, and other frameworks.
  • e.g. You can optimize BERT derived model using ONNX for use with desktop application (CPU optimized during inference)