Transformers - Attention is All You Need
Embedding Positional-Embedding
Self-Attention O(N*N)
NLP: RNN LSTM Transfomers
Historical Links:
Before Transformers RNN + LSTM was being used for Translation and most such tasks.
Transformer Architecture is better than RNN in following aspects:
Better at remembering large context
Parallelizable - So Faster
The animal did not cross street because it was tired.
it
refers toanimal
orstreet
?
Self Attention O(N*N)
where N = max 70 words in sentence
where as RNN had O(d*d)
complexity where d ~= 1000 dimensions.
Self attention mechanism
Note: Decoder-only models are easier to train; Even emerging to be used in translation as well.
Encoder + Decoder Model (Translation):
Input Text ──▶ [Encoder + Layers] ──────┐
Sky is blue Embeddings │
+-------+
Cross-Attn ▼
┌───────┐
<start> vaanam ──▶ [Decoder + Layers ] ──▶ Logits ──▶ Next token
Embeddings
Decoder Loops: Vaanam, Vanam is, Vanam is Neelam;
Assume: V = Total Vocabulary = 50,000
L = Sequence Length = 128 Tokens
d_model = Embedding Dimension = 1024
Input Tokens: [t₁, t₂, ..., t_L]
▼
╔════════════════════════════════════════════════════╗
║ Token Embedding Lookup (E) ║
║ Embedding matrix: [V, d_model] ║
║ Input: [L] (token IDs) ║
║ Output: [L, d_model] ║
╚════════════════════════════════════════════════════╝
▼
╔════════════════════════════════════════════════════╗
║ Add Positional Encoding (P) ║
║ Position matrix: [L, d_model] ║
║ Output: [L, d_model] ║
╚════════════════════════════════════════════════════╝
▼
╔════════════════════════════════════════════════════╗
║ Transformer Block × N (e.g., N=12) ║
║ ┌────────────────────────────────────────────────┐ ║
║ │ Layer Norm │ ║
║ │ Self-Attention │ ║
║ │ Q = X · W_Q [L, d_model] · [d_model, d_model] │
║ │ K = X · W_K → [L, d_model] │
║ │ V = X · W_V → [L, d_model] │
║ │ Split into heads → [L, n_heads, d_k] │
║ │ Attention(Q,K,V) → [L, d_model] │
║ │ + Residual │ ║
║ └────────────────────────────────────────────────┘ ║
║ ┌────────────────────────────────────────────────┐ ║
║ │ Layer Norm │ ║
║ │ MLP (FFN) │ ║
║ │ Linear: [d_model → d_ff] │ ║
║ │ Activation (e.g., GELU) │ ║
║ │ Linear: [d_ff → d_model] │ ║
║ │ + Residual │ ║
║ └────────────────────────────────────────────────┘ ║
╚════════════════════════════════════════════════════╝
▼
╔════════════════════════════════════════════════════╗
║ Final Layer Output: [L, d_model] ║
╚════════════════════════════════════════════════════╝
▼
╔════════════════════════════════════════════════════╗
║ Linear Projection to Vocabulary Logits ║
║ Multiply by Eᵀ: [d_model, V] ║
║ Output: [L, V] ║
╚════════════════════════════════════════════════════╝
▼
Softmax → Probabilities over vocabulary
There are 3 important artifacts:
q, k, v dimensions are typically smaller than input token vector dim.
Embedding length d_model = 512 (per token vector)
query vector length = 64 (typically smaller)
query weight matrix Qw = 512 x 64
q = x1 . Qw = [q1, q2, ... q64]
Note: q contains 64 new derived (composite) properties of input vector.
k = x1. Kw = [k1, k2, ... k64]
v = x1 . Vw = [v1, v2, ... v64]
Attention Score = q * k;
q1 . k1 = word1 attention score on word1.
q1 . k2 = word1 attention score on word2.
Normalize Attention score = softmax( q*k / sqrt(64) ) where 64 = dim(q)
e.g. softmax for word1 on word1 = 0.88; word1 on word2 = 0.12
Means word2 represents at word1 position at 0.12 times.
Say, softmax = (s1, s2, ... s64) # Adds upto 1.
v represents the essence of the word itself in smaller dimension.
z1 = Transformed Value = s1*v1 + s2*v2 + .... + s64*v64
(Weighted average performed on v considering attention score)
Note: we have z1, z2, ... zn where n = total input words.
Note that z is enriched context vector compared to input token vector.
At the end of Self attention block, there is a residual operation
which means: (Add original input + Z) and send it to next FFN block.
It is used in Seq2Seq models like translation which is encoder-decoder based.
Output of Encoder is factorized into (K, V) (i.e. Keys, Values) using yet another learned (Kw, Vw) weight matrices. This is nothing to do with (Q, K, V) (i.e. Query, Key, Value) used in self-attention.
The Encoder output (K, V) passed on to Decoder. The decoder uses it's self attention learned Q and multiplies with Encoder's K to get some attention scores that indicates which part of Encoder's token to focus more.
Positional embedding vector is typically added to word embedding before passing on to Self-Attention
Adding positional embedding does not create ambiguity (by overlapping semantic word embedding space) since it is jointly learned in high dimensional space, it has been observed to be mostly orthogonal to embedding space.
Positional embedding could be one of the following:
Absolute position based: Most common; e.g. GPT-4, Deepseek Can not support larger context than trained max length; Simplest to implement and proven to work well.
Rotary Positional Embedding (RoPE); e.g. Llama, Mistral; Uses rotation matrix to encode relative position into Q, V vectors and applies it as part of Self Attention; Farther tokens have smaller attention score; Dynamic not learned; Supports large context;
Relative Positional Embedding; e.g. T5 Uses relative position instead of absolute. Learned during training.
NMT (Neural Machine Translation Task)
Uses separate vocabulary spaces for source (encoder) and target (decoder)
The word embeddings for same meaning words are entirely different in source language and target language.
In multi-lingual model (e.g. mT5) similar meaning words share similar word embeddings.
pytorch is best suited for research and experimentation and initial training.
TensorFlow is best suited for large scale training and deployment.
Huggingface is best suited for research, experimentation and fine tuning. It supports pytorch and Tensorflow backends.
Most opensource models like Llama, Deepseek, Mistral use Pytorch for initial training of the model itself. For finetuning by others, huggingface APIs provide better abstraction and easy access to source models.
FairSeq (by Facebook) is sequence modeling toolkit library. Supports both large scale training and fine tuning and mixed precision (FP16)
Pytorch Lightning (Simple wrapper around pytorch) which provides better abstraction over pytorch to reduce boiler plate code.
DeepSeed (by Microsoft) library used in training massive models like GPT-3. Based on pytorch. Open Source.