Transformers - Attention is All You Need

Synopsis


   Embedding Positional-Embedding
   Self-Attention  O(N*N)

   NLP:     RNN    LSTM     Transfomers

Transfomers Architecture

Why Transformers ?

  • Before Transformers RNN + LSTM was being used for Translation and most such tasks.

  • Transformer Architecture is better than RNN in following aspects:

    • Better at remembering large context

    • Parallelizable - So Faster

      The animal did not cross street because it was tired. it refers to animal or street ?

    • Self Attention O(N*N) where N = max 70 words in sentence where as RNN had O(d*d) complexity where d ~= 1000 dimensions.

    • Self attention mechanism

Types of Transformer Models

  • Encoder-Decoder Models. (e.g. Most Translation/Summarization Models)
  • Decoder-Only Models. (e.g. Q&A Models)

Note: Decoder-only models are easier to train; Even emerging to be used in translation as well.

Encoder + Decoder Model Overview


  Encoder + Decoder Model (Translation):

  Input Text  ──▶ [Encoder    + Layers] ──────┐
  Sky is blue      Embeddings                 │
                                      +-------+
                                 Cross-Attn ▼
                                  ┌───────┐
  <start> vaanam  ──▶ [Decoder   + Layers  ] ──▶ Logits ──▶ Next token
                       Embeddings 

  Decoder Loops: Vaanam, Vanam is, Vanam is Neelam;

  • Self-Attention is applied with in layer across tokens.
  • Cross-Attention is applied between context and decoder layer.

Decoder Only Model Overview


   Assume: V = Total Vocabulary = 50,000
           L = Sequence Length = 128 Tokens
           d_model = Embedding Dimension = 1024

   Input Tokens: [t₁, t₂, ..., t_L]
             ▼
╔════════════════════════════════════════════════════╗
║            Token Embedding Lookup (E)              ║
║   Embedding matrix: [V, d_model]                   ║
║   Input: [L] (token IDs)                           ║
║   Output: [L, d_model]                             ║
╚════════════════════════════════════════════════════╝
             ▼
╔════════════════════════════════════════════════════╗
║            Add Positional Encoding (P)             ║
║   Position matrix: [L, d_model]                    ║
║   Output: [L, d_model]                             ║
╚════════════════════════════════════════════════════╝
             ▼
╔════════════════════════════════════════════════════╗
║        Transformer Block × N (e.g., N=12)          ║
║ ┌────────────────────────────────────────────────┐ ║
║ │                Layer Norm                      │ ║
║ │                Self-Attention                  │ ║
║ │   Q = X · W_Q      [L, d_model] · [d_model, d_model] │
║ │   K = X · W_K      → [L, d_model]              │
║ │   V = X · W_V      → [L, d_model]              │
║ │   Split into heads → [L, n_heads, d_k]         │
║ │   Attention(Q,K,V) → [L, d_model]              │
║ │                + Residual                      │ ║
║ └────────────────────────────────────────────────┘ ║
║ ┌────────────────────────────────────────────────┐ ║
║ │                Layer Norm                      │ ║
║ │                MLP (FFN)                       │ ║
║ │   Linear: [d_model → d_ff]                     │ ║
║ │   Activation (e.g., GELU)                      │ ║
║ │   Linear: [d_ff → d_model]                     │ ║
║ │                + Residual                      │ ║
║ └────────────────────────────────────────────────┘ ║
╚════════════════════════════════════════════════════╝
             ▼
╔════════════════════════════════════════════════════╗
║       Final Layer Output: [L, d_model]             ║
╚════════════════════════════════════════════════════╝
             ▼
╔════════════════════════════════════════════════════╗
║       Linear Projection to Vocabulary Logits       ║
║   Multiply by Eᵀ: [d_model, V]                     ║
║   Output: [L, V]                                   ║
╚════════════════════════════════════════════════════╝
             ▼
     Softmax → Probabilities over vocabulary


Self Attention

Attention Goals

  • Find relative importance between words (input vectors)
  • Enrich input vectors using relative importance
  • importance between x1 . x2 != x2 . x1 (not commutative) due to order

Self Attention Mechanism

  • There are 3 important artifacts:

    • query (Represents What is being searched. e.g. Select where record[key1].value = value1)
    • key (Represents the Identifier of the words)
    • value (Represents the target value content)
  • q, k, v dimensions are typically smaller than input token vector dim.

   Embedding length d_model = 512 (per token vector)
   query vector length = 64 (typically smaller)
   query weight matrix Qw = 512 x 64

   q = x1 . Qw = [q1, q2, ... q64]
   Note: q contains 64 new derived (composite) properties of input vector.

   k = x1. Kw = [k1, k2, ... k64] 

   v = x1 . Vw = [v1, v2, ... v64]

   Attention Score = q * k;
           q1 . k1 = word1 attention score on word1. 
           q1 . k2 = word1 attention score on word2. 

   Normalize Attention score = softmax( q*k / sqrt(64) ) where 64 = dim(q)
           e.g. softmax for word1 on word1 = 0.88; word1 on word2 = 0.12
                Means word2 represents at word1 position at 0.12 times.
   Say, softmax = (s1, s2, ... s64) # Adds upto 1.

   v represents the essence of the word itself in smaller dimension.

   z1 = Transformed Value = s1*v1 + s2*v2 + .... + s64*v64
        (Weighted average performed on v considering attention score)

   Note: we have z1, z2, ... zn where n = total input words.

   Note that z is enriched context vector compared to input token vector.
   
   At the end of Self attention block, there is a residual operation
   which means: (Add original input + Z) and send it to next FFN block.
 

Multi Attention Mechanism

  • Multiple Attention Heads are used instead of single.
  • E.g. 8 attention heads with Associated (Qw, Kw, Vw) each 64 length for input vector 512 length.
  • Attention heads do not divide input vector to focus, they independently apply.
  • Attention heads random initialization mostly helps to capture complementary patterns and not being redundant. There are some techniques like orthogonal weights initialization, but as of now they are not used in popular models.
  • Final Z = (z1, z2, ... , z8) is enriched context vector.
  • Enriched context vector will feed into FFN and/or decoder for cross-attention.

Cross Attention

  • It is used in Seq2Seq models like translation which is encoder-decoder based.

  • Output of Encoder is factorized into (K, V) (i.e. Keys, Values) using yet another learned (Kw, Vw) weight matrices. This is nothing to do with (Q, K, V) (i.e. Query, Key, Value) used in self-attention.

  • The Encoder output (K, V) passed on to Decoder. The decoder uses it's self attention learned Q and multiplies with Encoder's K to get some attention scores that indicates which part of Encoder's token to focus more.

Positional Embedding

  • Positional embedding vector is typically added to word embedding before passing on to Self-Attention

  • Adding positional embedding does not create ambiguity (by overlapping semantic word embedding space) since it is jointly learned in high dimensional space, it has been observed to be mostly orthogonal to embedding space.

  • Positional embedding could be one of the following:

    • Absolute position based: Most common; e.g. GPT-4, Deepseek Can not support larger context than trained max length; Simplest to implement and proven to work well.

    • Rotary Positional Embedding (RoPE); e.g. Llama, Mistral; Uses rotation matrix to encode relative position into Q, V vectors and applies it as part of Self Attention; Farther tokens have smaller attention score; Dynamic not learned; Supports large context;

    • Relative Positional Embedding; e.g. T5 Uses relative position instead of absolute. Learned during training.

Vocabulary for Translation Model

  • NMT (Neural Machine Translation Task)

  • Uses separate vocabulary spaces for source (encoder) and target (decoder)

  • The word embeddings for same meaning words are entirely different in source language and target language.

  • In multi-lingual model (e.g. mT5) similar meaning words share similar word embeddings.

  • pytorch is best suited for research and experimentation and initial training.

  • TensorFlow is best suited for large scale training and deployment.

  • Huggingface is best suited for research, experimentation and fine tuning. It supports pytorch and Tensorflow backends.

  • Most opensource models like Llama, Deepseek, Mistral use Pytorch for initial training of the model itself. For finetuning by others, huggingface APIs provide better abstraction and easy access to source models.

  • FairSeq (by Facebook) is sequence modeling toolkit library. Supports both large scale training and fine tuning and mixed precision (FP16)

  • Pytorch Lightning (Simple wrapper around pytorch) which provides better abstraction over pytorch to reduce boiler plate code.

  • DeepSeed (by Microsoft) library used in training massive models like GPT-3. Based on pytorch. Open Source.