LLM Notes

LLM Notes
- References
- LLM Notes

References

LLM Notes

Notes from stanford CS229 ML LLM notes.
See https://www.youtube.com/watch?v=9vM4p9NN0Ts

HELM - Holistic Evaluation of Language Models (NLP Bench Mark). Ratings:

Model              Mean with Rate
GPT-4 0613         0.962
GPT-4 Turbo        0.834
Palmyra X V3 72B   0.821
Palmyra X V2 33B   0.783
Yi 34B             0.772

Best sites for comparing performance of various models: https://artificialanalysis.ai/
See Also: Huggingface LLM Leaderboard
Transformers are better than LSTM.

Transformers vs LSTM

Sequential vs. Parallel Processing: LSTMs (Based on RNN) process sequences sequentially, while Transformers process in parallel.
Recurrence vs. Self-Attention: LSTMs rely on recurrence and gates, whereas Transformers use self-attention mechanisms.
Local vs. Global Dependencies: LSTMs focus on local dependencies, Transformers on global.
LSTM is better for temporal relationship (like stock trading data)

Activation function helps to converge early.
More data is better.
Better model is better.
Chinchilla paper describes given the compute capacity (like 10^5 FLOPS), how many billion parameters we can train the model on. i.e. It found the linear relationship between compute capacity and total number of parameters.
Smaller model costs less for inference too.
Post training is needed for AI Assistance. Plain LLM won't answer the question.
SFT - Supervised fine tuning - Mechanism of post training.
Synthetic generation of Data for SFT. Use LLM to generate synthetic data on which it can train itself.
SFT is prone to copy specific human style (which is subjective).
SFT is typically done on a small dataset (like 3000 or 20K) example question and answers. If the human fine tuning input feeds new knowledge that LLM does not know, then you are teaching LLM to make up or hallucinate possible answers. To avoid that, the answers should be based on facts known to LLM already.
RL - Reinforcement Learning. (ChatGPT did that -- which is a major breakthrough compared to GPT3. )
Weights stored in 32 bits. For Computations (convert and) use 16 bits floats. Activation in bf16, Gradients in 16 bits.
BPE - byte pair encoding gradually builds upon from 1 character token to multi-char token depending on frequency. So tokenize is recongized as token-ize
After BPE, vector embeddings is used to map tokens (subwords or words) to dense vectors. This enables models to understand token meanings.
Types of vector embeddings used: word2vect, GloVe, FastText, Transformer's self-attention based embeddings.

Byte-Pair Encoding (BPE)

Tokenization technique to split words into subwords.
Reduces vocabulary size, handling out-of-vocabulary (OOV) words.
Represents words as sequences of subwords.
Tokenize word is recognized as "Token"+"ize" two subwords.

Vector Embeddings:

Map tokens (subwords or words) to dense vectors.
Capture semantic relationships, context, and nuances.
Enable models to understand token meanings.
Vector embeddings transform subwords into dense vectors.
LLM processes vectorized subwords to learn contextual representations.
Embeddings are learned as part of the training process itself depending on the training model.
Types of Vector Embeddings:
- Word2Vec (W2V)
- GloVe
- FastText
- Transformers' self-attention-based embeddings (Defacto std now for GPT3, etc)
Vector Embeddings with BPE:
- Captures nuances and context.
- Reduced dimensionality: Efficient processing.
- Better generalization: Handles OOV words.
LLM Architectures using BPE and Vector Embeddings:
- Transformers (e.g., BERT, RoBERTa)
- XLNet
- DistilBERT
- T5 (Text-to-Text Transfer Transformer)

Implementation

Pre-trained language models (e.g., BERT) provide pre-trained embeddings.
Custom embeddings can be learned during training.
Popular libraries: Hugging Face Transformers, PyTorch, TensorFlow.
ChatGPT (GPT-3/4): Transformer-based architecture:
- Modified version of the original Transformer.
- Token Embeddings: Learned during training, using a combination of ...
- Subword embeddings: Based on WordPiece tokenization (similar to BERT).
- Positional embeddings: To preserve sequence information.
- LayerNorm and Embedding Normalization: To stabilize and normalize embeddings.
Meta AI (LLaMA): Transformer-based architecture:
- Modified version of the original Transformer.
- Token Embeddings using combination of ...
- Subword embeddings: Based on SentencePiece tokenization.
- Positional embeddings: To preserve sequence information.
- Relative Positional Embeddings: To better capture sequence relationships.
Other Advanced Embeddings:
- Rotary Positional Embeddings (RPE): Used in some transformer variants.
- Alibi Positional Embeddings: Used in some transformer variants.
- Learnable Positional Embeddings: Used in some transformer variants.

Training Objectives

Masked Language Modeling (MLM): Used in BERT, RoBERTa where model learns to predict some masked input tokens.
Next Sentence Prediction (NSP): Used in BERT.
Autoregressive Language Modeling: Used in GPT-3/4. Also known as Causal language modeling (CLM) objective to predict next token without masking.

BERT vs RoBERTa vs ChatGPT

BERT (Bidirectional Encoder Representations from Transformers)
RoBERTa (a derivative of BERT developed by Facebook AI)
All 3 are tranformer based models: introduced in the 2017 "Attention Is All You Need" paper.
BERT and RoBERTa are designed primarily for understanding tasks and are not inherently generative.
BERT uses a masked language modeling (MLM) objective where some input tokens are masked, and the model learns to predict them.
GPT models, including ChatGPT, use a causal language modeling (CLM) objective, predicting the next token in a sequence without masking.
BERT and RoBERTa are designed as bidirectional models. They process input tokens both at left and right -- Suited for tasks like text classification, question answering, and named entity recognition.
GPT (Generative Pre-trained Transformer), including ChatGPT, is a unidirectional or autoregressive model, focusing primarily on predicting the next word in a sequence based on the prior context, which is essential for text generation.

Key Benefits of LLM

Improved contextual understanding: Through advanced embeddings.
Better handling of long-range dependencies: Through transformer architecture.
Enhanced transfer learning: Through pre-training on large datasets.
These advanced embeddings enable models like ChatGPT and Meta AI to better understand and generate human-like language.

Implementation Libraries

Hugging Face Transformers library provides implementations for many models.
PyTorch and TensorFlow provide built-in support for transformer architectures.

Decoder-Only Transformer Architecture

A decoder-only Transformer architecture refers to a Transformer model that uses only the decoder block from the original Transformer design introduced by Vaswani et al. in Attention is All You Need (2017).
This architecture is commonly used in tasks like text generation, language modeling, and autoregressive sequence prediction.

How It Works:

Input Representation: The input tokens are fed directly into the decoder.
Self-Attention Mechanism: The model uses self-attention with masking, allowing it to only attend to previously generated tokens or tokens earlier in the sequence.
Output Prediction: The model generates text one token at a time, using previously predicted tokens as context.

Why Only the Decoder?:

In the original Transformer design, the encoder processes the input sequence, while the decoder generates outputs based on both the encoder's output and previous tokens.
However, in a decoder-only model:
- No Encoder: Model generates text based solely on its input sequence.
- Causal Masking: The self-attention layer uses a causal mask. Prevents the model from seeing future tokens.

Applications of Decoder-Only Models:

Language Models (LMs): GPT models (e.g., GPT-3, GPT-4) are prime examples.
Text Completion: Used for autocomplete features in IDEs.
Text Summarization & Dialogue Systems: Supports conversation and summarization tasks by predicting text sequentially.

Key Advantages:

Efficient for Generation: Since there's no encoder, the model focuses on generation tasks, reducing model complexity for such tasks.
Scalable: It scales well for large language models.

Examples of Decoder-Only Models:

GPT-3/GPT-4: Focused on text generation tasks.
Bloom & OPT Models: Open-source language models following a decoder-only architecture.

Comparison with Other Architectures :

----------------------------------------------------------------------------------------------------
Aspect          Encoder-Only                Decoder-Only                Encoder-Decoder
----------------------------------------------------------------------------------------------------
Use Case        Text classification         Text generation             Translation, summarization
Examples        BERT, RoBERTa               GPT, Bloom, OPT             T5, BART, mT5
Attention Mask  Bidirectional (sees all)    Causal (sees past only)     Both directions
----------------------------------------------------------------------------------------------------

GPT to ChatGPT Process

Pre-training

Creating foundational LLM is called pre-training.
Unsupervised Learning.
Goal is to predict next word.

Fine Turning

Supervised
Train using example conversations -- domain or tasks specific.
Train using examples to follow instructions. (Instruction tuning)

Reinforcement Learning from Human Feedback (RLHF)

Human labelers rank multiple model outputs for a given prompt.
These rankings are used to train a reward model.
Supervised.