LLM Notes
Notes from stanford CS229 ML LLM notes.
HELM - Holistic Evaluation of Language Models (NLP Bench Mark). Ratings:
Model Mean with Rate
GPT-4 0613 0.962
GPT-4 Turbo 0.834
Palmyra X V3 72B 0.821
Palmyra X V2 33B 0.783
Yi 34B 0.772
Best sites for comparing performance of various models: https://artificialanalysis.ai/
See Also: Huggingface LLM Leaderboard
Transformers are better than LSTM.
Map tokens (subwords or words) to dense vectors.
Capture semantic relationships, context, and nuances.
Enable models to understand token meanings.
Vector embeddings transform subwords into dense vectors.
LLM processes vectorized subwords to learn contextual representations.
Embeddings are learned as part of the training process itself depending on the training model.
Types of Vector Embeddings:
- Word2Vec (W2V)
- GloVe
- FastText
- Transformers' self-attention-based embeddings (Defacto std now for GPT3, etc)
Vector Embeddings with BPE:
LLM Architectures using BPE and Vector Embeddings:
How It Works:
Why Only the Decoder?:
Applications of Decoder-Only Models:
Key Advantages:
Examples of Decoder-Only Models:
Comparison with Other Architectures :
----------------------------------------------------------------------------------------------------
Aspect Encoder-Only Decoder-Only Encoder-Decoder
----------------------------------------------------------------------------------------------------
Use Case Text classification Text generation Translation, summarization
Examples BERT, RoBERTa GPT, Bloom, OPT T5, BART, mT5
Attention Mask Bidirectional (sees all) Causal (sees past only) Both directions
----------------------------------------------------------------------------------------------------