LLM Model Quantization Guide
Quantize Llama.cpp QAT GGUF INT8 FP8 4-Bits PTQ (PostTrain) QAT (Q-Aware-T)
Distill Teacher-Student Logits-Matching Temperature TinyBERT TinyLlama
Logits-based distillation (Learn from Classification Models)
sequence-level distillation (Learn from Summarization Models).
Logits Log-Odds Log(p/(1-p)) GLM Classification
Softmax Conversion-of-Logits-to-Probabilities
ArgMax Choose Max (Loses full input Information)
BERT Understanding-Model (Vs Generative)
classification NER (NamedEntityRecognition), Q&A, Summarize,
MLM (Masked Language Model) Text Generation, Translation,
Semantic-Search 250.MB-1.3.GB
Pretrained BERT (bert-base-uncased) General-purpose Large-corpus
BERT Encoder-Decoder | bert2bert | Fine-tuned Summarization Model
Dialog BERT | FineTuned Chatbot
GPT Generative-Pretrained-Transformers
Autoregressive (AR) Model (Predicts next word) (Vs MLM)
175B-Parameters(GPT-3) is 500x bigger than BERT base.
User: Thava Alagu (thavamuni@gmail.com)
Created: 3/26/2025 18:31
Quantizing a Large Language Model (LLM) is the process of reducing the precision of its numerical computations to make it smaller and more efficient, while maintaining reasonable accuracy. This helps in reducing memory usage, improving inference speed, and enabling deployment on resource-constrained devices like edge hardware.
Post-Training Quantization (PTQ)
Applies quantization after training.
Converts weights and/or activations from float32 (FP32) or float16 (FP16) to int8 (INT8) or lower.
Example: 8-bit (INT8) quantization reduces memory size by 4x.
Common methods:
Quantization-Aware Training (QAT)
4-bit and Lower Quantization (GPTQ, AWQ, etc.)
GPTQ (Generalized Post-Training Quantization):
AWQ (Activation Weight Quantization):
BitsandBytes (bnb-int4/bnb-int8):
Using Hugging Face Transformers + bitsandbytes
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
text = "What is quantization in LLMs?"
inputs = tokenizer(text, return_tensors="pt").to("cuda")
output = model.generate(**inputs)
print(tokenizer.decode(output[0], skip_special_tokens=True))
bitsandbytes
for efficient inference.Using GPTQ for 4-bit Quantization
from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized("model_name", device="cuda:0", use_triton=False)
Using torch.quantization
(For PyTorch Models)
import torch
from torch.quantization import quantize_dynamic
from transformers import AutoModel
model = AutoModel.from_pretrained("bert-base-uncased")
quantized_model = quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
Using llama.cpp
for CPU Inference (GGUF Format)
Convert models using llama.cpp
for running LLMs on CPUs efficiently.
Uses GGUF format with 4-bit, 5-bit, and 8-bit quantization.
Example:
./quantize model_f32.bin model_q4.bin 4
./main -m model_q4.bin -p "What is quantization?"
Would you like help quantizing a specific model for your use case?
Model distillation (or Knowledge Distillation) is a technique used to compress a large, high-performing model (Teacher) into a smaller, faster model (Student) while preserving most of its performance.
This technique is commonly used in deep learning, especially in LLMs (Large Language Models), vision models, and speech models, to improve efficiency without significant loss in accuracy.
Model distillation follows a Teacher-Student training framework:
Train a Large Model (Teacher Model) – A big, accurate model is trained first (e.g., GPT-4, Llama 2-70B).
Train a Smaller Model (Student Model) – A lightweight model (e.g., DistilBERT, TinyLlama) is trained to mimic the Teacher.
Use Soft Targets Instead of Hard Labels – The Student learns from the Teacher’s logits (soft probabilities) instead of just correct/incorrect labels.
Optimize with a Custom Loss Function – Loss function combines:
Fine-Tune the Student Model – The smaller model is further optimized to improve accuracy.
Logit Matching (Soft Targets)
Instead of learning from ground-truth labels, the Student learns from the Teacher’s probability distribution.
Helps capture uncertainty and nuances that hard labels miss.
Example:
Teacher: 80% "Dog", 15% "Wolf", 5% "Cat"
Student: 78% "Dog", 17% "Wolf", 5% "Cat"
Feature-Based Distillation
Response-Based Distillation
Self-Distillation
distilbert
(Pre-trained Distilled Model)from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
training_args = TrainingArguments(output_dir="./results", per_device_train_batch_size=8)
trainer = Trainer(model=model, args=training_args)
trainer.train()
DistilBERT
is a distilled version of BERT, trained using knowledge distillation.import torch
import torch.nn as nn
class DistillationLoss(nn.Module):
def __init__(self, temperature=2.0):
super().__init__()
self.temperature = temperature
self.kl_loss = nn.KLDivLoss(reduction="batchmean")
def forward(self, student_logits, teacher_logits):
teacher_probs = torch.nn.functional.softmax(teacher_logits / self.temperature, dim=1)
student_probs = torch.nn.functional.log_softmax(student_logits / self.temperature, dim=1)
return self.kl_loss(student_probs, teacher_probs)
loss_fn = DistillationLoss(temperature=2.0)
transformers
for Distillationfrom transformers import AutoModelForSequenceClassification
teacher_model = AutoModelForSequenceClassification.from_pretrained("bert-large-uncased")
student_model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
# Train student using teacher outputs instead of hard labels
teacher_logits = teacher_model(input_ids).logits
student_logits = student_model(input_ids).logits
loss = loss_fn(student_logits, teacher_logits)
Would you like help distilling a specific LLM for your use case? 🚀
what is this example bert model capable of ? what does the term logit mean and where it comes from ? is it related to logistical regression ?
The BERT model (Bidirectional Encoder Representations from Transformers) is a pre-trained transformer-based model capable of handling various Natural Language Processing (NLP) tasks, such as:
The distilled version of BERT (e.g., DistilBERT, TinyBERT) provides a smaller, faster alternative with comparable accuracy.
A logit is the raw, unnormalized output from a machine learning model before applying an activation function like softmax or sigmoid.
In a classification model like BERT for text classification, the last layer produces logits, which are then converted into probabilities.
Example:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Load pre-trained BERT model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "BERT is a powerful NLP model!"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# Logits are raw scores before softmax
logits = outputs.logits
print(logits)
Example output (logits):
tensor([[ 2.54, -1.23]]) # Two classes: positive (2.54) & negative (-1.23)
The higher logit value (2.54) corresponds to the positive class.
To convert logits into probabilities, apply softmax:
probabilities = torch.nn.functional.softmax(logits, dim=1)
print(probabilities)
Output:
tensor([[0.95, 0.05]]) # 95% positive, 5% negative
Yes! Logits originate from logistic regression and are directly related to log-odds.
In logistic regression, we predict probabilities using the sigmoid function, which maps any real number to a range between 0 and 1:
Where
This logit function represents the log-odds:
✔ BERT is capable of text classification, question answering, NER, and more.
✔ Logits are raw scores produced before converting them into probabilities.
✔ Logits originate from logistic regression and represent log-odds.
✔ In multi-class classification, logits are converted to probabilities using softmax instead of sigmoid.
Would you like a deeper dive into how BERT’s attention mechanism works? 🚀
How comes GPT model from OpenAI became so much more popular than BERT (from Google?) How big is BERT model ?
While BERT (by Google, 2018) and GPT (by OpenAI, 2018+) are both transformer-based models, GPT has gained significantly more popularity due to several key reasons:
Feature | BERT | GPT |
---|---|---|
Training Approach | Masked Language Model (MLM) – Predicts missing words | Autoregressive (AR) Model – Predicts next word |
Use Case | Text classification, search ranking, NLP tasks | Chatbots, text generation, creative writing |
Input Processing | Processes entire input at once | Generates output word by word |
Interactivity | More useful for short, structured tasks | More natural for dialogues, storytelling |
Why does this matter?
BERT has multiple versions with different sizes:
Model | Parameters | Size |
---|---|---|
BERT Base | 110M | 400MB |
BERT Large | 340M | 1.3GB |
DistilBERT | 66M | 250MB |
Yes! BERT is still widely used for:
✔ Google Search Ranking (since 2019)
✔ Sentiment Analysis
✔ Chatbots (fine-tuned versions like DialogBERT)
✔ Named Entity Recognition (NER)
✔ Summarization (via variants like BART, T5)
However, GPT models dominate chatbots, content creation, and general AI assistants. 🚀
Would you like help fine-tuning BERT or exploring newer hybrid models like T5, BART, or LLaMA?
I still do not understand the basic approach of distillation process. Especially this line: teacher_logits = teacher_model(input_ids).logits ; Where does the input_ids come from ? In order to distill, do you need the original training data or do you ask questions to the teacher model and get the answers and use that answers to train the student model ?
Great questions! Let me clarify the distillation process step by step and answer your doubts in detail.
The key idea behind knowledge distillation is that instead of training a small model (student) from scratch using the original dataset, we use a large model (teacher) to generate "soft labels" (logits) for the student model to learn from.
input_ids
Come From?The input_ids
represent tokenized text that is passed into the model.
Example:
Given a sentence:
"The weather is nice today."
It is converted into token IDs using a tokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "The weather is nice today."
inputs = tokenizer(text, return_tensors="pt") # 'pt' means PyTorch tensor
print(inputs)
Output:
{'input_ids': tensor([[101, 1996, 4633, 2003, 3835, 2651, 1012, 102]])}
input_ids
is what gets passed into the model.Then, when we do:
teacher_logits = teacher_model(input_ids).logits
input_ids
= tokenized text (input to the model).teacher_model(input_ids)
= Runs the text through the teacher model..logits
= Returns the raw output scores before applying softmax.Not necessarily! There are two approaches:
If you have the original dataset, you can pass the same inputs through both the teacher and student models.
Example:
input_ids = tokenizer(["Some text here"], return_tensors="pt").input_ids
teacher_logits = teacher_model(input_ids).logits # Teacher's outputs
student_logits = student_model(input_ids).logits # Student's outputs
loss = distillation_loss(student_logits, teacher_logits) # Train student
The student model tries to match the teacher’s logits using a loss function (KL Divergence).
If the original dataset is not available, you can generate inputs by querying the teacher model.
This is called zero-shot or self-supervised distillation.
Example:
queries = ["What is knowledge distillation?", "How does BERT work?"]
input_ids = tokenizer(queries, return_tensors="pt").input_ids
teacher_answers = teacher_model.generate(input_ids) # Generate answers
student_model.train_on(teacher_answers) # Fine-tune the student
Here, we ask the teacher model questions and use its responses to train the student.
This is useful when we don’t have labeled training data.
The student doesn’t just learn the final label, it learns how confident the teacher is about different possible answers.
Example:
If the task is sentiment classification, and we input:
"This movie was amazing!"
The teacher model logits might look like this (before applying softmax):
[2.5, -1.2] # (Positive sentiment: 2.5, Negative sentiment: -1.2)
Instead of just saying "Positive", the student learns:
This helps the student learn better decision boundaries.
Approach | Uses Original Data? | How is Student Trained? |
---|---|---|
Supervised Distillation | ✅ Yes | Use same dataset, get teacher logits, train student |
Self-Supervised Distillation | ❌ No | Generate new queries, get teacher’s responses, train student |
Zero-Shot Distillation | ❌ No | Query teacher with random text, learn from teacher’s outputs |
Tell me if these statements are acurate and correct it as necessary: Unless the distillation process uses original training dataset (i.e. Supervised Distillation), there is no easy way to evaluate the quality and performance of the student model since we don't know the entire capabilities of the teacher model. Typically the student model we reduce the total number of parameters and do not change the size (total number of bits) of the parameter (but Quantization does). Single teacher (BERT) model is capable of doing different things like summarization, sentiment analysis, classification, semantic search, etc. The training methods for each of these capabilities would be slightly different. So the distillation also follows the same process of using different ways of distillation ?
Your statements are mostly accurate but need a few refinements. Below is the corrected version with explanations:
❌ Incorrect (Partially True) → Needs Clarification
✅ Corrected:
"Unless the distillation process uses the original training dataset (i.e., Supervised Distillation), evaluating the quality and performance of the student model is more challenging because we lack labeled ground truth data. However, we can still estimate performance by comparing the student’s outputs with the teacher’s on a held-out dataset or by using external benchmarks."
Why?
Even without the original dataset, we can evaluate the student model by:
So, while evaluation is harder, it is not impossible.
✅ Mostly Correct, Needs Refinement
✅ Corrected:
"Typically, the student model reduces the total number of parameters but does not change the precision (total number of bits per parameter). However, quantization is a separate technique that reduces model size by lowering the precision of parameters (e.g., from 32-bit floating point to 8-bit integers)."
Why?
✅ Correct!
✅ Your statement is correct:
"A single teacher (BERT) model is capable of performing different tasks like summarization, sentiment analysis, classification, and semantic search, depending on how it is fine-tuned."
Why?
Pretrained BERT is a general-purpose model trained with Masked Language Modeling (MLM).
To perform specific tasks, we fine-tune BERT on different datasets:
✅ Correct!
✅ Your statement is correct with minor rewording:
"Since different tasks (e.g., summarization, classification, sentiment analysis) require different training methods, the distillation process also follows different approaches depending on the task."
Why?
Distillation is task-dependent because the way the teacher model is trained affects how knowledge is transferred to the student.
Some examples:
Conclusion:
1️⃣ Unless the distillation process uses the original training dataset (i.e., Supervised Distillation), evaluating the quality and performance of the student model is more challenging, but it can still be done using proxy methods or external benchmarks.
2️⃣ Typically, the student model reduces the total number of parameters but does not change the precision (total number of bits per parameter). Quantization, on the other hand, reduces parameter precision to make the model smaller.
3️⃣ A single teacher (BERT) model can perform different tasks like summarization, sentiment analysis, classification, and semantic search, depending on how it is fine-tuned.
4️⃣ Since different tasks (e.g., summarization, classification, sentiment analysis) require different training methods, the distillation process also follows different approaches depending on the task.
🚀 Let me know if you need more clarification! Would you like an example of distilling BERT for a specific task (e.g., classification vs. summarization)?
yes.
Let's go through a practical example of distilling BERT for two different tasks:
We fine-tune a BERT model (bert-base-uncased
) on sentiment analysis (e.g., IMDB dataset).
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
# Load teacher model (fine-tuned BERT for sentiment analysis)
teacher_model = AutoModelForSequenceClassification.from_pretrained("textattack/bert-base-uncased-imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Example sentence
text = "I absolutely loved this movie. It was fantastic!"
inputs = tokenizer(text, return_tensors="pt")
# Get teacher logits
with torch.no_grad():
teacher_outputs = teacher_model(**inputs)
teacher_logits = teacher_outputs.logits
print("Teacher Logits:", teacher_logits)
👆 Logits-based distillation: The student model will learn from these logits rather than just final labels (0/1).
We use KL-Divergence Loss to train the student model to mimic the teacher’s logits.
import torch.nn.functional as F
# Load student model (smaller version of BERT)
from transformers import DistilBertForSequenceClassification
student_model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
# Forward pass on student model
student_outputs = student_model(**inputs)
student_logits = student_outputs.logits
# Compute KL Divergence loss
temperature = 2.0 # Higher temperature makes logits smoother
teacher_probs = F.softmax(teacher_logits / temperature, dim=-1)
student_probs = F.log_softmax(student_logits / temperature, dim=-1)
loss = F.kl_div(student_probs, teacher_probs, reduction="batchmean") * (temperature**2)
print("Distillation Loss:", loss.item())
👉 Key points:
This approach is different because summarization is a sequence generation task rather than classification.
We use a summarization model like BART (instead of standard BERT).
from transformers import AutoModelForSeq2SeqLM
# Load BART teacher model for summarization
teacher_model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
text = "The Amazon rainforest is the largest tropical rainforest in the world. It produces 20% of Earth's oxygen."
inputs = tokenizer(text, return_tensors="pt")
# Generate summary from teacher model
with torch.no_grad():
summary_ids = teacher_model.generate(**inputs, max_length=30)
teacher_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Teacher Summary:", teacher_summary)
We use sequence-level loss (cross-entropy) instead of KL-divergence.
from transformers import T5ForConditionalGeneration
# Load student model (T5-small)
student_model = T5ForConditionalGeneration.from_pretrained("t5-small")
# Generate summary from student model
student_summary_ids = student_model.generate(**inputs, max_length=30)
student_summary = tokenizer.decode(student_summary_ids[0], skip_special_tokens=True)
print("Student Summary:", student_summary)
👆 Sequence loss (Cross-Entropy Loss) is used to match teacher and student outputs at the token level.
Task | Teacher Model | Student Model | Loss Function | Distillation Type |
---|---|---|---|---|
Sentiment Analysis | BERT (textattack/bert-imdb) | DistilBERT | KL Divergence Loss | Logits-Based Distillation |
Summarization | BART (facebook/bart-large-cnn) | T5-small | Cross-Entropy Loss | Sequence-Level Distillation |
Would you like a full training loop for either of these examples? 🚀
Are there separate BERT models for tasks i.e. for Sentiment Analysis or Summarization and such ? There are also Mixture of Experts models where single model is capable of various different tasks -- how does training and inference work for MOE models ?
Yes and No. There is a single base BERT model, but different versions exist based on how they are fine-tuned for specific tasks.
bert-base-uncased
) is a general-purpose model trained on a large corpus (Wikipedia + BooksCorpus).Task | Fine-Tuned BERT Model | Differences |
---|---|---|
Sentiment Analysis | textattack/bert-base-uncased-imdb |
Fine-tuned on IMDB movie reviews |
Summarization | bert2bert (BERT Encoder-Decoder) |
Fine-tuned on CNN/DailyMail |
Question Answering | bert-large-uncased-whole-word-masking-finetuned-squad |
Fine-tuned on SQuAD dataset |
Named Entity Recognition (NER) | dbmdz/bert-large-cased-finetuned-conll03-english |
Fine-tuned on CoNLL-2003 dataset |
Semantic Search | sentence-transformers/all-MiniLM-L6-v2 |
Fine-tuned with contrastive learning |
So, while there is one base BERT, different fine-tuned versions exist for different tasks.
Unlike fine-tuned BERT models (where each model specializes in a single task), MoE models can dynamically specialize in multiple tasks.
🚀 Example:
Expert Networks
Routing Mechanism
Instead of sending input through all experts, the router chooses a few experts based on the input.
Example:
Sparse Activation
Loss Calculation
🚀 Example:
Feature | Fine-Tuned BERT | Mixture of Experts (MoE) |
---|---|---|
Model Type | One model per task | One model, many experts |
Training | Fine-tuned separately for each task | All experts are trained together |
Inference | Runs entire model | Only activates a few experts |
Efficiency | Fixed compute per task | More efficient (activates fewer parameters) |
Scalability | Limited (requires multiple models for multiple tasks) | Scales to thousands of tasks |
Would you like an example of how MoE models are implemented in code? 🚀
Temperature scaling is a technique used in knowledge distillation, softmax functions, and model calibration. It controls the "softness" of probability distributions by adjusting the logits before applying the softmax function.
In knowledge distillation, a student model learns from the teacher's soft probabilities instead of hard labels. The temperature (T
) is used to smooth the teacher’s output distribution.
Where:
🔹 Higher T
→ Smoother Probabilities
🔹 Lower T
→ Sharper Probabilities (closer to one-hot labels)
High Temperature (e.g., T = 2.0
)
Low Temperature (e.g., T = 1.0
or T < 1.0
)
✅ Best Practice:
T = 2.0
to 5.0
).T = 1.0
for inference to get normal probabilities.🔹 Example:
T = 1.0
) → Model says: "I'm 99% sure this is a cat."T = 2.0
) → Model says: "I'm 80% sure this is a cat, but 15% chance it’s a dog."✅ Best Practice:
T = 1.0
), then apply post-training temperature scaling to calibrate the model.In GPT and other generative models, temperature
is used to control randomness during inference.
Low Temperature (T < 1.0
)
High Temperature (T > 1.0
)
✅ Best Practice:
Use Case | Effect of High Temperature (T > 1.0 ) |
Effect of Low Temperature (T < 1.0 ) |
---|---|---|
Knowledge Distillation | Smoother probabilities, better learning | Harder for student to learn |
Model Calibration | Fixes overconfidence in predictions | Makes model more confident |
LLMs (GPT, Chatbots) | More creative/random responses | More deterministic responses |
Would you like a code example showing how temperature affects softmax outputs? 🚀
You said: In GPT and other generative models, temperature is used to control randomness during inference. How does it work ? Does it randomly choose among top few potential words while generating vs selecting the best possible next word when T=0 ? This aspect of temperature does not bear any significance during training ?
Yes, you're absolutely right! In GPT and other generative models, temperature affects the probability distribution of the next token (word) during inference.
🔹 When generating text, the model predicts the next word/token based on probabilities.
🔹 Temperature scales these probabilities before making a choice.
GPT uses a softmax function to convert logits (raw scores) into probabilities. Temperature (T
) scales these logits before softmax is applied.
Where:
🔥 Lower Temperature (T → 0
) → More Deterministic
Makes the model focus on the most probable word.
Almost always selects the top-ranked token (argmax behavior).
Example:
T
(0.2-0.5): "Paris"
(almost 100% of the time).🎭 Higher Temperature (T > 1
) → More Randomness
Softens the probability distribution, making lower-ranked tokens more likely to be chosen.
The model may choose from a wider range of words.
Example:
T
(1.5): "ventured"
, "fought"
, "discovered"
, or "escaped"
.🚀 Extreme Cases:
🔹 No, temperature is only used during inference.
🔹 During training, the model learns to predict probabilities of words based on the dataset.
🔹 The logits are not scaled with temperature because the goal is to make the model learn accurate likelihoods rather than adjust randomness.
✅ Temperature is only applied at inference time when we need to control diversity and creativity in text generation.
Instead of just using temperature, other sampling techniques are often used together:
k
most probable tokens before sampling.k=50
, only the top 50 words are considered.p
.p=0.9
, it selects the fewest words whose probabilities sum to 90%.✅ Best Practice:
Temperature (T) | Behavior | Use Case |
---|---|---|
T = 0 | Always picks the most probable token (greedy) | Fact-based answers, deterministic outputs |
T = 0.2 - 0.5 | Slightly more variation but still precise | Coding, summarization, search |
T = 0.7 - 1.0 | Balanced creativity vs. coherence | Chatbots, storytelling, casual conversation |
T = 1.5 - 2.0 | Very creative but sometimes incoherent | Poetry, humor, brainstorming |
I can show you how temperature affects text generation in Python using GPT-based models like OpenAI's API or Hugging Face Transformers!
Powered by ChatGPT Exporter