Summary
Tags: #RAG #embedding #LLM
Following is a GPT Chat regarding LLM, embeddings and RAG.
It highlights the basic concepts clearly.
I had a chat with Gemini on building custom semantic search on my own notes. I have attached the chat below. It suggested solutions like: AnythingLLM Tool, Obsidian Search Plugin, LlamaIndex (Framework), LangChain (Framework), Sentence Transformers and FAISS (libraries).
User: Thava Alagu (thavamuni@gmail.com)
Created: 18/03/2025, 13:04:08
ChatGpt
How does RAG capability get used by popular AI coding assistants like Cursor, Codeium, tabnine, sourcegraph Cody and such ?
Retrieval-Augmented Generation (RAG) is a critical capability in modern AI-powered coding assistants like Cursor, Codeium, Tabnine, Sourcegraph Cody, and others. These assistants leverage RAG to enhance code completion, provide contextual suggestions, and answer complex coding questions more effectively. Here’s how RAG is used in these tools:
How it works:
Example in AI coding assistants:
How it works:
Example:
How it works:
Example:
How it works:
Example:
How it works:
Example:
How it works:
Example:
RAG is a game-changer for AI coding assistants, allowing them to move beyond generic LLM-generated code to context-aware, project-specific suggestions. Popular tools like Cursor, Codeium, Tabnine, and Sourcegraph Cody leverage retrieval techniques to enhance code completions, debugging, documentation assistance, and project-wide understanding.
Would you like a deep dive into how specific tools implement RAG with examples? 🚀
How embeddings are relevant and used in applying RAG for coding assistance ?
Embeddings play a critical role in implementing Retrieval-Augmented Generation (RAG) for AI-powered coding assistants like Cursor, Codeium, Tabnine, Sourcegraph Cody, and others. They allow AI to retrieve relevant code snippets, documentation, and past usage patterns efficiently before generating responses.
These embeddings are indexed and stored, allowing the AI to retrieve relevant information before generating suggestions.
Example:
Example:
useDebounce
and not explicitly “debounce an input,” embeddings help match the intent.Example:
Example:
✅ Semantic Search for Code & Documentation – Retrieves relevant snippets even when phrasing differs.
✅ Multi-File Code Understanding – Helps AI assistants analyze code across multiple files and modules.
✅ Faster Query Resolution – Quickly fetches relevant functions, classes, and past implementations.
✅ Project-Specific Context – Ensures AI completions are aligned with your project’s unique structure.
✅ Better Code Suggestions & Refactoring – Embeddings help improve accuracy and consistency of AI-generated code.
Embeddings bridge the gap between raw LLM capabilities and real-world, project-specific code generation. By converting code, documentation, and queries into vector representations, AI coding assistants can retrieve and apply relevant context before generating responses.
Would you like a detailed example of embeddings in action (e.g., a Python script using FAISS or Pinecone)? 🚀
So, embeddings are used mainly for semantic search to construct proper context. Does popular AI Chat tools like ChatGPT, claude.ai and deepseek.com use RAG by default to enhance our prompt text with additional context (if we don't provide additional documentation or attachment) ? The embeddings are not relevant when we don't use RAG ? Or is it integral part of all LLM chat activities one way or another ? LLM itself operates fundamentally based on embeddings or not ?
Yes, embeddings are fundamental to how all Large Language Models (LLMs) operate, but their role differs depending on whether RAG is being used or not. Let’s break this down step by step.
✅ No, not by default—unless explicitly provided with external knowledge sources like documentation, files, or APIs.
🔹 Example:
👉 So, embeddings for retrieval are NOT always used unless RAG is enabled.
🔹 Yes, but in a different way! Even when RAG is not used, embeddings are still a core part of how LLMs generate responses.
👉 So, embeddings are ALWAYS used by an LLM internally, whether RAG is involved or not.
✅ Absolutely. LLMs fundamentally operate using embeddings!
🔹 Example:
Even without retrieval, embeddings drive everything an LLM does.
Aspect | LLM Internal Embeddings | RAG-Based Embeddings |
---|---|---|
Purpose | Helps LLM process, predict, and generate text | Helps retrieve external context |
When Used? | Always, in every model response | Only if retrieval (RAG) is enabled |
Example Usage | Understanding sentence meaning, predicting next words | Finding relevant documentation, previous code snippets |
Storage? | Inside the model (learned during training) | External vector database (FAISS, Pinecone, Weaviate) |
👉 So, embeddings are core to all LLMs, but RAG-based embeddings are an additional layer used for retrieval.
1️⃣ All LLMs (ChatGPT, Claude, DeepSeek, etc.) always use embeddings internally to process and generate text.
2️⃣ RAG (Retrieval-Augmented Generation) is separate and adds retrieval-based embeddings when external knowledge is required.
3️⃣ If you don’t provide extra documentation or attachments, these tools typically do not use RAG—they rely only on their built-in training.
4️⃣ Even without RAG, embeddings are at the core of how LLMs work.
Would you like an example of embeddings in action with a simple Python script? 🚀
If I am implementing AI Code assistance tool, I need to be aware of these. The embedding mechanism of LLM is internal to LLM. My tool retrieves additional context using my own embeddings -- This additional context of say, API documentation is a pure list of words or in embeddings format ? Should my AI code assistance tool pass this additional context as list of words to LLM or in the embedding format itself ? Is different embeddings mechanisms used by LLMs and agents expected to be compatible or independent ?
You're correct that an LLM has its own internal embeddings, while your AI code assistance tool will use its own retrieval embeddings to fetch relevant context (e.g., API documentation, previous code, etc.). However, the way you pass this additional context to the LLM is crucial.
✅ Pass it as a list of words (text), NOT as embeddings.
LLMs do not accept embeddings directly as input. Instead, they process raw text and internally convert it into embeddings before generating responses.
🔹 Example: If a developer asks:
💬 “How do I use the Stripe API to create a customer?”
Your tool:
Uses vector search to find relevant API documentation.
Retrieves a snippet like:
{
"method": "POST",
"endpoint": "/v1/customers",
"params": ["email", "name", "description"],
"example": "curl https://api.stripe.com/v1/customers -d email='user@example.com' -d name='John Doe'"
}
Passes this as raw text in the LLM prompt:
Here is the relevant API documentation for Stripe's customer creation:
- Endpoint: POST /v1/customers
- Required Params: email, name, description
- Example usage: curl https://api.stripe.com/v1/customers -d email='user@example.com' -d name='John Doe'
Based on this, generate a Node.js function to create a customer using the Stripe SDK.
✅ The LLM will process this natural language input and generate the appropriate function.
🔹 They are independent and serve different purposes, but they are complementary.
Aspect | LLM Internal Embeddings | Your AI Tool’s Embeddings (RAG) |
---|---|---|
Purpose | Helps model process & generate responses | Helps retrieve relevant external data |
Where Used? | Inside the LLM during inference | In your retrieval pipeline |
Format? | Stored in the model’s parameters | Stored in a vector database |
How Used? | Transforms input text into vector space for prediction | Matches query embeddings with stored knowledge |
Compatibility? | Not directly usable by external systems | You control how your embeddings are stored & retrieved |
You just retrieve relevant context using your embeddings and pass it as raw text to the LLM.
✔ Use embeddings for retrieval, but always pass text to the LLM.
✔ Optimize context injection – You don’t need to pass an entire API document, just the most relevant parts.
✔ Format the retrieved context clearly – Use bullet points, code blocks, or structured formats.
✔ Fine-tune context length – LLMs have token limits, so pass only what’s necessary.
Would you like a sample Python implementation for embedding-based retrieval using FAISS or Pinecone? 🚀
So RAG based embeddings involve the raw embedding mechanism used to index the domain specific Knowledge (like entire code base of my repository) and the tricks and algorithms to extract relevant information from the domain source. What are the popular choices of the raw embedding technology and the algorithms that extract relevant information using these embeddings today ?
Yes, you’ve got it right! RAG-based embeddings involve two key aspects:
1️⃣ Raw embedding technology – Converts your domain-specific knowledge (e.g., API docs, codebase, previous chat history) into high-dimensional vectors and stores them in a vector database.
2️⃣ Retrieval algorithms – Efficiently extract the most relevant information from the stored embeddings based on a user query.
These methods generate embeddings for indexing domain knowledge:
Embedding Model | Provider | Strengths | Limitations |
---|---|---|---|
OpenAI text-embedding-ada-002 |
OpenAI | Best for general-purpose text/code retrieval | API calls add latency & cost |
Google text-bison or code-bison |
Good for Google Cloud-native solutions | Limited fine-tuning | |
Cohere embed-multilingual-v3 |
Cohere | Multilingual embedding support | API-dependent |
Hugging Face Models (sentence-transformers , BGE , E5 ) |
Open-source | No API cost, flexible tuning | Needs self-hosting |
Facebook (Meta) FAISS embeddings |
Meta | Optimized for large-scale retrieval | Requires separate model for embedding |
Milvus / Weaviate custom embeddings | Open-source | Good for hybrid search | Higher setup complexity |
🔹 For Codebases?
code-bert
, code-t5
, starcoder embeddings
→ Specially trained for code search & retrieval.text-embedding-ada-002
→ Surprisingly good for code search too!Once embeddings are stored, you need a way to query and retrieve the most relevant pieces of knowledge. The key techniques include:
Algorithm | How It Works? | Used In |
---|---|---|
K-Nearest Neighbors (KNN) | Finds the K most similar embeddings to the query | FAISS, Pinecone, Weaviate |
Approximate Nearest Neighbors (ANN) | Faster, scalable version of KNN for big datasets | FAISS, HNSWlib, Milvus |
Hierarchical Navigable Small World (HNSW) | Graph-based search; efficient for large-scale retrieval | FAISS, Milvus, Weaviate |
Product Quantization (PQ) | Compresses embeddings for memory-efficient retrieval | FAISS, ScaNN |
Hybrid Search (Vector + Keyword Search) | Mixes semantic search (vectors) + traditional search | Weaviate, Vespa |
Re-ranking (Cross-Encoder) | After retrieving top-N documents, reranks results using another model | ColBERT, BGE Re-ranker |
🔹 For Codebases?
cohere-rerank
or colbert
) → Improve search quality after initial vector retrieval.Here’s how you might structure an AI code assistant retrieval system:
1️⃣ Embedding Model – text-embedding-ada-002
(for general) or codebert
(for code).
2️⃣ Vector Database – FAISS (for self-hosting) or Pinecone/Weaviate (managed).
3️⃣ Retrieval Algorithm – HNSW
(for fast search) + BM25
(for keyword matching).
4️⃣ Re-ranking (optional) – Use cohere-rerank
or cross-encoder
for better results.
👉 Would you like a sample pipeline in Python that embeds and retrieves code snippets? 🚀
In this context of trying to create embeddings for the entire code base, does it apply embedding for every word of the code or chunks of the code ? You mentioned OpenAI providing text-embedding-ada-002 that helps with creating embedding for our data. Do they provide API for this and if so, is it usually cheaper than LLM chat completion requests pricing in terms of cost per millions of tokens ? Could this embedding be built incrementally if I encounter a large repository with more new modules ? A vector databased based on embeddings should decide one fixed embeddings and can not mix different embeddings in order to be useful, right ? Vector databases like FAISS expect you to create the embeddings or do they provide integrated service of creating embeddings as well and charge the API cost to you ?
✅ Embeddings are created for chunks of code, not individual words.
Since single-word embeddings lose context, AI code assistants typically embed entire functions, classes, or logical code blocks instead of words.
🔹 How chunking works in codebases:
For large repositories, an intelligent chunking strategy ensures that embeddings capture full function signatures, dependencies, and comments.
✅ Yes, OpenAI provides an API for embeddings, specifically text-embedding-ada-002
.
Yes! Embeddings are much cheaper compared to chat completion requests.
text-embedding-ada-002
)
gpt-4-turbo
)
🔹 Key takeaway:
Embedding large codebases is much cheaper than querying an LLM for every request.
✅ Yes, embeddings can be built incrementally.
For large codebases, an incremental embedding pipeline avoids re-processing everything from scratch. Common strategies include:
🔹 Example: Using a Hash-Based Strategy
🚫 No, a vector database should use one consistent embedding model.
text-embedding-ada-002
vs. code-bert
) produce embeddings in different vector spaces.✅ If you need multiple types of embeddings, store them in separate indexes within the same database.
✅ FAISS does NOT create embeddings—it only stores and searches them.
🔹 Comparison:
Database | Creates Embeddings? | Stores & Searches? |
---|---|---|
FAISS | ❌ No | ✅ Yes |
Pinecone | ❌ No | ✅ Yes |
Weaviate | ✅ Yes (optional) | ✅ Yes |
Milvus | ❌ No | ✅ Yes |
👉 If you want full control, use FAISS + your own embedding model.
👉 If you want managed embedding generation, use Weaviate or Pinecone.
✅ Embeddings apply to chunks of code, not words.
✅ OpenAI provides an embedding API, and it’s much cheaper than LLM chat requests.
✅ You can incrementally embed new code in large repositories.
✅ A vector database should use a single embedding model to ensure compatibility.
✅ FAISS stores embeddings but does not create them—you must generate embeddings externally.
🚀 Want a sample Python pipeline for embedding and retrieving code snippets using FAISS?
Some opensource coding assistants let you provide your own LLM Model (and API Key) for code assistance. Do any of them also directly expose you to select the embedding-model and Vector Database Providers and let you directly pay for them as well ? If I use FAISS embeddings, do I create a separate API Key with Facebook and pay for the usage ?
✅ Yes, some open-source coding assistants allow you to provide your own:
text-embedding-ada-002
, sentence-transformers
, Cohere, Hugging Face models)🔹 Examples of AI coding assistants that allow this level of customization:
Coding Assistant | Custom LLM? | Custom Embedding Model? | Custom Vector DB? |
---|---|---|---|
Sourcegraph Cody | ✅ Yes (OpenAI, local models) | ⚠️ Partial (Not exposed fully yet) | ❌ No (Uses built-in embeddings) |
Codeium | ⚠️ Partial (Uses its own model) | ❌ No | ❌ No |
Tabnine | ❌ No | ❌ No | ❌ No |
Continue.dev | ✅ Yes (Self-hosted Llama, OpenAI, Anthropic) | ✅ Yes | ✅ Yes (Pinecone, FAISS, Chroma, Weaviate) |
OpenDevin | ✅ Yes (Self-hosted LLMs) | ✅ Yes | ✅ Yes |
AnythingLLM | ✅ Yes | ✅ Yes | ✅ Yes |
🚀 Best Choice for Full Flexibility?
👉 Continue.dev, OpenDevin, and AnythingLLM allow you to set both a custom embedding model and vector database while letting you pay for them separately.
🚫 No, FAISS is free and does not require an API key or payments.
text-embedding-ada-002
or sentence-transformers
.🔹 FAISS vs. Cloud Vector Databases (Pinecone, Weaviate, Milvus)
Vector DB | API Key Required? | Paid? | Managed? |
---|---|---|---|
FAISS | ❌ No | ❌ No | ❌ No (Self-hosted) |
Pinecone | ✅ Yes | ✅ Yes | ✅ Yes |
Weaviate | ✅ Yes (Optional) | ✅ Yes | ✅ Yes |
Milvus | ✅ Yes (Optional) | ✅ Yes | ✅ Yes |
🔹 If you want a free, local solution → FAISS
🔹 If you want a managed, scalable solution → Pinecone, Weaviate, or Milvus
Would you like a quick guide on setting up FAISS with OpenAI embeddings? 🚀
Powered by ChatGPT Exporter
I would like to develop custom semantic search tool on top of my own notes.
Workflow To Build :
Key Considerations:
Sources and related content
Ready-to-use Tool:
AnythingLLM:
Obsidian with plugins: If you have obsidian, and install the correct plugins, it is a ready to use tool.
Open WebUI:
Framework:
Library: