Summary

Following is a GPT Chat regarding LLM, embeddings and RAG.
It highlights the basic concepts clearly.
I had a chat with Gemini on building custom semantic search on my own notes. I have attached the chat below. It suggested solutions like: AnythingLLM Tool, Obsidian Search Plugin, LlamaIndex (Framework), LangChain (Framework), Sentence Transformers and FAISS (libraries).

RAG and Embeddings in AI Assistants

User: Thava Alagu (thavamuni@gmail.com)
Created: 18/03/2025, 13:04:08
ChatGpt

Prompt:

How does RAG capability get used by popular AI coding assistants like Cursor, Codeium, tabnine, sourcegraph Cody and such ?

Response:

Retrieval-Augmented Generation (RAG) is a critical capability in modern AI-powered coding assistants like Cursor, Codeium, Tabnine, Sourcegraph Cody, and others. These assistants leverage RAG to enhance code completion, provide contextual suggestions, and answer complex coding questions more effectively. Here’s how RAG is used in these tools:

1. Code Context Awareness via Retrieval

How it works:

RAG-based models retrieve relevant code snippets, documentation, or prior context before generating responses.
Instead of relying solely on a pre-trained LLM, they search across local and external knowledge bases (e.g., repositories, documentation, Stack Overflow).
This improves accuracy and coherence when suggesting code or answering questions.

Example in AI coding assistants:

Sourcegraph Cody retrieves functions, comments, and related files from the repo before completing code.
Cursor (VS Code AI) fetches similar code snippets across the project to maintain consistency.
Codeium dynamically retrieves context from the surrounding codebase before providing suggestions.

2. Project-Specific Code Understanding

How it works:

Instead of blindly suggesting generic code, the assistant retrieves and references your existing project files.
It can pull relevant function signatures, class definitions, API calls, and prior logic.

Example:

Tabnine integrates with the IDE and retrieves context from imported modules, dependencies, and existing function calls.
Cursor AI retrieves nearby code structure before suggesting completions, ensuring relevance.

3. Documentation & API Assistance

How it works:

When a developer asks a question, the RAG model retrieves relevant API documentation or function definitions before generating an answer.
This allows AI coding assistants to provide more precise answers with real-world examples.

Example:

Codeium and Cody can fetch API docs from an internal or external knowledge base.
Tabnine retrieves function signatures and inline documentation from your code before suggesting improvements.

4. Multi-File & Large Codebase Support

How it works:

AI coding assistants use RAG to scan and retrieve code from multiple files or repositories before generating responses.
This ensures that function calls and references remain consistent across the project.

Example:

Sourcegraph Cody searches across large monorepos to retrieve the best implementation before suggesting code.
Cursor AI integrates with VS Code’s file system to retrieve functions/classes across different files.

5. Intelligent Code Refactoring & Debugging

How it works:

Instead of generic fixes, AI assistants retrieve previous bug fixes, refactoring patterns, or best practices from your project.
They analyze past commit history and changes to suggest more relevant optimizations.

Example:

Codeium and Cursor retrieve similar past fixes before suggesting debugging solutions.
Sourcegraph Cody references repository history to suggest performance optimizations or security patches.

6. Personalized AI Assistance via RAG

How it works:

RAG enables assistants to personalize suggestions based on a user’s coding patterns, project history, and frequently used libraries.
Instead of relying on generic AI completions, the model retrieves and learns from a developer’s coding habits.

Example:

Tabnine personalizes auto-completions based on past user interactions.
Codeium fetches user-specific coding patterns to provide more relevant suggestions.

Conclusion

RAG is a game-changer for AI coding assistants, allowing them to move beyond generic LLM-generated code to context-aware, project-specific suggestions. Popular tools like Cursor, Codeium, Tabnine, and Sourcegraph Cody leverage retrieval techniques to enhance code completions, debugging, documentation assistance, and project-wide understanding.

Would you like a deep dive into how specific tools implement RAG with examples? 🚀

Prompt:

How embeddings are relevant and used in applying RAG for coding assistance ?

Response:

How Embeddings Enable RAG for AI Coding Assistants

Embeddings play a critical role in implementing Retrieval-Augmented Generation (RAG) for AI-powered coding assistants like Cursor, Codeium, Tabnine, Sourcegraph Cody, and others. They allow AI to retrieve relevant code snippets, documentation, and past usage patterns efficiently before generating responses.

1. What Are Embeddings in RAG?

Embeddings are vector representations of text, code, or documentation, where similar items are closer together in a high-dimensional space.
In the context of coding assistance, embeddings help convert:
- Code snippets (functions, classes, comments)
- Natural language queries (e.g., “How to sort an array in Java?”)
- Documentation and API references
- Project-specific code context

These embeddings are indexed and stored, allowing the AI to retrieve relevant information before generating suggestions.

2. How Embeddings Power RAG in Coding Assistants

🔹 Step 1: Code & Documentation Indexing

AI coding assistants first convert the codebase, documentation, and past usage into embeddings.
These embeddings are stored in a vector database like FAISS, Pinecone, or Weaviate for fast retrieval.
This allows the AI to search semantic meaning rather than just keywords.

Example:

Sourcegraph Cody indexes functions and comments in a large repository.
Codeium creates embeddings for all code snippets in a project, making retrieval efficient.

🔹 Step 2: Query → Embedding Conversion

When a developer types a query or code snippet, the assistant converts it into an embedding vector.
It then searches the vector database to find the most relevant code/documentation.
This enables semantic search, meaning even if the query wording differs from the actual code, it can still retrieve relevant results.

Example:

If you ask:
“How do I debounce an input in JavaScript?”
- The model retrieves related function implementations from your project or external sources.
- Even if the function is named useDebounce and not explicitly “debounce an input,” embeddings help match the intent.

🔹 Step 3: Retrieval & Context Injection

Once relevant code snippets are retrieved, they are injected into the LLM’s context before generating a response.
This ensures that the assistant’s output is context-aware and not just generic AI-generated code.

Example:

Cursor AI fetches function signatures, dependencies, and similar code snippets before suggesting code.
Tabnine retrieves relevant past completions and best practices before making a suggestion.

🔹 Step 4: Personalized Code Completions & Refactoring

Over time, embeddings allow the model to learn your coding style and suggest personalized code snippets.
This makes AI coding assistants more efficient at refactoring, debugging, and code completion.

Example:

If you frequently use a specific React hook pattern, Codeium can retrieve past usage and suggest completions that match your style.
Sourcegraph Cody can analyze past commit history and suggest context-aware improvements.

3. Key Benefits of Using Embeddings in RAG for Coding

✅ Semantic Search for Code & Documentation – Retrieves relevant snippets even when phrasing differs.
✅ Multi-File Code Understanding – Helps AI assistants analyze code across multiple files and modules.
✅ Faster Query Resolution – Quickly fetches relevant functions, classes, and past implementations.
✅ Project-Specific Context – Ensures AI completions are aligned with your project’s unique structure.
✅ Better Code Suggestions & Refactoring – Embeddings help improve accuracy and consistency of AI-generated code.

Conclusion

Embeddings bridge the gap between raw LLM capabilities and real-world, project-specific code generation. By converting code, documentation, and queries into vector representations, AI coding assistants can retrieve and apply relevant context before generating responses.

Would you like a detailed example of embeddings in action (e.g., a Python script using FAISS or Pinecone)? 🚀

Prompt:

So, embeddings are used mainly for semantic search to construct proper context. Does popular AI Chat tools like ChatGPT, claude.ai and deepseek.com use RAG by default to enhance our prompt text with additional context (if we don't provide additional documentation or attachment) ? The embeddings are not relevant when we don't use RAG ? Or is it integral part of all LLM chat activities one way or another ? LLM itself operates fundamentally based on embeddings or not ?

Response:

Yes, embeddings are fundamental to how all Large Language Models (LLMs) operate, but their role differs depending on whether RAG is being used or not. Let’s break this down step by step.

1️⃣ Does ChatGPT, Claude, and DeepSeek Use RAG by Default?

✅ No, not by default—unless explicitly provided with external knowledge sources like documentation, files, or APIs.

When you just chat normally (without attachments or external retrieval), tools like ChatGPT, Claude, and DeepSeek rely only on their pre-trained LLM knowledge (what they learned during training).
However, if you upload files, link documentation, or use specialized retrieval-based versions (like ChatGPT Enterprise with RAG integrations), then RAG comes into play to fetch additional context dynamically.

🔹 Example:

ChatGPT (free/Plus) → No RAG by default; answers come from its model training.
ChatGPT Enterprise with Retrieval → Uses RAG to fetch additional company knowledge.
Claude with File Uploads → Retrieves and injects external data using RAG.

👉 So, embeddings for retrieval are NOT always used unless RAG is enabled.

2️⃣ Are Embeddings Relevant if We Don’t Use RAG?

🔹 Yes, but in a different way! Even when RAG is not used, embeddings are still a core part of how LLMs generate responses.

🔹 How LLMs Use Embeddings Without RAG:

Every word, token, or phrase you input is converted into a high-dimensional vector (embedding).
The model processes these embeddings through self-attention mechanisms (like in Transformers) to generate responses.
Even without retrieval, the model relies on semantic similarity between embeddings to generate coherent answers.

👉 So, embeddings are ALWAYS used by an LLM internally, whether RAG is involved or not.

3️⃣ Is Embedding a Core Part of All LLM Operations?

✅ Absolutely. LLMs fundamentally operate using embeddings!

🔹 Why?

LLMs don’t process raw text directly. Instead, they convert all text into token embeddings.
Each token (word or subword) is mapped to a high-dimensional vector in embedding space.
These vectors capture semantic meaning, allowing the model to predict and generate text contextually.

🔹 Example:

The words “car,” “vehicle,” and “automobile” would have embeddings that are close together.
The words “apple” (the fruit) and “Apple” (the company) have different embeddings, thanks to contextual learning.

Even without retrieval, embeddings drive everything an LLM does.

4️⃣ What’s the Difference Between LLM Embeddings & RAG Embeddings?

Aspect	LLM Internal Embeddings	RAG-Based Embeddings
Purpose	Helps LLM process, predict, and generate text	Helps retrieve external context
When Used?	Always, in every model response	Only if retrieval (RAG) is enabled
Example Usage	Understanding sentence meaning, predicting next words	Finding relevant documentation, previous code snippets
Storage?	Inside the model (learned during training)	External vector database (FAISS, Pinecone, Weaviate)

👉 So, embeddings are core to all LLMs, but RAG-based embeddings are an additional layer used for retrieval.

🔹 Conclusion

1️⃣ All LLMs (ChatGPT, Claude, DeepSeek, etc.) always use embeddings internally to process and generate text.
2️⃣ RAG (Retrieval-Augmented Generation) is separate and adds retrieval-based embeddings when external knowledge is required.
3️⃣ If you don’t provide extra documentation or attachments, these tools typically do not use RAG—they rely only on their built-in training.
4️⃣ Even without RAG, embeddings are at the core of how LLMs work.

Would you like an example of embeddings in action with a simple Python script? 🚀

Prompt:

If I am implementing AI Code assistance tool, I need to be aware of these. The embedding mechanism of LLM is internal to LLM. My tool retrieves additional context using my own embeddings -- This additional context of say, API documentation is a pure list of words or in embeddings format ? Should my AI code assistance tool pass this additional context as list of words to LLM or in the embedding format itself ? Is different embeddings mechanisms used by LLMs and agents expected to be compatible or independent ?

Response:

How to Pass Additional Context to an LLM in an AI Code Assistant

You're correct that an LLM has its own internal embeddings, while your AI code assistance tool will use its own retrieval embeddings to fetch relevant context (e.g., API documentation, previous code, etc.). However, the way you pass this additional context to the LLM is crucial.

1️⃣ Should You Pass Additional Context as Words or Embeddings?

✅ Pass it as a list of words (text), NOT as embeddings.

LLMs do not accept embeddings directly as input. Instead, they process raw text and internally convert it into embeddings before generating responses.

How This Works in Your AI Code Assistant

Your tool retrieves relevant API documentation (or code snippets) using your own embeddings & vector search.
You extract the relevant text (e.g., function signatures, descriptions).
You insert this retrieved text into the LLM’s prompt in natural language.

🔹 Example: If a developer asks:
💬 “How do I use the Stripe API to create a customer?”

Your tool:

Uses vector search to find relevant API documentation.

Retrieves a snippet like:

{
  "method": "POST",
  "endpoint": "/v1/customers",
  "params": ["email", "name", "description"],
  "example": "curl https://api.stripe.com/v1/customers -d email='user@example.com' -d name='John Doe'"
}

Passes this as raw text in the LLM prompt:

Here is the relevant API documentation for Stripe's customer creation:
- Endpoint: POST /v1/customers
- Required Params: email, name, description
- Example usage: curl https://api.stripe.com/v1/customers -d email='user@example.com' -d name='John Doe'

Based on this, generate a Node.js function to create a customer using the Stripe SDK.

✅ The LLM will process this natural language input and generate the appropriate function.

2️⃣ Are LLM Embeddings and Agent Embeddings Compatible?

🔹 They are independent and serve different purposes, but they are complementary.

Differences Between LLM Embeddings vs. Your AI Tool's Embeddings

Aspect	LLM Internal Embeddings	Your AI Tool’s Embeddings (RAG)
Purpose	Helps model process & generate responses	Helps retrieve relevant external data
Where Used?	Inside the LLM during inference	In your retrieval pipeline
Format?	Stored in the model’s parameters	Stored in a vector database
How Used?	Transforms input text into vector space for prediction	Matches query embeddings with stored knowledge
Compatibility?	Not directly usable by external systems	You control how your embeddings are stored & retrieved

Key Takeaway

Your AI tool's embeddings are only for retrieval.
LLM embeddings are part of the model's internal architecture.
They don’t need to be compatible because they serve different roles.

You just retrieve relevant context using your embeddings and pass it as raw text to the LLM.

3️⃣ Best Practices for Implementing RAG in Your AI Code Assistant

✔ Use embeddings for retrieval, but always pass text to the LLM.
✔ Optimize context injection – You don’t need to pass an entire API document, just the most relevant parts.
✔ Format the retrieved context clearly – Use bullet points, code blocks, or structured formats.
✔ Fine-tune context length – LLMs have token limits, so pass only what’s necessary.

Would you like a sample Python implementation for embedding-based retrieval using FAISS or Pinecone? 🚀

Prompt:

So RAG based embeddings involve the raw embedding mechanism used to index the domain specific Knowledge (like entire code base of my repository) and the tricks and algorithms to extract relevant information from the domain source. What are the popular choices of the raw embedding technology and the algorithms that extract relevant information using these embeddings today ?

Response:

Yes, you’ve got it right! RAG-based embeddings involve two key aspects:
1️⃣ Raw embedding technology – Converts your domain-specific knowledge (e.g., API docs, codebase, previous chat history) into high-dimensional vectors and stores them in a vector database.
2️⃣ Retrieval algorithms – Efficiently extract the most relevant information from the stored embeddings based on a user query.

Popular Choices for Raw Embedding Technology

These methods generate embeddings for indexing domain knowledge:

Embedding Model	Provider	Strengths	Limitations
OpenAI `text-embedding-ada-002`	OpenAI	Best for general-purpose text/code retrieval	API calls add latency & cost
Google `text-bison` or `code-bison`	Google	Good for Google Cloud-native solutions	Limited fine-tuning
Cohere `embed-multilingual-v3`	Cohere	Multilingual embedding support	API-dependent
Hugging Face Models (`sentence-transformers`, `BGE`, `E5`)	Open-source	No API cost, flexible tuning	Needs self-hosting
Facebook (Meta) `FAISS` embeddings	Meta	Optimized for large-scale retrieval	Requires separate model for embedding
Milvus / Weaviate custom embeddings	Open-source	Good for hybrid search	Higher setup complexity

🔹 For Codebases?

code-bert, code-t5, starcoder embeddings → Specially trained for code search & retrieval.
text-embedding-ada-002 → Surprisingly good for code search too!

Popular Algorithms for Extracting Relevant Information

Once embeddings are stored, you need a way to query and retrieve the most relevant pieces of knowledge. The key techniques include:

Algorithm	How It Works?	Used In
K-Nearest Neighbors (KNN)	Finds the K most similar embeddings to the query	FAISS, Pinecone, Weaviate
Approximate Nearest Neighbors (ANN)	Faster, scalable version of KNN for big datasets	FAISS, HNSWlib, Milvus
Hierarchical Navigable Small World (HNSW)	Graph-based search; efficient for large-scale retrieval	FAISS, Milvus, Weaviate
Product Quantization (PQ)	Compresses embeddings for memory-efficient retrieval	FAISS, ScaNN
Hybrid Search (Vector + Keyword Search)	Mixes semantic search (vectors) + traditional search	Weaviate, Vespa
Re-ranking (Cross-Encoder)	After retrieving top-N documents, reranks results using another model	ColBERT, BGE Re-ranker

🔹 For Codebases?

FAISS (HNSW or IVF-PQ) + BM25 (Keyword Search) → Great for hybrid code search.
Re-ranking models (e.g., cohere-rerank or colbert) → Improve search quality after initial vector retrieval.

Choosing the Right Stack for AI Code Assistance

Here’s how you might structure an AI code assistant retrieval system:

1️⃣ Embedding Model – text-embedding-ada-002 (for general) or codebert (for code).
2️⃣ Vector Database – FAISS (for self-hosting) or Pinecone/Weaviate (managed).
3️⃣ Retrieval Algorithm – HNSW (for fast search) + BM25 (for keyword matching).
4️⃣ Re-ranking (optional) – Use cohere-rerank or cross-encoder for better results.

👉 Would you like a sample pipeline in Python that embeds and retrieves code snippets? 🚀

Prompt:

In this context of trying to create embeddings for the entire code base, does it apply embedding for every word of the code or chunks of the code ? You mentioned OpenAI providing text-embedding-ada-002 that helps with creating embedding for our data. Do they provide API for this and if so, is it usually cheaper than LLM chat completion requests pricing in terms of cost per millions of tokens ? Could this embedding be built incrementally if I encounter a large repository with more new modules ? A vector databased based on embeddings should decide one fixed embeddings and can not mix different embeddings in order to be useful, right ? Vector databases like FAISS expect you to create the embeddings or do they provide integrated service of creating embeddings as well and charge the API cost to you ?

Response:

1️⃣ Do Embeddings Apply to Every Word or Chunks of Code?

✅ Embeddings are created for chunks of code, not individual words.

Since single-word embeddings lose context, AI code assistants typically embed entire functions, classes, or logical code blocks instead of words.

🔹 How chunking works in codebases:

Function-level embeddings → Best for code search
Class-level embeddings → Useful for object-oriented projects
File-level embeddings → Good for documentation & architecture analysis
Paragraph-level embeddings → Used when integrating with natural language comments

For large repositories, an intelligent chunking strategy ensures that embeddings capture full function signatures, dependencies, and comments.

2️⃣ Does OpenAI Provide an API for Embeddings?

✅ Yes, OpenAI provides an API for embeddings, specifically text-embedding-ada-002.

Pricing: Is It Cheaper Than LLM Chat Completion?

Yes! Embeddings are much cheaper compared to chat completion requests.

OpenAI Embedding API Cost (text-embedding-ada-002)
- $0.0001 per 1K tokens (i.e., $0.10 per million tokens)
OpenAI GPT-4 Completion API Cost (example for gpt-4-turbo)
- $10 per million input tokens
- $30 per million output tokens

🔹 Key takeaway:
Embedding large codebases is much cheaper than querying an LLM for every request.

3️⃣ Can Embeddings Be Built Incrementally for Large Repositories?

✅ Yes, embeddings can be built incrementally.

For large codebases, an incremental embedding pipeline avoids re-processing everything from scratch. Common strategies include:

Detecting modified files and only re-embedding changed sections.
Using Git hooks to auto-embed new code when developers push changes.
Caching embeddings to avoid unnecessary recomputation.

🔹 Example: Using a Hash-Based Strategy

Compute a hash for each function/class.
If the hash hasn’t changed, reuse the existing embedding.
If the hash changes, regenerate the embedding.

4️⃣ Can a Vector Database Mix Different Embeddings?

🚫 No, a vector database should use one consistent embedding model.

Why? Different embedding models (e.g., text-embedding-ada-002 vs. code-bert) produce embeddings in different vector spaces.
Mixing embeddings leads to incorrect similarity results because the distance calculations won’t be meaningful.

✅ If you need multiple types of embeddings, store them in separate indexes within the same database.

5️⃣ Does FAISS Create Embeddings or Just Store Them?

✅ FAISS does NOT create embeddings—it only stores and searches them.

You must generate embeddings using an external model (e.g., OpenAI API, Hugging Face models, CodeBERT).
FAISS only indexes and retrieves embeddings efficiently.
If you want a service that creates and stores embeddings (like an end-to-end solution), consider Pinecone or Weaviate.

🔹 Comparison:

Database	Creates Embeddings?	Stores & Searches?
FAISS	❌ No	✅ Yes
Pinecone	❌ No	✅ Yes
Weaviate	✅ Yes (optional)	✅ Yes
Milvus	❌ No	✅ Yes

👉 If you want full control, use FAISS + your own embedding model.
👉 If you want managed embedding generation, use Weaviate or Pinecone.

Key Takeaways

✅ Embeddings apply to chunks of code, not words.
✅ OpenAI provides an embedding API, and it’s much cheaper than LLM chat requests.
✅ You can incrementally embed new code in large repositories.
✅ A vector database should use a single embedding model to ensure compatibility.
✅ FAISS stores embeddings but does not create them—you must generate embeddings externally.

🚀 Want a sample Python pipeline for embedding and retrieving code snippets using FAISS?

Prompt:

Some opensource coding assistants let you provide your own LLM Model (and API Key) for code assistance. Do any of them also directly expose you to select the embedding-model and Vector Database Providers and let you directly pay for them as well ? If I use FAISS embeddings, do I create a separate API Key with Facebook and pay for the usage ?

Response:

1️⃣ Do Open-Source Coding Assistants Let You Choose Your Embedding Model & Vector Database?

✅ Yes, some open-source coding assistants allow you to provide your own:

LLM model & API key (e.g., OpenAI, Anthropic, DeepSeek, Llama)
Embedding model (e.g., OpenAI’s text-embedding-ada-002, sentence-transformers, Cohere, Hugging Face models)
Vector database provider (e.g., FAISS, Pinecone, Weaviate, Milvus)

🔹 Examples of AI coding assistants that allow this level of customization:

Coding Assistant	Custom LLM?	Custom Embedding Model?	Custom Vector DB?
Sourcegraph Cody	✅ Yes (OpenAI, local models)	⚠️ Partial (Not exposed fully yet)	❌ No (Uses built-in embeddings)
Codeium	⚠️ Partial (Uses its own model)	❌ No	❌ No
Tabnine	❌ No	❌ No	❌ No
Continue.dev	✅ Yes (Self-hosted Llama, OpenAI, Anthropic)	✅ Yes	✅ Yes (Pinecone, FAISS, Chroma, Weaviate)
OpenDevin	✅ Yes (Self-hosted LLMs)	✅ Yes	✅ Yes
AnythingLLM	✅ Yes	✅ Yes	✅ Yes

🚀 Best Choice for Full Flexibility?
👉 Continue.dev, OpenDevin, and AnythingLLM allow you to set both a custom embedding model and vector database while letting you pay for them separately.

2️⃣ If I Use FAISS, Do I Need an API Key & Pay Facebook?

🚫 No, FAISS is free and does not require an API key or payments.

FAISS (Facebook AI Similarity Search) is open-source and completely free to use.
It runs locally on your machine or server and does not require an API key or cloud service.
You must generate embeddings yourself using a model like OpenAI’s text-embedding-ada-002 or sentence-transformers.

🔹 FAISS vs. Cloud Vector Databases (Pinecone, Weaviate, Milvus)

Vector DB	API Key Required?	Paid?	Managed?
FAISS	❌ No	❌ No	❌ No (Self-hosted)
Pinecone	✅ Yes	✅ Yes	✅ Yes
Weaviate	✅ Yes (Optional)	✅ Yes	✅ Yes
Milvus	✅ Yes (Optional)	✅ Yes	✅ Yes

🔹 If you want a free, local solution → FAISS
🔹 If you want a managed, scalable solution → Pinecone, Weaviate, or Milvus

Would you like a quick guide on setting up FAISS with OpenAI embeddings? 🚀

Building Semantic Search Tool

I would like to develop custom semantic search tool on top of my own notes.

Workflow To Build :

Data Ingestion: Load your markdown notes into the chosen tool or framework.
Embedding Generation: Use an embedding model to generate embeddings for your notes.
Vector Indexing: Store the embeddings in a vector database.
Querying: Use natural language to query the vector database.
LLM Processing (Optional): Use an LLM to process the retrieved information and generate a natural language response.

Key Considerations:

Embedding Model Choice: The choice of embedding model will impact the accuracy of your semantic search.
Vector Database Performance: The performance of your vector database will affect the speed of your queries.
Computational Resources: Generating and storing embeddings can require significant computational resources.

Sources and related content

What is semantic search, and how does it work? - Google Cloud

What Are Word Embeddings? | IBM

SentenceTransformers Documentation — Sentence Transformers documentation

What Are Vector Databases? Definition And Uses - Databricks

What is a Vector Database & How Does it Work? Use Cases + Examples - Pinecone

Ready-to-use Tool:
- AnythingLLM:
  - User-friendly application.
  - Provides a GUI and abstractions for connecting various LLMs and vector databases.
  - Backed by Mintplex-labs
  - Uses ViteJS+React Frontend + Express Server
- Obsidian with plugins: If you have obsidian, and install the correct plugins, it is a ready to use tool.
- Open WebUI:
  - Similar to AnythingLLM, but more popular.
  - For RAG, AnythingLLM is better, but Open WebUI is more flexible and offers fine grained customization.
Framework:
- LangChain: Provides a structured set of tools and abstractions.
- LlamaIndex (GPT Index): Provides high-level API and abstractions for indexing and querying data.
Library:
- Semantic-search libraries (e.g., Sentence Transformers, FAISS):
  - Generating embeddings or performing similarity searches.
  - Provides building blocks to build semantic search applications.