Vector Database
In the basic RAG, vector database holds a central place. A Vector Database is a specialized database designed to index and query vector embeddings. Its core capability is to efficiently answer questions like “which stored vectors are most similar to this query vector?” usually via approximate nearest neighbor (ANN) search algorithms like HNSW, IVF, PQ and others. In the RAG stack, the vector DB enables semantic search of documents.
Here are some of the key features and considerations to have in mind when selecting a vector database:
-
Storage and Indexing: The database stores each embedding along with an identifier or payload (as mentioned, typically the chunk text or reference info). To speed up similarity search in high dimensions, they use clever indexing structures. A popular one is HNSW (Hierarchical Navigable Small World graphs) which organizes vectors in a graph for logarithmic search time. Another is IVF (inverted file with product quantization) where vectors are coarsely clustered. The choice of index affects recall and speed. Many vector databases let you configure this and set up an HNSW index with M=16, ef=200 for high recall for example.
-
Similarity Metric: This is important as a similarity metric will determine how you retrieve context. Most use cosine similarity or Euclidean distance on normalized vectors to measure closeness. Cosine similarity is common for text embeddings since the magnitude of vectors isn’t as important as direction. The vector database handles computing this efficiently for millions of vectors. It typically returns the top k closest vectors for a query.
-
Scalability: Vector databases are built to handle large volumes of data. They can index millions or billions of vectors, often scaling horizontally (distributed across nodes). Examples include Pinecone, Weaviate, FAISS (Facebook AI Similarity Search library used under the hood by many), Milvus or ElasticSearch (that has a vector extensions). When scaling, consider sharding (splitting the index) and replication (for fault tolerance) similar to traditional databases scaling. A well-optimized vector database can do searches in a few milliseconds even on millions of vectors.
-
Metadata filtering: A powerful feature is the ability to filter results by metadata before or after similarity search. For instance, if the user query is related to product “XYZ” and each vector chunk has a
product
metadata, the system can ask the vector DB: give me nearest neighbors among those with product=XYZ. This way irrelevant categories are excluded. Many vector databases like Pinecone or Weaviate support metadata filtering as part of the query.
- Hybrid and Re-ranking: Vector database results are often combined with other strategies. A hybrid search might take the top 50 vectors by similarity and then do a second-stage ranking using a cross-encoder (which is an LLM that scores relevance given the query and chunk text, as a more precise but expensive step). Or you can combine keyword constraints (only consider chunks that contain a certain keyword AND are semantically similar). These approaches can improve precision when raw vector similarity might not be sufficient by itself. Or you can combine vector databases with Knowledge Graphs in a GraphRAG setting.
- Limitations: Unlike a graph or relational DB, the vector DB does not inherently understand structured relationships or hierarchy. It is a big problem. It’s a “flat” store of points in space. It can’t do complex queries like “give me all documents that mention X before Y” without additional mechanisms. Also, vector search can sometimes retrieve a result that is semantically odd because of the curse of dimensionality or if the query vector falls in between clusters (though this is mitigated by the quality of the embedding model). Another limitation is lack of explainability. Unfortunately, it is not always obvious why a certain result was deemed similar (beyond “the numbers were close”). This is where knowledge graphs have an edge in explanation.
Practically, the vector databases work like this in our Stack: when the orchestration layer receives a user query, it embeds the query (using the same embedding model used for the documents). Then it asks the vector database for the top 5-10 most similar chunks. The vector database returns those chunks (with their original text and metadata). These chunks are then fed into the LLM as context for answer generation.
To illustrate, imagine a user asks: “What are the symptoms of diabetes?”. The query embedding goes into the vector DB of medical articles. The vector database might return chunks from various documents: one from a medical guideline listing diabetes symptoms, another from a patient FAQ, and a couple from other ones because those documents had vectors near the query vector. The orchestration will then pass those pieces of text to the LLM so it can compose a comprehensive answer.
In summary, the vector database is the semantic search engine. It makes the knowledge base accessible by meaning and not just exact keywords. It provides the speed and scale such that even with large corpora, you can retrieve relevant knowledge in real-time to help the LLM answer questions accurately.