Embedding Creation
Whether you’re prototyping a lightweight, “just-get-it-working” RAG or engineering a high-precision GraphRAG, one fact is constant: you must create embeddings.
-
Basic RAG For a non-critical use case, you can embed only your text chunks and call it a day. The query is embedded, compared against those chunk vectors, and the most similar passages are returned.
-
GraphRAG When precision matters, you first build a knowledge graph. Once that structure exists, you embed both the graph’s nodes and the underlying text chunks. Doing so enables two complementary retrieval routes: a) semantic chunk retrieval which is basic RAG; b) graph-aware retrieval where the query embeds to a vector, finds the closest graph nodes, traverses their relationships and surfaces the connected chunks.
In short, embeddings sit at the heart of every RAG variant. GraphRAG simply extends the idea by mapping the graph’s semantics into the same vector space, unlocking richer, relationship-driven recall alongside vanilla similarity search.
Let's dive into the Embedding Creation now.
Embedding Creation is the process of converting each chunk of text into a high-dimensional numeric vector that captures the semantic meaning of the text. These vectors (embeddings) populate the vector database and enable similarity search. Similarity search means that the system can find which pieces of text are “closest” in meaning to a given query or to each other. To avoid confusion, "closest" is not the same as "the same" as it is based on similarity. This is effectively the cornerstone of Retrieval-Augmented Generation: when a user asks a question, that question will also be embedded and matched against this database of vectors to retrieve the "closest" context and feed it back to the LLM to answer.
Key activities in embedding creation involve:
-
Embedding Model Selection: Choosing an embedding model is crucial. There are many models available and some are general-purpose (like OpenAI’s ADA-002 embedding for text) and others are domain-specific (like SciBERT for scientific text). Some are large and powerful, some are smaller but faster. There is no one-size-fits-all, you need to consider the trade-offs like accuracy vs. speed vs. cost. For instance, a smaller Small Language Model (SLM) or a distilled model might serve best if latency is critical and domain is narrow, whereas a large model might yield more nuanced embeddings if maximum accuracy is needed.
-
Embedding Dimensionality: Different models output vectors of different dimensions (there are numerous dimensions of 384, 768, 1024). Higher dimensional embeddings can capture more information but also consume more storage and might require more data to avoid sparsity issues. There’s also a balance: too low-dimensional might not differentiate well between texts and too high will result with high computational cost. By decising on the embedding model, you inherently decide the dimension (for example ADA-002 gives 1536-dim vectors).
-
Generating the Embeddings: This is where the action happens. It typically involves running each text chunk through the embedding model’s API or library. If using a cloud API like OpenAI, this is a batch process sending text and receiving vectors. If using a local model (for example SentenceTransformers or HuggingFace models), it involves loading the model on a machine with possibly GPU acceleration and encoding all texts. It’s often the longest part of pipeline execution time, especially if thousands of chunks exist. In order to speed it up, you can use techniques like batching and parallelization.
-
Storing Embeddings with References: Each vector embedding is stored in the vector database alongside an identifier or payload that links back to the original chunk (often including the chunk text or a chunk ID plus its metadata). For example, the vector for chunk “A” might be stored with an entry that also includes
{doc: "Policy.doc", page: 3, chunk_index: 5}
. This way, when you use similarity search, you will know which chunk (and thus which document section) you received.
With embeddings in place, the system essentially has a semantic, vector index of all content. Unlike a keyword index (which a search engine uses), this semantic index can retrieve relevant information even if the query uses different words than the source text. For instance, a query “How to reset my password?” could match a chunk that says “Steps to change your login credential” because the embedding model identified that “reset password” is close to “change credential” in the vector space and therefore has an understanding that it is similar.