Chunking

After cleaning text, you will need to split it into smaller pieces called chunks which are optimal for embedding and querying. It is a process of breaking down large blocks of text into smaller, manageable and semantically meaningful units. Chunking is critical as it is ineffective to store an entire document as one vector. Chunking improves retrieval granularity and ensures content fits within the LLM's context window.

Chunks can be small, medium or large and they vary based on the use case. The challenge is to do splitting in a way that preserves context and meaning as much as possible. Although chunks are primarily created for the purpose of the basic RAG, they also help contextualise the entity extraction in the Knowledge Graph creation for GraphRAG.

Key considerations for chunking include:

Chunk Size: Choosing an appropriate chunk length is important. Chunks might be defined in terms of characters or tokens (for example 500 words or 1000 characters per chunk). If chunks are too large, they might overflow model limits and also dilute the relevance (a big chunk might contain multiple topics which make it harder to retrieve exactly the right one). On the other side, if chunks are too small, important context can be lost and the model might not have enough information to be useful. A common practice is to aim for a chunk that would produce an embedding of a few hundred dimensions which is often a few paragraphs of text. It is widely accepted that the medium chunks are optimal for RAG as they provide good enough context vs recall ration.
Chunk Overlap: To mitigate the issue of splitting context, many pipelines use overlapping chunks. For instance, if using a sliding window over text, one might take 1000 characters with a 200 character overlap for the next chunk. This way, if a crucial sentence falls on a boundary, it will still appear in at least one of the chunks in full. Overlap helps maintain continuity but increases redundancy (which is generally okay; embeddings of overlapping chunks will be similar).
Natural Boundaries: Ideally, splits should happen at logical boundaries like paragraph breaks or sentence ends rather than arbitrarily in the middle of a sentence. Techniques like Recursive Character Text Splitter (as used in LangChain) attempt this. They first try to split by sections or paragraphs and then, if chunks are still too large, they split by sentence until chunks are below the size limit. This preserves readability and coherence.
No Loss of Knowledge: The aim is to ensure that no important fact or answer spans across two chunks without overlap. If a user's question pertains to a piece of text that was split, the overlapping strategy or careful boundary selection should ensure that at least one chunk contains the full context needed to answer. It often requires trial and error to find the right chunk size and overlap for a given dataset and embedding model.
Chunk Metadata: When splitting, each chunk inherits metadata from its source document (like document ID, title) and often is augmented with location info (page number, section number). For example, after chunking, one chunk’s metadata might say source: Employee_Handbook.pdf, page: 12. This is critical for later stages to trace answers back to sources and also for possibly grouping results by document in a UI.

Why chunking matters: The vector database or retrieval system will be fetching whole chunks in response to queries. If chunking is done well, each chunk is a self-contained, topical unit of information that can be returned as a relevant piece to the LLM. If done poorly (too large or misaligned chunks), the system might retrieve a lot of irrelevant text along with the relevant parts, which can confuse the LLM or waste its context window (and potentially cause context poisoning, where irrelevant info degrades the answer quality).

For example, if we have a 20-page document on “Company Security Policies.” Instead of embedding the entire 20 pages as one vector (which would be too large), we chunk it into say 40 chunks of half-page each. Each chunk will capture a specific policy or subtopic. A query about “password length requirements” should retrieve the chunk about password policy, not the entire policy document. Well-chosen chunking enables that precise retrieval.

In summary, chunking turns documents into bite-sized, context-rich pieces that are fit for LLM's consumption. It balances completeness of context with specificity, ensuring the system can later pinpoint exactly the pieces of information needed to answer a question.