Data Ingestion and Preprocessing

By definition, Retrieval Augmented Generation (RAG) means that you'll be retrieving information. In order for retrieval to work right, you'll need to ingest, process and index your raw data into a searchable form first. So, the first step is a process of preparing your proprietary knowledge (documents, PDFs, etc) for the retrieval step - this is the foundation of the Stack's knowledge base.

The goal is to turn messy, unstructured, or disparate data into clean, enriched, and segmented text that downstream components (like embedding models or knowledge graphs) can consume.

There are a couple of typical activities that happen:

Document Loading and Extraction: Connecting to data sources and extracting content. This involves using specialized connectors and parsers for each data format (e.g. PDF files, Word docs, HTML pages, databases) to load text into the system.
Text Cleaning: Processing the raw text to remove noise or sensitive information. This could mean stripping out HTML tags, removing boilerplate (headers, footers), normalizing whitespace, fixing encoding issues, and redacting personal or confidential data.
Data Enrichment: Enhancing the data with additional context or metadata. For example, tagging documents with their source, date, author, or detected topics. Enrichment can also include detecting and labeling entities (people, places, etc.) or expanding acronyms, which provides valuable context for later retrieval.
Chunking: Splitting documents into smaller chunks (segments) that are easier for language models to handle. Large texts are divided into sections of a few hundred words each, often with some overlapping text for context continuity. This is crucial because it improves semantic search and ensures each chunk fits within model token limits.

By the end of preprocessing, the system will have a repository of clean, well-structured chunks of text, each possibly annotated with metadata. It is the input to the next stages. Whilst often overlooked, good preprocessing is vital. If this stage is done well, the rest of the Stack can rely on high-quality data, leading to more accurate retrievals and responses. The alternative results in poor retrieval and wrong information downstream.