Embedding Model

The Embedding Model represents a specific algorithm or neural network used to generate text embeddings. The reason why it is highlighted is because the quality of the embedding model determines how well semantic meanings are captured.

What is it: An embedding model typically is a transformer-based model (like BERT, RoBERTa, or specialized ones like InstructorXL, etc.) that has been trained (or fine-tuned) to output a vector for a given text such that similar texts have closer vectors. Unlike the main LLM used for generation, an embedding model is usually smaller and optimized for representational tasks rather than text generation.
What is the difference between LLM and Embedding Model: Sometimes the main LLM (like GPT-4) could be used to get embeddings (via certain API calls). But often, you would use a dedicated embedding model that is more efficient. For example, OpenAI provides separate endpoints for text-embedding-ada which is much cheaper and faster per call than using GPT-4 for embeddings. Small Language Models can also be embedding models. For example, a fine-tuned MiniLM or mpnet model from Hugging Face that outputs 384-d vectors very quickly.
How to choose a Model:
- Domain Specificity: If your documents are very domain-specific (say legal contracts), an embedding model fine-tuned on legal text might capture nuances (like “force majeure” similarity) better than a general model.
- Dimensions: As noted, dimension is a parameter. Some popular choices: 384, 768, 1024, 1536. Higher might improve recall but not always significantly; it can be an overkill for simple domains.
- Sparse vs Dense: Traditional vectors are dense. There is also a concept of sparse embeddings (like BM25 or hybrid models that incorporate keyword matching). Some advanced pipelines use hybrid retrieval: combining dense and sparse. For simplicity, one might pick a dense model and possibly augment with keyword search if needed.
- Cost vs Speed: If using an external API, cost per 1000 embedding calls matters. If self-hosting, the model’s size dictates memory and inference speed. A 100M parameter model is much faster than a 6B parameter one. So there is a trade-off: for realtime updates or very large corpora, a smaller but slightly less accurate model might be chosen to save time and money.
- Small Language Models (SLMs): The term SLM can refer to models with fewer parameters, which can be faster to customize and run and cheaper. If one trains or fine-tunes a small model to produce embeddings tailored to your data, it might give the best of both worlds: relevant and efficient. Many enterprise solutions consider having their own small embedding models if data privacy is a concern (so they cannot call an external API) or cost is high.
Fine-tuning: Some teams fine-tune embedding models on their own corpus or on a task (like similarity of question-answer pairs). This can yield improved performance for specific retrieval tasks, but it requires expertise and labeled data (or at least some unsupervised fine-tuning like fitting the model to reconstruct parts of documents).

To sum up, the Embedding Model is the engine that translates text to math. A well-chosen model will place related pieces of information near each other in vector space, making the vector database effective. A poor choice could mean the retrieval misses relevant info or pulls in a lot of irrelevant. Therefore, selecting and possibly customizing the embedding model is a key design decision, often informed by experimentation and benchmarking on the kinds of queries expected.

In practice, a popular starting point is to use a proven model (for example, OpenAI’s text-embedding-ada-002 or Cohere’s embedding model, or SentenceTransformer like all-MiniLM-L6-v2 if open source is needed) and see if it meets needs. If gaps are observed (like certain synonyms not being recognized or domain terms not clustered), then one might explore domain-specific models or fine-tuning.