How to Use RAG with LLMs: Vector Search

Retrieval-augmented generation (RAG) is a modern method used with large language models (LLMs) to deal with vast volumes of data. Instead of sending all potentially relevant data to an LLM, the RAG method suggests filtering data entries based on relevance to the user prompt before making LLM requests. Filtering is usually done using vector search, with an optional reranking step to improve accuracy.

This guide will explain typical RAG use cases and focus on implementing vector search. It will explain what vector embeddings are, how to generate them, how to measure similarity between vectors, and how to combine these steps into a working solution.

Use Cases

Let’s take a look at the traditional way to query an LLM:

The user writes a prompt.
The app generates a system prompt with instructions and relevant data to complete the task. The system prompt is not aware of the contents of the user prompt.
The system prompt + user prompt are sent to an LLM.
LLM responds with text data (usually JSON).

This approach works in simple cases. However, when the system prompt grows, the following issues appear:

Costs increase. If we send a lot of data into the system prompt, we must pay for all the input tokens it consumes.
Decreased accuracy. After a specific threshold, models start to hallucinate, being unable to distinguish between data entries.
Context window overflow. With even more data, models will “forget” instructions and first data entries.

We might be building an application that allows the user to search the contents of uploaded books semantically. If done naively, the system prompt would specify instructions and include the contents of all the books to allow the LLM to process a request. Depending on the model, it might stop working after several pages.

A better way to solve the problem would be to implement semantic search to find only several pages or paragraphs relevant to the user prompt. This way, the application will work with almost arbitrary amounts of data, with acceptable costs and accuracy.

Vector Embeddings

In the RAG method, semantic search is implemented by using vector comparison. For each data entry, we generate a vector that represents the relatedness of the text for the model. Such vectors are called vector embeddings, or sometimes just embeddings. We can find the most relevant entries if we generate a vector for the search query and compare it with vectors generated for each data item.

Embedding models generate vector embeddings. They accept a string or an array of strings and produce a vector, usually of a fixed length. Some models allow the vector length to be configured, with shorter vectors resulting in less memory consumption and better comparison speed, but also decreased accuracy.

Let’s try to generate vector embeddings with OpenAI models to understand how it works in practice. The official OpenAI library has a dedicated API to generate embeddings. At a minimum, we need to provide the input string and the model's name:

from openai import OpenAI
client = OpenAI()

def get_embedding(text):
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding

Currently, OpenAI provides three embedding models: text-embedding-3-small, text-embedding-3-large, and text-embedding-ada-002. text-embedding-3-small and text-embedding-3-large are the last generation models, with the small version being slightly less accurate and several times cheaper than the large one. text-embedding-3-small produces vectors with 1536 length.

If we want to implement a semantic search for our example with books, we must first generate embeddings for each paragraph. Let’s assume they are stored in a JSON file with the following structure:

[
  {"book_id": 1, "paragraph": "Lorem ipsum dolor sit amet…"},
  {"book_id": 1, "paragraph": "Duis aute irure dolor…"}
]

To generate embeddings, we need to load this file, call get_embedding per entry, and save it back as JSON:

import pandas as pd

df = pd.read_json('paragraphs.json')
df['embedding'] = df.apply(
    lambda row: get_embedding(row['paragraph']), 
    axis=1
)
df.to_json('paragraphs_with_embeddings.json', orient='records')

After this, we can start implementing vector search.

Vector Similarity

Vector search is based on the vector similarity concept, which measures how two vectors are “close” to each other. There are several measures, but cosine similarity is used in most cases with RAG. It is defined as the cosine of the angle between 2 vectors in a multidimensional space.

To illustrate the concept, let’s take a look at different vectors in 2-dimensional space:

When the angle is close to 0, the cosine will approach 1, meaning the vectors are similar.
If the angle is close to π/2, the cosine will approach 0; the vectors are orthogonal and not related to each other.
If the angle is close to π, the cosine will approach -1; the vectors are opposite.

The same approach works in multidimensional space, where the formula will be as follows:

With this information, we can implement a function that computes cosine similarity between 2 vector embeddings:

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

The function will return 1 for closely related vectors, 0 for unrelated, and -1 for opposite vectors.

Vector Search

Vector search is implemented in several steps:

All input data should have vector embeddings generated. It needs to happen once and not per search request.
A vector embedding needs to be generated for the search query.
A vector similarity measure must be computed between the search query vector and vectors for the data being queried.
Data should be sorted by the measure value. If cosine similarity is used, data can be sorted by its value in descending order.
The top K entries are the most relevant to the user query.

We can follow this approach with our example about books. First of all, we need to load data with pre-generated embeddings once. Then we can call the embedding model to generate a vector for the user query, compute cosine similarity between the user query vector and vectors for each paragraph, and sort the data based on this measure. Top K records will contain the most relevant paragraphs to the user query:

df = pd.read_json('paragraphs_with_embeddings.json')

def search_books(user_query, limit):
    embedding = get_embedding(user_query)
    df['similarity'] = df.apply(
        lambda x: cosine_similarity(x['embedding'], embedding), 
        axis=1
    )
    return df.sort_values(by='similarity', ascending=False).head(limit)

Conclusion

The RAG method is used to retrieve context-specific information to augment LLM calls, as an alternative to providing all information at once. The classic implementation of RAG is based on vector search with an optional reranking step.

This guide walks through the steps of implementing the first RAG step - vector search, including vector embeddings generation and similarity computation. We used cosine similarity as one of the most popular measures used with RAG. We implemented a simple application with semantic search for the contents of books.

In the following article we will discuss issues of using vector search alone, and how they can be resolved by adding a reranking step.