Retrieval augmented generation with OpenAI and OpenSearch^®

Use OpenSearch® as a vector database to generate responses to user queries using AI

Retrieval augmented generation (RAG) is an AI technique for retrieving facts from an external knowledge base to provide large language models (LLMs) on the most accurate, up-to-date context to be able to craft better replies and reduce the risk of hallucination.

This tutorial guides you through an example of using Aiven for OpenSearch® as a backend vector database for OpenAI embeddings and how to perform text, semantic, or mixed search which can serve as basis for a RAG system. We'll use the set of Wikipedia articles as base knowledge to influence the replies of a chatbot.

Why use OpenSearch as a backend vector database?

OpenSearch is a widely adopted open source search/analytics engine. It allows to storing, querying and transforming of documents in a variety of shapes. It also provides fast and scalable functionality to perform both accurate and fuzzy text search. Using OpenSearch as vector database enables you to mix and match semantic and text search queries on top of a performant and scalable engine.

Prerequisites

Before you begin, have the following:

An Aiven Account. You can create an account and start a free trial with Aiven by navigating to the signup page and creating a user.
An Aiven for OpenSearch service. You can spin up an Aiven for OpenSearch service in minutes in the Aiven Console with the following steps
- Click on Create service
- Select OpenSearch
- Choose the Cloud Provider and Region
- Select the Service plan (the hobbyist plan is enough for the notebook)
- Provide the Service name
- Click on Create service
The OpenSearch Connection String. The connection string is visible as Service URI in the Aiven for OpenSearch service overview page.
Your OpenAI API key
Python and pip installed locally.

Installing dependencies

The tutorial requires the following Python packages:

openai
pandas
wget
python-dotenv
opensearch-py

You can install the above packages with:

pip install openai pandas wget python-dotenv opensearch-py

OpenAI key settings

We'll use OpenAI to create embeddings starting from a set of documents, so we need an API key. You can get one from the OpenAI API Key page after logging in.

To avoid leaking the OpenAI key, you can store it as an environment variable named OPENAI_API_KEY.

Info

For more information on how to perform the same task across other operative systems, refer to Best Practices for API Key Safety.

To store safely the information, create a .env file in the same folder where the notebook is located and add the following line, replacing the <INSERT_YOUR_API_KEY_HERE> with your OpenAI API Key.

OPENAI_API_KEY=<INSERT_YOUR_API_KEY_HERE>

Connect to Aiven for OpenSearch

Once the Aiven for OpenSearch service is in the RUNNING state, we can retrieve the connection string from the Aiven for OpenSearch service page.

Copy the Service URI paramete and store it in the same .env file created above, after replacing the https://USER:PASSWORD@HOST:PORT string with the Service URI.

OPENSEARCH_URI=https://USER:PASSWORD@HOST:PORT

We can now connect to Aiven for OpenSearch by adding the following code in Python:

import os
from opensearchpy import OpenSearch
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

connection_string = os.getenv("OPENSEARCH_URI")

# Create the client with SSL/TLS enabled, but hostname verification disabled.
client = OpenSearch(connection_string, use_ssl=True, timeout=100)

The code above reads the OpenSearch connection string from the .env file (os.getenv("OPENSEARCH_URI")) and creates a client connection using SSL with a timeout of 100 seconds.

Download the dataset

In theory we could use any dataset for this purpose, and you are more than welcome to bring your own. However, for simplicity's sake and to avoid the need to calculate embeddings on a huge dataset of documents, we'll use a set with pre-calculated OpenAI embeddings which score Wikipedia articles. We can get the file and unzip it with:

import wget
import zipfile

embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'
wget.download(embeddings_url)

with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip",
"r") as zip_ref:
    zip_ref.extractall("data")

Let's load the file in a pandas dataframe and check its content with:

import pandas as pd

wikipedia_dataframe = pd.read_csv("data/vector_database_wikipedia_articles_embedded.csv")

wikipedia_dataframe.head()

The file contains:

id a unique Wikipedia article identifier
url the Wikipedia article URL
title the title of the Wikipedia page
text the text of the article
title_vector and content_vector the embedding calculated on the title and content of the wikipedia article respectively
vector_id the id of the vector

Define the OpenSearch mapping to store the OpenAI embeddings

To properly store and query all the fields included in the dataset, we need to define an OpenSearch index settings and mapping optimized for the storage of the information, including the embeddings. For this purpose we can define the settings and the mappings via:

index_settings ={
    "index": {
      "knn": True,
      "knn.algo_param.ef_search": 100
    }
  }

index_mapping= {
    "properties": {
      "title_vector": {
          "type": "knn_vector",
          "dimension": 1536,
          "method": {
            "name": "hnsw",
            "space_type": "l2",
            "engine": "faiss"
        }
      },
      "content_vector": {
          "type": "knn_vector",
          "dimension": 1536,
          "method": {
            "name": "hnsw",
            "space_type": "l2",
            "engine": "faiss"
        },
      },
      "text": {"type": "text"},
      "title": {"type": "text"},
      "url": { "type": "keyword"},
      "vector_id": {"type": "long"}
      
    }
}

The code above:

Defines an index with knn search enabled. The k-nearest neighbors (k-NN) search searches a vector space (generated by embeddings) in order to retrieve the k closest vectors. You can read more in the OpenSearch k-NN documentation.
Defines a mapping with:
- title_vector and content_vector of type knn_vector and 1536 dimension (vector with 1536 entries)
- text, containing the article text as a text field
- title, containing the article title as a text field
- url, containing the article text as a keyword field
- vector_id, containing the id of the vector as long field

With the settings and mappings defined, we can now create the openai_wikipedia_index index in Aiven for OpenSearch with:

index_name = "openai_wikipedia_index"
client.indices.create(
        index=index_name, 
        body={"settings": index_settings, "mappings":index_mapping}
        )

Load data into OpenSearch

With the index created, the next step is to parse the the pandas dataframe and load the data into OpenSearch using the Bulk APIs. The following function loads a set of rows in the dataframe:

def dataframe_to_bulk_actions(df):
    for index, row in df.iterrows():
        yield {
            "_index": index_name,
            "_id": row['id'],
            "_source": {
                'url' : row["url"],
                'title' : row["title"],
                'text' : row["text"],
                'title_vector' : json.loads(row["title_vector"]),
                'content_vector' : json.loads(row["content_vector"]),
                'vector_id' : row["vector_id"]
            }
        }

To speed up ingestion we can load the data in batches of 200 rows.

from opensearchpy import helpers
import json

start = 0
end = len(wikipedia_dataframe)
batch_size = 200
for batch_start in range(start, end, batch_size):
    batch_end = min(batch_start + batch_size, end)
    batch_dataframe = wikipedia_dataframe.iloc[batch_start:batch_end]
    actions = dataframe_to_bulk_actions(batch_dataframe)
    helpers.bulk(client, actions)

Once all the documents are loaded, we can try a query to retrieve the documents containing Pizza:

res = client.search(index=index_name, body={
    "_source": {
        "excludes": ["title_vector", "content_vector"]
    },
    "query": {
        "match": {
            "text": {
                "query": "Pizza"
            }
        }
    }
})

print(res["hits"]["hits"][0]["_source"]["text"])

The result is the Wikipedia article talking about Pizza:

Pizza is an Italian food that was created in Italy (The Naples area). It is made with different toppings. Some of the most common toppings are cheese, sausages, pepperoni, vegetables, tomatoes, spices and herbs and basil. These toppings are added over a piece of bread covered with sauce. The sauce is most often tomato-based, but butter-based sauces are used, too. The piece of bread is usually called a "pizza crust". Almost any kind of topping can be put over a pizza. The toppings used are different in different parts of the world. Pizza comes from Italy from Neapolitan cuisine. However, it has become popular in many parts of the world.

History 
The origin of the word Pizza is uncertain. The food was invented in Naples about 200 years ago. It is the name for a special type of flatbread, made with special dough. The pizza enjoyed a second birth as it was taken to the United States in the late 19th century.

...

Encode chatbot questions with OpenAI text-embedding-ada-002 model

To perform a semantic search, we need to calculate question encodings with the same embedding model used to encode the documents at index time. In this example, we need to use the text-embedding-ada-002 model.

from openai import OpenAI

# Define model
EMBEDDING_MODEL = "text-embedding-ada-002"

# Define the Client
openaiclient = OpenAI(
    # This is the default and can be omitted
    api_key=os.getenv("OPENAI_API_KEY"),
)

# Define question
question = 'is Pineapple a good ingredient for Pizza?'

# Create embedding
question_embedding = openaiclient.embeddings.create(input=question, model=EMBEDDING_MODEL)

Run semantic search queries with OpenSearch

With the above embedding calculated, we can now run semantic searches against the OpenSearch index to retrieve the necessary context for the retrieval-augmented generation. We're using knn as query type and scan the content of the content_vector field.

response = client.search(
  index = index_name,
  body = {
      "size": 15,
      "query" : {
        "knn" : {
          "content_vector":{
          "vector":  question_embedding.data[0].embedding,
          "k": 3
        }
      }
    }
  }
)

for result in response["hits"]["hits"]:
  print("Id:" + str(result['_id']))
  print("Score: " + str(result["_score"]))
  print("Title: " + str(result["_source"]["title"]))
  print("Text: " + result["_source"]["text"][0:100])

The result is the list of articles ranked by score:

Id:13967
Score: 13.94602
Title: Pizza
Text: Pizza is an Italian food that was created in Italy (The Naples area). It is made with different topp
Id:90918
Score: 13.754393
Title: Pizza Hut
Text: Pizza Hut is an American pizza restaurant, or pizza parlor. Pizza Hut also serves salads, pastas and
Id:66079
Score: 13.634726
Title: Pizza Pizza
Text: Pizza Pizza Limited (PPL), doing business as Pizza Pizza (), is a franchised Canadian pizza fast foo
Id:85932
Score: 11.388243
Title: Margarita
Text: Margarita may mean:
 The margarita, a cocktail made with tequila and triple sec
Margarita Island, a 
Id:13968
Score: 10.576359
Title: Pepperoni
Text: Pepperoni is a meat food that is sometimes sliced thin and put on pizza. It is a kind of salami, whi
Id:87088
Score: 9.424156
Title: Margherita of Savoy
...

Use OpenAI Chat Completions API to generate a RAG reply

The step above retrieves the content semantically similar to the question. Now let's use OpenAI chat completions function to return a retrieval-augmented generated reply based on the information retrieved.

# Retrieve the text of the first result in the above dataset
top_hit_summary = response['hits']['hits'][0]['_source']['text']

# Craft a reply
response = openaiclient.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Answer the following question:" 
            + question 
            + "by using the following text:" 
            + top_hit_summary
        }
        ]
    )

choices = response.choices

for choice in choices:
    print(choice.message.content)

The result is going to be similar to the below:

Pineapple is a contentious ingredient for pizza, and it is commonly used at Pizza Pizza Limited (PPL), a Canadian fast-food restaurant with locations throughout Ontario. The menu includes a variety of toppings such as pepperoni, pineapples, mushrooms, and other non-exotic produce. Pizza Pizza has been a staple in the area for over 30 years, with over 500 locations in Ontario and expanding across the nation, including recent openings in Montreal and British Columbia.

Conclusion

OpenSearch is a powerful tool providing both text and vector search capabilities. Used alongside OpenAI APIs allows you to craft personalized AI applications able to augment the context based on semantic search, and return responses to queries that are augmented by AI. A logical next step would be to pair OpenSearch with another storage system to, for example, store the responses that your customers find useful and train the model further. Building an end to end system including a databases like PostgreSQL® and a streaming integration with Apache Kafka® could provide the resiliency of a relational database and the hybrid search capability of OpenSearch with data feeding in near real time.

You can try Aiven for OpenSearch, or any of the other Open Source tools, in the Aiven platform free trial by signing up.

Table of contents

Why use OpenSearch as a backend vector database?
Prerequisites
Installing dependencies
OpenAI key settings
Connect to Aiven for OpenSearch
Download the dataset
Define the OpenSearch mapping to store the OpenAI embeddings
Load data into OpenSearch
Encode chatbot questions with OpenAI text-embedding-ada-002 model
Run semantic search queries with OpenSearch
Use OpenAI Chat Completions API to generate a RAG reply
Conclusion

Retrieval augmented generation with OpenAI and OpenSearch®