Retrieval augmented generation with OpenAI and OpenSearch®

Use OpenSearch® as a vector database to generate responses to user queries using AI

Retrieval augmented generation (RAG) is an AI technique for retrieving facts from an external knowledge base to provide large language models (LLMs) on the most accurate, up-to-date context to be able to craft better replies and reduce the risk of hallucination.

This tutorial guides you through an example of using Aiven for OpenSearch® as a backend vector database for OpenAI embeddings and how to perform text, semantic, or mixed search which can serve as basis for a RAG system. We'll use the set of Wikipedia articles as base knowledge to influence the replies of a chatbot.

Overall Flow including Aiven for OpenSearch, OpenAI

Why use OpenSearch as a backend vector database?

OpenSearch is a widely adopted open source search/analytics engine. It allows to storing, querying and transforming of documents in a variety of shapes. It also provides fast and scalable functionality to perform both accurate and fuzzy text search. Using OpenSearch as vector database enables you to mix and match semantic and text search queries on top of a performant and scalable engine.

Online workshop: Preparing and Using Data for AI with LangChain and OpenSearch®

Read more

Prerequisites

Before you begin, have the following:

  1. An Aiven Account. You can create an account and start a free trial with Aiven by navigating to the signup page and creating a user.
  2. An Aiven for OpenSearch service. You can spin up an Aiven for OpenSearch service in minutes in the Aiven Console with the following steps
    • Click on Create service
    • Select OpenSearch
    • Choose the Cloud Provider and Region
    • Select the Service plan (the hobbyist plan is enough for the notebook)
    • Provide the Service name
    • Click on Create service
  3. The OpenSearch Connection String. The connection string is visible as Service URI in the Aiven for OpenSearch service overview page.
  4. Your OpenAI API key
  5. Python and pip installed locally.

Installing dependencies

The tutorial requires the following Python packages:

  • openai
  • pandas
  • wget
  • python-dotenv
  • opensearch-py

You can install the above packages with:

pip install openai pandas wget python-dotenv opensearch-py

OpenAI key settings

We'll use OpenAI to create embeddings starting from a set of documents, so we need an API key. You can get one from the OpenAI API Key page after logging in.

To avoid leaking the OpenAI key, you can store it as an environment variable named OPENAI_API_KEY.

Info

For more information on how to perform the same task across other operative systems, refer to Best Practices for API Key Safety.

To store safely the information, create a .env file in the same folder where the notebook is located and add the following line, replacing the <INSERT_YOUR_API_KEY_HERE> with your OpenAI API Key.

OPENAI_API_KEY=<INSERT_YOUR_API_KEY_HERE>

Connect to Aiven for OpenSearch

Once the Aiven for OpenSearch service is in the RUNNING state, we can retrieve the connection string from the Aiven for OpenSearch service page.

Copy the Service URI paramete and store it in the same .env file created above, after replacing the https://USER:PASSWORD@HOST:PORT string with the Service URI.

OPENSEARCH_URI=https://USER:PASSWORD@HOST:PORT

We can now connect to Aiven for OpenSearch by adding the following code in Python:

import os from opensearchpy import OpenSearch from dotenv import load_dotenv # Load environment variables from .env file load_dotenv() connection_string = os.getenv("OPENSEARCH_URI") # Create the client with SSL/TLS enabled, but hostname verification disabled. client = OpenSearch(connection_string, use_ssl=True, timeout=100)

The code above reads the OpenSearch connection string from the .env file (os.getenv("OPENSEARCH_URI")) and creates a client connection using SSL with a timeout of 100 seconds.

Download the dataset

In theory we could use any dataset for this purpose, and you are more than welcome to bring your own. However, for simplicity's sake and to avoid the need to calculate embeddings on a huge dataset of documents, we'll use a set with pre-calculated OpenAI embeddings which score Wikipedia articles. We can get the file and unzip it with:

import wget import zipfile embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip' wget.download(embeddings_url) with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip", "r") as zip_ref: zip_ref.extractall("data")

Let's load the file in a pandas dataframe and check its content with:

import pandas as pd wikipedia_dataframe = pd.read_csv("data/vector_database_wikipedia_articles_embedded.csv") wikipedia_dataframe.head()

The file contains:

  • id a unique Wikipedia article identifier
  • url the Wikipedia article URL
  • title the title of the Wikipedia page
  • text the text of the article
  • title_vector and content_vector the embedding calculated on the title and content of the wikipedia article respectively
  • vector_id the id of the vector

Define the OpenSearch mapping to store the OpenAI embeddings

To properly store and query all the fields included in the dataset, we need to define an OpenSearch index settings and mapping optimized for the storage of the information, including the embeddings. For this purpose we can define the settings and the mappings via:

index_settings ={ "index": { "knn": True, "knn.algo_param.ef_search": 100 } } index_mapping= { "properties": { "title_vector": { "type": "knn_vector", "dimension": 1536, "method": { "name": "hnsw", "space_type": "l2", "engine": "faiss" } }, "content_vector": { "type": "knn_vector", "dimension": 1536, "method": { "name": "hnsw", "space_type": "l2", "engine": "faiss" }, }, "text": {"type": "text"}, "title": {"type": "text"}, "url": { "type": "keyword"}, "vector_id": {"type": "long"} } }

The code above:

  • Defines an index with knn search enabled. The k-nearest neighbors (k-NN) search searches a vector space (generated by embeddings) in order to retrieve the k closest vectors. You can read more in the OpenSearch k-NN documentation.
  • Defines a mapping with:
    • title_vector and content_vector of type knn_vector and 1536 dimension (vector with 1536 entries)
    • text, containing the article text as a text field
    • title, containing the article title as a text field
    • url, containing the article text as a keyword field
    • vector_id, containing the id of the vector as long field

With the settings and mappings defined, we can now create the openai_wikipedia_index index in Aiven for OpenSearch with:

index_name = "openai_wikipedia_index" client.indices.create( index=index_name, body={"settings": index_settings, "mappings":index_mapping} )

Load data into OpenSearch

Load data in OpenSearch®

With the index created, the next step is to parse the the pandas dataframe and load the data into OpenSearch using the Bulk APIs. The following function loads a set of rows in the dataframe:

def dataframe_to_bulk_actions(df): for index, row in df.iterrows(): yield { "_index": index_name, "_id": row['id'], "_source": { 'url' : row["url"], 'title' : row["title"], 'text' : row["text"], 'title_vector' : json.loads(row["title_vector"]), 'content_vector' : json.loads(row["content_vector"]), 'vector_id' : row["vector_id"] } }

To speed up ingestion we can load the data in batches of 200 rows.

from opensearchpy import helpers import json start = 0 end = len(wikipedia_dataframe) batch_size = 200 for batch_start in range(start, end, batch_size): batch_end = min(batch_start + batch_size, end) batch_dataframe = wikipedia_dataframe.iloc[batch_start:batch_end] actions = dataframe_to_bulk_actions(batch_dataframe) helpers.bulk(client, actions)

Once all the documents are loaded, we can try a query to retrieve the documents containing Pizza:

res = client.search(index=index_name, body={ "_source": { "excludes": ["title_vector", "content_vector"] }, "query": { "match": { "text": { "query": "Pizza" } } } }) print(res["hits"]["hits"][0]["_source"]["text"])

The result is the Wikipedia article talking about Pizza:

Pizza is an Italian food that was created in Italy (The Naples area). It is made with different toppings. Some of the most common toppings are cheese, sausages, pepperoni, vegetables, tomatoes, spices and herbs and basil. These toppings are added over a piece of bread covered with sauce. The sauce is most often tomato-based, but butter-based sauces are used, too. The piece of bread is usually called a "pizza crust". Almost any kind of topping can be put over a pizza. The toppings used are different in different parts of the world. Pizza comes from Italy from Neapolitan cuisine. However, it has become popular in many parts of the world. History The origin of the word Pizza is uncertain. The food was invented in Naples about 200 years ago. It is the name for a special type of flatbread, made with special dough. The pizza enjoyed a second birth as it was taken to the United States in the late 19th century. ...

Encode chatbot questions with OpenAI text-embedding-ada-002 model

Encoding questions with OpenAI text-embedding-ada-002 model

To perform a semantic search, we need to calculate question encodings with the same embedding model used to encode the documents at index time. In this example, we need to use the text-embedding-ada-002 model.

from openai import OpenAI # Define model EMBEDDING_MODEL = "text-embedding-ada-002" # Define the Client openaiclient = OpenAI( # This is the default and can be omitted api_key=os.getenv("OPENAI_API_KEY"), ) # Define question question = 'is Pineapple a good ingredient for Pizza?' # Create embedding question_embedding = openaiclient.embeddings.create(input=question, model=EMBEDDING_MODEL)

Run semantic search queries with OpenSearch

Semantic Search with OpenSearch

With the above embedding calculated, we can now run semantic searches against the OpenSearch index to retrieve the necessary context for the retrieval-augmented generation. We're using knn as query type and scan the content of the content_vector field.

response = client.search( index = index_name, body = { "size": 15, "query" : { "knn" : { "content_vector":{ "vector": question_embedding.data[0].embedding, "k": 3 } } } } ) for result in response["hits"]["hits"]: print("Id:" + str(result['_id'])) print("Score: " + str(result["_score"])) print("Title: " + str(result["_source"]["title"])) print("Text: " + result["_source"]["text"][0:100])

The result is the list of articles ranked by score:

Id:13967 Score: 13.94602 Title: Pizza Text: Pizza is an Italian food that was created in Italy (The Naples area). It is made with different topp Id:90918 Score: 13.754393 Title: Pizza Hut Text: Pizza Hut is an American pizza restaurant, or pizza parlor. Pizza Hut also serves salads, pastas and Id:66079 Score: 13.634726 Title: Pizza Pizza Text: Pizza Pizza Limited (PPL), doing business as Pizza Pizza (), is a franchised Canadian pizza fast foo Id:85932 Score: 11.388243 Title: Margarita Text: Margarita may mean: The margarita, a cocktail made with tequila and triple sec Margarita Island, a Id:13968 Score: 10.576359 Title: Pepperoni Text: Pepperoni is a meat food that is sometimes sliced thin and put on pizza. It is a kind of salami, whi Id:87088 Score: 9.424156 Title: Margherita of Savoy ...

Use OpenAI Chat Completions API to generate a RAG reply

retrieval-augmented generation reply

The step above retrieves the content semantically similar to the question. Now let's use OpenAI chat completions function to return a retrieval-augmented generated reply based on the information retrieved.

# Retrieve the text of the first result in the above dataset top_hit_summary = response['hits']['hits'][0]['_source']['text'] # Craft a reply response = openaiclient.chat.completions.create( model="gpt-3.5-turbo", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Answer the following question:" + question + "by using the following text:" + top_hit_summary } ] ) choices = response.choices for choice in choices: print(choice.message.content)

The result is going to be similar to the below:

Pineapple is a contentious ingredient for pizza, and it is commonly used at Pizza Pizza Limited (PPL), a Canadian fast-food restaurant with locations throughout Ontario. The menu includes a variety of toppings such as pepperoni, pineapples, mushrooms, and other non-exotic produce. Pizza Pizza has been a staple in the area for over 30 years, with over 500 locations in Ontario and expanding across the nation, including recent openings in Montreal and British Columbia.

Conclusion

OpenSearch is a powerful tool providing both text and vector search capabilities. Used alongside OpenAI APIs allows you to craft personalized AI applications able to augment the context based on semantic search, and return responses to queries that are augmented by AI. A logical next step would be to pair OpenSearch with another storage system to, for example, store the responses that your customers find useful and train the model further. Building an end to end system including a databases like PostgreSQL® and a streaming integration with Apache Kafka® could provide the resiliency of a relational database and the hybrid search capability of OpenSearch with data feeding in near real time.

You can try Aiven for OpenSearch, or any of the other Open Source tools, in the Aiven platform free trial by signing up.