When text meets image: a guide to OpenSearch® for multimodal search

Learn how to use Aiven for OpenSearch® to search both images and text to return results!

Do you recall the frustration of looking through an endless sea of images, desperately seeking that one photograph buried deep within your digital library? Perhaps it was a snapshot from a past vacation, a birthday moment of your friend, or a sunrise view on a freezing winter moment. Wouldn't it be great if you could just describe the photo and find it easily? Well, if you want to know how to do that, you're in the right place! Keep reading to learn how to connect text to image search using the OpenAI's CLIP model and the OpenSearch® search engine.

In this tutorial we'll use a multi-modal vision and language model named CLIP. This model is special in the way that it can work with both images and text interchangeably, producing embeddings for input data that is either a text snippet or an image. These embeddings, or we also call them "vectors", are arrays of numeric values that define the location of an object in a multidimensional space. This location is calculated by a model by looking at various features of the object.

CLIP works perfectly for our use case, on one side we can convert each photo into a vector. On the other, we can do the same to a search text description - producing another vector. Once we have that we can compare image vectors to the search phrase vector and find the closest photo to the search description.

To accomplish this, we'll need:

Photos to experiment with — I'll be using a dataset from Unsplash.
The CLIP (Contrastive Language-Image Pre-Training) model to generate vectors.
OpenSearch to store, compare vectors, and find the nearest ones using the K-Nearest Neighbors Algorithm.

Here is the plan:

Download photos.
Process each photo using the CLIP model and retrieve the generated vectors.
Store the vectors and corresponding photo URLs in OpenSearch.
Process a search phrase using the CLIP model.
Retrieve a single vector.
Send a similarity search request to OpenSearch and retrieve the closest photo.

If you're unsure where to begin, don't worry; we'll navigate through each step together!

GitHub repository with complete code

For easy access to all the code that we'll discuss, find the complete implementation in the GitHub repository. I recommend cloning the repository and running the steps in a Jupyter notebook.

Set up the Python environment

In this tutorial, we'll use Python notebooks. If you haven't already set up your Python environment, it is time to do so. First, install Python by following steps from www.python.org. Next create a Python environment. This can be done in different ways, below is how I did it using Conda. However, your setup steps might be different.

conda create --name photo_env  
conda activate photo_env
conda install pip3  
conda install pandas  
conda install -c anaconda ipykernel
python -m ipykernel install --user --name=photo_env

Now when running a notebook you can select photo_env as your environment to perform actions.

Download the archive with image references

Unsplash offers a great collection of high quality images that we can use for this project. There are two versions of their dataset and we'll use the smaller version, called "Lite". It contains approximately 25,000 pictures. To obtain the dataset, follow these steps:

Visit the Download section of the Unsplash datasets repository: Unsplash Datasets Repository.
Download the archive provided in the repository.

The archive itself doesn't contain images, it contains only the corresponding URLs of the images. The URLs are in the file photos.tsv000. There you'll find a list of records for each photo, including URL information:

Download all images

To store and later display images by URL we need to establish a connection between each downloaded image and its unique identifier. While photos.tsv000 contains both the image id and URL, the image URL isn't easily derived from the id.

It's tempting to use image id as it's short and has no special characters, making it suitable as a filename for downloaded images. Yet this approach would require searching through photos.tsv000 each time we need to retrieve the image URL by image id. This is quite inefficient.

To simplify the process and ensure that we only need to read photos.tsv000 once during the image loading phase, we'll embed the URL into the filename of each downloaded photo. For example, here is how we can achieve it for the first row from photos.tsv000:

bygTaBey1Xk	
https://unsplash.com/photos/bygTaBey1Xk	https://images.unsplash.com/uploads/1413387620228d142bee4/23eceb86	
2014-10-15 15:40:40.111061	
t	4635	3070	1.51		
jaspervandermeij	Jasper	van der Meij	
PENTAX RICOH IMAGING	
GR	100	14.0	18.3	8						1708356	
19085	sea and rock cliff with grasses under cloudy sky	
Neist Point	57.428386927437906	
-6.7830279999999998	30.348905999999999	
LcE{wnIVRixt~WR+NGjbxukCWBWB

For the photo above, the URL is https://images.unsplash.com/uploads/1413387620228d142bee4/23eceb86. However, because filenames shouldn't contain forward slash /, we need to do a small trick - to embed the URL into the filename, we'll replace / with a different character, such as !. Thus, the filename becomes https:!!images.unsplash.com!uploads!1413387620228d142bee4!23eceb86. Later, when we need the URL, we'll replace ! back with /.

To implement this process, move photos.tsv000 to your project directory and run 1-load-images.ipynb, which contains the necessary code to execute the embedding of URLs into filenames:

import csv
import os
import requests
from urllib.parse import urlparse

# Create a folder to store the images if it doesn't exist
folder_name = 'photos'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

# Read the TSV file and extract photo URLs
with open('photos.tsv000', newline='', encoding='utf-8') as tsvfile:
    reader = csv.DictReader(tsvfile, delimiter='\t')
    count = 0  # Counter for downloaded images
    for row in reader:
        photo_image_url = row['photo_image_url']

        # Replace "/" in the URL path with "!" to be used as a file name that we can reconstruct into URL later
        filename = photo_image_url.replace("/", "!") 
        
        try:
            # Download the image
            photo_image_url = f"{photo_image_url}?w=640"
            response = requests.get(photo_image_url)
            if response.status_code == 200:
                # Save the image to the folder
                count += 1
                with open(os.path.join(folder_name, filename), 'wb') as f:
                    f.write(response.content)
                print(f"Downloaded image {count}: {filename}")
            else:
                print(f"Failed to download image {filename}")
        except Exception as e:
            print(f"Error downloading image {filename}: {str(e)}")

print("All images downloaded successfully.")

Wait a bit while we download ~25,000 images. You can see the progress in the notebook.

Prepare OpenSearch

We'll use OpenSearch both as storage, as well as a search engine that can work with vectors.

Start by creating an Aiven for OpenSearch® service by signing up for an Aiven account and accessing the Aiven Console.

To connect to the OpenSearch cluster you'll need its Service URI. Find this value on the landing page of the service.

To avoid hard-coding credentials in the code, we'll use python-dotenv. This means that all credentials should go to an environment file which is excluded from the repository. Create .env and add the Service URI there (see .env.example).

SERVICE_URI='add your OpenSearch service URI here'

You'll also need python dependencies for OpenSearch and for Dotenv, install them:

pip install opensearch-py
pip install python-dotenv

Now you're ready to run the code from the second notebook. Open 2-prepare-opensearch.ipynb and run the code to create an index with the name "photos":

from dotenv import load_dotenv
load_dotenv()
SERVICE_URI = os.getenv("SERVICE_URI")

from opensearchpy import OpenSearch
opensearch = OpenSearch(SERVICE_URI, use_ssl=True)

from opensearchpy import OpenSearch
index_name = 'photos'
index_body = {
  'settings': {
    'index': {
      "knn": True
    }
  },
  "mappings": {
            "properties": {
                "embedding": {
                    "type": "knn_vector",
                    "dimension": 512
                }
            }
        }
}

opensearch.indices.create(index_name, body=index_body)

OpenSearch can work with dynamic schemas, so creating an index isn't usually mandatory. However, when dealing with vectors and enabling k-nearest neighbors (KNN) functionality, it's essential to explicitly define the mappings. This ensures that OpenSearch understands which field will contain the embeddings and how many dimensions it will have.

Process images

If you followed the steps before by now you should have a folder Photos that contain Unsplash photos. Time to process each of those images, get a vector and store the data in OpenSearch. To create vectors we'll use the CLIP model.

You'll need to install several libraries to work with CLIP (see detailed instructions in the CLIP GitHub repo):

pip install git+https://github.com/openai/CLIP.git
pip install ftfy regex tqdm
pip install Pillow   
pip install torch

Go to 3-process-and-upload.ipynb and do all the steps.

Load the model:

import os
import clip
import torch
from PIL import Image

# Load the open CLIP model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

Create a function that computes the feature vectors for a batch of images. This is where we rely on the torch library to preprocess the image to optimize for the available environment. Next we use the model to obtain feature vectors for each image in the batch. Finally, normalise the vectors to make sure that feature vectors have a consistent scale:

def compute_clip_features(photos_batch):
    # Load all the photos from the files
    photos = [Image.open(photo_file) for photo_file in photos_batch]
    
    # Preprocess all photos
    photos_preprocessed = torch.stack([preprocess(photo) for photo in photos]).to(device)

    with torch.no_grad():
        # Encode the photos batch to compute the feature vectors and normalize them
        photos_features = model.encode_image(photos_preprocessed)
        photos_features /= photos_features.norm(dim=-1, keepdim=True)

    # Transfer the feature vectors back to the CPU and convert to numpy
    return photos_features.cpu().numpy()

Create a function to index embeddings into OpenSearch:

from dotenv import load_dotenv
load_dotenv()
SERVICE_URI = os.getenv("SERVICE_URI")

from opensearchpy import OpenSearch
opensearch = OpenSearch(SERVICE_URI, use_ssl=True)
from opensearchpy.helpers import bulk

def index_embeddings_to_opensearch(data):
    actions = []
    for d in data:
        action = {
            "_index": "photos",  # Update with your index name
            "_source": {
                "image_url": d['image_url'],
                "embedding": d['embedding'].tolist()
            }
        }
        actions.append(action)
    success, _ = bulk(opensearch, actions, index="photos")
    print(f"Indexed {success} embeddings to OpenSearch")

Iterate over images and process them in batches:

import os

# Path to the directory containing photos
image_dir = "photos"

# Batch size for processing images and indexing embeddings
batch_size = 100

# List to store embeddings
data = []

# Process images in batches
image_files = os.listdir(image_dir)
for i in range(0, len(image_files), batch_size):
    batch_files = image_files[i:i+batch_size]
    batch_file_paths = [os.path.join(image_dir, file) for file in batch_files]

    # Compute embeddings for the batch of images
    batch_embeddings = compute_clip_features(batch_file_paths)

    # Create data dictionary for indexing
    for file_path, embedding in zip(batch_file_paths, batch_embeddings):
        image_url = os.path.basename(file_path).replace("!", "/")
        data.append({'image_url': image_url, 'embedding': embedding})

    # Check if we have enough data to index
    if len(data) >= batch_size:
        index_embeddings_to_opensearch(data)
        data = []

# Index any remaining data
if len(data) > 0:
    index_embeddings_to_opensearch(data)

print("All embeddings indexed successfully.")

Search

The data about images is processed and already in OpenSearch. Time to create functionality to search for an image! Find all the steps in the notebook 4-run-vector-search.

Load the open CLIP model and create a function to generate an embedding for a single text:

import clip
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

def get_single_embedding(text):
    with torch.no_grad():
        # Encode the text to compute the feature vector and normalize it
        text_input = clip.tokenize([text]).to(device)
        text_features = model.encode_text(text_input)
        text_features /= text_features.norm(dim=-1, keepdim=True)

    # Return the feature vector
    return text_features.cpu().numpy()[0]

Create a function to perform KNN search in OpenSearch:

import os
from dotenv import load_dotenv
load_dotenv()
SERVICE_URI = os.getenv("SERVICE_URI")
index_name = "photos"  # Update with your index name

from opensearchpy import OpenSearch
opensearch = OpenSearch(SERVICE_URI, use_ssl=True)

def knn_search(text):
    vector = get_single_embedding(text)

    body = {
        "query": {
            "knn": {
                "embedding": {
                    "vector": vector.tolist(),  # Convert to list
                    "k": 2  # Number of nearest neighbors to retrieve
                }
            }
        }
    }

    # Perform search
    result = opensearch.search(index=index_name, body=body)
    return result

Create a function to display an image as a result:

from IPython.display import display, Image
import pandas as pd

# Read the TSV file into a DataFrame
file_path = 'photos.tsv000'
df = pd.read_csv(file_path, sep='\t')


def display_image_by_id(result):
    # Check if hits are present in the result
    if 'hits' in result and 'hits' in result['hits']:
        hits = result['hits']['hits']
        # Extract image_url from the first hit
        if hits:
            image_url = hits[0]['_source']['image_url']
            image_url = f"{image_url}?w=640"
                
            # Display the image
            display(Image(url=image_url))
        else:
            print("No hits found in the result.")
    else:
        print("Invalid result format or no hits found.")

Finally, run the example:

text_input = "dog at home"  # Provide your text input here
result = knn_search(text_input)
display_image_by_id(result)

What's next

After following this tutorial you've learned how to use multimodal search with OpenSearch. If you're interested in related topics check out these articles:

Table of contents

GitHub repository with complete code
Set up the Python environment
Download the archive with image references
Download all images
Prepare OpenSearch
Process images
Search
What's next