Workshop: Searching for images with vector search - OpenSearch and CLIP model

Welcome to our workshop!

Thank you so much for taking part in our hands-on workshop: Searching for images with vector search, OpenSearch® and CLIP

We hope you will join us LIVE at the appointed time, but what follows are written instructions for those who would like to jump ahead, or for those newer folks who might hit bumps along the way during the workshop to help get themselves caught up.

If you want to have a look beforehand, here is the GitHub repo.

A tutorial following similar steps is also available in Aiven’s Developer Center at When text meets image: a guide to OpenSearch® for multimodal search
.

Pre-requisites

To get the most out of your workshop experience, we recommend doing the following ahead of time:

  1. Get signed in to Aiven Console
  2. Set up a GitHub Codespace

Get signed in to Aiven Console

  1. Head to Aiven Console

  1. If you’ve used Aiven before, go ahead and log in. You’re done with this step!

  2. If not, sign up through whichever method you’d like!

  3. If you chose email, you’ll need to click the link sent to you to validate your email address

  4. Once logged in, you’ll be asked to enter some additional information. Choose either Personal or Business, whichever applies to you, and specify the Name of your first project on Aiven.

    (The picture shows a project named kafka-python-workshop - it’s probably best not to use that name!)

  1. Once at your project’s Services screen, you can close your browser; we’ll do this part during the workshop.

Spoiler alert: we’re going to create an OpenSearch service here. :wink:

Set up a GitHub Codespace

GitHub Codespaces offer an entire development environment running in the cloud, accessible from your browser, including Visual Studio Code and a Terminal. We’ll use this tool to eliminate issues with individual machines during the workshop, and ensure we’re all using the same versions of all the things so commands work correctly.

As mentioned before, we have all this nifty code ready to go for you in our workshop GitHub repo.

  1. From the repo’s README.md file, click the “Open in GitHub Codespaces” button to begin.

  1. Leave the settings at their default options and click Create codespace.

(Note: You shouldn’t need to worry about getting charged as a result of this workshop, as the number of core hours + storage required should fit well under their monthly included storage)
3. Once completed, you should see an interface that looks something like the following:

  1. If you ever lose this window and need to get back here again, you can view your full list of GitHub Codespaces from Sign in to GitHub · GitHub

NOTE: By default, these will end up with doofy auto-generated names like “probable space orbit” … feel free to click the “3-dots” menu > Rename and give it something a bit more intelligible.

FAQs / Troubleshooting

I get errors when trying to connect to OpenSearch, what gives?

Connecting securely to OpenSearch requires putting the connection information in a place that the code can find it (this is covered in the workshop as well, but restating for completeness):

  1. From GitHub Codespaces, rename the .env-example file to .env
  2. From your OpenSearch Service Overview page in Aiven Console, copy and paste the Service URI as the value after SERVICE_URI=

I get an error trying run clip.load to load the model

We’ve seen this a couple of times. There’s some discussion in the comments below, but the simplest thing to do is to re-run the Jupyter cell that failed, and it should succeed the second time. After that, the downloaded model is cached locally.

I’m currently troubleshooting an issue we saw in the workshop, where one participant found the model, preprocess = clip.load("ViT-B/32", device=device) in the second notebook hanging. I’ve now seen this issue myself, which should help with diagnosis.

Our best guess is that this is network traffic issues (that is, the server providing the file was busy).

Simplest solution: re-run the Jupyter cell
The simplest solution appears to be to just re-run the Jupyter cell that is failing. This typically succeeds the second time it tries to download the CLIP model. After that, the model file is cached locally, so further runs should be OK.

Alternative: download the model separately
An alternative is to make a local copy of the model and use that.

First use curl to download the model:

mkdir models
curl https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt --output models/ViT-B-32.pt

This has the advantage of making it obvious if the download fails due to network problems, and easy to try again.

Then replace the original loading code:

model, preprocess = clip.load("ViT-B/32", device=device)

with

LOCAL_MODEL = Path('./models/ViT-B-32.pt').absolute()
MODEL_NAME = 'ViT-B/32'
model, preprocess = clip.load(MODEL_NAME, device=device, download_root=LOCAL_MODEL.parent)

Remember to do this at the start of both 2-process-and-upload.ipynb and 3-run-vector-search.ipynb