Skip to main content

Sample dataset

Databases are more fun with data, so to get you started on your OpenSearch® journey we picked this open data set of recipes as a great example you can try out yourself.

Epicurious recipes

A dataset from Kaggle with recipes, rating and nutrition information from Epicurious.

Let's take a look at a sample recipe document:

{
"title": "A very nice Vegan dish",
"desc": "A beautiful description of the recipe",
"date": "2015-05-01T04:00:00.000Z",
"categories": [
"Vegan",
"Tree Nut Free",
"Soy Free",
"No Sugar Added"
],
"ingredients": [
"list",
"of",
"ingredients"
],
"directions": [
"list",
"of",
"steps",
"to prepare the dish"
],
"calories": 32.0,
"fat": 1.0,
"protein": 1.0,
"rating": 5.0,
"sodium": 959.0,
}

Load the data with Python

  1. Download and unzip the full_format_recipes.json file from the dataset in your current directory.

  2. Install the Python dependencies:

    pip install opensearch-py==1.0.0
  3. In this step you will create the script that reads the data file you downloaded and puts the records into the OpenSearch service. Create a file named epicurious_recipes_import.py, and add the following code; you will need to edit it to add the connection details for your OpenSearch service.

    Find the SERVICE_URI on Aiven's dashboard.

    import json
    from opensearchpy import helpers, OpenSearch


    SERVICE_URI = 'YOUR_SERVICE_URI_HERE'
    INDEX_NAME = 'epicurious-recipes'

    os_client = OpenSearch(hosts=SERVICE_URI, ssl_enable=True)


    def load_data():
    with open('full_format_recipes.json', 'r') as f:
    data = json.load(f)
    for recipe in data:
    yield {'_index': INDEX_NAME, '_source': recipe}

    OpenSearch Python client offers a helper called bulk() which allows us to send multiple documents in one API call.

    helpers.bulk(os_client, load_data())
  4. Run the script with the following command, and wait for it to complete:

    python epicurious_recipes_import.py

Get data mapping with Python

When no data structure is specified, which is our case as shown on load the data with Python, OpenSearch uses dynamic mapping to automatically detect the fields. To check the mapping definition of your data, OpenSearch client provides a function called get_mapping as shown:

import pprint

INDEX_NAME = 'epicurious-recipes'
mapping_data = os_client.indices.get_mapping(INDEX_NAME)

# Find index doc_type
doc_type = list(mapping_data[INDEX_NAME]["mappings"].keys())[0]

schema = mapping_data[INDEX_NAME]["mappings"][doc_type]
fields = list(schema.keys())
pprint(fields)
pprint(schema)

You should be able to see the fields' output:

['calories',
'categories',
'date',
'desc',
'directions',
'fat',
'ingredients',
'protein',
'rating',
'sodium',
'title']

And the mapping with the fields and their respective types.

{'calories': {'type': 'float'},
'categories': {'fields': {'keyword': {'ignore_above': 256, 'type': 'keyword'}},
'type': 'text'},
'date': {'type': 'date'},
'desc': {'fields': {'keyword': {'ignore_above': 256, 'type': 'keyword'}},
'type': 'text'},
'directions': {'fields': {'keyword': {'ignore_above': 256, 'type': 'keyword'}},
'type': 'text'},
'fat': {'type': 'float'},
'ingredients': {'fields': {'keyword': {'ignore_above': 256,
'type': 'keyword'}},
'type': 'text'},
'protein': {'type': 'float'},
'rating': {'type': 'float'},
'sodium': {'type': 'float'},
'title': {'fields': {'keyword': {'ignore_above': 256, 'type': 'keyword'}},
'type': 'text'}}

Read more about OpenSearch mapping in the official OpenSearch documentation.

Load the data with NodeJS

To load data with NodeJS we'll use OpenSearch JavaScript client

Download full_format_recipes.json, unzip and put it into the project folder.

It is possible to index values either one by one or by using a bulk operation. Because we have a file containing a long list of recipes we'll use a bulk operation. A bulk endpoint expects a request in a format of a list where an action and an optional document are followed one after another:

  • Action and metadata
  • Optional document
  • Action and metadata
  • Optional document
  • and so on

To achieve this expected format, use a flat map to create a flat list of such pairs instructing OpenSearch to index the documents.

module.exports.recipes = require("./full_format_recipes.json");

/**
* Indexing data from json file with recipes.
*/
module.exports.indexData = () => {
console.log(`Ingesting data: ${recipes.length} recipes`);
const body = recipes.flatMap((doc) => [
{ index: { _index: indexName } },
doc,
]);

client.bulk({ refresh: true, body }, console.log(result.body));
};

Run this method to load the data and wait till it's done. We're injecting over 20k recipes, so it can take 10-15 seconds.

Get data mapping with NodeJS

We didn't specify any particular structure for the recipes data when we uploaded it. Even though we could have set explicit mapping beforehand, we opted to rely on OpenSearch to derive the structure from the data and use dynamic mapping. To see the mapping definitions use the getMapping method and provide the index name as a parameter.

/**
* Retrieving mapping for the index.
*/
module.exports.getMapping = () => {
console.log(`Retrieving mapping for the index with name ${indexName}`);

client.indices.getMapping({ index: indexName }, (error, result) => {
if (error) {
console.error(error);
} else {
console.log(result.body.recipes.mappings.properties);
}
});
};

You should be able to see the following structure:

{
calories: { type: 'long' },
categories: { type: 'text', fields: { keyword: [Object] } },
date: { type: 'date' },
desc: { type: 'text', fields: { keyword: [Object] } },
directions: { type: 'text', fields: { keyword: [Object] } },
fat: { type: 'long' },
ingredients: { type: 'text', fields: { keyword: [Object] } },
protein: { type: 'long' },
rating: { type: 'float' },
sodium: { type: 'long' },
title: { type: 'text', fields: { keyword: [Object] } }
}

These are the fields you can play with. You can find information on dynamic mapping types in the documentation.

Sample queries with HTTP client

With the data in place, we can start trying some queries against your OpenSearch service. Since it has a simple HTTP interface, you can use your favorite HTTP client. In these examples, we will use httpie because it's one of our favorites.

First, export the SERVICE_URI variable with your OpenSearch service URI address and index name from the previous script:

export SERVICE_URI="YOUR_SERVICE_URI_HERE/epicurious-recipes"
  1. Execute a basic search for the word vegan across all documents and fields:

    http "$SERVICE_URI/_search?q=vegan"
  2. Search for vegan in the desc or title fields only:

    http POST "$SERVICE_URI/_search" <<< '
    {
    "query": {
    "multi_match": {
    "query": "vegan",
    "fields": ["desc", "title"]
    }
    }
    }
    '
  3. Search for recipes published only in 2013:

    http POST "$SERVICE_URI/_search" <<< '
    {
    "query": {
    "range" : {
    "date": {
    "gte": "2013-01-01",
    "lte": "2013-12-31"
    }
    }
    }
    }
    '