hero-image

An Introduction to Elasticsearch

August 12, 2019
By John Hammink

Imagine you were tasked to develop a data-pipeline that was agnostic to the sort of data it stored, and open-ended about the kinds of searches and analytics that could be done with it. It needed to be fast, and with as few as possible constraints on format, data type or record size.

data continuum illustration with focus on unstructured data

You’ve already looked at the data-structure continuum, and realized you’re on the far right end of the continuum, making your data unstructured, or as close as possible to free text. However, you need to be able to include several additional data types, in a variable format, in a non-flat hierarchy, within each record. In other words, you need maximum flexibility.

You also need a distributed database that can scale, handle replication sensibly by default, be configurable and offer millisecond latency on read- and write- queries. So what do you choose?

In this piece, we’ll take an updated look at Elasticsearch: we’ll examine index and search while considering query types, analytics and auto-completion features. We’ll dig into the power and flexibility that Elasticsearch can provide, and review some industry use cases.

What it is…

Used widely amongst Aiven’s customers to handle the variability of logs, Elasticsearch’s incredible usefulness comes down to this: advanced, distributed, scalable full-text storage and search. Built as an abstraction layer with a fully functional REST API atop Apache Lucene, Elasticsearch is a fully-featured, open source, scalable, extremely low-latency document store and search engine library, that supports real-time analytics.

Of course, logs are just the beginning. Elasticsearch was built to drive information-discovery tasks based on text and data in almost unlimited formats. Let’s look at how it works in a bit more detail.

Elaticsearch can perform lightning-fast searches because it can search indexes along with indexed data, and support a number of manipulations and analytics cases right out of the box. How does it do this?

How data is organized in Elasticsearch

Let’s look at how Elasticsearch organizes its data by comparing related concepts:

RDBMS Elasticsearch
Databases Indices/Indexes
Tables Types
Rows Documents
Columns Fields


A cluster can consist of one or many ES nodes (physical machines). These can hold one to many indices, with one to many types. Types contain many documents, typically with more than one field.

Documents are immutable — but replaceable — JSON objects containing data. While analogous to records or rows in a traditional database, the data in a document can be hierarchical, not strictly flat like columns and rows.

Within documents, data is serialized by keys (names) and values: the latter are the stored data types associated with those keys and may include strings, all manner of numeric types, dates, booleans, ranges, binary blobs, geotypes, a host of specialized data types, and arrays. A document can even contain, as values, arrays of other values, IP addresses, hashes, and even hierarchically-nested objects.

Indexing and mapping

Documents are indexed(verb): they are stored in an index(noun) via an index API to be later retrieved for querying and analysis. Indexing is the process by which documents are ingested into – and organised in – Elasticsearch.

Here’s how a basic indexing operation looks, where {id} refers to the unique identifier of the document:

PUT /{index}/{type}/{id}
{
  “field”: “value” ,
  ...
}

Elasticsearch can auto-generate a schema during an index operation if no schema is given. However, using the generic types this option provides should be done so sparingly, as they’re not optimal for queries.

Mapping is the process by which a document schema is explicitly defined, and is how Elasticsearch organizes, types, and stores the fields that make up a document. You can create the index, type and mapping in a single request:

PUT /{index}/
{
    "mappings": {
        "developers": {
            "properties": {
                "name": {
                    "type": "string"
                },
                "address": {
                    "type": "string"
                },
                "phone": {
                    "type": "string"
                },
                "favorite number": {
                    "type": "float"
                },
                "shoe_size": {
                    "type": "integer"
                }
                "likes_ES": {
                    "type": "boolean"
                }
            }
        }
    }
}

Document examples

Simple document example:

{
   "name": "Simon Belmont"
}

More complex document example:

[
  {
    name: "Simon Belmont",
    competence: [
      {
        database: "Elasticsearch",
        expertise: "novice",
      },
      {
        database: "PostgreSQL",
        expertise: "beginner",
      },
    ]
  },
  {
    name: "Azalin Rex",
    competence: [
      {
        database: "Cassandra",
        level: "expert",
      },
    ]
  },
]

Once indexed, there is tremendous flexibility as to how Elasticsearch can search indexed data. In fact, this flexibility is what Elasticsearch is known for.

Search queries use a REST API, often combined with a search body, although there is a simple API which works without the search body.

Example basic match query using the SearchLite API:

GET /{index}/{type}/_search?q={search term}

Example search, with full search body:

POST /{index}/{type}/_search

{
   "query": {
       "multi_match" : {
           "query" : "{search term}",
           "fields" : ["{field 1}", "{field 2}", "{field 2}", "{field 3}"]
       }
   }
}

The multi_match keyword lets you search multiple fields for the search term. Elasticsearch’s Query DSL is profoundly extensive, flexible, and works down to the keyword/token level.

Query types

Elasticsearch query types also include full-text queries; term-level queries; relevance-ranking queries; term-completion queries (where word-completion suggestions appear after a user starts typing); joining queries (less expensive than SQL-style joins); regular expression and partial-regular expression queries; geo queries (essentially geolocation data queries); specialized queries (a rather motley crew of “none of the above” query types); and even span queries, which control for word order and position and are well suited to legal, contractual, and patent documents (but can be used on any text, anywhere).

To add to the pile and drive home the flexibility of Elasticsearch, compound-queries are possible. These wrap other query types, including mashups of more than one of the above queries, to calculate items such as scores, matches, relevance ranking of results, and boolean truth values.

Let’s take a look at even more queries and analyses that Elasticsearch provides:

  • Boosting queries: boost scores in a given field in the document, to make these resulting documents more prominent in the results.
  • Fuzzy queries: Looks for words approximately similar to a query string. Uses the Levenstein distance to calculate the number of single-character changes from the query string to the retrieved term, prioritizing results in descending order from those which contain the most similar term.
  • Wildcard queries: matches a pattern instead of a complete word or phrase. Uses ? to denote any single character and * for zero or more characters.
  • Regexp queries: matches a regular expression instead of a complete word or phrase.
  • Range queries: search for items within a range.
  • Filtered bool queries: filters search results via a secondary or tertiary criteria.
  • Field value factor scores: factor in a specific field’s value when determining a relevance score.
  • Decay functions: deprioritize documents in descending order as a specific field’s value in a result set moves away from a specified value.

This list isn’t nearly exhaustive; there are many more functions and possibilities, and more being added all the time! Stay tuned - we’ll cover searches, analytics and other Elasticsearch functionality hands-on in future posts!

How flexible is Elastic?

Once you realize how it’s possible to string many query types together to one — as a query string query — then you begin to see just how flexible a search engine as a datastore like Elasticsearch can be.

There are also some bundled features made to be targeted to end users’ experience. Search-as-you-type functionality completes your search string with suggested results as you type; and auto-completion, based on the completion suggester in Elastic, is optimized for speed.

Elasticsearch’s real power lies in how it leverages these flexible search, analytics, and end-user-facing capabilities with its storage mechanism and architecture.

Elasticsearch, as a distributed data store, supports the CAP theorem, where the user can tune the tradeoff between consistency of data across partitions, availability of the data in each partition, and the partition tolerance of the index.

With great working defaults right out of the box, and a managed solution like Aiven Elasticsearch you’ll probably not need to tune anything. Elasticsearch can scale more or less infinitely; it works right out of the box on even a single node but can scale to as many nodes as your data requires.

The power of all this is that, with Elasticsearch, you can:

  1. Store data — across all supported types — without needing to flatten it to conform to a rigid RDBMS schema.
  2. Shard and replicate your data across partitions and nodes as you’d do with any distributed database, to balance load and provide fault-tolerance.
  3. Scale your datastore accordingly: seamlessly add and integrate new nodes as you need to resize your cluster.
  4. Index flexibly: on as little as a single word, token, combination of words, wildcards or even regular expressions — or a combination of all of the above.
  5. Combine many types of full-text searches, analytical queries, fuzzy searches and boosted queries into a single, laser-focused, intuitive inquiry. This power and flexibility is unseen in traditional ACID databases.
  6. Do all of the above with speed and the lowest latency of any data storage solution around.

There’s power in elasticity

Elasticsearch can be integrated with any number of other technologies, data stores and messaging services, making it a flexible part of any data infrastructure. Elasticsearch can work with Hadoop, almost any other SQL RDBMS, and many other NoSQL solutions. Elasticsearch supports Java APIs to index data into databases like MongoDB.

One of the most popular uses of Elasticsearch is as a component in the ELK stack, a widely used combination of Elasticsearch, Logstash and Kibana often used for system monitoring and metrics visualization.

While many Aiven customers use Elasticsearch for log handling and processing, there are still a wide range of use-cases to be tapped. Consider some of the following:

Uber sends an event every few seconds to Elasticsearch; they use that information to understand riders/drivers to better serve them. Wikipedia, Facebook, Etsy and many more use search-as-you-type and autocompletion; Wikimedia supports billions of real-time searches daily. Verizon indexes billions of documents for real time logging and analytics within their call center.

Off of terra firma, the Mars Rover has been sending 30k messages @ 100k documents, 4 times daily into Elasticsearch; and we’ve heard the Jet Propulsion laboratory has been doing “some very interesting things” on Elasticsearch

The Aiven Advantage

Aiven Elasticsearch as a managed-and hosted solution for your document and full-text search needs available across the world on Google Cloud Platform, Amazon Web Services, Microsoft Azure, Digital Ocean, Packet and UpCloud.

Many of our customers use Elasticsearch with other Aiven services such as Kafka, PostgreSQL and Redis. All of these solutions are managed by Aiven, so you don’t need to worry about backend operations related to management such as upgrades, updates, backup, scaling, or replication.

In this piece, we’ve learned about Elasticsearch: looked at what it is, examined index and search, and considered query types, analytics and auto-completion features. We’ve gained some insight into the flexibility that Elasticsearch provides, and finally read up on some use cases from the industry before thinking about what Elasticsearch can do for you.

Your turn…

Want to give it a try? Check it out with our no commitment, 30-day trial, or find out more on the product page!


References

Start your free 30 day trial today

Test the whole platform for 30 days with no ifs, ands, or buts.