Aiven Blog

Different ways to Search Text in PostgreSQL

A guide to choosing between iLIKE, full-text search, trigrams, and vector embeddings for your data

Jay Miller

|RSS Feed

Jay is a Staff Developer Advocate at Aiven. Jay has served as a keynote speaker and an avid member of the Python Community. When away from the keyboard, Jay can often be found cheering on their favorite baseball team. 

When it comes to text search, PostgreSQL offers a surprisingly rich set of tools. Initially, text search capabilities were quite basic, often relying on the LIKE operator. This is inefficient for large amounts of text and lacks the nuance that comes with language.

A major breakthrough came with the introduction of the built-in tsquery and tsvector data types, along with the associated functions, as part of PostgreSQL's core distribution. But tsquery is not the only option.

I like to think of this like seeing a doctor. You can go to a family doctor and they can help you with most things. For specific results, it’s better to see a specialist who has a better understanding of the particular issue.

Search is the same way. At the end of the day, you get results but what you put into your search will affect what you get. That said there are some search methods that work better than others depending on your data.

Let’s explore a small dataset of 25 doctor-based tv shows and see what different search challenges match up.

  • Searching for an exact answer (doctor)
  • Searching for a word fragment (Doc)
  • Capitalization (Doctor)
  • Singular/Plural (Doctors)
  • Phrase Matching (time travelling doctor)
  • Similar Text Matching (physician)

I've made a Github repo that you can walk through along the code samples in this post - https://github.com/aiven-labs/pg-text-search-comparisons

This was inspired and learned a great deal about these techniques from our Elephant in the Room Episode 1 on YouTube. This guide aims to summarize and detail how to implement these methods effectively.

iLIKE shouldn’t always be avoided

Using LIKE/iLIKE in your queries may seem laughable in 2025, but in reality, it’s the lightest and easiest way to retrieve consistent data.

iLIKE is the case insensitive version of LIKE. You should use iLIKE unless casing matters.

iLIKE is particularly useful for finding specific shows when you need a simple, fast, and exact match that ignores capitalization. For example, if you are looking for "Grey's Anatomy" using iLIKE 'Grey''s Anatomy' or even iLIKE '%Grey''s%' is highly efficient.

select show_name from tv_shows where show_name iLIKE '%Grey''s%'; show_name ---------------- Grey's Anatomy (1 row)

While testing this a funny thing happened which actually shows the problem with iLike. I made a mistake and wrote Gray’’s Anatomy instead of Grey’’s Anatomy. There were no results for that.

SELECT show_name FROM tv_shows WHERE show_name iLIKE 'Gray''s Anatomy'; show_name ----------- (0 rows)

You can also chain iLIKEs with OR.

SELECT show_name, description FROM tv_shows WHERE show_name iLIKE '%Doc%' OR description iLIKE '%Dr%' LIMIT 3; show_name | description ----------------+-------------------------------------------------------------- Grey's Anatomy | A medical drama following the personal and professional lives of the physicians and staff of Grey Sloan Memorial Hospital. ER | A fast-paced medical drama depicting the daily operations, professional relationships, and personal crises of the staff at Cook County General Hospital. House M.D. | Follows the brilliant but eccentric diagnostician Dr. Gregory House and his team as they solve complex medical mysteries. (3 rows)

So if you are confident in your knowledge of the data and your query you can use iLIKE pretty reliably.

SupportSearch TypeNotes
Searching for an exact answer (doctor)
Searching for a word fragment (Doc)
Capitalization (DOCTOR)
Misspelling
⚠️Singular/PluralIf you use % and shorten the phrase to catch the plural with the wildcard
Shortening (Dr.)
⚠️Phrase MatchingONLY if you get the exact phrase
Similar Text Matching (physician)

TSQuery and TSVector

The core of PostgreSQL's native Full-Text Search (FTS) relies on the tsvector and tsquery data types. A tsvector is created by parsing a document, converting words to dictionary-normalized root forms, and discarding common stop words (and, or, in, etc). A tsquery represents the user's search phrase. The tsquery is similarly processed and may include boolean operators (AND, OR). The search compares the roots of the query and the vector.

Note that you are searching the tsvector, not the content itself. This means that information needs to be generated and possibly stored. To improve performance you can create a generalized inverted (GIN) index on that data. This means more data stored than if you used iLIKE.

-- Add a tsvector column that combines show_name and description for full-text search ALTER TABLE tv_shows ADD COLUMN search_vector tsvector; -- Populate the search vector using both show_name and description ADD COLUMN search_vector tsvector GENERATED ALWAYS AS ( to_tsvector('english', coalesce(show_name, '') || ' ' || coalesce(desription, '') ) ) STORED; -- Create a GIN index for fast full-text search queries CREATE INDEX idx_tv_shows_search ON tv_shows USING GIN(search_vector)

As you can see you can add multiple files to search and your index would depend on that. You can also configure things like the language (for stem and lexing definitions), and the weight of how much value to give each field (for instance the name of the show could be more valuable to you than matching in the description). You can generate this information on the fly (obviously there would be performance questions).

Let’s pause and talk about similarity.

When you run a query you are not getting results in a boolean feature (exists or not) instead, you are performing a distance query. We’ve talked about this in more detail in our Image recognition with Python, OpenCV, OpenAI CLIP and pgvector guide. But basically your distance is how close your query is to the data plus any weights that you use to manipulate the score.

SELECT show_name, ts_rank(search_vector, query) AS relevance, ts_headline('english', description, query, 'MaxWords=20, MinWords=10') AS excerpt FROM tv_shows, to_tsquery('english', 'relationships') query WHERE search_vector @@ query ORDER BY relevance DESC; show_name | relevance | excerpt ---------------------+------------+------------------------------------- ER | 0.06079271 | <b>relationships</b>, and personal crises of the staff at Cook County The Good Doctor | 0.06079271 | with autism and savant syndrome, navigating his career and <b>relationships</b>. The Resident | 0.06079271 | <b>relationships</b> of the physicians at Chastain Park Memorial Hospital as they Doogie Howser, M.D. | 0.06079271 | Howser, as he navigates his medical career and personal <b>relationships</b>.

Each of these searches found the phrase (we used ts_headline to mark the result with <b>RESULT</b>). If we change our query to “personal relationship” what happens?

SELECT show_name, ts_rank(search_vector, query) AS relevance, ts_headline('english', description, query, 'MaxWords=20, MinWords=10') AS excerpt FROM tv_shows, to_tsquery('english', 'personal\ relationships') query WHERE search_vector @@ query ORDER BY relevance DESC; show_name | relevance | excerpt ---------------------+------------+------------------------------------------------------ Doogie Howser, M.D. | 0.09910322 | Howser, as he navigates his medical career and <b>personal</b> <b>relationships</b>. (1 row)

But we just had so many results… What Happened? The query above is only looking for the phrase “Personal Relationships”. There are a few ways to make it look for each individual word. The most common is to use the tsquery search operators & (AND), | (OR), ! (NOT), and <-> (FOLLOWED BY).

SELECT show_name, ts_rank(search_vector, query) AS relevance, ts_headline('english', description, query, 'MaxWords=20, MinWords=10') AS excerpt FROM tv_shows, to_tsquery('english', 'personal & relationships') query WHERE search_vector @@ query ORDER BY relevance DESC; show_name | relevance | excerpt ---------------+-------------+------------------------------------------ Doogie Howser, M.D. | 0.09910322 | Howser, as he navigates his medical career and <b>personal</b> <b>relationships</b>. ER | 0.098500855 | <b>relationships</b>, and <b>personal</b> crises of the staff at Cook County The Resident | 0.008163527 | <b>relationships</b> of the physicians at Chastain Park Memorial Hospital as they navigate complex cases and <b>personal</b> (3 rows)

Full text search seems to be most beneficial when you know how you want searches to happen.

SupportFeatureNotes
Searching for an exact answer (doctor)
⚠️Searching for a word fragment (Doc)needs to_tsquery(‘Doc:*’)
Capitalization (DOCTOR)
Misspelling (Dotcor)
⚠️Singular/Plural (Doctors)Only works in catching “Doctors” with the query “Doctor”
Shortening (Dr.)
⚠️Phrase Matching (“time travelling doctor”)ONLY if you get the exact phrase
Similar Text Matching (physician)

Trigram search handles similarity similar to tsquery/tsvector but instead finds the distance between trigrams of the content. A trigram is a sequence of three consecutive characters taken from a string (leading adn trailing spaces are added for more flexibility). For example, the word "Doctor" contains the trigrams: ' Do','Doc', 'oct', 'cto', 'tor', 'or '.

By comparing the set of trigrams between a search query and the text in your database, PostgreSQL can calculate a similarity score. This is enabled by the pg_trgm extension.

CREATE EXTENSION pg_trgm;

similarity(text, text) returns a score between 0 and 1 (like Full Text Search). This score indicates the degree of similarity between two text strings, calculated based on the number of shared trigrams. A higher score signifies a greater degree of similarity between the two input strings, making it an excellent feature for tasks like finding approximate matches or detecting near-duplicates.

There are two basic matching operators for pg_tgrm, % and %>. The percent is a character-level match and compares the entire string of content against your search term. This is great if you are trying to match a field in its entirety but doesn’t work well when trying to find a section in the content. The second is a word-level match which will look for matches in one or more words. For full text search you will most likely need to use the word-level search.

Each of these searches look for distance like the others but pg_trgm has the ability to filter out results below a set threshold operator. You can adjust the thresholds for both pg_trgm.similarity_threshold and pg_trgm.word_similarity_threshold. The lower the number the looser the match will be.

When I was messing with the thresholds, I was able to get the search term “Doctor” to find “Dr.”. That said when I searched for Docter I also got matches for many other words that were not doctor or doctor related.

pg_trgm.strict_similarity_threshold has a tighter scoring threshold that could help reduce false matches.

Below I compared pg_trgm.similarity_threshold with pg_trgm.strict_similarity_threshold to see how “Docter” (an intentional misspelling) scored compared with both thresholds set to “0.5”.

WITH search AS (SELECT 'docter' AS term) SELECT word, word_similarity(word, search.term) AS word_sim, strict_word_similarity(word, search.term) AS strict_word_sim FROM tv_shows, search, regexp_split_to_table(COALESCE(description, ''), '\s+') AS word WHERE word_similarity(word, search.term) > 0.2 OR strict_word_similarity(word, search.term) > 0.2 ORDER BY strict_word_sim DESC; word | word_sim | strict_word_sim ---------+------------+----------------- doctor. | 0.5714286 | 0.4 doctor, | 0.5714286 | 0.4 doctor | 0.5714286 | 0.4 doctor | 0.5714286 | 0.4 Doctor, | 0.5714286 | 0.4 doctors | 0.5 | 0.36363637 doctors | 0.5 | 0.36363637 after | 0.33333334 | 0.18181819 Center | 0.2857143 | 0.16666667 Doogie | 0.2857143 | 0.16666667 Dr. | 0.33333334 | 0.11111111 Dr. | 0.33333334 | 0.11111111 Dr. | 0.33333334 | 0.11111111 Dr. | 0.33333334 | 0.11111111 Dr. | 0.33333334 | 0.11111111 her | 0.25 | 0.1 her | 0.25 | 0.1 her | 0.25 | 0.1 deal | 0.2 | 0.09090909

The trigram approach offers excellent handling of misspellings and is good for finding similar words or fragments without the need for wildcards. It’s simple to implement using the pg_trgm extension. However, its main drawback is semantic weakness, as it treats all words equally, lacking the ability to understand language rules like stemming or ignore stop words, which can reduce the relevance of results in longer documents.

That said it’s generally a simple upgrade from our `tsvector`-based search, but also requires more thought in how users are searching your data.

SupportChallengeNotes
Searching for an exact answer (doctor)
Searching for a word fragment (Doc)
Capitalization (DOCTOR)
Misspelling (Dotcor)This is its strength.
Singular/Plural (Doctors)Yes, due to high character overlap
Shortening - (Dr.)Weak, low character overlap with "Doctor"
⚠️Phrase Matching ("time travelling doctor")Yes, but scores are based on the whole phrase similarity.
Similar Text Matching (physician)Can return matches, but often requires a lower similarity threshold. Which adds too much noise

Vector search (or Semantic Search) is a monumental shift in how we approach text retrieval. Instead of matching keywords or character sequences, vector search attempts to understand the meaning or intent of the query.

This is achieved by converting text into high-dimensional numerical arrays called embeddings (vectors). Text with similar meaning will have vectors that are numerically "close" to each other in the vector space.

PostgreSQL can perform vector search with the pgvector extension.

Just like with tsvector search you are trying to calculate the distance between the vectors of your content and your query. Unlike tsvector search, there isn’t a way to calculate the vectors inside of PostgreSQL. This means that you will need to use an outside service to generate the vector embeddings. This can be done with code, one of the LLM APIs or a tool like n8n.

Let’s alter our table to add the pgvector extension and alter our table to store our embeddings… We will be using the nomic-embed-text model which will create 768 floating point numbers as your embedding.

It’s most important your embedding model is the same for your content and your query.

CREATE EXTENSION IF NOT EXISTS vector; ALTER TABLE tv_shows ADD COLUMN embedding vector(768);

Let’s explore the code below using Ollama installed locally:

#!/usr/bin/env python3 # /// script # requires-python = ">=3.14" # dependencies = [ # "psycopg[binary]>=3.0.0", # "ollama>=0.1.0", # "typer>=0.9.0", # ] # /// """ Generate embeddings for TV shows using and store in PostgreSQL. """ import os import sys from typing import Optional import psycopg from psycopg import sql import ollama import typer app = typer.Typer() MODEL_NAME = "nomic-embed-text" def generate_embedding(text: str, model_name:str=MODEL_NAME) -> Optional[list[float]]: """Generate embedding for the given text using Ollama.""" response = ollama.embed(model=model_name, input=text) return response.get("embeddings", [[]])[0] @app.command() def generate(): """Generate embeddings for TV shows and store them in the database.""" model_name = ensure_ollama_model() conn = get_db_connection() with conn.cursor() as cur: # Fetch all TV shows that don't have embeddings yet query = """ SELECT id, show_name, description FROM tv_shows WHERE embedding IS NULL ORDER BY id """ for idx, (show_id, title, description) in enumerate(shows, 1): # Create embedding text: TITLE DESCRIPTION text_to_embed = f"{title} {description or ''}" # Generate embedding embedding = generate_embedding(text_to_embed, model_name) if embedding: # Update the database with the embedding cur.execute( t"UPDATE tv_shows SET embedding = {embedding} WHERE id = {show_id}" ) # Commit all changes conn.commit()

We should now be able to see our vectors

select name, embedding from tv_shows LIMIT 1; show_name | embedding -------------+-------------------------------------------------------- Grey's Anatomy | [-0.010193545,0.045789447,-0.16749908,...] (1 row)

Now we can search against our content.

@app.command() def search( query: str = typer.Argument(..., help="Search query to find similar TV shows"), limit: int = typer.Option(5, "--limit", "-l", help="Number of results to return"), rank: float = .95 model_name: str = MODEL_NAME, ): # Generate embedding for the search query embedding = generate_embedding(query, model_name) conn = get_db_connection() with conn.cursor() as cur: # Search for similar shows using vector similarity # The 'rank' will show/hide results cur.execute(t"""SELECT id, show_name, description, embedding <-> {embedding}::vector as distance FROM tv_shows WHERE embedding IS NOT NULL AND (embedding <-> {embedding}::vector) > {rank} ORDER BY embedding <-> {embedding}::vector LIMIT {limit}""") results = cur.fetchall() if not results: print("No results found") return print(f"\n📺 Top {len(results)} similar shows for: '{query}'\n") for idx, (show_id, title, description, distance) in enumerate(results, 1): similarity = 1 - (distance / 2) # Normalize distance to similarity print(f"{idx}. {title} (similarity: {similarity:.2%})") if description: print(f" {description[:100]}...")

Searching for content like “Doctor”, “DOCTOR”, “Doc”, “Dr.” seems to give similar positive results. Misspellings were hit or miss but would work with some minor tweaks to the ranking. What was fantastic was that you could search for some abstract ideas… I could get results for The Good Doctor, a show about a talented physician diagnosed with autism spectrum disorder using the query “mental disorder” and even “adhd”.

Those results are due to the nature of searching by embeddings. This has some strange side effects. Like with the rank of .98. If you searched for “cook”... Doc Martin shows up. I’m not sure why but likely “cook” pulls close enough to something in the description. Perhaps it’s Doc’s name.

“A British comedy-drama following Dr. Martin Ellingham, a retired neurosurgeon who moves to a small coastal village in Devon and becomes the local doctor.”

Vector search creates better results when the query is structured like the content so “A time traveling doctor who goes on adventures” is going to have better results than “time travel”.

There is much to learn when it comes to pgvector. Pgvector requires the generation of all these vector embeddings and the search itself is not the fastest solution. One strategy is to generate indexes that will increase performance. Pgvector out of the box supports HNSW and IVFLAT. You can also add pgvectorscale which promises better compression and performance with DiskANN indexing.

The ability to search for intent rather than just keywords makes this the most powerful method for answering complex or conceptual questions. It also can surprise you with results that are not an obvious match but are close based on the context of the content.

SupportChallengeNotes
Searching for an exact answer (doctor)
⚠️Searching for a word fragment (Doc)The context is more important that matching strings
Capitalization (DOCTOR)
⚠️Misspelling (Dotcor)Minor and common Misspellings are often caught
Singular/Plural (Doctors)
Shortening - (Dr.)Yes, the model understands the semantic link between "Dr." and "Doctor."
Phrase Matching ("time travelling doctor")Yes, understands the meaning of the phrase.
Similar Text Matching (physician)EXCELLENT. The strength of vector search. A search for 'physician' would return all 'Doctor' shows.

The Final Results

The tables may paint a clear picture but we didn’t talk about storage and performance. That is because there are several indicators that can make any of these the correct answer. In fact here is a full infographic that does include those thoughts.

Looking back at our Elephant in the Room experiment, we never got a clear answer as to which solution would work best. It only showed us that searching for the same thing with different search types and settings gave us an array of different results. This is why it’s important to understand how your users will search for your information. This will lead you to optimizing for a better result.

There is one last option which is hybrid search, a combination of one or more of these search types. While this is an option, I believe this is where we start posing the question - are we using the right tool for the job? Aiven for Opensearch is a dedicated search solution that can perform all of these search types and more (even at the same time) out of the box with little configuration needed.

I don’t believe the answer is to reach for another service immediately. You can do things like partition your data to make your search windows smaller. You can also collect information on how your customers are querying data and then choose your search type based on your findings. But if you have optimized your Aiven for PostgreSQL instance and you’re still trying to eke out more performance, using a dedicated system may be the correct answer. But for all else, you hopefully have a better understanding of how these search types work and when they should be implemented.

Check out the Repo and follow along with an free Aiven for PostgreSQL database.


Stay updated with Aiven

Subscribe for the latest news and insights on open source, Aiven offerings, and more.

Related resources