Picking a baby name with PostgreSQL® and FuzzyStrMatch

Choosing the name of your new child is really serious, so as a data professional you will naturally want to reach for PostgreSQL®. Find out more below.

Naming variables is one of the hardest problems in Computer Science, but when the variable is a new life, the challenge can rise quite significantly. Family names, personal preferences, celebrities, vetoes, word sound, it all goes into the decision mix! Even more, the possible pool of names is huge - and we all know that the traditional "let's browse for names in a book" approach doesn't scale anymore with the ever-increasing set of options.

Help is needed, and this blog post shows how PostgreSQL® can help us refine our search and get us closer to a final selection.

All choices, decisions, and names used in this example are pure imagination and don't represent any kind of recommendation!

Get the dataset

To start with, we need a dataset of valid names. This post uses this collection of baby names from data.world, for which you need a valid account to access the data.

Once you have created the account, you will be able to download the dataset in a babynames-clean.csv file on your local machine.

Create a PostgreSQL service and load the data

You can run PostgreSQL locally or, if you prefer, create an instance using Aiven for PostgreSQL® with the Aiven CLI command, as shown below:

avn service create demo-pg \ --service-type pg \ --cloud google-europe-west3 \ --plan hobbyist

The command starts creation of a PostgreSQL service (--service-type pg) on the Google Europe West 3 cloud region with the minimal hobbyist plan, which is enough for our needs. The service takes a couple of minutes to start, and we can wait for it to be up with:

avn service wait demo-pg

Once the service is in RUNNING state, we can connect to it with the following command that calls psql under the hood:

avn service cli demo-pg

After connecting we need to create a table, since babynames-clean.csv doesn't contain any header:

create table names (name text, gender text);

Now it's time to copy the data:

\copy names from 'babynames-clean.csv' csv

Now the data is loaded in the names table and we can start the research of our imaginary baby's name.

Initial scanning and filtering of the data

Okay, time to choose a name! Is it going to be a boy or a girl? If you already have that information, it's going to cut out half of the options, since the query below shows that we have a more or less equal amount of male and female names.

select gender, count(*) nr_names from names group by gender;

The result shows slightly more boys' names than girls'.

gender | nr_names --------+---------- girl | 3345 boy | 3437 (2 rows)

What if we don't know the gender yet, and we have almost seven thousand names to pick from? Scanning all of them one by one is not an option. We need to apply some filters.
Do you have a long surname? It might be nice to save the poor kid from the stress of writing out a long first name as well. Let's filter for names with fewer than 6 characters.

select count(*) from names where length(name) < 6;

We still have lots of results (2632), so some more pruning is needed ... I personally like names ending with vowels - can we add this constraint? Yes we can, with the help of regular expressions:

select count(*) from names where length(name) < 6 and name similar to '%(a|e|i|o|u|y)';

With the help of regular expressions we can take it even further, like specifying 1 l or 2 consecutive ls:

select * from names where length(name) < 6 and name similar to '%(l){1,2}%(a|e|i|o|u|y)' limit 10;

Regular expressions can only help so much, and we still have too many names in our list. We need some more ways to exclude options.

Avoid similar names

We all have names we never want to hear again. Maybe they remind us of someone we don't want to think about, or they're just taken by somebody else we hear about way too often. It's easy to avoid these names, but we can use PostgreSQL's FuzzyStrMatch to go a step further and eliminate anything even vaguely similar. We can install the extension with:

create extension fuzzystrmatch;

The extension brings us several functions. The first one, named Levenstein difference, helps us understand how many characters make two strings differ. Let's say, for example, we don't want anything that is written too similar to Allie.

select *, levenshtein(name, 'Allie') diff from names where length(name) < 6 and name similar to '%(l){1,2}%(a|e|i|o|u|y)' order by diff limit 10;

Returns

name | gender | diff -------+--------+------ Allie | boy | 0 Altie | girl | 1 Alfie | boy | 1 Arlie | boy | 1 Alvie | boy | 1 Alcie | girl | 1 Algie | boy | 1 Ellie | boy | 1 Ollie | boy | 1 Alvia | boy | 2 (10 rows)

We can see that the Levenshtein difference between:

  • Allie and Allie is 0 since they are the same string
  • Allie and Altie is 1 (the t)
  • Allie and Alfie is 1 (the f)
  • Allie and Alvia is 2 (the v and a)

We can remove all names with a Levenshtein difference of less than 3, since this helps us avoid names written too similarly to Allie.

select count(*) from names where length(name) < 6 and name similar to '%(l){1,2}%(a|e|i|o|u|y)' and levenshtein(name, 'Allie') > 3;

We still have 235 rows, but the difference in the number of characters is just the beginning. We don't even want to hear a sound similar to Allie! Luckily the FuzzyStrMatch extension brings us another function, named soundex that provides the Soundex code of a string. Let's try with Allie itself:

select soundex('Allie');

The result is A400 which, by itself, is not very useful. But FuzzyStrMatch also offers difference which, despite the name, provides a Soundex similarity score between 0 (two strings sound different) to 4 (two strings sound the same). If we try with our dataset:

select *, soundex(name) sdx, difference(name, 'Allie') sdx_score from names where length(name) < 6 and name similar to '%(l){1,2}%(a|e|i|o|u|y)' and levenshtein(name, 'Allie') > 3 order by sdx_score desc limit 5;

This shows that Clay, Milo, Riley, Wiley and Rolla sound a bit too close to our Allie.

name | gender | sdx | sdx_score -------+--------+------+----------- Clay | boy | C400 | 3 Milo | boy | M400 | 3 Riley | boy | R400 | 3 Wiley | boy | W400 | 3 Rolla | boy | R400 | 3 (5 rows)

We want to go for an entirely different sound, so let's remove anything with a Soundex score greater than 2.

select count(*) from names where length(name) < 6 and name similar to '%(l){1,2}%(a|e|i|o|u|y)' and levenshtein(name, 'Allie') > 3 and difference(name, 'Allie') <= 2;

The final touch: the metaphone

We still have 115 rows, a few too many to choose from. The final step in the selection could be based on a name of which we like the sound, and therefore we'd like to explore similar-sounding ones. Let's take Marlo as an example. To retrieve names that sound similar we can either use the Soundex function again, or we could evolve our approach and use the Metaphone or double metaphone for higher accuracy.

select dmetaphone('Marlo') dmet;

The double metaphone function returns the MRL string. We can use that, together with the Levenshtein difference, to find all strings which vary from it by only one character.

select count(*) from names where length(name) < 6 and name similar to '%(l){1,2}%(a|e|i|o|u|y)' and levenshtein(name, 'Allie') > 3 and difference(name, 'Allie') <= 2 and levenshtein(dmetaphone(name), dmetaphone('Marlo')) <=1;

Now we have to parse only 26 rows, and frankly, this is where the hard choices begin. Still, with the help of PostgreSQL, some regular expressions and the FuzzyStrMatch extension, we were able to prune a lot of our possible dataset.

Which name is best? The final selection is, of course, up to you.

Ready-to-use functions for string similarity

String similarity is not an easy topic, but luckily the FuzzyStrMatch extension provides a few functions like Levenshtein, Soundex and Metaphone that can help us solve some common problems when comparing texts.

Some further resources:

  • FuzzyStrMatch: to understand functions and additional parameters
  • Pg similarity: another PostgreSQL extension providing a different set of text similarity functions
  • Pattern matching: to review the details of regular expression usage