Jun 23, 2021

The pursuit of happiness with Python and PostgreSQL®

The world and the data in it are seldom perfect. Read to find out how to clean up the data in your PostgreSQL® database so that it's usable for analysis.

francesco-tisiot — Francesco Tisiot
|RSS Feed
Head of Developer Experience at Aiven

Creating information is just like any creative process. It starts with finding the materials (in this case, datasources) and continues with cleaning, joining and wrangling datasets. By the time you're finished you've built something beautiful and gained new insights to share - and then you start all over again.

One of the most widely-used languages in the creation of information is Python, loved by data scientists, engineers and analysts alike for its great ecosystem of existing libraries for data wrangling.

This blog post explores three such libraries: pandas, ddlgenerator and psycopg2. These each enable us to clean a dataset and push it to a PostgreSQL database where the data can later be queried and exposed to a huge variety of company figures.

Today, the objective is simple: we will explore a happiness dataset and try to find out where in the world we should move to have joyful life!

Here's the high-level process:

Get the dataset
Clean the data
Create a PostgreSQL instance
Push the data to PostgreSQL
Query the Data in PostgreSQL

0. Get the dataset

As mentioned before, begin by finding a dataset. Kaggle is a website widely used in the data science community, providing datasets used for challenges, competitions or learning. They have a nice dataset about World Happiness which only requires a login to be downloaded. It contains 5 CSV files, one per year, listing the happiness ranking of various countries together with some other indicators.

Let's create a folder named happiness and a subfolder named data for storing the files:

mkdir -p happiness/data
cd happiness

Next, download the data from Kaggle and place the 5 CSVs in the happiness/data folder.

1. Clean the data

As happens frequently, source datasets are not always tidy and clean. This is also true for our happiness dataset. When browsing the files we can see that 2015.csv and 2016.csv share the same format of column names and data. But the 2017.csv has the Region column missing, some others named and ordered differently (Generosity and Trust are swapped). The same applies to the files 2018.csv and 2019.csv where we can't find the confidence intervals anymore. Moreover, the year is only contained in the file name, and not in a column.

Before starting our cleaning efforts, it's good practice to create a separate virtual environment every time we start a new Python project. This ensures our default Python environment is not overloaded with possibly conflicting libraries. We can create a virtual environment and enable it like this:

python -m venv python_pg
source python_pg/bin/activate

The files need some work in order to produce a consistent view of the data, so let's put on our Python superhero mask. Start by installing the pandas library, which will help us with data manipulation and analysis, by executing the following code in your terminal:

pip install pandas

Now we can create a file named prepare.py in our happiness folder and start including some Python code in it. We'll start by importing the library installed earlier, together with os and glob which should be installed by default and enable the interaction with the operative system and pathname pattern expansion respectively.

import pandas as pd
import glob, os

Next, define the column names that we're going to use in our target DataFrame named all_years_df to store the dataset once cleaned. Add the following lines to the prepare.py file:

column_names = ['survey_yr', 'country', 'overall_rank',
                'score', 'gdp', 'generosity', 'freedom',
                'social_support', 'life_exp', 'gov_trust']

all_years_df = pd.DataFrame()

Now it's time to bring order to chaos and parse all the CSV files in the data folder. Add the following code to the prepare.py file:

files = glob.glob('data/*.csv')

for fp in files:
    # Handling file names as new column
    file_year = os.path.basename(fp).split('.')[0]
    file_content = pd.read_csv(fp).assign(Year=int(file_year)-2000)

    # Picking the right column for each file based on the year
    uniformed_columns = find_columns(file_content, file_year)

    # Conforming column names and appending the parsed year data to all_years_df
    uniformed_columns.columns = column_names
    all_years_df = pd.concat([all_years_df, uniformed_columns])

all_years_df.to_csv('results.csv', index=False)

The glob function returns the list of files in the happiness/data folder, that we loop over. For each file, we read the filename with os.path.basename, split the name on the dot (.) and take the first part, extracting only the year information. The file_content line uses pandas' read_csv function to load the file data and populates the new column Year (using the assign function) with the last two digits of the year (int(file_year)-2000).

Next we call the find_columns function, which we'll define soon, to select the correct columns from the source file depending on the year.

After that, we consolidate the column names in the uniformed_columns DataFrame containing only one year of data, and then append it to the all_years_df DataFrame that will contain the whole dataset at the end of the process. Lastly we store the end result in a CSV file named results.csv in the happiness folder.

Picking the correct columns

The missing part is the find_columns function definition where we consolidate the column names and order. As mentioned above, depending on the year, we need to select the correct columns from the source file. Add the following code to the prepare.py file, just after the import statements, to perform the trick on various years:

def find_columns(file_content, file_year):
    if file_year in ['2015', '2016']:
        uniformed_columns = file_content[[
            'Year', 'Country', 'Happiness Rank', 'Happiness Score',
            'Economy (GDP per Capita)', 'Generosity', 'Freedom', 'Family',
            'Health (Life Expectancy)', 'Trust (Government Corruption)'
            ]]

    elif file_year in ['2017']:
        uniformed_columns = file_content[[
            'Year', 'Country', 'Happiness.Rank', 'Happiness.Score',
            'Economy..GDP.per.Capita.', 'Generosity', 'Freedom', 'Family',
            'Health..Life.Expectancy.', 'Trust..Government.Corruption.'
            ]]

    else:
        uniformed_columns = file_content[[
            'Year', 'Country or region', 'Overall rank', 'Score',
            'GDP per capita', 'Generosity', 'Freedom to make life choices',
            'Social support', 'Healthy life expectancy',
            'Perceptions of corruption'
            ]]

    return uniformed_columns

The function picks the correct column for each year and returns the resulting DataFrame.

It's now time to check that the pipeline is working. With the terminal positioned on the happiness folder let's execute the following:

python prepare.py

Now check the file results.csv in the happiness folder to verify that the code worked. The results should look like this:

2. Create a PostgreSQL instance

The second step is to store our data in a PostgreSQL table. But before storing it, we'll need a PG Database; let's create one with Aiven CLI in our terminal

avn service create demo-pg    \
  -t pg                       \
  --cloud google-europe-west3 \
  -p hobbyist

The above command creates a PostgreSQL instance named demo-pg on google-europe-west3 using a hobbyist plan. You can review all PostgreSQL plans on our pricing page. To connect to it, we'll need the service uri containing information like host, port, and the default avnadmin user's password. We can retrieve it with:

avn service get demo-pg --format '{service_uri}'

Now we have a couple of minutes to relax while we wait for the instance to be ready after the following call in the terminal

avn service wait demo-pg

3. Push the data to PostgreSQL

Once the service is running, go back to Python and prepare the data push.

We'll use the ddlgenerator library to automatically generate the SQL statement to create the PostgreSQL table, and psycopg2 to interact with the database. We can install both with the following line in the terminal:

pip install psycopg2 ddlgenerator

Now create a new push.py file, add the import, and connect to the database statements. Replace the <SERVICE_URI_OUTPUT> parameter in the following code with the output of the avn service get command above:

import psycopg2

conn = psycopg2.connect('<SERVICE_URI_OUTPUT>')

The connection is established. We can now create the table and push the data. ddlgenerator provides a nice functionality to create both table DDLs and insert statements from a CSV file. We can append the following code to the push.py file:

from ddlgenerator.ddlgenerator import Table

table = Table('results.csv', table_name='HAPPINESS')
sql = table.sql('postgresql', inserts=True)

After importing the Table class from ddlgenerator, the code creates an instance parsing the results.csv file, with a target table name of HAPPINESS. The last line creates the CREATE table definition SQL for PostgreSQL, including the insert statements for each line (inserts=True). If we print the SQL statement it should look something like the following:

DROP TABLE IF EXISTS happiness;

CREATE TABLE happiness (
	survey_yr INTEGER NOT NULL,
	country VARCHAR(24) NOT NULL,
	overall_rank INTEGER NOT NULL,
	score DECIMAL(15, 14) NOT NULL,
	gdp DECIMAL(17, 16) NOT NULL,
	generosity DECIMAL(16, 16) NOT NULL,
	freedom DECIMAL(16, 16) NOT NULL,
	social_support DECIMAL(16, 15) NOT NULL,
	life_exp DECIMAL(17, 16) NOT NULL,
	gov_trust DECIMAL(16, 16)
);

INSERT INTO happiness (survey_yr, country, overall_rank, score, gdp, generosity, freedom, social_support, life_exp, gov_trust) VALUES (19, 'Finland', 1, 7.769, 1.34, 0.153, 0.596, 1.587, 0.986, 0.393);
INSERT INTO happiness (survey_yr, country, overall_rank, score, gdp, generosity, freedom, social_support, life_exp, gov_trust) VALUES (19, 'Denmark', 2, 7.6, 1.383, 0.252, 0.592, 1.573, 0.996, 0.41);
...

We can now execute the sql statement against the database by adding the following code to the push.py file

cur = conn.cursor()
cur.execute(sql)
conn.commit()
cur.close()
conn.close()

We created a cursor, executed the sql statement, committed the change to the database and closed both cursor and connection. It's now time to run the code. With the same terminal on the happiness folder, run the following:

python push.py

Congrats, the data is now loaded in PostgreSQL.

4. Query the data in PostgreSQL

Let's verify that the data is correctly populated in the PostgreSQL happiness table. Create a new Python file query.py with the following content (as before, edit the connection service uri):

import psycopg2

conn = psycopg2.connect('<SERVICE_URI_OUTPUT>')

cur = conn.cursor()
cur.execute('''
SELECT SURVEY_YR, COUNTRY, OVERALL_RANK
from HAPPINESS
WHERE OVERALL_RANK <= 3
ORDER BY SURVEY_YR, OVERALL_RANK
''')

print(cur.fetchall())
cur.close()
conn.close()

As per the push.py file, we are creating a connection to PostgreSQL, then executing a query to check which countries were in the top three positions over the various years. The result should be similar to:

[
(15, 'Switzerland', 1),   (15, 'Iceland', 2),       (15, 'Denmark', 3),
(16, 'Denmark', 1),       (16, 'Switzerland', 2),   (16, 'Iceland', 3),
(17, 'Norway', 1),        (17, 'Denmark', 2),       (17, 'Iceland', 3),
(18, 'Finland', 1),       (18, 'Norway', 2),        (18, 'Denmark', 3),
(19, 'Finland', 1),       (19, 'Denmark', 2),       (19, 'Norway', 3)
]

It looks like Northern European countries are a great place to live! Take particular notice of how Aiven's home country, Finland, has been at the top of the table in the last two years.

The data, now available in a PostgreSQL relational table, can be queried and exposed by all major analytics and reporting tools, making it accessible to a wider audience.

Wrapping up

Finding the right dataset is just the beginning: cleaning, wrangling and publishing the data creates information to help people in their decision-making. Python is a great fit in this area and widely adopted in the data community. The pandas, ddlgenerator and psycopg2 libraries mentioned in the blog enable an easy data handling and push to PostgreSQL. This makes for a solid base for data discovery or corporate reporting.

Some more info:

Aiven for PostgreSQL: all the info about Aiven's offering
Aiven PostgreSQL supported extensions: browse the extensions we support
pandas, ddlgenerator and psycopg2: documentation about the libraries used

Not using Aiven services yet? Sign up now for your free trial at https://console.aiven.io/signup!

In the meantime, make sure you follow our changelog and blog RSS feeds or our LinkedIn and Twitter accounts to stay up-to-date with product and feature-related news.

The pursuit of happiness with Python and PostgreSQL®

0. Get the dataset

1. Clean the data

Picking the correct columns

2. Create a PostgreSQL instance

3. Push the data to PostgreSQL

4. Query the data in PostgreSQL

Wrapping up

Further reading

Stay updated with Aiven

Related resources

Solve a DVD rental mystery with pgweb

Aiven's journey to enable a better Apache Flink^® experience

Enabling high availability for Klaw with NGINX

The pursuit of happiness with Python and PostgreSQL®

0. Get the dataset

1. Clean the data

Picking the correct columns

2. Create a PostgreSQL instance

3. Push the data to PostgreSQL

4. Query the data in PostgreSQL

Wrapping up

Further reading

Stay updated with Aiven

Related resources

Solve a DVD rental mystery with pgweb

Aiven's journey to enable a better Apache Flink® experience

Enabling high availability for Klaw with NGINX

Aiven's journey to enable a better Apache Flink^® experience