Develop and leverage AI models in ClickHouse®

Learn how to train AI models and perform live scoring with a set of SQL statements and Aiven for ClickHouse®

ClickHouse® is a well known analytical database, but one of its less known feature is the ability to train AI models and perform live scoring with a set of SQL statements. In this tutorial we'll showcase how, starting from a taxi dataset, we can train a Logistic Regression model and use it for predicting the future cost of a ride.

Prerequisites

Start with Aiven for ClickHouse

We can create an ClickHouse service with the following process:

  1. Access the Aiven Console

  2. Create an Aiven for ClickHouse® service and specify:

    • The cloud provider and region
    • The service plan, defining the size of the service. We can use the free tier for the purpose of this tutorial
    • The additional storage disk size
    • The service name
  3. Click on Create

Wait a couple of minutes until the service is in RUNNING state.

Load the data

For this tutorial, we'll use a dataset containing New York taxi trip data from Kaggle. Navigate to the Kaggle website, download the file and unzip it. The folder contains a set of files with the naming pattern yellow_tripdata_YYYY-MM.csv. Each one includes a month of taxi rides in the years 2019 and 2020.

To load it in Aiven for ClickHouse, connect to the database with the clickhouse-client Docker image:

docker run -it \ --rm clickhouse/clickhouse-server clickhouse-client \ --user USERNAME \ --password PASSWORD \ --host HOST \ --port PORT \ --secure
You can find the USERNAME, PASSWORD, HOST and PORT information in the Aiven Console, under the Aiven for ClickHouse service overview page in the ClickHouse Native tab.

Create a taxi_trips_train table for our training dataset:

CREATE TABLE taxi_trips_train ( VendorID UInt32, tpep_pickup_datetime DateTime, tpep_dropoff_datetime DateTime, passenger_count UInt32, trip_distance Float32, RatecodeID UInt32, store_and_fwd_flag String, PULocationID UInt32, DOLocationID UInt32, payment_type Enum('CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4, 'UNK' = 5), fare_amount Float32, extra Float32, mta_tax Float32, tip_amount Float32, tolls_amount Float32, improvement_surcharge Float32, total_amount Float32, congestion_surcharge Float32, ) ENGINE = MergeTree PRIMARY KEY (tpep_pickup_datetime, tpep_dropoff_datetime);

Create another table, named taxi_trips_test for our testing dataset:

CREATE TABLE taxi_trips_test ( VendorID UInt32, tpep_pickup_datetime DateTime, tpep_dropoff_datetime DateTime, passenger_count UInt32, trip_distance Float32, RatecodeID UInt32, store_and_fwd_flag String, PULocationID UInt32, DOLocationID UInt32, payment_type Enum('CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4, 'UNK' = 5), fare_amount Float32, extra Float32, mta_tax Float32, tip_amount Float32, tolls_amount Float32, improvement_surcharge Float32, total_amount Float32, congestion_surcharge Float32, ) ENGINE = MergeTree PRIMARY KEY (tpep_pickup_datetime, tpep_dropoff_datetime);

Then exit from the docker container:

exit

Next, we need to clean our data source a little. Remove the Windows end of line symbols from yellow_tripdata_2019-01.csv, representing our training dataset, with:

tr -d '\015' <yellow_tripdata_2019-01.csv >yellow_tripdata_2019-01-polished.csv

Then load the data from the yellow_tripdata_2019-01-polished.csv file to the taxi_trips_train table with:

cat yellow_tripdata_2019-01-polished.csv | docker run -i \ --rm clickhouse/clickhouse-server clickhouse-client \ --user USERNAME \ --password PASSWORD \ --host HOST \ --port PORT \ --secure \ --max_insert_block_size=100000 \ --query="""INSERT INTO taxi_trips_train FORMAT CSVWithNames"""

Load the yellow_tripdata_2019-02.csv data in the taxi_trips_test, this represents our testing dataset

tr -d '\015' <yellow_tripdata_2019-02.csv >yellow_tripdata_2019-02-polished.csv cat yellow_tripdata_2019-02-polished.csv | docker run -i \ --rm clickhouse/clickhouse-server clickhouse-client \ --user USERNAME \ --password PASSWORD \ --host HOST \ --port PORT \ --secure \ --max_insert_block_size=100000 \ --query="""INSERT INTO taxi_trips_test FORMAT CSVWithNames"""

Train a stochastic linear regression model

Once the data is uploaded, we can focus on the main task of training an AI model. Checking the dataset, we can notice the presence of a column called fare_amount representing the fare as calculated by the meter based on time and distance. In the next section we'll build a model to predict the fare_amount based on other columns in the dataset and we'll use a stochastic linear regression model built in ClickHouse to achieve it.

To train the model, connect to ClickHouse again:

docker run -it \ --rm clickhouse/clickhouse-server clickhouse-client \ --user USERNAME \ --password PASSWORD \ --host HOST \ --port PORT \ --secure

Then execute the following SQL to train the stochastic linear regression model:

CREATE TABLE taxi_trips_model ENGINE = Memory AS SELECT stochasticLinearRegressionState(0.001, 0.1, 15, 'Adam')( fare_amount, passenger_count, trip_distance, RatecodeID, PULocationID, DOLocationID ) AS state FROM taxi_trips_train;

The above statement trains a stochastic linear regression model with the following parameters:

  • taxi_trips_model as name
  • 0.001 as learning rate
  • 0.1 as l2 regularization coefficient
  • 15 as mini-batch size
  • Adam as method for updating weights
  • fare_amount as target column
  • passenger_count, trip_distance, RatecodeID, PULocationID, and DOLocationID as training parameters

Test the stochastic linear regression model

After building the model, let's test it against our taxi_trips_test dataset. We can compare the actual fare_amount in our test dataset with the prediction from the taxi_trips_model with:

WITH (SELECT state FROM taxi_trips_model) AS model SELECT rowNumberInAllBlocks(), fare_amount, evalMLMethod( model, passenger_count, trip_distance, RatecodeID, PULocationID, DOLocationID ) prediction FROM taxi_trips_test limit 10;

The above SQL:

  • Retrieves the state from the taxi_trips_model model
  • Includes the row number with the rowNumberInAllBlocks() function
  • Selects the fare_amount from the test dataset
  • includes the prediction from the model using the evalMLMethod function, passing the same columns used for training, but coming from the test dataset

The results showcases predictions not too far from the actual data

┌─rowNumberInAllBlocks()─┬─fare_amount─┬────prediction─┐ │ 0 │ 52 │ 57.29841872226803 │ │ 1 │ 42.5 │ 43.8984970367608 │ │ 2 │ 7 │ 7.425320721767511 │ │ 3 │ 10 │ 11.119674442058924 │ │ 4 │ 11.5 │ 13.151729718574753 │ │ 5 │ 7 │ 7.861355191000587 │ │ 6 │ 9.5 │ 10.534773043722428 │ │ 7 │ 9 │ 10.270891195885282 │ │ 8 │ 5.5 │ 6.191941850198845 │ │ 9 │ 16 │ 17.262316387594158 │ └───────────────┴─────────┴───────────┘

Measure the stochastic linear regression model performance

How does the model perform? In the AI/ML field there are a set of different KPIs to understand the quality of a model. We are going to use the mean absolute error (MAE) in this example. The MAE calculates the average difference between the prediction and the actual value in the testing dataset. We can calculate the overall MAE with:

WITH (SELECT state FROM taxi_trips_model) AS model, pred as (SELECT rowNumberInAllBlocks() rownum, fare_amount, evalMLMethod( model, passenger_count, trip_distance, RatecodeID, PULocationID, DOLocationID ) prediction FROM taxi_trips_test) SELECT avg(abs(fare_amount - prediction)) AS MAE, avg(fare_amount) AS avg_fare_amount, avg(abs(fare_amount - prediction))*100.0/avg(fare_amount) pct FROM pred WHERE fare_amount > 0 and fare_amount <= 500;

The above query calculates:

  • The MAE as the average of the absolute difference between the actual total_amount and the prediction
  • The average total_amount
  • The weight of the MAE on the total_amount

The results say that our model is off on 2.03 dollars per trip, the error is around 15.97%. Not bad for a single SQL query.

┌────────MAE─┬────avg_fare_amount─┬─────────pct─┐ │ 2.035759828017413 │ 12.746705737557932 │ 15.970870199184755 │ └───────────┴──────────────┴────────────┘

Improve the model with feature engineering in ClickHouse

Can we improve the performance? Yes with some feature engineering! Feature engineering is the practice of creating a new set of parameters for the model to be more accurate. You can derive these parameters from existing parameters, enriched with external sources, or aggregated from various fields. All the above actions are native to ClickHouse that allows us to explore the data and reshape it as needed.

First step is to check the data quality; let's investigate the presence of outliers in the fare_amount column by bucketing the data in 10$ buckets:

select cast(fare_amount/10 as int)*10, count(*) from taxi_trips_train group by cast(fare_amount/10 as int)*10 order by 1

Notice that there are a lot of trips with negative fare_amount as well as amounts greater than 500$. This is probably mistakes in the data collection and outliers we might want to take into account when creating the model.

Let's create a new model that:

  • Excludes the outliers identified above with cost greater than 500$
  • Predicts the absolute value of fare_amount avoiding the negative numbers
  • Includes, as additional parameters:
    • The pickup hour
    • The pickup day of the week
    • The time from pickup to dropoff in buckets of 10 minutes

First, drop the existing taxi_trips_model_enhanced table:

drop table taxi_trips_model_enhanced;

Then re-create it with our new model:

CREATE TABLE taxi_trips_model_enhanced ENGINE = Memory AS SELECT stochasticLinearRegressionState(0.001, 0.1, 15, 'Adam')( abs(fare_amount), toDayOfWeek(tpep_pickup_datetime), toHour(tpep_pickup_datetime), age('minute', tpep_pickup_datetime, tpep_dropoff_datetime)/10, passenger_count, trip_distance, RatecodeID, PULocationID, DOLocationID ) AS state FROM taxi_trips_train WHERE fare_amount <= 500;

The pickup hour and day of the week are added to the new taxi_trips_model_enhanced model by the toDayOfWeek(tpep_pickup_datetime) and toHour(tpep_pickup_datetime) columns. The calculation age('minute', tpep_pickup_datetime, tpep_dropoff_datetime)/10 provides the time in minutes between pickup and dropoff in buckets of 10 minutes.

We can test the new model with:

WITH (SELECT state FROM taxi_trips_model_enhanced) AS model, pred as (SELECT rowNumberInAllBlocks() rownum, fare_amount, evalMLMethod( model, toDayOfWeek(tpep_pickup_datetime), toHour(tpep_pickup_datetime), age('minute', tpep_pickup_datetime, tpep_dropoff_datetime)/10, passenger_count, trip_distance, RatecodeID, PULocationID, DOLocationID ) prediction FROM taxi_trips_test) SELECT avg(abs(abs(fare_amount) - prediction)) AS MAE, avg(abs(fare_amount)) AS avg_fare_amount, avg(abs(abs(fare_amount) - prediction))*100.0/avg(fare_amount) pct FROM pred WHERE abs(fare_amount) <= 500;

The result shows an improvement, compared to the previous model, of more than 1%.

┌────────────────MAE─┬────avg_fare_amount─┬────────────────pct─┐ │ 1.8768253472744798 │ 12.658898150895748 │ 14.854519179740457 │ └────────────────────┴────────────────────┴────────────────────┘

You can build in further improvements to the model parameters to raise its precision. This tutorial demonstrates how, with just SQL statements, you can train and test a stochastic linear regression model all within ClickHouse.

Conclusion

Adding artificial intelligence in your application doesn't always require external tools. This removes the security barriers you set up in your database and exposing you to the risk of accidental data access or usage of stale data. Modern databases like ClickHouse exposes interesting functionality that allows you to implement model training and prediction directly where your data resides, enabling you to keep your data secure and performant.