Develop and leverage AI models in ClickHouse®
Learn how to train AI models and perform live scoring with a set of SQL statements and Aiven for ClickHouse®
Learn how to train AI models and perform live scoring with a set of SQL statements and Aiven for ClickHouse®
ClickHouse® is a well known analytical database, but one of its less known feature is the ability to train AI models and perform live scoring with a set of SQL statements. In this tutorial we'll showcase how, starting from a taxi dataset, we can train a Logistic Regression model and use it for predicting the future cost of a ride.
We can create an ClickHouse service with the following process:
Access the Aiven Console
Create an Aiven for ClickHouse® service and specify:
Click on Create
Wait a couple of minutes until the service is in RUNNING state.
For this tutorial, we'll use a dataset containing New York taxi trip data from Kaggle. Navigate to the Kaggle website, download the file and unzip it. The folder contains a set of files with the naming pattern yellow_tripdata_YYYY-MM.csv. Each one includes a month of taxi rides in the years 2019 and 2020.
To load it in Aiven for ClickHouse, connect to the database with the clickhouse-client Docker image:
Loading code...
USERNAME, PASSWORD, HOST and PORT information in the Aiven Console, under the Aiven for ClickHouse service overview page in the ClickHouse Native tab.Create a taxi_trips_train table for our training dataset:
Loading code...
Create another table, named taxi_trips_test for our testing dataset:
Loading code...
Then exit from the docker container:
Loading code...
Next, we need to clean our data source a little. Remove the Windows end of line symbols from yellow_tripdata_2019-01.csv, representing our training dataset, with:
Loading code...
Then load the data from the yellow_tripdata_2019-01-polished.csv file to the taxi_trips_train table with:
Loading code...
Load the yellow_tripdata_2019-02.csv data in the taxi_trips_test, this represents our testing dataset
Loading code...
Once the data is uploaded, we can focus on the main task of training an AI model. Checking the dataset, we can notice the presence of a column called fare_amount representing the fare as calculated by the meter based on time and distance. In the next section we'll build a model to predict the fare_amount based on other columns in the dataset and we'll use a stochastic linear regression model built in ClickHouse to achieve it.
To train the model, connect to ClickHouse again:
Loading code...
Then execute the following SQL to train the stochastic linear regression model:
Loading code...
The above statement trains a stochastic linear regression model with the following parameters:
taxi_trips_model as name0.001 as learning rate0.1 as l2 regularization coefficient15 as mini-batch sizeAdam as method for updating weightsfare_amount as target columnpassenger_count, trip_distance, RatecodeID, PULocationID, and DOLocationID as training parametersAfter building the model, let's test it against our taxi_trips_test dataset. We can compare the actual fare_amount in our test dataset with the prediction from the taxi_trips_model with:
Loading code...
The above SQL:
state from the taxi_trips_model modelrowNumberInAllBlocks() functionfare_amount from the test datasetevalMLMethod function, passing the same columns used for training, but coming from the test datasetThe results showcases predictions not too far from the actual data
Loading code...
How does the model perform? In the AI/ML field there are a set of different KPIs to understand the quality of a model. We are going to use the mean absolute error (MAE) in this example. The MAE calculates the average difference between the prediction and the actual value in the testing dataset. We can calculate the overall MAE with:
Loading code...
The above query calculates:
MAE as the average of the absolute difference between the actual total_amount and the predictiontotal_amountMAE on the total_amountThe results say that our model is off on 2.03 dollars per trip, the error is around 15.97%. Not bad for a single SQL query.
Loading code...
Can we improve the performance? Yes with some feature engineering! Feature engineering is the practice of creating a new set of parameters for the model to be more accurate. You can derive these parameters from existing parameters, enriched with external sources, or aggregated from various fields. All the above actions are native to ClickHouse that allows us to explore the data and reshape it as needed.
First step is to check the data quality; let's investigate the presence of outliers in the fare_amount column by bucketing the data in 10$ buckets:
Loading code...
Notice that there are a lot of trips with negative fare_amount as well as amounts greater than 500$. This is probably mistakes in the data collection and outliers we might want to take into account when creating the model.
Let's create a new model that:
fare_amount avoiding the negative numbers10 minutesFirst, drop the existing taxi_trips_model_enhanced table:
Loading code...
Then re-create it with our new model:
Loading code...
The pickup hour and day of the week are added to the new taxi_trips_model_enhanced model by the toDayOfWeek(tpep_pickup_datetime) and toHour(tpep_pickup_datetime) columns. The calculation age('minute', tpep_pickup_datetime, tpep_dropoff_datetime)/10 provides the time in minutes between pickup and dropoff in buckets of 10 minutes.
We can test the new model with:
Loading code...
The result shows an improvement, compared to the previous model, of more than 1%.
Loading code...
You can build in further improvements to the model parameters to raise its precision. This tutorial demonstrates how, with just SQL statements, you can train and test a stochastic linear regression model all within ClickHouse.
Adding artificial intelligence in your application doesn't always require external tools. This removes the security barriers you set up in your database and exposing you to the risk of accidental data access or usage of stale data. Modern databases like ClickHouse exposes interesting functionality that allows you to implement model training and prediction directly where your data resides, enabling you to keep your data secure and performant.
docker run -it \
--rm clickhouse/clickhouse-server clickhouse-client \
--user USERNAME \
--password PASSWORD \
--host HOST \
--port PORT \
--secureCREATE TABLE taxi_trips_train (
VendorID UInt32,
tpep_pickup_datetime DateTime,
tpep_dropoff_datetime DateTime,
passenger_count UInt32,
trip_distance Float32,
RatecodeID UInt32,
store_and_fwd_flag String,
PULocationID UInt32,
DOLocationID UInt32,
payment_type Enum('CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4, 'UNK' = 5),
fare_amount Float32,
extra Float32,
mta_tax Float32,
tip_amount Float32,
tolls_amount Float32,
improvement_surcharge Float32,
total_amount Float32,
congestion_surcharge Float32,
)
ENGINE = MergeTree
PRIMARY KEY (tpep_pickup_datetime, tpep_dropoff_datetime);CREATE TABLE taxi_trips_test (
VendorID UInt32,
tpep_pickup_datetime DateTime,
tpep_dropoff_datetime DateTime,
passenger_count UInt32,
trip_distance Float32,
RatecodeID UInt32,
store_and_fwd_flag String,
PULocationID UInt32,
DOLocationID UInt32,
payment_type Enum('CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4, 'UNK' = 5),
fare_amount Float32,
extra Float32,
mta_tax Float32,
tip_amount Float32,
tolls_amount Float32,
improvement_surcharge Float32,
total_amount Float32,
congestion_surcharge Float32,
)
ENGINE = MergeTree
PRIMARY KEY (tpep_pickup_datetime, tpep_dropoff_datetime);exittr -d '\015' <yellow_tripdata_2019-01.csv >yellow_tripdata_2019-01-polished.csvcat yellow_tripdata_2019-01-polished.csv | docker run -i \
--rm clickhouse/clickhouse-server clickhouse-client \
--user USERNAME \
--password PASSWORD \
--host HOST \
--port PORT \
--secure \
--max_insert_block_size=100000 \
--query="""INSERT INTO taxi_trips_train FORMAT CSVWithNames"""tr -d '\015' <yellow_tripdata_2019-02.csv >yellow_tripdata_2019-02-polished.csv
cat yellow_tripdata_2019-02-polished.csv | docker run -i \
--rm clickhouse/clickhouse-server clickhouse-client \
--user USERNAME \
--password PASSWORD \
--host HOST \
--port PORT \
--secure \
--max_insert_block_size=100000 \
--query="""INSERT INTO taxi_trips_test FORMAT CSVWithNames"""docker run -it \
--rm clickhouse/clickhouse-server clickhouse-client \
--user USERNAME \
--password PASSWORD \
--host HOST \
--port PORT \
--secureCREATE TABLE taxi_trips_model ENGINE = Memory AS
SELECT
stochasticLinearRegressionState(0.001, 0.1, 15, 'Adam')(
fare_amount,
passenger_count,
trip_distance,
RatecodeID,
PULocationID,
DOLocationID
)
AS state FROM taxi_trips_train;WITH (SELECT state FROM taxi_trips_model) AS model
SELECT
rowNumberInAllBlocks(),
fare_amount,
evalMLMethod(
model,
passenger_count,
trip_distance,
RatecodeID,
PULocationID,
DOLocationID
) prediction
FROM taxi_trips_test limit 10;┌─rowNumberInAllBlocks()─┬─fare_amount─┬────prediction─┐
│ 0 │ 52 │ 57.29841872226803 │
│ 1 │ 42.5 │ 43.8984970367608 │
│ 2 │ 7 │ 7.425320721767511 │
│ 3 │ 10 │ 11.119674442058924 │
│ 4 │ 11.5 │ 13.151729718574753 │
│ 5 │ 7 │ 7.861355191000587 │
│ 6 │ 9.5 │ 10.534773043722428 │
│ 7 │ 9 │ 10.270891195885282 │
│ 8 │ 5.5 │ 6.191941850198845 │
│ 9 │ 16 │ 17.262316387594158 │
└───────────────┴─────────┴───────────┘WITH (SELECT state FROM taxi_trips_model) AS model,
pred as (SELECT
rowNumberInAllBlocks() rownum,
fare_amount,
evalMLMethod(
model,
passenger_count,
trip_distance,
RatecodeID,
PULocationID,
DOLocationID
) prediction
FROM taxi_trips_test)
SELECT
avg(abs(fare_amount - prediction)) AS MAE,
avg(fare_amount) AS avg_fare_amount,
avg(abs(fare_amount - prediction))*100.0/avg(fare_amount) pct
FROM pred WHERE fare_amount > 0 and fare_amount <= 500;┌────────MAE─┬────avg_fare_amount─┬─────────pct─┐
│ 2.035759828017413 │ 12.746705737557932 │ 15.970870199184755 │
└───────────┴──────────────┴────────────┘select
cast(fare_amount/10 as int)*10,
count(*)
from taxi_trips_train
group by
cast(fare_amount/10 as int)*10
order by 1drop table taxi_trips_model_enhanced;
CREATE TABLE taxi_trips_model_enhanced ENGINE = Memory AS
SELECT
stochasticLinearRegressionState(0.001, 0.1, 15, 'Adam')(
abs(fare_amount),
toDayOfWeek(tpep_pickup_datetime),
toHour(tpep_pickup_datetime),
age('minute', tpep_pickup_datetime, tpep_dropoff_datetime)/10,
passenger_count,
trip_distance,
RatecodeID,
PULocationID,
DOLocationID
)
AS state FROM taxi_trips_train
WHERE fare_amount <= 500;WITH (SELECT state FROM taxi_trips_model_enhanced) AS model,
pred as (SELECT
rowNumberInAllBlocks() rownum,
fare_amount,
evalMLMethod(
model,
toDayOfWeek(tpep_pickup_datetime),
toHour(tpep_pickup_datetime),
age('minute', tpep_pickup_datetime, tpep_dropoff_datetime)/10,
passenger_count,
trip_distance,
RatecodeID,
PULocationID,
DOLocationID
) prediction
FROM taxi_trips_test)
SELECT
avg(abs(abs(fare_amount) - prediction)) AS MAE,
avg(abs(fare_amount)) AS avg_fare_amount,
avg(abs(abs(fare_amount) - prediction))*100.0/avg(fare_amount) pct
FROM pred WHERE abs(fare_amount) <= 500;┌────────────────MAE─┬────avg_fare_amount─┬────────────────pct─┐
│ 1.8768253472744798 │ 12.658898150895748 │ 14.854519179740457 │
└────────────────────┴────────────────────┴────────────────────┘