Many companies are still using a general-purpose database for every project. Usually it's the easiest way to get up and running quickly. If you already have a lot of expertise in that database, why would you use something else?
The answer is performance.
Generalist RDBMs are built to be pretty good at everything -- that's what a generalist is. But today, we have more and more database technologies created for specific workloads and use cases. They are designed from the ground up to be extremely good at one thing. And when it comes to query performance, a general RDBM doesn't stand a chance.
I'm going to demonstrate this with our Aiven for ClickHouse database. And not just with some standard benchmarking tools - I'll test drive it with some real world data, something that can answer real questions that businesses might be interested in.
The dataset
I went hunting for interesting (and free) datasets that I could play around with.
Quick disclaimer: This is not going to be a deep dive into optimization of either Postgres or ClickHouse. Yes, you could use partitioning, secondary indexes, or all sorts of table or database settings to improve performance. This is intended to provide a baseline, because BOTH services can be optimized.
What I found is a treasure trove of data scraped from YouTube by the Archive Team (archive.org), featuring metadata for 4.5 billion videos as they were in December 2021. The reason this dataset exists is because YouTube announced they were going to no longer display the number of dislikes on videos. So this is the last point where we can see this information that they stopped sharing.
For my own sanity, and to save time and space, I removed the video description and the metadata fields while loading the data. Both can have a lot of special characters, or new lines, and would frankly not be efficient to store in a data warehouse. Something like OpenSearch would be a better home, maybe for future exploration.
The questions
The next step was to come up with some questions I wanted to explore. This is what I chose:
How do people react to live content, as opposed to uploaded videos? Since live content is by necessity less polished, do people like (or dislike) it at a higher ratio compared to pre-recorded videos?
Which channels have a high view count to subscriber ratio? Are these generally driven by a few videos that have been extremely popular, but have not converted viewers into subscribers?
What videos have gone most viral relative to the usual view count of their uploader?
How has the use of subtitles changed over time? Has automation led to them being available on more videos?
How does turning comments off affect the like / dislike ratio?
What is the most watched / liked Unlisted video?
With these questions, I started to dig in.
Exploring the Data
I chose two plans with the same price:
Aiven for ClickHouse - Business-16
Aiven for Postgres - Business-32
I gave both some extra disk space. The first thing I noticed once I got the data loaded is that ClickHouse is much better on disk compression. The ClickHouse table was about a third the size of the Postgres table.
Loading code...
Here are the table definitions I used.
*We’ve updated our plans based on our experience running ClickHouse for the last few years, so the plan used in this example is now different than described.
ClickHouse
Loading code...
Postgres
Loading code...
Both of these tables are pretty basic, and we could definitly improve the performance of queries by adding secondary indexes, or by changing some settings. We could also make some changes to compress the data to use even less space. However, we're still in the exploratory phase. We can focus on optimization once we know what kind of queries we'll be running.
Live vs uploaded content
By definition, live content will be a little more raw and unpolished than uploaded videos, which are often extensively edited and composed by selecting from numerous takes. How do viewers feel about that? Lots of experts in influencer and content creator culture put heavy emphasis on the value of authenticity. Does live content feel more authentic? What about the interactive nature of live performances, where viewers can feel involved in the video (and the community) by chatting with other viewers and the creator - does that help drive likes as well?
ClickHouse query
Loading code...
Postgres query
Loading code...
Runtime
Loading code...
Results
Loading code...
Our databases agree that live content in general gets more likes and dislikes per viewer, almost double the rate in both cases. Oddly the standard deviation for likes is higher for live content, but lower for dislikes.
Many Views, Few Subscribers
I was curious to find out what kind of channels have a very high number of views, as opposed to how many people are actually subscribed to the channel. What sort of videos do those channels create? Is it a single video that has gone viral? Or is it that to make it to the top of the list, you need a few videos with many views?
ClickHouse query
Loading code...
Postgres query
Loading code...
Runtime
Loading code...
Results
Loading code...
This is a little disappointing: ranks 1, 2, and 5 are what look like ads for low effort mobile games, and 3 and 4 only have unlisted videos.
Let's see if we can filter out the ads and get more interesting results by bumping up to a minimum of 1000 subscribers, and filter to only count listed videos:
Loading code...
Hmm, again ranks 1, 2, and 5 are similar, mostly reposts from Likee, and all accounts have a few very popular videos. The rank 3 account has been terminated, so nothing to report there. Rank 4 is once again mobile app ads, this time showing for an app that can age or de-age your selfies.
Since digital advertising tends to have low conversion rates, perhaps it's not surprising that this section is dominated by ads instead of viral videos like I expected.
Subtitles over the years
With advances in speech recognition, it’s easier than ever to create subtitles for videos. Are people taking advantage of this? They added auto-captioning in late 2009, was there a jump around that time?
ClickHouse query
Loading code...
Postgres query
Loading code...
Runtime
Loading code...
Results
Loading code...
People certainly have been making increasing use of subtitles for their videos. The largest jump was in 2020, which I had to look into for curiosity's sake. Apparently at that, time YouTube was removing their community captions feature, which allowed you to upload captions for other people's video. This prompted a very successful campaign to have creators add captions to their videos for hard of hearing and deaf viewers.
Comments and controversial videos
Comments are a mixed blessing. They’re an important sign of engagement for the algorithms that rule the internet, and can be a valuable source of feedback for creators. But they can also be vector for harassment, much more impactful than a simple “Dislike”. Do creators turn comments off sometimes for their more controversial videos?
ClickHouse query
Loading code...
Postgres Query
Loading code...
Runtime
Loading code...
Results
Loading code...
Videos with comments disabled on average have a much worse Like ratio, which is even more pronounced on videos with over ten thousand views. This would suggest that yes, people do disable comments on unpopular videos, especially ones with many views.
The other interesting thing about this data is the way the ratio of likes to dislike spikes in the 100k - million views range, both for videos with comments and without. This suggests to me that in that range, the YouTube algorithm is doing a good job predicting who will like a video. But as views increase, the odds that someone who dislikes the content of a video will be exposed to it increase.
So I asked a follow-up question: when commenting is disabled, are people more likely to like or dislike to express their feelings about a video?
ClickHouse query
Loading code...
Postgres Query
Loading code...
Runtime
Loading code...
Results
Loading code...
No, indeed not. Videos with comments enabled have on average more interactions (likes and dislikes) than videos with similar view counts and comments disabled.
Unlisted videos
YouTube officially claims to have over 800 million videos, but our dataset shows over 4 billion records. Are those all unlisted? Our analysis has focused mostly on listed videos. So what's the most viewed video that is unlisted?
ClickHouse query
Loading code...
Postgres Query
Loading code...
Runtime
Loading code...
Results
Loading code...
Boring! It's an Instagram ad with 2 billion views. After analyzing videos with a high view count to sub ratio, maybe this is what I should've expected. There's got to be a way to separate viral videos from ads! We'll just have to keep digging.
Top 1% of views
ClickHouse also has some intertesting statistical functions built in. As I'm interested in views again, let's get a sense of how views are distributed among videos. Before you look, try to guess how many views you need to be in the top 1% of videos.
ClickHouse query
Loading code...
Yes, postgres can do similar functions, but for this specific example needs multiple steps, and would require an order of magnitude more resources. For the ClickHouse quantile function, it uses sampling to calculate approximations
Runtime
42 seconds for ClickHouse
Results
Loading code...
It's pretty cool to see the numbers laid out like that. I did not expect the top 1% to be so high, given how many videos there are on the platform.
Conclusion
On average, the Aiven for ClickHouse database performed 5-10 times better than Aiven for PostgreSQL. And in one specific case, question #2, the performance was 20x better! Postgres is a great general purpose database, but it struggles to keep up with ClickHouse on these large datasets, especially column aggregations. It makes perfect sense, ClickHouse has been designed for exactly that purpose. If we were doing a performance test with a lot of updates and deletes, it would be ClickHouse fighting for its life.
Every database has an intended use. From the very beginning of its development, the developers make certain trade-offs, and you want to go with the grain, not against it.
If you only take one thing away from our little experiment, it should be this: choosing the right tool for your use case saves you time, both in development and execution, so you can focus on getting answers to the big questions.
To get the latest news about Aiven and our services, plus a bit of extra around all things open source, subscribe to our monthly newsletter! Daily news about Aiven is available on our LinkedIn and Twitter feeds.
If you just want to find out about our service updates, follow our changelog.
SELECTCEILING(log10(view_count +1))as Views_OOM,AVG(CASEWHEN is_comments_enabled THENNULLELSE like_count + dislike_count
END)as comments_disabled_interactions,AVG(CASEWHEN is_comments_enabled
THEN like_count + dislike_count
ELSENULLEND)as comments_enabled_interactions
FROM videos
GROUPBY Views_OOM
ORDERBY Views_OOM;-- A few small changes because PG doesn't like Log(0), and stores boolean as T/F instead of 1/0