March 6 we hosted the Amsterdam Open Source Data Infrastructure meetup, with a talk by Gaby Schilders (EDB). Please find his slides here and the transcript of the discussion that followed his talk below.
Audience member: Many years back we had to build an application that was quite data intensive. I was working as a consultant and the client said let’s do some persistency layer, snd I said let’s try just plain SQL JDBC. And performance was INAUDIBLE (50%? 15%?). Now of course it was less flexible, but the application was not looking for flexibility but performance.
Gaby: and that is exactly the point, the flexibility that you gain from some solutions has an impact on performance. Anyone else run into database problems recently?
Piet: I saw the list of databases, and I have experience with Oracle, Posgtres, Yugabyte and Cockroach. My scripts in Postgres are PopSQL and then only Postgres itself, EDB and Yugabyte eat the stored procedures. And I don’t know how many of you use stored procedures - I can hold a long discourse on why they are extremely efficient, especially in a monolith, my big challenge is that all of a sudden my dear scripts and benchmarks because cochroach et all don’t eat, don’t interpret stored procedures. From your slide with cloud native cloud database providers, how many do stored procedures?
Gaby: Neon, Big Animal, Aurora and Aiven. That’s all
Piet: So my investment in PopSQL will remain with those, and I’ll keep telling my developers that it is very efficient to build a Postgres function and do stuff on the database level.
Gaby: Stored procedures is processing near your data, which has its advantages and disadvantages. And your scripts no longer working are a lesson to the younger generation represented here that you too will build up technical debt. Nothing is as permanent as the temporary, I’ve seen so many examples of people building solutions that worked on the platform and think they’ll come around to fixing it later.
Piet: The code can be available on GitHub but you can’t easily read the code, you need a layer on top of the code…
Gaby: AI can now explain code to you…
Audience member: But the context in which the code was written, that is something AI can not explain to you.
Gaby: That, I agree with. But you’ll have some semblance of the working instead of having to manually go over lines and lines of badly spaced code.
Audience member: AI can help you read code, but has anyone checked that what AI say about the code is correct?
Gaby: I understand your sceptisism, but if you’d ask the coder next to you to do it for you, will you get the right answer? Trust but verify, in either case.
I talk a bunch about theory, but what have you run into working with databases?
Piet: Laziness and chattiness. (laughther)
Audience member: For me a lot of the databses are targetted for largely static data. You’re not doing very much with it. When you have write-heavy, read-heavy, at roughly the same time. A lot of the optimizations where you hope they kick in is vacuuming, index-rebalancing, diminish the performance of reading back of data. We’ve used things like ClickHouse, which is not a standard database in any sense, but write and read-only works pretty effectively.
Gaby: When you have many writes, it’s always an issue, especially if you can’t handle them with 1 node. With 1 node many of your problem can be optimized away.
Audience member: My problem with databases is that the benchmarks always look at relatively static data, but the interaction of writing and then reading it back, relatively quickly, and on a bunch of nodes, that might be a lot to do with application architecture, but the database does struggle.
Gaby: The answer to that is making sure you only persist the data that you need to persist. And you can set the consistency level in for instance Postgres on a per query - not per session - level. So if you have data that you need to have immediately consistent, then you change it to that. More importantly, in your application you have to think what data you’re going to persist in the first place. In my experience a lot of shit goes into databases that doesn’t need to be there. And that is going to impact massively the amount of writes. Hoarders store everything. In terms of disk space that might be ok, but not if we’re at the same time expecting performance from our databases. We’re limited by physics.
Audience member: Minor provocation, don’t you feel we’re are continously reiventing the wheel? We started in the 50ies with some kind of database, then there come the relational database management systems, and now we do NoSQL databases. What’s the difference?
Gaby: We have seen that pattern with client-server, where compute went from server to client, and back again. We’re not forgetting the past, what is changing is the physical reality. Physical boundaries and opportunities guide our decisions.
Audience member: It’s true, we’re seeing a lot of use cases we didn’t see 10-20 years ago, (INAUDIBLE) new data paradigms.
Gaby (who fortunately did hear the question): We did have unstructured data back then, those were called files (laughter)
Audience member: Yes, but to do analytics on…
Gaby: Yes, exactly, the use cases have changed! Yet the counter point is that that has happened because the technology allows for it. Although the example of vectors, that is also something you can do in Postgres and is basically nothing different from an array. The data type is merely an array of numbers.
Audience member: But it’s done with extensions, to adjust for those workloads.
Gaby: Yes exactly, but that extensibility is why I think Postgres is the future, it’s the Linux of databases.
Audience member: Most IT departments would like to standardize their database use to just the one, but all todays use cases don’t fit a one-size-fits-all solution. You have to use multiple data format types to cover all your use cases.
Gaby: Partly agree, Postgres can do most workloads in most situations, most of the time. But I’ve heard stories from yall tonight that indicate that you might have different needs at times. Judging by the myriad of databases that exist today you would think we need all those different solutions, but I wish a company would look at what they’re already using and if it might not actually (be made to) fit.