An introduction to PostgreSQL

PostgreSQL is a flexible, open-source object-relational database management system with features meant for everything from single machines to data warehouses. We look at its ins and outs, and how to get it as a managed service from Aiven.

20 April 2020
John HamminkDeveloper Advocate at Aiven

Truth: if your data needs to be stored and later accessed, you’ll need some sort of database. If that data is transactional and requires ACID guarantees, then a Relational Database Management System (RDBMS) is probably your best bet.

In fact, it’s quite difficult for most of us to imagine what computing and data persistence was like before RDBMSs and their open-source implementations; data transactions between them are ubiquitous and completely taken for granted today.

Depending on the type, structure, data model, data store and intended use case of your data, different data stores have appeared on the scene to be best suited for different needs (and different times).

However, the RDBMS has remained fairly constant for quite some time, with PostgreSQL as an open-source leader.

So how did we get here? In this post, we’ll examine PostgreSQL: its history, a few of the many things it does, how it does them, why it does them that way, and what’s unique about PostgreSQL that has made it a go-to data store today. We’ll also consider the advantages of a managed and hosted PostgreSQL solution, and finally look at how to get started.

  1. The RBDMS - from past to present
  2. What is PostgreSQL and what does it do?
  3. How PostgreSQL does what it does
  4. The power of the PostgreSQL approach
  5. The PostgreSQL ecosystem
  6. Aiven for PostgreSQL

The RDBMS - from past to present

Prior to PostgreSQL, and relational databases as we know them today, how the reality of data storage looks depends very much on how far back you’re willing to go. Prior to about 1960, different tools and techniques were developed to enable data access, ensure quality and manage control of data.

That year, the US Department of Defense (DoD) established the Conference on Data Systems Languages (CODASYL) to nail down guidelines for data access and programming languages, establishing COBOL in the process. At the same time, magnetic tape drives, which required serial access, started to be replaced by magnetic tape drives, which could access data randomly.

Along the way, CODASYL introduced the integrated data store which took advantage of the new random-access storage, and eventually came up with the Network Database model, to describe objects and their relationships. Early versions of Data Definition Language (DDL) and Data Manipulation Language (DML) were derived and refined to define database schema and query/update data respectively.

These early systems from CODASYL (and later IBM) started out by only processing a single record at a time. But, what if one could use the language to specify single operations that would be performed across the entire data set? Enter E.F. Codd.

Instead of placing the burden on users to find information, the DBMS should, in his view, recognize simple commands to find what was needed. In a landmark paper, Codd specified a new way to organize and access data en masse, incorporating the following principles:

  1. data access should be independent of hardware and storage implementation; and
  2. there should be an automatic way to navigate data, e.g. a high-level, non-procedural language for doing so.

At the same time IBM was working on their RDBMS, System R, a few Berkeley researchers started work on Ingres (Interactive Graphics Retrieval System). This system was eventually updated to include user feedback, rewrites for code maintainability and new features.

This development, too, pointed the way ahead but left some gaps: without system-defined abstract data types and user-defined operators, for example, the user would not be easily able to uniquely process specific data objects; the language also offered little way to express relations nor inherit attributes or procedures.

Around 1982, one of those Berkeley researchers, Michael Stonebraker, left Berkeley to develop PostIngres (POSTGRES) a proprietary version of Ingres that added features needed to make a product that would be commercially feasible. To model data semantically and effectively, it would include:

  • data types (to define objects to be acted upon)
  • user-defined operators (to define language syntax and semantics)
  • user-defined procedures (rules)
  • attribute and procedure inheritance

This enabled the following:

  • complex objects;
  • a shared object hierarchy for an object-oriented programming language;
  • new semantic constructs;
  • changes to the data model;
  • changes to the syntax of the query language;
  • support for primary keys;
  • support for inheritance of data;
  • support for inheritance of procedures;
  • attributes that reference tuples in other relations; and
  • aggregations from diverse data sources, among other features.

Starting to recognize the SQL in all of this? Suddenly, you could do a whole lot more.

Back to top

What is PostgreSQL and what does it do?

PostgreSQL is a flexible open-source object relational database management system with features meant to meet changes in workloads, from single machines to data warehouses to web services with many concurrent users. PostgreSQL uses and extends SQL (thus the name), and is broadly extensible to a range of use cases beyond mere transactional data.

As the default database for MacOS Server, and available on almost every major operating system (including the BSDs, Windows and Linux), PostgreSQL enjoys support from its global development group of companies and individual contributors. But what does it do?

PostgreSQL is a relational database. As such, it’s a store of relations between tuples representing entities (such as documents and people) and relationships (such as authorship). Relations hold fixed-type attributes representing entity properties (such as a title) along with a primary key. Attribute types can be either atomic (such as integer, floating point, or boolean) or structured (such as an array or a procedure).

the parts of a postgresql table

PostgreSQL supports transactions with ACID properties. This means transactions must support four attributes:

Atomicity — transactions are considered complete units; a transaction can either completely succeed or completely fail - in the case of failure, database state is left unchanged.

Consistency — a database, between transactions, can only exist in a valid state; all data written to the database must adhere to extant constraints, triggers, cascades and related combinations.

Isolation, — a function of concurrency control, ensures that data is not corrupted by illegal or concurrent transactions — as transactions are treated as if they happened sequentially.

Durability — ensures that a transaction remains committed even when the system fails — typically completed transactions are recorded, for example, in a write-ahead log.

There’s a great breakdown here of most (but not all) of the types of features that PostgreSQL supports, so we’ll look at just a few of the main ones here.

In addition to SQL support, PostgreSQL features automatically updatable views, meaning that a view resulting from a query will automatically update when a transaction updates the data serving that view. A materialized view is a data object containing a given query’s results: it's possible to query a materialized view just as you would a database table directly.

Triggers execute code in response to transactional or other changes to a table or view: when a new item is entered to an inventory table, a similar item could be entered automatically to each of the prices, reviews, and orders tables.

Foreign keys exist to ensure inclusion dependencies or members shared between parent and child tables; for example, a member of managers table might share the employee_name column with an employees table.

Stored procedures, maintained in the data dictionary of a database, may contain several combined SQL procedures to provide applications access to commonly-used data-validation, access-control, or other methods.

PostgreSQL offers a means for publish/subscribe logical replication; essentially allowing a PostgreSQL instance to serve as an event cache between applications and event consumers. There is also a means for change data capture, meaning that prior database changes can be rewound and replayed, a capability essential for disaster recovery.

And that’s just the beginning. The PostgreSQL home page lists features including supported data types, more in-depth data integrity features, features supporting concurrency, disaster recovery features, security features, extensibility and text-search, including international character support.

Back to top

How PostgreSQL does it

Core PostgreSQL has mandated that data be fully-structured, requiring that it adhere to a specific schema; records added to a database that do not comply will be rejected. This places PostgreSQL historically on the far left of a data-structure continuum ranging from fully structured to relatively unstructured data:

PostgreSQL on the continuum from structured to unstructured data

NOTE: PostgreSQL has since evolved to also support JSON/JSONB, XML and other common formats for data ingress/egress.

As an open-source database management system, PostgreSQL users can continually evolve their databases, while switching languages and without need to recompile. While SQL conformance is ongoing, new features are being constantly added. Let’s look at how objects are created and manipulated in PostgreSQL.

As SQL is quite expansive, we’ll only touch on some basics here. psql is an interactive terminal for working with PostgreSQL.

DDL - data definition language

Data definition language (DDL) is used to define data structures, and include the following commands:

  • Data type declaration
  • CREATE - creates a new database, table, index or procedure.
  • ALTER - modifies an existing database object.
  • DROP - deletes an existing database, table, index or procedure.
  • TRUNCATE TABLE - empties a table for reuse.

For example:

CREATE DATABASE LIBRARY;

CREATE TABLE Borrowers (
  id INT PRIMARY KEY,
  first_name VARCHAR(20) not null,
  last_name VARCHAR(50) not null,
  birth_date DATE not null);

ALTER TABLE Borrowers ADD COLUMN years_member INT NOT NULL CHECK years_member > 0;

CREATE TABLE books (
  id INT PRIMARY KEY,
  title VARCHAR(50),
  author VARCHAR(50)
  publication_date DATE not null);

  /* after populating ‘books‘ table */

  TRUNCATE TABLE books;

/* removes all data from ‘books’ table */

DML - data manipulation language

On the other hand, data manipulation language is used to work with tables and their contents as in:

  • INSERT - inserts new rows in a table.
  • UPDATE - updates existing rows in the table.
  • DELETE - deletes rows from a table.
  • SELECT - the main method for querying data from the database.

a select statement on a postgresql table

For example:

INSERT INTO Borrowers(id,first_name,last_name,birth_date,years_member)   VALUES(1,'Riley','Jimbo',1964-05-02,12);
INSERT INTO Borrowers(id,first_name,last_name,birth_date,years_member) VALUES(2,'Smith','Bubba',1973-09-21,5);
INSERT INTO Borrowers(id,first_name,last_name,birth_date,years_member) VALUES(3,'Eberhardt','Jocelyn',1930-12-22,71);

UPDATE Borrowers SET years_member = 1 WHERE id = 1;
DELETE FROM Borrowers WHERE years_member > 70;

SELECT * FROM Borrowers WHERE years_member > 5;

Compact syntax for bulk insert

Naturally, there is an easier way to insert multiple rows at once, using a single command:

INSERT INTO Borrowers (id,first_name,last_name,birth_date,years_member) VALUES
 (4, 'Evelyn', 'Snarfington', 1959-08-03, 8),
 (5, 'Robert', 'Jones', 2000-01-23, 6),
 (6, 'Joseph', 'Chu', 1987-11-07, 10);

NOTE: There are actually now many other ways to achieve the same thing, including automating large data imports programmatically — and omitting field specifications when that data doesn’t exist.

Querying data

Most querying on PostgreSQL (as in SQL itself) happens via the SELECT statement, as in the following examples:
SELECT * FROM Borrowers;

Queries can be refined in a variety of ways, for example limiting results to a condition:
SELECT * FROM Borrowers where years_member > 3;

Limiting results to a specific results count:
SELECT * FROM Borrowers LIMIT 5;

Querying only certain fields:
SELECT id, last_name, years_member FROM Borrowers;

Sorting query results by certain fields:
SELECT * FROM Borrowers ORDER BY years_member DESC;

and so on.

NOTE: There are also Data Control Language (DCL) and Transactional Control Language (TCL) commands to consider, but they are out of the scope of this introduction.

The power of the PostgreSQL approach

The real beauty in building an open-source database modularly from the ground up are the features and performance that have been built-in as PostgreSQL have evolved. Here are just a few of these features, along with how they look in practice:

Text search

PostgreSQL can support the sort of SQL queries that look for column values based on some textual contents combined with wildcards, for example:

SELECT * from books WHERE column_name LIKE ‘zen%’;

could return records containing book titles like Zen and the Art of Motorcycle Maintenance and Zen Flesh, Zen Bones. Those wildcards could be positioned also at the start of the search string. The same query, but with the search string as %zen% could return the results above -- in addition to titles where ‘Zen’ isn’t the initial word in the string, like The Three Pillars of Zen and The Essential Teachings of Zen Master Hakuin.

In addition to text search, PostgreSQL supports a tsvector (a text-search) data type, while the function to_tsvector lets you create a list of tokens from which to derive a tsvector. to_tsquery is used to query the tsvector for occurrences of certain words or phrases.

NOTE: These advanced text search features echo similar features in Elasticsearch, blurring the lines between the two systems somewhat.

json language support

PostgreSQL enjoys excellent json language support. After importing your data from a .json file, you can deploy json data types and use json functions.

Concurrency

PostgreSQL is one of the best-known examples of an open source database which supports concurrent ACID transactions. To achieve this, PostgreSQL may use multiple locks to serialize database changes. When concurrent changes are made, the first transaction may proceed (as a lock is invoked); the second pauses until the first commits or aborts.

Here is an example. Imagine opening two terminal windows. In a psql session, you’d try:

CREATE TABLE records (
  header text NOT NULL primary key,
  info json NOT NULL
);

BEGIN;
ALTER TABLE records ADD COLUMN recent_update timestamp;

In another psql session, you’d try to run:

SELECT * FROM records;

NOTE: nothing happens, until the first transaction is complete.

Reliability and disaster recovery

Before writing a full page image to disk, PostgreSQL periodically writes to a write-ahead log (WAL). This enables PostgreSQL to recover from crashes by replaying the writes in the log.

The full_page_writes parameter in postgresql.conf can be toggled off if there is anything in the system preventing partial writes.

Individual records in WALs are checksummed by a CRC-32 check to validate correctness of record contents (upon restore operations), although there are a few items that bypass the checks, including data pages, and internal data structures.

Security

PostgreSQL handles a range of security issues in a variety of ways. To name a few, with PostgreSQL, the client connections are allowed, by default, only via local Unix socket, not via TCP/IP sockets, unless -i switch is used to start the server backend. Also, client connections can be black/whitelisted via the pg_hba.conf file in PG_DATA.

These are precisely the sorts of things you’d never need to worry about with a hosted, managed solution.

Superuser rights to objects are one option to be able to perform a lot of different operations in PostgreSQL. For example, DB files themselves are only readable by the PostgreSQL superuser account. These privileges can be implemented to copy table data, create publications that publish data automatically, or create subscriptions. They are also important privileges for use in logical replication.

In fact, each PostgreSQL user takes a username and password; users may be assigned to groups, whereby table — or even row — access can be assigned or restricted based on these group privileges.

Postgraphile is a GraphQL database, built on PostgreSQL, which implements row-level security.

PostgreSQL also offers a range of encryption options like encryption for specific columns, data partitions, across a network, etc. Server spoofing prevention works by creating a symbolic link to a relocated vulnerable socket file.

Extensibility

PostgreSQL is catalog-driven: data dictionaries — system catalogues — hold information about the database itself, cells, columns, data types, functions, access methods, etc. These catalogues are tables in their own-right: user-queryable and modifiable.

As users can modify databases directly, as well as add dynamically-loadable code changes or user-defined functions (UDFs), many choose PostgreSQL for rapid prototyping applications and data structures.

Extensibility is critical for many organizations with needs specific to a given industry or line of business. When traditional database systems fall short and business users are not themselves database experts, modifying the database may not interrupt availablity nor require core database engine modifications.

In fact, users want seamless solutions. With PostgreSQL you can seamlessly create your own data types, functions, aggregates, operators, operator classes (for index access), data wrappers, and so on, providing the tools and add-ons for domain-specific solutions.

Sometimes extensions become complete products in their own right. One concrete example: PostGIS, modified to support location queries by typing geographic coordinates. Additionally, Timescale is PostgreSQL optimized for time-series data.

Tool support

Developers these days can enjoy a number of tools to interact with the database which go way beyond psql. pgAdmin supports PostgreSQL across the board and is a long-time favorite of DBAs and developers. DBeaver, while not native to PostgreSQL, does many of the same things. Navicat is a cloud-compatible tool that also works with MySQL and MongoDB.

DataGrip, from JetBrains, is a popular integrated development environment (IDE) that works cross-platform and with a range of databases. usql is a universal command-line interface that works with many databases natively. Finally, OmniDB is a complete collaborative workspace for PostgreSQL and other databases.

In addition to core PostgreSQL itself, Aiven developers have contributed to a number of PostgreSQL extensions, including pghoard (for backup and restore), pglookout (for replication monitoring and as a failover daemon), pgmemcache (a set of PostgreSQL user-defined functions that provide an interface to memcached), and a range of connectors.

What does all of this mean?

PostgreSQL is not just a database, but it’s the sum of the features, functions and ecosystem comprising and surrounding it. With such a modifiable approach at its core, PostgreSQL can munge yet-to-be-defined data types, manage larger data sets than ever, run parallel queries at scale, support complex configurations with ease, and even improve and enhance monitoring capabilities, spawning new uses and derivatives. Let’s look briefly at the PostgreSQL ecosystem.

Back to top

The PostgreSQL ecosystem — use cases and users

Today in addition to enterprise users, a number of 3rd party vendors, consultancies and independent groups derive products from, consult, support, and offer auxiliary enhancements for PostgreSQL. There are also integrations with artificial intelligence (AI) and machine learning (ML), many cloud versions, hardware-optimized versions, and so forth.

The ecosystem is truly massive. By far, the biggest industry sector using PostgreSQL is the computer software industry, followed by information technology and services. Other industries include hospital and health care, higher education, financial services, telecommunication, education, marketing and advertising. Government agencies, including the FBI, are also users.

According to the db-engines ranking, PostgreSQL is the fourth most popular database management system (and most popular open source DBMS). In fact, Aiven has a number of case studies with customers using hosted and managed PostgreSQL services.

Back to top

Aiven for PostgreSQL

As a fully-managed and hosted solution for object-relational database management, Aiven for PostgreSQL is the fastest for any provider, offering high-availability with automatic updates and backups. Aiven supports various extensions, such as PostGIS and TimescaleDB and provides publish/subscribe logical replication.

More and more, our customers are using Aiven for PostgreSQL services as an integrated part of a system that also leverages other services, including Aiven for Kafka, Elasticsearch or Redis.

Getting started

Aiven for PostgreSQL is easy to set up, either on the Aiven Console:

creating an aiven for postgresql service in the aiven console

straight from the command line, or using our REST API.

Wrapping up

In this article, we’ve looked at PostgreSQL: where it came from, what it is, what it does, and how it does it. We’ve examined Data Definition Language (DDL), Data Manipulation Language (DML) and the many tools and compelling features that come with this open-source database. Finally, after looking at use cases, users and PostgreSQL’s industry position, we’ve considered the advantages of a hosted, managed solution, and looked at a few ways to get started.

Want to try it out? Check out our no commitment, 30-day trial, and read about Aiven for PostgreSQL. You can also subscribe to our blog and changelog RSS feeds, or follow us on Twitter and LinkedIn.

Blog CTA PostgreSQL webinar

References (others linked throughout the article):
[1] Funding a Revolution: Government Support for Computing Research. National Academy Press, 1999.
[2] A Relational Model of Data for Large Shared Data Banks. E.F. Codd, IBM, 1970.
[3] The POSTGRES Data Model.

postgresql

Start your free 30 day trial

Test the whole platform for 30 days with no ifs, ands, or buts.

Aiven logo

Let‘s connect

Apache Kafka, Apache Kafka Connect, Apache Kafka MirrorMaker 2, Apache Cassandra, Elasticsearch, PostgreSQL, MySQL, Redis, InfluxDB, Grafana are trademarks and property of their respective owners. All product and service names used in this website are for identification purposes only and do not imply endorsement.