Tombstones in Apache Cassandra®
Apache Cassandra manages deletion of data via a mechanism called tombstones.
Because Cassandra is a distributed system, it cannot delete data immediately in
the same way as a traditional relational database. On a high-level, when a row
is deleted, instead of immediately deleting it, Cassandra will mark it as a
tombstone row. Then, as part of regularly scheduled maintenance, the row will
actually get deleted. This maintenance is called compaction and the threshold
is controlled by the table-level setting gc_grace_seconds
. Any tombstone older
than this setting will be removed completely during compaction (with some
caveats - more details in the Cassandra documentation on
compaction).
The default value for this setting is 864000 seconds (10 days).
Tombstone tradeoffs
If your system has very large numbers of tombstone rows, this can lead to unexpected behaviour since the rows that seem deleted are in fact still there on disk, but with a tombstone marker. If your workload includes a lot of data deletion, it is useful to be aware of the tradeoffs. Tombstones are periodically processed by garbage collection, which can affect cluster stability. The two main things affected are read performance and disk usage.
Tombstones and read performance
If read queries have to scan large numbers of tombstones, the query performance can be significantly degraded. In particularly bad cases, the query can even time out. There are a couple of types of queries that are more likely to be affected by this. They all involve scanning all or a large part of a table.
- Full table scans like
SELECT * from inventory.items
- Any query that requires adding
ALLOW FILTERING
- Range queries, which are queries with
WHERE item_cost > threshold
or similar
Tombstones and disk usage
If you are rapidly filling up your cluster with data at the same time as you are doing a lot of deletions, you will reach size limits sooner. This is because the tombstone data is not actually deleted and still taking up space on disk.
Identify when tombstones affect a query
If you suspect problems caused by tombstones for your cluster, you can
check the logs. By default, if a query encounters over 1000 tombstones
(configured by tombstone_warn_threshold
see the
documentation)
it will generate a log entry. The entry will be in the format
Read <X> live rows and <Y> tombstone cells for query <query> [...] (see tombstone_warn_threshold)
.
If it encounters over 100 000 tombstones (configured by
tombstone_failure_threshold)
, the query will be aborted with a
TombstoneOverwhelmingException
(or just time out). To investigate a
query that is encountering tombstones, the easiest way is to connect
with a cqlsh
session and run TRACING ON
followed by the query of
interest. You can view values of Cassandra settings with
SELECT * FROM system_views.settings WHERE name = '<setting name>';
.
Tombstone best practice
Designing your data models and query strategies to account for the expected tombstones for your particular application can really help to get the best from Apache Cassandra. We've put together a list of strategies to help mitigate the effects that can sometimes be observed.
- Review your data model and compaction strategy and consider implementing table-level time-to-live (TTL) or using TimeWindowCompactionStrategy (TWCS) as the compaction strategy if appropriate for your workload.
- Avoid queries that end up running on all partitions in a table, such
as queries with no
WHERE
clause, or queries that needALLOW FILTERING
. - Update your queries so that they don't have to scan over tombstone rows in the same manner. For range queries, this might mean investigating if you can use a narrower range, or use a different approach to the query.
- If you are planning to delete all the data in a table, you can truncate the table to avoid creating tombstones.
- Allow tombstone deletion to happen automatically as part of regular
operations rather than forcing the deletes. Once more time than
gc_grace_seconds
has elapsed and a compaction happens, the data with tombstone marks will be removed from disk.