Stop hitting the database for things that aren't there: Use a Valkey Bloom Filter

Every time your app asks the database 'does this row exist,' it pays the cost of a database query to get back 'no'.

What is Valkey Bloom?

Valkey Bloom ships as a plugin included and enabled with Aiven for Valkey on version 9.0 and higher. Valkey Bloom filters give you a way to cut most database miss queries by sending them first to an in-memory lookup.

A bloom filter is a small, probabilistic data structure designed to answer one question: "Have I seen this item before?" It provides two potential answers: Absolutely Not, and Probably.

You may think that 100% Yes or No would be better but here’s the thing, probably is really fast and you’re really concerned about the Absolutely Not’s taking up unnecessary connections.

Each layer catches what the previous one missed: the bloom filter rejects the cheapest misses without a database query, the cache serves repeat hits from popular items, and PostgreSQL only sees queries that have a real chance of returning data and it’s only hit a fraction of the time and for things like updates (like when inventory is adjusted or transactions happen).

Setting Up A Bloom Filter

Reserve a filter with a target false-positive rate and expected capacity. For 10 million active item numbers at a 0.1% false-positive rate:

BF.RESERVE bf:items 0.001 10000000
Loading code...

Seems like a lot. It’s actually about 25MB of space.

used_memory_human:25.98M
Loading code...

Populating from PostgreSQL

You can now populate your bloom filter using the data in your PostgreSQL database. Batch insert the item numbers as rows stream with 1000 items per call. Below would be the actual commands run in Valkey.

BF.MADD bf:items 000000000 0000000001 0000000002 ... 000000999
BF.MADD bf:items 000001000 0000001001 0000010001 ... 000001999
...
Loading code...

The cleanest way to do this is a script that pulls from PostgreSQL in batches and streams them into Valkey.

import os
import psycopg
import valkey
 
PG_CONN = os.environ["AIVEN_FOR_PG_CONNECTION_STRING"]
VALKEY_HOST = os.environ["AIVEN_FOR_VALKEY_HOST"]
VALKEY_PORT = int(os.environ["AIVEN_FOR_VALKEY_PORT"])
VALKEY_PASSWORD = os.environ["AIVEN_FOR_VALKEY_PASSWORD"]
 
FILTER_KEY = "bf:items"
CAPACITY = 10_000_000
ERROR_RATE = 0.001
PAGE_SIZE = 1000
 
def seed_filter():
   valkey_conn = valkey.Valkey(
       host=VALKEY_HOST,
       port=VALKEY_PORT,
       password=VALKEY_PASSWORD,
       ssl=True,
       ssl_cert_reqs="required",
   )
 
   valkey_conn.execute_command("BF.RESERVE", FILTER_KEY, ERROR_RATE, CAPACITY)
 
   with psycopg.connect(PG_CONN) as pg:
       with pg.cursor() as cur:
           last_seen = ""
           while True:
               cur.execute("""
                   SELECT item_number
                   FROM catalog_items
                   WHERE active = true
                     AND deprecated_at IS NULL
                     AND item_number > %s
                   ORDER BY item_number
                   LIMIT %s
               """, (last_seen, PAGE_SIZE))
 
               rows = cur.fetchall()
               if not rows:
                   break
 
               valkey_conn.execute_command("BF.MADD", FILTER_KEY, *[r[0] for r in rows])
               last_seen = rows[-1][0]
 
if __name__ == "__main__":
   seed_filter()

Loading code...

Bloom Filter in Action

The application's read path goes through all three layers. We run BF.EXISTS to check an item number.

BF.EXISTS bf:items 2813308004
Loading code...

If this returns 0, the item is definitely not in the catalog. The application returns "not found" immediately.

If it returns 1, the item probably exists, and we proceed to the cache:

GET item:2813308004
Loading code...

On a cache hit, we're done. The information is returned and the process can continue without reaching the database.

On a cache miss, we fall through to PostgreSQL:

SELECT * FROM catalog_items WHERE item_number = '2813308004';
Loading code...

If PostgreSQL returns nothing, we just hit a bloom filter false positive. This happens about once every 100,000 requests based on our settings. That's 99,999 unnecessary database queries avoided for every one false positive that gets through.

Lastly, we cache the information so that any further requests in our time-to-live (TTL) are picked up by Valkey, saving more queries against the database.

SET item:2813308004 - "<json payload>" EX 300
Loading code...

That EX 300 gives the cache entry a 5-minute TTL, which is plenty of time to absorb a burst of repeat lookups for the same item.

Repopulating the data

Data changes and we don’t want to keep the same bloom filter forever or the number of false positives will go up.

The cleanest pattern is a periodic rebuild from PostgreSQL into a fresh filter, then atomic swap. This will be very similar to our initial loading script but will load data into a temp filter (taking up another 25MB of space).

"""
We'd move the connection configuration to a config.py and 
import to both scripts
"""

import psycopg

from config import (
   CAPACITY,
   ERROR_RATE,
   PAGE_SIZE,
   PG_CONN,
   get_valkey_conn,
)

FILTER_KEY = "bf:items"
REBUILD_KEY = "bf:items:rebuild"

def rebuild_filter():
   valkey_conn = get_valkey_conn()
 
   # Start clean — make sure no stale rebuild key exists
   valkey_conn.delete(REBUILD_KEY)
   valkey_conn.execute_command("BF.RESERVE", REBUILD_KEY, ERROR_RATE, CAPACITY)
 
   with psycopg.connect(PG_CONN) as pg:
       with pg.cursor() as cur:
           last_seen = ""
           while True:
               cur.execute("""
                   SELECT item_number
                   FROM catalog_items
                   WHERE active = true
                     AND deprecated_at IS NULL
                     AND item_number > %s
                   ORDER BY item_number
                   LIMIT %s
               """, (last_seen, PAGE_SIZE))
 
               rows = cur.fetchall()
               if not rows:
                   break
 
               valkey_conn.execute_command("BF.MADD", REBUILD_KEY, *[r[0] for r in rows])
               last_seen = rows[-1][0]
 
   # Atomic swap — live filter is replaced in one step, no gap in coverage
   valkey_conn.rename(REBUILD_KEY, FILTER_KEY)
 
if __name__ == "__main__":
   rebuild_filter()
 

Loading code...

This loads fresh data into a new filter bf:items:rebuild and then we overwrite the old filter with the rename.

Before You Ship

Bloom filters are great but you can definitely mess things up. The issues below are the ones that actually bite people.

Bloom filters aren’t automatic.

That also means that items added to PostgreSQL won't appear in the bloom filter. Make sure that you refresh your filters on a regular basis.

Bloom filters don't support deletion. Once an item is added, it's there until you rebuild the filter from scratch. This is why making this a script to run on a schedule vs a regular command is important.

Solution: If your data is changing too quickly for bloom filters to stay accurate, opt for a regular caching strategy with a time-to-live (TTL) that is appropriate for the change in data.

BF.RESERVE is important.

BF.RESERVE commits you to a capacity and error rate. Pushing past the reserved capacity and the false-positive rate climbs.

Solution: Include plenty of headroom in your reserved space. The default memory limit for bloom filter is 128MB and that can hold 50 million items in a filter. Also you can monitor your bloom filter with BF.INFO.

BF.INFO bf:items
 1) Capacity
 2) (integer) 50000000
 3) Size
 4) (integer) 89860187 # approx 90MB
 5) Number of filters
 6) (integer) 1
 7) Number of items inserted
 8) (integer) 10000000
 9) Error rate
10) "0.001" # 1% of calls
11) Expansion rate
12) (integer) 2
13) Tightening ratio
14) "0.5"
15) Max scaled capacity
16) (integer) 50000000
Loading code...

This returns capacity, size, number of items inserted, and expansion rate — useful for alerting before you blow past the reserved capacity.

Atomic swap is non-negotiable.

Don't DEL the live filter and re-populate; there will be a window where every lookup returns "definitely not" and your application will think every ID is invalid which, depending on how long the rebuild process is, could be a problem.

SOLUTION: Always build into a temp key and RENAME to swap.

UPDATE: Using RENAME in a cluster, both bf:items and bf must land on the same hash slot or you'll get a CROSSSLOT error. Without hash tags, they very likely won't.

The fix is to use bf:{items} and bf:{items}:rebuild.

Persistence has caveats.

Bloom filters persist with normal Valkey snapshot (RDB)/append-on-write (AOF) settings, but a cold start on a fresh node means an empty filter until reload completes (that includes replicas).

Solution: You need to make sure that you work around your bloom being empty or not existing that way you don’t have downtime if you need to drop or rebuild the bloom from scratch.

Wrapping Up

Bloom filters exist to keep wasted requests away from your database. There are other use cases like deduplication and fraud/spam detection. If you find yourself querying your database against mostly static information, a bloom filter is the right tool.

Stay updated with Aiven

Subscribe for the latest news and insights on open source, Aiven offerings, and more.

Subscribe to RSS

Table of contents

What is Valkey Bloom?
Setting Up A Bloom Filter
Populating from PostgreSQL
Bloom Filter in Action
Repopulating the data
Before You Ship
Bloom filters aren’t automatic.
BF.RESERVE is important.
Atomic swap is non-negotiable.
Persistence has caveats.
Wrapping Up