Using PostgreSQL Anonymizer to safely share data with LLMs
Learn how PostgreSQL Anonymizer helps safely share data with LLMs by masking PII, using both static and dynamic masking strategies for secure development.
Jay is a Staff Developer Advocate at Aiven. Jay has served as a keynote speaker and an avid member of the Python Community. When away from the keyboard, Jay can often be found cheering on their favorite baseball team.
Using PostgreSQL Anonymizer to safely share data with LLMs
The promise of large language models (LLMs) brings new performance gains to development teams, sharing information and context to your team. How is this information stored?
AI companies cache and persist information. This means queries and their results are stored in a database somewhere. I was able to retrieve prompts across multiple devices meaning that the information was not held to my local machine. AI companies talk about how data is secured, but as we saw with DeepSeek, AI companies are not immune to database setup mistakes leading to data leaks. There is also the issue of retained history and context that more and more AI tools are serving as a feature. That information can be retrieved by a bad actor using zero-click exploits. So as we speed up our own development using LLMs, we need to ensure that we do it in a way that protects the information of our users.
Meet PostgreSQL Anonymizer, the data masking extension now available by default with Aiven for PostgreSQL®. PostgreSQL Anonymizer lets you declare and mask columns with personal identifiable information (PII).
PostgreSQL Anonymizer aids the principle of least privilege (POLP), allowing for even finer access to information. LLMs and other users only interact with anonymized information, which minimizes exposure of sensitive data. Everything that we’re explaining about how LLMs will query masked data applies to any user under the same roles.
This post will show you two strategies you can use to work with anonymized data when using LLMs and MCP.
Our Data Model
Here is an example B2B Ecommerce Platform schema. All of the data is synthetic and made up.
You're asked to help with a marketing campaign that segments customers within designated service regions and industries and finds sales trends across those segments. Our customer schema is full of personal information that an ecommerce company would need to fulfill orders. Using this information is critical to the request but creates opportunities for data breaches and data privacy violations.
⚠️ Don't put your company at risk by sharing PII
This is a reminder that models by default are non-deterministic. This means a model with access to your database may collect information in different and unexpected ways. Unfortunately, Model Context Protocol (MCP) does not help with this. Many of the MCP servers I've tested often start with the widest ways to search, starting with learning the schema and all the tables and then retrieving a sample of data to verify the format of the content.
Before we talk more about PostgreSQL Anonymizer, make sure you protect your customer's information by:
consider principle of least privilege (POLP) and limit all users access to only views and tables they absolutely need.
consider column-level security and row-level security where applicable
consider hashing and salting sensitive data
What happens when we ask AI with unfettered access
Let's see what happens when you give the LLM read access to the entire database. Using Claude 4.0 Pro using crystaldba/postgres-mcp in restricted mode (meaning is not able to write to the database... Only READ) I got a wonderful response that identified primary and secondary segmentation criteria as well as customer value metrics that should be tracked. When I asked it to compare its plan with the current database schema, it was able to pull the data and create a series of queries that would connect with our database and view the entire database schema.
Loading code...
As I mentioned earlier, this shows that LLMs will consider pathways outside of what is expected and will query tables that you may not have expected.
A few lines later in the response we get a request for a lot of PII. The actual query isn't important but as you look at it take a look at the amount of PII that it attempts to grab.
Loading code...
This isn't a problem, it's necessary. This is the foundation of the task that we're trying to accomplish. The problem is that if this request had permission to run, it would have created this view which if called would display all of the customer information.
All this to say, if it's in the memory or data chain of an LLM, you have no control over how it can be retrieved later. The best approach is to never upload it to begin with.
Fast Fix - Fork, Transform and Share
PostgreSQL Anonymizer has the ability to permanently randomize data through static masking. Depending on the size of the database (and its contents), we can fork the database and anonymize our PII before passing it to the LLM.
Let's imagine if our database had the following customer information
Loading code...
**Step 1:**Let's fork our existing database. In the Aiven console you can go to your service overview page for the PG instance that you want to work with. In the Backups and forking section, select the three dots and then fork service.
Another way to do this is with the Aiven CLI.
Loading code...
Step 2: We need to connect to the forked database as an admin and enable the PostgreSQL Anonymizer service.
Connect using your favorite PostgreSQL client (if you have psql you can use the avn cli again)
Loading code...
::NOTE: STATIC ANONYMIZATION IS A PERMANENT CHANGE. DO NOT RUN THE FOLLOWING ON YOUR PRODUCTION DATABASE
Loading code...
This will copy the datasets used for creating anonymized services.
Step 3: Create SECURITY LABELS for the columns that should be anonymized.
Loading code...
Those functions anon.dummy_first_name() and the others populate from those datasets that we loaded with anon.init(). There are several common patterns that you can choose from. You can see the full list of functions in the documentation.
Step 4: Apply the changes. We could apply this at the table level with
Loading code...
Or we can apply all the changes we've made with
Loading code...
Now when we run that same query from that we did at the beginning we would get something like.
Loading code...
If we do this to all our data we can then give this information to the LLM and trust that no customer data will be released.
Do we need to make another database?
No, we don’t. PostgreSQL Anonymizer has dynamic masking. This masks data for specified certain roles but will not permanently change the rows. This means when your LLM makes a query the results are anonymized.
Let’s apply this to our marketing project.
Step 1: Create a view that removes some of the PII that you want the LLM doesn’t need
Loading code...
This reduces the amount of masking needed. With this running with every query we want to optimize the process as much as possible.
Step 2: Apply the policy to the role that we made for the LLM.
Loading code...
Now when the model uses MCP to learn more about the database it will only find the tables and views that it has access to.
This is useful but limits the LLM's ability to learn and work outside the lines. There is still the customer email which is PII. That's intentionally left in since this an email marketing campaign and we’d want to use it later with privileged accounts.
::NOTE: You should still not give AI access to any production databases. Use with caution
Step 3: Start the extension and the dynamic masking engine.
Loading code...
The SET anon.transparent_dynamic_masking TO true; starts the dynamic masking engine. Now as before we create our security labels. And apply the label to the user account.
Loading code...
Notice we applied the mask to the view and not the original column. If you apply this to the original column, the data will not change in the referenced view. You have to apply the mask to what will be called in the query.
Now when that role (the one we'll give to LLMs and MCP servers) starts its exploration. Interesting enough it learned about the relationships to the customers table and got confused and assumed that we hadn’t created the table yet (HOW DARE YOU NOT GIVE IT ACCESS 😂). After I mentioned the view it immediately began working on building customer segmentation by trying to get all the information.
Loading code...
It then proceeded to generate a campaign highlighting top performing Segments and the states they were in. Here’s a sample.
Loading code...
When I ask it more information about that customer at ID:0
Loading code...
Use PostgreSQL Anonymizer with your LLMs and your employees
This introduction to PostgreSQL Anonymizer shows how you can secure PII from LLMs. These tips also work to help you follow the principle of least privilege. There are other use cases for the anonymizer such as anonymizing data to train your own models, as well as other masking strategies that would dump masked data to other formats and data types. Lastly, there are many different masking functions for all kinds of different information. You can learn more about these at the PostgreSQL Anonymizer Documentation
-- from the AI response-- ==========================================-- 1. DATA EXPLORATION & VALIDATION-- ==========================================-- Check current data distributionSELECT'Total Customers'as metric,COUNT(*)as count FROM customers
UNIONALLSELECT'Total Orders'as metric,COUNT(*)as count FROM orders
UNIONALLSELECT'Completed Orders'as metric,COUNT(*)as count FROM orders WHEREstatus='completed'UNIONALLSELECT'Total Order Items'as metric,COUNT(*)as count FROM order_items;...
-- cutoff due to lengthCREATEORREPLACEVIEW high_value_customer_segments ASWITH customer_purchase_metrics AS(SELECT c.customer_id, c.company,-- PII when connected to an individual c.first_name,-- PII c.last_name,-- PII c.email,-- PII c.state,-- PII c.country,-- PII c.industry, c.city,-- PIICOUNT(o.order_id)as total_orders,SUM(o.total_amount)as total_purchases,AVG(o.total_amount)as avg_order_value,MAX(o.order_date)as last_purchase_date,MIN(o.order_date)as first_purchase_date,
SELECT first_name, last_name, email from customers; id | first_name | last_name | email
-----+------------+------------+-----------------1| Jay | Miller | jay@companymail.com2| Jane | Doe | jane@companymail.com
avn service create <NEW SERVICE NAME> --project <PROJECT_NAME> --cloud <CLOUD_NAME> -t pg --plan business-4 -c service_to_fork_from=<FORKED SERVICE NAME>`
avn service cli <SERVICE NAME>
CREATE EXTENSION IFNOTEXISTS anon CASCADE;-- enable the `PostgreSQL Anonymizer` extensionSELECT anon.init();-- initialize the extension
-- RUN for each column you want to be anonymizedSECURITY LABEL FOR anon ONCOLUMN customers.first_name
IS'MASKED WITH FUNCTION anon.dummy_first_name()';SECURITY LABEL FOR anon ONCOLUMN customers.last_name
IS'MASKED WITH FUNCTION anon.dummy_last_name()';SECURITY LABEL FOR anon ONCOLUMN customers.email
IS'MASKED WITH FUNCTION anon.dummy_safe_email()';
SELECT anon.anonymize_column('customers','email')-- anonymize the 'email' columnSELECT anon.anonymize_table('customers');-- anonymizes the 'customers' table
SELECT anon.anonymize_database();-- this will apply all of the masking rules.
SELECT first_name, last_name, email from customers; id | first_name | last_name | email
-----+------------+------------+-----------------1| Jason | Thompson | axk124erd@example.com-- originally Jay2| Sam | Adams | s3jdk3iui@example.net-- originally Jane
-- Create a restricted view for LLM access-- This view joins companies, customers, and orders with minimal data exposureCREATEVIEW llm_customer_data ASSELECT-- Customer Information-- city, state, and country could lead to segmentation based on location cust.id as customer_id, cust.email as customer_email,- so that we can check against other data cust.city as customer_city,-no longer pii once the email is anonymized
cust.state as customer_state, cust.country as customer_country, cust.industry as customer_industry
-- Create the account for your llmCREATE ROLE llmuser with LOGIN PASSWORD '<PASSWORDFORLLM>';-- restrict access only views we want to grant accessGRANTCONNECTONDATABASE defaultdb to llmuser;GRANTUSAGEONSCHEMApublicto llmuser;GRANTSELECTONpublic.llm_customer_data to llmuser;GRANTSELECTONpublic.orders to llmuser;GRANTSELECTONpublic.order_items to llmuser;GRANTSELECTONpublic.product_categories to llmuser;GRANTSELECTONpublic.products to llmuser;
-- same instructions as beforeCREATE EXTENSION IFNOTEXISTS anon CASCADE;SELECT anon.init();-- start the dynamic maskingALTERDATABASE defaultdb SET anon.transparent_dynamic_masking TOtrue;
-- run for each column we want to maskSECURITY LABEL for anon on llm_customer_data.customer_email -- mask the view not the original columnis'MASKED WITH FUNCTION anon.dummy_safe_email()';-- apply the label to llmuserSECURITY LABEL FOR anon ON ROLE llmuser IS'MASKED';
Ah, it's called llm_customer_data! Let me examine that view:
E: get_object_details
Request
{
`object_name`: `llm_customer_data`,
`object_type`: `view`,
`schema_name`: `public`
}
Response
{'basic': {'schema': 'public', 'name': 'llm_customer_data', 'type': 'view'}, 'columns': [{'column': 'customer_id', 'data_type': 'integer', 'is_nullable': 'YES', 'default': None}, {'column': 'customer_email', 'data_type': 'character varying', 'is_nullable': 'YES', 'default': None}, {'column': 'customer_city', 'data_type': 'character varying', 'is_nullable': 'YES', 'default': None}, {'column': 'customer_state', 'data_type': 'character varying', 'is_nullable': 'YES', 'default': None}, {'column': 'customer_country', 'data_type': 'character varying', 'is_nullable': 'YES', 'default': None}, {'column': 'customer_industry', 'data_type': 'USER-DEFINED', 'is_nullable': 'YES', 'default': None}], 'constraints': [], 'indexes': []}
Perfect! Now let me examine the actual customer data available through this view:
E: execute_sql
`Request`
{
`sql`: `SELECT * FROM llm_customer_data LIMIT 10;`
}
Response
[{'customer_id': 0, 'customer_email': 'hermann2369@wunschcrona.com', 'customer_city': 'West Kendall', 'customer_state': 'Utah', 'customer_country': 'United States', 'customer_industry': 'Technology'},
...
Tier 1: Premium Tech Markets
Target: Florida, South Carolina, Illinois Tech customers
Budget Allocation: 35% ($350K)
Customer LTV: $6,000-$8,000
Campaign: "Innovation Leaders"
Channels: LinkedIn, Google Ads, Tech conferences
Messaging: "Cutting-edge solutions for industry leaders"
Offers: Volume discounts, early access to new products
Tier 2: Healthcare Powerhouses
Target: Iowa, Oklahoma, Arizona Healthcare customers
Budget Allocation: 30% ($300K)
Customer LTV: $4,700-$6,300
Campaign: "Trusted Healthcare Partners"
Channels: Healthcare publications, medical conferences, direct mail
Messaging: "Reliable solutions for critical operations"
Offers: Extended warranties, compliance support
Reminder... All the values were made up using synthetic data do not use these for your B2B Marketing Campaigns.
Customer ID: 0 Profile (Updated)
Customer Information:
Email: genevieve@example.com (Note: Email changed from previous query)
Location: West Kendall, Utah, United States
Industry: Technology