PostgreSQL Anonymizer 2.0 - Generating Fake Data

October 15th, 2024

After several months of development, version 2.0 of PostgreSQL Anonymizer has entered the beta phase, and this is an opportunity for us to launch a series of articles to present its new capabilities in preview!

For this first technical overview, let’s see how to generate fake data (also known as “synthetic data”).

Photo Credit Markus Spiske

Logo PostgreSQL Anonymizer

Why is it important?

PostgreSQL Anonymizer 2.0 offers a wide range of functions to generate fake but realistic data. These functions are useful for writing masking rules and replacing sensitive data with data that “looks real.”

But that’s not the only benefit of these functions. In fact, they can also be used from the very first steps of a new project: when designing a data model, it is essential to “populate” the tables with data so as not to start development “empty-handed.”

Consider a new application that needs to rely on a classic customer table:

CREATE TABLE public.customer (
    id          INT PRIMARY KEY,
    firstname   TEXT,
    lastname    TEXT,
    email       TEXT
);

We want to insert 2000 people into this table, but how can we do it since the application doesn’t exist yet?

Why is it complicated?

Obviously, inserting John Doe 2000 times into the table is not really an option! The challenge is to produce realistic and context-appropriate data.

For example, if the application is intended for adult users, we want the birth dates to be between 1950 and 2006.

This is where PostgreSQL Anonymizer comes into play with its wide range of randomization functions and fake data generators.

INSERT INTO customer
SELECT
	i*100+pg_catalog.random(0,99), -- avoid collisions !
	anon.dummy_first_name(),
	anon.dummy_last_name(),
	anon.dummy_free_email()
FROM generate_series(1,2000) i;

Note: The random(x,y) function is one of the new features of PostgreSQL 17! For earlier versions, the anon.random_int_between(x,y) function is an equivalent alternative.

In total, PostgreSQL Anonymizer provides more than 70 fake data generation functions! They all have the dummy_ prefix. The complete list is available in the Advanced Faking section of the documentation:

https://postgresql-anonymizer.readthedocs.io/en/latest/masking_functions/#advanced-faking

Which gives us:

SELECT * FROM customer LIMIT 2;
 id  | firstname | lastname |        email
-----+-----------+----------+----------------------
 143 | Eloisa    | Beer     | mavis_ab@hotmail.com
 200 | Braden    | Hagenes  | ariel_nam@yahoo.com

Let’s now imagine that the application needs to store phone numbers and is intended for the French market. We can use the anon.dummy_phone_number() function:

SELECT anon.dummy_phone_number();
 dummy_phone_number
-------------------
 648-881-1114

This works, but the format does not match that of French phone numbers.

To obtain fake values adapted to the local context, we can simply add the _locale suffix to the function and specify that we want a French number:

ALTER TABLE customer ADD COLUMN phone TEXT;
UPDATE customer 
  SET phone = anon.dummy_phone_number_locale('fr_FR');

Et voilà ! :)

SELECT * FROM customer LIMIT 2;
   id   | firstname | lastname |        email         |     phone
--------+-----------+----------+----------------------+----------------
 115076 | Zelda     | Robel    | afton_eos@gmail.com  | 05 38 16 52 85
 123886 | Kamille   | Ernser   | kiera_ut@hotmail.com | 04 24 18 60 76

In total, as of today, 7 locales are available partially or fully: ar_SA, en_US, fr_FR, ja_JP, pt_BR, zh_CN, zh_TW.

As you can see, generating high-quality fake data is no easy task! Rather than developing this feature ex-nihilo in PostgreSQL Anonymizer, we chose to rely on the fake-rs library, which supports numerous locales and a wide range of categories.

The PostgreSQL Faker project is now end-of-life

If you have already used PostgreSQL Anonymizer, you might also be aware of the PostgreSQL Faker extension, which previously addressed the need for fake data. For various reasons, we have decided to freeze this project and integrate this functionality directly into PostgreSQL Anonymizer.

The PostgreSQL Faker project is now deprecated, and all its users are encouraged to gradually migrate to PostgreSQL Anonymizer 2.0.

Your turn!

All these new fake data generators are still in the beta phase! Feel free to test them and share your feedback and use cases. How do you generate fake data to populate your tables? Does this new feature meet your needs? Etc.

For any requests or ideas, open a ticket via the link below and let’s discuss!

https://gitlab.com/dalibo/postgresql_anonymizer/-/issues