Dalibo Labs just released yesterday the version 0.5 of PostgreSQL Anonymizer, an extension that hides sensitive information inside a PostgreSQL database.
The extension already has a large panel of anonymization strategies: randomization, faking, partial destruction, shuffling, noise addition, etc.
For this new released, I worked on a different approach called Generalization and the k-anonymity property.
The idea behing Generalization is quite simple: To avoid giving a sensitive information, we can replace the data with a broader, less accurate value.
For instance, instead of saying “Bob is 28 years old”, you can say “Bob is between 20 and 30 years old”. This is something we all do unconsciously when we don’t want to reveal a precise information. Typically when someone ask “Where do you live ?”, we often respond with a vague answer (“I live in Paris” or even “I live in France”) instead of giving our exact address.
What is interesting with this approach is that the data remains true while avoiding the risk of re-identification.
So Generalization is an instinctive and simple way to hide personnal
human beings. But how would that apply to a database management system like
PostgreSQL ? After all the primary purpose of a database is to store accurate
information. If you have
birthdate column in a SQL table, you can’t
practically insert data saying “sometime in the nineties”, right ? 😀
Well… it turns out that PostgreSQL can handle generalization very easily with the RANGE data types, a very powefull way to store and manipulate a set of values contained between a lower and an upper bound.
Example: Blurring medical data
Here’s a basic table containing sensible health data:
# SELECT * FROM patient; ssn | firstname | zipcode | birth | disease -------------+-----------+---------+------------+--------------- 253-51-6170 | Alice | 47012 | 1989-12-29 | Heart Disease 091-20-0543 | Bob | 42678 | 1979-03-22 | Allergy 565-94-1926 | Caroline | 42678 | 1971-07-22 | Heart Disease 510-56-7882 | Eleanor | 47909 | 1989-12-15 | Acne 098-24-5548 | David | 47905 | 1997-03-04 | Flu 118-49-5228 | Jean | 47511 | 1993-09-14 | Flu 263-50-7396 | Tim | 47900 | 1981-02-25 | Heart Disease 109-99-6362 | Bernard | 47168 | 1992-01-03 | Asthma 287-17-2794 | Sophie | 42020 | 1972-07-14 | Asthma 409-28-2014 | Arnold | 47000 | 1999-11-20 | Diabetes (10 rows)
We want the anonymized data to remain true because it will be used for statistics. We can build a generalized view upon this table with 2 mains actions :
First, we need to get rid of primary keys and direct identifiers. Here we will simply remove the
ssnfield and destroy the content of the
Second, we will reduce the accuracy of the indirect identifiers (
birth) using 2 generalization functions from the PostgreSQL Anonymizer extension :
Here we go:
CREATE MATERIALIZED VIEW generalized_patient AS SELECT 'REDACTED'::TEXT AS firstname, anon.generalize_int4range(zipcode,1000) AS zipcode, anon.generalize_daterange(birth,'decade') AS birth, disease FROM patient;
Notice that the generalization functions take 2 arguments: the original value as the first parameter and a second parameter for the length of each step of the range.
For numeric values, the step is an integer. Here the
is generalized in steps of 1000. For datetime values, the possible steps
are: microseconds, milliseconds, second, minute, hour, day, week, month,
year, decade, century and millennium. Here the
birth column is generalized
This will give us this less accurate view of the data:
# SELECT * FROM generalized_patient; firstname | zipcode | birth | disease -----------+---------------+-------------------------+--------------- REDACTED | [47000,48000) | [1980-01-01,1990-01-01) | Heart Disease REDACTED | [42000,43000) | [1970-01-01,1980-01-01) | Allergy REDACTED | [42000,43000) | [1970-01-01,1980-01-01) | Heart Disease REDACTED | [47000,48000) | [1980-01-01,1990-01-01) | Acne REDACTED | [47000,48000) | [1990-01-01,2000-01-01) | Flu REDACTED | [47000,48000) | [1990-01-01,2000-01-01) | Flu REDACTED | [47000,48000) | [1980-01-01,1990-01-01) | Heart Disease REDACTED | [47000,48000) | [1990-01-01,2000-01-01) | Asthma REDACTED | [42000,43000) | [1970-01-01,1980-01-01) | Asthma REDACTED | [47000,48000) | [1990-01-01,2000-01-01) | Diabetes (10 rows)
The nice thing with RANGE datatypes is that they are very easy to use for statistics. For instance we can answer like this to the question “How many people born before 1990 have had flu ?”:
SELECT count(*) FROM generalized_patient WHERE disease = 'Flu' AND upper(birth) < '1990-01-01';
As you can see, if you generalize the data with wider steps, the data will
be less accurate. For instance, using
millennium in the
anon.generalize_daterange() function would completely destroy the data.
On the hand, using
week would be too precise and there would be risk of
So the main question that comes with Generalization is : “Is this dataset generalized enough ?” and this is where we meet the concept of k-anonymity.
k-anonymity is an industry-standard term used to describe a property of an
anonymized dataset. The k-anonymity principle states that within a
given dataset, any anonymized individual cannot be distinguished from at
k-1 other individuals. In other words, k-anonymity might be described
as a “hiding in the crowd” guarantee. A low value of
k indicates there’s
a risk of re-identification using linkage with other data sources.
With PostgreSQL Anonymizer, you can evaluate the
k factor of a generalized
table in 2 steps :
- First declare the columns that are [indirect idenfiers] ( also known as “quasi identifers”) like this:
SECURITY LABEL FOR anon ON COLUMN patient.zipcode IS 'INDIRECT IDENTIFIER'; SECURITY LABEL FOR anon ON COLUMN patient.birth IS 'INDIRECT IDENTIFIER';
- Once the indirect identifiers are declared, just use the
In the example above, the
k factor is 3, which means that for any
value of (
birth), you will find at least 3 individuals in the
generalized_patient table. If
k = 1, some people in your data set may
be reidentified using external data or inference techniques.
Different anonymization strategies for different purposes
You may have already spotted 2 main limitations of Generalization strategy :
A/ This technique works great low-dimensional datasets (i.e. tables with a small
number of columns). But if a datasets contains a lot of indirect identifiers
k factor will drop
B/ The generalized view and the original table don’t have the same data model. In particular, the numeric and datetime value will be replaced by RANGE. Since Generalization breaks the data model, it is not possible to use Dynamic Masking, Anonymous Dumps and In-Place Anonymization with this strategy. In other words: Generalization cannot be used for CI tests, for application development or as training data.
This is the demonstration that there is no “1-size-fits-all” approach for anonymization ! In fact, in some case you may have to build several anonymization workflows for a given dataset, because different data operators will have different needs and constraints.
This is all for today ! If GDPR and Anonymization are a concern to you, please take a look at the PostgreSQL Anonymizer extension and send us feedback !
We love feedback 😀
The project is open source and available here:
blog comments powered by Disqus