As a result of the EU GDPR, you’ll have come across phrases such as ‘profiling’ and ‘privacy by design.’ But the new data protection act has also thrown words such as ‘anonymisation’ and ‘pseudonymisation’ into the spotlight. You may know these words better as ‘anonymous data’ or ‘pseudonymous data,’ but what do they actually mean? And how and when are they useful?
Here we look at what data anonymisation and pseudonymisation actually entail, techniques to employ them, and their uses and risks.
Anonymisation and pseudonymisation: What’s the difference?
First things first, these are two distinct terms. As you’ll see, the GDPR even categorises them differently.
– Anonymisation destroys any way of identifying the data subject. It is irreversible.
– Pseudonymisation substitutes the identity of the data subject, meaning you need additional information to re-identify the data subject. It is reversible.
Bear with me for a moment while I use an example. On one desk, you have four books written by ‘Anon.’ You don’t know if the same author wrote all four books, or if two, three or four people wrote them.
On another desk, you have four books written by George Orwell. You know that George Orwell wrote all four books, even if you don’t know that George Orwell was actually Eric Arthur Blair. Blair was writing under a pseudonym, whereas the other authors were anonymous.
The view from the EU
The GDPR distinguishes between anonymised and pseudonymous data. Recital 26 of the GDPR defines anonymised data as “data rendered anonymous in such a way that the data subject is not or no longer identifiable.”
Pseudonymous data always allows for some form of re-identification, no matter how unlikely or indirect. You can re-identify it because the process is reversible. The GDPR therefore considers it to be personal data.
However, you cannot (in theory, at least) re-identify anonymous data. If you can guarantee you have irreversibly anonymised personal data, the GDPR no longer classifies it as personal data.
So how can you anonymise or pseudonymise data?
Directory replacement involves modifying individuals’ names within your data, but maintaining consistency between values such as ‘postcode’ and ‘city.’
Scrambling can be reversible, and involves mixing letters. For example, ‘Cruise’ could become ‘Irecus’.
Data blurring approximates data values to render their meaning obsolete and/or make it impossible to identify individuals.
Masking hides sections of data with random characters or other data.
Data encryption translates data into another form, so that only those with access to a a decryption key, or password, can read it. (The messaging app WhatsApp, for instance, uses end-to-end encryption. The sender and intended receiver each have unique keys to access any given message sent between them.)
Data encryption is useful in storing different indirect identifiers separately – a key part of any pseudonymisation technique.
Data blurring is one way to anonymise data.
What are direct and indirect identifiers?
You may at times find you need to conceal certain ‘identifiers’ within datasets. At this point, it’s important to distinguish between direct and indirect identifiers. The International Organization for Standardization defines direct identifiers as “data that can be used to identify a person without additional information or with cross-linking through other information that is in the public domain.”
In other words, direct identifiers correspond directly to a person’s identity. They include family names, first names, maiden names and aliases; postal addresses and telephone numbers; and IDs, including social security numbers, bank account details and credit card numbers.
Identifiers such as these can apply to any person, alive or dead. This includes their dependents, ancestors, descendants and other related persons.
In contrast, indirect identifiers are data that do not identify an individual in isolation. They may, however, reveal individual identities if you combine them with additional information. These include information such as gender, date of birth, and postcode.
Indirect identifiers are data that do not identify an individual in isolation.
The risk of re-identification
Research has found that you can identify 87 per cent of US citizens if you know their gender, date of birth and ZIP code.
While the above are three indirect identifiers, it’s still prudent to consider the following three questions when dealing with an anonymised dataset:
- Are you able to single out an individual?
- Are you able to link records relating to an individual?
- Can you infer information concerning an individual?
To reduce the risk of re-identification of pseudonymous data, controllers should have appropriate technical measures in place, such as encryption, hashing or tokenization. They should also put in place organizational measures, such as policies, agreements and privacy by design, to separate pseudonymous data from their identification key.
The GDPR and pseudonymous data
The new data protection act looks favourably upon pseudonymisation. Recital 29 actually emphasises the GDPR’s aim “to create incentives to apply pseudonymisation when processing personal data.”
What’s more, Recital 78 and Article 25 actually list pseudonymisation as a way to show GDPR compliance with requirements such as privacy-by-design. Pseudonymising personal data is an opportunity to achieve GDPR compliance – and make further use of the data you collect.
Anonymisation and pseudonymisation: Are they actually important?
There’s no silver bullet when it comes to data security. Despite any measures you put in place, you can re-identify pseudonymous data precisely because it is a reversible process. Neither is data anonymisation a failsafe option.
AOL, Netflix and the New York Taxi and Limousine Commission all released anonymised datasets to the public. Subsequently, external actors were able to identify individuals in each dataset, Thelma Arnold being the most famous from AOL’s list. However, implemented well, both pseudonymisation and anonymisation have their uses.
Organisations commonly employ pseudonymisation when using barcode scanners at events and exhibitions. Each barcode represents a number, which in turn refers to an attendee. You can, therefore, look up information on each delegate (for example, if they have arrived) without having to reveal who they are.
Use a barcode scanner to look up delegate information without revealing who they are.
More broadly, as an international company, you can leverage pseudonymisation to utilise relevant data for marketing purposes across borders. Applying pseudonyms to sections of data enables you to share that (pseudonymous) data with another region, while storing data subjects’ full information at source.
Anonymisation is more commonly used with highly sensitive data, such as medical and financial records. The Australian government, for example, published anonymised Medicare data last year. In this case, however, researchers in Melbourne were able to re-identify individuals from the data released. There was simply too much information available in the dataset to prevent inference, and so re-identification. The researchers highlighted the importance of not publishing data to the level of the individual. Instead, those releasing the data should have employed data blurring techniques to protect the identities of the data subjects.
To conclude, anonymous and pseudonymous data both have important roles to play within organisations. However, it is crucial to be aware of the risks they carry with them, and to manage those risks responsibly.