A Few Home Truths About Anonymised Personal Information.

The following fascinating article arrived a few a days ago.

"Anonymized" data really isn't—and here's why not

Companies continue to store and sometimes release vast databases of "anonymized" information about users. But, as Netflix, AOL, and the State of Massachusetts have learned, "anonymized" data can often be cracked in surprising ways, revealing the hidden secrets each of us are assembling in online "databases of ruin."

By Nate Anderson | Last updated September 8, 2009 6:25 AM CT

The Massachusetts Group Insurance Commission had a bright idea back in the mid-1990s—it decided to release "anonymized" data on state employees that showed every single hospital visit. The goal was to help researchers, and the state spent time removing all obvious identifiers such as name, address, and Social Security number. But a graduate student in computer science saw a chance to make a point about the limits of anonymization.

Latanya Sweeney requested a copy of the data and went to work on her "reidentification" quest. It didn't prove difficult. Law professor Paul Ohm describes Sweeney's work:

At the time GIC released the data, William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers. In response, then-graduate student Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, a city of 54,000 residents and seven ZIP codes. For twenty dollars, she purchased the complete voter rolls from the city of Cambridge, a database containing, among other things, the name, address, ZIP code, birth date, and sex of every voter. By combining this data with the GIC records, Sweeney found Governor Weld with ease. Only six people in Cambridge shared his birth date, only three of them men, and of them, only he lived in his ZIP code. In a theatrical flourish, Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office.

Boom! But it was only an early mile marker in Sweeney's career; in 2000, she showed that 87 percent of all Americans could be uniquely identified using only three bits of information: ZIP code, birthdate, and sex.

Such work by computer scientists over the last fifteen years has shown a serious flaw in the basic idea behind "personal information": almost all information can be "personal" when combined with enough other relevant bits of data.

That's the claim advanced by Ohm in his lengthy new paper on "the surprising failure of anonymization." As increasing amounts of information on all of us are collected and disseminated online, scrubbing data just isn't enough to keep our individual "databases of ruin" out of the hands of the police, political enemies, nosy neighbors, friends, and spies.

If that doesn't sound scary, just think about your own secrets, large and small—those films you watched, those items you searched for, those pills you took, those forum posts you made. The power of reidentifiation brings them closer to public exposure every day. So, in a world where the PII concept is dying, how should we start thinking about data privacy and security?

Don't ruin me

For almost every person on earth, there is at least one fact about them stored in a computer database that an adversary could use to blackmail, discriminate against, harass, or steal the identity of him or her. I mean more than mere embarrassment or inconvenience; I mean legally cognizable harm.

Examples of the anonymization failures aren't hard to find.

When AOL researchers released a massive dataset of search queries, they first "anonymized" the data by scrubbing user IDs and IP addresses. When Netflix made a huge database of movie recommendations available for study, it spent time doing the same thing. Despite scrubbing the obviously identifiable information from the data, computer scientists were able to identify individual users in both datasets. (The Netflix team then moved on to Twitter users.)

More examples and discussion here:

http://arstechnica.com/tech-policy/news/2009/09/your-secrets-live-online-in-databases-of-ruin.ars

Now this is not an easy area at all – as shown by the difficulties cited above.

In Australia Standards Australia’s IT-14 is onto the case with a project described as follows:

Project 9002

“This activity is being developed based upon international activities and consideration of the needs for Australianisation of the processes for de-identification. It includes processes and requirements for ensuring the privacy of personal information, particularly to support secondary data use and reporting. This is an international activity to which Australia has actively contributed. This is a joint activity of several working groups.”

This information is found here:

http://www.e-health.standards.org.au/drafts.asp?area=projects&recid=128

This work is based on the recently announced ISO Technical Standard No 25237.

Here is the introduction to the release.

"Pseudonymization" – new ISO specification supports privacy protection in health informatics

10/3/09:

A new ISO technical specification will help to reconcile the increasing use in healthcare of electronic processing of patient data with increasing patient expectations for privacy protection.

In the healthcare sector, concerns about protecting private data are an overriding consideration and such concerns are intensifying with the continuing progress in the use of information and communication technology (ICT) tools and solutions to improve health services.

ISO/TS 25237:2008, Health informatics – Pseudonymisation, contains principles and requirements for privacy protection using pseudonymisation services for the protection of personal health information in databases.

Pseudonymisation (from pseudonym) allows for the removal of an association with a data subject. It differs from anonymisation (anonymous) in that it allows for data to be linked to the same person across multiple data records or information systems without revealing the identity of the person.

The technique is recognised as an important method for privacy protection of personal health information. It can be performed with or without the possibility of re-identifying the subject of the data (reversible or irreversible pseudonymisation).

ISO/TS 25237:2008 is applicable to organisations that make a claim of trustworthiness for operations engaged in pseudonymisation services, which may be national or trans-border.

It will serve as a general guide for implementers, as well as for quality assurance purposes, assisting users to determine their trust in the services provided. Application areas include, but are not limited to:

  • Research, or other secondary use of clinical data
  • Clinical trials and post-marketing surveillance
  • Public health monitoring and assessment
  • Confidential patient-safety reporting (e.g. adverse drug effects)
  • Comparative quality indicator reporting
  • Peer review
  • Consumer groups.

ISO/TS 25237:2008 was developed by ISO technical committee ISO/TC 215, Health informatics. It provides a conceptual model of the problem areas, requirements for trustworthy practices, and specifications to support the planning and implementation of pseudonymisation services. More precisely, it:

  • Defines a basic concept for pseudonymisation
  • Gives an overview of different use cases for pseudonymisation that can be both reversible and irreversible
  • Defines a basic methodology for pseudonymisation services including organisational as well as technical aspects
  • Gives a guide to risk assessment for re-identification
  • Specifies a policy framework and minimal requirements for trustworthy practice for the operations of a pseudonymisation service

The full release is here:

http://www.iso.org/iso/pressrelease.htm?refid=Ref1209

The scope of the issues raised – and the article that stimulated this post – make it vital this standard be worked through, approved and applied in Australia sooner rather than later!

David.

0 comments:

Post a Comment