On "Personally Identifiable Information"

"Personally identifiable information" is a legal term used in two related but distinct contexts. The first context is a series of breach-disclosure laws enacted in recent years in response to security breaches involving customer data that could enable identity theft.

California Senate Bill 1386 [13] is a representative example. It defines "personal information" as follows:

[An] individual's first name or first initial and last name in combination with any one or more of the following data elements, when either the name or the data elements are not encrypted:

Two points are worthy of note. First, the spirit of the terminology is to capture the types of information that are commonly used for authenticating an individual. This reflects the bill's intent to deter identity theft. Consequently, data such as email addresses and telephone numbers do not fall under the scope of this law. Second, it is the personal information itself that is sensitive, rather than the fact that it is possible to associate sensitive information with an identity.

The second context in which the term "personally identifiable information" appears is the privacy law. In the United States, the Privacy Act of 1974 [84] regulates the collection of personal information by government agencies, but there is no overarching law regulating private entities. At least three such acts introduced in 2005 failed to pass: the Privacy Act of 2005 [88], the Consumer Privacy Protection Act of 2005 [86], and the Online Privacy Protection Act of 2005 [87]. However, there do exist laws for specific types of data such as the Video Privacy Protection Act (VPPA) [83] and the Health Insurance Privacy and Accountability Act (HIPAA).

The language from the HIPAA Privacy Rule [85] is representative:

Individually identifiable health information is information


  1. That identifies the individual; or
  2. With respect to which there is a reasonable basis to believe the information can be used to identify the individual.

The spirit of the law clearly encompasses deductive disclosure, and the term "reasonable basis" leaves the defining line open to interpretation by case law. We are not aware of any court decisions that define identifiability.

Individual U.S. states do have privacy protection laws that apply to any operator, such as California's Online Privacy Protection Act of 2003 [14]. Some countries other than the United States have similar generic laws, such as Canada's Personal Information Protection and Electronic Documents Act (PIPEDA) [65]. The European Union is notorious for the broad scope and strict enforcement of its privacy laws--the EU privacy directive defines "personal data" as follows [26]:

any information relating to an identified or identifiable natural person [...]; an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identification number or to one or more factors specific to his physical, physiological, mental, economic, cultural or social identity."

It is clear from the above that privacy law, as opposed to breach-disclosure law, in general interprets personally identifiable information broadly, in a way that is not covered by syntactic anonymization. This distinction appears to be almost universally lost on companies that collect and share personal information, as illustrated by the following Senate Committee testimony by Chris Kelly, Chief Privacy Officer of Facebook [42]:

The critical distinction that we embrace in our policies and practices, and that we want our users to understand, is between the use of personal information for advertisements in personally-identifiable form, and the use, dissemination, or sharing of information with advertisers in non-personally-identifiable form. Ad targeting that shares or sells personal information to advertisers (name, email, other contact oriented information) without user control is fundamentally different from targeting that only gives advertisers the ability to present their ads based on aggregate data.

Finally, it is important to understand that the term "personally identifiable information" has no particular technical meaning. Algorithms that can identify a user in an anonymized dataset are agnostic to the semantics of the data elements. While some data elements may be uniquely identifying on their own, any element can be identifying in combination with others. The feasibility of such re-identification has been amply demonstrated by the AOL privacy fiasco [10], de-anonymization of the Netflix Prize dataset [61], and the work presented in this paper. It is regrettable that the mistaken dichotomy between personally identifying and non-personally identifying attributes has crept into the technical literature in phrases such as "quasi-identifier."

Arvind Narayanan 2009-03-19