Our primary goal for participating in the Challenge was to raise attention to the ever-present possibility of de-anonymization in such contests. More broadly one might ask whether contests should discourage the use of outside information in general—trading off the benefits of leveraging outside information to genuinely improve the outcomes of data mining, and preventing cheating—and what technical or policy means could be used to limit such information.

In addition to undermining the aim of the contest, which was to advance the state-of-the-art in machine learning, de-anonymization also has privacy implications (although these did not arise in this contest). For these reasons, it is important to think about how to prevent de-anonymization in future contests. One factor that greatly affects both risks—in opposite directions—is whether the underlying data is already publicly available. If it is, then there is likely no privacy risk; however, it furnishes a ready source of high-quality data to game the contest. On the other hand, if the data is not public, then it can only be exploited by cross-referencing it through de-anonymization with another dataset of the same type (or with some of the same attributes) that is public.10 While this presents a significant privacy risk, imperfect correlation between the two datasets makes it unclear whether there is a threat to the fidelity of the contest.

Focusing on the threat of de-anonymization to the contest, incorporating a restriction on the use of external data in the rules is a necessary step, but arguably insufficient, since less scrupulous contestants may attempt to apply de-anonymization anyway. We can see two possible ways to combat this threat. The first is the approach taken by the Netflix Prize—require teams qualifying for prizes to produce source code and human-readable descriptions of algorithms. The Netflix Prize stipulated that the verification process would “ensure that the provided algorithm description and source code could reasonably have generated the prediction sets submitted.”

The loophole in this approach is the possibility of overfitting. While source-code verification would undoubtedly catch a contestant who achieved their results using de-anonymization alone, the more realistic threat is that of de-anonymization being used to bridge a small gap. In this scenario, a machine learning algorithm would be trained on the test set, the correct results having been obtained via de-anonymization. Since successful ML solutions are composites of numerous algorithms, and consequently have a huge number of parameters, it should be possible to conceal a significant amount of overfitting in this manner.

Another approach is to attempt to prevent the possibility of de-anonymization. This is a daunting task: a long line of research has demonstrated the feasibility of de-anonymization on various datasets [25; 20; 12; 16; 23; 24; 13; 1] and theoretical evidence exists for the impossibility of de-anonymization of high-dimensional data [3].

Several papers have claimed to provide k-anonymity, or variants thereof, for graphs [15; 19; 9; 28]. These have only been evaluated against graphs with a small number of nodes and small average degree. Further, they only consider the threat of adversaries with restricted types of auxiliary information, which in particular does not include global information such as another graph that is structurally related to the target graph. For these reasons we do not believe these techniques are effective ways to anonymize social networking graphs.

Finally, differential privacy offers a potential method to bypass the need for anonymization [11]. McSherry and Mironov show how the Netflix Prize contest could have been run by releasing a differentially private dataset (primarily, a sanitized item-item covariance matrix) instead of raw user data [21]. The drawbacks of this approach are that it would restrict the class of machine learning algorithms, and may require a shift from the “release-and-forget” model to an online privacy-preserving computation model, especially due to the difficulty of generating synthetic datasets that preserve differential privacy [26]. Nevertheless, differential privacy remains essentially the only mathematically sound methodology for provably resisting de-anonymization.


... public.10
For example, Narayanan and Shmatikov were able to re-identify anonymized records in the Netflix Prize dataset with user accounts on IMDb via their public movie reviews [23].
Arvind Narayanan 2011-03-09