State of the Union

The attacks described in this paper target anonymized, sanitized versions of social networks, using partial auxiliary information about a subset of their members. To show that both anonymized networks and auxiliary information are widely available, we survey real-world examples of social-network data sharing, most of which involve releasing more information than needed for our attack.

Academic and government data-mining. Social networks used for published data-mining research include the mobile-phone call graphs of, respectively, 7 million [63], 3 million [60], and 2.5 million [48] customers, as well as the land-line phone graph of 2.1 million Hungarian users [47]. Corporations like AT&T, whose own database of 1.9 trillion phone calls goes back decades [39], have in-house research facilities, but smaller operators must share their graphs with external researchers. Phone-call networks are also commonly used to detect illicit activity such as calling fraud [90] and for national security purposes, such as identifying the command-and-control structures of terrorist cells by their idiosyncratic sub-network topologies [39]. A number of companies sell data-mining solutions to governments for this purpose [75].

Sociologists, epidemiologists, and health-care professionals collect data about geographic, friendship, family, and sexual networks to study disease propagation and risk. For example, the Add Health dataset includes the sexual-relationship network of almost 1,000 students of an anonymous Midwestern high school as part of a detailed survey on adolescent health [2]. While the Add Health project takes a relatively enlightened stance on privacy [1], this graph has been published in an anonymized form [11].

For online social networks, the data can be collected by crawling either via an API, or "screen-scraping" (e.g., Mislove et al. crawled Flickr, YouTube, LiveJournal, and Orkut [58]; anonymized graphs are available by request only). We stress that even when obtained from public websites, this kind of information--if publicly released--still presents privacy risks because it helps attackers who lack resources for massive crawls. In some online networks, such as LiveJournal and the Experience Project, user profiles and relationship data are public, but many users maintain pseudonymous profiles. From the attacker's perspective, this is the same as publishing the anonymized network.

Advertising. With the emergence of concrete evidence that social-network data makes commerce much more profitable [70; 78], network operators are increasingly sharing their graphs with advertising partners to enable better social targeting of advertisements. For example, Facebook explicitly says that users' profiles may be shared for the purpose of personalizing advertisements and promotions, as long as the individual is not explicitly identified [27]. Both Facebook and MySpace allow advertisers to use friends' profile data for ad targeting [20]. Social-network-driven advertising has been pursued by many startups [24; 59] and even Google [71], typically relying on anonymity to prevent privacy breaches [5; 25; 62].

Third-party applications. The number of third-party applications on Facebook alone is in the tens of thousands and rapidly growing [72]. The data from multiple applications can be aggregated and used for targeted advertising (e.g., as done by SocialMedia [69]). As the notion of social networking as a feature rather than destination takes hold [4], many other networks are trying to attract application developers; on the Ning platform, which claims over 275,000 networks, each network can be considered a third-party application. The data given to third-party applications is usually not anonymized, even though most applications would be able to function on anonymized profiles [28].

Third-party applications have a poor track record of respecting privacy policies. For example, a security hole in a Facebook application developed by Slide, Inc. "exposed the birthdays, gender, and relationship status of strangers, including Facebook executives, [and] the wife of Google co-founder Larry Page" [57]. WidgetLaboratory, one of the most popular developers for the Ning platform, was banned permanently after "gathering credentials from users and otherwise creating havoc on Ning networks" [6]. Therefore, it is important to understand what a malicious third-party application can learn about members of a social network, even if it obtains the data in an anonymized form.

Aggregation. Aggregation of information from multiple social networks, facilitated by projects such as OpenID [64], DataPortability [21], the "social graph" project [29], and various microformats [56], potentially presents a greater threat to individual privacy than one-time data releases. Existing aggregators include FriendFeed, MyBlogLog, Jaiku (recently acquired by Google), and Plaxo; the latter even provides an open-source "social graph crawler" [67]. Aggregated networks are an excellent source of auxiliary information for our attacks.

Other data-release scenarios. WellNet is a health-care co-ordination service which enables employers to monitor the social network in real time in order to track employees' medical and pharmacy activity [55]. The data is anonymized.

In "friend-to-friend networking," a peer-to-peer file-sharing network is overlaid on social links [68] in order to defeat censor nodes such as the RIAA. Nodes are pseudonymous and communication is encrypted. Since traffic is typically not anonymized at the network level, the logs that can be obtained, for example, by subpoenaing the ISP are essentially anonymized social-network graphs.

Finally, consider photographs published online without identifying information. The accuracy of face recognition can be improved substantially by exploiting the fact that users who appear together in photographs are likely to be neighbors in the social network [76]. Since most online photographs appear in a social-network context, they effectively represent an anonymized graph, and techniques developed in this paper can help in large-scale facial re-identification.

Arvind Narayanan 2009-03-19