Related Work

Privacy properties. A social network consists of nodes, edges, and information associated with each node and edge. The existence of an edge between two nodes can be sensitive: for instance, in a sexual-relationship network with gender information attached to nodes [11] it can reveal sexual orientation. Edge privacy was considered in [44; 7]. In most online social networks, however, edges are public by default, and few users change the default settings [34].

While the mere presence of an edge may not be sensitive, edge attributes may reveal more information (e.g., a single phone call vs. a pattern of calls indicative of a business or romantic relationship). For example, phone-call patterns of the disgraced NBA referee Tom Donaghy have been used in the investigation [91]. In online networks such as LiveJournal, there is much variability in the semantics of edge relationships [30].

The attributes attached to nodes, such as the user's interests, are usually far more sensitive. Social Security numbers can be predicted from Facebook profiles with higher accuracy than random guessing [34]; see [17] for other privacy breaches based on profile data. Even implicit attributes such as node degree can be highly sensitive, e.g., in a sexual network [11]. Existing defenses focus on names and other identifiers, but basic de-anonymization only reveals that someone belongs to the network, which is hardly sensitive. As we show in the rest of this paper, however, it can be used as a vehicle for more serious attacks on privacy, including disclosure of sensitive attributes.

De-anonymization attacks. Backstrom et al. present two active attacks on edge privacy in anonymized social networks [7]. These active attacks fundamentally assume that the adversary is able to modify the network prior to its release: "an adversary chooses an arbitrary set of users whose privacy it wishes to violate, creates a small number of new user accounts with edges to these targeted users, and creates a pattern of links among the new accounts with the goal of making it stand out in the anonymized graph structure." Both attacks involve creating $O(\log N)$ new "sybil" nodes ($N$ is the total number of nodes), whose outgoing edges help re-identify quadratically as many existing nodes.

Active attacks are difficult to stage on a large scale. First, they are restricted to online social networks (OSNs); creating thousands of fake nodes in a phone-call or real-life network is prohibitively expensive or impossible. Even in OSNs, many operators (e.g., Facebook) check the uniqueness of email addresses and deploy other methods for verifying accuracy of supplied information, making creation of a large number of dummy nodes relatively difficult.

Second, the attacker has little control over the edges incoming to the nodes he creates. Because most legitimate users will have no reason to link back to the sybil nodes, a subgraph with no incoming edges but many outgoing edges will stand out. As we show below, this may enable the network operator to recognize that the network has been compromised by a sybil attack. There are also other techniques for identifying sybil attacks in social networks [93], including methods for spammer detection deployed by OSNs that allow unidirectional edges [73].

We carried out an experiment to verify the claim that identification of subgraphs consisting primarily of sybil nodes is difficult in real-world social networks. The data for this experiment was the graph of LiveJournal obtained from Mislove et al. [58], crawled in late 2006. It is a directed graph with 5.3 million nodes and 77 million edges. Except for the time of the crawl, this graph is similar to that used in [7].

The cut-based attack of [7] creates 7-node subgraphs containing a Hamiltonian path. In contrast to the observation in [7] that every possible 7-node subgraph containing a Hamiltonian path occurs in the LiveJournal graph, there are no subgraphs in the LiveJournal graph that have these two properties and, furthermore, do not have any incoming edges. We conclude that active attacks are easy to detect if real users never link back to sybil nodes. More sophisticated sybil-detection techniques may work as long as only a small percentage of real users link back to sybil nodes.

The third limitation of active attacks is the fact that many OSNs require a link to be mutual before the information is made available in any form. Therefore, assuming that real users do not link back to dummy users, the links from fake nodes to real ones do not show up in the network.

We conclude that large-scale active attacks requiring creation of tens of thousands of sybil nodes are unlikely to be feasible. Active attacks can still be useful in identifying or creating a small set of "seeds" to serve as a starting point for large-scale, passive privacy breaches. We develop such an attack in Section 5.2.

Backstrom et al. also describe passive attacks, in which a small coalition of users discover their location in the anonymized graph by utilizing the knowledge of the network structure around them. This attack is realistic, but again, only works on a small scale: the colluding users can only compromise the privacy of some of the users who are already their friends.

By contrast, our attack does not require creation of a large number of sybil nodes, and--as shown by our experiments on real-world online social networks--can be successfully deployed on a very large scale.

Defenses. Existing privacy protection mechanisms for social networks are only effective against very restricted adversaries and have been evaluated on small, simulated networks whose characteristics are different from real social networks. For example, Zheleva and Getoor give several strategies for preventing link re-identification [94], but the model ignores auxiliary information that may be available to the attacker.

An unusual attempt to prevent network operators from capitalizing on user-provided data appears in [35]. It involves scrambling the profiles when they are sent to the server and client-side unscrambling when a friend's profile is viewed. Building and running such a system involves constant reverse-engineering of communication between the client and the server. Further, all of a user's friends need to use the system, flatly contradicting the claim of incremental deployability. A similar idea appears in [52], with a more sound architecture based on a server-side Facebook application. Both approaches severely cripple social-network functionality because almost any non-trivial action other than viewing another user's profile or messages requires the server to manipulate the data in a way which is not possible under encryption.

Anonymity is a popular approach to protecting privacy. Felt and Evans propose a system where applications see randomized tokens representing users instead of actual identifiers [28]. Frikken and Golle show how to compute an anonymous graph from pieces held by different participants in order to perform privacy-preserving social-network analysis [31]. Kerschbaum and Schaad additionally enable participants to track their position in the anonymous graph [43].

Several papers proposed variants of $k$-anonymity for social networks. For example, Hay et al. require nodes to be automorphically equivalent [38], i.e., there must exist automorphisms of the graph that map each of $k$ nodes to one another. This is an extremely strong structural requirement, which is achieved only against severely restricted adversaries: in one model, the attacker only has information about degree sequences around his target node; in another, partial knowledge of the structure in the vicinity of the target. The technique appears to work only if the average degree is low, ruling out most online social networks.

Liu and Terzi consider node re-identification assuming that the adversary's auxiliary information consists only of node degrees [51]. There is no clear motivation for this restriction. Campan and Truta propose metrics for the information loss caused by edge addition and deletion and apply $k$-anonymity to node attributes as well as neighborhood structure [15]. Zhou and Pei assume that the adversary knows the exact $1$-neighborhood of the target node [95]. The anonymization algorithm attempts to make this 1-neighborhood isomorphic to $k-1$ other 1-neighborhoods via edge addition. The experiments are performed on an undirected network with average degree 4 (an order of magnitude lower than that in real social networks) and already require increasing the number of edges by 6%. The number of edges to be added and the computational effort are likely to rise sharply with the average degree.

The fundamental problem with $k$-anonymity is that it is a syntactic property which may not provide any privacy even when satisfied (e.g., if all $k$ isomorphic neighborhoods have the same value of some sensitive attributes). Crucially, all of these defenses impose arbitrary restrictions on the information available to the adversary and make arbitrary assumptions about the properties of the social network.

We argue that the auxiliary information which is likely to be available to the attacker is global in nature (e.g., another social network with partially overlapping membership) and not restricted to the neighborhood of a single node. In the rest of this paper, we show how this information, even if very noisy, can be used for large-scale re-identification. Existing models fail to capture self-reinforcing, feedback-based attacks, in which re-identification of some nodes provides the attacker with more auxiliary information, which is then used for further re-identification. Development of a model for such attacks is our primary contribution.

Arvind Narayanan 2009-03-19