Privacy properties. A social network consists of nodes, edges, and information associated with each node and edge. The existence of an edge between two nodes can be sensitive: for instance, in a sexual-relationship network with gender information attached to nodes [11] it can reveal sexual orientation. Edge privacy was considered in [44; 7]. In most online social networks, however, edges are public by default, and few users change the default settings [34].
While the mere presence of an edge may not be sensitive, edge attributes may reveal more information (e.g., a single phone call vs. a pattern of calls indicative of a business or romantic relationship). For example, phone-call patterns of the disgraced NBA referee Tom Donaghy have been used in the investigation [91]. In online networks such as LiveJournal, there is much variability in the semantics of edge relationships [30].
The attributes attached to nodes, such as the user's interests, are usually far more sensitive. Social Security numbers can be predicted from Facebook profiles with higher accuracy than random guessing [34]; see [17] for other privacy breaches based on profile data. Even implicit attributes such as node degree can be highly sensitive, e.g., in a sexual network [11]. Existing defenses focus on names and other identifiers, but basic de-anonymization only reveals that someone belongs to the network, which is hardly sensitive. As we show in the rest of this paper, however, it can be used as a vehicle for more serious attacks on privacy, including disclosure of sensitive attributes.
De-anonymization attacks.
Backstrom et al. present two active attacks on edge
privacy in anonymized social networks [7]. These active attacks
fundamentally assume that the adversary is able to modify the network
prior to its release: "an adversary chooses an arbitrary set of users
whose privacy it wishes to violate, creates a small number of new user
accounts with edges to these targeted users, and creates a pattern of
links among the new accounts with the goal of making it stand out in the
anonymized graph structure." Both attacks involve creating
new "sybil" nodes (
is the total number of nodes), whose outgoing
edges help re-identify quadratically as many existing nodes.
Active attacks are difficult to stage on a large scale. First, they are restricted to online social networks (OSNs); creating thousands of fake nodes in a phone-call or real-life network is prohibitively expensive or impossible. Even in OSNs, many operators (e.g., Facebook) check the uniqueness of email addresses and deploy other methods for verifying accuracy of supplied information, making creation of a large number of dummy nodes relatively difficult.
Second, the attacker has little control over the edges incoming to the nodes he creates. Because most legitimate users will have no reason to link back to the sybil nodes, a subgraph with no incoming edges but many outgoing edges will stand out. As we show below, this may enable the network operator to recognize that the network has been compromised by a sybil attack. There are also other techniques for identifying sybil attacks in social networks [93], including methods for spammer detection deployed by OSNs that allow unidirectional edges [73].
We carried out an experiment to verify the claim that identification of subgraphs consisting primarily of sybil nodes is difficult in real-world social networks. The data for this experiment was the graph of LiveJournal obtained from Mislove et al. [58], crawled in late 2006. It is a directed graph with 5.3 million nodes and 77 million edges. Except for the time of the crawl, this graph is similar to that used in [7].
The cut-based attack of [7] creates 7-node subgraphs containing a Hamiltonian path. In contrast to the observation in [7] that every possible 7-node subgraph containing a Hamiltonian path occurs in the LiveJournal graph, there are no subgraphs in the LiveJournal graph that have these two properties and, furthermore, do not have any incoming edges. We conclude that active attacks are easy to detect if real users never link back to sybil nodes. More sophisticated sybil-detection techniques may work as long as only a small percentage of real users link back to sybil nodes.
The third limitation of active attacks is the fact that many OSNs require a link to be mutual before the information is made available in any form. Therefore, assuming that real users do not link back to dummy users, the links from fake nodes to real ones do not show up in the network.
We conclude that large-scale active attacks requiring creation of tens of thousands of sybil nodes are unlikely to be feasible. Active attacks can still be useful in identifying or creating a small set of "seeds" to serve as a starting point for large-scale, passive privacy breaches. We develop such an attack in Section 5.2.
Backstrom et al. also describe passive attacks, in which a small coalition of users discover their location in the anonymized graph by utilizing the knowledge of the network structure around them. This attack is realistic, but again, only works on a small scale: the colluding users can only compromise the privacy of some of the users who are already their friends.
By contrast, our attack does not require creation of a large number of sybil nodes, and--as shown by our experiments on real-world online social networks--can be successfully deployed on a very large scale.
Defenses.
Existing privacy protection mechanisms for social networks
are only effective against very restricted adversaries and
have been evaluated on small, simulated networks whose
characteristics are different from real social networks.
For example, Zheleva and Getoor give
several strategies for preventing link re-identification [94],
but the model ignores auxiliary information that may be available to
the attacker.
An unusual attempt to prevent network operators from capitalizing on user-provided data appears in [35]. It involves scrambling the profiles when they are sent to the server and client-side unscrambling when a friend's profile is viewed. Building and running such a system involves constant reverse-engineering of communication between the client and the server. Further, all of a user's friends need to use the system, flatly contradicting the claim of incremental deployability. A similar idea appears in [52], with a more sound architecture based on a server-side Facebook application. Both approaches severely cripple social-network functionality because almost any non-trivial action other than viewing another user's profile or messages requires the server to manipulate the data in a way which is not possible under encryption.
Anonymity is a popular approach to protecting privacy. Felt and Evans propose a system where applications see randomized tokens representing users instead of actual identifiers [28]. Frikken and Golle show how to compute an anonymous graph from pieces held by different participants in order to perform privacy-preserving social-network analysis [31]. Kerschbaum and Schaad additionally enable participants to track their position in the anonymous graph [43].
Several papers proposed variants of
-anonymity for social networks.
For example, Hay et al. require nodes to be automorphically
equivalent [38], i.e., there must exist automorphisms of
the graph that map each of
nodes to one another.
This is an extremely strong structural requirement, which
is achieved only against severely restricted adversaries: in one model,
the attacker only has information about degree sequences around his target
node; in another, partial knowledge of the structure in the vicinity of
the target. The technique appears to work only if the average
degree is low, ruling out most online social networks.
Liu and Terzi consider node re-identification assuming that
the adversary's auxiliary information consists only of node
degrees [51]. There is no clear motivation for this restriction.
Campan and Truta propose metrics for the information loss caused by
edge addition and deletion and apply
-anonymity to node attributes
as well as neighborhood structure [15]. Zhou and Pei assume
that the adversary knows the exact
-neighborhood of the target
node [95]. The anonymization algorithm attempts to make this
1-neighborhood isomorphic to
other 1-neighborhoods via edge
addition. The experiments are performed on an undirected network
with average degree 4 (an order of magnitude lower than that in real
social networks) and already require increasing the number
of edges by 6%. The number of edges to be added and the computational
effort are likely to rise sharply with the average degree.
The fundamental problem with
-anonymity is that it is a syntactic
property which may not provide any privacy even when satisfied (e.g.,
if all
isomorphic neighborhoods have the same value of some sensitive
attributes). Crucially, all of these defenses impose arbitrary
restrictions on the information available to the adversary and make
arbitrary assumptions about the properties of the social network.
We argue that the auxiliary information which is likely to be available to the attacker is global in nature (e.g., another social network with partially overlapping membership) and not restricted to the neighborhood of a single node. In the rest of this paper, we show how this information, even if very noisy, can be used for large-scale re-identification. Existing models fail to capture self-reinforcing, feedback-based attacks, in which re-identification of some nodes provides the attacker with more auxiliary information, which is then used for further re-identification. Development of a model for such attacks is our primary contribution.
Arvind Narayanan 2009-03-19