A social network consists of (1) a directed graph , and (2) a set of attributes for each node in (for instance, name, telephone number, etc.) and a set of attributes for each edge in (for instance, type of relationship). The model is agnostic as to whether attributes accurately reflect real-world identities or not (see Appendix C). We treat attributes as atomic values from a discrete domain; this is important for our formal definition of privacy breach (Definition 3 below). Real-valued attributes must be discretized. Where specified, we will also represent edges as attributes in taking values in .
In addition to the explicit attributes, some privacy policies may be concerned with implicit attributes, i.e., properties of a node or an edge that are based purely on the graph structure. For example, node degree can be a sensitive implicit attribute. Implicit attributes may be leaked without disclosing any explicit attributes. For example, if the adversary re-identifies a subset of nodes in an anonymized graph, none of which are adjacent, he learns the degrees of these nodes without breaking edge privacy. Which implicit attributes should be protected depends on the specific network.
Our model of the data release process focuses on what types of data are released and how the data is sanitized (if at all), and abstracts away from the procedural distinctions such as whether the data is available in bulk or obtained by crawling the network. As discussed in Section 2, social-network data are routinely released to advertisers, application developers, and researchers. Advertisers are often given access to the entire graph in a (presumably) anonymized form and a limited number of relevant attributes for each node. Application developers, in current practice, get access to a subgraph via user opt-in and most or all of the attributes within this subgraph. This typically includes the identifying attributes, even if they are not essential for the application's functionality . Researchers may receive the entire graph or a subgraph (up to the discretion of the network owner) and a limited set of non-identifying attributes.
"Anonymization" is modeled by publishing only a subset of attributes. Unlike naïve approaches such as -anonymity, we do not distinguish identifying and non-identifying attributes (any attribute can be identifying if it happens to be known to the adversary as part of his auxiliary information). Suppressed attributes are not limited to the demographic quasi-identifiers a priori; we simply assume that the published attributes by themselves are insufficient for re-identification. In Section 4.4, we explain the (indirect) connection between preventing node re-identification and intuitive "privacy." In terms of entropy, most of the information in the released graph resides in the edges, and this is what our de-anonymization algorithm will exploit.
The data release process may involve perturbation or sanitization that changes the graph structure in some way to make re-identification attacks harder. As we argued in Section 3, deterministic methods that attempt to make different nodes look identical do not work on realistic networks. Other defenses are based on injecting random noise into the graph structure. The most promising one is link prediction , which produces plausible fake edges by exploiting the fact that edges in social-network graphs have a high clustering coefficient. (We stress that link prediction is far beyond the existing sanitization techniques, which mostly rely on simple removal of identifiers.) The experiments in Section 6.2 show that our algorithm is robust to injected noise, whether resulting from link prediction or not. In Appendix E.4, we discuss how to measure the amount of noise introduced by perturbation.
We model the data sanitization and release process as follows. First, select a subset of nodes, , and subsets of node and edge attributes to be released. Second, compute the induced subgraph on . For simplicity, we do not model more complex criteria for releasing edge, e.g., based on edge attributes. Third, remove some edges and add fake edges. Release = , i.e., a sanitized subset of nodes and edges with the corresponding attributes.
As described in Section 2, network owners release anonymized and possibly sanitized network graphs to commercial partners and academic researchers. Therefore, we take it for granted that the attacker will have access to such data. The main question we answer in the rest of this paper is: can sensitive information about specific individuals be extracted from anonymized social-network graphs?
Attack scenarios. Attackers fall into different categories depending on their capabilities and goals. The strongest adversary is a government-level agency interested in global surveillance. Such an adversary can be assumed to already have access to a large auxiliary network (see below). His objective is large-scale collection of detailed information about as many individuals as possible. This involves aggregating the anonymous network with by recognizing nodes that correspond to the same individuals.
Another attack scenario involves abusive marketing. A commercial enterprise, especially one specializing in behavioral ad targeting [81; 92], can easily obtain an anonymized social-network graph from the network operator for advertising purposes. As described in Sections 1 and 2, anonymity is often misinterpreted as privacy. If an unethical company were able to de-anonymize the graph using publicly available data, it could engage in abusive marketing aimed at specific individuals. Phishing and spamming also gain from social-network de-anonymization. Using detailed information about the victim gleaned from his or her de-anonymized social-network profile, a phisher or a spammer will be able to craft a highly individualized, believable message (cf. ).
Yet another category of attacks involves targeted de-anonymization of specific individuals by stalkers, investigators, nosy colleagues, employers, or neighbors. In this scenario, the attacker has detailed contextual information about a single individual, which may include some of her attributes, a few of her social relationships, membership in other networks, and so on. The objective is to use this information to recognize the victim's node in the anonymized network and to learn sensitive information about her, including all of her social relationships in that network.
Modeling the attacker. We assume that in addition to the anonymized, sanitized target network , the attacker also has access to a different network whose membership partially overlaps with . The assumption that the attacker possesses such an auxiliary network is very realistic. First, it may be possible to extract directly from : for example, parts of some online networks can be automatically crawled, or a malicious third-party application can provide information about the subgraph of users who installed it. Second, the attacker may collude with an operator of a different network whose membership overlaps with . Third, the attacker may take advantage of several ongoing aggregation projects (see Section 2). The intent of these projects is benign, but they facilitate the creation of a global auxiliary network combining bits and pieces of public information about individuals and their relationships from multiple sources. Fourth, government-level aggregators, such as intelligence and law enforcement agencies, can collect data via surveillance and court-authorized searches. Depending on the type of the attacker, the nodes of his auxiliary network may be a subset, a superset, or overlap with those of the target network.
We emphasize that even with access to a substantial auxiliary network , de-anonymizing the target network is a highly non-trivial task. First, the overlap between the two networks may not be large. For the entities who are members of both and , some social relationships may be preserved, i.e., if two nodes are connected in , the corresponding nodes in are also connected with a non-negligible probability, but many of the relationships in each network are unique to that network. Even if the same entity belongs to both networks, it is not immediately clear how to recognize that a certain anonymous node from corresponds to the same entity as a given node from . Therefore, easy availability of auxiliary information does not directly imply that anonymized social networks are vulnerable to privacy breaches.
Our formal model of the attacker includes both aggregate auxiliary information (large-scale information from other data sources and social networks whose membership overlaps with the target network) and individual auxiliary information (identifiable details about a small number of individuals from the target network and possibly relationships between them). In the model, we consider edge relationship to be a binary attribute in and all edge attributes to be defined over instead of . If , then .
Aggregate auxiliary information. It is essential that the attacker's auxiliary information may include relationships between entities. Therefore, we model as a graph and a set of probability distributions and , one for each attribute of every node in and each attribute of every edge in . These distributions represent the adversary's (imperfect) knowledge of the corresponding attribute value. For example, the adversary may be 80% certain that an edge between two nodes is a "friendship" and 20% that it is a mere "contact." Since we treat edges themselves as attributes, this also captures the attacker's uncertain knowledge about the existence of individual edges. This model works well in practice, although it does not capture some types of auxiliary information, such as "node is connected to either node , or node ."
For an attribute of a node (respectively, attribute of an edge ), we represent by (resp., ) the attacker's prior probability distribution (i.e., distribution given by his auxiliary information) of the attribute's value. The set (resp., ) can be thought of as a union of (resp., ) over all attributes and nodes (resp., edges).
Aggregate auxiliary information is used in the the "propagation" stage of our de-anonymization algorithm (Section 5).
Individual auxiliary information (information about seeds). We also assume that the attacker possesses detailed information about a very small2 number of members of the target network . We assume that the attacker can determine if these members are also present in his auxiliary network (e.g., by matching usernames and other contextual information). The privacy question is whether this information about a handful of members of can be used, in combination with , to learn sensitive information about other members of .
It is not difficult to collect such data about a small number of nodes. If the attacker is already a user of , he knows all details about his own node and its neighbors [44; 76]. Some networks permit manual access to profiles even if large-scale crawling is restricted (e.g., Facebook allows viewing of information about "friends" of any member by default.) Some users may make their details public even in networks that keep them private by default. The attacker may even pay a handful of users for information about themselves and their friends , or learn it from compromised computers or stolen mobile phones. For example, the stored log of phone calls provides auxiliary information for de-anonymizing the phone-call graph. With an active attack (e.g., ), the attacker may create fake nodes and edges in with features that will be easy to recognize in the anonymized version of , such as a clique or an almost-clique. Since large-scale active attacks are unlikely to be feasible (see Section 3), we restrict their role to collecting individual auxiliary information as a precursor to the main, passive attack.
Individual auxiliary information is used in the the "seed identification" stage of our de-anonymization algorithm (Section 5).
In this paper, we take an "operational" approach to social-network privacy by focusing solely on node re-identification. First, it is unclear how to give a meaningful definition of social-network privacy that does not make some assumptions about the attacker's strategy and yet yields meaningful results on real-world data. Second, all currently known privacy-breaching and privacy-protection algorithms focus on node re-identification. Even edge inference, in order to be considered a meaningful privacy breach, must include learning some identifying information about the endpoints and thus implies node re-identification. Third, while anonymity is by no means sufficient for privacy3, it is clearly necessary. A re-identification algorithm that breaks anonymity is thus guaranteed to violate any reasonable definition of privacy, as long as there are any sensitive attributes at all attached to the nodes, since the algorithm re-labels the sensitive attributes with identifying information.
We define ground truth to be a mapping between the nodes of the attacker's auxiliary network and the nodes of the target network. Intuitively, a pair of nodes are mapped to each other if they belong to the same "entity" (see Appendix C). If takes the special value , then there is no mapping for node (e.g., if was not released as part of ). Further, need not map every node in . This is important because the overlap between and may be relatively small. We do assume that the mapping is 1-1, i.e., an entity has at most one node in each network, as discussed in Appendix C.
Node re-identification or re-labeling refers to finding a mapping between a node in and a node in . Intuitively, is a labeled graph and is unlabeled. Node re-identification succeeds on a node if , and fails otherwise. The latter includes the case that and vice versa. Informally, re-identification is recognizing correctly that a given node in the anonymized network belongs to the same entity as a node in the attacker's auxiliary network.
We give such an algorithm in Section 5. Observe that the algorithm outputs, for each node in , a set of candidate nodes in and a probability distribution over those nodes reflecting the attacker's imperfect knowledge of the re-identification mapping.
We now define the class of adversaries who attempt to breach privacy via re-identification. After constructing the mapping, the adversary updates his knowledge of the attributes of using the attribute values in . Specifically, he can use the probability distribution over the candidate nodes to derive a distribution over the attribute values associated with these nodes. His success is measured by the precision of his posterior knowledge of the attributes.
Because the auxiliary graph need not be a subgraph of the target graph, the mapping may not be complete, and the mapping adversary's posterior knowledge of an attribute value is only defined for nodes that have actually been mapped to nodes in the target graph, at least one of which has a non-null value for this attribute. Formally, is defined if there is a non-zero number of nodes such that and . Edge attributes are treated similarly.
The probability of a given node having a particular attribute value can be computed in other ways, e.g., by looking only at the most likely mapping. This does not make a significant difference in practice.
(a) for some attribute such that , where , or
(b) for some attribute such that , where .
Definition 3 should be viewed as a meta-definition or a template, and must be carefully adapted to each instance of the re-identification attack and each concrete attribute. This involves subjective judgment. For example, did a privacy breach occur if the the attacker's confidence increased for some attributes and decreased for others? Learning common-sense knowledge from the sanitized network (for example, that all nodes have fewer than neighbors) does not intuitively constitute a privacy breach, even though it satisfies Definition 3 for the "node degree" attribute. Such common-sense knowledge must be included in the attacker's . Then learning it from the sanitized graph does not constitute a privacy breach.
, imagine the network consisting of augmented with singleton nodes. Re-identification fails on the singletons because there is no edge information associated with them, and, therefore, the naïve metric returns half the value on as it does on . Intuitively, however, the presence of singletons should not affect the performance of any de-anonymization algorithm.
This is not merely hypothetical. In many online networks, the majority of nodes show little or no observable activity after account creation. Restricting one's attention to the giant connected component does not solve the problem, either, because extraneous nodes with degree 1 instead of 0 would have essentially the same (false) impact on naïvely measured performance.
Instead, we assign a weight to each affected node in proportion to its importance in the network. Importance is a subjective notion, but can be approximated by node centrality, which is a well-studied concept in sociology that only recently came to the attention of computer scientists [40; 19; 54; 3; 45].
There are three groups of centrality measures: local, eigenvalue-based and distance-based. Local methods such as degree centrality consider only the neighbors of the node. Eigenvalue methods also consider the centrality of each neighbor, resulting in a convergent recursive computation. Distance-based measures consider path lengths from a node to different points in the network. A well-known eigenvalue-based measure was proposed by Bonacich in , while  presents a textbook treatment of centrality.
We find that the decision to use a centrality measure at all, as opposed to a naïve metric such as the raw fraction of nodes de-anonymized, is much more important than the actual choice of the measure. We therefore use the simplest possible measure, degree centrality, where each node is weighted in proportion to its degree. In a directed graph, we use the sum of in-degree and out-degree.
There is an additional methodological issue. For a mapped pair of nodes, should we use the centrality score from the target graph or the auxiliary graph? It is helpful to go back to the pathological example that we used to demonstrate the inadequacy of fraction-based metrics. If either of the nodes in the mapped pair is a singleton, then the de-anonymization algorithm clearly has no hope of finding that pair. Therefore, we compute the centrality in both graphs and take the minimum of the two. We believe that this formulation captures most closely the spirit of the main question we are answering in this paper: "what proportion of entities that are active in a social network and for which non-trivial auxiliary information is available can be re-identified?"
Given a probabilistic mapping , we say that a (concrete) mapping is sampled from if for each , is sampled according to .
The error rate is the probability that maps a node to any node other than :
The probability is taken over the inherent randomness of the de-anonymization algorithm as well as the sampling of from . Note that the error rate includes the possibility that and .
The above measure only gives a lower bound on privacy breach because privacy can be violated without complete de-anonymization. Therefore, if the goal is to protect privacy, it is not enough to show that this measure is low. It is also necessary to show that Definition 3 is not satisfied. Observe, for example, that simply creating copies of the graph technically prevents de-anonymization and even satisfies naïve syntactic definitions such as -anonymity, while completely violating any reasonable definition of privacy.
In the other direction, however, breaking Definition 4 for a large fraction of nodes--as our algorithm of Section 5 does--is sufficient to break privacy via Definition 3, as long some trivial conditions are met: at least one private attribute is released as part of , and the adversary possesses little or no auxiliary information about this attribute.