A social network
consists of (1) a directed graph
, and
(2) a set of attributes
for each node in
(for instance, name,
telephone number, etc.) and a set of attributes
for each
edge in
(for instance, type of relationship). The model is agnostic
as to whether attributes accurately reflect real-world identities or not
(see Appendix C).
We treat attributes as atomic values from a discrete domain;
this is important for our formal definition of privacy breach
(Definition 3 below).
Real-valued attributes must be discretized.
Where specified, we will also represent edges
as attributes in
taking values in
.
In addition to the explicit attributes, some privacy policies may be concerned with implicit attributes, i.e., properties of a node or an edge that are based purely on the graph structure. For example, node degree can be a sensitive implicit attribute. Implicit attributes may be leaked without disclosing any explicit attributes. For example, if the adversary re-identifies a subset of nodes in an anonymized graph, none of which are adjacent, he learns the degrees of these nodes without breaking edge privacy. Which implicit attributes should be protected depends on the specific network.
Our model of the data release process focuses on what types of data are released and how the data is sanitized (if at all), and abstracts away from the procedural distinctions such as whether the data is available in bulk or obtained by crawling the network. As discussed in Section 2, social-network data are routinely released to advertisers, application developers, and researchers. Advertisers are often given access to the entire graph in a (presumably) anonymized form and a limited number of relevant attributes for each node. Application developers, in current practice, get access to a subgraph via user opt-in and most or all of the attributes within this subgraph. This typically includes the identifying attributes, even if they are not essential for the application's functionality [28]. Researchers may receive the entire graph or a subgraph (up to the discretion of the network owner) and a limited set of non-identifying attributes.
"Anonymization" is modeled by publishing only a subset of attributes.
Unlike naïve approaches such as
-anonymity, we do not distinguish
identifying and non-identifying attributes (any attribute can be
identifying if it happens to be known to the adversary as part of his
auxiliary information). Suppressed attributes are not limited to the
demographic quasi-identifiers a priori; we simply assume that the
published attributes by themselves are insufficient for re-identification.
In Section 4.4, we explain the (indirect)
connection between preventing node re-identification and intuitive
"privacy." In terms of entropy, most of the information in the
released graph resides in the edges, and this is what our de-anonymization
algorithm will exploit.
The data release process may involve perturbation or sanitization that changes the graph structure in some way to make re-identification attacks harder. As we argued in Section 3, deterministic methods that attempt to make different nodes look identical do not work on realistic networks. Other defenses are based on injecting random noise into the graph structure. The most promising one is link prediction [50], which produces plausible fake edges by exploiting the fact that edges in social-network graphs have a high clustering coefficient. (We stress that link prediction is far beyond the existing sanitization techniques, which mostly rely on simple removal of identifiers.) The experiments in Section 6.2 show that our algorithm is robust to injected noise, whether resulting from link prediction or not. In Appendix E.4, we discuss how to measure the amount of noise introduced by perturbation.
We model the data sanitization and release process as follows. First,
select a subset of nodes,
, and subsets
of node and edge attributes
to be released. Second, compute the induced subgraph on
.
For simplicity, we do not model more complex criteria for releasing edge,
e.g., based on edge attributes. Third, remove some edges and add
fake edges. Release
=
,
i.e., a sanitized
subset of nodes and edges with the corresponding attributes.
As described in Section 2, network owners release anonymized and possibly sanitized network graphs to commercial partners and academic researchers. Therefore, we take it for granted that the attacker will have access to such data. The main question we answer in the rest of this paper is: can sensitive information about specific individuals be extracted from anonymized social-network graphs?
Attack scenarios.
Attackers fall into different categories depending on their capabilities
and goals. The strongest adversary is a government-level
agency interested in global surveillance. Such an adversary
can be assumed to already have access to a large auxiliary network
(see below). His objective is large-scale collection
of detailed information about as many individuals as possible. This
involves aggregating the anonymous network
with
by recognizing nodes that correspond to the same individuals.
Another attack scenario involves abusive marketing. A commercial enterprise, especially one specializing in behavioral ad targeting [81; 92], can easily obtain an anonymized social-network graph from the network operator for advertising purposes. As described in Sections 1 and 2, anonymity is often misinterpreted as privacy. If an unethical company were able to de-anonymize the graph using publicly available data, it could engage in abusive marketing aimed at specific individuals. Phishing and spamming also gain from social-network de-anonymization. Using detailed information about the victim gleaned from his or her de-anonymized social-network profile, a phisher or a spammer will be able to craft a highly individualized, believable message (cf. [41]).
Yet another category of attacks involves targeted de-anonymization of specific individuals by stalkers, investigators, nosy colleagues, employers, or neighbors. In this scenario, the attacker has detailed contextual information about a single individual, which may include some of her attributes, a few of her social relationships, membership in other networks, and so on. The objective is to use this information to recognize the victim's node in the anonymized network and to learn sensitive information about her, including all of her social relationships in that network.
Modeling the attacker.
We assume that in addition to the anonymized, sanitized target network
, the attacker also has access to a different network
whose membership partially overlaps with
.
The assumption that the attacker possesses such an auxiliary network
is very realistic. First, it may be possible to extract
directly from
: for example, parts of some online networks can be
automatically crawled, or a
malicious third-party application can provide information about the
subgraph of users who installed it. Second, the attacker may collude
with an operator of a different network whose membership overlaps
with
. Third, the attacker may take advantage of several ongoing
aggregation projects (see Section 2). The intent of
these projects is benign, but they facilitate the creation
of a global auxiliary network combining bits and pieces of public
information about individuals and their relationships from multiple
sources. Fourth, government-level aggregators, such as intelligence
and law enforcement agencies, can collect data via
surveillance and court-authorized searches. Depending on
the type of the attacker, the nodes of his auxiliary network
may be a subset, a superset, or overlap with those of the target network.
We emphasize that even with access to a substantial auxiliary network
, de-anonymizing the target network
is a highly
non-trivial task. First, the overlap between the two networks may
not be large. For the entities who are members of both
and
, some social relationships may be preserved, i.e., if two nodes are
connected in
, the corresponding nodes in
are also connected
with a non-negligible probability, but many of the relationships in each
network are unique to that network. Even if the same entity belongs
to both networks, it is not immediately clear how to recognize
that a certain anonymous node from
corresponds to the same
entity as a given node from
. Therefore, easy availability
of auxiliary information does not directly imply that anonymized social
networks are vulnerable to privacy breaches.
Our formal model of the attacker includes both aggregate auxiliary
information (large-scale information from other data sources and
social networks whose membership overlaps with the target network) and
individual auxiliary information (identifiable details about a small
number of individuals from the target network and possibly relationships
between them). In the model, we consider edge relationship to be a binary
attribute in
and all edge attributes
to be defined over
instead of
.
If
, then
.
Aggregate auxiliary information.
It is essential that the attacker's auxiliary
information may include relationships between entities. Therefore,
we model
as a graph
and a set of probability distributions
and
,
one for each attribute of every node in
and each
attribute of every edge in
. These distributions represent
the adversary's (imperfect) knowledge of the corresponding attribute
value. For example, the adversary may be 80% certain that
an edge between two nodes is a "friendship" and 20% that it
is a mere "contact." Since we treat edges themselves as attributes,
this also captures the attacker's uncertain knowledge about
the existence of individual edges. This model works well in practice,
although it does not capture some types of auxiliary information,
such as "node
is connected to either
node
, or node
."
For an attribute
of a node
(respectively, attribute
of an edge
), we
represent by
(resp.,
) the attacker's
prior probability distribution (i.e., distribution given by his
auxiliary information) of the attribute's value. The set
(resp.,
) can be thought of as a union of
(resp.,
) over all attributes and nodes
(resp., edges).
Aggregate auxiliary information is used in the the "propagation" stage of our de-anonymization algorithm (Section 5).
Individual auxiliary information (information about seeds).
We also assume that the attacker possesses detailed information about a very
small2
number of members of the target network
. We assume
that the attacker can determine if these members are
also present in his auxiliary network
(e.g., by
matching usernames and other contextual
information). The privacy question is whether this information about a
handful of members of
can be used, in combination with
,
to learn sensitive information about other members of
.
It is not difficult to collect such data about a small number of nodes.
If the attacker is already a user of
, he knows all details about his
own node and its neighbors [44; 76]. Some networks permit
manual access to profiles even if large-scale crawling is restricted
(e.g., Facebook allows viewing of information about "friends" of
any member by default.) Some users may make their details public
even in networks that keep them private by default. The attacker
may even pay a handful of users for information about themselves and
their friends [49], or learn it from compromised computers
or stolen mobile phones. For example, the stored log of phone calls
provides auxiliary information for de-anonymizing the phone-call graph.
With an active attack (e.g., [7]), the attacker may create fake
nodes and edges in
with features that will be easy to recognize in
the anonymized version of
, such as a clique or an almost-clique.
Since large-scale active attacks are unlikely to be feasible (see
Section 3), we restrict their role to collecting individual
auxiliary information as a precursor to the main, passive attack.
Individual auxiliary information is used in the the "seed identification" stage of our de-anonymization algorithm (Section 5).
The notion of what should be considered private varies from network
to network and even from individual to individual within the network.
To keep our model independent of the semantics of a particular network,
we treat the privacy policy as a syntactic, exogenous labeling
that specifies for every node attribute, edge, and edge attribute
whether it should be public or private. Formally, it is a function
.
In Appendix D, we discuss the challenges of rigorously
defining privacy policies.
In this paper, we take an "operational" approach to social-network privacy by focusing solely on node re-identification. First, it is unclear how to give a meaningful definition of social-network privacy that does not make some assumptions about the attacker's strategy and yet yields meaningful results on real-world data. Second, all currently known privacy-breaching and privacy-protection algorithms focus on node re-identification. Even edge inference, in order to be considered a meaningful privacy breach, must include learning some identifying information about the endpoints and thus implies node re-identification. Third, while anonymity is by no means sufficient for privacy3, it is clearly necessary. A re-identification algorithm that breaks anonymity is thus guaranteed to violate any reasonable definition of privacy, as long as there are any sensitive attributes at all attached to the nodes, since the algorithm re-labels the sensitive attributes with identifying information.
We define ground truth to be a mapping
between the nodes
of the attacker's auxiliary network and the nodes
of the
target network. Intuitively, a pair of nodes are mapped to each other
if they belong to the same "entity" (see Appendix C).
If
takes the special value
, then there is no mapping
for node
(e.g., if
was not released as part of
).
Further,
need not map every node in
. This is important
because the overlap between
and
may be relatively small.
We do assume that the mapping is 1-1, i.e., an entity has at most one
node in each network, as discussed in Appendix C.
Node re-identification or re-labeling refers to finding a mapping
between a node in
and a node in
. Intuitively,
is a labeled graph and
is unlabeled.
Node re-identification succeeds on a node
if
, and fails otherwise. The latter includes the case
that
and vice versa. Informally,
re-identification is recognizing correctly that a given
node in the anonymized network belongs to the same entity as a node in
the attacker's auxiliary network.
and
and produces a probabilistic mapping
, where
is
the probability that
maps to
.
We give such an algorithm in
Section 5. Observe that the algorithm outputs, for
each node in
, a set of candidate nodes in
and
a probability distribution over those nodes reflecting the attacker's
imperfect knowledge of the re-identification mapping.
We now define the class of adversaries who attempt to breach
privacy via re-identification. After constructing the mapping,
the adversary updates his knowledge of the attributes of
using the attribute values in
. Specifically, he
can use the probability distribution over the candidate nodes to derive
a distribution over the attribute values associated with these nodes.
His success is measured by the precision of his posterior
knowledge of the attributes.
outputs a probability
distribution calculated as follows:
Because the auxiliary graph need not be a subgraph of the target
graph, the mapping may not be complete, and the mapping adversary's
posterior knowledge
of an attribute value is only defined for
nodes
that have actually been mapped to nodes in the target
graph, at least one of which has a non-null value for this attribute.
Formally,
is defined if there is a non-zero number of nodes
such
that
and
. Edge
attributes are treated similarly.
The probability of a given node having a particular attribute value can be computed in other ways, e.g., by looking only at the most likely mapping. This does not make a significant difference in practice.
We say that privacy of
is compromised if, for some attribute
which takes value
in
and is designated
as "private" by the privacy policy, the adversary's belief that
increases by more than
, which is a pre-specified
privacy parameter. For simplicity, we assume that the privacy policy
is global, i.e., the attribute is either public, or private for
all nodes (respectively, edges). More granular policies are
discussed in Appendix D.
, let
and
.
We say that the privacy of
is breached w.r.t. adversary
and privacy parameter
if
(a)
for some attribute
such that
,
where
, or
(b)
for some attribute
such that
,
where
.
Definition 3 should be viewed as a meta-definition
or a template, and must be carefully adapted to each instance of the
re-identification attack and each concrete attribute. This involves
subjective judgment. For example, did a privacy breach occur if the
the attacker's confidence increased for some attributes and decreased
for others? Learning common-sense knowledge from the sanitized network
(for example, that all nodes have fewer than
neighbors)
does not intuitively constitute a privacy breach, even though it satisfies
Definition 3 for the "node degree" attribute. Such
common-sense knowledge must be included in the attacker's
. Then
learning it from the sanitized graph does not constitute a privacy breach.
, imagine the network
consisting of
augmented
with
singleton nodes. Re-identification
fails on the singletons because there is no edge information associated
with them, and, therefore, the naïve metric returns
half the value on
as it does on
. Intuitively, however, the
presence of singletons should not affect the performance of any
de-anonymization algorithm.
This is not merely hypothetical. In many online networks, the majority of nodes show little or no observable activity after account creation. Restricting one's attention to the giant connected component does not solve the problem, either, because extraneous nodes with degree 1 instead of 0 would have essentially the same (false) impact on naïvely measured performance.
Instead, we assign a weight to each affected node in proportion to its importance in the network. Importance is a subjective notion, but can be approximated by node centrality, which is a well-studied concept in sociology that only recently came to the attention of computer scientists [40; 19; 54; 3; 45].
There are three groups of centrality measures: local, eigenvalue-based and distance-based. Local methods such as degree centrality consider only the neighbors of the node. Eigenvalue methods also consider the centrality of each neighbor, resulting in a convergent recursive computation. Distance-based measures consider path lengths from a node to different points in the network. A well-known eigenvalue-based measure was proposed by Bonacich in [12], while [37] presents a textbook treatment of centrality.
We find that the decision to use a centrality measure at all, as opposed to a naïve metric such as the raw fraction of nodes de-anonymized, is much more important than the actual choice of the measure. We therefore use the simplest possible measure, degree centrality, where each node is weighted in proportion to its degree. In a directed graph, we use the sum of in-degree and out-degree.
There is an additional methodological issue. For a mapped pair of nodes, should we use the centrality score from the target graph or the auxiliary graph? It is helpful to go back to the pathological example that we used to demonstrate the inadequacy of fraction-based metrics. If either of the nodes in the mapped pair is a singleton, then the de-anonymization algorithm clearly has no hope of finding that pair. Therefore, we compute the centrality in both graphs and take the minimum of the two. We believe that this formulation captures most closely the spirit of the main question we are answering in this paper: "what proportion of entities that are active in a social network and for which non-trivial auxiliary information is available can be re-identified?"
Given a probabilistic mapping
, we say that a (concrete)
mapping is sampled from
if for each
,
is
sampled according to
.
.
The
success rate of a de-anonymization algorithm outputting a
probabilistic mapping
, w.r.t. a centrality measure
, is the probability that
sampled from
maps a node
to
if
is selected according to
:
The error rate is the probability that
maps a node
to any node other than
:
The probability is taken over the inherent randomness of the de-anonymization
algorithm as well as the sampling of
from
. Note that
the error rate includes the possibility that
and
.
The above measure only gives a lower bound on privacy breach
because privacy can be violated without complete de-anonymization.
Therefore, if the goal is to protect privacy, it is not enough
to show that this measure is low. It is also necessary to show that
Definition 3 is not satisfied. Observe, for example,
that simply creating
copies of the graph technically prevents
de-anonymization and even satisfies naïve syntactic definitions
such as
-anonymity, while completely violating any reasonable
definition of privacy.
In the other direction, however, breaking Definition 4 for
a large fraction of nodes--as our algorithm of Section 5
does--is sufficient to break privacy via Definition 3,
as long some trivial conditions are met: at least one private attribute
is released as part of
, and the adversary possesses little
or no auxiliary information about this attribute.
. For example,
in our experiments, we find that between 30 and 150 seeds are sufficient
for networks with
to
members.
to a small set of nodes in
which all have the same
value for some sensitive attribute. Anonymity is preserved (he does not
know which of the nodes corresponds to the target node), yet he still
learns the value of his target's sensitive attribute.