Our re-identification algorithm runs in two stages. First, the attacker identifies a small number of "seed" nodes which are present both in the anonymous target graph and the attacker's auxiliary graph, and maps them to each other. The main, propagation stage is a self-reinforcing process in which the seed mapping is extended to new nodes using only the topology of the network, and the new mapping is fed back to the algorithm. The eventual result is a large mapping between subgraphs of the auxiliary and target networks which re-identifies all mapped nodes in the latter.
We assume that the attacker's individual auxiliary information (see
Section 4.3) consists of a clique of
nodes which are
present both in the auxiliary and the target graphs. It is sufficient
to know the degree of each of these nodes and the number
of common neighbors for each pair of nodes.
The seed-finding algorithm takes as inputs (1) the target graph, (2)
seed nodes in the auxiliary graph,
(3)
node-degree values, (4)
pairs
of common-neighbor counts, and (5) error parameter
.
The algorithm searches the target graph for a unique
-clique with
matching (within a factor of
)
node degrees and common-neighbor counts. If found, the algorithm
maps the nodes in the clique to the corresponding nodes in the auxiliary graph;
otherwise, failure is reported.
While this brute-force search is exponential in
, in practice this
turns out not to be a problem. First, if the degree is bounded by
, then the complexity is
. Second, the running time is
heavily input-dependent, and the inputs with high running time
turn out to produce a large number of matches. Terminating the
algorithm as soon as more than one match is found greatly decreases the
running time.
The propagation algorithm takes as input two graphs
and
and a partial "seed" mapping
between
the two. It outputs a mapping
. One may consider
probabilistic mappings, but we found it simpler to focus on deterministic
1-1 mappings
.
Intuitively, the algorithm finds new mappings using the topological
structure of the network and the feedback from previously constructed
mappings. It is robust to mild modifications of the topology such
as those introduced by sanitization. At each iteration, the algorithm
starts with the accumulated list of mapped pairs between
and
.
It picks an arbitrary unmapped node
in
and computes a score
for each unmapped node
in
, equal to the number of
neighbors of
that have been mapped to neighbors of
. If the
strength of the match (see below) is above a threshold, the mapping
between
and
is added to the list, and the next iteration starts.
There are a few additional details and heuristics that we describe below.
Eccentricity.
Eccentricity is a heuristic defined in [61] in the context of
de-anonymizing databases.
It measures how much an item in a set
"stands out" from the rest, and
is defined as
and
denote the highest and
second highest values, respectively, and
denotes the standard
deviation.
Our algorithm measures the eccentricity of the set of mapping scores
(between a single node in
and each unmapped node in
)
and rejects the match if the eccentricity score is below a threshold.
Edge directionality.
Recall that we are dealing with directed graphs. To compute the
mapping score between a pair of nodes
and
, the algorithm computes
two scores-the first based only on the incoming edges of
and
,
and the second based only on the outgoing edges. These
scores are then summed.
Node degrees. The mapping scores as described above are biased in favor of nodes with high degrees. To compensate for this bias, the score of each node is divided by the square root of its degree. The resemblance to cosine similarity4 is not superficial: the rationale is the same.
Revisiting nodes. At the early stages of the algorithm, there are few mappings to work with, and therefore the algorithm makes more errors. As the algorithm progresses, the number of mapped nodes increases and the error rate goes down. Thus the need to revisit already mapped nodes: the mapping computed when revisiting a node may be different because of the new mappings that have become available.
Reverse match.
The algorithm is completely agnostic about the semantics of the two
graphs. It does not matter whether
is the target graph and
is the auxiliary graph, or vice versa. Each time a node
maps to
, the mapping scores are computed with the input graphs switched.
If
gets mapped back to
, the mapping is retained;
otherwise, it is rejected.
The following pseudocode describes the algorithm in detail. theta is a parameter that controls the tradeoff between the yield and the accuracy.
function propagationStep(lgraph, rgraph, mapping)
for lnode in lgraph.nodes:
scores[lnode] = matchScores(lgraph, rgraph, mapping, lnode)
if eccentricity(scores[lnode]) < theta: continue
rnode = (pick node from rgraph.nodes where
scores[lnode][node] = max(scores[lnode]))
scores[rnode] = matchScores(rgraph, lgraph, invert(mapping), rnode)
if eccentricity(scores[rnode]) < theta: continue
reverse_match = (pick node from lgraph.nodes where
scores[rnode][node] = max(scores[rnode]))
if reverse_match != lnode:
continue
mapping[lnode] = rnode
function matchScores(lgraph, rgraph, mapping, lnode)
initialize scores = [0 for rnode in rgraph.nodes]
for (lnbr, lnode) in lgraph.edges:
if lnbr not in mapping: continue
rnbr = mapping[lnbr]
for (rnbr, rnode) in rgraph.edges:
if rnode in mapping.image: continue
scores[rnode] += 1 / rnode.in_degree ^ 0.5
for (lnode, lnbr) in lgraph.edges:
if lnbr not in mapping: continue
rnbr = mapping[lnbr]
for (rnode, rnbr) in rgraph.edges:
if rnode in mapping.image: continue
scores[rnode] += 1 / rnode.out_degree ^ 0.5
return scores
function eccentricity(items)
return (max(items) - max2(items)) / std_dev(items)
until convergence do:
propagationStep(lgraph, rgraph, seed_mapping)
Complexity.
Ignoring revisiting nodes and reverse matches, the
complexity of the algorithm is
, where
is a
bound on the degree of the nodes in
. To see this, let
be the partial mapping computed at
any stage of the algorithm. For each
and each
adjacent
to
such that
, the algorithm examines each
of the neighbors of
, giving an upper bound of
.
Assuming that a node is revisited only if the number of already-mapped
neighbors of the node has increased by at least 1, we get a bound of
, where
is a
bound on the degree of the nodes in
.
Finally, taking reverse mappings into account, we get
.
is defined when neither is empty:
.