- Abstract
- Introduction
- State of the Union
- Related Work
- Model and Definitions

- De-anonymization

- Experiments

- Conclusion
- Bibliography
- Glossary
- On "Personally Identifiable Information"
- "Identity" in social networks
- Challenges of defining privacy
- Measuring the effect of perturbation
- Notes on data acquisition

De-anonymization

Our re-identification algorithm runs in two stages. First, the attacker identifies a small number of "seed" nodes which are present both in the anonymous target graph and the attacker's auxiliary graph, and maps them to each other. The main, propagation stage is a self-reinforcing process in which the seed mapping is extended to new nodes using only the topology of the network, and the new mapping is fed back to the algorithm. The eventual result is a large mapping between subgraphs of the auxiliary and target networks which re-identifies all mapped nodes in the latter.

Seed identification

We assume that the attacker's individual auxiliary information (see Section 4.3) consists of a clique of nodes which are present both in the auxiliary and the target graphs. It is sufficient to know the degree of each of these nodes and the number of common neighbors for each pair of nodes.

The seed-finding algorithm takes as inputs (1) the target graph, (2) seed nodes in the auxiliary graph, (3) node-degree values, (4) pairs of common-neighbor counts, and (5) error parameter . The algorithm searches the target graph for a unique -clique with matching (within a factor of ) node degrees and common-neighbor counts. If found, the algorithm maps the nodes in the clique to the corresponding nodes in the auxiliary graph; otherwise, failure is reported.

While this brute-force search is exponential in , in practice this turns out not to be a problem. First, if the degree is bounded by , then the complexity is . Second, the running time is heavily input-dependent, and the inputs with high running time turn out to produce a large number of matches. Terminating the algorithm as soon as more than one match is found greatly decreases the running time.

Propagation

The propagation algorithm takes as input two graphs and and a partial "seed" mapping between the two. It outputs a mapping . One may consider probabilistic mappings, but we found it simpler to focus on deterministic 1-1 mappings .

Intuitively, the algorithm finds new mappings using the topological structure of the network and the feedback from previously constructed mappings. It is robust to mild modifications of the topology such as those introduced by sanitization. At each iteration, the algorithm starts with the accumulated list of mapped pairs between and . It picks an arbitrary unmapped node in and computes a score for each unmapped node in , equal to the number of neighbors of that have been mapped to neighbors of . If the strength of the match (see below) is above a threshold, the mapping between and is added to the list, and the next iteration starts. There are a few additional details and heuristics that we describe below.

Eccentricity.
Eccentricity is a heuristic defined in [61] in the context of
de-anonymizing databases.
It measures how much an item in a set "stands out" from the rest, and
is defined as

where and denote the highest and second highest values, respectively, and denotes the standard deviation.

Our algorithm measures the eccentricity of the set of mapping scores (between a single node in and each unmapped node in ) and rejects the match if the eccentricity score is below a threshold.

Edge directionality. Recall that we are dealing with directed graphs. To compute the mapping score between a pair of nodes and , the algorithm computes two scores-the first based only on the incoming edges of and , and the second based only on the outgoing edges. These scores are then summed.

Node degrees.
The mapping scores as described above are biased in favor of nodes
with high degrees. To compensate for this bias, the score of each
node is divided by the square root of its degree. The
resemblance to cosine similarity^{4} is not superficial: the rationale is
the same.

Revisiting nodes. At the early stages of the algorithm, there are few mappings to work with, and therefore the algorithm makes more errors. As the algorithm progresses, the number of mapped nodes increases and the error rate goes down. Thus the need to revisit already mapped nodes: the mapping computed when revisiting a node may be different because of the new mappings that have become available.

Reverse match. The algorithm is completely agnostic about the semantics of the two graphs. It does not matter whether is the target graph and is the auxiliary graph, or vice versa. Each time a node maps to , the mapping scores are computed with the input graphs switched. If gets mapped back to , the mapping is retained; otherwise, it is rejected.

The following pseudocode describes the algorithm in detail. `theta`

is a parameter that controls the tradeoff between the yield and the accuracy.

function propagationStep(lgraph, rgraph, mapping) for lnode in lgraph.nodes: scores[lnode] = matchScores(lgraph, rgraph, mapping, lnode) if eccentricity(scores[lnode]) < theta: continue rnode = (pick node from rgraph.nodes where scores[lnode][node] = max(scores[lnode])) scores[rnode] = matchScores(rgraph, lgraph, invert(mapping), rnode) if eccentricity(scores[rnode]) < theta: continue reverse_match = (pick node from lgraph.nodes where scores[rnode][node] = max(scores[rnode])) if reverse_match != lnode: continue mapping[lnode] = rnode function matchScores(lgraph, rgraph, mapping, lnode) initialize scores = [0 for rnode in rgraph.nodes] for (lnbr, lnode) in lgraph.edges: if lnbr not in mapping: continue rnbr = mapping[lnbr] for (rnbr, rnode) in rgraph.edges: if rnode in mapping.image: continue scores[rnode] += 1 / rnode.in_degree ^ 0.5 for (lnode, lnbr) in lgraph.edges: if lnbr not in mapping: continue rnbr = mapping[lnbr] for (rnode, rnbr) in rgraph.edges: if rnode in mapping.image: continue scores[rnode] += 1 / rnode.out_degree ^ 0.5 return scores function eccentricity(items) return (max(items) - max2(items)) / std_dev(items) until convergence do: propagationStep(lgraph, rgraph, seed_mapping)

Complexity.
Ignoring revisiting nodes and reverse matches, the
complexity of the algorithm is , where is a
bound on the degree of the nodes in . To see this, let
be the partial mapping computed at
any stage of the algorithm. For each and each adjacent
to such that
, the algorithm examines each
of the neighbors of , giving an upper bound of .

Assuming that a node is revisited only if the number of already-mapped neighbors of the node has increased by at least 1, we get a bound of , where is a bound on the degree of the nodes in . Finally, taking reverse mappings into account, we get .

- ... similarity
^{4} - The
*cosine similarity measure*between two sets is defined when neither is empty: .