Our re-identification algorithm runs in two stages. First, the attacker identifies a small number of "seed" nodes which are present both in the anonymous target graph and the attacker's auxiliary graph, and maps them to each other. The main, propagation stage is a self-reinforcing process in which the seed mapping is extended to new nodes using only the topology of the network, and the new mapping is fed back to the algorithm. The eventual result is a large mapping between subgraphs of the auxiliary and target networks which re-identifies all mapped nodes in the latter.

Seed identification

While algorithms for seed identification are not our primary technical contribution, they are a key step in enabling our overall algorithm to succeed. Here we describe one possible seed identification algorithm. The attacks in [7] can also be considered seed identification algorithms. We briefly discuss alternatives at the end of Section 6.1.

We assume that the attacker's individual auxiliary information (see Section 4.3) consists of a clique of $k$ nodes which are present both in the auxiliary and the target graphs. It is sufficient to know the degree of each of these nodes and the number of common neighbors for each pair of nodes.

The seed-finding algorithm takes as inputs (1) the target graph, (2) $k$ seed nodes in the auxiliary graph, (3) $k$ node-degree values, (4) $k \choose 2$ pairs of common-neighbor counts, and (5) error parameter $\epsilon$. The algorithm searches the target graph for a unique $k$-clique with matching (within a factor of $1\pm\epsilon$) node degrees and common-neighbor counts. If found, the algorithm maps the nodes in the clique to the corresponding nodes in the auxiliary graph; otherwise, failure is reported.

While this brute-force search is exponential in $k$, in practice this turns out not to be a problem. First, if the degree is bounded by $d$, then the complexity is $O(nd^{k-1})$. Second, the running time is heavily input-dependent, and the inputs with high running time turn out to produce a large number of matches. Terminating the algorithm as soon as more than one match is found greatly decreases the running time.


The propagation algorithm takes as input two graphs $G_1 = (V_1, E_1)$ and $G_2 = (V_2, E_2)$ and a partial "seed" mapping $\mu_S$ between the two. It outputs a mapping $\mu$. One may consider probabilistic mappings, but we found it simpler to focus on deterministic 1-1 mappings $\mu \colon V_1 \rightarrow V_2$.

Intuitively, the algorithm finds new mappings using the topological structure of the network and the feedback from previously constructed mappings. It is robust to mild modifications of the topology such as those introduced by sanitization. At each iteration, the algorithm starts with the accumulated list of mapped pairs between $V_1$ and $V_2$. It picks an arbitrary unmapped node $u$ in $V_1$ and computes a score for each unmapped node $v$ in $V_2$, equal to the number of neighbors of $u$ that have been mapped to neighbors of $v$. If the strength of the match (see below) is above a threshold, the mapping between $u$ and $v$ is added to the list, and the next iteration starts. There are a few additional details and heuristics that we describe below.

Eccentricity. Eccentricity is a heuristic defined in [61] in the context of de-anonymizing databases. It measures how much an item in a set $X$ "stands out" from the rest, and is defined as

\begin{displaymath}\frac{\textsf{max}(X) - \textsf{max}_2({X})}{\sigma({X})} \end{displaymath}

where $\textsf{max}$ and $\textsf{max}_2$ denote the highest and second highest values, respectively, and $\sigma$ denotes the standard deviation.

Our algorithm measures the eccentricity of the set of mapping scores (between a single node in $v_1$ and each unmapped node in $v_2$) and rejects the match if the eccentricity score is below a threshold.

Edge directionality. Recall that we are dealing with directed graphs. To compute the mapping score between a pair of nodes $u$ and $v$, the algorithm computes two scores-the first based only on the incoming edges of $u$ and $v$, and the second based only on the outgoing edges. These scores are then summed.

Node degrees. The mapping scores as described above are biased in favor of nodes with high degrees. To compensate for this bias, the score of each node is divided by the square root of its degree. The resemblance to cosine similarity4 is not superficial: the rationale is the same.

Revisiting nodes. At the early stages of the algorithm, there are few mappings to work with, and therefore the algorithm makes more errors. As the algorithm progresses, the number of mapped nodes increases and the error rate goes down. Thus the need to revisit already mapped nodes: the mapping computed when revisiting a node may be different because of the new mappings that have become available.

Reverse match. The algorithm is completely agnostic about the semantics of the two graphs. It does not matter whether $G_1$ is the target graph and $G_2$ is the auxiliary graph, or vice versa. Each time a node $u$ maps to $v$, the mapping scores are computed with the input graphs switched. If $v$ gets mapped back to $u$, the mapping is retained; otherwise, it is rejected.

The following pseudocode describes the algorithm in detail. theta is a parameter that controls the tradeoff between the yield and the accuracy.

function propagationStep(lgraph, rgraph, mapping)

  for lnode in lgraph.nodes:
    scores[lnode] = matchScores(lgraph, rgraph, mapping, lnode)
    if eccentricity(scores[lnode]) < theta: continue
    rnode = (pick node from rgraph.nodes where 
          scores[lnode][node] = max(scores[lnode]))

    scores[rnode] = matchScores(rgraph, lgraph, invert(mapping), rnode)
    if eccentricity(scores[rnode]) < theta: continue
    reverse_match = (pick node from lgraph.nodes where 
          scores[rnode][node] = max(scores[rnode]))
    if reverse_match != lnode:

    mapping[lnode] = rnode

function matchScores(lgraph, rgraph, mapping, lnode)

  initialize scores = [0 for rnode in rgraph.nodes]

  for (lnbr, lnode) in lgraph.edges:
    if lnbr not in mapping: continue
    rnbr = mapping[lnbr]
    for (rnbr, rnode) in rgraph.edges:
      if rnode in mapping.image: continue
      scores[rnode] += 1 / rnode.in_degree ^ 0.5

  for (lnode, lnbr) in lgraph.edges:
    if lnbr not in mapping: continue
    rnbr = mapping[lnbr]
    for (rnode, rnbr) in rgraph.edges:
      if rnode in mapping.image: continue
      scores[rnode] += 1 / rnode.out_degree ^ 0.5

  return scores

function eccentricity(items)

  return (max(items) - max2(items)) / std_dev(items)

until convergence do:
  propagationStep(lgraph, rgraph, seed_mapping)

Complexity. Ignoring revisiting nodes and reverse matches, the complexity of the algorithm is $O(\vert E_1\vert d_2)$, where $d_2$ is a bound on the degree of the nodes in $V_2$. To see this, let $\mu_{part}$ be the partial mapping computed at any stage of the algorithm. For each $u \in V_1$ and each $v$ adjacent to $u$ such that $v \in \textsf{domain}(\mu_{part})$, the algorithm examines each of the neighbors of $\mu_{part}(v)$, giving an upper bound of $\vert E_1\vert d_2$.

Assuming that a node is revisited only if the number of already-mapped neighbors of the node has increased by at least 1, we get a bound of $O(\vert E_1\vert d_1d_2)$, where $d_1$ is a bound on the degree of the nodes in $V_1$. Finally, taking reverse mappings into account, we get $O((\vert E_1\vert+\vert E_2\vert)d_1d_2)$.


... similarity4
The cosine similarity measure between two sets $X \mbox { and } Y$ is defined when neither is empty: $\textsf{cos}(X, Y) = \frac{\vert X \cap Y\vert}{\sqrt{\vert X\vert \vert Y\vert}}$.
Arvind Narayanan 2009-03-19