Contents


De-anonymization

De-anonymization of social networks has been well studied in the security and privacy community. We summarize the literature in Section VI. Our presentation here follows Narayanan and Shmatikov [24] closely.

Abstractly, de-anonymization can be formalized as follows: there is a graph $ G$ from which two graphs $ G_T$ (for “target”) and $ G_A$ (for “auxiliary”) are derived via some stochastic process. There is thus a natural notion of (partial) correspondence between the nodes of $ G_T$ and $ G_A$; the goal of de-anonymization is to recover this correspondence.

The stochastic process involved could be as simple as the deletion of random edges. At the other extreme, if we're considering two entirely different online social networks, say Facebook and LinkedIn, then $ G$ is the underlying social graph of human relationships, and $ G_T$ and $ G_A$ are generated according to the processes by which users join online social networks, which has no simple algorithmic description.

In the current instance, $ G$ is the Flickr graph at the time of the Kaggle crawl. The Kaggle graph $ G_T$ was generated by crawling, followed by the sampling process described in Section II. Our graph $ G_A$ was “generated” from $ G$ by the natural evolution of the Flickr graph since the time of the Kaggle crawl until the time of our crawl, followed by crawling. Each of these steps introduces noise and/or bias into the graphs.

Metrics. We use the metrics accuracy and coverage to measure the performance of a de-anonymization algorithm. Accuracy is defined as the fraction of nodes correctly de-anonymized among all de-anonymized nodes. Coverage is defined as the fraction of nodes de-anonymized. Specifically, we are concerned with coverage over the Kaggle test set. Therefore in the sequel, unless explicitly noted otherwise, coverage means the fraction of nodes de-anonymized among the nodes that appeared in the Kaggle test set. We also adopt a similar notion of edge coverage and edge accuracy for the Kaggle test set, where the former is defined as the fraction of edges in the Kaggle test set that have been de-anonymized, and the latter is defined as the fraction of edges that are correctly de-anonymized among all edges that are de-anonymized in the Kaggle test set. An edge is de-anonymized iff both nodes incident to the edge are de-anonymized.

Overview of our de-anonymization algorithm. Our basic approach to de-anonymization is described in [24]. Broadly, there are two steps: “seed identification” and “propagation.” In the former step we somehow de-anonymize a small number of nodes, and in the latter step we use these as “anchors” to propagate the de-anonymization to more and more nodes. In this step the algorithm feeds on its own output.

We note some general features of this approach. There is some similarity with the “seed-and-extend” paradigm for sequence alignment in bioinformatics [5]. A very small number of seeds is sufficient to get the algorithm underway; the number depends on whether or not the seeds are “hubs” (high-degree nodes) and on the degree of similarity between the two graphs.

Propagation is somewhat reminiscent of the spread of epidemics. As this analogy suggests, if there are too few seeds, then propagation dies out; if there are sufficiently many, it reaches a large fraction of the nodes. Note, however, that while epidemics are transmitted from one node to another, multiple already-mapped pairs of nodes are involved in implicating a new mapping between a new pair of nodes.


Seed identification

As mentioned earlier, seed identification is a bootstrapping step with the goal of de-anonymizing a small subset of nodes in the Kaggle graph.

The seed-identification technique in [24] is based on pattern search, specifically, identifying small cliques based on degrees and common-neighbor counts. We propose a new approach to seed identification based on formulating it as a combinatorial optimization problem. Compared to [24], we believe our approach is more robust to graph evolution, noise, and information loss due to the crawler's sampling process.

Our motivation for developing this new approach stems primarily from the fact that during the contest, it was unclear whether the clique search technique would be feasible, since we did not know how much the graph had evolved between the two crawls, and how similar the node degrees in the two graphs were to each other. Although we determined after the contest that the existing technique would also have worked in this setting, our method may be of independent interest, especially in contexts where noise and information loss are more prominent.

Figure 5: Correspondence between Kaggle and Flickr high in-degree nodes.
Image matches

Search space reduction. Since both graphs contain millions of nodes, our algorithm needs to effectively reduce the search space to be viable. Essentially, we would like to find a small subset of nodes $ K$ in the Kaggle graph, and a small subset of nodes $ F$ in the Flickr graph, such that a significant fraction of nodes in $ K$ correspond to nodes in $ F$.

The key observation is that the nodes with high in-degree in both graphs roughly correspond to each other (whereas only about $ \frac{1}{15}$ of nodes with high out-degree are present in the Kaggle graph). Figure 5 shows the correspondence: of the top 30 nodes in the Kaggle graph and the top 30 nodes in the Flickr graph, there are 27 correspondences; 60 in the top 80 nodes and 84 in the top 120.

The fact that a rough correspondence exists can be surmised without the benefit of hindsight. If we assume that the crawled nodes in the Kaggle graph are a random sample of all nodes, ignoring the changes in the graph between the two crawls, we can expect that the indegrees of corresponding nodes in the two graphs will be roughly in the same proportion up to sampling error. Even though the Kaggle nodes are not a uniformly random sample, the sampling turns out to be “random enough” for the correspondence to hold.

Graph matching. We now have a small subset of nodes $ K$ from the Kaggle graph, and a small subset of nodes $ F$ from the Flickr graph, with a large fraction of nodes in $ K$ corresponding to nodes in $ F$. Our next step is to find this correspondence.

First we need a measure of the quality of a candidate mapping, so that we can optimize that measure over all possible mappings. If all the edges among the top $ k$ nodes in the Kaggle graph were available, we could look for the mapping that minimizes mismatches. By mismatch we mean the existence of an edge between a pair of nodes in one graph, but not between the images of that pair of nodes in the other graph. However, since only a (small) subset of the Kaggle nodes are crawled, most of the edge information is unavailable.

It turns out that even with a partial crawl, we can adopt a similar strategy, and in fact do even better than simply looking at edges. The trick is to look at the cosine similarity of the sets of in-neighbors of a pair of nodes. A similar rationale as above explains why the cosines between corresponding node pairs in the two graphs would be roughly equal: if the Kaggle nodes are a random sample then they would indeed be equal up to sampling error.

Figure 6: Relationship between similarity scores of corresponding node pairs.
Image cosine-scatter

Figure 6 shows the relationship between similarity scores of corresponding node pairs. Among the top 40 nodes in the Kaggle and Flickr graphs, there are 32 correspondences; these two sets of 32 nodes give $ {32 \choose 2} = 496$ pairs of corresponding unordered node pairs. For each node pair we have an in-neighbor cosine similarity score; thus, the graph shows 496 pairs of scores $ (x_i,y_i)$.

The slope is less than 1, i.e., node pairs in the Kaggle graph have slightly higher scores. This is because the crawled nodes in the Kaggle graph are more biased towards high-degree nodes.

We treat the cosine score of a node pair as the weight of an (undirected) “edge” between those two nodes, regardless of whether or not any actual edges exist between those nodes. We look for a mapping where the weights of corresponding edges are as close to each other as possible. This is the problem of weighted graph matching. As mentioned earlier, this approach to finding seeds is a global optimization problem rather than pattern search.

During the contest, we found a seed mapping of 10 nodes among the top 20 nodes (there are in fact 18 corresponding pairs of nodes among the top 20s) by visual inspection of the matrix of cosines. This formed our seed set, and was sufficient to kick off propagation.

However, there is an automated, robust and scalable approach to the weighted graph matching problem: i.e., using a state-space search metaheuristic. In Section IV we show that simulated annealing can easily handle inputs of up size up to 100 with low false-positive and false-negative rates.

Propagation

Our propagation algorithm is adapted from [24]. In fact, our implementation was simpler; we did not need the more complex heuristics because the two graphs were substantially similar to each other, as noted in Section II. However, the fact that neither graph was fully available made the algorithm somewhat trickier.

As the algorithm progresses, it maintains a (partial) mapping between nodes in the Kaggle graph and nodes in the true Flickr graph. We iteratively try to extend the mapping as follows: pick an arbitrary as-yet-unmapped node in the Kaggle graph, find the “most similar” node in the Flickr graph, and if they are “sufficiently similar,” they get mapped to each other.

At a high level, similarity between a Kaggle node and a Flickr node is defined as cosine similarity between the already-mapped neighbors of the Kaggle node and the already-mapped neighbors of the Flickr node (nodes mapped to each other are treated as identical for the purpose of cosine comparison).

Figure 7: De-anonymization propagation algorithm (simplified).
Image propagation

In Figure 7, the blue nodes have already been mapped. The similarity between $ A$ and $ B$ is $ \frac{2}{\sqrt{3}\cdot\sqrt{3}} = \frac{2}{3}$. Whether or not edges exist between $ A$ and $ A'$ or $ B$ and $ B'$ is irrelevant.

Edge directionality combined with the fact that not all nodes have been crawled introduces some subtleties. Essentially, when comparing a pair of nodes, we ignore the out-edges of either unless both have been crawled. The full algorithm is listed as Algorithm 1.


\begin{algorithm}
% latex2html id marker 320
[t]
\KwIn{\
\par
\begin{tabular}{ll...
... the out- and in-neighborhood of a node $v$ respectively.}
\par
\end{algorithm}

There are two reasons why the similarity between a node and its image may not be $ 100\%$: the contest graph is slightly different from our newer crawled graph, and the mapping itself might have inaccuracies. The latter effect is minimal—the algorithm occasionally revisits already-mapped nodes to correct errors in the light of more data.

The propagation algorithm was run in two stages. In the first stage, we de-anonymize high-degree nodes with a high confidence level. In the second stage, we focus on the nodes in the test set (that have not yet been de-anonymized) and relax the confidence threshold, and even allow multiple candidates per node. Mappings found in this stage do not feed back into the similarity computation, due to lower confidence.

The following heuristic criteria define the “sufficiently similar” criterion in stage 1 of propagation:

The parameters $ k, \theta$    and $ \delta$ allow us to trade off accuracy for coverage. In the second stage of propagation, we set $ k=3$, eliminate the last criterion, and if there are multiple matches with a similarity score of $ 0.5$ or more, we report the best 3 as candidates.

Results

The algorithm was implemented in Python and run on a commodity laptop. Propagation was run in two stages. In the first stage, we attempted to de-anonymize the high-degree nodes, specifically, about 34,000 nodes sorted by out-degree and about 100,000 nodes sorted by in-degree. There was, of course, some overlap between the two sets. Mappings were produced for about 120,000 nodes. In the second stage, we attempted to de-anonymize the nodes in the test set that hadn't already been mapped. The first stage took a few hours and the second stage about 20 minutes.

Of the 120,000 mappings in stage 1, 99.3% were correct.6 At the end of the second stage, mappings were produced for about 14,000 nodes out of about 17,600 in the test set. Thus, the coverage was 79.7%. About 7.5% of nodes had multiple candidates. Of these, the top match was accurate in the overwhelming majority of cases. Considering only the top match, the overall accuracy was 97.8%. Of the 8,960 edges, the coverage was 64.7% (roughly the node coverage squared), and the accuracy was 95.2%.



Footnotes

... correct.6
The “ground truth” mapping between obfuscated node IDs and real identities provided by Kaggle is not complete—it covers only 36,547 of the 37,692 nodes with outgoing edges, and 17,130 of the 17,594 nodes in the test set. The fractions reported in this paragraph are measured with respect to the available ground truth.
Arvind Narayanan 2011-03-09