We used data from three large online social networks in our experiments. The first graph is the "follow" relationships on the Twitter microblogging service, which we crawled in late 2007. The second graph is the "contact" relationships on Flickr, a photo-sharing service, which we crawled in late 2007/early 2008. Both services have APIs that expose a mandatory username field, and optional fields name and location. The latter is represented as free-form text. The final graph is the "friend" relationships on the LiveJournal blogging service; we obtained it from the authors of . The parameters of the three graphs are summarized below. In computing the average degree, the degree of a node is counted as the sum of its in- and out-degrees. Further details about the crawling process can be found in Appendix J.
To demonstrate feasibility of seed identification, we ran the algorithm of Section 5.1 with the LiveJournal graph as its target. Recall from Section 4.3 that the auxiliary information needed to create seed mappings comes from the users of the target network. Therefore, we can evaluate feasibility of seed identification simply by measuring how much auxiliary information is needed to identify a unique node in the target graph. We emphasize that our main de-anonymization algorithm needs only a handful of such nodes.
For simplicity, we assume that the attacker only has access to the undirected graph, where an edge is included only if it is symmetrical in the original graph. This underestimates the re-identification rate, because the attacker would have more information if directionality of edges were considered.
We synthetically generate auxiliary information for seed identification starting from randomly sampled cliques. To sample a clique of size , we start from a random node and, at every stage, randomly pick a node which is adjacent to all the nodes picked so far. If there is no such node, we start over.
This method does not sample uniformly from all the cliques in the graph; the distribution of selected nodes is much more equitable. If we sample a -clique uniformly, it is susceptible to anomalies in the graph that make the result meaningless. If the graph has a large clique, or even a large dense subgraph, then almost every -clique sampled will belong to this large clique or subgraph.
Given a clique (specifically, a -clique), we assume that the attacker knows the degrees of these 4 nodes as well as the number of common neighbors of each of the 6 pairs. The auxiliary information may be imprecise, and the search algorithm treats a -clique in the target graph as a match as long as each degree and common-neighbor count matches within a factor of , where is the error parameter (intuitively, the higher the error, the noisier the auxiliary information and the lower the re-identification rate). Figure 1 shows how re-identification rate decreases with noise. Recall that we allow at most one match, and so the attacker never makes an error as long as his assumptions about the imprecision of his auxiliary information are correct.
This experiment establishes that seed identification is feasible in practice. If anything, it underestimates how easy this is to do in the real world, where the attacker can use auxiliary information other than degrees and common-neighbor counts. Searching based on the structure of the target users' graph neighborhoods allows re-identification with just two or even a single node, although this is algorithmically complex.
The most remarkable feature of our propagation algorithm is that it achieves "viral," self-reinforcing, large-scale re-identification regardless of the number of seeds, as long as the latter is above a (low) threshold. To study this behavior, we carried out an experiments on pairs of subgraphs, over 100,000 nodes each, of a real-world social network. In each experiment, one of the subgraphs was used as the auxiliary information, the other as the target. The graphs were artificially perturbed by adding different levels of noise to achieve various degrees of edge overlap.
Perturbation strategy. Given a real network graph , our goal is to sample subsets of such that and have an overlap of . Overlap is measured in terms of the Jaccard Coefficient, which is defined for two sets if one of them is non-empty: . Thus, if each of two sets shares half its members with the other, the overlap is . We simply partition randomly into three subsets of size , respectively, and set and .
We use one subgraph as the auxiliary information and the other as the anonymous target graph. As mentioned in Section 2, we believe that introducing noise via edge deletions and additions is the only realistic method of perturbing the edges. Our goal is to simulate the effect of perturbation on the target graph as follows (Procedure A):
The best way to add edges is to use link prediction, which will result in plausible fake edges. Instead of choosing a specific link prediction algorithm, we perform the following (Procedure B):
It should be clear that Procedure B produces more plausible edges than even the best concrete link prediction algorithm. If the link prediction algorithm is perfect, i.e., if the edge additions accomplish the reverse of random edge deletion, then the two procedures are more or less equivalent ( in Procedure A corresponds to in Procedure B; and in Procedure A correspond to the two perturbed copies in Procedure B). If the link prediction is not perfect, then Procedure B is better in the sense that it leads to more realistic noise, and thus makes the task of our de-anonymization algorithm harder.
This leaves the question of what fraction of edges to remove to get an edge overlap of . The fraction of common edges is , while the fraction of edges left in at least one of the copies is , giving , which yields as the only valid solution. Note that the edge overlap is calculated for the subgraphs formed by the overlapping nodes. The overlap between and is much lower.
Results. We investigated the impact that the number of seeds has on the ability of the propagation algorithm to achieve large-scale re-identification, and also its robustness to perturbation.
Figure 3 shows that the selection of seeds determines whether propagation step dies out or not (cf. phase transition ), but whenever large-scale propagation has been achieved, the re-identification rate stays remarkably constant. We find that when the algorithm dies out, it re-identifies no more than a few dozen nodes correctly.
We performed a further experiment to study the phase transition better. A run is classified as successful if it re-identifies at least 1,000 nodes. Figure 4 shows the resulting probabilities of large-scale propagation. The phase transition is somewhat less sharp than might appear from Figure 3, although the window is almost completely in the range [15,45].
It must be noted that the number of seeds required to trigger propagation depends heavily on the parameters of the graph and the algorithm used for seed selection. We therefore caution against reading too much into the numbers. What this experiment shows is that a phase transition does happen and that it is strongly dependent on the number of seeds. Therefore, the adversary can collect seed mappings incrementally until he has enough mappings to carry out large-scale re-identification.
Figure 2 shows that imprecision of the auxiliary information decreases the percentage of nodes re-identified, but cannot prevent large-scale re-identification.
Ground truth. To verify our results, we had to determine the ground truth, i.e., the true mapping between the two graphs. We produced ground-truth mappings based on exact matches in either the username, or name field. Once a match is found, we compute a score based on a variety of heuristics on all three fields (username, name and location). If the score is too low, we reject the match as spurious.
This resulted in around 27,000 mappings, which we will call . Since these mappings were computed with a completely different information than used by the de-anonymization algorithm, errors in the ground truth can only degrade the reported performance of our de-anonymization algorithm. We picked a random sample of the mappings and verified by human inspection that the error rate is well under 5%.
Of course, some of those who use both Flickr and Twitter may use completely different usernames and names on the two services and are thus not included in our ground-truth mappings. This has no effect on the reported performance of our algorithm. When it does recognize two nodes as belonging to the same user, it is rarely wrong, and, furthermore, it can successfully re-identify thousands of users.
It is possible that our algorithm has a better performance on the nodes where the ground truth is known than on other nodes. For example, users who acquire distinctive usernames on both websites might be habitual early adopters of web services. Thus, the numbers below must be interpreted with caution.
Our seed mapping consisted of 150 pairs of nodes selected randomly from , with the constraint that the degree of each mapped node in the auxiliary graph is at least 80. More opportunistic seed selection can lower the number of seeds required.
The accuracy of our algorithm on (weighted by centrality--see Section E.1) is summarized below: