In this section we describe the competition dataset and our Flickr dataset crawled for the purposes of de-anonymization.

Social network challenge dataset

The challenge graph had 1,133,547 nodes and 7,237,983 edges. An additional 8,960 edges formed the test set; of these, equal numbers were true edges (withheld from the training set) and false edges. Test edges were incident on training set nodes only, making link prediction possible. Of the 1.1m nodes, only 37,689 had outgoing edges. In the sequel, we will call these “crawled” nodes. In order to generate a running leaderboard, a random 20% of the test set was held out to form a validation set on which entry AUCs were computed. The final results were calculated on the entire test set.

User identities were obfuscated in the challenge dataset: nodes were assigned random numeric IDs. However, during the course of the challenge, it was revealed on the contest forum that the data came from crawling Flickr—nodes correspond to users of the photo sharing site and edges correspond to directed friendship relationships. After completion, we learned that the competition crawl took place between 21-28 June 2010 [22]. It was also revealed that the crawl was made in several iterations by initially downloading random nodes, and subsequently sampling uniformly from the inbound neighbors of these nodes; that the true test edges were selected uniformly at random from the crawled set of edges incident to nodes with degree two or more; and that the false test edges were rejection sampled to connect a random crawled node to a random reached node, provided the edge did not already exist.

Figures 1 and 2 display the in- and out-degree distributions of the competition training graph, respectively. The heavy tailed distributions suggest that a small number of nodes have very large degrees. These nodes form the seed set for the de-anonymization process below.

Figure 1: Kaggle challenge graph: in-degree distribution.
Image kaggle-indeg-dist

Figure 2: Kaggle challenge graph: out-degree distribution.
Image kaggle-outdeg-dist

Flickr crawl dataset

Between mid-December 2010 and mid-January 2011, we crawled the Flickr social graph in order to de-anonymize the Kaggle dataset. The crawler was written in Python, using the Curl library. A total of 2,038,803 nodes were crawled; these are the nodes that have outgoing edges. The resulting graph comprises 163,579,517 directed edges on 9,124,801 nodes—significantly larger than the competition dataset. Of these 9.1m nodes, 7,775,972 have incoming edges.

While unhelpful for de-anonymization and link prediction, we recorded the true identities of the nodes we crawled. Similarly the organizers provided us with the unobfuscated node IDs for the challenge graph after the contest. Together these identities allowed us to compare the two partial snapshots of the Flickr graph after the completion of the contest.

It is only meaningful to compare edges originating from nodes that have been crawled in both graphs. There were 6,658,755 such edges in the challenge graph, 7,041,554 edges in our Flickr graph and 6,545,560 of these edges were in common. Thus 113,195 edges had been deleted since the Kaggle crawl and 495,994 edges had been added. The cosine similarity between the two sets of edges is 95.6%.5

Figure 3: Distribution of ratios of out-degree of Kaggle nodes to corresponding Flickr nodes. (Top 700 nodes by out-degree.)
Image outdegree-ratio

Figure 4: Distribution of ratios of in-degree of Kaggle nodes to corresponding Flickr nodes. (Top 700 nodes by in-degree.)
Image indegree-ratio


... 95.6\%.5
Here and in the sequel, the cosine similarity between two sets $ X$ and $ Y$ is $ \frac{\ \ \vert X \cap Y\vert}{\sqrt{\vert X\vert \cdot \vert Y\vert}}$.
Arvind Narayanan 2011-03-09