Contents



Link Prediction

Use of voting to increase edge coverage

As stated earlier, de-anonymization covered 64.7% of the test cases in the test set. Recall that some nodes are not deterministically de-anonymized; multiple candidate matches are produced instead. We now describe how we leverage this in link prediction.

The idea behind voting is very simple. Let $ v_K$ denote a Kaggle node, and $ \mathcal{C}(v_K)$ denote $ v_K$'s de-anonymization candidates, i.e., potential Flickr nodes that correspond to $ v_K$. Given an edge $ (a, b)$ in the test set, suppose $ \vert\mathcal{C}(a)\vert \neq 0$ and $ \vert\mathcal{C}(b)\vert \neq 0$. Then the edge $ (a, b)$ has $ \vert\mathcal{C}(a)\vert\cdot\vert\mathcal{C}(b)\vert$ possible de-anonymization candidates. If all candidates unanimously vote 0, i.e., none of these candidate edges exist in the Flickr graph, then we output the prediction 0. Conversely, if all candidates unanimously vote $ 1$, we output the prediction $ 1$.

To exploit the full power of voting, we performed the following additional steps to the de-anonymization algorithm: 1) Prune the de-anonymization output by confidence score. 2) Among the remaining nodes, run stage 2 of the propagation algorithm with the “sufficiently similar” criterion completely eliminated, thereby encouraging the algorithm to include even less likely candidates. As a result, some of the nodes have large numbers of candidates. This is not a drawback—since we adopt a unanimous voting strategy, having more candidates ensures that the predictions are more confident.

Naturally, this gives an almost complete coverage of the nodes and edges, specifically 96.0% node coverage and 92.6% edge coverage. Of these 57.0% of edges had unique candidates; we applied voting to the other 35.6%. As shown in Table III, voting produced unanimous results for 18.7% of the 35.6% of edges.

Table I is a breakdown of the number of edges and non-edges found by the de-anonymization and voting strategy. One interesting observation is that the de-anonymization algorithm was able to determine more edges than non-edges, while the voting strategy uncovered only non-edges, 98.1% of which are true non-edges. This is not surprising, due to the following two reasons. First, the de-anonymization algorithm has a better coverage on higher-degree nodes. Second, we use a unanimous voting strategy to ensure high confidence. When a test edge has multiple de-anonymization candidates, it is highly likely that some candidates will vote no, as two nodes picked randomly from the graph are much more likely not to have an edge based on a simple calculation. Therefore, it is much more likely to uncover a non-edge through unanimous voting than to uncover an edge.

Algorithm 3 formally describes how we combine de-anonymization with machine learning predictions.


\begin{algorithm}
% latex2html id marker 960
[t]
\par
\KwIn{
\par
\
\par
\small ...
...
\par
\caption{Combining de-anonymization and machine learning.}
\end{algorithm}


Table I: Results: breakdown of the number of edges and non-edges predicted by the de-anonymization and voting algorithms.
Method #(%) of edges predicted #(%) of non-edges predicted
DA 2868 (56.2%) 2240 (43.8%)
Voting 0 (0%) 1677 (100%)

Machine learning. We implemented 25 features (and their variations) capturing neighborhood characteristics up to $ 4$ hops away. These features include Adamic/Adar [18], Jaccard [18], localized random walks, node degrees, local clustering coefficients, whether the reverse edge exists, and so on. For a subset of the above features, we also implemented different variations with different parametrizations (e.g., hop count), and using in-degrees, out-degrees, or both.

We ran the Random Forest [8] classifier on top of the 25 selected features. We created a training and validation set using the “ground truth” obtained from de-anonymization. The training set contains 3,000-4,200 examples, and the validation set contains roughly 1,000-2,000 examples.

Table II depicts the performance of the machine learning algorithm, and selected features. Notably, among the features we tried, the best stand-alone feature is the localized random walk. Given a test edge $ (a, b)$, we perform a random walk starting at node $ a$ for a limited number of hops, and compute how likely it is to reach node $ b$ in the process. To compute this probability, we implemented an approximate version of 3-4 rounds of matrix multiplication of the PageRank algorithm, starting at node $ a$. The localized random walk feature achieves an AUC of 0.912 and 0.924 when the maximum number of hops is 3 and 4 respectively.

The Random Forest algorithm achieved an AUC of 0.935 to 0.945 on the validation set. On the entire Kaggle test set, it achieved an AUC of 0.953; however, this number has to be taken with a grain of salt, as we used part of the test set for training (see Figure 13). The classifier performed worse on the ML set which was not covered by de-anonymization or voting, with an AUC of 0.881. Therefore, an interesting observation is that the subset of test edges that are more difficult to de-anonymize also turns out to be the set relatively hard for link prediction (through ML). This is likely due to the fact that these nodes lack sufficient information necessary for de-anonymization or accurate link prediction.


Figure 13: Breakdown of the Kaggle test set.
\includegraphics[width=0.8\columnwidth]{figures/piechart}


Table II: Machine Learning Results.
Method AUC (test set) AUC (ML set)
Adamic/Adar 0.835 0.760
Jaccard 0.851 0.775
Localized Random Walk (3 hops) 0.912 0.834
Localized Random Walk (4 hops) 0.924 0.865
Random Forest (25 features) 0.953* 0.881

*: Part of the test set was used to train based on the de-anonymization result.



Table III: Results.
Method % Edges Covered Accuracy AUC
DA 57.0%* 0.987 -
Voting 18.7% 0.981 -
ML 24.3% - 0.881**
Overall 100%   0.981

* Based on a pruned version of the de-anonymization result.
** AUC on the ML set was 0.881; but AUC on the validation set was 0.935-0.945. The ML set is a hard subset both for de-anonymization and machine learning since it is biased toward low-degree nodes.


Arvind Narayanan 2011-03-09