Graph Matching via Simulated Annealing

In Section III-A we showed how to formulate the problem of seed-identification, the first step of de-anonymization, as a weighted graph matching problem. To recap, we are given two graphs with weighted edges, and we are required to find the mapping between the nodes that minimizes a given function, such as the sum of absolute differences between the weights of edges mapped to each other.

Weighted graph matching generalizes its unweighted version, inexact graph matching, which in turn generalizes graph isomorphism. Unsurprisingly, inexact graph matching is NP-complete [2], and therefore weighted graph matching is NP-complete as well.

Choice of weights. In our experiments the weight of an edge is the in-neighbor cosine-similarity score of the pair of nodes it is incident to. Although there is more information available that could be incorporated into the edge weights—a fraction of the Kaggle nodes have out-edges in addition to in-edges—we choose not to do so for simplicity. Further, one might imagine incorporating the intuition that higher-degree nodes are more likely to correspond to higher-degree nodes, but it is not clear if it is possible to encapsulate this insight in the form of edge weights.

Simulated annealing is a state-space search “metaheuristic” that is ideal for approximate combinatorial optimization. The randomized search begins by exploring the state-space, placing relatively even probability on all directions, in order to avoid local optima. As the search proceeds, the distribution on directions begins to concentrate more tightly around the gradient direction, converging on greedy hill-climbing in the limit. A global parameter called temperature, which gradually decreases with time, controls the trade-off between exploration and hill-climbing. With a suitable “cooling schedule,” local optima can be avoided in favor of the global optimum. For more see the tutorial survey [14].

Our goal in this section is to show that simulated annealing is a scalable approach to weighted graph matching, using a real-world dataset. The salient property of this dataset is that the difference between the graphs to be matched arises primarily due to the evolution of a social network with time. We are not claiming that simulated annealing is necessarily better suited than other approaches such as genetic algorithms, whether on this type of input or for weighted graph matching in general. We review the literature on graph matching in Section VI.

Ground truth and dummy nodes. Ideally, we would like to measure how close the output of our algorithm is to the global optimum, as well as how close the global optimum is to the “ground truth.” However, we do not know the global optima, for obvious reasons. Therefore we directly measure the performance of the algorithm against the ground truth.

Another subtlety is that we actually want a partial mapping. We handle this by the typical approach of adding “dummy” nodes to both graphs [7].7 Adding $ k$ dummy nodes to $ n$ regular nodes has the effect of finding a mapping of size $ n-k$. This is because the weights of all edges incident on dummy nodes are 0, and therefore dummy nodes will never map to other dummy nodes (since that would result in a sub-optimal mapping that can be improved in 1 step).

If a mapping between two non-dummy nodes output by the algorithm is not an actual correspondence, we call it a false positive. If a correspondence between a pair of nodes is not output by the algorithm, it is a false negative. Ideally, we'd like the algorithm to behave as follows. Let $ m$ be the number of pairs of corresponding nodes. Then the number of false positives should be $ \textsf{max}(n-m-k, 0)$ and the number of false negatives $ \textsf{max}(m+k-n, 0)$.

Potential function. Choosing a potential function is difficult—simply adding the differences between cosines gave poor results. For one, Figure 6 shows that a scaling factor is necessary. Moreover since a few false positives might be inevitable (with a sufficiently small number of dummies), these might throw off the potential function due to wildly mismatched cosines. Finally, with the additive function, edges with higher cosine scores have a higher impact on the potential; we would like it to be more balanced.

The algorithm we used, listed as Algorithm 2, incorporates the above rationale. It compares a pair of mapped nodes; the potential function is the sum of the scores of the pairs of mapped nodes. We experimented with a few choices for $ \alpha$ and $ \beta$ and found that $ \alpha = \beta = 0.5$ appears to yield the best results. The reason for this is not yet clear.

% latex2html id marker 363
...e potential function, the summation is unnecessary in practice.}

States and cooling schedule. The set of all bijective mappings between the two set of nodes (including the dummy nodes) forms the set of states for simulated annealing. Initially we start with a random bijection. The “neighbor states” of a bijection are derived by swapping the images of a pair of nodes.8

The acceptance probability of a state transition was chosen in a straightforward manner: $ P(\Delta p, T) = e^{-\frac{\Delta p}{cT}}$ where $ \Delta p$ is the change in potential function incurred by the transition, $ T$ is the temperature, and $ c$ is a constant. Thus, transitions to lower-potential states will always be taken, and other transitions will be taken with a nonzero probability.

$ T$ was varied as $ T = \frac{1}{t}$, where $ t$ is the time or number of iterations. The “constant” $ c$ is actually dependent on the number of nodes $ n$: $ c = c(n) = 20n$. These two choices together constitute the “cooling schedule”.

Figure 8: Illustration of bimodality. This behavior is typical
Image fp-40-0-pdf

Performance. The results are summarized in Figures 9-11. All observations are medians of (at least) 30 trials. The median is a more meaningful statistic to report than the mean, because the distribution is often bimodal, with modes near 0 and $ n$ as shown in Figure 8. This behavior is due to one of two events occurring: the algorithm gets stuck far away from the global optimum, or it finds the global optimum (or a point very close to it), which still results in some error because the global optimum differs from the ground truth by a small amount.

For $ n=20$, the algorithm matches the ideal performance (Figure 9). For $ n=40$, it is not quite ideal, but comes close, making no more than three errors (Figure 10). For $ n=80$, a different behavior is observed: when the number of dummies is too small, the algorithm is unable to find the global optimum in the median case.9 With 8 or more dummies the algorithm performs well (Figure 11).

Figure 9: Simulated annealing performance: 20 nodes.
Image fpfn-20

Figure 10: Simulated annealing performance: 40 nodes.
Image fpfn-40

Figure 11: Simulated annealing performance: 80 nodes.
Image fpfn-80

For the purposes of the propagation algorithm, the results presented here are more than adequate, considering that it worked even with 10 seeds, and can tolerate a fraction of erroneous mappings.

Parameters. As the graphs show, the best results (measured by false positives + false negatives) are obtained when the selected number of dummies is (roughly) equal to the “correct” number of dummies, which is $ n-m$, i.e. the number of nodes without a corresponding node in the other graph. Not knowing the optimal number of dummies is not a problem for our application, since we can try the propagation algorithm with different sets of seeds and as long as any one run results in an “epidemic”, de-anonymization is successful.

Nevertheless, in other applications, being able to determine the right number of dummies automatically might be important. In combinatorial optimization problems, approaches that have fewer knobs that require tweaking are preferable. We describe two potential ways to accomplish this, both of which are possible future extensions:

Figure 12: Simulated annealing running time.
Image complexity

Running time. Finally, we present some observations on running time. Figure 12 shows the number of evaluations of the potential function, which is the bottleneck step. We implemented a cache (with no expiry) of already-seen mappings and their potential function values; thus, what is shown on the $ y$-axis is the number of cache misses. A further optimization computes only the difference in potential function values, by exploiting the fact that only one pair of nodes is swapped in one step, thus decreasing the complexity of potential function evaluation from quadratic to linear.

It is not clear how to interpret Figure 12. The curve appears to start off as a parabola and then become a straight line. We caution against generalizing too much from this figure, because in general the running time to get an equivalent error rate is likely to depend heavily on the nature of the data.


This is analogous to slack variables.
... nodes.8
We tried “clever” ways of picking neighbors where only pairs of nodes whose cosine similarity is greater than a threshold are considered. These choices performed worse than considering all possible pairs.
... case.9
With 0 dummies, the outputs are essentially no better than random permutations, although this is not shown in the graph.
Arvind Narayanan 2011-03-09