Notes on data acquisition

Typically, a network crawl can only recover the giant connected component. Both Twitter and Flickr allow to query only forward links. Therefore, we can expect to recover the strongly-connected component (SCC) fully and the weakly connected component (WCC) incompletely.

We crawled the entire SCC of Twitter, subject to the caveat that the Twitter API for discovering relationships is indirect; in particular, we cannot discover users whose activity on the website is "protected," i.e., viewable by friends only. Interestingly, the size of the Twitter user population, at least as reflected in the connected component of regular users, turned out to be much smaller than was being reported in the media at the time of our crawl. It is also worth noting that since then Twitter has introduced crippling rate limitations on its API, which make a large-scale crawl infeasible.

We could not crawl the entire SCC of the Flickr graph due to its size. We crawled it in a priority-queue fashion, giving the highest priority to the nodes with the highest number of incoming edges from the already crawled nodes. Comparing our numbers with [58], we conclude that we have, in fact, recovered most of the SCC.

Finally, the authors of [58], who kindly provided with us with the LiveJournal data, report that their crawl covers the vast majority of the users in LiveJournal's WCC.

Arvind Narayanan 2009-03-19