Pandagon and Pharyngula have random blogrolls: they have a database of many blogs, of which 20 and 10 respectively are randomly chosen to be displayed each time one loads the front page. Given that, it’s interesting to ask how many blogs there are in each blog’s database.

The basic method is to use duplicates. If Pandagon has 100 blogs on its blogroll, we’ll expect two different reloads to have 20*20/100 = 4 entries in common. More generally, we’re given *k* randomly selected blogs each reload out of *n* on the blogroll; the problem is to find *n*, using the fact that two reloads are expected to have *k*^2/*n* entries in common.

Since this method inherently relies on random variable, no database’s size can be determined with certainty. However, by reloading indefinitely, one can get arbitrarily high levels of confidence. Here I’ll assume a 95% level, but the same method can be generalized to any level.

For example, let’s do it with one reload, i.e. two loads. The first load on Amanda’s blogroll yields,

- Obsidian Wings
- James Wolcott
- Punk Ass Blog
- Big Brass Blog
- Rad Geek
- The Heretik
- Shakespeare’s Sister
- Creek Running North
- Hughes for America
- Punning Pundit
- Blonde Sense
- Alas, a Blog
- Jesus’ General
- The Girl Gets Away
- Firedoglake
- Neal Pollack
- Talk Left
- 3 Quarks Daily
- Ezra Klein
- August Pollak

The second yields,

- Ethical Werewolf
- Faux Real
- Soloway Now
- Steve Gilliard
- Women of Color Blog
- Body and Soul
- Blonde Sense
- Daily Pepper
- Rox Populi
- Ezra Klein
- Skippy the Bush Kangaroo
- Bitch, Ph.D.
- Talk Left
- Big Brass Blog
- PZ Myers
- Malkin(s) Watch
- John and Belle Have a Blog
- Liberal Feminist Bias
- Rad Geek
- Echidne of the Snakes

The two lists have five entries in common, so if that’s the actual expected value, there are 80 entire on Amanda’s blogroll. But there’s a large margin of error: if there are 100 entries, then the probability of a match is 0.2 rather than 0.25, and the probability of having 5 or more matches is 0.37. In contrast, for a two-tailed 95% confidence interval it needs to be 0.025.

If *m* reloads produce *x* matches, then the sample mean is *x*/*mk*, so the assumption we make is that *x*/*m* = *k*^2/*n*, or *n* = *mk*^2/*x*. If in fact *n* < *mk*^2/*x*, then the expected value is a bit higher, and we need the expected value of getting *x* or fewer matches to be less than 0.025. If *n* > *mk*^2/*x* then we similarly need the expected value of getting *x* or more matches to be 0.025.

Although we can use more information, for example by comparing the first reload to the third, this will vastly complicate the situation, because the events will no longer be independent. If *n* = 100 and *k* = 20, then given that the second reload has no matches with the first or the third, the expected number of matches between the first and the third goes up to 0.25; if the second reload is identical to the first and has 4 matches with the third, then obviously there are going to be exactly 4 matches between the first and the third.

Similarly, to maintain independence, we can only use aggregate data from all *m* reloads, rather than the sequence of means.

For an approximate formula, let’s assume the distribution of possible *x*‘s given *n*, *k*, and *m* is normal. This is how it limits when *m* is large, but it’s actually binomial. We need to find some *m* for which the 95% confidence interval resulting from *n* is disjoint from these resulting from *n* + 1 and *n* – 1.

The expected number of matches is (*k*/*n*)**mk* = *mk*^2/*n*; the variance for a binomial distribution is *mk**(*k*/*n*)(1 – *k*/*n*), so the standard deviation is SQRT(*m*(*n* – *k*)*k*^2/*n*^2) = *k*/*n* * SQRT(*m*(*n* – *k*)). A normal distribution’s 95% confidence interval is between -1.96 and 1.96 standard deviations from the mean, so the upper and lower bounds of the interval are *mk*^2/*n* (+/-) 1.96*k*/*n* * SQRT(*m*(*n* – *k*)).

Now, we want to have *mk*^2/*n* – 1.96*k*/*n* * SQRT(*m*(*n* – *k*)) > *mk*^2/(*n* – 1) + 1.96*k*/(*n* – 1) * SQRT(*m*(*n* – 1 – *k*)), or *mk*^2/*n*(*n* – 1) > 1.96*k*(SQRT(*m*(*n* – 1 – *k*))/(*n* – 1) + SQRT(*m*(*n* – *k*))/*n*), i.e. SQRT(*m*) > 1.96(*n*SQRT(*n* – 1 – *k*) + (*n* – 1)SQRT(*n* – *k*))/*k*. We also want a similar relation with *n* and *n* + 1, which requires *m* to be greater.

In other word, to pinpoint *n* to a 95% confidence, we need to reload 2.84((*n*SQRT(*n* + 1 – *k*) + (*n* + 1)SQRT(*n* – *k*)))^2/*k*^2 times. If, as the above experiment suggests, Amanda has *n* = 80, it means we need 11,134 reloads; it takes that many to weed out the possibilities of 79 and 81.

[…] Alon Levy has a couple of cool posts on the math related to random blogrolls at Abstract Nonsense, read them. When I read […]