## More on Random Blogrolls

A more efficient method to determine the length of random blogrolls is to wait until a reload features no entries that haven’t been seen before. After such a point, if the number of unique entries is n‘, then to check whether n‘ = n, reload again. If n‘ = n – 1, then the probability that a given reload will include the remaining entry is k/n. So each reload has a probability of (nk)/n of having no new entry. After sufficiently many reloads, the probability will be 0.05 and the result will be significant at the 95% level.

Now, let’s see what “sufficiently many” means. After r reloads, the probability is ((nk)/n)^r, which we want to be less than 0.05. But that implies that r is greater than the log of 0.05 to base (nk)/n, i.e. r > log 0.05 / log (nk)/n = 1.3/(log n – log(nk)), where the log is taken to base 10.

This method is much, much more efficient than the other one. With n = 80 and k = 20, we need r > 10.4, i.e. r = 11. The absolute worst case scenario is when the second reload is identical to the first, and then each time we need 11 reloads to get a new entry. Then m = 1 + 11(nk) = 661. And that’s an extremely unlikely upper bound.

When n‘ < n – 1, the probability that at least one entry will be new is way higher than k/n. For instance, when n‘ = n – 2, it’s the complement of (nk)/n * (nk – 1)/(n – 1) = 0.56, and r > log 0.05 / log 0.56 = 5.16, so 6 reloads are enough. The actual formula for the probability for any value of nn‘ is then the obvious generalization of the formula for nn‘ = 2.

But note that since we don’t actually know n, only n‘, we can only use r = 11. But the expected values for nn‘ > 1 can be used to calculate the expected number of reloads it will take to cover all entries. An upper bound is the sum of expected values for each nn‘, since it in effect assumes there’s only one new entry each time.

If we have an event with probability p, the expected number of trials it takes to get it is 1/p. The expected value for a given s = nn‘ is then 1/(1 – (nk)(nk – 1)…(nks + 1)/n(n – 1)…(ns + 1)) = n(n – 1)…(ns + 1)/(n(n – 1)…(ns + 1) – (nk)(nk – 1)…(nks + 1)). We need to sum that over all s from 1 to nk and add r.

With n = 80 and k = 20, we get 4 + 2.27 + 1.71 + 1.45 + … + 1 + 1 = 66.39, i.e. 78 reloads are enough.

Note that we don’t actually use n when reloading to calculate probabilities. Instead, we can use an estimate derived from the method of matches of successive reloads. For each one point increase in n, the number of reloads goes up by a little more than one (for example, n = 81 yields 67.53).