Reservoir Sampling

Labeled as Algorithm R in the description by Jeffrey Vitter in his subject of Random Sampling with a Reservoir, reservoir sampling is a common technique in data processing: randomly choose k samples out of a set S with n items wherein the n is either very large or unknown beforehand; all the chosen k items form a "reservoir" in this sense and guarantee to have each of them chosen with equal probability 1/n.

Formal Description

Given a stream of data of unknown size n, randomly select k items "reservoir" from it with equal probability by: saving the first k items in the array of size k. For each item j, wherein j>k, choose a random integer M from 1 to j, if M is smaller than k, then substitute the value of "reservoir" in index M with the value of array in index j.

Note: assist in understanding the problem, there is an interesting post here with an example of how girls fairly choose a date from a bunch of bachelors.

Algorithm R

ReservoirSample(S[1..n], R[1..k])
  // fill the reservoir array
  for i = 1 to k
      R[i] := S[i]

  // replace elements with gradually decreasing probability
  for j = k+1 to n
    M := random(1, j)   // important: inclusive range
    if M <= k
        R[M] := S[j]

This yields to Ο(n) time in complexity.

To see how it works, consider a proof by induction:

After the (i-1)th round, let us assume, the probability of a number being in the "reservoir" is k / (i-1). Since the probability of the number being replaced in the ith round is 1 / i, the probability that it survives the ith round is (i-1) / i. Thus, the probability that a given number is in the "reservoir" after the ith round is the product of these two probabilities, i.e. the probability of being in the reservoir after the (i-1)th round, and surviving replacement in the ith round. This is (k / (i-1)) * ((i-1) / i) = k / i. Hence, the result holds for i, and is therefore true by induction.

Additional References

Last updated