Reservoir Sampling
Labeled as Algorithm R in the description by Jeffrey Vitter in his subject of Random Sampling with a Reservoir, reservoir sampling is a common technique in data processing: randomly choose k samples out of a set S with n items wherein the n is either very large or unknown beforehand; all the chosen k items form a "reservoir" in this sense and guarantee to have each of them chosen with equal probability 1/n.
Formal Description
Given a stream of data of unknown size n, randomly select k items "reservoir" from it with equal probability by: saving the first k items in the array of size k. For each item j, wherein j>k, choose a random integer M from 1 to j, if M is smaller than k, then substitute the value of "reservoir" in index M with the value of array in index j.
Note: assist in understanding the problem, there is an interesting post here with an example of how girls fairly choose a date from a bunch of bachelors.
Algorithm R
This yields to Ο(n) time in complexity.
To see how it works, consider a proof by induction:
After the (i-1)th round, let us assume, the probability of a number being in the "reservoir" is k / (i-1). Since the probability of the number being replaced in the ith round is 1 / i, the probability that it survives the ith round is (i-1) / i. Thus, the probability that a given number is in the "reservoir" after the ith round is the product of these two probabilities, i.e. the probability of being in the reservoir after the (i-1)th round, and surviving replacement in the ith round. This is (k / (i-1)) * ((i-1) / i) = k / i. Hence, the result holds for i, and is therefore true by induction.
Additional References
Medium, Reservoir Sampling. https://medium.com/100-days-of-algorithms/day-33-reservoir-sampling-252062ce0baa
Last updated