algorithm-notes
  • Preface
  • Outline
    • Asymptotic-Analysis
    • Divide-and-Conquer (DnC)
      • Overview
      • Multiplication
      • DnC Strategy
      • Master Method
    • Randomization
      • Overview
      • Discrete Probability
      • Randomized Algorithms
      • Reservoir Sampling
    • Sorting
      • Overview
      • Selection Sort
      • Bubble Sort
      • Insertion Sort
      • Shell Sort
      • Merge Sort
      • Heap Sort
      • Quick Sort
      • Non-Comparison Sort
    • Graph Algorithms
      • Overview
      • Minimum Spanning Tree
      • Shortest Path
      • Topological Sort
    • Tree
      • Overview
      • Balanced Search Tree
        • AVL Tree
        • Red-Black Tree
        • B Tree
      • Heap
      • Disjoint Set
      • Threaded Binary Tree
      • Segment Tree
    • Searching
      • Overview
      • Binary Search
      • Graph Search
      • Hash Table
      • String Search
    • Dynamic Programming
      • Overview
      • Fibonacci Numbers
      • LCS problem
    • Greedy Algorithms
      • Overview
Powered by GitBook
On this page
  • Formal Description
  • Algorithm R
  • Additional References

Was this helpful?

  1. Outline
  2. Randomization

Reservoir Sampling

PreviousRandomized AlgorithmsNextSorting

Last updated 6 years ago

Was this helpful?

Labeled as Algorithm R in the description by in his subject of , reservoir sampling is a common technique in data processing: randomly choose k samples out of a set S with n items wherein the n is either very large or unknown beforehand; all the chosen k items form a "reservoir" in this sense and guarantee to have each of them chosen with equal probability 1/n.

Formal Description

Given a stream of data of unknown size n, randomly select k items "reservoir" from it with equal probability by: saving the first k items in the array of size k. For each item j, wherein j>k, choose a random integer M from 1 to j, if M is smaller than k, then substitute the value of "reservoir" in index M with the value of array in index j.

Note: assist in understanding the problem, there is an interesting post with an example of how girls fairly choose a date from a bunch of bachelors.

Algorithm R

ReservoirSample(S[1..n], R[1..k])
  // fill the reservoir array
  for i = 1 to k
      R[i] := S[i]

  // replace elements with gradually decreasing probability
  for j = k+1 to n
    M := random(1, j)   // important: inclusive range
    if M <= k
        R[M] := S[j]

This yields to Ο(n) time in complexity.

To see how it works, consider a proof by induction:

After the (i-1)th round, let us assume, the probability of a number being in the "reservoir" is k / (i-1). Since the probability of the number being replaced in the ith round is 1 / i, the probability that it survives the ith round is (i-1) / i. Thus, the probability that a given number is in the "reservoir" after the ith round is the product of these two probabilities, i.e. the probability of being in the reservoir after the (i-1)th round, and surviving replacement in the ith round. This is (k / (i-1)) * ((i-1) / i) = k / i. Hence, the result holds for i, and is therefore true by induction.

Additional References

Medium, Reservoir Sampling.

Jeffrey Vitter
Random Sampling with a Reservoir
here
https://medium.com/100-days-of-algorithms/day-33-reservoir-sampling-252062ce0baa