美文网首页
Reservoir Sampling

Reservoir Sampling

作者: SharlotteZZZ | 来源:发表于2018-09-21 05:34 被阅读0次

Reservoir sampling is a family of randomized algorithms for randomly choosing k samples from a list of n items, where n is either a very large or unknown number. Typically n is large enough that the list doesn’t fit into main memory. For example, a list of search queries in Google and Facebook.

So we are given a big array (or stream) of numbers (to simplify), and we need to write an efficient function to randomly select k numbers where 1 <= k <= n. Let the input array be stream[].

The probability that any item stream[i] where 0 <= i < n will be in final reservoir[] is k/n.

It can be solved in O(n) time. The solution also suits well for input in the form of stream. The idea is similar to this post. Following are the steps.

  1. Create an array reservoir[0..k-1] and copy first k items of stream[] to it.
  2. Now one by one consider all items from (k+1)th item to nth item.
  • generate random number j from U(0,index) where index is the index of current item in stream[].
  • If j falls in range 0 to k-1, replace reservoir[j] with arr[i]

Proof:
(1)For stream[i] where k <= i < n
Consider the last item -- the probability that the last item is in final reservoir = The probability that one of the first k indexes is picked for last item = k/n
Let us now consider the second last item. The probability that the second last item is in final reservoir[] = [Probability that one of the first k indexes is picked in iteration for stream[n-2]] X [Probability that the index picked in iteration for stream[n-1] is not same as index picked for stream[n-2] ] = [k/(n-1)]*[(n-1)/n] = k/n.

(2)For stream[i] where 0 <= i < k
The first k items are initially copied to reservoir[] and may be removed later in iterations for stream[k] to stream[n].
The probability that an item from stream[0..k-1] is in final array = Probability that the item is not picked when items stream[k], stream[k+1], …. stream[n-1] are considered = [k/(k+1)] x [(k+1)/(k+2)] x [(k+2)/(k+3)] x … x [(n-1)/n] = k/n

# An efficient Python3 program  
# to randomly select k items from a stream of items 
import random 

#used to print our reservoir
def printArray(stream,n): 
    for i in range(n): 
        print(stream[i],end=" "); 
    print(); 
  
# A function to randomly select k items from stream[0..n-1]. 
def selectKItems(stream, n, k): 
        i=0;  # index for elements in stream[] 
          
        # reservoir[] is the output  
        reservoir = [0]*k; 
        for i in range(k): 
            reservoir[i] = stream[i]; 
          
        # Iterate from the (k+1)th to nth element 
        while(i < n): 

            j = random.randrange(i+1); 

            if(j < k): 
                reservoir[j] = stream[i]; 
            i+=1; 
          
        print("Following are k randomly selected items"); 
        printArray(reservoir, k); 
     
if __name__ == "__main__": 
    stream = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]; 
    n = len(stream); 
    k = 5; 
    selectKItems(stream, n, k); 
  
# This code is contributed by mits

Practice:
LC382 Linked List Random Node

Reference:
https://www.geeksforgeeks.org/reservoir-sampling/

相关文章

网友评论

      本文标题:Reservoir Sampling

      本文链接:https://www.haomeiwen.com/subject/rswonftx.html