Reservoir sampling is a family of randomized algorithms for randomly choosing k samples from a list of n items, where n is either a very large or unknown number. Typically n is large enough that the list doesn’t fit into main memory. For example, a list of search queries in Google and Facebook.
So we are given a big array (or stream) of numbers (to simplify), and we need to write an efficient function to randomly select k numbers where 1 <= k <= n. Let the input array be stream[].
The probability that any item stream[i] where 0 <= i < n will be in final reservoir[] is k/n.
It can be solved in O(n) time. The solution also suits well for input in the form of stream. The idea is similar to this post. Following are the steps.
- Create an array reservoir[0..k-1] and copy first k items of stream[] to it.
- Now one by one consider all items from (k+1)th item to nth item.
- generate random number j from U(0,index) where index is the index of current item in stream[].
- If j falls in range 0 to k-1, replace reservoir[j] with arr[i]
Proof:
(1)For stream[i] where k <= i < n
Consider the last item -- the probability that the last item is in final reservoir = The probability that one of the first k indexes is picked for last item = k/n
Let us now consider the second last item. The probability that the second last item is in final reservoir[] = [Probability that one of the first k indexes is picked in iteration for stream[n-2]] X [Probability that the index picked in iteration for stream[n-1] is not same as index picked for stream[n-2] ] = [k/(n-1)]*[(n-1)/n] = k/n.
(2)For stream[i] where 0 <= i < k
The first k items are initially copied to reservoir[] and may be removed later in iterations for stream[k] to stream[n].
The probability that an item from stream[0..k-1] is in final array = Probability that the item is not picked when items stream[k], stream[k+1], …. stream[n-1] are considered = [k/(k+1)] x [(k+1)/(k+2)] x [(k+2)/(k+3)] x … x [(n-1)/n] = k/n
# An efficient Python3 program
# to randomly select k items from a stream of items
import random
#used to print our reservoir
def printArray(stream,n):
for i in range(n):
print(stream[i],end=" ");
print();
# A function to randomly select k items from stream[0..n-1].
def selectKItems(stream, n, k):
i=0; # index for elements in stream[]
# reservoir[] is the output
reservoir = [0]*k;
for i in range(k):
reservoir[i] = stream[i];
# Iterate from the (k+1)th to nth element
while(i < n):
j = random.randrange(i+1);
if(j < k):
reservoir[j] = stream[i];
i+=1;
print("Following are k randomly selected items");
printArray(reservoir, k);
if __name__ == "__main__":
stream = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12];
n = len(stream);
k = 5;
selectKItems(stream, n, k);
# This code is contributed by mits
Practice:
LC382 Linked List Random Node
Reference:
https://www.geeksforgeeks.org/reservoir-sampling/
网友评论