Unlock instant, AI-driven research and patent intelligence for your innovation.

Random sampling from distributed streams

a random sampling and data technology, applied in the field of optimal random sampling from distributed streams of data, can solve the problems of communication between sites, inability to collect all data at a single site, and inability to process data in a centralized manner

Inactive Publication Date: 2013-03-21
IOWA STATE UNIV RES FOUND +1
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

The invention is a method, computer program product, and system for distributed sampling on a network with multiple sites and a coordinator. The method includes receiving a data element with a weight from a site, comparing it with a global value stored at the coordinator, and updating or communicating the global value based on the weight. This allows for collaborative decision-making and efficient data collection across multiple sites.

Problems solved by technology

For many data analysis tasks, it is impractical to collect all the data at a single site and process it in a centralized manner.
A challenge is to minimize the communication between the different sites and the coordinator, while providing an accurate answer to queries at the coordinator at all times.
A problem in this setting is to obtain a random sample drawn from the union of all distributed streams.
Other problems on distributed stream processing, including the estimation of the number of distinct elements and heavy hitters, use random sampling as a primitive.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Random sampling from distributed streams
  • Random sampling from distributed streams
  • Random sampling from distributed streams

Examples

Experimental program
Comparison scheme
Effect test

case i

[0052] v sends a message to the coordinator in epoch i in Process A. In this case, the first time v sends a message to the coordinator in this epoch, v will receive the current value of u , which is smaller than or equal to mi. This communication costs two messages, one in each direction. Henceforth, in this epoch, the number of messages sent in Process A is no more than those sent in Process B. In this epoch, the number of messages transmitted to / from v in Process A is at most twice the number of messages as in Process B, which has at least one transmission from the coordinator to site v.

case ii

[0053] v did not send a message to the coordinator in this epoch, in Process A. In this case, the number of messages sent in this epoch to / from site v in Process A is smaller than in Process B.

[0054]Let ξ denote the total number of epochs.

[0055]Lemma 4. If r≧2,

Eξ≤(log(n / s)logs)+2

[0056]Proof

Letz=(log(n / r)logr).

First, it is noted that in each epoch, u decreases by a factor of at least r. Thus, after (z+l) epochs, u is no more than

1rz+=(rn)1r.

Thus,

[0057]Pr[ξ≥z+]≥Pr[u≤(sn)1r]

[0058]Let Y denote the number of elements (out of n) that have been assigned a weight of

snr

or less. Y is a binomial random variable with expectation

sr.

Note that if

u≤snr,

it must be true that Y>s.

Pr[ξ≥z+]≤Pr[Y≥s]≤Pr[Y≥rE[Y]]≤1r

where Markov's inequality has been used.

[0059]Since ξ takes only positive integral values,

E[ξ]=∑i>0Pr[ξ≥i]=∑i=1zPr[ξ≥i]+∑≥1Pr[ξ≥z+]≤z+∑≥11r≤z+11-1 / r≤z+2

where r≧2.

[0060]Let nj denote the total number of elements that arrived in epoch j, thus n=Σj=0ξ−1nj. Let μ denote the total number of messag...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

Described herein are methods, systems, apparatuses and products for random sampling from distributed streams. An aspect provides a method for distributed sampling on a network with a plurality of sites and a coordinator, including: receiving at the coordinator a data element from a site of the plurality of sites, said data element having a weight randomly associated therewith deemed reportable by comparison at the site to a locally stored global value; comparing the weight of the data element received with a global value stored at the coordinator; and performing one of: updating the global value stored at the coordinator to the weight of the data element received; and communicating the global value stored at the coordinator back to the site of the plurality of sites. Other embodiments are disclosed.

Description

FIELD OF THE INVENTION[0001]The subject matter presented herein generally relates to optimal random sampling from distributed streams of data.BACKGROUND[0002]For many data analysis tasks, it is impractical to collect all the data at a single site and process it in a centralized manner. For example, data arrives at multiple network routers at extremely high rates, and queries are often posed on the union of data observed at all the routers. Since the data set is changing, the query results could also be changing continuously with time. This has motivated the continuous, distributed, streaming model. In this model there are k physically distributed sites receiving high-volume local streams of data. These sites talk to a central coordinator that has to continuously respond to queries over the union of all streams observed so far. A challenge is to minimize the communication between the different sites and the coordinator, while providing an accurate answer to queries at the coordinator...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(United States)
IPC IPC(8): G06F17/30
CPCG06F17/30536G06F17/30516G06F17/30533G06F16/24568G06F16/2458G06F16/2462
Inventor WOODRUFF, DAVID P.TIRTHAPURA, SRIKANTA N.
Owner IOWA STATE UNIV RES FOUND