Parallel processing of continuous queries on data streams

a data stream and parallel processing technology, applied in the field of data stream processing and event management, can solve the problems of inability to scale out with respect to the incoming stream volume, system capacity limitation, and inability to scale ou

Inactive Publication Date: 2011-12-22
UNIV MADRID POLITECNICA
View PDF0 Cites 234 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

[0010]Parallel processing of data streams allows providing scalability and that way, increasing the throughput by means of the addition of new nodes. This parallel processing can be applied to data stream processing and complex events processing.
[0016]Stream processing engines can be centralized or distributed. A centralized stream processing engine has a single system instance executed in a single computer or node. That is, the system is executed in a single node. A distributed stream processing engine has multiple instances, that is, multiple executions of the system are performed and each instance can be executed by different nodes. The most basic distributed engines can execute different queries in different nodes. Thereby, they can scale out the number of queries by increasing the number of nodes. Some distributed engines enable distributing query operators in different nodes. This allows them to scale out with respect to the number of operators by increasing the number of nodes.
[0035]If any source subquery does not produce tuples to be processed by the destination subquery, then the input merger will block. To avoid this situation the load balancers would work as it follows. Each load balancer keeps track of the last timestamp of the last tuple generated for each destination subquery. When no tuple is sent to a destination subquery for a maximum period of time m, then it sends a dummy tuple with an identical timestamp to the last one sent by that load balancer. When the dummy tuple is received by an input merger, it is just used to unblock the input merger processing. If it does not have the smallest timestamp, the input merger will take the tuple with smallest timestamp. Sooner or later, the dummy tuple will be the one with smallest timestamp, in that case, the input merger will just discard it. Thus, periodic generation of dummy tuples in the load balancers avoids blocking the input merger.
[0036]Elasticity is a property of distributed systems that refers to the capacity of growing and shrinking the number of nodes to process the incoming load by using the minimum required resources, that is, the minimum possible number of nodes able to process the incoming load satisfying the quality of service requirements.

Problems solved by technology

None of the currently existing approaches enables to scale out with respect to the incoming stream volume.
This is because the data stream processed by a query or operator query must go through a single node, containing the query or operator, and therefore the system capacity will be limited by the capacity of a single node.
For stream volumes exceeding the processing capacity of a node these systems cannot scale out.
However, this load balancing is studied in the context of distributed query engine that does not parallelize queries, therefore, it does not address the problem of how distribute the load between instances of the same subquery, but across different subqueries.
The problem with this technique is the loss of information that is not permissible for a multitude of applications and also has associated tradeoffs such as precision loss in the result of queries or even consistency loss in the outcome of queries.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Parallel processing of continuous queries on data streams
  • Parallel processing of continuous queries on data streams
  • Parallel processing of continuous queries on data streams

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0052]FIG. 1 shows a query with Map (M), Filter (F), Join (J) and Aggregate (A) operators. In this query incoming tuples enter through the left operator. The map operator transforms a tuple with the associated transformation function. The filtering operator applies a predicate to the tuple, if it is satisfied, then the tuple is forwarded to the next operator, otherwise, it is discarded. The output of the filter operator is connected with the two inputs of the join operator. That is, each tuple produced by the filter operator is sent to each of the two inputs of the join operator performing a self-join. The join operator applies a predicate to all pairs kept in the two sliding windows (associated to the respective input streams). Each pair that satisfies the predicate is concatenated and generated as an output tuple. The next operator is an aggregate. It aggregates the tuples according a given function or a group-by clause. A tuple is generated periodically with the aggregated value ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A continuous query parallel engine on data streams provides scalability and increases the throughput by the addition of new nodes. The parallel processing can be applied to data stream processing and complex events processing. The continuous query parallel engine receives the query to be deployed and splits the original query into subqueries, obtaining at least one subquery; each subquery is executed in at least in one node. Tuples produced by each operator of each subquery are labeled with timestamps. A load balancer is interposed at the output of each node that executes each one of the instances of the source subquery and an input merger is interposed in each node that executes each one of the instances of a destination subquery. After checks are performed, further load balancers or input managers may be added.

Description

[0001]This application claims benefit of U.S. Ser. No. 61 / 356,353, filed 18 Jun. 2011 and which application is incorporated herein by reference. To the extent appropriate, a claim of priority is made to the above disclosed application.FIELD OF THE INVENTION[0002]The present invention belongs to the data stream processing and event management fields.BACKGROUND OF THE INVENTION[0003]Continuous query processing engines enable processing data streams by queries that process continuously those streams producing results that are updated with the arrival of new data in the data stream. Known continuous query processing engines are Borealis (Daniel J. Abadi, Yanif Ahmad, Magdalena Balazinska, Ugur etintemel, Mitch Cherniack, Jeong-Hyon Hwang, Wolfgang Lindner, Anurag Maskey, Alex Rasin, Esther Ryvkina, Nesime Tatbul, Ying Xing, Stanley B. Zdonik: The Design of the Borealis Stream Processing Engine. CIDR 2005: 277-289), Aurora (Daniel J. Abadi, Donald Carney, Ugur etintemel, Mitch Cherniack,...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(United States)
IPC IPC(8): G06F17/30
CPCG06F9/5066G06F17/30445G06F17/30516G06F9/5088G06F16/24568G06F16/24532
Inventor JIMENEZ PERIS, RICARDOPATINO MARTINEZ, MARTA
Owner UNIV MADRID POLITECNICA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products