Unlock instant, AI-driven research and patent intelligence for your innovation.

Data partitioning method and device for flow data processing system

A technology for processing system and data partition, applied in the field of data processing of big data technology, it can solve the problems of heavy workload of working nodes, heavy workload of working nodes, affecting system performance, etc., to achieve good load balance and avoid communication overhead.

Pending Publication Date: 2017-12-08
NAT COMP NETWORK & INFORMATION SECURITY MANAGEMENT CENT
View PDF1 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] With the rapid development of the Internet and social networks, streaming data has become an important type of big data. Large-scale streaming data is widely used in different fields such as network monitoring, stock market forecasting, aerospace, Web applications, and meteorological measurement and control. Compared with static Data has unique characteristics, so stream data processing systems face more challenges, such as: the continuous flow of data requires stream data processing systems to process data in real time or near real time; the flow rate of data is difficult to control and predict, requiring stream The data processing system has scalability and self-adaptability; the randomness of the data generated by the data source makes the load of the data processing system usually unpredictable
Moreover, if the data source presents a skewed distribution, the amount of data distributed to the parallel computing nodes is usually unbalanced, which will cause some working nodes to be overloaded, thus affecting the performance of the entire system
[0003] In the prior art, widely used data flow distribution strategies include random division (Shuffle Grouping) and key value division (Key Grouping), wherein, random division is only applicable to stateless data; key value division can be based on the key value will have the same The data stream tuples of the key-value key are divided into the same working node for processing, but when the key values ​​in the data stream are unevenly distributed, this division will lead to excessive load on some working nodes, resulting in serious imbalances

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data partitioning method and device for flow data processing system
  • Data partitioning method and device for flow data processing system
  • Data partitioning method and device for flow data processing system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0043] The data partition method provided by the present invention is applied to count the top k words with the highest frequency in the stream data.

[0044] Use the Key Grouping method to count the top k words with the highest frequency in the flow data: the word is used as the key value key, and the data source node source uses a hash function to map different words to different work nodes for processing, and the work node worker runs Several counting programs count the frequency of occurrence of different words, select the Top-k of the node after a period of time, send it to the downstream working nodes for summary, and count the final Top-k.

[0045] Since the occurrence frequency of each word in the data stream is different, for example, the occurrence frequency of "the" will be significantly higher than that of the word "champagne", so the load of the worker nodes processing different words will be severely uneven.

[0046] The top k words with the highest frequency in ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a data partitioning method and device for a flow data processing system. The method comprises the steps of dividing flow data into hot-key and non-hot-key with a key value as a unit according to a quantity of flow tuples counted in real time; and allocating classified flow data to working nodes with least load among selected working nodes, so as to finish data partitioning. Through adoption of the data partitioning method and device for the flow data processing system, communication overhead of the data source node and the working node is saved, and load balance is realized effectively.

Description

technical field [0001] The invention relates to the field of data processing of big data technology, in particular to a data partitioning method and device for a streaming data processing system. Background technique [0002] With the rapid development of the Internet and social networks, streaming data has become an important type of big data. Large-scale streaming data is widely used in different fields such as network monitoring, stock market forecasting, aerospace, Web applications, and meteorological measurement and control. Compared with static Data has unique characteristics, so stream data processing systems face more challenges, such as: the continuous flow of data requires stream data processing systems to process data in real time or near real time; the flow rate of data is difficult to control and predict, requiring stream The data processing system has scalability and self-adaptability; the randomness of the data generated by the data source makes the load of th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): H04L12/803G06F17/30
CPCH04L47/125G06F16/278G06F16/285
Inventor 史亮王勇张鸿刘谦
Owner NAT COMP NETWORK & INFORMATION SECURITY MANAGEMENT CENT