Data partitioning method and device for flow data processing system

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology for processing system and data partition, applied in the field of data processing of big data technology, it can solve the problems of heavy workload of working nodes, heavy workload of working nodes, affecting system performance, etc., to achieve good load balance and avoid communication overhead.

Pending Publication Date: 2017-12-08

NAT COMP NETWORK & INFORMATION SECURITY MANAGEMENT CENT

View PDF1 Cites 1 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0002] With the rapid development of the Internet and social networks, streaming data has become an important type of big data. Large-scale streaming data is widely used in different fields such as network monitoring, stock market forecasting, aerospace, Web applications, and meteorological measurement and control. Compared with static Data has unique characteristics, so stream data processing systems face more challenges, such as: the continuous flow of data requires stream data processing systems to process data in real time or near real time; the flow rate of data is difficult to control and predict, requiring stream The data processing system has scalability and self-adaptability; the randomness of the data generated by the data source makes the load of the data processing system usually unpredictable

Moreover, if the data source presents a skewed distribution, the amount of data distributed to the parallel computing nodes is usually unbalanced, which will cause some working nodes to be overloaded, thus affecting the performance of the entire system

[0003] In the prior art, widely used data flow distribution strategies include random division (Shuffle Grouping) and key value division (Key Grouping), wherein, random division is only applicable to stateless data; key value division can be based on the key value will have the same The data stream tuples of the key-value key are divided into the same working node for processing, but when the key values in the data stream are unevenly distributed, this division will lead to excessive load on some working nodes, resulting in serious imbalances

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment

[0043] The data partition method provided by the present invention is applied to count the top k words with the highest frequency in the stream data.

[0044] Use the Key Grouping method to count the top k words with the highest frequency in the flow data: the word is used as the key value key, and the data source node source uses a hash function to map different words to different work nodes for processing, and the work node worker runs Several counting programs count the frequency of occurrence of different words, select the Top-k of the node after a period of time, send it to the downstream working nodes for summary, and count the final Top-k.

[0045] Since the occurrence frequency of each word in the data stream is different, for example, the occurrence frequency of "the" will be significantly higher than that of the word "champagne", so the load of the worker nodes processing different words will be severely uneven.

[0046] The top k words with the highest frequency in ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a data partitioning method and device for a flow data processing system. The method comprises the steps of dividing flow data into hot-key and non-hot-key with a key value as a unit according to a quantity of flow tuples counted in real time; and allocating classified flow data to working nodes with least load among selected working nodes, so as to finish data partitioning. Through adoption of the data partitioning method and device for the flow data processing system, communication overhead of the data source node and the working node is saved, and load balance is realized effectively.

Description

technical field [0001] The invention relates to the field of data processing of big data technology, in particular to a data partitioning method and device for a streaming data processing system. Background technique [0002] With the rapid development of the Internet and social networks, streaming data has become an important type of big data. Large-scale streaming data is widely used in different fields such as network monitoring, stock market forecasting, aerospace, Web applications, and meteorological measurement and control. Compared with static Data has unique characteristics, so stream data processing systems face more challenges, such as: the continuous flow of data requires stream data processing systems to process data in real time or near real time; the flow rate of data is difficult to control and predict, requiring stream The data processing system has scalability and self-adaptability; the randomness of the data generated by the data source makes the load of th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): H04L12/803G06F17/30

CPCH04L47/125G06F16/278G06F16/285

Inventor 史亮王勇张鸿刘谦

Owner NAT COMP NETWORK & INFORMATION SECURITY MANAGEMENT CENT

Data partitioning method and device for flow data processing system

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology