A highly reliable distributed data flow real-time statistical method and system

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A statistical method and distributed technology, applied in the field of big data, which can solve the problems of insufficiency of second-level delay, non-dynamically scalable nodes, single point of failure, etc.

Active Publication Date: 2019-11-05

INST OF INFORMATION ENG CHINESE ACAD OF SCI

View PDF2 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

Existing products such as s4, storm, and spark have been recognized and applied by the industry to a certain extent, but their performance is not satisfactory in more demanding scenarios

[0003] The shortcomings of S4 mainly include two shortcomings. The first is reliability. S4 can only guarantee at-most-once semantics. When the processing node is down, the task can be transferred, but all the data in the memory will be lost.

In addition, S4 cannot dynamically expand nodes, which is unacceptable for a distributed system

[0004] There are two problems with Storm: one is the single point of failure brought about by the weakly centralized structure. After the nimbus node or ui node goes down, there will be problems in the execution of statistical tasks; the other is semantic guarantee, that is, reliability. The trident mechanism is provided to guarantee the exactly-once semantics, but this mechanism will seriously affect the processing performance

[0005] There are also two problems with Spark Streaming: first, the processing delay is high, and second-level delays cannot meet certain application scenarios that require high real-time performance; second, in terms of semantic guarantee, it is necessary to use specific data sources to ensure that lost data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

example 1

[0066] Suppose the format of the data source data contains 4 fields, id, timestamp (timestamp), source ip (sip), destination ip (dip), and one of the data streams is as follows:

[0067] ID Timestamp Sip dip 1 09:25 1.1.1.1 2.2.2.2 1 09:25 1.1.1.1 3.3.3.3 1 09:26 1.1.1.1 6.6.6.6 1 09:26 3.3.3.3 5.5.5.5 2 09:27 4.4.4.4 2.2.2.2 2 09:28 4.4.4.4 3.3.3.3 2 09:28 6.6.6.6 1.1.1.1

[0068] Now it is necessary to count how many pieces of information are generated by using a certain ip for all ids that appear in the 3min window. Using this technical invention, the above requirements are configured as a service rule as follows:

[0069] data source Map Granularity Reduce Granularity send to Data Processing Rules Info_mq 1min 3min Result_mq Group_by_and_count: id, sip

[0070] Both the map node and the reduce node will read this rule and parse the rule into tasks corresponding to map a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a high-reliability distributed data stream real-time statistical method and system. Three technologies are included, wherein the first technology is a distributed data stream calculation model based on a MapReduce programming model, the second technology is a concurrent data transmission mechanism with serial numbers, and the third technology is a distributed task management and scheduling mechanism based on states and signals. According to the first technology, the ideology of a MapReducer model is expanded to a cluster concept, each Map or Reducer calculation unit is a node in a distributed cluster and is called a Mapper or a Reducer, all the Mapper nodes form a Mapper cluster, and all the Reducer nodes form a Reducer cluster. Through the first technology, throughput is realized by guaranteeing the expandability of the distributed system; through the second technology and the third technology, data reliability and task availability are realized, and therefore reliable semantics are guaranteed.

Description

technical field [0001] The invention relates to a highly reliable distributed data flow real-time statistical method and system, belonging to the field of big data technology. Background technique [0002] When it comes to statistical processing of data streams, most of the current solutions in the industry are based on distributed memory computing. This is because the concurrency of distributed systems can well deal with large-scale data streams, using memory computing instead of traditional The local or distributed file system is because the system needs to process the data flowing into the system as soon as possible, because in the data flow computing scenario, the data flows at a high frequency and is time-sensitive. When a packet of data flows past, it cannot be retrieved again. package data. Under such a distributed memory computing architecture, there are many problems that need to be solved, from the compromise of consistency, availability, and partition tolerance, ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Patents(China)

IPC IPC(8): G06F9/48G06F9/54G06F11/30

CPCG06F9/4881G06F9/546G06F11/3006G06F11/3017G06F2209/548

Inventor 木伟民李召希王坤朋王伟平

Owner INST OF INFORMATION ENG CHINESE ACAD OF SCI

A highly reliable distributed data flow real-time statistical method and system

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

example 1

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology