Unlock instant, AI-driven research and patent intelligence for your innovation.

High-reliability distributed data stream real-time statistical method and system

A statistical method and distributed technology, applied in the field of big data, can solve problems such as data loss, unsatisfied second-level delay, and single point of failure

Active Publication Date: 2017-09-22
INST OF INFORMATION ENG CAS
View PDF2 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Existing products such as s4, storm, and spark have been recognized and applied by the industry to a certain extent, but their performance is not satisfactory in more demanding scenarios
[0003] The shortcomings of S4 mainly include two shortcomings. The first is reliability. S4 can only guarantee at-most-once semantics. When the processing node is down, the task can be transferred, but all the data in the memory will be lost.
In addition, S4 cannot dynamically expand nodes, which is unacceptable for a distributed system
[0004] There are two problems with Storm: one is the single point of failure brought about by the weakly centralized structure. After the nimbus node or ui node goes down, there will be problems in the execution of statistical tasks; the other is semantic guarantee, that is, reliability. The trident mechanism is provided to guarantee the exactly-once semantics, but this mechanism will seriously affect the processing performance
[0005] There are also two problems with Spark Streaming: first, the processing delay is high, and second-level delays cannot meet certain application scenarios that require high real-time performance; second, in terms of semantic guarantee, it is necessary to use specific data sources to ensure that lost data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • High-reliability distributed data stream real-time statistical method and system
  • High-reliability distributed data stream real-time statistical method and system
  • High-reliability distributed data stream real-time statistical method and system

Examples

Experimental program
Comparison scheme
Effect test

example 1

[0066] Suppose the format of the data source data contains 4 fields, id, timestamp (timestamp), source ip (sip), destination ip (dip), and one of the data streams is as follows:

[0067] ID

Timestamp

Sip

dip

1

09:25

1.1.1.1

2.2.2.2

1

09:25

1.1.1.1

3.3.3.3

1

09:26

1.1.1.1

6.6.6.6

1

09:26

3.3.3.3

5.5.5.5

2

09:27

4.4.4.4

2.2.2.2

2

09:28

4.4.4.4

3.3.3.3

2

09:28

6.6.6.6

1.1.1.1

[0068] Now it is necessary to count how many pieces of information are generated by using a certain ip for all ids that appear in the 3min window. Using this technical invention, the above requirements are configured as a service rule as follows:

[0069] data source

Map Granularity

Reduce Granularity

send to

Data Processing Rules

Info_mq

1min

3min

Result_mq

Group_by_and_count: id, sip

[0070] Both the map node and the redu...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a high-reliability distributed data stream real-time statistical method and system. Three technologies are included, wherein the first technology is a distributed data stream calculation model based on a MapReduce programming model, the second technology is a concurrent data transmission mechanism with serial numbers, and the third technology is a distributed task management and scheduling mechanism based on states and signals. According to the first technology, the ideology of a MapReducer model is expanded to a cluster concept, each Map or Reducer calculation unit is a node in a distributed cluster and is called a Mapper or a Reducer, all the Mapper nodes form a Mapper cluster, and all the Reducer nodes form a Reducer cluster. Through the first technology, throughput is realized by guaranteeing the expandability of the distributed system; through the second technology and the third technology, data reliability and task availability are realized, and therefore reliable semantics are guaranteed.

Description

technical field [0001] The invention relates to a highly reliable distributed data flow real-time statistical method and system, belonging to the field of big data technology. Background technique [0002] When it comes to statistical processing of data streams, most of the current solutions in the industry are based on distributed memory computing. This is because the concurrency of distributed systems can well deal with large-scale data streams, using memory computing instead of traditional The local or distributed file system is because the system needs to process the data flowing into the system as soon as possible, because in the data flow computing scenario, the data flows at a high frequency and is time-sensitive. When a packet of data flows past, it cannot be retrieved again. package data. Under such a distributed memory computing architecture, there are many problems that need to be solved, from the compromise of consistency, availability, and partition tolerance, ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F9/48G06F9/54G06F11/30
CPCG06F9/4881G06F9/546G06F11/3006G06F11/3017G06F2209/548
Inventor 木伟民李召希王坤朋王伟平
Owner INST OF INFORMATION ENG CAS