Spark streaming based big data stream processing method and system

A processing method and technology of a processing system, applied in the field of big data flow processing, can solve problems such as incorrect update of variable state, difficulty, non-native support, etc., and achieve better fault tolerance, faster processing speed, and improved processing efficiency.

Inactive Publication Date: 2016-09-07
北京思特奇信息技术股份有限公司
View PDF3 Cites 14 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] But Storm has its own flaws, for example: In terms of fault tolerance, data guarantee, each individual record in Storm must be tracked as it passes through the system, so Storm can at least guarantee that each record will be processed once, but in recovering from errors Duplicate records are allowed when coming over, which means that the mutable state may be incorrectly updated twice; in terms of implementation and programming API, because the core of Storm is written in clojure (but most of the expansion work is written in java) , which brings some difficulties for us to understand its implementation; in terms of cluster management integration, Storm can run on its own cluster, and Storm can also run on Mesos, but when running on YARN, a third-party support component is required Storm on YARN, not natively supported

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Spark streaming based big data stream processing method and system
  • Spark streaming based big data stream processing method and system
  • Spark streaming based big data stream processing method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0049] The principles and features of the present invention are described below in conjunction with the accompanying drawings, and the examples given are only used to explain the present invention, and are not intended to limit the scope of the present invention.

[0050] Spark Streaming is an extension of the spark core API, which enables high-throughput, fault-tolerant stream processing of real-time data streams. There are many data sources for Spark Streaming, including kafka, flume, twitter, ZeroMQ or traditional TCP sockets.

[0051] Spark Streaming is an extension of the core Spark API. It does not process data streams one at a time like Storm, but pre-segments them into batch jobs at time intervals before processing. Spark's abstraction for continuous data flow is called DStream (DiscretizedStream), a DStream is a micro-batching (micro-batching) RDD (elastic distributed data set); and RDD is a distributed data set that can be The two methods operate in parallel, namely...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a spark streaming based big data stream processing method and system. The method includes: a step S1, receiving data sent by a data source at an appointed position, executing a step S2 if the data source is an HDFS, and executing a step S3 if the data source is an FLUME; the step S2, storing the data in a file form, and executing the step S3; the step S3, processing the received data or file through the spark streaming; and a step S4, writing the processing result of the file or the data in a result catalogue through the spark streaming according to a time interval. The method and system provide good fault-tolerant state calculation for fault-tolerant and data assurance, can support Scala programming and Java programming in the aspect of API programming; and in cluster management integration, the Spark Streaming can run on clusters thereof, and can run on a YARN and an Mesos.

Description

technical field [0001] The invention relates to the field of big data stream processing, in particular to a spark streaming-based big data stream processing method and system. Background technique [0002] In the prior art, Storm is often used to implement a data flow model. When Storm is used to implement a data flow model, data continuously flows through a transformation entity network. An abstraction of a stream of data is called a stream, which is an infinite sequence of tuples. A tuple is like a structure that uses some additional serialization code to represent standard data types (such as integers, floats, and byte arrays) or user-defined types. Each stream is defined by a unique ID, which can be used to build a topology of data sources and sinks. [0003] But Storm has its own flaws, for example: In terms of fault tolerance, data guarantee, each individual record in Storm must be tracked as it passes through the system, so Storm can at least guarantee that each rec...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/24568
Inventor 杜旭苗
Owner 北京思特奇信息技术股份有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products