Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A method and system for ensuring that mapreduce data input shards contain complete records

A data recording and data packet technology, which is applied in the fields of electrical digital data processing, digital data information retrieval, special data processing applications, etc. Simple, Efficient Effects

Active Publication Date: 2020-02-07
CHENGDU GOLDTEL IND GROUP
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0013] (2) Network transmission overhead. If the fragmentation is so large that a fragment spans multiple HDFS blocks, a map task must be transmitted by multiple blocks through the network, so the fragmentation size should not exceed HDFS blocks. size
[0015] According to this setting, it seems that the MapReduce architecture can already guarantee the localization of data processing, but this is not the case; this is because the division of HDFS data blocks is completely carried out according to the physical size of the file, without considering the content of the file
The Map task processes data according to the content of the file; the Map task processes each record separately, and each record is a pair; however, because HDFS performs data block segmentation At this time, it is entirely possible to split a data record into two data blocks, or even to different DataNodes
[0016] In order to ensure the correctness of data processing, the strategy adopted by the MapReduce architecture is to read the remaining data of the record from the next data slice when a piece of data crosses data slices until the complete record is read, but , but it will greatly reduce the processing efficiency of the system and increase the amount of data transmission

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and system for ensuring that mapreduce data input shards contain complete records

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0078] Embodiment 1, according to the scheme of the present invention, the data block (block) of the HDFS system is stored and divided, which can completely avoid the need to read data across data slices during data processing, ensuring data processing Localization, thus greatly improving the efficiency of system processing.

[0079] The previous HDFS file design mainly includes the following steps:

[0080] S001. The client calls create( ) of DistributedFileSystem of DFSClient to create a file;

[0081] S002.DFSClient's DistributedFileSystem uses RPC to call the create() method of the metadata node (Name Node) to create a new file. In this step, the metadata node first judges whether the "file does not exist, and the client has created it" file permissions"; if it is satisfied, a new file will be created; if not, the file will not be created;

[0082] S203. After the file is created, DistributedFileSystem returns FSOutputStream to the client;

[0083] S203. Use the write()...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method and a system for guaranteeing a MapReduce data input fragment to contain complete records. The method comprises the following steps: S1, creating a storage file in an HDFS system; S2, inputting data into a client side of the HDFS system, and describing each input data; S3, sequentially receiving each data record through the client side of the HDFS system, constructing data packets, and judging whether data record received currently can be completely stored in a current data packet when the data record is received during the construction of the nth data package; S4, constructing received data packets into data blocks through the client side of the HDFS system, and writing the data blocks into the storage file. The method and system avoids the situation that data is required to be read in a span-data slice way during data processing, and guarantee data processing localization, thereby greatly improving the system processing efficiency.

Description

technical field [0001] The invention relates to a method and system for ensuring that data input fragments of MapReduce contain complete records. Background technique [0002] MapReduce is a distributed computing software architecture first proposed by Google to solve distributed computing problems with large amounts of data; this architecture is a typical data fragmentation processing architecture. [0003] This architecture originally originated from the two functions of map and reduce in functional programming; the Map master node reads input data, divides it into small data pieces (input splits) that can be solved in the same way, and then distributes these small data pieces to On different data nodes (Data Nodes), each data node performs the same processing on each small data piece in a loop; the Reduce master node obtains the processing results of all Map data nodes, and then combines all the results and returns them to the output. [0004] The operation of each Map h...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/182G06F16/172
CPCG06F16/182
Inventor 武志学赵阳田盛
Owner CHENGDU GOLDTEL IND GROUP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products