Distributed type processing method based on massive data

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A distributed processing and massive data technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve problems that cannot be directly applied to scientific data processing, and achieve the effect of simple use and efficient operation

Active Publication Date: 2013-09-04

TSINGHUA UNIV

View PDF3 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

However, Hive cannot be directly applied to scientific data processing, because the processing in Hive is based on the columns and rows of the table, and the scientific data stored in the form of an array has no concept of the columns and rows of the table.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

example 1

[0059] The SQL command is: select time, var1, var2 from examplewhere y>=3andy<=9;

[0060] (1) Condition variable selection optimization and array merge storage are not performed

[0061] NcFileInputFormat knows from the query command to use three variables time, y, and var1, where the output variable is time, var1, and y are condition variables. According to the NetCDF file header information, it is known that var1 and var2 are the main variables, and time and y are both Dimension variables for var1 and var2. NcFileInputFormat traverses the output data tuple {time, y, var1, var2} from the NetCDF file, such as the first data tuple is {1000.0, 3, 1, 2}, the number of data tuples depends on the main variable var1 or The dimension of var2, here is 6x2=12, NcFileSerDe knows from NcFileInputFormat that the variables to be used are time, y, var1, var2, and deserializes the data tuple output by NcFileInputFormat according to the type of variables in the table {time is double type ,...

example 2

[0071] The SQL command is: select time, y, x, var3, from example where y=6andx>=6and x<=8;

[0072] (1) Conditional selection optimization and array merge storage are not performed

[0073] The main variable in {time, y, x, var3} is var3. The total generated data tuples are 2x6x4=48, and there are 2x1x2=4 qualified data tuples after filtering, as follows: {1000.0, 6 , 6, 6}, {1000.0, 6, 8, 8}, {1001.0, 6, 6, 30}, {1001.0, 6, 8, 32}, the final generated NetCDF file is as follows:

[0074]

[0075](2) The case of conditional selection optimization

[0076] The selection range [1, 3] of x>=6 and x>=8 of the x variable as the conditional variable is not continuous, and the conditional selection optimization cannot be used.

[0077] It can be seen from the above embodiments that the present invention designs a MapReduce-based distributed processing method for massive data stored in the form of arrays, so that users can use SQL commands to perform distributed processing on the ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention relates to the technical field of distributed type massive data processing and discloses a distributed type processing method based on massive data. The distributed type processing method based on the massive data includes the following steps: S1, computing and outputting a variable and a master variable in a conditional variable set according to SQL (structural query language) commands and variable information in array files, and executing the S2 if the master variable exists; S2, judging whether to optimize the conditional variable or not; S3, judging whether to store arrays integrally or not; S4, generating a data tuple received in the Map Reduce mission by an SQL engine according to the SQL commands and data types, of each column, defined in a table and generating a result data tuple; and S5, judging whether to store the generated result data tuple as array files in necessary or not. The distributed type processing method is based on the Map Reduce massive data stored in an array manner, so that a user can process the massive data stored in an array manner by the SQL commands in a distributed type, and the distributed type process method has the advantages of simplicity in use, high running efficiency and capability of fault tolerance.

Description

technical field [0001] The invention relates to the technical field of distributed processing of massive data, in particular to a distributed processing method for massive data stored in the form of an array. Background technique [0002] With the rapid development of the Internet, more and more data are generated by the network. How to store and process these massive data is an urgent problem to be solved. MapReduce is a distributed processing framework proposed by Google. As long as the user writes the processing process in "Map" and "Reduce", the MapReduce system can enable users to realize distributed parallel processing of massive data, and provide automatic task fault tolerance and load balanced. Combined with Google's distributed file system (GFS), MapReduce can make full use of data locality, thereby greatly reducing the amount of data transmitted over the network. Hadoop is an open-source implementation of the MapReduce and GFS architectures, used by Yahoo! , Fac...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Patents(China)

IPC IPC(8): G06F17/30

Inventor 杨广文耿益锋黄小猛

Owner TSINGHUA UNIV

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Distributed type processing method based on massive data

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

example 1

example 2

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology