Check patentability & draft patents in minutes with Patsnap Eureka AI!

Distributed type processing method based on massive data

A distributed processing and massive data technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve problems that cannot be directly applied to scientific data processing, and achieve the effect of simple use and efficient operation

Active Publication Date: 2013-09-04
TSINGHUA UNIV
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, Hive cannot be directly applied to scientific data processing, because the processing in Hive is based on the columns and rows of the table, and the scientific data stored in the form of an array has no concept of the columns and rows of the table.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed type processing method based on massive data
  • Distributed type processing method based on massive data
  • Distributed type processing method based on massive data

Examples

Experimental program
Comparison scheme
Effect test

example 1

[0059] The SQL command is: select time, var1, var2 from examplewhere y>=3andy<=9;

[0060] (1) Condition variable selection optimization and array merge storage are not performed

[0061] NcFileInputFormat knows from the query command to use three variables time, y, and var1, where the output variable is time, var1, and y are condition variables. According to the NetCDF file header information, it is known that var1 and var2 are the main variables, and time and y are both Dimension variables for var1 and var2. NcFileInputFormat traverses the output data tuple {time, y, var1, var2} from the NetCDF file, such as the first data tuple is {1000.0, 3, 1, 2}, the number of data tuples depends on the main variable var1 or The dimension of var2, here is 6x2=12, NcFileSerDe knows from NcFileInputFormat that the variables to be used are time, y, var1, var2, and deserializes the data tuple output by NcFileInputFormat according to the type of variables in the table {time is double type ,...

example 2

[0071] The SQL command is: select time, y, x, var3, from example where y=6andx>=6and x<=8;

[0072] (1) Conditional selection optimization and array merge storage are not performed

[0073] The main variable in {time, y, x, var3} is var3. The total generated data tuples are 2x6x4=48, and there are 2x1x2=4 qualified data tuples after filtering, as follows: {1000.0, 6 , 6, 6}, {1000.0, 6, 8, 8}, {1001.0, 6, 6, 30}, {1001.0, 6, 8, 32}, the final generated NetCDF file is as follows:

[0074]

[0075](2) The case of conditional selection optimization

[0076] The selection range [1, 3] of x>=6 and x>=8 of the x variable as the conditional variable is not continuous, and the conditional selection optimization cannot be used.

[0077] It can be seen from the above embodiments that the present invention designs a MapReduce-based distributed processing method for massive data stored in the form of arrays, so that users can use SQL commands to perform distributed processing on the ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to the technical field of distributed type massive data processing and discloses a distributed type processing method based on massive data. The distributed type processing method based on the massive data includes the following steps: S1, computing and outputting a variable and a master variable in a conditional variable set according to SQL (structural query language) commands and variable information in array files, and executing the S2 if the master variable exists; S2, judging whether to optimize the conditional variable or not; S3, judging whether to store arrays integrally or not; S4, generating a data tuple received in the Map Reduce mission by an SQL engine according to the SQL commands and data types, of each column, defined in a table and generating a result data tuple; and S5, judging whether to store the generated result data tuple as array files in necessary or not. The distributed type processing method is based on the Map Reduce massive data stored in an array manner, so that a user can process the massive data stored in an array manner by the SQL commands in a distributed type, and the distributed type process method has the advantages of simplicity in use, high running efficiency and capability of fault tolerance.

Description

technical field [0001] The invention relates to the technical field of distributed processing of massive data, in particular to a distributed processing method for massive data stored in the form of an array. Background technique [0002] With the rapid development of the Internet, more and more data are generated by the network. How to store and process these massive data is an urgent problem to be solved. MapReduce is a distributed processing framework proposed by Google. As long as the user writes the processing process in "Map" and "Reduce", the MapReduce system can enable users to realize distributed parallel processing of massive data, and provide automatic task fault tolerance and load balanced. Combined with Google's distributed file system (GFS), MapReduce can make full use of data locality, thereby greatly reducing the amount of data transmitted over the network. Hadoop is an open-source implementation of the MapReduce and GFS architectures, used by Yahoo! , Fac...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 杨广文耿益锋黄小猛
Owner TSINGHUA UNIV
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More