Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Theta-join method for massive distributed data

A technology of distributed data and connection methods, applied in the field of value connection, can solve problems such as low efficiency, and achieve the effect of improving query efficiency, reducing workload, and speeding up query efficiency

Active Publication Date: 2016-10-12
云尧科技浙江有限公司
View PDF1 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In order to overcome the inefficiency of the existing non-equivalent connection method, the present invention provides a non-equivalent connection method for massive distributed data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Theta-join method for massive distributed data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0040] The specific steps of the non-equivalent connection method for massive distributed data in the present invention are as follows:

[0041] A. Theta-Join query definition:

[0042] Suppose there are two relational tables R(A,B) and S(B,C), the function θ belongs to {>,R.BθS.B S(B,C), then QB is called a non-equivalent join query between relational tables R and S connected by field B.

[0043] Explanation of symbols: R(A,B) represents relational table R, A, B are attributes of R, S(B,C) represents relational table S, and B, C are attributes of S. θ is the connection function of R and S. QB represents a query involving R and S, and "∞" represents a connection symbol. Since in a distributed environment, the calculation format of data is in the form of key-value, therefore, field B can be regarded as the key of R and S, and field A can be regarded as the combination of all fields in relational table R except B ( value), field C can be regarded as the combination (value) o...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a theta-join method for massive distributed data, which is used for solving the technical problem of low efficiency of the conventional theta-join method. The method adopts the technical scheme that prior to theta-join of two tables, appropriate filtering rules are firstly selected according to join conditions, the maximum values and the minimum values of join fields of the two tables are then calculated, all records in the two tables are scanned according to the maximum values and the minimum values, the records irrelevant to output results are removed through filtering, Cartesian product calculation is only carried out for filtered data, and a secondary comparison of Cartesian product results is finally carried out according to the join conditions, so that the records satisfying the join conditions are obtained through screening. The method has the advantages that a large number of the records which fail to satisfy the join conditions are removed through the filtering prior to the Cartesian product calculation, so that the workload of a Reducer is effectively reduced, and the theta-join query efficiency is improved.

Description

technical field [0001] The invention relates to a non-equivalent connection method, in particular to a non-equivalent connection method facing massive distributed data. Background technique [0002] In the cloud computing environment, the explosive growth of data volume has brought new challenges to data storage, processing and analysis. Traditional database and data processing methods cannot meet the storage and processing requirements of big data. At present, the mainstream processing method is to use the parallel processing technology of MapReduce to improve the data processing speed. Under the parallel distributed model based on MapReduce, although the data can be fragmented and processed in a distributed manner, due to the Cartesian product result of the connection operation, especially the non-equivalent connection operation (Theta-join), it will cause problems in the network and disk. The amount of data increases sharply, resulting in very large I / O and disk overhead...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/284
Inventor 刘文洁李占怀潘巍张晓
Owner 云尧科技浙江有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products