A data dependency mining method and system based on distributed computing

A distributed computing, data-dependent technology, applied in the field of data processing, can solve problems such as low performance and many omissions in mining, and achieve the effect of reducing the amount of computation

Active Publication Date: 2019-02-12
HARBIN INST OF TECH
View PDF6 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The two are difficult to coordinate and will result in data-dependent mining either missing a lot or performing poorly

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A data dependency mining method and system based on distributed computing
  • A data dependency mining method and system based on distributed computing
  • A data dependency mining method and system based on distributed computing

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0064] Such as figure 1 As shown, the distributed computing-based data dependency mining method provided in Embodiment 1 of the present invention may include the following steps:

[0065] Step S1: Generate attribute similarity posting list according to the original data set. This step is a data reallocation step. Wherein, each row in the attribute similar posting table corresponds to a data pair in the original data set, and this row records the attribute number of the pair of data pairs satisfying the similarity constraint.

[0066] Step S2: Mining first-order data dependencies according to the attribute similar posting list. This step is a first-order dependency mining step.

[0067] Step S3: Mining high-order data dependencies step by step, in which high-order data dependency candidate sets are first generated, that is, all candidate relationships of k-order data dependencies are generated, and high-order data dependency candidates are generated based on the mined low-or...

Embodiment 2

[0075] On the basis of the distributed computing-based data dependency mining method provided in Embodiment 1 of the present invention, the specific implementation process of the data redistribution step is provided as follows:

[0076] The first step: because the different attributes of the data may have different formats and similar (or distance) functions, it is not convenient to carry out parallel processing. The present invention first converts the database stored by rows into several sub-datasets reorganized by columns. Specifically, a data ID is specified or generated for each piece of data in the original data set, and for each piece of data, the data ID, attribute number, and attribute value are stored as a triplet. After all the data is processed, it is redistributed according to the attribute number (corresponding to the ReduceByKey operation in the Spark framework), where each attribute number corresponds to a sub-database, which records the data ID and The value o...

Embodiment 3

[0154] On the basis of the distributed computing-based data dependency mining method provided in Embodiment 2 of the present invention, the specific implementation process of the first-order dependency mining step is as follows:

[0155] 1) For the data pair (i, j) in the attribute similar posting list, the attribute set A that satisfies the similarity constraint ij , generating a Cartesian product And aggregate the results into a first-order exclusion list.

[0156] 2) Eliminate the repeated elements in the first-order exclusion list to obtain the first-order data dependencies to be excluded;

[0157] 3) Use the Cartesian product to generate a candidate set of first-order data dependencies, and obtain a non-trivial candidate set of first-order data dependencies after eliminating diagonal elements;

[0158] 4) Subtracting the first-order data dependencies to be excluded from the candidate set of non-trivial first-order data dependencies to obtain the mined first-order data ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the technical field of data processing, and provides a data dependency mining method and system based on distributed computing. The method comprises a data redistribution stepof generating an attribute similarity inverted table according to an original data set; a first-order dependency mining step of mining a first-order data dependency relationship according to the attribute similarity inverted table; a higr-order dependency mining step of mining the dependency relationship of the high-order data grade by grade, wherein the high-order data dependency candidate set is generated, and the high-order data dependency candidate set is pruned based on the low-order data dependency set of the mining, and the high-order data dependency relation of the pruned high-order data dependency candidate set is verified by using the attribute similarity inversion table. The method of the invention makes the reliability and accuracy of the data dependency mining to be higher bygenerating the inverted table with similar attributes and adopting a recursive data dependency mining mode.

Description

technical field [0001] The invention relates to the technical field of data processing, in particular to a distributed computing-based data dependency mining method and system. Background technique [0002] Before the era of big data, it was an indispensable job for researchers in various fields to discover the laws existing in data through data, so as to infer and explore the physical world. Among them, the data dependency relationship, that is, the relationship in which a certain attribute of a record is uniquely or approximately determined by other attributes, is a common regular form. The discovery of a new dependency relationship can often bring new revelations and discoveries to theoretical research in related fields, and has more practical significance such as data cleaning and data query optimization. With the advent of the era of big data, a large amount of data is being generated in various fields such as industry, medical care, finance, and meteorology. The amou...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/2458
Inventor 王宏志张翔熙
Owner HARBIN INST OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products