A method and system for data dependency mining based on distributed computing

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A distributed computing, data-dependent technology, applied in the field of data processing, can solve problems such as low performance and many omissions in mining, and achieve the effect of reducing the amount of computation

Active Publication Date: 2020-09-25

HARBIN INST OF TECH

View PDF6 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

The two are difficult to coordinate and will result in data-dependent mining either missing a lot or performing poorly

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0064] Such as figure 1 As shown, the distributed computing-based data dependency mining method provided in Embodiment 1 of the present invention may include the following steps:

[0065] Step S1: Generate attribute similarity posting list according to the original data set. This step is a data reallocation step. Wherein, each row in the attribute similar posting table corresponds to a data pair in the original data set, and this row records the attribute number of the pair of data pairs satisfying the similarity constraint.

[0066] Step S2: Mining first-order data dependencies according to the attribute similar posting list. This step is a first-order dependency mining step.

[0067] Step S3: Mining high-order data dependencies step by step, in which high-order data dependency candidate sets are first generated, that is, all candidate relationships of k-order data dependencies are generated, and high-order data dependency candidates are generated based on the mined low-or...

Embodiment 2

[0075] On the basis of the distributed computing-based data dependency mining method provided in Embodiment 1 of the present invention, the specific implementation process of the data redistribution step is provided as follows:

[0076] The first step: because the different attributes of the data may have different formats and similar (or distance) functions, it is not convenient to carry out parallel processing. The present invention first converts the database stored by rows into several sub-datasets reorganized by columns. Specifically, a data ID is specified or generated for each piece of data in the original data set, and for each piece of data, the data ID, attribute number, and attribute value are stored as a triplet. After all the data is processed, it is redistributed according to the attribute number (corresponding to the ReduceByKey operation in the Spark framework), where each attribute number corresponds to a sub-database, which records the data ID and The value o...

Embodiment 3

[0154] On the basis of the distributed computing-based data dependency mining method provided in Embodiment 2 of the present invention, the specific implementation process of the first-order dependency mining step is as follows:

[0155] 1) For the data pair (i, j) in the attribute similar posting list, the attribute set A that satisfies the similarity constraint ij , generating a Cartesian product And aggregate the results into a first-order exclusion list.

[0156] 2) Eliminate the repeated elements in the first-order exclusion list to obtain the first-order data dependencies to be excluded;

[0157] 3) Use the Cartesian product to generate a candidate set of first-order data dependencies, and obtain a non-trivial candidate set of first-order data dependencies after eliminating diagonal elements;

[0158] 4) Subtracting the first-order data dependencies to be excluded from the candidate set of non-trivial first-order data dependencies to obtain the mined first-order data ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention relates to the technical field of data processing, and provides a data dependency mining method and system based on distributed computing. The method comprises a data redistribution stepof generating an attribute similarity inverted table according to an original data set; a first-order dependency mining step of mining a first-order data dependency relationship according to the attribute similarity inverted table; a higr-order dependency mining step of mining the dependency relationship of the high-order data grade by grade, wherein the high-order data dependency candidate set is generated, and the high-order data dependency candidate set is pruned based on the low-order data dependency set of the mining, and the high-order data dependency relation of the pruned high-order data dependency candidate set is verified by using the attribute similarity inversion table. The method of the invention makes the reliability and accuracy of the data dependency mining to be higher bygenerating the inverted table with similar attributes and adopting a recursive data dependency mining mode.

Description

technical field [0001] The invention relates to the technical field of data processing, in particular to a distributed computing-based data dependency mining method and system. Background technique [0002] Before the era of big data, it was an indispensable job for researchers in various fields to discover the laws existing in data through data, so as to infer and explore the physical world. Among them, the data dependency relationship, that is, the relationship in which a certain attribute of a record is uniquely or approximately determined by other attributes, is a common regular form. The discovery of a new dependency relationship can often bring new revelations and discoveries to theoretical research in related fields, and has more practical significance such as data cleaning and data query optimization. With the advent of the era of big data, a large amount of data is being generated in various fields such as industry, medical care, finance, and meteorology. The amou...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityPatents(China)

IPC IPC(8): G06F16/2458

Inventor王宏志张翔熙

OwnerHARBIN INST OF TECH

A method and system for data dependency mining based on distributed computing

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology