Data space heterogeneous data automatic fusion method based on unsupervised learning clustering algorithm

An unsupervised learning and clustering algorithm technology, applied in the field of data integration, can solve the problem that the accuracy of matching cannot be guaranteed, the heterogeneous synonymy of attribute names is not considered, and the conflict of value data cannot be solved, so as to achieve accurate matching and solve matching problems. precise effect

Pending Publication Date: 2022-07-12
SUN YAT SEN UNIV
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] In order to solve the problem that the existing technology does not consider the heterogeneous synonym of attribute names, the accuracy of matching cannot be guaranteed, and the problem of value data conflicts cannot be solved, the present invention provides data space heterogeneous data based on unsupervised learning clustering algorithm The method of automatic fusion, which has the characteristics of accurate matching and can solve the problem of accurate matching

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data space heterogeneous data automatic fusion method based on unsupervised learning clustering algorithm
  • Data space heterogeneous data automatic fusion method based on unsupervised learning clustering algorithm
  • Data space heterogeneous data automatic fusion method based on unsupervised learning clustering algorithm

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0048] like figure 2 As shown, the method for automatic fusion of heterogeneous data in data space based on unsupervised learning clustering algorithm includes the following steps:

[0049] S1. Obtain heterogeneous data, and preprocess the heterogeneous data;

[0050] S2. Perform schema extraction on the preprocessed heterogeneous data to obtain attribute names and attribute values ​​of the heterogeneous data;

[0051] S3. Obtain the corpus, and pre-train the word embedding module through the corpus;

[0052] S4. Input the attribute name into the pre-trained word embedding module, and obtain the attribute name vector;

[0053] S5. Pre-select a target mode, determine whether the source mode and the target mode are consistent, if they are consistent, go to the next step, if they are inconsistent, input the attribute value into the hybrid matcher, and calculate the similarity between the source mode and the target mode of the attribute value , to determine whether the source ...

Embodiment 2

[0056] like figure 2 , image 3 As shown, the method for automatic fusion of heterogeneous data in data space based on unsupervised learning clustering algorithm includes the following steps:

[0057] S1. Obtain heterogeneous data, and preprocess the heterogeneous data;

[0058] S2. Perform schema extraction on the preprocessed heterogeneous data to obtain attribute names and attribute values ​​of the heterogeneous data;

[0059] S3. Obtain the corpus, and pre-train the word embedding module through the corpus;

[0060] S4. Input the attribute name into the pre-trained word embedding module, and obtain the attribute name vector;

[0061] S5. Pre-select a target mode, determine whether the source mode and the target mode are consistent, if they are consistent, go to the next step, if they are inconsistent, input the attribute value into the hybrid matcher, and calculate the similarity between the source mode and the target mode of the attribute value , judging whether the ...

Embodiment 3

[0102] The method for automatic fusion of heterogeneous data in data space based on unsupervised learning clustering algorithm includes the following steps:

[0103] S1. Obtain heterogeneous data, and preprocess the heterogeneous data; in this embodiment, the data to be matched is loaded into the system through the open API interface of the source data, and it is checked in turn whether there is null value or abnormal data in each mode. If it is found that the data is incomplete, the data needs to be filled or discarded according to the actual storage requirements.

[0104] S2. Perform schema extraction on the preprocessed heterogeneous data to obtain the attribute names and attribute values ​​of the heterogeneous data; in this embodiment, the complete data attribute names need to be extracted from each data source, and the same is stored in a file, the The file can be in txt or csv format.

[0105] S3. Obtain the corpus, and pre-train the word embedding module through the co...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to the technical field of data integration, and discloses a data space heterogeneous data automatic fusion method based on an unsupervised learning clustering algorithm, and the method comprises the following steps: S1, carrying out the preprocessing of heterogeneous data; s2, mode extraction is carried out, and attribute names and attribute values of heterogeneous data are obtained; s3, pre-training the word embedding module through the corpus; s4, inputting the attribute name into the pre-trained word embedding module, and obtaining an attribute name vector; s5, judging whether the source mode is consistent with the target mode or not, if yes, executing the next step, and if not, inputting the attribute value into a mixed matcher, calculating the similarity between the source mode and the target mode of the attribute value, judging whether the source mode is matched with the target mode or not, and if yes, executing the next step; and S6, performing clustering integration through an unsupervised learning clustering algorithm according to the attribute name vector. According to the method, the problems that the matching accuracy cannot be guaranteed and the value data conflict cannot be solved due to the fact that the heterogeneous synonym condition of the attribute names is not considered in the prior art are solved.

Description

technical field [0001] The invention relates to the technical field of data integration, and more particularly, to a method for automatic fusion of heterogeneous data in data space based on an unsupervised learning clustering algorithm. Background technique [0002] With the transition from IT (Internet Technology) to DT (Data Technology) era, data has become increasingly large in scale, complex in sources, and diversified in type and structure, making enterprises face the problem of data asset management. Due to its closedness and isolation, the "data island" formed by independent business applications lacks correlation between data, making it difficult for data to flow in business applications. [0003] The core task of data integration is to integrate interrelated and physically isolated multi-source heterogeneous data, so that users can access the data in a transparent manner. The essence of data integration is pattern matching. In the process of data fusion, it is nece...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/35G06F16/33G06K9/62G06N3/08
CPCG06F16/355G06F16/3347G06N3/088G06F18/2321G06F18/2135G06F18/22Y02D10/00
Inventor孙伟沈光明
OwnerSUN YAT SEN UNIV