Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

Heterogeneous data sets based on field value priority connection method based on mic

A technology of heterogeneous data and connection methods, which is applied in other database retrieval and special data processing applications, etc., can solve problems such as difficult to realistically simulate heterogeneous data, errors, overestimation of test performance evaluation, etc.

Active Publication Date: 2019-07-09
FUJIAN NORMAL UNIV
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The disadvantage is that proWGen only implements simple positive / negative correlation for field connection, which makes it difficult to realistically simulate complex and diverse heterogeneous data in practice
With the explosive increase of Internet data volume, Zipf-like is no longer suitable for describing heterogeneous data distribution with heavy tail characteristics. If Zipf-like is used for data generation, for the system used to generate data, its test There will be overestimated results in performance evaluation, and there will be a large error compared with the real data situation, which means that unreliable data is generated

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Heterogeneous data sets based on field value priority connection method based on mic
  • Heterogeneous data sets based on field value priority connection method based on mic
  • Heterogeneous data sets based on field value priority connection method based on mic

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0039] The present invention provides a method for preferential connection of heterogeneous data sets based on the MIC field value, which is characterized in that for two heterogeneous data sets U and V, U contains field A, and V contains field B. Field A and field B The problem of connecting a data set of l records, where all the values ​​in the field A construct a set S A ={A 1 ,A 2 ,A 3 ,...,A m }, the set S constructed by all the values ​​in the field B B ={B 1 ,B 2 ,B 3 ,...,B n }, the form of each record is {A x ,B y }, 1≤x≤m, 1≤y≤n, m and n respectively represent m and n values ​​in fields A and B, such as figure 1 As shown, including the following steps:

[0040] Step 1: Fit the extended exponential distribution of the heterogeneous data set, namely the parameters a, b, c, and x of the SE distribution (Stretched ExponentialDistribution) 0 , Where c is the extended parameter, x 0 Is the scale parameter, a represents the slope of the SE fitting approximate straight line, and...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a field value prior connection method based on MIC of a heterogeneous data set comprising following steps: fitting parameters of heterogeneous data set SE distribution; calculating a MIC coefficient between field A and field B; generating a set StA and a set StB of all occurrences of the values in the fields A and B respectively; establishing cumulative distribution functions PA (x) and PB (y) corresponding to the set StA and the set StB; determining whether total record I is zero or not, if the total record I is zero, then turning to the last step, otherwise, turning to the next step; calculating and obtaining a field value Ax corresponding to the field A according to PA(x); calculating and obtaining a field value By corresponding to the field B based on a field prior connection model; saving {Ax, By} as a record; updating the total record number I=I-1, and returning to step 5; completing all connections of the heterogeneous data. This method is helpful for realistic simulation of heterogeneous data sets to make the connected data set to maintain a reasonable balance between the fields and the similarity between nodes.

Description

Technical field [0001] The invention relates to the technical field of heterogeneous data field value connection, in particular to a method for preferential connection of field values ​​of heterogeneous data sets based on MIC. Background technique [0002] Reasonable analysis of the field content of heterogeneous data sets is helpful for the construction and testing of their domain systems. However, heterogeneous data sets usually reach TB or even PB level, which consumes network resources extremely, and the user behavior and related item attributes in the data are related The content of the fields involves private information, so companies and governments are rarely willing to share their data for researchers to use. With the continuous expansion of the Internet, the heavy tail phenomenon in heterogeneous data has become more and more common, and the connection relationship between various fields has become more and more complicated. It is extremely difficult to generate heterog...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/90
CPCG06F16/90
Inventor 肖如良丘志鹏张锐蔡声镇倪友聪杜欣
Owner FUJIAN NORMAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products