Linking Data Elements Based on Similarity Data Values and Semantic Annotations

a data element and similarity data technology, applied in the field of data management and data linking, can solve the problems of limiting the number of data sources to be linked, the inapplicability of standard analytic techniques that assume complete access to the whole data, and the limited scale of existing linking systems

Inactive Publication Date: 2013-12-12
IBM CORP
View PDF1 Cites 26 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

[0006]Exemplary embodiments of systems and methods in accordance with the present invention are directed to an automated method for choosing data elements from data sources to link based on similarity of semantic annotations, similarity between data value instances or a combination of both semantic annotations and data instances. Well known measures of similarity from Information Retrieval and data mining are used, e.g., Jaccard similarity and cosine similarity. Any type of annotations about semantic features of data elements can be used; however, the present invention it is not dependent on the method used to create the annotations. Unlike existing techniques that require user or human interaction to choose candidate data elements for linking, the present invention is completely automatic, because all pairs of data elements across and within all data sources are considered for linking Systems and methods in accordance with the present invention are computationally feasible in both space and time through the significant reduction in search space achieved through the use of signatures. For each data element, a summary of the set of data values or annotations for that element is computed. This summary is called a signature. These signatures are constructed in a manner that preserves similarity, i.e., the similarity of two signatures is very close to the similarity of their underlying sets. In addition, these signatures have a fixed size regardless of the size of the sets of annotations and data values.

Problems solved by technology

Discovering links between data elements of different data sources is a fundamental problem for the emerging semantic web, as well as traditional data integration systems.
Therefore, users need a reasonably deep understanding of the data sources and their elements, limiting both which data sources will be linked (only those with which a user is familiar) and the number of data sources that will be linked, since these specifications take time to create.
Furthermore, the ability of existing linking systems to scale to a large number of medium to large size data sources is limited due to both the quadratic number of comparisons that need to be performed to exhaustively check for links between pairs of elements of different data sources.
The result is that some standard analytic techniques that assume complete access to the whole data are not applicable.
Very few existing methods even attempt to use instance values for matching data elements, and such instance-based methods have been described as useful but prohibitively expensive.
Examples of methods that use instance value include methods that use values only for validation, methods that look at the distribution and other properties of the values to infer meta-data about the elements, methods that use only a selected sample of instance values, methods that are expensive and do not avoid the quadratic number of comparisons and methods that rely on particular properties of instance values that work only in a limited set of applications.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Linking Data Elements Based on Similarity Data Values and Semantic Annotations
  • Linking Data Elements Based on Similarity Data Values and Semantic Annotations
  • Linking Data Elements Based on Similarity Data Values and Semantic Annotations

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0023]Referring initially to FIG. 1, an exemplary embodiment of a system for linking data elements derived from data sources 100 is illustrated. The system includes a data element linking computing system 102 that is in communication with at least one and preferably a plurality of data sources 104. Suitable computing systems include computers and processors within a single domain as well as distributed computing systems. These systems include the processors and software applications required to perform the functions of the data element linking computing system. Suitable data sources include collections or repositories of data, preferably in a computer or machine readable format. These data include structured data such as a relational database and semi-structured data such as an extensible mark-up language (XML) document. Each data source comprises at least one and preferably a plurality of data elements 106. A given data element represents a subset of the data contained in its data ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

Data elements from data sources and having a data value set are linked by using hash functions to determine a dimensionally reduced instance signature for each data element based on all data values associated with that data element to yield a plurality of dimensionally reduced instance signatures of equivalent fixed size such that similarities among the data values in the data value sets across all data elements is maintained among the plurality of instance signatures. Candidate pairs of data elements to link are identified using the plurality of instance signatures in locality sensitive hash functions, and a similarity index is generated for each candidate pair using a pre-determined measure of similarity. Candidate pairs of data elements having a similarity index above a given threshold are linked.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS[0001]The present application is a continuation of co-pending U.S. patent application Ser. No. 13 / 491,724 filed Jun. 8, 2012. The entire disclosure of that application is incorporated herein by reference.FIELD OF THE INVENTION[0002]The present invention relates to data management and data linkingBACKGROUND OF THE INVENTION[0003]The semantic web is an extension of the world wide web that incorporates semantics into the data or web pages that are accessed and downloaded across the internet. Discovering links between data elements of different data sources is a fundamental problem for the emerging semantic web, as well as traditional data integration systems. These links are the building blocks for searching, querying and other higher level services. Link discovery techniques span syntactic methods, many derived from similarity measures developed by Information Retrieval (IR). These techniques include structural ones, e.g., using foreign key relat...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(United States)
IPC IPC(8): G06F17/30
CPCG06F16/951
Inventor BORNEA, MIHAELA ANCUTADUAN, SONGYUNFOKOUE-NKOUTCHE, ACHILLE BELLYHASSANZADEH, OKTIEKEMENTSIETSIDIS, ANASTASIOSSRINIVAS, KAVITHAWARD, MICHAEL J.
Owner IBM CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products