Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Linking Data Elements Based on Similarity Data Values and Semantic Annotations

a data element and similarity data technology, applied in the field of data management and data linking, can solve the problems of limiting the number of data sources to be linked, the inapplicability of standard analytic techniques that assume complete access to the whole data, and the limited scale of existing linking systems

Inactive Publication Date: 2013-12-12
IBM CORP
View PDF1 Cites 26 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

This patent text describes a method for identifying similar data elements in a database using a technique called Locality Sensitive Hashing (LSH). This approach reduces the cost of comparing pairs of data by assigning a higher likelihood to similar data elements. Only a small portion of the data needs to be compared, resulting in a more efficient search. The method also uses semantic annotations to construct signatures that better represent the content of each data element. By using multiple signatures, the accuracy of the similarity predictions is increased. Additionally, this method allows for the use of both data value instances and semantic annotations to create more complex signatures.

Problems solved by technology

Discovering links between data elements of different data sources is a fundamental problem for the emerging semantic web, as well as traditional data integration systems.
Therefore, users need a reasonably deep understanding of the data sources and their elements, limiting both which data sources will be linked (only those with which a user is familiar) and the number of data sources that will be linked, since these specifications take time to create.
Furthermore, the ability of existing linking systems to scale to a large number of medium to large size data sources is limited due to both the quadratic number of comparisons that need to be performed to exhaustively check for links between pairs of elements of different data sources.
The result is that some standard analytic techniques that assume complete access to the whole data are not applicable.
Very few existing methods even attempt to use instance values for matching data elements, and such instance-based methods have been described as useful but prohibitively expensive.
Examples of methods that use instance value include methods that use values only for validation, methods that look at the distribution and other properties of the values to infer meta-data about the elements, methods that use only a selected sample of instance values, methods that are expensive and do not avoid the quadratic number of comparisons and methods that rely on particular properties of instance values that work only in a limited set of applications.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Linking Data Elements Based on Similarity Data Values and Semantic Annotations
  • Linking Data Elements Based on Similarity Data Values and Semantic Annotations
  • Linking Data Elements Based on Similarity Data Values and Semantic Annotations

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0023]Referring initially to FIG. 1, an exemplary embodiment of a system for linking data elements derived from data sources 100 is illustrated. The system includes a data element linking computing system 102 that is in communication with at least one and preferably a plurality of data sources 104. Suitable computing systems include computers and processors within a single domain as well as distributed computing systems. These systems include the processors and software applications required to perform the functions of the data element linking computing system. Suitable data sources include collections or repositories of data, preferably in a computer or machine readable format. These data include structured data such as a relational database and semi-structured data such as an extensible mark-up language (XML) document. Each data source comprises at least one and preferably a plurality of data elements 106. A given data element represents a subset of the data contained in its data ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

Data elements from data sources and having a data value set are linked by using hash functions to determine a dimensionally reduced instance signature for each data element based on all data values associated with that data element to yield a plurality of dimensionally reduced instance signatures of equivalent fixed size such that similarities among the data values in the data value sets across all data elements is maintained among the plurality of instance signatures. Candidate pairs of data elements to link are identified using the plurality of instance signatures in locality sensitive hash functions, and a similarity index is generated for each candidate pair using a pre-determined measure of similarity. Candidate pairs of data elements having a similarity index above a given threshold are linked.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS[0001]The present application is a continuation of co-pending U.S. patent application Ser. No. 13 / 491,724 filed Jun. 8, 2012. The entire disclosure of that application is incorporated herein by reference.FIELD OF THE INVENTION[0002]The present invention relates to data management and data linkingBACKGROUND OF THE INVENTION[0003]The semantic web is an extension of the world wide web that incorporates semantics into the data or web pages that are accessed and downloaded across the internet. Discovering links between data elements of different data sources is a fundamental problem for the emerging semantic web, as well as traditional data integration systems. These links are the building blocks for searching, querying and other higher level services. Link discovery techniques span syntactic methods, many derived from similarity measures developed by Information Retrieval (IR). These techniques include structural ones, e.g., using foreign key relat...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(United States)
IPC IPC(8): G06F17/30
CPCG06F16/951
Inventor BORNEA, MIHAELA ANCUTADUAN, SONGYUNFOKOUE-NKOUTCHE, ACHILLE BELLYHASSANZADEH, OKTIEKEMENTSIETSIDIS, ANASTASIOSSRINIVAS, KAVITHAWARD, MICHAEL J.
Owner IBM CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products