Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Block-based Web record linkage system and method

A block and record technology, applied in special data processing applications, instruments, network data retrieval, etc., can solve the problems of block time-consuming, large sample data, large data sets, etc., and achieve the effect of improving recall rate and efficiency

Active Publication Date: 2017-05-31
XUZHOU NORMAL UNIVERSITY
View PDF5 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

These three methods have the following disadvantages: formulating rules requires domain knowledge; training classifiers requires a large amount of sample data; the weight parameters of each attribute need to be carefully adjusted
The data set is very large, and partitioning is also a time-consuming process. How to implement it in parallel

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Block-based Web record linkage system and method
  • Block-based Web record linkage system and method
  • Block-based Web record linkage system and method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] The present invention will be described in further detail below in conjunction with the accompanying drawings and with reference to the data. It should be understood that the embodiments are only for illustrating the present invention and not limiting the scope of the present invention in any way.

[0037] Web data is huge, even in one field, such as books, hotels, flights, the amount of information is massive big data, in this information, there are many records describing the same entity, the traditional method is to use pairwise for these records To find those records that describe the same entity, however, due to the huge scale of Web records, it is necessary to use a fast matching method to be effective and feasible.

[0038] Such as figure 1As shown, a system of web record links based on blocks disclosed in the present invention includes a web crawler, a Sample database, a web record database, a block attribute analysis module, a block module, a block balance mod...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a block-based Web record linkage system which comprises a Web crawler, a Sample database, a Web record database, a block attribute analysis module, a blocking module, a block balancing module, a paired matching module, a matching determination module and a record linkage result set. According to a block-based Web record linkage method, data from various data sources are quickly blocked by a Mapreduce model, and the data are compared and recorded in the blocks, therefore, the record matching efficiency is improved to a large extent; on the basis, the sizes of blocks are balanced, so that the record matching efficiency is further improved. The recalling rate of record linkage is also improved by adopting a method for blocking a data set with a multi-block function from multiple angles.

Description

technical field [0001] The invention relates to the technical field of Web record linking, in particular to a block-based system and method for Web record linking. Background technique [0002] The era of big data has arrived, and the scale of data, the speed of updating, and the wide range of data are unprecedented. How to organize and analyze these data is an extremely challenging research work to maximize the value of data. However, since these data come from different data sources on the Web, the value of the same attribute representing the same entity is often different due to writing errors, multiple naming methods, and other reasons. The purpose of record linking is to distinguish which records represent the same entity. [0003] Traditional record linking methods are mainly aimed at millions of records from dozens or hundreds of data sources, while in a big data environment, the available data sources may involve millions, of which a considerable number of data sou...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/245G06F16/951
Inventor 姜芳艽
Owner XUZHOU NORMAL UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products