Adjacent sorting repetition-reducing method based on Map-Reduce and segmentation

A word segmentation and adjacency technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as inability to efficiently process massive information, and achieve the effect of improving the efficiency of deduplication.

Active Publication Date: 2011-08-24
ZHEJIANG UNIV
View PDF1 Cites 26 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] And because the source of information largely comes from the Internet, and the information on the Internet is ver

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Adjacent sorting repetition-reducing method based on Map-Reduce and segmentation
  • Adjacent sorting repetition-reducing method based on Map-Reduce and segmentation
  • Adjacent sorting repetition-reducing method based on Map-Reduce and segmentation

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0038] The present invention will be further described below in conjunction with the drawings and specific embodiments

[0039] In the data deduplication method, the data set that needs to be deduplicated is called a record set, and each record in the record set contains multiple pieces of field information. The general step of the deduplication method is to compare the records in pairs and compare the similarity of the records to judge whether the records are duplicates. The top layer in the implementation of the deduplication method is the deduplication method framework. The middle is the deduplication method to determine whether two records are the same. The similarity between the records depends on the matching of the fields between the records. The deduplication method is composed of these three Level composition, each pair of records must involve these three levels when comparing similarity. This method focuses on two parts: the de-duplication method framework and the fiel...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention discloses an adjacent sorting repetition-reducing method based on Map-Reduce and segmentation. On the basis of adopting an SNM method under a Map-Reduce distributed framework of Hadoop, the adjacent sorting repetition-reducing method solves the problem that a large number of repetitive data exist when information is extracted with information extraction technology, and the data are designed to be processed in a distributed way, the similarity degree between records is calculated by field matching method to judge whether the records are repetitive, thereby increasing the whole repetition-reducing operating efficiency.

Description

Technical field [0001] The present invention relates to an efficient data deduplication method based on the Map-Reduce distributed framework. The method is based on the Map-Reduce distributed framework and uses the similarity matching method of word segmentation and edit distance as the field similarity method, and the adjacent ordering method ( SNM) is a record deduplication method, which can effectively improve the operating efficiency of computer deduplication. Background technique [0002] With the rapid development of the Internet, the Internet has become the most popular information distribution media, and has developed into a global, huge, distributed and shared information space. The network has also rapidly emerged as an important means of exchange and information dissemination, and abundant data resources have also appeared on the WEB. The Internet has also become an important way for people to obtain information. However, with the explosive growth of the Internet, peo...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 尹建伟苏伟兵吴朝晖邓水光李莹吴健
Owner ZHEJIANG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products