Adjacent sorting repetition-reducing method based on Map-Reduce and segmentation

A technology of word segmentation and adjacency, which is applied in the fields of instruments, calculations, electrical digital data processing, etc., and can solve problems such as the inability to efficiently process massive amounts of information

Active Publication Date: 2013-03-13
ZHEJIANG UNIV
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] And because the source of information largely comes from the Internet, and the information on the Internet is very large and huge, the existing stand-alone operation framework has been unable to efficiently process massive information

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Adjacent sorting repetition-reducing method based on Map-Reduce and segmentation
  • Adjacent sorting repetition-reducing method based on Map-Reduce and segmentation
  • Adjacent sorting repetition-reducing method based on Map-Reduce and segmentation

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0038] The present invention will be further described below in conjunction with accompanying drawing and specific embodiment

[0039] In the data deduplication method, the data set to be deduplicated is called a record set, and each record in the record set contains multiple pieces of field information. The general steps of the deduplication method are to compare the records pair by pair, and compare the similarity of the records to determine whether the records are duplicated. In the implementation of the deduplication method, the top layer is the deduplication method framework, and the middle is the deduplication method to judge whether two records are the same, and the similarity between records depends on the matching of fields between records. The deduplication method consists of these three Each pair of records must involve these three levels when performing similarity comparison. This method focuses on the two parts of the deduplication method framework and the fiel...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention discloses an adjacent sorting repetition-reducing method based on Map-Reduce and segmentation. On the basis of adopting an SNM method under a Map-Reduce distributed framework of Hadoop, the adjacent sorting repetition-reducing method solves the problem that a large number of repetitive data exist when information is extracted with information extraction technology, and the data are designed to be processed in a distributed way, the similarity degree between records is calculated by field matching method to judge whether the records are repetitive, thereby increasing the whole repetition-reducing operating efficiency.

Description

technical field [0001] The present invention relates to an efficient data deduplication method based on the Map-Reduce distributed framework. The method is based on the Map-Reduce distributed framework. The similarity matching method of word segmentation sorting edit distance is used as the field similarity method, and the adjacency sorting method ( SNM) is a deduplication method for records, which can effectively improve the operation efficiency of computer deduplication. Background technique [0002] With the rapid development of the Internet, the Internet has become the most popular information release media, and has developed into a global, huge, distributed and shared information space. The Internet has also rapidly emerged as an important means of exchange and information dissemination, and abundant data resources have also appeared on the WEB. The Internet has also become an important way for people to obtain information, but with the explosive growth of the Internet...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 尹建伟苏伟兵吴朝晖邓水光李莹吴健
Owner ZHEJIANG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products