Massive document distribution searching duplication removing system and method

A distributed and document technology, applied in the field of information processing, can solve the problems of low efficiency and large amount of calculation of massive document sorting technology, and achieve the effect of reducing the amount of calculation, large semantic contribution, and high word frequency in document sorting and sorting.

Active Publication Date: 2014-02-12
TRS INFORMATION TECH CO LTD
View PDF2 Cites 18 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The purpose of the present invention is to provide a system and method for distributed retrieval and deduplication of massive documents, which uses the idea of ​​distributed system to disperse the massive document library into dozens or even more subsets, so that the deduplication can be performed in one subset or several In order to solve the problem of low efficiency and large amount of calculation in massive document deduplication technology in the era of big data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Massive document distribution searching duplication removing system and method
  • Massive document distribution searching duplication removing system and method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0020] In order to adapt to the development of the big data era and solve the problems existing in the existing technology, the distributed retrieval and deduplication system and method for massive documents provided by the embodiment of the present invention, with the help of the distributed system idea, uses the fingerprint hash value to evenly distribute the massive documents to Several subset storage areas allow the document similarity calculation to run on one or several subsets, which greatly reduces the amount of computation and meets the efficiency requirements of massive document deduplication.

[0021] In order to make the purpose, technical method, and advantages of the embodiments of the present invention clearer, the technical solutions provided by the embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings.

[0022] Such as figure 1 Shown is a module diagram of the distributed retrieval and ranking syst...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A massive document distribution searching duplication removing system comprises a document pre-processing module, a document feature calculating module, a distribution database building module, a storage module, a distribution searching module and a similarity calculating module. The document feature calculating module calculates document feature vectors according to importance degree of a word to a document. The distribution database building module maps the document into different storage subregions according to the document feature vectors. The distribution searching module searches a plurality of subregions where a target document belongs, the similarity calculating module calculates the similarity of the target document and all documents in the plurality of subregions, and massive document distribution duplication removing operation is achieved. By means of the system and method, a distribution system idea is adopted, massive documents are scattered into a plurality of subsets, and duplication removing calculation is conducted in one or a few of subsets, similarity calculation amount is reduced, and document duplication removing efficiency is improved.

Description

technical field [0001] The invention belongs to the technical field of information processing, and in particular relates to a system and method for distributed retrieval and deduplication of massive documents in the era of big data. Background technique [0002] With the advent of the era of big data, all kinds of information are increasing exponentially, and all walks of life and fields are facing the pressure of massive information collection, processing, and storage. Therefore, it is necessary to overcome the development of the times to check for duplicate documents or similar documents from the source. technical difficulties. For example, search results with the same or similar content in the results returned by current search engines account for about 45%. Therefore, it is necessary to judge which web pages have the same or similar content when collecting search information. [0003] There are three types of web page deduplication technologies commonly used in the fiel...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/182G06F16/93
Inventor 王洪俊肖诗斌施水才
Owner TRS INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products