Massive document distribution searching duplication removing system and method

A distributed and document technology, applied in the field of information processing, can solve the problems of low efficiency and large amount of calculation of massive document sorting technology, and achieve the effect of reducing the amount of calculation, large semantic contribution, and high word frequency in document sorting and sorting.
CN103577418AActive Publication Date: 2014-02-12TRS INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN Β· China
Patent Type
Applications(China)
Current Assignee / Owner
TRS INFORMATION TECH CO LTD
Publication Date
2014-02-12

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
Patent Text Reader

Abstract

A massive document distribution searching duplication removing system comprises a document pre-processing module, a document feature calculating module, a distribution database building module, a storage module, a distribution searching module and a similarity calculating module. The document feature calculating module calculates document feature vectors according to importance degree of a word to a document. The distribution database building module maps the document into different storage subregions according to the document feature vectors. The distribution searching module searches a plurality of subregions where a target document belongs, the similarity calculating module calculates the similarity of the target document and all documents in the plurality of subregions, and massive document distribution duplication removing operation is achieved. By means of the system and method, a distribution system idea is adopted, massive documents are scattered into a plurality of subsets, and duplication removing calculation is conducted in one or a few of subsets, similarity calculation amount is reduced, and document duplication removing efficiency is improved.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The invention belongs to the technical field of information processing, and in particular relates to a system and method for distributed retrieval and deduplication of massive documents in the era of big data. Background technique

[0002] With the advent of the era of big data, all kinds of information are increasing exponentially, and all walks of life and fields are facing the pressure of massive information collection, processing, and storage. Therefore, it is necessary to overcome the development of the times to check for duplicate documents or similar documents from the source. technical difficulties. For example, search results with the same or similar content in the results returned by current search engines account for about 45%. Therefore, it is necessary to judge which web pages have the same or similar content when collecting search information.

[0003] There are three types of web page deduplication technologies commonly used in the fiel...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More