Massive document distribution searching duplication removing system and method
Patent Information
- Authority / Receiving Office
- CN Β· China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TRS INFORMATION TECH CO LTD
- Publication Date
- 2014-02-12
Smart Images
Figure 1 Figure 2
Abstract
Description
technical field
[0001] The invention belongs to the technical field of information processing, and in particular relates to a system and method for distributed retrieval and deduplication of massive documents in the era of big data. Background technique
[0002] With the advent of the era of big data, all kinds of information are increasing exponentially, and all walks of life and fields are facing the pressure of massive information collection, processing, and storage. Therefore, it is necessary to overcome the development of the times to check for duplicate documents or similar documents from the source. technical difficulties. For example, search results with the same or similar content in the results returned by current search engines account for about 45%. Therefore, it is necessary to judge which web pages have the same or similar content when collecting search information.
[0003] There are three types of web page deduplication technologies commonly used in the fiel...