The invention provides a similar
web page duplicate-removing
system based on a parallel
programming mode, comprises a
web page content pre-
processing module, a
web page eigenvector extracting module,a web page feature
fingerprint calculation module, a web page
fingerprint on-line duplicate-removing module, a web page
fingerprint distributed batch duplicate-removing module and a computing platformbased on specific distribution. The
system can complete links of carrying out unified conversion of text content encoding,
standardization of document structure, web page
noise content
abortion, thematic content analysis and identification of web pages, lexical segmentation of continuous text content, and the like on the web pages obtained by
crawling of web crawlers, thereby forming eigenvectorswhich can present the web pages. Relative algorithms can be used to obtain web page fingerprints which present web page characteristics aiming at the vector. The
system provided by the invention accurately and fast detects fully complete repetition or approximate repetition of the web page contents caused by site mirroring,
web document transshipment, and the like on the condition of massive amount of data of Internet and completes corresponding repetition-removing works, thereby enhancing the
storage efficiency of search engines and bringing better use experience for the search engines.