The invention provides a similar web page duplicate-removing system based on a parallel programming mode, comprises a web page content pre-processing module, a web page eigenvector extracting module,a web page feature fingerprint calculation module, a web page fingerprint on-line duplicate-removing module, a web page fingerprint distributed batch duplicate-removing module and a computing platformbased on specific distribution. The system can complete links of carrying out unified conversion of text content encoding, standardization of document structure, web page noise content abortion, thematic content analysis and identification of web pages, lexical segmentation of continuous text content, and the like on the web pages obtained by crawling of web crawlers, thereby forming eigenvectorswhich can present the web pages. Relative algorithms can be used to obtain web page fingerprints which present web page characteristics aiming at the vector. The system provided by the invention accurately and fast detects fully complete repetition or approximate repetition of the web page contents caused by site mirroring, web document transshipment, and the like on the condition of massive amount of data of Internet and completes corresponding repetition-removing works, thereby enhancing the storage efficiency of search engines and bringing better use experience for the search engines.