The invention provides a network page efficient and accurate deduplication system based on cloud computing, and aims to solve the problems that most of web pages searched by an existing search engineare static web pages, due to the existence of a large amount of transshipment and plagiarism, the main content of a large number of web pages is repeated, and for the search engine, the repeated web pages virtually increase the burden of index storage, and meanwhile, more retrieval time can be consumed; the webpage deduplication system based on the Hadoop cloud platform is designed and realized bycombining an open source framework, other modules of a search engine can be better connected by adopting a mode of detecting and judging duplicate in real time after a spider program captures a webpage; and in a massive webpage collection stage, the network page efficient and accurate deduplication system based on cloud computing can preprocess the web pages in advance, then web page similarity detection and discovery are carried out, repeated web pages or web pages with high similarity are removed, and therefore index quality is improved, retrieval results are optimized, and good search experience is provided for users.