The invention discloses a Chinese
web page text deduplication
system and a Chinese
web page text deduplication method. The deduplication
system comprises an index
server and a search
server, wherein the index
server comprises a
web page text preprocessing module, a combined characteristic
sentence extraction module and a
digital signature calculation module; and the search server comprises a web page text capture module and a Hash query module. The deduplication method comprises the following steps of: normalizing a web page text; extracting a combined characteristic
sentence of the text; calculating a
digital signature of the combined characteristic
sentence; and comparing the
digital signature with the existing digital signature in a
Hash table, and judging whether the digital signature is duplicated or not. By the deduplication
system and the deduplication method, a
search engine can quickly and accurately determine and remove a large number of Chinese web pages with duplicated contents in
the Internet; and when the
search engine captures a new web page, the digital signature of the web page is calculated and compared with the digital signature of the web page, which has been stored by the
search engine, whether the web page is duplicated or not is judged, and the web page is not stored if the web page is duplicated, so that the waste of a storage space is avoided, and the search accuracy of the search engine is improved simultaneously.