The invention discloses a Chinese similar
web page de-emphasis method based on microcosmic characteristics in order to solve the problem of automatic detection of content similar to Chinese web pages. The Chinese similar
web page de-emphasis method considering syntactic information and
semantic information of web pages both comprises the following steps: firstly, establishing a text term co-occurrence picture according to extracted
web page effective information; secondly, extracting document characteristic vectors, wherein the document characteristic vectors comprise keyword position information and keyword terms ; finally, establishing a document keyword
inverted index file by sufficiently using a retrieval
system and
classified information; completing document characteristic vector retrieval match according to the
inverted index file, and thereby, detecting and investigating similar web pages. The Chinese similar web page de-emphasis method can effectively reduce the harmful effect of arithmetic accuracy by
noise information, considers the content and structure information of the web page text, sufficiently uses the advantages of a retrieval and classification
system simultaneously, obtains good effect of de-emphasis accuracy rate larger than 90 percent and average recalling rate larger than 80 percent and is especially suitable for large-scale web page de-emphasis.