Chinese web page repeated document detection and filtration method based on full stop characteristic word string

A filtering method and feature word technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of repeated web page detection methods that are difficult to achieve the ideal processing effect at the same time, large amount of calculation, poor detection accuracy, etc., to achieve Simple and effective real-time detection and processing, the method is simple and easy to implement, and the effect of low cost is achieved

Inactive Publication Date: 2013-02-27
NANJING UNIV
View PDF2 Cites 27 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The second difficulty in duplicate web page detection is that, since it involves comparison among millions of documents, and because each document has a relatively large length, it is necessary to complete the interaction between a large number of documents within a certain time limit. Comparison is a very time-consuming calculation process. If the calculation time is too long, it cannot meet the actual needs of search engines to regularly crawl and update search pages as soon as possible.
[0004] Existing duplicate web page detection methods are difficult to achieve ideal processing results in both detection accuracy and computing performance
The Shingling detection method has a fast processing speed and poor detection accuracy; the Random Projection method also has great advantages in performance, but it has not improved much in accuracy; the Imatch method improves the detection accuracy by strength

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese web page repeated document detection and filtration method based on full stop characteristic word string
  • Chinese web page repeated document detection and filtration method based on full stop characteristic word string
  • Chinese web page repeated document detection and filtration method based on full stop characteristic word string

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0044] In the following, the present invention will be further clarified with reference to the accompanying drawings and specific embodiments. It should be understood that these embodiments are only used to illustrate the present invention and not to limit the scope of the present invention. After reading the present invention, those skilled in the art will understand various aspects of the present invention. Modifications in equivalent forms fall within the scope defined by the appended claims of this application.

[0045] The main design idea and processing process of the Chinese duplicate document detection method in the present invention is: in order to perform duplicate document detection on a huge number of web pages searched by a search engine in response to a user’s search request, we propose and use a simple and effective Chinese period feature, using the use characteristics and statistical characteristics of Chinese period in web page text to complete the filtering of we...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Chinese web page repeated document detection and filtration method based on full stop characteristic word string, which comprises the following steps: extracting the full stop characteristic word string of the web page to be detected; using the full stop characteristic word string to filtrate template information for the web page to be detected to retain and extract text contents of the web page subjects; calculating similarity of the text contents of the web page subjects, and judging repetition relationship and inclusion relationship between the web pages; and clustering the web pages having the repetition relationship and inclusion relationship. According to the invention, aiming at Chinese web page, particularly Chinese news web page, firstly effective detection characteristics are searched, which can effectively detect the effective text portions on the web pages and filtrate noise which is irrelevant with the text content of the subject, such as advertisement on the webpage; and on the basis, the problems of similarity measurement and document repetition detection are solved, and finally the parallel processing problem during large-scale repeated document detection is solved.

Description

technical field [0001] The invention relates to a document detection method, in particular to a method for detecting and filtering duplicate documents in Chinese webpages based on period feature strings. Background technique [0002] There are a large number of nearly repeated webpages in the Internet (according to statistics, the repetition rate of Chinese webpages reaches 29%), which brings many problems to search engines and greatly increases the overhead and burden of webpage crawling, index establishment, and space storage. And significantly affect the experience of search engine users, reducing user satisfaction. [0003] It is relatively easy to detect two identical web pages, but in actual situations, there are almost no identical web pages. The first major difficulty in duplicate webpage detection is that many websites, especially news websites, will reproduce the same report or article. Therefore, the subject content in these webpages is exactly the same, but some...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/27
Inventor 黄宜华袁春风韦永壮刘玉龙张建
Owner NANJING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products