Document detection method and system

A detection method and document technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems that the precision rate and recall rate cannot meet high requirements, so as to improve the recall rate and improve The effect of accuracy and efficiency

Active Publication Date: 2013-09-11
SHENZHEN SHI JI GUANG SU INFORMATION TECH
View PDF3 Cites 26 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, in the process of using each method, it is found that each meth

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Document detection method and system
  • Document detection method and system
  • Document detection method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0017] This embodiment provides a document detection method, such as figure 1 As shown, the method includes the following steps.

[0018] 101. Extract characters from a webpage document to be tested to obtain at least one document feature.

[0019] Specifically, the document feature can replace the webpage document to be tested, and be used for comparison with document features of other webpage documents, so as to determine whether the webpage document to be tested is an approximate duplicate document with other webpage documents.

[0020] 102. Perform hash calculation on each of the document features to obtain corresponding feature fingerprints.

[0021] 103. If no document cluster similar to the webpage document to be tested is found in the fingerprint mapping database according to each feature fingerprint, compare the webpage document to be tested with webpage documents within a specified number of days for similarity comparison.

[0022] Specifically, the (K, V) pair of ...

Embodiment 2

[0035] This embodiment provides a document detection method, such as image 3 As shown, the method includes the following steps.

[0036] 301. Extract characters from a webpage document to be tested to obtain at least one document feature.

[0037] Specifically, the document feature can replace the webpage document to be tested, and be used for comparison with document features of other webpage documents, so as to determine whether the webpage document to be tested is an approximate duplicate document with other webpage documents. At least one document feature can be obtained through the following method.

[0038] Divide the text of the webpage document to be tested into at least one paragraph according to the paragraph identifier; select the N paragraphs containing the largest number of characters in the at least one paragraph; divide each paragraph in the N paragraphs according to punctuation marks It is at least one sentence; each paragraph contains a sentence with the la...

Embodiment 3

[0099] This embodiment provides a document detection system, such as Figure 4 As shown, it includes: a feature extraction device 41, which is used to extract characters in the webpage document to be tested to obtain at least one document feature, and perform a hash calculation on each of the document features to obtain a corresponding feature fingerprint; a document comparison device 43, When no document cluster similar to the webpage document to be tested is found in the fingerprint mapping database according to each of the feature fingerprints, the webpage document to be tested is compared with webpage documents within a specified number of days, And when it is determined that among the webpage documents within the specified number of days, the webpage documents whose similarity value with the webpage document to be tested is greater than the similarity threshold are included in the document cluster of the document mapping database, the The document cluster is a target docu...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

An embodiment of the invention provides a detection method and system, relates to the field of a data processing technology, and solves the problem that the conventional approximate duplicated document detection method cannot meet higher requirements in terms of a precision ratio and a recall ratio. In the embodiment of the invention, a method combining multi-feature fingerprint inquiry and document similarity comparison is adopted, multi-feature fingerprints can accurately reflect discriminative features of a web page document to be detected and other web page documents, and records in accordance with conditions can be rapidly inquired according to a corresponding relation of the feature fingerprints and approximate duplicated documents in an existing data base, so that the accuracy rate and the efficiency of approximate duplicated document detection can be improved. With the adoption of the detection method of the document similarity comparison, the situation that a web page document to be detected surely belongs to an approximate duplicated document but cannot be detected by the multi-feature fingerprint inquiry due to the fact that the data base is defective can be prevented, so that the recall ration of the approximate duplicated document detection is improved.

Description

technical field [0001] The invention relates to the technical field of Internet data processing, in particular to a document detection method and system. Background technique [0002] Near-duplicates documents usually refer to: two or more webpage documents with only slight differences between them on the Internet, and these differences include: counters, timestamps, a small number of words, a small number of sentences, etc. In addition, for some webpage documents with the same theme, that is, about the same event, even if the words, sentences, paragraphs, lengths, etc. of each document differ greatly, from the perspective of network users, the meaning expressed by each document is Similarly, these documents are also near-duplicate documents. [0003] According to statistics, repeated webpage documents on the Internet account for about 30% to 45% of all documents, and most of them are generated by mirror images. Detecting near-duplicate documents in a huge number of web do...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 王炫聪孙甲慧陈锡彬李翔黄斌强
Owner SHENZHEN SHI JI GUANG SU INFORMATION TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products