Unlock instant, AI-driven research and patent intelligence for your innovation.

Document detection method and system

A detection method and document technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems that the precision rate and recall rate cannot meet high requirements, so as to improve the recall rate and improve The effect of accuracy and efficiency

Active Publication Date: 2018-04-27
SHENZHEN SHI JI GUANG SU INFORMATION TECH
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, in the process of using each method, it is found that each method cannot meet the higher requirements in terms of precision and recall.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Document detection method and system
  • Document detection method and system
  • Document detection method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0017] This embodiment provides a document detection method, such as figure 1 As shown, the method includes the following steps.

[0018] 101. Extract characters from a webpage document to be tested to obtain at least one document feature.

[0019] Specifically, the document feature can replace the webpage document to be tested, and be used for comparison with document features of other webpage documents, so as to determine whether the webpage document to be tested is an approximate duplicate document with other webpage documents.

[0020] 102. Perform hash calculation on each of the document features to obtain corresponding feature fingerprints.

[0021] 103. If no document cluster similar to the webpage document to be tested is found in the fingerprint mapping database according to each feature fingerprint, compare the webpage document to be tested with webpage documents within a specified number of days for similarity comparison.

[0022] Specifically, the (K, V) pair of ...

Embodiment 2

[0035] This embodiment provides a document detection method, such as image 3 As shown, the method includes the following steps.

[0036] 301. Extract characters from a webpage document to be tested to obtain at least one document feature.

[0037] Specifically, the document feature can replace the webpage document to be tested, and be used for comparison with document features of other webpage documents, so as to determine whether the webpage document to be tested is an approximate duplicate document with other webpage documents. At least one document feature can be obtained through the following method.

[0038] Divide the text of the webpage document to be tested into at least one paragraph according to the paragraph identifier; select the N paragraphs containing the largest number of characters in the at least one paragraph; divide each paragraph in the N paragraphs according to punctuation marks It is at least one sentence; each paragraph contains a sentence with the la...

Embodiment 3

[0099] This embodiment provides a document detection system, such as Figure 4 As shown, it includes: a feature extraction device 41, which is used to extract characters in the webpage document to be tested to obtain at least one document feature, and perform a hash calculation on each of the document features to obtain a corresponding feature fingerprint; a document comparison device 43, When no document cluster similar to the webpage document to be tested is found in the fingerprint mapping database according to each of the feature fingerprints, the webpage document to be tested is compared with webpage documents within a specified number of days, And when it is determined that among the webpage documents within the specified number of days, the webpage documents whose similarity value with the webpage document to be tested is greater than the similarity threshold are included in the document cluster of the document mapping database, the The document cluster is a target docu...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

Embodiments of the present invention provide a document detection method and system, which relate to the technical field of Internet data processing, and solve the problem that existing approximate duplicate document detection methods cannot meet higher requirements in terms of precision and recall. In the embodiment of the present invention, the method of combining multi-feature fingerprint query and document similarity comparison is adopted. Since multiple feature fingerprints can accurately reflect the characteristics that distinguish the webpage document to be tested from other webpage documents, and according to the existing database The corresponding relationship between feature fingerprints and near-duplicate documents can quickly query qualified records, so the accuracy and efficiency of near-duplicate document detection can be improved. By using the document similarity detection method, it can prevent the situation that the webpage document to be tested is indeed an approximate duplicate document caused by an imperfect database but cannot be detected by the multi-feature fingerprint query, thereby improving the recall rate of the approximate duplicate document detection.

Description

technical field [0001] The invention relates to the technical field of Internet data processing, in particular to a document detection method and system. Background technique [0002] Near-duplicates documents usually refer to: two or more webpage documents with only slight differences between them on the Internet, and these differences include: counters, timestamps, a small number of words, a small number of sentences, etc. In addition, for some webpage documents with the same theme, that is, about the same event, even if the words, sentences, paragraphs, lengths, etc. of each document differ greatly, from the perspective of network users, the meaning expressed by each document is Similarly, these documents are also near-duplicate documents. [0003] According to statistics, repeated webpage documents on the Internet account for about 30% to 45% of all documents, and most of them are generated by mirror images. Detecting near-duplicate documents in a huge number of web do...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 王炫聪孙甲慧陈锡彬李翔黄斌强
Owner SHENZHEN SHI JI GUANG SU INFORMATION TECH