Document similarity calculation method and near-duplicate document detection method and device

A technology of document similarity and calculation method, applied in the computer field, can solve the problems of similar duplicate documents, increase the complexity of duplicate content, and the accuracy is easily affected by the order of word segmentation.

Inactive Publication Date: 2014-12-31
HUAWEI TECH CO LTD +1
View PDF5 Cites 28 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] There are two main problems in the near-duplicate detection of massive text data: one is the accuracy, due to the different data input and processing pipelines, the near-duplicate documents are not necessarily exactly the same
The second is efficiency. The information expansion brought about by the development of Internet technology has increased the complexity of discovering duplicate content from massive data.
[0005] However, the existing set-based similarity calculation methods cannot perceive the editing similarity of text at the word segmentation level. When editing errors occur at the wor...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Document similarity calculation method and near-duplicate document detection method and device
  • Document similarity calculation method and near-duplicate document detection method and device
  • Document similarity calculation method and near-duplicate document detection method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0118] figure 1 It is a flowchart of the method for calculating document similarity provided by this embodiment, such as figure 1 As shown, the document similarity calculation method of the present invention includes:

[0119] S101: Perform word segmentation processing on two documents to be detected respectively to obtain respective word segmentation sets of the documents to be detected.

[0120] For two documents to be detected 1 ,s 2 Perform word segmentation separately to obtain the word segmentation set T 1 ,T 2 .

[0121] S102. Calculate the editing similarity of all word segmentation pairs in the two word segmentation sets.

[0122] Wherein, the two word segmentation of each word segmentation pair are derived from the two word segmentation sets respectively.

[0123] Specifically, the word set T 1 With T 2 All participle t 1,i ∈ T 1 , T 2,j ∈ T 2 Build a Bipartite Graph for the vertices, where 1≤i≤m, 1≤j≤n, m and n are the word segmentation set T 1 With T 2 , Calculate all word...

Embodiment 2

[0140] figure 2 It is the flow chart of the approximate duplicate document detection method provided by this embodiment, such as figure 2 As shown, the approximate duplicate document detection method of the present invention includes:

[0141] S201: Perform word segmentation processing on each document to be detected to obtain a respective word segmentation set of each document to be detected.

[0142] Using existing word segmentation methods, for example, by identifying specific non-English characters (such as punctuation, numbers, etc.), the word segmentation method, the forward maximum matching method, etc., the word segmentation process is performed on each document to be detected to obtain the word segmentation set .

[0143] Optionally, number the word segmentation obtained by the word segmentation process and record the word segmentation number, where the word segmentation number indicates the order in which the word segmentation appears in the document to be detected.

[014...

Embodiment 3

[0187] Figure 4 It is a schematic diagram of the document similarity calculation device provided by this embodiment, such as Figure 4 As shown, the document similarity calculation device of the present invention includes: a word segmentation module 401, a first calculation module 402, a weighted even picture establishment module 403, a second calculation module 404, and a third calculation module 405.

[0188] The word segmentation module 401 is configured to perform word segmentation processing on two documents to be detected respectively to obtain respective word segmentation sets of the documents to be detected.

[0189] The word segmentation module 401 analyzes the two documents to be detected 1 ,s 2 Perform word segmentation separately to obtain the word segmentation set T 1 ,T 2 .

[0190] The first calculation module 402 is configured to calculate the editing similarity of all word segmentation pairs in the two word segmentation sets obtained by the word segmentation module 4...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a document similarity calculation method and a near-duplicate document detection method and device. The calculation method comprises the following steps: performing word segmentation processing on two documents to be detected to obtain respective participle sets of the documents to be detected; calculating the edition similarity of all participle pairs in the two participle sets, wherein two participles in each participle pair come from the two participle sets respectively; establishing sides among the participle pairs of which the edition similarity meets a certain requirement in all the participle pairs to obtain a weighted biograph, wherein the edition similarity is the weights of the sides of corresponding participle pairs; calculating the maximum weighted matching value of the weighted bi-graph; calculating the similarity between the documents to be detected by using the maximum weighted matching value. By adopting the document similarity calculation method and the near-duplicate document detection method and device provided by the invention, high accuracy is achieved, near-duplicate texts comprising participle set edition errors can be identified effectively, the near-duplicate document detection accuracy is increased, the calculation complexity is lowered, and the calculation efficiency is optimized.

Description

Technical field [0001] The present invention relates to the field of computer technology, in particular to a method for calculating document similarity, a method and a device for detecting similarly repeated documents. Background technique [0002] With the popularization of electronic equipment and the development of Internet technology, the number of Internet users continues to grow, leading to the continuous expansion of Internet data. According to the International Data Corporation (International Data Corporation, IDC) research report, about 75% of the existing data is copy information, that is, only 25% of the data is unique. Therefore, the detection technology for repeated or nearly repeated data in massive data is particularly important. It can not only reduce the storage and bandwidth resources of the deduplication system, but also help improve the quality of information processed by the data cleaning and analysis system. [0003] Approximate duplicate detection of massive...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27
CPCG06F16/334G06F16/9024G06F40/205
Inventor 李国良冯建华魏建生
Owner HUAWEI TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products