Unlock instant, AI-driven research and patent intelligence for your innovation.

A text deduplication method and device

A text and text processing technology, applied in the field of text processing, can solve problems such as large amount of calculation, high false positive rate, and complex implementation.

Active Publication Date: 2018-09-28
TENCENT TECH (BEIJING) CO LTD
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] The advantage of the simhash algorithm is that it can greatly reduce the calculation workload in the case of massive texts. The disadvantage is that the implementation is more complicated, and the calculation of the Hamming distance is relatively large.
[0010] It can be seen that among the above three methods, the first method has a higher misjudgment rate, and the calculation amount of the latter two methods is too large, and both the misjudgment rate and the calculation amount cannot be considered

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A text deduplication method and device
  • A text deduplication method and device
  • A text deduplication method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] In order to make the objectives, technical solutions, and advantages of the present invention clearer, the following further describes the solutions of the present invention in detail with reference to the accompanying drawings and embodiments.

[0026] In the embodiment of the present invention, text deduplication is accomplished through the following three steps:

[0027] Step 1. Build a case library:

[0028] In order to de-duplicate the text, you first need to specify multiple texts as case texts, and process each of the case texts to build a case library.

[0029] The processing of each case text includes the following steps:

[0030] A1. Extract the characteristic words of the case text to obtain a characteristic word string.

[0031] The existing word segmentation method can be used to extract text feature words.

[0032] For example, for the case text: What happened to your car:

[0033] Extract feature words to get the following feature word string: What happened to your ca...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a text duplicate removal method and device. According to the technical scheme, feature word strings of case texts are sliced, and the signature values of all slices are calculated, and accordingly an incidence relation between the signature values and the case texts is established to form a case library; when duplicate removal processing needs to be conducted on the texts to be processed, the feature word strings of the texts to be processed are sliced, the signature values of all the slices are calculated, the case text corresponding to the slices are determined according to the signature values of all the slices, accordingly statistics is conducted on the number of the signature values corresponding to the same case text, the similarity between the texts to be processed and the corresponding case texts is calculated by the adoption of the number of the maximum signature values, and thus similarity judgment is conducted. According to the text duplicate removal method and device, the required calculated amount is less, and a smaller misjudgment rate can be guaranteed.

Description

Technical field [0001] This application relates to the technical field of text processing, and in particular to a method and device for text deduplication. Background technique [0002] The current text de-duplication methods mainly include the following: text hash, cosine similar text calculation, and simhash, which will be introduced below. [0003] 1) Text hash method: Calculate the hash value of the text content (for example) the Murmur hash value, compare whether the hash values ​​of the two texts are the same to determine whether it is the same text, the hash value is the same, the text is considered the same. [0004] The text hash method can quickly judge whether two texts are similar, but the judgment conditions are too strict, and the text content must be exactly the same, otherwise it may be caused by calculating different hash values. For example: "Let it evolve." and "Let it evolve!" are originally the same text, but because the final punctuation marks are different, th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/21G06F17/30
Inventor 贾铸斌袁昌文
Owner TENCENT TECH (BEIJING) CO LTD