Unlock instant, AI-driven research and patent intelligence for your innovation.

Text duplicate removal method and device

A text and text processing technology, applied in the field of text processing, can solve the problems of high false positive rate, complex implementation and large amount of calculation.

Active Publication Date: 2015-05-20
TENCENT TECH (BEIJING) CO LTD
View PDF4 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] The advantage of the simhash algorithm is that it can greatly reduce the calculation workload in the case of massive texts. The disadvantage is that the implementation is more complicated, and the calculation of the Hamming distance is relatively large.
[0010] It can be seen that among the above three methods, the first method has a higher misjudgment rate, and the calculation amount of the latter two methods is too large, and both the misjudgment rate and the calculation amount cannot be considered

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text duplicate removal method and device
  • Text duplicate removal method and device
  • Text duplicate removal method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] In order to make the object, technical solution and advantages of the present invention clearer, the solutions of the present invention will be further described in detail below with reference to the accompanying drawings and examples.

[0026] In the embodiment of the present invention, text deduplication is completed through the following three steps:

[0027] Step 1. Create a case library:

[0028] In order to deduplicate text, it is first necessary to designate multiple pieces of text as case texts, and process each of the case texts to build a case library.

[0029] The processing of each case text includes the following steps:

[0030] A1. Extract the feature words of the case text to obtain a feature word string.

[0031] Existing word segmentation methods can be used to extract text feature words.

[0032] For example, for the case text: What the hell happened to your car:

[0033] Extract the feature words to get the following feature word string: what happ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a text duplicate removal method and device. According to the technical scheme, feature word strings of case texts are sliced, and the signature values of all slices are calculated, and accordingly an incidence relation between the signature values and the case texts is established to form a case library; when duplicate removal processing needs to be conducted on the texts to be processed, the feature word strings of the texts to be processed are sliced, the signature values of all the slices are calculated, the case text corresponding to the slices are determined according to the signature values of all the slices, accordingly statistics is conducted on the number of the signature values corresponding to the same case text, the similarity between the texts to be processed and the corresponding case texts is calculated by the adoption of the number of the maximum signature values, and thus similarity judgment is conducted. According to the text duplicate removal method and device, the required calculated amount is less, and a smaller misjudgment rate can be guaranteed.

Description

technical field [0001] The present application relates to the technical field of text processing, in particular to a text deduplication method and device. Background technique [0002] The current text deduplication methods mainly include the following types: text hash, cosine similar text calculation, simhash, which are introduced below. [0003] 1) Text hash method: Calculate the hash value (for example) Murmur hash value of the text content, compare whether the hash values ​​of the two texts are the same to determine whether they are the same text, and if the hash values ​​are consistent, the text is considered to be the same. [0004] The text hash method can quickly judge whether two texts are similar, but the judgment conditions are too strict, and the text content must be exactly the same, otherwise different hash values ​​may be calculated. For example: "Let it go." and "Let it go!" are the same text, but because the final punctuation marks are different, they are m...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/21G06F17/30
Inventor 贾铸斌袁昌文
Owner TENCENT TECH (BEIJING) CO LTD