Text similarity detection method

A technology of text similarity and detection method, applied in the field of text similarity detection, can solve problems such as poor support and loss of effective information, and achieve the effect of increasing accuracy and reliability

Active Publication Date: 2018-01-09
KUNMING UNIV OF SCI & TECH
View PDF4 Cites 18 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The invention provides a text similarity detection method, which is used to solve the phenomenon that the Simhash algorithm has poor support for short tex

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text similarity detection method
  • Text similarity detection method
  • Text similarity detection method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0024] Example 1: as Figure 1-5 As shown, a text similarity detection method, the specific steps of the method are as follows:

[0025] Step1, input text A and text B;

[0026] The content of text A is "Xiao Ming, your friend is calling you to go to the stadium to play basketball, and then have dinner together!", and the content of text B is "Xiao Ming, your friend is calling you to go to the playground to play rugby, and then we will have dinner together!" have dinner!".

[0027] Step2. Preprocess the text A and text B to obtain the substantive words; calculate the TF-IDF value of the substantive words of text A and text B respectively as the weight of the substantive words; according to the weights, the text A and text A and the substantive words respectively The substantive words of text B are generated with length l 1 Simhash fingerprint, and calculate the Hamming distance h between the two 1 ; by the Hamming distance h 1 and the length l of the generated fingerprint...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a text similarity detection method and belongs to the technical field of natural language processing. The method comprises the steps of firstly performing similarity calculation on a text by using a conventional Simhash algorithm; secondly introducing an N-Gram language model for performing combination on text keywords to enable the keywords to have a context connection relationship, and performing similarity calculation on the text by using the Simhash algorithm again; thirdly, introducing a longest common substring to serve as one of similarity judgment standards forperforming similarity calculation on the text; and finally, giving a corresponding weight to the calculated similarity, and performing final similarity superposition calculation. Compared with the prior art, the method has the advantages that the phenomena of poor supportability of short texts by the Simhash algorithm, effective information loss in a fingerprint generation process and the like are mainly eliminated; and the accuracy and reliability of text similarity detection are improved.

Description

technical field [0001] The invention relates to a text similarity detection method, which belongs to the technical field of natural language processing. Background technique [0002] Currently, many learning materials are stored in large-scale data centers. However, the data center is filled with a large number of repeated or similar files, which has a certain impact on the storage space of the data center and the data retrieval of the search engine. [0003] Simhash is currently the mainstream approximate text detection algorithm, but there are still many problems in using Simhash for text similarity detection. For example, the accuracy of short text detection is very poor, and Simhash involves multiple dimensionality reduction in the process of generating fingerprints, which may Some valid information is lost. Contents of the invention [0004] The invention provides a text similarity detection method, which is used to solve the phenomenon that the Simhash algorithm ha...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/27
Inventor 龙华祁俊辉杜庆治邵玉斌
Owner KUNMING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products