Improved text similarity solving algorithm based on semantic analysis

A text similarity and semantic analysis technology, applied in the field of text similarity algorithm based on semantic analysis, can solve the problems of not considering the contribution of vocabulary semantic status, not considering the semantic relationship of vocabulary and vocabulary, and not being able to truly reflect the similarity of texts, etc. , to achieve the effect of maximum use value, good applicability and high precision

Inactive Publication Date: 2017-05-03
SICHUAN YONGLIAN INFORMATION TECH CO LTD
View PDF0 Cites 21 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Vector Space Model (Vector Space Model referred to as VSM) is a classic method for calculating text similarity. This method does not consider the semantic information of words and the semantic connection between words, so it cannot truly reflect the similari

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Improved text similarity solving algorithm based on semantic analysis
  • Improved text similarity solving algorithm based on semantic analysis
  • Improved text similarity solving algorithm based on semantic analysis

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0023] In order to solve the problem of the difference in the importance of different words in the text in the feature vocabulary to the text, the problem of "polysemy" and "multiple words", and improve the accuracy of text similarity calculation, combined with Figure 1-Figure 3 The present invention has been described in detail, and its specific implementation steps are as follows:

[0024] Step 1: Initialize the text corpus module, treat the comparison text (W 1 , W 2 ) for preprocessing, the specific description process is as follows:

[0025] Comprehensive word segmentation and stop word removal technology, the flow chart of the Chinese text preprocessing process is as follows figure 2 .

[0026] The word segmentation method here uses a Chinese automatic word segmentation algorithm based on information theory, and its specific word segmentation and stop word processing steps are as follows:

[0027] Step 1.1: Use the deactivation table to separate the text (W 1 , W ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an improved text similarity solving algorithm based on semantic analysis. The algorithm comprises the steps of performing word segmentation and stop word removing processing on two texts; computing weights of the words in the texts based on an improved information theory method; acquiring weights of positions and properties of the words according to the positions and the properties of the words; constructing a target function shown in the description of the extracted text words according to the abovementioned three factors; and at last respectively reducing dimensions of the two feature words according to a semantic similarity, thus acquiring two feature word vectors, and then solving the text similarity sim (W1, W2) between the texts (W1, W2) according to a Pearson correlation coefficient. Compared with traditional text similarity computing method, the algorithm provided by the invention has higher accuracy, wider application range and higher application value, can accurately compute contribution degrees of the different words to a text thought and solve the problems of polysemy and synonym, is more accordant with an empirical value, and meanwhile provides a good theoretical basis for subsequent text clustering.

Description

technical field [0001] The invention relates to the technical field of semantic network, in particular to an improved algorithm for solving text similarity based on semantic analysis. Background technique [0002] At present, there are two commonly used methods for calculating text similarity: one is based on mathematical statistics, and the other is based on semantic analysis. The method based on mathematical statistics is calculated according to word form and word frequency, while the semantic analysis is calculated by using the internal semantic connection of the vocabulary in the text. Vector Space Model (Vector Space Model referred to as VSM) is a classic method for calculating text similarity. This method does not consider the semantic information of words and the semantic connection between words, so it cannot truly reflect the similarity between texts. In addition, VSM does not Considering the semantic status of vocabulary in the text and the contribution to the exp...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27
CPCG06F40/284G06F40/30
Inventor 金平艳
Owner SICHUAN YONGLIAN INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products