Improved text similarity solution method

A text similarity, text technology, applied in the improved text similarity solution field, can solve the problems of not considering the difference in importance, low accuracy, no internal connection, etc., and achieve the effect of high accuracy and good applicability

Inactive Publication Date: 2017-05-03
SICHUAN YONGLIAN INFORMATION TECH CO LTD
View PDF0 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Therefore, if you want to accurately calculate the similarity between texts, you must start from the semantic level; the second type is a method based on semantic analysis, which uses a semantic dictionary to perform semantic analysis on the words in the text, but there is no Deepen the internal relationship between semantics, and do not consider the difference in the importance of different words in the text to the text, so the accuracy of the calculation is low

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Improved text similarity solution method
  • Improved text similarity solution method
  • Improved text similarity solution method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0015] In order to solve the difference in the importance of different words in the text in the feature words to the text, combined with figure 1 The present invention has been described in detail, and its specific implementation steps are as follows:

[0016] Step 1: Initialize the text corpus module, treat the comparison text (W 1 , W 2 ) for preprocessing, the specific description process is as follows:

[0017] The text (W 1 , W 2 ) for word segmentation and stop word processing.

[0018] Step 2: Based on the information theory method, calculate the weight value W of the vocabulary in the text I , the specific calculation process is as follows:

[0019] The formulas for calculating word frequency based on information theory are:

[0020]

[0021] above formula is the amount of information that the vocabulary has in the document about the word frequency, p(c 1,2 ) respectively for word c 1 、c 2 Probability values ​​in the text.

[0022] The formula for calcu...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides an improved text similarity solution method. The method comprises the following steps: performing word segmentation and unused word removal processing on two texts (the formula is as shown in the specification), calculating weights of feature vocabularies in the texts based on an information theory method, calculating position weights of the vocabularies according to the position information of the vocabularies, constructing feature value target functions (the formula is as shown in the specification) in the texts by integrating all factors, extracting two feature vocabulary vectors, and converting a text similarity problem into a vector distance problem. Compared with the traditional text similarity calculation method, the improved text similarity solution method provided by the invention has the advantages of higher accuracy and wider application range, and meanwhile provides good theoretical basis for subsequent text clustering.

Description

technical field [0001] The invention relates to the technical field of semantic network, in particular to an improved method for solving text similarity. Background technique [0002] At present, there are two main types of text similarity calculation methods. The first type is based on mathematical statistics, such as the classic vector space model method. This type of method is simple to calculate, and can reflect the similarity between two texts in terms of word frequency and frequency level. But a text with practical significance has its own central idea that it wants to express, which is a concept on the semantic level. The central idea extracted by mathematical statistics methods is far from the central idea actually expressed in the text. Therefore, if you want to accurately calculate the similarity between texts, you must start from the semantic level; the second type is a method based on semantic analysis, which uses a semantic dictionary to perform semantic analys...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
CPCG06F40/30
Inventor 金平艳
Owner SICHUAN YONGLIAN INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products