Method for solving text similarity based on Gini index

A technology of text similarity and Gini index, applied in the field of semantic network, can solve the problems of low efficiency and accuracy of the text similarity algorithm of synonyms and polysemous words, and does not consider the importance and contribution of characteristic vocabulary sets, etc., and achieves great use value , high accuracy, high accuracy effect

Inactive Publication Date: 2017-05-03
SICHUAN YONGLIAN INFORMATION TECH CO LTD
View PDF3 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Aiming at the problem of high-dimensional sparse feature vector, the problem of not considering the importance and contribution of the feature vocabulary set to the text, the problem of synonyms and polysemous words, and the problem of low efficiency and accuracy of the text similarity algorithm, the present invention provides a Gini-based A Method of Exponentially Computing Text Similarity

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for solving text similarity based on Gini index
  • Method for solving text similarity based on Gini index
  • Method for solving text similarity based on Gini index

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] In order to solve the problem of high-dimensional sparse feature vectors, the importance and contribution of the feature vocabulary set to the text, the problem of synonyms and polysemous words, and the low efficiency and accuracy of the text similarity algorithm, combined with Figure 1-Figure 3 The present invention has been described in detail, and its specific implementation steps are as follows:

[0026] Step 1: Use Chinese word segmentation technology to separate the two texts (w 1 ,w 2 ) for word segmentation processing, its specific word segmentation technology process is as follows:

[0027] Step 1.1: According to the "word segmentation dictionary", find the word in the sentence to be segmented that matches the dictionary, scan the Chinese character string to be segmented completely, search and match in the dictionary of the system, and mark the words in the dictionary when encountering them ; If there is no relevant match in the dictionary, simply split the ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for solving a text similarity based on the Gini index. The method comprises the following steps: performing text word segmentation processing by use of the word segmentation technology, matching with a stop word list to perform a stop word elimination operation on a vocabulary, and obtaining a series of vocabulary positions and word characteristic weighted values according to the research statistics; collecting and reducing dimensions of the text vocabulary by use of a target weight function as shown in description, combining the vocabularies with high similarity according to the semantic similarity, collecting and reducing the dimensions of above characteristic words again, and solving the inter-textual similarity by use of the similarity between the vectors. Compared with the traditional text characteristic vocabulary extracting method, the method disclosed by the invention is higher in accuracy, better in application vale, and good in data processing effect; the defects of an information gain method are overcome, the result is more suitable for the experience value, the text characteristic vocabulary high-dimensional spare problem and the problem of the synonyms and polyseme are solved, the contribute degrees of different vocabularies to the text thought are computed, and the good theory basis is provided for the subsequent text similarity and text clustering.

Description

technical field [0001] The invention relates to the technical field of semantic network, in particular to a method for calculating text similarity based on the Gini index. Background technique [0002] In Chinese information processing, the calculation of text similarity is widely used in information retrieval, machine translation, automatic question answering system, text mining and other fields. At present, most text similarity algorithms are based on the vector space model, but this method will cause the problem of high-dimensional sparsity and the importance and contribution of the characteristic vocabulary set to the text are unknown. Moreover, such algorithms do not solve the natural language problems of synonyms and polysemy in text data well. These problems interfere with the efficiency and accuracy of the text similarity algorithm, and degrade the performance of similarity calculation. In order to meet the above requirements, the present invention provides a metho...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06F17/30
CPCG06F16/35G06F40/289G06F40/30
Inventor 金平艳
Owner SICHUAN YONGLIAN INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products