Text similarity measuring system based on multi-feature fusion

A text similarity and multi-feature fusion technology, applied in the field of semantic-based text similarity measurement method and system, can solve the problems of lack of semantics, large difference in text length, and low accuracy of similarity results

Active Publication Date: 2015-06-10
XINJIANG TECHN INST OF PHYSICS & CHEM CHINESE ACAD OF SCI
View PDF4 Cites 59 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0016] The present invention provides a text similarity measurement system based on multi-feature fusion, which combines multiple features based on word frequency, word vector and Wikipedia tags to The purpose of measuring similarity is...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text similarity measuring system based on multi-feature fusion
  • Text similarity measuring system based on multi-feature fusion
  • Text similarity measuring system based on multi-feature fusion

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0066] In order to make those skilled in the art better understand the present invention, the present invention will be described in further detail below in conjunction with the accompanying drawings:

[0067] as attached figure 1 Shown, the present invention comprises the following steps:

[0068] Training text preprocessing: Preprocessing the training text, word segmentation, removing stop words, and removing punctuation marks; for example, for sentence A: "The leader reprimanded the staff" and sentence B: "The employee was criticized by the boss", after word segmentation, After removing stop words and removing punctuation marks, it is expressed as A: [leadership, reprimand, employee] and B: [employee, boss, criticism];

[0069] Word vector model training: In order to obtain the semantic features between words in the text, the deep learning method is used to perform multiple iterations to train the text, and each vocabulary in the training text set is represented as a 200-d...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a text similarity measuring system based on multi-feature fusion and relates to the field of intelligent information processing. According to the system, the text similarity is measured by fusing multiple features based on word frequencies, word vectors and Wikipedia labels. The invention aims to solve the problem of semantic loss caused by non-considering of contexts in a conventional text similarity measuring system and the problem of low similarity result accuracy caused by larger text length difference. The text similarity measuring system is implemented by the following steps: carrying out preprocessing such as word segmentation and stop word removal on a training text; training corpora of the processed training text as a word vector model; measuring the similarity based on the word frequencies, the similarity based on the word vectors and the similarity based on the Wikipedia labels between input text pairs to be computed, and carrying out weighted summation to obtain a final text semantic similarity measuring result. According to the system, the measurement accuracy of the text similarities can be improved, so that the requirement on intelligent information processing is met.

Description

technical field [0001] The invention relates to the technical field of intelligent information processing in the field of information technology, in particular to a method and system for measuring text similarity based on semantics. Background technique [0002] Semantic similarity is a core technology in the field of intelligent information processing, which can be applied to query expansion, word sense disambiguation, question answering system and information retrieval, etc. Assessing semantic similarity is also an important task in numerous research fields, such as psychology, cognitive science, artificial intelligence, etc. [0003] Supervised methods and unsupervised methods are two mainstream methods of semantic similarity measurement. Supervised methods require prior knowledge, such as knowledge base systems or ontology resources, such as DBPedia, WordNet, HowNet, etc.; unsupervised methods mainly use statistical learning The method obtains context information and ru...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/30
Inventor 马博李晓蒋同海周喜王磊杨雅婷赵凡
Owner XINJIANG TECHN INST OF PHYSICS & CHEM CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products