Short text similarity calculating method

A similarity calculation, short text technology, applied in the field of intelligent human-computer dialogue and the Internet, it can solve the problems of influence effect, inaccurate information of the human-computer dialogue system replying to the user, and only consideration.

Active Publication Date: 2017-03-08
EMOTIBOT TECH LTD
View PDF4 Cites 80 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the method of using the TF*IDF of the vector to calculate the cosine similarity to find the most similar problem only considers the text similarity between sentences, that is, how many repeated word segmentations there are literally to judge the similarity between sentences, which is obviously not enough , such as "I'm sleepy" has the same semantics as "I want to sleep", but there are almost no repeated words literally, this method cannot deal with this situation
In addition, since man-machine dialogue systems usually use short sentences, TF is basically 1, which has little effect, which will also affect the effect of the method
[0005] Therefore, the defect in the prior art is that the cosine similarity between the user input and the word vector of the question of the candidate question-answer pair is calculated through the TF*IDF value of the word segmentation, and only the text similarity between sentences is considered, that is, it can only be used literally How many repeated word segmentations are used to judge the similarity of sentences, which will make the judgment of similarity very inaccurate, which directly leads to inaccurate information for replying users in the man-machine dialogue system

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Short text similarity calculating method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0036] figure 1 A flow chart of a method for calculating similarity between short texts provided by the first embodiment of the present invention is shown. Such as figure 1 As shown, the method for calculating similarity between short texts according to the first embodiment of the present invention includes:

[0037] Step S1, acquiring corpus data, preprocessing the corpus data, and obtaining training corpus;

[0038] Step S2, according to the training corpus, obtain a keyword extraction model, use a word segmentation tool to segment the training corpus, and use word2vec training to obtain a word vector set;

[0039] Step S3, obtain the user input text and the question of the candidate Q&A pair, perform word segmentation on the candidate Q&A pair question and the user input text respectively through the word segmentation tool, and perform keyword segmentation on the candidate Q&A pair question and the user input text respectively through the keyword extraction model Extract...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a short text similarity calculating method. The calculating method comprises steps of acquiring corpus data, pre-processing the corpus data to achieve training corpus, achieving a key word extracting model based on the training corpus, segmenting the training corpus via a segmenting tool, and acquiring a word vector set via training of word2vec, acquiring a text input by a user and questions correctly answered by candidate answers to achieve a segmenting result and a key word extracting result, calculating the questions correctly answered by candidate answers and the word vector set of the text input by the user based on the segmenting result and the key word extracting result, acquiring a sentence vector based on the word vector set, calculating similarity of two sentence vectors, and conducting similarity correction for information contained in the user input text and the questions correctly answered by the candidate answers to achieve corrected similarity. Cosine similarity of sentence vectors of the user input text and the question sentences correctly answered by candidate answers is calculated and similarity is corrected via sentence patterns, named entity and pronouns.

Description

technical field [0001] The invention relates to the field of Internet technology, in particular to the field of intelligent man-machine dialogue. Background technique [0002] With the continuous evolution of the informatization of human society and the rising cost of human services, people increasingly hope to communicate with computers through natural language, and the human-computer intelligent chat system has become a product born under such a historical background. [0003] In the existing human-computer dialogue system, there are mainly two implementation methods, one is the retrieval model, and the other is the generation model. The retrieval model regards a round of human-computer dialogue as an information retrieval process, prepares a certain amount of data in advance, and builds an index for the questions in the question-answer pair (which consists of a question and several answers). When the user enters one or several sentences, it is regarded as a retrieval, an...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/22
CPCG06F40/194
Inventor 简仁贤陈秀龙
Owner EMOTIBOT TECH LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products