Short text similarity calculation method and system

A similarity calculation and short text technology, which is applied in computing, computer components, special data processing applications, etc., can solve problems such as not considering semantic information, inability to accurately express the semantic meaning of sentences, and unsatisfactory effects, etc., to achieve high accuracy rate effect

Inactive Publication Date: 2018-07-27
NAT COMP NETWORK & INFORMATION SECURITY MANAGEMENT CENT
View PDF5 Cites 38 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The analysis and processing of short texts in this way mainly has the following two problems. The first is that due to the sparsity of short text feature words, the text vectors are too sparse when using common text algorithms, resulting in clustering. The effect is not ideal, and the effect of long text cannot be achieved; the second is to use the vector space model to represent the text, only considering the statistical characteristics of words in the context, based on the assumption of linear independence between keywords, without considering the words themselves semantic information, so it has great limitations and cannot accurately express the inner semantic meaning of the sentence

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Short text similarity calculation method and system
  • Short text similarity calculation method and system
  • Short text similarity calculation method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0041] The present invention will be further described in detail below in conjunction with the accompanying drawings, so that those skilled in the art can implement it with reference to the description.

[0042]It should be understood that terms such as "having", "comprising" and "including" as used herein do not entail the presence or addition of one or more other elements or combinations thereof.

[0043] Such as figure 1 As shown, a short text similarity calculation method includes the following steps:

[0044] S1, obtain the training corpus, segment the training corpus, use the deep learning word2vec algorithm to train the training corpus, and obtain the word vector (a) of each word in the training corpus 1i ,a 2i ,a 3i ...), and then combine each word vector to form a word vector set S;

[0045] S=((a 11 ,a 21 ,a 31 ...), (a 12 ,a 22 ,a 32 ...), (a 13 ,a 23 ,a 33 ...), ... (a 1i ,a 2i ,a 3i ...) ... (a 1N ,a 2N ,a 3N ...))

[0046] S2. Segment the shor...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a short text similarity calculation method. The method comprises the steps as follows: S1, performing word segmentation on training corpus, obtaining the word vector of each word with a word2vec algorithm, and combining the word vectors into a word vector set; S2, performing word segmentation on to-be-calculated short texts respectively, finding the word vector of each wordof the to-be-calculated short texts from the word vector set, and combining the word vectors into a short text vector set; S3, calculating the cosine similarity between each word vector in the word vector set and each word vector in the short text vector set, obtaining the maximum similarity value of each word vector, and combining the maximum similarity values into short text sentence vectors; S4, calculating the similarity between two short text sentence vectors to calculate the similarity between two short texts. The invention also provides a short text similarity calculation system. According to the similarity algorithm, short text sentences are represented by sentence vectors, the semantic similarity between short text sentences is effectively depicted, and the accuracy is high.

Description

technical field [0001] The invention belongs to the technical field of short text similarity, and in particular relates to a short text similarity calculation method and system. Background technique [0002] With the rapid development of computer science and technology and the Internet, more and more data appear on the Internet in the form of short text, such as Weibo news, news headlines, posting comments, etc. Applying machine learning techniques such as classification and clustering to Internet short text data, digging out valuable information to provide useful convenience for people's lives, and to meet the needs of different aspects has become a very popular topic in big data application technology. However, short Chinese texts have the characteristics of sparse words, discrete semantics, and random words, which bring great challenges to the research of short Chinese texts. Therefore, it is necessary to mine short text data and accurately recognize its inner meaning It...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06K9/62
CPCG06F40/211G06F40/289G06F40/30G06F18/22
Inventor 王慧汪立东王博刘春阳张旭王萌李雄
Owner NAT COMP NETWORK & INFORMATION SECURITY MANAGEMENT CENT
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products