Similarity calculation method based on text and semantics, server and storage medium

A technology of semantic similarity and text similarity, which is applied in text database query, unstructured text data retrieval, special data processing applications, etc., can solve the problems of diverse expressions in difficult conversations, single expression methods, etc., to ensure diverse expressions , Improve intelligence and reduce the effect of text information loss

Pending Publication Date: 2019-09-10
武汉斗鱼鱼乐网络科技有限公司
View PDF4 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] However, the similarity calculation of existing conversational content often only considers the understanding of semantics. After the text is segmented, the semantic similarity is calculated through frequency statistics or text clustering based on vocabulary or vocabulary sets, or t

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Similarity calculation method based on text and semantics, server and storage medium
  • Similarity calculation method based on text and semantics, server and storage medium
  • Similarity calculation method based on text and semantics, server and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0027] see figure 1 , a schematic flowchart of a method for calculating similarity based on text and semantics provided by an embodiment of the present invention, including:

[0028] S101. Select any two vectorized sentence texts from the corpus;

[0029] The corpus is the language material that actually appears in the actual use of the language. It is generally a large-scale electronic text library, which can be formed by manually collecting language materials in a specific field to form a prediction library, such as collecting barrage materials in the live broadcast field to form Bullet screen corpus. The vectorized representation refers to expressing the sentence text as a series of vectors capable of expressing the semantics of the text, which is generally realized by word vectorization, and the vectorized representation of the sentence text can facilitate computer recognition and processing.

[0030] Preferably, a corpus is formed by collecting conversational materials ...

Embodiment 2

[0057] figure 2 Another schematic flow chart of the text-based and semantic similarity method provided by Embodiment 2 of the present invention. On the basis of Embodiment 1, take the actual calculation of the text similarity of two sentences as an example to describe in detail:

[0058] Collect the corpus of specific fields to form a corpus. In this embodiment, take the bullet chat corpus as an example, collect the bullet chat corpus, and after removing repetitions, removing phrase numbers, removing sensitive words, shielding vulgar sentences, etc., perform jieba word segmentation, and Use doc2vec to vectorize all the text in the bullet chat corpus.

[0059] S201. Select vector text A and B;

[0060] Assume that the vector text or barrage text is: A=Miss Sister has a nice singing voice, I like it very much, B=Miss Sister with beautiful voice and sweet singing voice.

[0061] Text A and B can be represented by doc2vec vectors as:

[0062]

[0063]

[0064] Among them...

Embodiment 3

[0085] image 3 Another schematic flow chart of the text-based and semantic similarity calculation method provided by Embodiment 3 of the present invention. On the basis of Embodiment 1, the semantic and text similarity calculation of the calculation statement in step S104 is further described. Including the following:

[0086] By calculating the semantics and text similarity of the sentence text, it is possible to determine the similarity of any two sentences in terms of semantic expression and text composition. If the semantic similarity is calculated separately, the expression form of the sentence will be too simple. If the text similarity is calculated separately , there are texts with the same composition but different expressions. Therefore, when conducting a man-machine dialogue, based on the existing corpus, the computer can reply with various text content.

[0087] It should be noted that this implementation is only a specific application provided by the comprehensi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention discloses a similarity calculation method based on texts and semantics, a server and a storage medium, and belongs to the field of natural language processing. The method comprises the following steps of: selecting any two statement texts expressed in a vectorization manner; calculating a projection distance of two texts in a vector space to obtain semantic similarity of the two texts, calculating text similarity according to a proportion of a text superposition part of the two texts in a global text formed by two statement texts, respectively setting weights of the texts and the semantic similarity, and comprehensively calculating scores of the texts and the semantic similarity. According to the technical scheme, semantics and text similarity of the session text can be considered, diversified content expressions are provided for the computer session, and single and repeated replies are effectively avoided.

Description

technical field [0001] The invention relates to the field of natural language processing, in particular to a similarity calculation method based on text and semantics, a server and a storage medium. Background technique [0002] When conducting a man-machine dialogue or a computer receiving natural language instructions, the computer must process and understand human language. Due to the diverse expressions of human language, it is inevitable to understand the literal meaning and semantics when performing natural language processing (NLP). diversity. [0003] However, the similarity calculation of existing conversational content often only considers the understanding of semantics. After the text is segmented, the semantic similarity is calculated through frequency statistics or text clustering based on vocabulary or vocabulary sets, or text similarity is calculated based on cosine. Unilaterally considering the similarity of text literals or text semantics, it is difficult ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/33
CPCG06F16/3344
Inventor 徐乐乐
Owner 武汉斗鱼鱼乐网络科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products