Text similarity measurement method based on semantic document expression

A similarity measurement and text technology, applied in the field of text similarity measurement based on semantic document expression, can solve the problems of not considering text semantics, grammatical meaning, large errors, ignoring the relationship between words and word positions, etc., to achieve outstanding substantive characteristics, The effect of simple structure and wide application prospect

Inactive Publication Date: 2020-07-24
山东山大鸥玛软件股份有限公司
View PDF3 Cites 14 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] There are certain problems in the above-mentioned semantic similarity model methods based on literal matching and latent semantic analysis in the prior art. The former does not take into account the semantic and grammatical meaning of the text, an

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text similarity measurement method based on semantic document expression
  • Text similarity measurement method based on semantic document expression
  • Text similarity measurement method based on semantic document expression

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0063] like figure 1 As shown, the present invention provides a text similarity measurement method based on semantic document expression, comprising the following steps:

[0064] S1. Obtain the first text and the second text to be compared, perform word segmentation preprocessing on the sentences of each text, and remove punctuation marks;

[0065] S2. After the first text and the second text are preprocessed, each word is obtained and mapped to generate a word vector, and the word vector is matched with the convolutional neural network model CNN and the bidirectional long-short-term memory cycle network model BiLSTM;

[0066] S3. Process each text through the convolutional neural network model CNN and the bidirectional long-short-term memory cycle network model BiLSTM, and extract the CNN sentence semantic feature vector and the BiLSTM sentence semantic feature vector of each text;

[0067] S4. For each sentence semantic feature of each text, use the attention mechanism mode...

Embodiment 2

[0070] like figure 2 As shown, the present invention provides a text similarity measurement method based on semantic document expression, comprising the following steps:

[0071] S1. Obtain the first text and the second text to be compared, perform word segmentation preprocessing on the sentences of each text, and remove punctuation marks; the specific steps are as follows:

[0072] S11. Obtaining fields of the first text and the second text to be compared;

[0073] S12. Construct a professional dictionary according to the text calculation target;

[0074] S13. According to the text field and the professional dictionary constructed, word segmentation is carried out by the word segmentation tool, stop words and punctuation marks are removed, and the length of the sentence to be segmented is set; the word segmentation tool adopts a stuttering word segmentation tool;

[0075] S2. After the first text and the second text are preprocessed, each word is obtained and mapped, and a...

Embodiment 3

[0096] like image 3 As shown, in the above-mentioned embodiment 2, the specific steps of step S22 are as follows:

[0097] S221. Obtain each vocabulary of the sentence sequence of each text, and set the current vocabulary;

[0098] S222. Determine whether there is a word vector in the current vocabulary;

[0099] If so, map the current vocabulary to generate a word vector, and enter step S224;

[0100] If not, enter step S223;

[0101] S223. Segment the current vocabulary twice to obtain subwords, and use the mean value of the word vectors of all subwords in the current vocabulary as the word vector;

[0102] If the subword still has no word vector, the subword will continue to be split, and the mean value of each layer of word vector will be returned to the upper layer;

[0103] If there is still no word vector after splitting into a single-word sequence, it will be marked as an unknown word vector and represented by a zero vector of corresponding length;

[0104] S224....

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a text similarity measurement method based on semantic document expression, which comprises the following steps: acquiring two texts to be compared, and respectively performingword segmentation preprocessing on sentences of each text; mapping the preprocessed vocabularies of the two texts to generate word vectors; processing each text through a convolutional neural networkmodel CNN and a bidirectional long-short-term memory loop network model BiLSTM, and extracting CNN sentence semantic features and BiLSTM sentence semantic features of each text; for each sentence semantic feature of each text, capturing an attention feature through an attention mechanism model, generating a weight vector, calculating a weight sum, generating a CNN semantic representation vector and a BiLSTM semantic representation vector, and respectively splicing the two semantic representation vectors of each text to generate a vocabulary semantic association feature vector; and constructinga similarity calculation function according to the vocabulary semantic association feature vectors of the two texts, and calculating the similarity of sentences of the two texts.

Description

technical field [0001] The invention belongs to the technical field of natural language processing, and in particular relates to a text similarity measurement method based on semantic document expression. Background technique [0002] For intelligent grading of text-based questions, it is necessary to consider whether there are candidates with similar answers, whether the question stems are plagiarized, and the similarity measurement between candidates’ answers and reference answers is a practical requirement. How to achieve intelligent grading on the text similarity measurement Giving a more reasonable and effective measurement method is the focus and difficulty of research. The current existing semantic similarity calculation methods can be summarized into three categories: [0003] A semantic similarity calculation method based on literal matching. Typical semantic similarity calculation methods based on LCS and TF-IDF. The semantic similarity calculation method based o...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F40/194G06F40/289G06F40/35G06N3/04
CPCG06N3/048G06N3/044G06N3/045
Inventor 马磊邢金宝袁峰薛勇
Owner 山东山大鸥玛软件股份有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products