Method and system for judging text similarity

A text similarity, text similarity technology, applied in special data processing applications, instruments, electrical digital data processing and other directions, can solve the problem of high recognition rate, and achieve the effect of accurate judgment results

Inactive Publication Date: 2018-04-27
CHINA TECHENERGY +1
View PDF4 Cites 28 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In order to solve the shortcomings of the three text similarity judgment algorithms in the prior art, the present invention provides a method and system for judging text similarity, which can solve various types of text judgments and have a high recognition rate.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for judging text similarity
  • Method and system for judging text similarity
  • Method and system for judging text similarity

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0041] Such as figure 1 As shown, the present embodiment provides a method for judging text similarity, which is a text similarity judging method based on the Siamese network; first, a VSM model needs to be established for text data, and the process includes preprocessing, word segmentation, and removal of disabled Words, quantify the text into a processable feature vector; then construct a Siamese network (also called a twin network) to extract the semantic similarity features of a sample pair based on a feature vector; finally construct a tripletloss (also called a triplet Loss) The loss function is used to judge the relevance of text pairs. The method specifically includes:

[0042] S1. Construct a vector space model to quantify text into processable objects;

[0043] In text processing, firstly text needs to be quantized into processable objects, preferably, the method of constructing a vector space model (VSM for short) in text processing is adopted, including: 1, text ...

Embodiment 2

[0074] In this embodiment, on the basis of Embodiment 1, in order to make the training model more flexible, the weights in the three branches of the Siamese network can be different, and the number of layers is different, that is, the three functions are not related to each other; only in the final distance Calculate and associate them together; other content that is not described repeatedly is the same as that in Embodiment 1. Specifically, such as Image 6 Shown, this embodiment preferably, in figure 1 The calculation of the cosine of the corresponding angle in S3 and S4 adopts the ternary metric function, and the triplet samples (x', x, x') have different network parameters W after feature extraction, and the semantic feature expression of the three samples is obtained , respectively denoted as

[0075] G w1 (x'), G w2 (x), G w3 (x');

[0076] When D(G w1 (x'),G w2 (x))-D(G w3 (x'),G w2 (x)) > α, it is judged as similar, otherwise it is judged as dissimilar; where...

Embodiment 3

[0079] As technical documents for software development, especially those used in the field of nuclear power, the documents should be compiled in accordance with the standard specifications, and the titles should be highly generalized and similar. Processing will greatly lose the important information brought by the title to the classification. Therefore, the contribution of paragraph titles to the text similarity measure should be properly considered during training and testing. Such as Figure 7 As shown, in this embodiment, preferably, a pair of text is selected as an input in step S2 corresponding to embodiment 1 or embodiment 2, denoted as (x i ,x j ); Divide the paragraph title and text of the text into two parts, and at the same time, merge the text and title of the two texts as input.

[0080] Specifically, first select a pair of texts as input, denoted as, can be similar or dissimilar, divide the paragraph title and text of the text into two parts, and at the same t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to the technical field of text classification and provides a method and system for judging the text similarity. The method and system for judging the text similarity are used forovercoming the defects which exist in three kinds of text similarity judgement algorithms in the prior art separately. The method comprises the steps 1, a vector space model is established and subjected to text quantization to form a processible object; 2, a Siamese network is used for establishing a text semantic similarity extraction model, in the Siamese network, a text semantic extraction network and a similarity judgement network are in series connection, and meanwhile optimization is conducted at the sample training stage; 3, based on semantic characteristic expression of samples at thetraining stage, a text similarity calculation function of the included angle cosine based on characteristic vectors and a final loss function are established; 4, two pieces of text to be tested are input, after the text to be tested is subjected to semantic characteristic extraction based on the Siamese network, the cosine included angle distance of the two vectors is calculated, a threshold value is set, and when the cosine included angle distance of the two vectors is larger than the threshold value, the similarity is judged, otherwise, the dissimilarity is judged.

Description

technical field [0001] The present invention relates to the technical field of text classification, in particular to the technical field of verification and confirmation of nuclear safety level software; more specifically, it relates to a method and system for judging text similarity. Background technique [0002] In the process of verification and validation (V&V) of nuclear safety-level software, it is necessary to evaluate the execution documents, analyze the traceability, and analyze the risks, etc. With the continuous increase of technical documents, each stage of each project Repeatedly performing these activities requires a lot of manpower. Therefore, in the process of document evaluation, automatic identification of items to be evaluated, automatic judgment of the semantic relevance of upper and lower-level documents in traceable analysis, and automatic matching of failure modes of similar products in the risk analysis process have become V&V problems that people nee...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06F17/22
CPCG06F40/194G06F40/289G06F40/30
Inventor 冯素梅江国进孙永滨白涛杜乔瑞王晓燕张亚栋徐先柱
Owner CHINA TECHENERGY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products