Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Text semantic similarity information processing method and system based on multi-model fusion

An information processing method and semantic similarity technology, applied in natural language data processing, semantic analysis, neural learning methods, etc., can solve the problems of unacceptable feedback time, large amount of calculation, waste of hardware resources, etc., to speed up real-time feedback speed , the effect of speeding up the feedback and reducing the amount of calculation

Pending Publication Date: 2020-12-04
GLOBAL TONE COMM TECH
View PDF4 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Most of the existing scientific research text semantic similarity calculation methods use deep neural network models and supervised learning methods for model training. This type of algorithm requires a large number of labeled samples to support it. There are often very few data, especially at the beginning of the project, the marked data is more difficult to meet, and the text-type marked data is different from the image, because it requires a subjective understanding of the article, so the requirements for the marked personnel are often higher. high
Therefore, it is not convenient to carry out large-scale supervised learning algorithms in the industrial field at the beginning of the project.
[0006] Deep neural network algorithms also require a large amount of calculation, which is feasible on a small amount of data, but it is applied to industrial-level data ranging from a few gigabytes to a few terabytes or even several petabytes. Articles similar to this article need to repeatedly execute a single neural network hundreds of millions of times, and its feedback time is destined to be unacceptable
[0007] Most of the existing semantic similarity detection algorithms in the industrial field are character-based prior probability statistical models, but they cannot capture context and word order relationships, so they can only be defined as a shallow semantic similarity calculation
[0008] Through the above analysis, the existing problems and defects of the existing technology are as follows: (1) The existing text semantic similarity calculation method adopts a supervised learning method for model training, which requires a large number of labeled samples to support; and the calculation amount is large;
[0009] (2) Existing semantic similarity detection algorithms are mostly character-based prior probability statistical models, but cannot capture context and word order relationships
[0010] (3) Existing models based on deep learning, such as: Simase_LSTM, RCNN, DSSM, etc., have a large amount of calculation, require high-configuration GPU server support, and have high hardware costs
[0014] To solve the above problems and defects (3) it is only necessary to provide financial support and purchase high-configuration servers. However, this system is mainly developed for specific groups of people, with a small audience and low utilization rate, which is likely to cause waste of hardware resources.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text semantic similarity information processing method and system based on multi-model fusion
  • Text semantic similarity information processing method and system based on multi-model fusion
  • Text semantic similarity information processing method and system based on multi-model fusion

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0108] For the patent data in the patent database, different models are used to process the title, abstract, claims, and instructions. Because there are large differences in the word frequency distribution, text length, and syntactic structure of each part, it is necessary to use different models for the four parts. deal with.

[0109] For titles and abstracts, because the length of the text is short, most of them are professional technical terms and their explanatory vocabulary, and the style is concise. Therefore, for the title abstract, word segmentation is first performed, and then keywords are extracted, and the keywords are sent to the word vector model. It is transformed into a corresponding word vector. The word vector model is used here because the word vector model is an unsupervised model that slides through the window in the article to intercept article fragments, such as Figure 4 As shown, using the intermediate vocabulary to predict the context vocabulary, the ...

Embodiment 2

[0132] Text semantic similarity information processing methods based on multi-model fusion include:

[0133] Step 1, perform word segmentation operations on the titles and abstracts of the papers in the paper database.

[0134] Step 2, use the title and abstract to train the word vector model.

[0135] Step 3: Split the full-text data in the paper into large chapters such as introduction, background, experiment, and effect comparison, and perform word segmentation for each chapter.

[0136] Step 4, use the word list of each chapter obtained in the above step 3 to train the sentence vector model.

[0137] Step 5, save the word vector model obtained in step 2 and the sentence vector model obtained in step 4.

[0138] Step 6, use the word vector model and the sentence vector model to perform sub-module feature extraction on the papers in the local paper database.

[0139] Step 7, build a feature fusion method, and fuse the features obtained in the above step 6.

[0140] Step ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention belongs to the technical field of patent retrieval, and discloses a text semantic similarity information processing method and system based on multi-model fusion, and the method comprises the steps: obtaining patent data, and segmenting words of titles, abstracts, claims and specifications of patents in the patent data by different models, obtaining corresponding word vector featuresand sentence vector features; fusing the word vector features of titles, the word vector features of abstracts, the sentence vector feature of the claims and the sentence vector features of specifications as combined feature vectors of patents; and calculating the similarity between the combined feature vectors of the patents and the combined feature vectors of other patents in the database. An unsupervised learning model is used, so that the requirement of an algorithm model for annotation data is greatly reduced, deep semantic features of an article can be deeply mined by using sentence vectors, the calculation amount of real-time calculation is greatly reduced, and the feedback speed is increased.

Description

technical field [0001] The invention belongs to the technical field of patent retrieval, and in particular relates to a method and system for processing text semantic similarity information based on multi-model fusion. Background technique [0002] At present, text semantic similarity calculation is an important research direction in the field of natural language processing. Its research results are widely used in retrieval systems, plagiarism checking systems, etc., which can help users quickly find what they want, tap deep-seated needs of users, and avoid The difference in results caused by different expression methods has high academic research value and industrial application value. [0003] The research direction of text semantic similarity calculation can be roughly divided into two types. One is the direction of scientific research. Most of its personnel are scholars from various universities or scientific research personnel of enterprises. The commonly used technica...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F40/289G06F40/30G06F16/33G06N3/04G06N3/08
CPCG06F40/289G06F40/30G06F16/3344G06N3/049G06N3/08G06N3/045
Inventor 杨万征蔡超程国艮
Owner GLOBAL TONE COMM TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products