Problem semantic matching method for optimizing BERT

A semantic matching and problem-solving technology, applied in semantic analysis, neural learning methods, natural language data processing, etc., can solve problems such as low quality of sentence vectors and difficulty in reflecting semantic similarity, and achieve fast results

Pending Publication Date: 2022-03-22
中国医学科学院医学信息研究所
View PDF0 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Although the BERT-based model has achieved good performance in many NLP tasks, the quality of the sentenc...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Problem semantic matching method for optimizing BERT
  • Problem semantic matching method for optimizing BERT
  • Problem semantic matching method for optimizing BERT

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0029] Step 1: Data collection and data preprocessing;

[0030] Collect real medical dialogue records on the Internet, and store them in the local machine in the form of natural text.

[0031] Step 2: Carry out data segmentation, and divide the data into training set and verification set;

[0032] Remove expected non-compliant characters, redundant punctuation marks, unify the half-full-width representation of punctuation marks, use regular expressions to remove non-text corpus, perform word segmentation after data preprocessing, and construct word segmentation reference documents for full-width word mask unsupervised training;

[0033] Step 3: Based on the pre-training model Bert-wwm-ext, do unsupervised training of full word mask;

[0034] Use the following script for unsupervised training of the model, pre-training on the processed data set, so that the model can learn the characteristics of the data set

[0035] export TRAIN_FILE= / path / to / dataset / wiki.train.raw

[0036...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a semantic matching method based on Bert, and the method is based on a pre-training model Bert-wwm-ext of Harbin Institute of Word, the model is firstly used to carry out unsupervised training of full word masks under our big data background, so that the model is firstly adapted to our data characteristics, and after the model based on our data is stored, the model based on our data is subjected to unsupervised training of full word masks under our big data background. The following adjustments are made on the structure of the model, a Pooling layer is added to an output layer of Bert, when sentences are input, each Batch inputs a group of specific sentences, a part of the sentences are similar in semantics, the remaining sentences are different in semantics, and in this way, the model is made to be similar to human learning, and the sentences can be input into the Bert. Contrast learning between data is considered, so that the model converges more quickly, after model architecture transformation is completed, sentence semantic similarity training is conducted again under the background of large corpora based on the model, comparison calculation between synonymous sentences and non-synonymous sentences is added in the training process, then the model is subjected to back propagation, and therefore the sentence semantic similarity is obtained. And finally obtained sentence vector semantic representation is more practical.

Description

technical field [0001] The invention relates to a Bert-based semantic matching technology and belongs to the field of artificial intelligence. Background technique [0002] Judging from the current mainstream technologies for text matching problem solutions, they can be summarized into three categories: statistical learning, deep learning, and transfer learning. Statistical learning technology routes mostly obtain text features through manual / statistical methods, and then compare the similarity between text pairs. Typical methods include but are not limited to: [0003] (1) Evaluation of similarity based on string operations, such as edit distance; [0004] (2) count the number of terms, and directly use statistical indicators such as similarity coefficients to calculate the similarity between the two; [0005] (3) Obtain the vector of text information by encoding such as word frequency-inverse text frequency (TF-IDF), and then obtain text similarity through inner product ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F40/211G06F40/30G06N3/04G06N3/08G06K9/62
CPCG06F40/30G06F40/211G06N3/084G06N3/088G06N3/045G06F18/22G06F18/214
Inventor 高东平秦奕杨渊李玲池慧
Owner 中国医学科学院医学信息研究所
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products