Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

Unsupervised machine reading understanding method based on large-scale problem self-learning

A reading comprehension and self-learning technology, applied in the field of unsupervised machine reading comprehension, can solve problems such as data difficulties and achieve the effect of improving accuracy

Pending Publication Date: 2021-12-24
宏龙科技(杭州)有限公司 +1
View PDF0 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] In many NLP applications, it is very difficult to obtain large amounts of labeled data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Unsupervised machine reading understanding method based on large-scale problem self-learning
  • Unsupervised machine reading understanding method based on large-scale problem self-learning
  • Unsupervised machine reading understanding method based on large-scale problem self-learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0031] Example: We use a variety of pre-trained language models (such as GPT-2 and T5) to generate a large amount of potential question and answer data from unlabeled passages of in-domain text. This method allows us to achieve cold start in a completely new domain . We then pre-train the model on these generated samples and finally fine-tune it on a specific labeled dataset.

[0032] Although a domain-specific trained model on the SQuAD1.1 training dataset achieves state-of-the-art performance (EM score of 85%) on the SQuAD1.1 Dev dataset, it is completely unable to perform the same level of inference on a completely new domain , namely NewQA (EM score of 32%). We have found that preventing overfitting on synthetic datasets is critical when pretraining models with synthetic datasets, as it often contains many noisy samples. However, these synthetic datasets are very useful when there is little or no in-domain training data in the early stage, because we can use this method ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses an unsupervised machine reading understanding method based on large-scale problem self-learning. The method comprises the following steps: firstly, dividing data into four types; and then, carrying out the following steps: S1, training unlabeled general data by using a standard pre-training model to obtain a pre-training language model; s2, training the marked general data by using a pre-training language model to obtain a question generator, and generating a specific task general domain model; s3, generating synthesized intra-domain data from the unlabeled intra-domain data by using a problem generator, filtering by using a specific task general domain model, and training a high-quality synthesized intra-domain data set obtained by filtering to obtain a new pre-training model; s4, mixing the marked intra-domain data through a low-quality synthetic data set obtained by filtering, marking answers, and then training by using a new pre-training model to obtain a final model; and based on the final model, inputting data to obtain a machine reading understanding result.

Description

technical field [0001] The invention relates to the field of machine reading comprehension, in particular to an unsupervised machine reading comprehension method based on large-scale problem self-learning. Background technique [0002] Many state-of-the-art algorithms for natural language processing (NLP) tasks require human-annotated data. In the early days we usually did not have any domain-specific labeled datasets, and annotating a sufficient amount of such data was usually expensive and laborious. Thus, for many NLP applications, even resource-rich languages ​​such as English have data labeled in only a few domains. [0003] In many NLP applications, obtaining large amounts of labeled data is difficult. Therefore, in many cases, we train a model from a small amount of data. However, the trained model is often overfit and needs to be generalized to unseen data. Therefore, researchers take advantage of large unlabeled datasets by pre-training language models, which of...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/211G06F40/253G06F40/295G06F40/58G06N3/04G06N3/08
CPCG06F40/211G06F40/253G06F40/295G06F40/58G06N3/088G06N3/045
Inventor 赵天成
Owner 宏龙科技(杭州)有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products