A Data Augmentation Algorithm for Chinese Named Entity Recognition Based on Sequence Generative Adversarial Networks

A named entity recognition and sequence generation technology, applied in the Internet field, can solve problems such as costing a lot of manpower and time, not being solved, and lacking a large amount of labeled data

Active Publication Date: 2021-04-13
BEIJING UNIV OF POSTS & TELECOMM
View PDF12 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0015] 1. Although modifying the structure of the deep model can enhance the semantic representation of the text, it does not solve the problem of lacking a large amount of labeled data
[0016] 2. The introduction of external resources requires a lot of manpower and time to collect external resources, and it is necessary to design effective rules to add external resources to the model

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Data Augmentation Algorithm for Chinese Named Entity Recognition Based on Sequence Generative Adversarial Networks
  • A Data Augmentation Algorithm for Chinese Named Entity Recognition Based on Sequence Generative Adversarial Networks
  • A Data Augmentation Algorithm for Chinese Named Entity Recognition Based on Sequence Generative Adversarial Networks

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0061] refer to figure 1 , 2 As shown, the present invention provides a method for applying a data enhancement algorithm based on a sequence generation confrontation network to a named entity recognition task. Specifically, during training, the method includes:

[0062] Step 1: Process the sentences in the corpus, divide each sentence into entity and non-entity parts according to the entity label information of the sentence, and add the entity and non-entity parts to the dictionary at the same time. Suppose a text sequence {c 1 ,c 2 ,c 3 ,c 4 ,c 5 ,c 6} label is {O,O,B-PER,I-PER,O,O}, you can put c 1 c 2 ,c 5 c 6 Classified as non-substantial parts, c 3 c 4 into entity parts, and then add them and their corresponding labels to the dictionary.

[0063] Step 2: According to the dictionary formed by entities and non-entities, the entities and non-entities in each sentence are mapped to corresponding indexes in the dictionary to form an index sequence.

[0064] Step ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention provides a method of selecting positive sample data in the source domain data to expand the training data of the target domain by fusing the semantic differences and label differences of the sentences in the source domain and the target domain, so as to enhance the named entity recognition performance of the target domain method. On the basis of the previous Bi‑LSTM+CRF model, in order to fuse the semantic difference and label difference of sentences in the source domain and the target domain, we introduce the semantic difference and label difference through the state representation and reward setting in reinforcement learning, so that the training The decision-making network can select sentences that have a positive impact on the performance of named entity recognition in the target domain in the data of the source domain, expand the training data of the target domain, solve the problem of insufficient training data in the target domain, and improve the named entity recognition of the target domain performance.

Description

technical field [0001] The invention relates to the technical field of the Internet, in particular to a method of using a sequence generation confrontation network to enhance data and improve the performance of Chinese named entity recognition. Background technique [0002] In recent years, deep learning has made great progress in image, speech and natural language processing. As an emerging technology of machine learning algorithms, deep learning is motivated by the establishment of a neural network that simulates the human brain for analysis and learning. In the field of images, people use deep neural networks to realize target detection in images, such as combining convolutional neural networks with candidate windows to detect pedestrians in images; in the field of speech, deep learning is used for speech synthesis and recognition provide us with an intelligent voice system; in the field of natural language processing, deep learning is applied to various life scenarios, ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/295G06F40/216G06F16/31G06F16/35G06F16/36G06N3/04G06N3/08
CPCG06F40/295G06F40/216G06F16/316G06F16/35G06F16/36G06N3/049G06N3/084G06N3/045
Inventor 李思王蓬辉李明正孙忆南
Owner BEIJING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products