Coordinated training-based dual-language named entity identification method

A technology for named entity recognition and named entity, applied in the field of natural language processing (NLP), which can solve the problems of performance degradation, unsatisfactory performance, and discomfort of supervised learning methods, and achieves reduction of domain dependence, improvement of consistency, and strong generalization. The effect of the ability to

Active Publication Date: 2014-06-11
BEIJING INSTITUTE OF TECHNOLOGYGY
View PDF5 Cites 34 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Among the statistical methods, the supervised learning method has a good performance in the task of named entity recognition, but it has two shortcomings: First, the method requires a large amount of labeled data to ensure the accuracy of learning, so it is not suitable for tho

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Coordinated training-based dual-language named entity identification method
  • Coordinated training-based dual-language named entity identification method
  • Coordinated training-based dual-language named entity identification method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0019] The specific implementation manner of the present invention will be described in further detail below in conjunction with the accompanying drawings.

[0020] A bilingual named entity recognition method based on collaborative training, comprising the following steps:

[0021] Step 1. Initialize the bilingual sequence tagging model, and train the Chinese-English sequence tagging models: Cmodel(s) and Cmodel(t) respectively on the tagged corpus sets Ls and Lt aligned at the Chinese-English sentence level. There are three named entities marked in the annotation corpus, namely PER (person name), LOC (place name) and ORG (organization name). The BIO annotation set is selected, and there are 7 types of annotations for all words: B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG, and O. Chinese uses single character features, single word features, 2-3 characters or word combination features; English uses word, part of speech, initial letter case feature combination templates.

[0022]...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a dual-language coordinated training-based named entity identification method, and belongs to the technical field of natural language processing in computer science. Parallel Chinese and English sentence datasets are considered as two different view of a dataset for dual-language coordinated training, a log-linear model is used for correcting projection marks in a projection process, and named entity dual-language aligned annotation consistency is introduced as a measurement index for mark confidence estimation when the model is used for predicting an unseen case. Compared with the prior art, the method has the advantages that the domain dependence of named entity identification is reduced, the advantages of dual-language identification are fused, the problem of partial identification ambiguity in single-language identification is solved, and the method is particularly suitable for the dual-language named entity synchronous identification of large-scale language materials.

Description

technical field [0001] The invention relates to a method for identifying bilingual named entities, and is especially suitable for identifying named entities on large-scale cross-domain bilingual corpus as a pre-processing of machine translation, and belongs to the technical field of natural language processing (NLP) in computer science. Background technique [0002] A named entity is the proper name of a unique entity. Named entity recognition is an important basic technical problem in the field of natural language processing, and has become one of the technical bottlenecks in the field of multilingual information processing such as cross-language information retrieval and machine translation. [0003] Currently, researchers have developed many models for named entity recognition. Among them, since rule-based methods are not conducive to generalization among different types of languages, statistical-based methods have received extensive attention in recent years. Among the...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/28
Inventor 黄河燕史树敏李业刚
Owner BEIJING INSTITUTE OF TECHNOLOGYGY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products