Unregistered word identification method and system using five-stroke character root deep learning

A deep learning, unregistered word technology, applied in character and pattern recognition, electrical digital data processing, special data processing applications, etc. Improved accuracy, improved effects

Pending Publication Date: 2019-09-27
GUANGDONG POLYTECHNIC NORMAL UNIV
View PDF10 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] (1) The rule-based method needs to manually formulate several rules, the feasibility is too low, when the application fields are very different, the portability is poor, and the rules need to be re-formulated
[0007] (2) The method based on machine learning and the recognition method based on the neural network model cannot recognize unregistered words
[0011] At present, the professional terms in various fields are complex in categories, general in content, large in information volume, and complex in composition
As a result, people cannot accurately and completely describe or express, but use some aliases, abbreviations, words, etc. to describe. Then, problems arise, and there are often typos, ambiguous words, similar meanings, etc.
This would have a severe impact on the domain's name recognition

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Unregistered word identification method and system using five-stroke character root deep learning
  • Unregistered word identification method and system using five-stroke character root deep learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0054] The present invention is a model combining LSTM and Wubi radicals for the recognition of Chinese named entities. The invention encodes the input character sequence and all potential words matching the Wubi radical dictionary. In contrast to character-based approaches, the present invention explicitly utilizes word and word order information. Gated recurrent units enable the model to select the most relevant characters and words from a sentence to generate better named entity recognition results.

[0055] In terms of input word embedding, the embodiments of the present invention use Wubi radicals to represent Chinese characters, and these representations are combined as character embeddings, which can enhance the exploration of morphological and semantic information of characters, and automatically extract n-gram features with neural networks. Divide each character into strokes to propose an n-gram model, each character is represented by 4 English letters. For differen...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to the technical field of natural language data processing, and discloses an unregistered word identification method and system using five-stroke character root deep learning. The method comprises the steps of converting a Chinese character into four English letters according to a five-stroke character root table; then inputting an embedded vector serving as an embedded vector of the model into an embedded vector corresponding to the words in a corpus to train a neural network model; and finally, enabling the model to output a most similar vocabulary vector in a previous corpus, and using the vocabulary vector as an important basis for identifying the unlogged vocabularies to better identify the unlogged vocabularies. According to the present invention, the Chinese character words with close radicals mostly have the same part-of-speech, and the five-stroke codes of the Chinese character words are similar, so that the neural network entity identification method based on the five-stroke roots is provided and can improve the performance of identifying the unlogged words through the neural network model. According to the present invention, the word vectors are used for representing the words based on deep learning, so that the sparse problem of the high-latitude vector space is solved, and the method is simpler and more effective.

Description

technical field [0001] The invention belongs to the technical field of processing natural language data, and in particular relates to an unregistered word recognition method and system using deep learning of Wubi radicals. Background technique [0002] At present, the commonly used existing technologies in the industry are as follows: "Named Entities", which are widely used in the field of natural language processing, were originally proposed at the Sixth Information Understanding Conference in 1996, and most of the research on MUC-6 is based on rules Methods, such as lexical rules for word forms or parts of speech. Formulate character matching rules based on prompt words before and after named entities, context, etc., mainly focusing on information extraction tasks. Named entities are objects of interest that can be used to solve specific problems. Sekine believes that the general seven subcategories of named entities cannot meet the application requirements of automatic q...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06K9/62G06N3/04
CPCG06F40/295G06N3/045G06F18/214
Inventor 肖政宏闫艺婷王华嘉周健烨李旺梁志鹏
Owner GUANGDONG POLYTECHNIC NORMAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products