Method for generating Chinese word vector with multi-submodule information

A sub-module, Chinese word technology, applied in the field of Chinese word vector generation, can solve the problems of less emphasis on information enhancement, information weakening, and affecting the accuracy of word vector

Active Publication Date: 2020-05-15
EAST CHINA NORMAL UNIV +1
View PDF3 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Embedding methods in the prior art often give equal weight to the sub-modules used, but equal treatment of sub-modules may strengthen less important information, weaken important information, and affect the accuracy of the generated word vectors

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for generating Chinese word vector with multi-submodule information
  • Method for generating Chinese word vector with multi-submodule information
  • Method for generating Chinese word vector with multi-submodule information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0040] See attached figure 1 , the present invention carries out the generation of Chinese word vector according to the following steps, and concrete operation steps are as follows:

[0041] Step 1, background and definition stage: the background of word vector and some basic definitions of training word vector in the present invention are illustrated, and its specific steps are as follows:

[0042] a. In Chinese, characters are the basic sign of a writing system with its own meaning. A word is usually composed of multiple characters and expresses the complete meaning of a word. The characters are associated with various characters to form information, such as components, parts Characters can be further divided into meaningful components, and one of these components can be considered a radical; the radical conveys the lexical meaning of the character, expressing something about what the character is associated with Hint information, and multiple characters can share the same ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method for generating a Chinese word vector with multi-submodule information. The method is characterized in that a six-submodule information and attention mechanism fusion method which comprises words, characters, radicals, components, fonts and pinyin is adopted, an improved Chinese character embedding representation form is learned and fused into word embedding with proper weight, and a high-precision word vector is generated. Compared with the prior art, the method has the advantages that the appropriate weight is allocated to each part of sub-module information according to the attention mechanism, so that the weight of the sub-module with less semantic meanings is reduced, the weight of the sub-module is improved, the modules with richer semantic meanings areprovided, Chinese word embedding is improved, and considerable performance improvement is realized.

Description

technical field [0001] The invention relates to the technical field of natural language processing, in particular to a method for generating Chinese word vectors with multiple sub-module information. Background technique [0002] In recent years, multiple distributed representations, i.e., word embeddings, based on deep neural network models have been proposed which lay a solid foundation for upstream NLP tasks, such as named entity recognition, text classification, machine translation, question answering, etc., to correctly represent words It is the most basic task of natural language processing (NLP), and the execution of other ongoing NLP tasks depends on how words are represented. Traditional word embedding methods focus on learning the representational information of words according to their context, and these methods are effective for IndoEuropean languages ​​(which use Latin script in their writing system). However, for Sino-Tibetan languages, learning word represent...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/284G06F40/211G06N3/04G06N3/08
CPCG06N3/08G06N3/045
Inventor 朱鹏程大伟杨芳洲罗轶凤钱卫宁周傲英
Owner EAST CHINA NORMAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products