Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Tibetan word vector representation method fusing components and character information

A word vector and component technology, applied in the field of Tibetan word vector representation that integrates components and word information, can solve the problems of few word vector learning objects, failure to obtain good word vectors, and insufficient semantic information mining, and achieve semantic and highly correlated effects

Active Publication Date: 2019-06-28
QINGHAI NORMAL UNIV
View PDF4 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0015] (2) The current word vector representation method only acquires word vectors from word information, such as the CBOW model and GloVe model only obtain the target word from the context word information of the target word The vector of the word, Skip-gram only obtains the vector of the context word of the target word from the information of the target word
Although this kind of word vector representation has the semantic information of words, due to the small number of learning objects of word vectors, the semantic information of words has not been fully excavated, and a better word vector has not been obtained.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Tibetan word vector representation method fusing components and character information
  • Tibetan word vector representation method fusing components and character information
  • Tibetan word vector representation method fusing components and character information

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment 1

[0100] For a given Tibetan word, it is decomposed into characters, and each character is further decomposed into several components. Tibetan words contain word meaning information, pre-added letters contain word tense and energy relationship information, super-added letters contain phonetic information and verb category information, base letters contain phonetic and tense information, and sub-added letters Letters contain phonological information, vowels contain morphological information, suffixes contain phonological, lexical, and grammatical information, and suffixed letters contain phonological, lexical, grammatical, and tense information, as well as degree information. For example, the Tibetan word (meaning students) can be decomposed into two words (meaning the present tense of learning) and (referring to the person suffix), word can be further broken down into components (base letter), (add letters below), (vowels) and (add letters after), these components...

specific Embodiment 2

[0101] Specific embodiment 2: The traditional CBOW word vector representation model represents " (World Progress)":

[0102] Such as Figure 5 As shown, the CBOW model learns the semantics of the target word from the "context" of the word in the corpus, and the word The semantics of the context word with acquired, i.e. word The word vector of the word consists of the word's context word with Obtained, vector splicing, summation or averaging operations are generally used when calculating word vectors;

specific Embodiment 3

[0103] Specific embodiment 3: TCCWEI model represents sentence " (World Progress)":

[0104] Such as Image 6 shown, word context words Character and word components co-acquisition. That is to say The word vector of and word and components The vectors of are obtained together.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention belongs to the technical field of Tibetan language information processing, and discloses a Tibetan language word vector representation method fusing components and character information,and the Tibetan language word vector representation method fusing the components and the character information comprises the steps: directly fusing the components and the character information into amodel TCCWEI represented by the Tibetan language word vector; fusing the component information into the vector representation of the character, and then fusing the component information and the information of the character into a model TCCWEII represented by the Tibetan word vector; and fusing components and Tibetan word vectors of character position information into the TCCWEII model to represent a model TCCWEII + P. According to the Tibetan language word vector representation model TCCWE provided by the invention, compared with the current optimal word vector representation, the Tibetan language word vector representation model TCCWE is improved by 8% on the Tibetan language similarity evaluation set TWordSim215. Compared with the current optimal word vector representation, the Tibetanlanguage correlation evaluation set TWordRel 215 is improved by 7%, and the semantic and correlation of the found similarity / correlation words are very high.

Description

technical field [0001] The invention belongs to the technical field of Tibetan information processing, and in particular relates to a Tibetan word vector representation method for fusing components and word information. Background technique [0002] At present, the word vector representation of language units is the focus of research by scholars. Words are the basic language units of natural language processing, and word vectors contain semantic information of words, which has become the core issue of language unit vector representation. The word vector not only contains the context information of the word, but also maps the word to the Euclidean space through the word vector, so that the distance between words can be calculated using the distance in the Euclidean space, which facilitates the description of the semantic distance between words. [0003] In natural language processing, word vector representation (Word Vector Representation) is also called distributed represen...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/27G06F16/33
Inventor 才智杰才让卓玛
Owner QINGHAI NORMAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products