Check patentability & draft patents in minutes with Patsnap Eureka AI!

Word embedding learning method based on Chinese word feature substrings

A learning method and feature sub-technology, applied in the fields of instruments, digital data processing, semantic analysis, etc., can solve the problems of poor word embedding effect, unable to fully capture the semantic information of Chinese characters, and difficult to effectively capture the semantic information of Chinese characters, etc. Fast, enhanced word embedding effect, less time-consuming effect

Pending Publication Date: 2020-07-31
UNIV OF ELECTRONIC SCI & TECH OF CHINA
View PDF5 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] The technical problem to be solved by the present invention is that most of the word embedding methods for Chinese in the prior art use the single feature or partial combination feature of Chinese characters to effectively capture the semantic information of Chinese characters, and the word embedding effect is poor. The present invention provides a solution to the above problems A word embedding learning method based on Chinese word feature substrings, the present invention mainly designs feature substrings to integrate the structure, strokes and pinyin features of Chinese characters, so as to capture the shape and pinyin information of Chinese words, and solve the problem that the single feature of Chinese characters cannot be complete The problem of capturing the semantic information of Chinese characters, and using the target word to predict the context word, so as to learn the embedding of Chinese words; the method of the present invention can enhance the effect of word embedding, and provide necessary technical support for the practice of Chinese natural language processing, text mining and other fields

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Word embedding learning method based on Chinese word feature substrings
  • Word embedding learning method based on Chinese word feature substrings
  • Word embedding learning method based on Chinese word feature substrings

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0042] Such as Figure 1 to Figure 3 Shown, a kind of word embedding learning method based on Chinese word feature substring of the present invention, the method comprises the following steps:

[0043] S1: Obtain Chinese text and obtain the corresponding word sequence through preprocessing;

[0044] S2: Obtain the Chinese target word and its context words from the word sequence obtained in step S1, split the Chinese target word into several Chinese characters; search each Chinese character in the Chinese dictionary, and check the pinyin and strokes of each Chinese character Coding and concatenation with structural features to generate feature substrings to represent some features or multiple features of Chinese characters;

[0045] The introduction of structural features can effectively solve the problem of Chinese characters with the same components but different semantics, such as "叶" and "古"; the introduction of pinyin features can effectively solve the problem of the same...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a Chinese word feature substring-based word embedding learning method, and the method comprises the steps: S1, obtaining a Chinese text, and processing the Chinese text into acorresponding word sequence; S2, obtaining a Chinese target word and a context word thereof from the word sequence, and splitting the Chinese target word into a plurality of Chinese characters; retrieving each Chinese character in the Chinese dictionary, and encoding and connecting the pinyin, strokes and structural features of each Chinese character in series to generate a feature substring; S3,constructing a prediction model based on Chinese target word pair context word embedding by adopting a binary logarithm likelihood method, and performing training to obtain word embedding representation. Strokes, structures and pinyin characteristics of Chinese characters are integrated, and the problem that semantic information of the Chinese characters cannot be completely captured through single characteristics of the Chinese characters is solved; characteristic substrings are provided to capture Chinese character forms, pinyin information and relations of the Chinese character forms and the pinyin information, the characteristic substrings of different lengths can represent part of characteristics or multiple characteristic combinations of Chinese characters, and fine-grained characteristic representation of Chinese words is provided.

Description

technical field [0001] The invention relates to the technical field of natural language processing, in particular to a word embedding learning method based on Chinese word feature substrings. Background technique [0002] Word embedding, also known as distributed representation of words, can encode the semantics of words into a low-dimensional vector space and capture semantic information well. Currently, word embeddings as input features have been proven to be effective in many Natural Language Processing (NLP) tasks, such as extracting text stems, named entity recognition, text classification, and machine translation. Designing effective models that learn word embeddings is crucial to understanding the semantics of words. [0003] Most current methods are based on modeling the relationship between the target word and its context words to learn word embedding. For example, the CBOW model predicts the target word through the context, and the SG model predicts the context th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F40/30G06F40/289G06F40/284
Inventor 刘勇国郑子强李巧勤杨尚明
Owner UNIV OF ELECTRONIC SCI & TECH OF CHINA
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More