Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A word vector representation method based on Chinese language element and pinyin joint statistics

A word vector and morpheme technology, which is applied in the field of word vector representation based on the joint statistics of Chinese morpheme and pinyin, can solve problems such as poor compatibility of Chinese language differences, affecting the performance of Chinese word vector representation models, and difficult Internet information text data. achieve high compatibility

Active Publication Date: 2019-05-28
STATE GRID ZHEJIANG ELECTRIC POWER CO LTD HANGZHOU POWER SUPPLY CO +2
View PDF5 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Typos will not only affect word segmentation results, but also affect the performance of the Chinese word vector representation model
[0004]Currently existing representation methods such as regular expressions, vector spaces, and word vectors cannot meet the adaptability of offline dictionaries and corpus data, and it is difficult to directly learn large-scale infinite Annotated Internet information text data
At the same time, the conventional word embedding model has poor compatibility with the characteristics of Chinese language differences, and the accuracy of representation and recognition of typos and words is low.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A word vector representation method based on Chinese language element and pinyin joint statistics
  • A word vector representation method based on Chinese language element and pinyin joint statistics
  • A word vector representation method based on Chinese language element and pinyin joint statistics

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0029] In order to make the object, technical solution and advantages of the present invention clearer, further detailed description will be made below in conjunction with specific embodiments of the present invention and accompanying drawings. It is also obvious that the described embodiments are only some embodiments of the present invention, rather than all application scenarios.

[0030] The invention provides a Chinese word vector representation method based on joint statistics of morpheme and pinyin, the method comprises the following steps:

[0031] 1. The generation of word representation vectors requires the support of a large corpus. The construction of the corpus is mainly collected from Internet news consultation, forum media information and Wikipedia's open source text corpus. The invention collects the Wikipedia Chinese data set as a general corpus, and the news data on the official website of the State Grid Zhejiang Electric Power Company as a professional corpu...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A word vector representation method based on Chinese language element and pinyin joint statistics comprises the following steps of 1, collecting internet text information to construct a corpus, and conducting text cleaning and word segmentation processing on the constructed corpus; (2) carrying out word segmentation processing on the Chinese corpus, converting the processed Chinese corpus into pinyin information which does not reserve tone information, and then respectively carrying out statistical weights TFc, IDFc, TFp and IDFp on word frequency statistics and inverse document probability ofthe morphemes and pinyin characteristics in the training set corpus and the whole document; (3) constructing a Chinese single morpheme representation vector based on a Chinese word representation model of contextual morpheme and pinyin joint statistics; And (4) training a three-layer neural network on the basis of the step (3) for predicting the central target word. According to the method, the adaptability of an offline dictionary and the corpus data scale can be met, large-scale unlabeled internet information text data can be directly learned, the consideration of a conventional word embedding model on Chinese language difference characteristics can be improved, and the representation and recognition accuracy of wrongly written words can be improved.

Description

Technical field: [0001] The invention belongs to the technical field of natural language processing and relates to a Chinese word vector representation model, in particular to a word vector representation method based on joint statistics of Chinese morphemes and pinyin. Background technique: [0002] At present, natural language processing technology has been applied to various aspects, and word representation technology in text is a basic research in the field of natural language processing. Chinese word representation technology is to express Chinese characters in the form of data vectors and apply them to the neural network language model. Data representation is used as a preparatory work. The quality of its expression seriously affects the performance of language model learning and training and scene applications. [0003] Usually, the completion of text data analysis for natural language processing requires the mining of massive text corpus information. With the rapid g...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/27G06N3/04G06N3/08
Inventor 潘坚跃刘祝平潘艺旻王译田陈文康王汝英李欣荣赵光俊周航帆魏伟刘畅李艳
Owner STATE GRID ZHEJIANG ELECTRIC POWER CO LTD HANGZHOU POWER SUPPLY CO
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products