Self-adaptive Chinese word segmentation method based on embedded representation

A Chinese word segmentation and embedded technology, applied in special data processing applications, natural language data processing, instruments, etc.

Active Publication Date: 2017-09-08
BEIJING UNIV OF POSTS & TELECOMM
View PDF3 Cites 51 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In semi-supervised domain transfer, we have a lot of labeled data in the source domain, but we can only get unlabeled data in the target domain

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Self-adaptive Chinese word segmentation method based on embedded representation
  • Self-adaptive Chinese word segmentation method based on embedded representation
  • Self-adaptive Chinese word segmentation method based on embedded representation

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0022] Next, the implementation method of the present invention will be described in more detail.

[0023] figure 1 It is a network structure diagram of the word segmentation method provided by the present invention, including:

[0024] Training part:

[0025] Step S1: The shared character embedded representation layer parameterizes the input labeled sentences and randomly extracted unlabeled sentence character vectors;

[0026] Step S2: The convolutional neural network extracts hidden multi-granularity local information from the marked sentence;

[0027] Step S3: the forward neural network calculates the label score of each character;

[0028] Step S4: Use the label inference method to obtain the optimal label sequence and loss function value;

[0029] Step S5: Send the unmarked sentence into a character language model based on a long-term short-term memory unit (LSTM) recurrent neural network, and obtain the hidden layer representation of each character position;

[003...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention discloses a self-adaptive Chinese word segmentation method based on embedded representation and belongs to the field of information processing. The method is characterized in that an embedded representation layer of a character is shared by a word segmentation network and a character language model. As for embedded representation of the character, on the one hand, hidden multi-granularity local features of a to-be-segmented text is obtained by means of the word segmentation network based on convolutional neural network; then label probability of the character is obtained through a forward network layer; finally, label inference is used to obtain the optimum segmentation result in the sentence level; on the other hand, an unlabelled text is randomly extracted, a character next to the character is predicted by means of a character language model based on a long- and short-term memory unit (LSTM) recurrent neural network and the word segmentation network is constrained. By modeling a character co-representing relationship in texts in different fields by means of the character language model and transferring information to the word segmentation network by means of embedded representation, the field transfer ability of word segmentation is enhanced, and the method has very huge practical value.

Description

technical field [0001] The invention relates to the field of information processing, in particular to a method for domain migration based on neural network Chinese word segmentation. Background technique [0002] Chinese word segmentation is a basic task in Chinese natural language processing. Its goal is to convert a sequence composed of Chinese characters into a sequence composed of Chinese words. Because Chinese words are the basic unit of Chinese semantic expression, Chinese word segmentation is a very important basic task, and the performance of the word segmentation system will directly affect the upper-level tasks of Chinese natural language processing, such as information retrieval and machine translation. [0003] In the past ten years, there has been a lot of research work on Chinese word segmentation, and many remarkable achievements have been made. On the one hand, many standard data sets for Chinese word segmentation have been established; on the other hand, ma...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06N3/04
CPCG06F40/289G06N3/04
Inventor 李思包祖贻徐蔚然高升
Owner BEIJING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products