Chinese word segmentation method by using character embedding based on word context and neural network

A neural network, Chinese word segmentation technology, applied in biological neural network model, semantic analysis, electrical digital data processing and other directions, can solve problems such as insufficient use of word information

Active Publication Date: 2017-09-15
NANJING UNIV
View PDF1 Cites 31 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] Purpose of the invention: Aiming at the shortcomings of the existing word tagging-based models in the current Chinese word segmentation technology that cannot make full use of word information, the present invention proposes a word context-based word embedding learning method to indirectly fuse word-level information, thereby improving Accuracy of Chinese Word Segmentation Task

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese word segmentation method by using character embedding based on word context and neural network
  • Chinese word segmentation method by using character embedding based on word context and neural network
  • Chinese word segmentation method by using character embedding based on word context and neural network

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0145] First, the labeled data used in this example is the Chinese version of Binzhou Treebank CTB6.0, in which there are 23,401 sentences in the training set, 2,078 sentences in the development set, and 2,795 sentences in the test set. The automatic segmentation data is a total of 41071242 sentences obtained in Chinese Gigaword (LDC2011T13).

[0146] In this embodiment, the complete process of using the Chinese word segmentation method based on word context-based word embedding and neural network in the present invention is as follows:

[0147] Step 1-1, determine the labeling system of the word labeling model, and define four types B, M, E, S, see 1-1 in the manual for specific meanings;

[0148] Step 1-2, train on Gigaword Chinese automatic segmentation data to get word embedding e uni Matrix and dword embedding e bi ;

[0149] Step 2-1, read a Chinese sentence "You will come right away", and calculate the score of each position about the mark:

[0150] 1. Your score(B)...

Embodiment 2

[0158] Algorithms used in the present invention are all written and implemented in C++ language. The model used in the experiment of this embodiment is: Intel(R) Core(TM) i7-4790K processor, the main frequency is 4.0GHz, and the memory is 24G. The labeled data used in this example is the Chinese version of Binzhou Treebank CTB6.0, in which there are 23,401 sentences in the training set, 2,078 sentences in the development set, and 2,795 sentences in the test set. The automatic segmentation data is a total of 41071242 sentences obtained in Chinese Gigaword (LDC2011T13). The model parameters are trained on Gigaword data and CTB6.0 data. The experimental results are shown in Table 1:

[0159] Table 1 Explanation of Experimental Results

[0160]

[0161]

[0162] Among them, Xu and Sun (2016) adopted a word segmentation model based on dependent recurrent neural network, Liu (2016) used a word segmentation model based on segmented representation, Zhang (2016) used a neural n...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention brings forward a Chinese word segmentation method by using character embedding based on word context and a neural network. The character embedding is learnt on large-scale automatic segmentation of data. The learnt character embedding is used as inputting of a segmentation model of the neural network so as to effectively help model learning. The method comprises the concrete steps of learning character embedding on large-scale automatic segmentation of data according to the word context and lexeme marking; and utilizing character-embedding as inputting of the segmentation model of the neural network so that segmentation performance is improved. Compared with other Chinese word segmentation method technologies based on the neural network, the method adopts the character-embedding technology based on the word context so that word information is effectively integrated into the segmentation model. Therefore, the accuracy of a word segmentation task is improved.

Description

technical field [0001] The invention relates to a method for segmenting Chinese words by using a computer, in particular to a method for automatically segmenting Chinese words by combining word embedding based on word context with a neural network. Background technique [0002] Chinese word segmentation is a basic task of natural language processing, and its extensive application requirements have attracted a large number of related researches, which has promoted the rapid development of its related technologies. Adhesive languages ​​such as Chinese are different from Western languages ​​in that there are no obvious gaps between words in Chinese sentences. The smallest unit of general natural language processing tasks is "word", so for Chinese, the first problem is to identify word strings first. The current means of processing Chinese word segmentation can be roughly divided into two categories, rule-based methods and statistical methods. Dictionary-based rule-based metho...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06N3/02
CPCG06F40/289G06F40/30G06N3/02
Inventor 戴新宇郁振庭陈家骏黄书剑张建兵
Owner NANJING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products