Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

training method of a Chinese word segmentation model based on a neural network

A Chinese word segmentation and neural network technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problem of model complexity and reduce model word segmentation performance, and achieve the effect of improving word segmentation performance and expanding training corpus

Active Publication Date: 2019-05-24
SUZHOU UNIV
View PDF3 Cites 14 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] The object of the present invention is to provide a kind of training method, device, equipment and computer-readable storage medium of the Chinese participle model based on neural network, in order to solve the traditional method of training the Chinese participle model through the training corpus of multiple participle norms , there is a problem of increasing the complexity of the model or reducing the performance of the model word segmentation

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • training method of a Chinese word segmentation model based on a neural network
  • training method of a Chinese word segmentation model based on a neural network
  • training method of a Chinese word segmentation model based on a neural network

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0052] The first embodiment of the training method of a neural network-based Chinese word segmentation model provided by the present invention is introduced below, see figure 1 , embodiment one includes:

[0053] Step S101: Obtain training corpus of multiple word segmentation specifications.

[0054] The above word segmentation specification refers to the rules and basis for word segmentation of text sentences. At present, the known word segmentation specifications include CTB, PKU, MSR, etc. Different word segmentation specifications have different but reasonable word segmentation methods for the same text sentence. As shown in Table 1, according to different word segmentation norms, "all parts of the country" can be segmented into various word sequences such as "all parts of the country", "all parts of the country", and "all / country / each place".

[0055] Table 1

[0056] participle specification

word segmentation result

CTB

Whole|Country|Each|Distri...

Embodiment 2

[0071] Embodiment two is carried out as a specific implementation mode. In embodiment two, for the word segmentation specification, three kinds of word segmentation specifications are selected to train the model, which are respectively CTB, MSR, and PKU; in terms of word vector representation, selected The unit embedding vector and the binary embedding vector of the word are used to form the vector representation of the word; on the Chinese word segmentation model, the BiLSTM-CRF model is selected.

[0072] see figure 2 , embodiment two specifically includes:

[0073] Step S201: Obtain training corpus of CTB, MSR, and PKU.

[0074] The above training corpus includes text sentences, and also includes label sequences corresponding to the text sentences. In the full labeling scenario, when the word segmentation specification is uniquely determined, each word and even each punctuation mark of a text sentence has a certain label. Therefore, a text sentence has only one reasonabl...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a training method of a Chinese word segmentation model based on a neural network. Corresponding corpus feature vectors are set for multiple word segmentation specifications, after training corpuses of the multiple word segmentation specifications are obtained, vector representation of characters is determined according to embedded vectors of the characters and the corpus feature vectors, and finally the vector representation of each character in the text sentence is input into a Chinese word segmentation model, so that a prediction result is obtained, and model parameters are adjusted according to the prediction result to complete training. Therefore, the method does not need to change the model structure; According to the method, the corresponding corpus feature vectors only need to be added to the vector representations of the characters, and the vector representations are used for training the model, so that the purpose of expanding the training corpus is achieved, the model can learn the generality among different word segmentation specifications, and therefore the purpose of improving the word segmentation performance under the single word segmentationspecification is achieved. In addition, the invention further provides a training device and equipment of the Chinese word segmentation model based on the neural network and a computer readable storage medium, and the effects of the training device and equipment correspond to those of the method.

Description

technical field [0001] The invention relates to the field of natural language processing, in particular to a training method, device, equipment and computer-readable storage medium of a neural network-based Chinese word segmentation model. Background technique [0002] Chinese word segmentation is a process of dividing text sentences into word sequences. Relevant scholars have proposed a variety of word segmentation specifications, and manually marked the corresponding training corpus for different word segmentation specifications. With the development of neural networks, Chinese word segmentation by training neural network-based models is becoming more and more common. [0003] At present, most Chinese word segmentation models focus on using the training corpus of the same word segmentation specification to improve the word segmentation performance under the word segmentation specification. This method is limited by the number of training corpora, and it is difficult to imp...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/332G06F17/27
CPCY02D10/00
Inventor 李正华朱运黄德朋张民陈文亮
Owner SUZHOU UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products