A Chinese word segmentation method based on bi-directional long-short time memory network model

A long- and short-term memory, Chinese word segmentation technology, applied in biological neural network models, special data processing applications, instruments, etc., can solve the problems of not being able to use the future text information of sentences, prone to overfitting, etc., to achieve good word segmentation and improve accuracy. rate effect

Active Publication Date: 2019-01-15
KUNMING UNIV OF SCI & TECH
View PDF3 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The neural network-based word segmentation method uses the neural network to automatically learn data features, avoiding the limitations of the traditional word segmentation method due to artificial settings, but the neural network model is greatly affected by the size of the context window, and it is easy to introduce too many features when the window is large The impurity information brought is prone to overfitting problems, and the traditional cyclic neural network (such as RNN) only relies on the above information in the sentence order, and cannot use the future text information in the sentence

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Chinese word segmentation method based on bi-directional long-short time memory network model
  • A Chinese word segmentation method based on bi-directional long-short time memory network model
  • A Chinese word segmentation method based on bi-directional long-short time memory network model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0036] Embodiment 1: as figure 1 As shown, it is the workflow of the Chinese word segmentation method in the field of metallurgy based on the two-way long-short-term memory network model. The specific steps are:

[0037] Step1: Due to the lack of authoritative corpus in the metallurgical information field, the data from the metallurgical information network was crawled to obtain the text data set in the metallurgical field, and the text data set was divided into a training set and a test set, and then the training set was preprocessed. The specific processing process is to use the BMES tagging method to tag the Chinese characters in the training set, as shown in Table 1. For multi-word words, B is the label of the first word in the multi-word word, and M is the label for removing the first word in the multi-word word. The label of the word and other words after the last word, E is the label of the last word in the multi-word, S is the label of the single word, the data set msr...

Embodiment 2

[0058] Embodiment 2: the present embodiment method is the same as embodiment 1, and the difference is that the present embodiment is applied in the non-metallurgical field, and the selected text is marked with four word positions (BEMS), and the results obtained are as shown in table 7:

[0059] Table 7 four-lexeme tag form

[0060]

[0061] Segment the labeled data according to the punctuation marks, and use the arrays data and label to represent the results after segmentation, as shown in Table 8:

[0062] Table 8 data and label data format

[0063]

[0064] The data data group includes each Chinese character, the label data group includes the label corresponding to each Chinese character, and then the data data group and the label data group are digitized separately, and each Chinese character in the data data group is used for the first time the Chinese character appears The sequential numbers are represented and stored in d['x'], and the labels of the label data gr...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Chinese word segmentation method based on a bi-directional long-short time memory network model. Firstly, the data set of any field is obtained, and the data set is divided into training set and test set. Then, the training set is preprocessed, the preprocessed training set and the open data set msr of Microsoft Research Asia are embedded separately, and the processed training set and data set msr are input into Bi-LSTM neural network model is trained to get the training set model X_Bi-LSTM Model and msr_Bi-LSTM model, and then using X_Bi-LSTM model, msr_Bi-LSTM modelpredicts the label of the test suite, and weights the forecasting probabilities of the two models, the probability of each Chinese character label after combination is obtained, then, the Viterbe algorithm is used to calculate the probability of each label after the combination to obtain the final probability of each Chinese character belonging to each label, and the label with the maximum probability value belongs to each Chinese character as the label of each Chinese character, thereby completing Chinese word segmentation. The invention can obtain better word segmentation result, and improves the accuracy of word segmentation.

Description

technical field [0001] The invention relates to a Chinese word segmentation method based on a bidirectional long-short-term memory network model, belonging to the field of natural language processing. Background technique [0002] In Chinese, there are no separators between words, and the words themselves lack obvious morphological marks. Therefore, the unique problem of Chinese information processing is how to divide Chinese word strings into reasonable word sequences, that is, Chinese word segmentation. Word segmentation is the first step in Chinese natural language processing. This is an important feature of natural language processing systems different from other languages, and it is also an important factor affecting the application of natural language processing in Chinese information processing. In recent years, many scholars at home and abroad have done a lot of research work in the field of Chinese word segmentation and achieved certain research results. However, f...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06N3/04
CPCG06N3/04G06F40/284
Inventor 邵党国郑娜
Owner KUNMING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products