Chinese word segmentation method based on two-way LSTM, CNN and CRF

A Chinese word segmentation, two-way technology, applied in the direction of neural learning methods, special data processing applications, instruments, etc., can solve the problems of low accuracy and slow speed of traditional Chinese word segmentation, so as to reduce the workload, reduce the cost of human labeling, and improve word segmentation efficiency effect

Active Publication Date: 2018-07-10
NANJING UNIV OF POSTS & TELECOMM
View PDF1 Cites 53 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Aiming at the deficiencies of the above-mentioned prior art, the present invention provides a Chinese word segmentation method based on bidirectional LSTM, CNN and CRF, which effectively solves the problems of slow speed and low accuracy of traditional Chinese word segmentation in practical applications

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese word segmentation method based on two-way LSTM, CNN and CRF
  • Chinese word segmentation method based on two-way LSTM, CNN and CRF
  • Chinese word segmentation method based on two-way LSTM, CNN and CRF

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0059] This embodiment provides a Chinese word segmentation method based on bidirectional LSTM, CNN and CRF. The flow chart of the method is as follows figure 1 shown, including the following steps:

[0060] step one:

[0061] The initial corpus is preprocessed to extract single character information as the character feature information of the corpus, and the characters are converted into pinyin form for the corresponding character as the pinyin feature information of the corpus. Label the text, get the labelled text and construct the character table, alphabet and label labeling table.

[0062] In this step, we use the BMEO annotation set to annotate the text, that is, the annotation set contains {B, M, E, O}, followed by the following characters to form a word and itself is the first character, we mark it as B. The character in the middle of the word is marked as M. Characters at the end of words, marked with E. For single characters, which do not form words before and af...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Chinese word segmentation method based on two-way LSTM, CNN and CRF which improves and optimizes traditional Chinese word segmentation base on deep learning algorithm. The method comprises following specific steps: preprocessing the initial corpus, extracting corpus character feature information and pinyin feature information corresponding to characters; using the convolutional neural network to obtain pinyin feature information vector of the characters; using the word2vec model to obtain the character feature information vector of text; splicing pinyin feature vectors and character feature vectors to obtain context information vectors and put the context information vectors to a bidirectional LSTM neural network; decoding the output of the bidirectional LSTM using the linear chain condition random field to obtain the word segmentation sequence; decoding the word segmentation label sequence to obtain word segmentation results. The invention utilizes the deep neural network to extract text character features and pinyin features and combines the conditional random field decoding, can effectively extract Chinese text features and achieve good effect on Chinese word segmentation tasks.

Description

technical field [0001] The invention relates to a Chinese word segmentation method based on bidirectional LSTM, CNN and CRF, and belongs to the field of natural language processing. Background technique [0002] Chinese word segmentation is a basic task of Natural Language Processing (NLP). Its purpose is to split the input Chinese character sequence into individual words. [0003] In the field of Chinese word segmentation, traditional technologies can be divided into two categories. One is the method based on dictionary and rules. It traverses the Chinese character string in a certain way to match the entry in the dictionary. If a certain string is found in the dictionary, The match is successful. The other method is based on statistics. Related methods include conditional random field (CRF), hidden Markov model (HMM), and maximum entropy model (Maximum Entropy). Among them, conditional random field has been widely used in the field of Chinese word segmentation in recent ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06N3/04G06N3/08
CPCG06N3/049G06N3/08G06F40/284G06N3/045
Inventor 王保云顾孙炎苗栋晨
Owner NANJING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products