Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A Chinese word segmentation method based on bidirectional lstm, cnn and crf

A Chinese word segmentation, two-way technology, applied in neural learning methods, instruments, biological neural network models, etc., can solve the problems of low accuracy and slow traditional Chinese word segmentation, so as to reduce workload, reduce labor labeling costs, and reduce sentences. The effect of feature dimension

Active Publication Date: 2021-11-02
NANJING UNIV OF POSTS & TELECOMM
View PDF1 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Aiming at the deficiencies of the above-mentioned prior art, the present invention provides a Chinese word segmentation method based on bidirectional LSTM, CNN and CRF, which effectively solves the problems of slow speed and low accuracy of traditional Chinese word segmentation in practical applications

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Chinese word segmentation method based on bidirectional lstm, cnn and crf
  • A Chinese word segmentation method based on bidirectional lstm, cnn and crf
  • A Chinese word segmentation method based on bidirectional lstm, cnn and crf

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0059] The present embodiment provides a Chinese word segmentation method based on bidirectional LSTM, CNN and CRF, the flow chart of the method is as follows figure 1 shown, including the following steps:

[0060] step one:

[0061] The initial corpus is preprocessed to extract single character information as the character feature information of the corpus, and the corresponding characters are converted into pinyin form as the pinyin feature information of the corpus. Annotate the text, get the labeled text and construct the character table, alphabet and label annotation table.

[0062] In this step, we use the BMEO annotation set to annotate the text, that is, the annotation set contains {B, M, E, O}, and the following characters form a word and itself is the first character, we mark it as B. The character in the middle of the word is marked as M. Characters at the end of words, marked as E. For a single character that does not form a word before and after, we mark it as...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a Chinese word segmentation method based on bidirectional LSTM, CNN and CRF, which is an improvement and optimization of traditional Chinese word segmentation based on a deep learning algorithm. The specific steps of the method are as follows: preprocessing the initial corpus, extracting the character feature information of the corpus and the corresponding pinyin feature information of the character; using the convolutional neural network to obtain the pinyin feature information vector of the character; using the word2vec model to obtain the character feature information vector of the text; Splicing the pinyin feature vector and the character feature vector to get the context information vector, put it into the bidirectional LSTM neural network; use the linear chain conditional random field to decode the output of the bidirectional LSTM to get the word segmentation tag sequence; decode the word segmentation tag sequence to get Word segmentation results. The present invention uses a deep neural network to extract text character features and pinyin features and combines conditional random fields for decoding, which can effectively extract Chinese text features and achieve good results in Chinese word segmentation tasks.

Description

technical field [0001] The invention relates to a Chinese word segmentation method based on bidirectional LSTM, CNN and CRF, belonging to the field of natural language processing. Background technique [0002] Chinese word segmentation is a basic task of Natural Language Processing (NLP). Its purpose is to split the input sequence of Chinese characters into individual words. [0003] In the field of Chinese word segmentation, traditional technologies can be divided into two categories. One is based on dictionaries and rules. It traverses Chinese character strings in a certain way and matches them with entries in the dictionary. If a certain string is found in the dictionary, Then the match is successful. The other is a method based on statistics. Related methods include conditional random field (CRF), hidden Markov model (HMM), and maximum entropy model (Maximum Entropy). Among them, conditional random field has been widely used in the field of Chinese word segmentation in...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/284G06N3/04G06N3/08
CPCG06N3/049G06N3/08G06F40/284G06N3/045
Inventor 王保云顾孙炎苗栋晨
Owner NANJING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products