A Chinese word segmentation method based on deep learning

A technology of deep learning and Chinese word segmentation, applied in instruments, biological neural network models, calculations, etc., can solve problems such as gradient disappearance, inability to handle long-distance historical memory, and recurrent neural network gradient explosion

Active Publication Date: 2022-07-26
NANJING UNIV OF POSTS & TELECOMM
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Early Chinese word segmentation tasks based on deep learning used a simple feedback neural network to label each word in the training sequence. This method only obtains context information within a fixed window, and cannot learn the relationship between data and previous data well.
[0006] Recursive neural network can automatically learn more complex features by accumulating historical memory, making full use of context, but in practice, it is found that the recurrent neural network has the problem of gradient explosion and gradient disappearance, which makes it face the problem of not being able to perform well. Dealing with the problem of long-distance historical memory

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Chinese word segmentation method based on deep learning
  • A Chinese word segmentation method based on deep learning
  • A Chinese word segmentation method based on deep learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0252] A Chinese word segmentation method based on deep learning, comprising the following steps:

[0253] Step 1: Perform literal word frequency statistics on the large-scale corpus D. Based on the CBOW model and the HS training method, each word in the corpus D is initialized as a basic distributed font vector, and the obtained font vectors are indexed by index. Save to dictionary V.

[0254] Step 2: Convert the training corpus into a fixed-length vector sentence by sentence, and send it into the improved bidirectional LSTM model. By training the parameters in the bidirectional LSTM model, the character-level literal vector in the dictionary V is refined and updated to obtain A feature vector carrying contextual semantics and a vector containing word features.

[0255] Step 3: For each training sentence, when training word by word, use the idea of ​​full segmentation to segment all candidate words ending with the current word within the maximum word length range, and fuse t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Chinese word segmentation method based on deep learning, which includes the following steps: mapping Chinese characters into literal vectors based on literal word frequencies; refining the literal vectors, and extracting feature vectors carrying contextual semantic information and character-carrying properties The feature vector of the feature; the character-level vector is effectively fused into a word-level distributed representation, and then the fused candidate word vector is sent to the deep learning model to calculate the sentence score, decoded by the method of beam search, and finally obtained by the sentence score. Select the appropriate word segmentation result. In this way, the word segmentation task is freed from tedious feature engineering, better system performance can be obtained by extracting richer feature information, and the complete segmentation history is used for modeling, with sequence-level word segmentation capabilities.

Description

technical field [0001] The invention relates to the technical field of natural language processing, in particular to a Chinese word segmentation method based on deep learning. Background technique [0002] In the current big data environment, with the rapid development of IoT data perception, data cloud computing, triple play, and mobile Internet, the amount of data, especially unstructured text, has grown exponentially, and the types of data are diverse and heterogeneous. characteristics such as fragmentation, information fragmentation and low value density. The rapid expansion of data has brought great challenges to the automatic processing of information. How to efficiently and accurately process massive texts and extract valuable information has become an important topic in Natural Language Processing (NLP). [0003] In the field of natural language processing, especially in Chinese natural language processing, word segmentation is an important benchmark task, and the p...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/289G06N3/02
CPCG06N3/02G06F40/289
Inventor 王传栋史宇李智
Owner NANJING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products