A Chinese word segmentation method based on depth learning

A deep learning and Chinese word segmentation technology, which is applied to instruments, biological neural network models, calculations, etc., can solve the problems of recurrent neural network gradient explosion, inability to handle long-distance historical memory, and inability to learn data association well.

Active Publication Date: 2018-12-25
NANJING UNIV OF POSTS & TELECOMM
View PDF1 Cites 32 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Early Chinese word segmentation tasks based on deep learning used a simple feedback neural network to label each word in the training sequence. This method only obtains context information within a fixed window, and cannot learn the relationship between data and previous data well.
[0006] Recursive neural network can automatically lear

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Chinese word segmentation method based on depth learning
  • A Chinese word segmentation method based on depth learning
  • A Chinese word segmentation method based on depth learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0252] A Chinese word segmentation method based on deep learning, comprising the steps of:

[0253] Step 1: Perform literal word frequency statistics on the large-scale corpus D. Based on the CBOW model and HS training method, initialize each word in the corpus D as a basic distributed literal vector, and index the acquired literal vector Save into dictionary V.

[0254] Step 2: Convert the training corpus into fixed-length vectors sentence by sentence, and send them into the improved bidirectional LSTM model. By training the parameters in the bidirectional LSTM model, refine and update the character-level literal vectors in the dictionary V to obtain A feature vector carrying contextual semantics and a vector containing literal features.

[0255] Step 3: For each training sentence, when training word by word, use the idea of ​​full segmentation to segment all candidate words ending with the current word within the maximum word length, and fuse the refined character-level fea...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Chinese word segmentation method based on depth learning, comprising the following steps: Chinese characters are maped into character vector based on literal character frequency; the character vector is refined to extract the feature vector with context semantic information and the feature vector with character feature; the character-level vectors are effectively fused with the word-level distributed representation, and then the fused candidate vectors are sent into the depth learning model to calculate the sentence scores, which are decoded by the cluster search method, and finally the appropriate word segmentation results are selected by the sentence scores. In this way, the task of word segmentation can be freed from the tedious feature engineering, better system performance can be obtained by extracting more abundant feature information, and the whole segmentation history can be used for modeling, which has the ability of word segmentation at the sequencelevel.

Description

technical field [0001] The invention relates to the technical field of natural language processing, in particular to a Chinese word segmentation method based on deep learning. Background technique [0002] In the current big data environment, with the rapid development of Internet of Things data perception, data cloud computing, triple play and mobile Internet, data, especially unstructured text data, is growing exponentially and presents types of diversification and heterogeneity. characteristics such as globalization, information fragmentation, and low value density. The rapid expansion of data has brought great challenges to the automatic processing of information. How to efficiently and accurately process massive texts and extract valuable information has become an important topic of Natural Language Processing (NLP). [0003] In the field of natural language processing, especially in Chinese natural language processing, word segmentation is an important benchmark task,...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27G06N3/02
CPCG06N3/02G06F40/289
Inventor 王传栋史宇李智
Owner NANJING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products