Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

LSTM-based mixed corpus word segmentation method

A word segmentation method and corpus technology, applied in natural language data processing, special data processing applications, instruments, etc., can solve the problems of word segmentation accuracy loss, dependence on dictionaries, inability to recognize unregistered words, etc., and achieve the effect of improving accuracy

Inactive Publication Date: 2018-05-04
北京知道未来信息技术有限公司
View PDF6 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0014] Disadvantage 1: The detection granularity of multiple languages ​​is not easy to distinguish, and there is a loss of participle accuracy because a certain language is not detected
[0015] Disadvantage 2: The dictionary-based method is too dependent on the dictionary, and cannot identify unregistered words that have not appeared in the dictionary based on semantic information
[0016] Disadvantage 3: The current statistics-based methods are mainly HMM (Hidden Markov) model and CRF (Conditional Random Field) model, because of the degree of calculation responsibility, it only considers the correlation between the current word and the previous word , the rest are conditionally independent, which is inconsistent with the reality, so there is room for further improvement in the accuracy of word segmentation

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • LSTM-based mixed corpus word segmentation method
  • LSTM-based mixed corpus word segmentation method
  • LSTM-based mixed corpus word segmentation method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0053] In order to make the above-mentioned features and advantages of the present invention more comprehensible, the following specific embodiments are described in detail in conjunction with the accompanying drawings.

[0054] The flow chart of the present invention is as figure 1 As shown, its implementation can be divided into 2 stages: 1) training stage and 2) prediction stage.

[0055] (1) Training stage:

[0056] Step 1: Step 1: Data Preprocessing. Multiple languages ​​are involved in the mixed corpus, such as Simplified Chinese, Traditional Chinese, Japanese and Korean. We first format each tagged language as follows: each word segmentation result occupies one line. For example, for Traditional Chinese, Simplified Chinese and Japanese "I am Chinese, I love China." The data format is as follows:

[0057] Simplified Chinese:

[0058] I

[0059] yes

[0060] Chinese people

[0061] ,

[0062] I

[0063] Love

[0064] China

[0065] .

[0066] traditional Chin...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses an LSTM-based mixed corpus word segmentation method. According to the method, training mixed corpus data is converted into character-level mixed corpus data; and the mixed corpus data is divided according to sentence length to obtain a plurality of sentences; the obtained sentences are grouped according to the sentence length, and a data set comprising n groups of sentencesis obtained; a plurality of pieces of data are extracted from the data set to serve as iteration data; the data iterated each time is converted into vectors with fixed length, the vectors are input into a deep learning model LSTM, and parameters of the deep learning model LSTM are trained; when the iteration change of a loss value generated by the deep learning model is smaller than a set threshold, is no longer lowered or reaches the maximum number of iterations, training of the deep learning model is terminated, and a trained deep learning model LSTM is obtained; and to-be-predicted mixed corpus data is converted into character-level corpus data, the corpus data is input into the trained deep learning model LSTM, and a word segmentation result is obtained.

Description

technical field [0001] The invention belongs to the technical field of computer software, and relates to an LSTM-based mixed corpus word segmentation method. Background technique [0002] Mixed corpus, in this patent, refers to training or prediction data that includes corpus data in at least two languages. [0003] Word segmentation (Word Segment) refers to marking the input continuous string into a continuous label sequence according to the semantic information. In this patent, it refers to segmenting the sequence data of Asian type characters (simplified Chinese, traditional Chinese, Korean and Japanese) into individual words, and using spaces as the segmentation between words. [0004] The word segmentation method of mixed corpus involves two aspects of professional knowledge: on the one hand, the data format of multiple corpora is unified according to the character level; on the other hand, the professional knowledge involved is mainly sequence annotation in natural la...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
CPCG06F40/211G06F40/284
Inventor 岳永鹏唐华阳
Owner 北京知道未来信息技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products