LSTM-based mixed corpus word segmentation method

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A word segmentation method and corpus technology, applied in natural language data processing, special data processing applications, instruments, etc., can solve the problems of word segmentation accuracy loss, dependence on dictionaries, inability to recognize unregistered words, etc., and achieve the effect of improving accuracy

Inactive Publication Date: 2018-05-04

北京知道未来信息技术有限公司

View PDF6 Cites 4 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0014] Disadvantage 1: The detection granularity of multiple languages is not easy to distinguish, and there is a loss of participle accuracy because a certain language is not detected

[0015] Disadvantage 2: The dictionary-based method is too dependent on the dictionary, and cannot identify unregistered words that have not appeared in the dictionary based on semantic information

[0016] Disadvantage 3: The current statistics-based methods are mainly HMM (Hidden Markov) model and CRF (Conditional Random Field) model, because of the degree of calculation responsibility, it only considers the correlation between the current word and the previous word , the rest are conditionally independent, which is inconsistent with the reality, so there is room for further improvement in the accuracy of word segmentation

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0053] In order to make the above-mentioned features and advantages of the present invention more comprehensible, the following specific embodiments are described in detail in conjunction with the accompanying drawings.

[0054] The flow chart of the present invention is as figure 1 As shown, its implementation can be divided into 2 stages: 1) training stage and 2) prediction stage.

[0055] (1) Training stage:

[0056] Step 1: Step 1: Data Preprocessing. Multiple languages are involved in the mixed corpus, such as Simplified Chinese, Traditional Chinese, Japanese and Korean. We first format each tagged language as follows: each word segmentation result occupies one line. For example, for Traditional Chinese, Simplified Chinese and Japanese "I am Chinese, I love China." The data format is as follows:

[0057] Simplified Chinese:

[0058] I

[0059] yes

[0060] Chinese people

[0061] ,

[0062] I

[0063] Love

[0064] China

[0065] .

[0066] traditional Chin...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses an LSTM-based mixed corpus word segmentation method. According to the method, training mixed corpus data is converted into character-level mixed corpus data; and the mixed corpus data is divided according to sentence length to obtain a plurality of sentences; the obtained sentences are grouped according to the sentence length, and a data set comprising n groups of sentencesis obtained; a plurality of pieces of data are extracted from the data set to serve as iteration data; the data iterated each time is converted into vectors with fixed length, the vectors are input into a deep learning model LSTM, and parameters of the deep learning model LSTM are trained; when the iteration change of a loss value generated by the deep learning model is smaller than a set threshold, is no longer lowered or reaches the maximum number of iterations, training of the deep learning model is terminated, and a trained deep learning model LSTM is obtained; and to-be-predicted mixed corpus data is converted into character-level corpus data, the corpus data is input into the trained deep learning model LSTM, and a word segmentation result is obtained.

Description

technical field [0001] The invention belongs to the technical field of computer software, and relates to an LSTM-based mixed corpus word segmentation method. Background technique [0002] Mixed corpus, in this patent, refers to training or prediction data that includes corpus data in at least two languages. [0003] Word segmentation (Word Segment) refers to marking the input continuous string into a continuous label sequence according to the semantic information. In this patent, it refers to segmenting the sequence data of Asian type characters (simplified Chinese, traditional Chinese, Korean and Japanese) into individual words, and using spaces as the segmentation between words. [0004] The word segmentation method of mixed corpus involves two aspects of professional knowledge: on the one hand, the data format of multiple corpora is unified according to the character level; on the other hand, the professional knowledge involved is mainly sequence annotation in natural la...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F17/27

CPCG06F40/211G06F40/284

Inventor岳永鹏唐华阳

Owner北京知道未来信息技术有限公司

LSTM-based mixed corpus word segmentation method

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology