LSTM-CNN-based word segmentation method

A word segmentation method and word technology, applied in the field of word segmentation based on LSTM-CNN, can solve the problems of few network layers, no obvious advantages of word segmentation results, low recognition rate, etc., achieve the effect of improving accuracy and avoiding unregistered words

Inactive Publication Date: 2018-04-20
北京知道未来信息技术有限公司
View PDF4 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The knowledge-based artificial neural network model has a small number of network layers in the actual application due to the gradient disappearance problem during model training, and the final word segmentation result has no obvious advantage
[0007] The word segmentation method based on the dictionary relies heavily on the dictionary library, the efficiency is relatively low, and cannot identify unregistered words; among the present invention, registered words refer to words that have appeared in the corpus vocabulary, and unregistered words refer to words that do not appear in word in corpus vocabulary
[0008] Based on the word frequency statistical word segmentation method (such as N-Gram), it can only associate the semantics of the first N-1 words of the current word, and the recognition accuracy is not high enough. When N increases, the efficiency is very low
And the recognition rate for unlogged is low

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • LSTM-CNN-based word segmentation method
  • LSTM-CNN-based word segmentation method
  • LSTM-CNN-based word segmentation method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0045] In order to make the above-mentioned features and advantages of the present invention more comprehensible, the following specific embodiments are described in detail in conjunction with the accompanying drawings.

[0046] The method process of the present invention is as figure 1 shown, which includes:

[0047] (1) Training stage:

[0048] Step 1: Transform the original training corpus data OrgData into character-level corpus data NewData. Specifically: using the BMES (Begin, Middle, End, Single) marking method, each word with a label in the original training corpus data is segmented at the character level. Then the character at the beginning of the word is marked as B, the character at the middle of the word is marked as M, the character at the end of the word is marked as E, and if the word has only one character, it is marked as S.

[0049] Step 2: Count the characters in NewData to obtain a character set CharSet. For example, suppose there are two words: Zhonghua...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an LSTM-CNN-based word segmentation method. The method comprises the steps of converting training corpus data into character-level corpus data; performing statistics on characters of the corpus data to obtain a character set and numbering the characters to obtain a character number set; performing statistics on character tags to obtain a tag set, and numbering the tags to obtain a tag number set; dividing corpora according to sentence lengths, and grouping obtained sentences according to the sentence lengths to obtain a data set comprising n groups of sentences; randomly selecting a sentence group from the data set without replacement, wherein the characters of each sentence form a piece of data w, and the corresponding tag set is y; converting the data w into the corresponding number and tag y to be input to a LSTM-CNN of a model, and training parameters of the deep learning model; and converting to-be-predicted data into data matched with the deep learning model, and inputting the data to the trained deep learning model to obtain a word segmentation result.

Description

technical field [0001] The invention belongs to the technical field of computer software, and relates to a word segmentation method based on LSTM-CNN. Background technique [0002] In natural language processing, Asian texts do not have natural space separators like Western texts. Many Western text processing methods cannot be directly used for processing Asian texts (Chinese, Korean, and Japanese), because Asian texts (Chinese , Korean and Japanese) must go through the process of word segmentation in order to maintain consistency with Western languages. Therefore, word segmentation is the basis of information processing in the processing of Asian text, and its application scenarios include: [0003] 1. Search engine: An important function of a search engine is to do full-text indexing of documents. Its content is to segment the text, and then form an inverted index with the word segmentation results of the document and the document. When users query, they also first query ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
CPCG06F40/289
Inventor 唐华阳岳永鹏刘林峰
Owner 北京知道未来信息技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products