Word-splitting method based on Bi-LSTM-CNN

A word segmentation method, bi-lstm-cnn technology, applied in neural learning methods, special data processing applications, instruments, etc., can solve the problems of fewer network layers, low efficiency, low recognition rate, etc., and achieve the effect of improving accuracy

Inactive Publication Date: 2018-04-27
北京知道未来信息技术有限公司
View PDF4 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The knowledge-based artificial neural network model has a small number of network layers in the actual application due to the gradient disappearance problem during model training, and the final word segmentation result has no obvious advantage
[0007] The word segmentation method based on the dictionary relies heavily on the dictionary library, the efficiency is relatively low, and cannot identify unregistered words; among the present invention, registered words refer to words that have appeared in the corpus vocabulary, and unregistered words refer to words that do not appear in word in corpus vocabulary
[0008] Based on the word frequency statistical word segmentation method (such as N-Gram), it can only associate the semantics of the first N-1 words of the current word, and the recognition accuracy is not high enough. When N increases, the efficiency is very low
And the recognition rate for unlogged is low

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Word-splitting method based on Bi-LSTM-CNN
  • Word-splitting method based on Bi-LSTM-CNN
  • Word-splitting method based on Bi-LSTM-CNN

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0047] In order to make the above-mentioned features and advantages of the present invention more comprehensible, the following specific embodiments are described in detail in conjunction with the accompanying drawings.

[0048] The method process of the present invention is as figure 1 shown, which includes:

[0049] (1) Training stage:

[0050] Step 1: Transform the original training corpus data OrgData into character-level corpus data NewData. Specifically: using the BMES (Begin, Middle, End, Single) marking method, each word with a label in the original training corpus data is segmented at the character level. Then the character at the beginning of the word is marked as B, the character at the middle of the word is marked as M, the character at the end of the word is marked as E, and if the word has only one character, it is marked as S.

[0051] Step 2: Count the characters in NewData to obtain a character set CharSet. For example, suppose there are two words: Zhonghua...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a word-splitting method based on a Bi-LSTM-CNN. The method comprises the steps of converting training corpus data into character-level corpus data; counting corpus data characters to obtain a character set and numbering each character to obtain a character number set; counting character labels to obtain a label set, and numbering the labels to obtain a label number set; dividing a corpus according to the sentence length, grouping obtained sentences according to the length of the sentences to obtain a data set including n groups of sentences; selecting a group of sentences from the data set without putting back the sentences, extracting multiple sentences from the group, constituting data w through each characters of each sentence, and integrating the corresponding labels as y; converting data w into corresponding numbers and labels y, sending the data into the model Bi-LSTM-CNN, and training parameters of a depth leaning model; converting to-be-predicted data into data matched with the depth learning model, and sending the data to a trained depth learning model to obtain a word-splitting result.

Description

technical field [0001] The invention belongs to the technical field of computer software, and relates to a word segmentation method based on Bi-LSTM-CNN. Background technique [0002] In natural language processing, Asian texts do not have natural space separators like Western texts. Many Western text processing methods cannot be directly used for processing Asian texts (Chinese, Korean, and Japanese), because Asian texts (Chinese , Korean and Japanese) must go through the process of word segmentation in order to maintain consistency with Western languages. Therefore, word segmentation is the basis of information processing in the processing of Asian text, and its application scenarios include: [0003] 1. Search engine: An important function of a search engine is to do full-text indexing of documents. Its content is to segment the text, and then form an inverted index with the word segmentation results of the document and the document. When users query, they also first que...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06N3/08
CPCG06F40/289G06N3/08
Inventor 唐华阳岳永鹏刘林峰
Owner 北京知道未来信息技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products