Word segmentation method based on Bi-LSTM

A word segmentation method and data technology, applied in neural learning methods, special data processing applications, instruments, etc., can solve the problems of few network layers, inability to recognize unregistered words, low efficiency, etc., and achieve the effect of improving accuracy

Inactive Publication Date: 2018-04-10
北京知道未来信息技术有限公司
View PDF4 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The knowledge-based artificial neural network model has a small number of network layers in the actual application due to the gradient disappearance problem during model training, and the final word segmentation result has no obvious advantage
[0007] The word segmentation method based on the dictionary relies heavily on the dictionary library, the efficiency is relatively low, and cannot identify unregistered words; among the present invention, registered words refer to words that have appeared in the corpus vocabulary, and unregistered words refer to words that do not appear in word in corpus vocabulary
[0008] Based on the word frequency statistical word segmentation method (such as N-Gram), it can only associate the semantics of the first N-1 words of the current word, and the recognition accuracy is not high enough. When N increases, the efficiency is very low
And the recognition rate for unlogged is low

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Word segmentation method based on Bi-LSTM
  • Word segmentation method based on Bi-LSTM
  • Word segmentation method based on Bi-LSTM

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0042] In order to make the above-mentioned features and advantages of the present invention more comprehensible, the following specific embodiments are described in detail in conjunction with the accompanying drawings.

[0043] The method process of the present invention is as figure 1 shown, which includes:

[0044] (1) Training stage:

[0045]Step 1: If there are multiple word-segmented corpus data, integrate them into one training corpus data OrgData, and its format is that each word segmentation result occupies one line; then convert the training corpus data Original into character-level corpus data. Specifically: according to the marking method of BMES (Begin, Middle, End, Single), the characters of the original training corpus data are segmented and marked as New_Data. Let the label corresponding to a certain word be Label, then the character at the beginning of the word is marked as LabelB, the character at the middle of the word is marked as Label M, and the charact...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a word segmentation method based on Bi-LSTM. The method includes the steps that training corpus data is converted into character-level corpus data; the corpus data is segmentedaccording to sentence length, and several sentences are obtained; then the obtained sentences are grouped according to the sentence length, and a data set comprising n groups of sentences is obtained; several pieces of data are extracted from the data set to serve as iteration data; the each-time iteration data is converted into a fixed-length vector to be sent into a depth learning model Bi-LSTM, and parameters of the depth learning model Bi-LSTM are trained; when the loss value iteration change generated by the depth learning model is smaller than a set threshold value and is not decreasedany more or the maximum iteration time number is reached, training of the depth learning model is terminated, and the trained depth learning model Bi-LSTM is obtained; corpus data to be predicted is converted into character-level corpus data, the character-level corpus data is sent into the trained depth learning model Bi-LSTM, and a word segmentation result is obtained.

Description

technical field [0001] The invention belongs to the technical field of computer software, and relates to a word segmentation method based on Bi-LSTM. Background technique [0002] In natural language processing, Asian texts do not have natural space separators like Western texts. Many Western text processing methods cannot be directly used for processing Asian texts (Chinese, Korean, and Japanese), because Asian texts (Chinese , Korean and Japanese) must go through the process of word segmentation in order to maintain consistency with Western languages. Therefore, word segmentation is the basis of information processing in the processing of Asian text, and its application scenarios include: [0003] 1. Search engine: An important function of a search engine is to do full-text indexing of documents. Its content is to segment the text, and then form an inverted index with the word segmentation results of the document and the document. When users query, they also first query ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06N3/08
CPCG06F40/284G06F40/289G06N3/08
Inventor 岳永鹏唐华阳
Owner 北京知道未来信息技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products