Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

capsule model Chinese word segmentation method based on multi-regularization combination

A Chinese word segmentation and capsule technology, applied in the Internet field, can solve problems such as performance degradation, inapplicability of the capsule model, and general accuracy, and achieve the effects of reducing labor and time costs, improving generalization capabilities, and realizing domain migration

Inactive Publication Date: 2019-05-17
BEIJING UNIV OF POSTS & TELECOMM
View PDF2 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0023] 1. It is only for the task of handwritten digit recognition, and the overall model is not suitable for sequence labeling tasks;
[0024] 2. The regularization term of the reconstructed image is not suitable for Chinese word segmentation tasks;
[0026] 1. Average accuracy;
[0027] 2. The capsule model is not suitable for the task of sequence labeling;
[0028] 3. The generalization performance is average, that is, training and testing can only be performed on the same corpus. When the test corpus in other fields is replaced, the performance will drop sharply

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • capsule model Chinese word segmentation method based on multi-regularization combination
  • capsule model Chinese word segmentation method based on multi-regularization combination
  • capsule model Chinese word segmentation method based on multi-regularization combination

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0084] refer to Figure 4-6 shown, Figure 4 , 5 shows a Chinese word segmentation method based on a multi-regularization combined capsule model provided by the present invention. Specifically, when training a single domain corpus, the method includes:

[0085] Step 1: Identify the maximum length of sentences in the corpus, and use pre-stored characters to fill in the length of sentences that are less than the maximum length in the corpus to the maximum length.

[0086] Among them, in this embodiment, the maximum sentence length is set to 128, and the corpus is CTB6.0; the purpose of this step is to fix all sentences input into the network model to a uniform length.

[0087] Step 2: Map the Chinese characters of the sentences in the corpus into vector representations; through the mapping dictionary, using the word embedding method, map the Chinese characters of the sentences in the corpus to vector representations that are not sparse.

[0088] Further, through the mapping di...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a capsule model Chinese word segmentation method based on multi-regularization combination. The capsule model migration is applied to a natural language processing NLP sequencelabeling task, namely a Chinese word segmentation task, by adding a capsule sliding window capsule split window, so that the technical problem that the capsule model is not suitable for the sequence labeling task is solved; A plurality of regularization items are combined to realize simple field migration, and a capsule model is adapted to a sequence labeling task to complete Chinese word segmentation with higher accuracy and help a more complex natural language processing task; Through combination of multiple regular items, the generalization ability of the model is improved, certain field migration is achieved, manual corpus labeling can be reduced, and the labor and time cost of manual corpus labeling during natural language processing research is reduced.

Description

technical field [0001] The invention relates to the field of Internet technology, and in particular to a Chinese word segmentation method based on a multi-regularization combined capsule model. Background technique [0002] With the development of information technology, machine learning and other technologies, the technology of automatic processing of information has gradually been applied to various scenarios, such as mining user preferences in movie reviews, shopping product reviews, and automatically generating a short summary of an article. Automate the processing of text, and as Chinese users become more active on the Internet and generate more and more information, it is even more necessary to automate the processing of textual information. The emergence of these situations makes the related technologies of natural language processing widely applied to all corners of society. For natural language processing technology, especially for the development of domestic natur...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/27G06K9/00G06K9/62G06N3/04G06N3/08
Inventor 李明正李思孙忆南徐雅静王蓬辉赵建博刘伟杰
Owner BEIJING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products