Multi-criterion Chinese word segmentation method based on local self-attention mechanism and segmentation tree

A Chinese word segmentation and attention technology, applied in neural learning methods, natural language data processing, instruments, etc., can solve problems such as unreasonable word combination and failure to use word segmentation criteria, and achieve the effect of improving accuracy and reducing impact

Inactive Publication Date: 2020-08-07
HANGZHOU DIANZI UNIV
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Existing solutions can extract common features from multiple segmentation criteria, thereby improving the accuracy of each word segmentation criteria, but none of the current soluti

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-criterion Chinese word segmentation method based on local self-attention mechanism and segmentation tree
  • Multi-criterion Chinese word segmentation method based on local self-attention mechanism and segmentation tree
  • Multi-criterion Chinese word segmentation method based on local self-attention mechanism and segmentation tree

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0093] If the sentence X="literary experts from all over the country walk out of the Great Hall of the People".

[0094] 1. First, obtain the unigram feature and Bigram feature of each character through word2vec, and combine it with the pre-defined position vector as the embedding layer.

[0095] 2. Pass the embedding layer to the self-attention network and get its output. The output of the self-attention network is decoded by crf, each character is labeled, and multiple labeling results are obtained. The result obtained is as figure 2 shown

[0096] 3. Combine his annotation results into a segmentation tree to generate multiple segmentation sequences. The generated split tree is as image 3 shown

[0097] 4. Input multiple segmentation sequences into the scoring system, and select the set of segmentation sequences with the highest score as output. scoring system such as Figure 4 shown.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a multi-criterion Chinese word segmentation method based on a local attention mechanism and a segmentation tree. According to the method, for a text sequence of a corpus, the method comprises the following implementation steps: inputting a text sequence, obtaining unigram features and Bigram features of each character through word2vec, combining the unigram features and theBigram features with a predefined position vector to serve as an embedded layer, transmitting the embedded layer to a self-attention network, and obtaining the output of the embedded layer; and labelingeach character through crf layer decoding, and obtaininga plurality of labeling results; combining the labeling results into a segmentation tree to form a plurality of segmentation sequences; inputting the plurality of segmentation sequences into a scoring system, and selecting the group of segmentation sequences with the highest score as output. According to the method, the accuracy of multi-criterion word segmentation is improved.

Description

technical field [0001] The invention relates to a Chinese word segmentation task, specifically a multi-criteria Chinese word segmentation method based on a local self-attention mechanism and a segmentation tree, and belongs to the technical field of natural language processing. Background technique [0002] In recent years, Chinese word segmentation models based on neural networks have achieved very good results in word segmentation accuracy. At present, the best Chinese word segmentation method is based on a supervised machine learning algorithm, which relies heavily on large-scale and high-quality tagged corpora. However, the cost of building a high-quality annotated corpus is very large. Therefore, we need to combine the existing corpora, and although different corpora have different segmentation standards for some corpora, they still have many common features. Using this common feature can improve the accuracy of each word segmentation criterion. [0003] Existing solu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F40/289G06N3/04G06N3/08
CPCG06F40/289G06N3/08G06N3/045
Inventor 张旻夏小勇姜明汤景凡
Owner HANGZHOU DIANZI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products