Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A Chinese Word Segmentation Incremental Learning Method

A Chinese word segmentation and incremental learning technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as high hardware requirements, long computing time, retraining models and large data processing volumes, etc., to achieve training Effects of cost reduction, processing time saving, and total saving

Active Publication Date: 2017-11-17
哈尔滨工业大学人工智能研究院有限公司
View PDF7 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] In order to solve the existing method of adding target domain data and mixing training data on the basis of the source domain segmentation data, the present invention needs to retrain the model every time the data is mixed, and the calculation time caused by the very large amount of data processing is long, and it is difficult for High hardware requirements

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Chinese Word Segmentation Incremental Learning Method
  • A Chinese Word Segmentation Incremental Learning Method
  • A Chinese Word Segmentation Incremental Learning Method

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment approach 1

[0024] Specific implementation mode 1: This implementation mode is described in conjunction with FIG. 1 ,

[0025] A Chinese word segmentation incremental learning method, comprising the steps of:

[0026] Step 1; suppose there are N statements in the Chinese statement set; for the statement x in the Chinese statement set n Carry out manual labeling, statement x n The result of manual labeling is y n ; Put the manually marked statement (x n ,y n ) is recorded as the training set, n is the sequence number of the statement, n=(1,2,...,N);

[0027] Step 2: Initialize the weight vector W of the features in the Chinese sentence set, and mark the initialized weight vector as W 1 =(w 1 ,w 2 ,...,w M ); where w 1 ,w 2 ,...,w M are the weights corresponding to each feature in the Chinese sentence set; M represents the number of all features in the Chinese sentence set;

[0028] Step 3: For the N sentences in the Chinese sentence set, calculate the weight vector W of each se...

specific Embodiment approach 2

[0041] The calculation of each sentence weight vector W for the N sentences in the Chinese sentence set described in step 3 of this embodiment n The specific steps are as follows:

[0042] Step 3.1: Segment the sentence x according to the Chinese word segmentation method n Carry out segmentation, there are many segmentation methods in the segmentation process, and each segmentation method is recorded as a possible marking result y′n ;

[0043] For the labeled result y' n , according to the feature extraction function Φ(x n ,y′ n ), to extract the feature vector (f 1 , f 2 ,..., f M );

[0044] Step 3.2: According to the following formula, calculate the statement x n is split into tokenized result y′ n When the score score;

[0045] score=w 1 f 1 +w 2 f 2 +…+w M f M =W n · Φ(x n ,y′ n )

[0046] Step 3.3: To statement x n All possible segmentation methods are segmented, and the corresponding score is calculated, the segmentation method with the largest scor...

Embodiment

[0053] Conduct experiments on CTB5.0 and Zhu Xian web novel data. The source field selects CTB5.0 data, and the CTB5.0 data is divided into CTB5.0 training set and CTB5.0 test set according to the division method in "Enhancing Chinese Word Segmentation Using Unlabeled Data". The incremental data is selected from Zhu Xian’s novels, which are recorded as ZX; the data division of Zhu Xian’s novels refers to the division method of "Type-supervised domain adaptation for joint segmentation and pos-tagging", and is divided into ZX training set and ZX test set. Randomly select 500 sentences of ZX training data in the ZX training set as a small-scale training set, and randomly select 2400 sentences of ZX training data in the ZX training set as a large-scale training set.

[0054] The training data in the CTB5.0 training set is used for training, and then the CTB5.0 test set and the ZX test set are used for testing. The test results are shown in Table 1, and the experimental results are...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A Chinese word segmentation incremental learning method relates to the field of Chinese word segmentation. In order to solve the existing method of adding target domain data and mixing training data on the basis of the source domain segmentation data, the present invention needs to retrain the model every time the data is mixed, and the calculation time caused by the very large amount of data processing is long, and it is difficult for High hardware requirements. In the present invention, at first the sentence xn in the Chinese sentence set is manually marked, and the manually marked sentence (xn, yn) is recorded as a training set; the weight vector W of the feature in the Chinese sentence set is initialized, and the Chinese sentence set N statements in , calculate the weight vector Wn of each statement; then perform T iterations, and then calculate the average value of the weight vector When the incremental Chinese statement set is introduced into the Chinese statement set, the weight vector of the incremental Chinese statement set is calculated Calculate the average value to obtain the Chinese word segmentation incremental weight parameter W ¯ Δ = 1 N T + N a d d T a d d ( Σ n = 1 , t = 1 , n = N , t = T W n , t + Σ n = 1 , t = 1 , n = N a d d , t = T a d d W a d d n , t ) , Complete the learning of Chinese word segmentation increments. The invention is applicable to the field of Chinese word segmentation.

Description

technical field [0001] The invention relates to the field of Chinese word segmentation. Background technique [0002] A word is the smallest language component with independent meaning. Chinese is based on a character as the basic writing unit, and there is no obvious distinguishing mark between words. Therefore, Chinese word segmentation is the foundation and key of Chinese information processing, and it is widely used in tasks such as information retrieval and text mining. [0003] In recent years, statistics-based Chinese word segmentation methods have achieved good performance in the field of news. However, with the rapid development of the Internet, social media, and mobile platforms, the data processed by the current Chinese word segmentation model is not limited to the news field, and more and more data from open fields are added, which puts forward new requirements for the Chinese word segmentation model. Existing research shows that when the Chinese word segmentat...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/27
Inventor 车万翔刘一佳刘挺赵妍妍
Owner 哈尔滨工业大学人工智能研究院有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products