Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Chinese formal text word segmentation method based on active learning

An active learning and text segmentation technology, applied in special data processing applications, instruments, electronic digital data processing and other directions, can solve a large number of manual labeling data, can not solve the boundary ambiguity and unregistered words, word meaning differences and other problems, to reduce labor The effect of labeling data

Inactive Publication Date: 2018-09-11
CHENGDU UNIV OF INFORMATION TECH
View PDF2 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Traditional greedy algorithms include forward maximum matching, reverse maximum matching, and two-way matching. This method requires a large amount of manual labeling data, and at the same time cannot solve the two major problems of Chinese word segmentation, meaning divergence and unregistered words.
In 1986, Liang Nanyuan and others applied the maximum matching method to Chinese word segmentation. The maximum matching method is a typical Chinese word segmentation method based on a dictionary. Its disadvantage is that it cannot solve the problem of boundary ambiguity and unregistered words.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese formal text word segmentation method based on active learning
  • Chinese formal text word segmentation method based on active learning
  • Chinese formal text word segmentation method based on active learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0035] The present invention provides a method of active learning, such as figure 1 shown, including the following steps:

[0036] Step 1: Use the existing small amount of labeled data to learn and train to obtain a prediction model;

[0037] Step 2: Predict the unlabeled data through the prediction model obtained through training, so as to obtain the prediction result, and the prediction result is to select the data to be labeled from the unlabeled data;

[0038] Step 3: Use the sampling method to select the most informative data fragments from the data to be labeled and submit them to experts for labeling;

[0039] Step 4: combining the labeled data and the labeled data to retrain the prediction model, and iterate continuously until a certain labeling ratio is reached to end the iteration;

[0040] Specifically, when there is little or no labeled data, manually labeling data is a time-consuming and labor-intensive task. Active learning is to use learning algorithms to sub...

experiment example

[0088] Data generally includes formal data and informal data. For example, literature and People's Daily are formal data, while Weibo is informal data. The data used in this paper come from 16 core journals such as "Computer Science", "Computer Application", "Journal of Software", "Journal of Medical Informatics", and a total of 10,000 paper titles are used. The data in this article is a formal text, and it contains a large amount of information and has the characteristics of short and concise.

[0089] Experimental evaluation

[0090] This application uses the commonly used F-score to measure the performance of the classifier, that is, the harmonic mean of the precision rate and the recall rate. Here we use the confusion matrix as shown in the table to introduce the precision and recall of the experiments in this paper.

[0091] Table 1 Mixing matrix table

[0092]

Segmentation after word segmentation

Not segmented after word segmentation

should act...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a Chinese formal text word segmentation method based on active learning. The method comprises the steps that a current annotation data set L is used to train a naive bayes classifier; the current naive bayes classifier is used to annotate a to-be-annotated data set U; a sampling method is used to select a most informative fragment to be annotated for an expert; the new-sampled annotated fragment is added into the annotation data set L; and constant iteration is carried out until a preset satisfaction condition stops. The Chinese formal text word segmentation method basedon the active learning can effectively reduce artificial annotation data and obtain a tokenizer with better performance. The performance (measured by adopting an F value) of a model obtained by dataextraction and training by using an active learning method is about 5 percentage points higher than that of a model obtained by the data extraction and training by adopting a random drawing method. The performance each time of the model obtained by data extraction and training after the active learning is combined with EM iteration is improved by about 1.5 percentage points than that of the modelobtained by the data extraction and training by separately adopting the active learning method.

Description

technical field [0001] The invention relates to the technical field of word segmentation, in particular to an active learning-based Chinese formal text word segmentation method based on active learning and expectation maximization algorithms. Background technique [0002] Word segmentation is a key basic step in natural language processing, and an indispensable key link in many application systems, such as: information retrieval, named entity recognition, machine translation, syntactic analysis, etc. The effect of word segmentation directly affects the final results of these applications. Effect. However, compared with inflectional language texts such as English, there is no obvious separator like a space between words in agglutinative language texts such as Chinese. Allowing the computer to automatically recognize the boundaries between Chinese word strings and words is Chinese word segmentation. Nowadays, there have been a large number of researches on Chinese word segme...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/27
CPCG06F40/289
Inventor 王亚强何梦秋何思佑唐聃舒红平
Owner CHENGDU UNIV OF INFORMATION TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products