Chinese formal text word segmentation method based on active learning

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
An active learning and text segmentation technology, applied in special data processing applications, instruments, electronic digital data processing and other directions, can solve a large number of manual labeling data, can not solve the boundary ambiguity and unregistered words, word meaning differences and other problems, to reduce labor The effect of labeling data

Inactive Publication Date: 2018-09-11

CHENGDU UNIV OF INFORMATION TECH

View PDF2 Cites 6 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

Traditional greedy algorithms include forward maximum matching, reverse maximum matching, and two-way matching. This method requires a large amount of manual labeling data, and at the same time cannot solve the two major problems of Chinese word segmentation, meaning divergence and unregistered words.

In 1986, Liang Nanyuan and others applied the maximum matching method to Chinese word segmentation. The maximum matching method is a typical Chinese word segmentation method based on a dictionary. Its disadvantage is that it cannot solve the problem of boundary ambiguity and unregistered words.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0035] The present invention provides a method of active learning, such as figure 1 shown, including the following steps:

[0036] Step 1: Use the existing small amount of labeled data to learn and train to obtain a prediction model;

[0037] Step 2: Predict the unlabeled data through the prediction model obtained through training, so as to obtain the prediction result, and the prediction result is to select the data to be labeled from the unlabeled data;

[0038] Step 3: Use the sampling method to select the most informative data fragments from the data to be labeled and submit them to experts for labeling;

[0039] Step 4: combining the labeled data and the labeled data to retrain the prediction model, and iterate continuously until a certain labeling ratio is reached to end the iteration;

[0040] Specifically, when there is little or no labeled data, manually labeling data is a time-consuming and labor-intensive task. Active learning is to use learning algorithms to sub...

experiment example

[0088] Data generally includes formal data and informal data. For example, literature and People's Daily are formal data, while Weibo is informal data. The data used in this paper come from 16 core journals such as "Computer Science", "Computer Application", "Journal of Software", "Journal of Medical Informatics", and a total of 10,000 paper titles are used. The data in this article is a formal text, and it contains a large amount of information and has the characteristics of short and concise.

[0089] Experimental evaluation

[0090] This application uses the commonly used F-score to measure the performance of the classifier, that is, the harmonic mean of the precision rate and the recall rate. Here we use the confusion matrix as shown in the table to introduce the precision and recall of the experiments in this paper.

[0091] Table 1 Mixing matrix table

[0092]

Segmentation after word segmentation

Not segmented after word segmentation

should act...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a Chinese formal text word segmentation method based on active learning. The method comprises the steps that a current annotation data set L is used to train a naive bayes classifier; the current naive bayes classifier is used to annotate a to-be-annotated data set U; a sampling method is used to select a most informative fragment to be annotated for an expert; the new-sampled annotated fragment is added into the annotation data set L; and constant iteration is carried out until a preset satisfaction condition stops. The Chinese formal text word segmentation method basedon the active learning can effectively reduce artificial annotation data and obtain a tokenizer with better performance. The performance (measured by adopting an F value) of a model obtained by dataextraction and training by using an active learning method is about 5 percentage points higher than that of a model obtained by the data extraction and training by adopting a random drawing method. The performance each time of the model obtained by data extraction and training after the active learning is combined with EM iteration is improved by about 1.5 percentage points than that of the modelobtained by the data extraction and training by separately adopting the active learning method.

Description

technical field [0001] The invention relates to the technical field of word segmentation, in particular to an active learning-based Chinese formal text word segmentation method based on active learning and expectation maximization algorithms. Background technique [0002] Word segmentation is a key basic step in natural language processing, and an indispensable key link in many application systems, such as: information retrieval, named entity recognition, machine translation, syntactic analysis, etc. The effect of word segmentation directly affects the final results of these applications. Effect. However, compared with inflectional language texts such as English, there is no obvious separator like a space between words in agglutinative language texts such as Chinese. Allowing the computer to automatically recognize the boundaries between Chinese word strings and words is Chinese word segmentation. Nowadays, there have been a large number of researches on Chinese word segme...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/27

CPCG06F40/289

Inventor 王亚强何梦秋何思佑唐聃舒红平

Owner CHENGDU UNIV OF INFORMATION TECH

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Chinese formal text word segmentation method based on active learning

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

experiment example

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology