Chinese text category recognition system and method for unbalanced data sampling

A data sampling and text technology, which is applied in text database clustering/classification, neural learning methods, electrical digital data processing, etc., can solve problems such as single use, feature loss, and low extraction granularity

Pending Publication Date: 2020-08-25
XI AN JIAOTONG UNIV
View PDF3 Cites 16 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Among them, the undersampling method refers to screening some representative samples so that the positive and negative samples reach a proportional balance; however, the current two types of undersampling methods based on clustering and integration only consider the selection of important features, and feature loss exists to a certain extent. The risk, and the redundancy problem in the negative sample data also needs to be considered
Oversampling refers to generating new positive samples from a small number of existing positive samples through the model to balance the positive and negative samples in the data set; currently there are two types of methods: filter-based sample generation and limited sample space to generate data. When using space, the diversity of sample generation is more limited, so that the features cannot be evenly distributed in the sample space, and the filtering-based method is often not enough to construct a good sample due to the low granularity of feature extraction and single use of positive samples. Selector to select better samples to complement the data
In the process of mixed sampling, the relationship between positive and negative sample features is not fully utilized, and the undersampling and oversampling methods are simply stacked.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese text category recognition system and method for unbalanced data sampling
  • Chinese text category recognition system and method for unbalanced data sampling
  • Chinese text category recognition system and method for unbalanced data sampling

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0122] The present invention has carried out a large number of experiments on the Wenyin Internet competition data set held by AI100, and carried out comparative experiments with basic sampling methods such as random undersampling, random oversampling, etc., and improved models based on these models, confirming the advantages of the present invention The performance of the Chinese text category recognition method (MUDS) of unbalanced data sampling reaches optimal (experimental results as shown in Table 1); Show).

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Chinese text category recognition system and method for unbalanced data sampling. The method comprises the following steps: firstly employing a text encoder to encode a Chinese text, and forming sentence vector representation; secondly, redundant and noise data are removed by adopting a multi-time DBSCAN clustering algorithm for negative sample data, and a deep learning sample selector is constructed for positive sample data to filter samples and select high-quality sample data to supplement positive samples while the samples are randomly generated; and finally, decoding the selected sample data through a text decoder to form text data, and forming a balanced data set together with the processed negative sample data so as to be applied to a text classification model. According to the method, a mixed sampling method is adopted to process data, particularly redundant data processing is performed on negative samples, a deep learning sample selector is constructedfor positive samples to filter generated samples, a high-quality sample training classifier is selected, and the text classification performance is improved.

Description

【Technical field】 [0001] The invention belongs to the technical field of natural language processing, and relates to a method for unbalanced data processing and text classification, in particular to a Chinese text category recognition system and method for unbalanced data sampling. 【Background technique】 [0002] In recent years, the rapid development of science and technology has promoted the explosive growth of data in various fields. The phenomenon of the number of data categories and the uneven distribution of features exists in almost every practical data set in real life. It is of great significance to effectively classify unbalanced data in reality. For example, enterprises can use classified user review information for different departments to make targeted improvements to related products and services. Specifically, there are the following two states in the real unbalanced data: one is that the amount of negative sample data is large, and there are certain noise da...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/35G06F40/126G06F40/289G06F40/216G06N3/04G06N3/08
CPCG06F16/355G06F40/126G06F40/289G06F40/216G06N3/08G06N3/045
Inventor 饶元祁江楠贺龙贺王卜
Owner XI AN JIAOTONG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products