Text classification method

A classification method and text classification technology, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve problems such as data imbalance, data skew, and inability to achieve text classification effects, and achieve a weakened impact and smooth data. Effect

Inactive Publication Date: 2009-11-25
UNIV OF SCI & TECH OF CHINA
View PDF0 Cites 65 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003]However, in actual classification applications, data skew is often encountered, also known as data imbalance or category imbalance, which is one of the important factors affecting classification performance , which poses a challenge to traditional classification methods
Most classification algorithms are proposed for uniformly distributed data. For the case of data skew, only traditional classification methods cannot achieve ideal text classification results.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text classification method
  • Text classification method
  • Text classification method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0024] The following describes in detail the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, but not to be construed as a limitation of the present invention.

[0025] figure 1 This is a flow chart of the text classification method according to the embodiment of the present invention. As shown in the figure, the initial training text set is firstly divided into multiple subsets containing texts of the same category according to the categories, and a corresponding probability topic model is extracted from each subset. (step 102). Of course, the initial training text set here may have data skew or class imbalance. The text category refers to...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A text classification method comprises following steps: dividing the initial training text collection into a plurality of subsets including the text in the same category based on the category, extracting the corresponding probability topic model from each subset; generating new text to balance the categories of the subsets by the corresponding probability topic model; constructing a classifier based on the balance training text collection corresponding to plural subsets; and processing text classification by the classifier. The invention can improve the classification effect of the text classification method under the condition of data skew.

Description

technical field [0001] The invention relates to data preprocessing technology, in particular to a text classification method. Background technique [0002] With the rapid development of the Internet, electronic texts such as web pages, e-mails, databases, and digital libraries on the Internet have grown exponentially. How to effectively process and classify these texts is a very important topic. Text classification refers to constructing a classification model based on existing data, that is, a classifier. It determines a category for each document in the test document collection according to a pre-defined classification system, so that users can browse documents conveniently, and can also limit the search scope to make document search easier. Automatic text classification is to use a large number of texts with class labels to train classification criteria or model parameters, and then use the training results to identify texts of unknown categories. [0003] However, in p...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 陈恩红林洋港马海平曹欢欢
Owner UNIV OF SCI & TECH OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products