Text categorization method based on Xgboost categorization algorithm

A classification algorithm and text classification technology, applied in computing, special data processing applications, instruments, etc., can solve the problems of low classification performance, large memory consumption, and low classification accuracy, and achieve simple preprocessing, memory reduction, and dimension reduction. Effect

Active Publication Date: 2017-06-09
SUN YAT SEN UNIV
View PDF3 Cites 30 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0015] The present invention provides a text classification method based on the Xgboost classification algorithm to solve the defects of

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text categorization method based on Xgboost categorization algorithm
  • Text categorization method based on Xgboost categorization algorithm
  • Text categorization method based on Xgboost categorization algorithm

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0033] This implementation case includes 3 specific cases, respectively classifying 3 text corpora with different characteristics, that is, a public English corpus WebKB, which excludes samples without any content, and two Chinese corpora, one of which is a public long Text corpus: Fudan University text classification corpus, and another corpus with very unbalanced Chinese short text samples: news comments, divided into two categories: normal and advertising, with a positive-negative ratio of 2742 / 42416=0.065.

[0034] Table 1 Summary of text classification datasets

[0035]

[0036] Such as figure 1 Shown, the specific implementation steps of the text classification method based on Xgboost classification algorithm of the present invention to do feature extraction with Labeled-LDA are as follows:

[0037] Step 1: Text Preprocessing

[0038] Prepare a batch of classified text sets in advance, such as 3 cases, randomly divide the training set and prediction set (news commenta...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

Provided is a text categorization method based on an Xgboost categorization algorithm. According to the text categorization method, a characteristic value is calculated by extracting a tagged word through Labeled-LDA, and then text categorization is conducted by using the Xgboost categorization algorithm. Compared with a method that the text categorization is conducted by using a common categorization algorithm and a common vector space modal is adopted as characteristic space, the method reduces required consumed memory, this is because the number of words contained in a Chinese text is several million, dimensionality is high, if the words are adopted as characteristics, the consumed memory is massive, even one machine cannot conduct processing, however, the number of common Chinese characters is no more than ten thousand, the number of frequent Chinese characters is even two to three thousand, the dimensionality is reduced greatly, and meanwhile Xgboost supports input in a dictionary mode rather than an array mode. Besides, the invention provides a novel feature selection algorithm Labeled-LDA algorithm with latent semantic and supervision, the Labeled-LDA is adopted to conduct feature selection, and thus not only can semantic information of massive linguistic data be dug by utilizing LDA, but also class information contained in the text can be utilized. Furthermore, preprocessing is easy, there is no need to extract the characteristics carefully, and accuracy and performance of categorization are improved with the addition of the strong ensemble learning algorithm Xgboost supporting a distributed mode.

Description

technical field [0001] The present invention relates to the field of text classification, more specifically, to a text classification method based on Xgboost classification algorithm. Background technique [0002] Text classification methods have been widely used in search engines, personalized recommendation systems, public opinion monitoring and other fields, and are an important part of efficient management and accurate positioning of massive information. [0003] The common framework of text classification methods is based on machine learning classification algorithms, which include data preprocessing, followed by feature extraction, feature selection, feature classification and other steps. [0004] Feature extraction is to use a unified method and model to identify the text. This method or model can represent the characteristics of the text and can be easily converted into a mathematical language, and then converted into a mathematical model that can be processed by a ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/353
Inventor 庞宇明任江涛
Owner SUN YAT SEN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products