Text categorization method based on Xgboost categorization algorithm

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A classification algorithm and text classification technology, applied in computing, special data processing applications, instruments, etc., can solve the problems of low classification performance, large memory consumption, and low classification accuracy, and achieve simple preprocessing, memory reduction, and dimension reduction. Effect

Active Publication Date: 2017-06-09

SUN YAT SEN UNIV

View PDF3 Cites 30 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0015] The present invention provides a text classification method based on the Xgboost classification algorithm to solve the defects of low classification performance, large memory consumption and low classification accuracy of the methods provided by the above prior art

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0033] This implementation case includes 3 specific cases, respectively classifying 3 text corpora with different characteristics, that is, a public English corpus WebKB, which excludes samples without any content, and two Chinese corpora, one of which is a public long Text corpus: Fudan University text classification corpus, and another corpus with very unbalanced Chinese short text samples: news comments, divided into two categories: normal and advertising, with a positive-negative ratio of 2742 / 42416=0.065.

[0034] Table 1 Summary of text classification datasets

[0035]

[0036] Such as figure 1 Shown, the specific implementation steps of the text classification method based on Xgboost classification algorithm of the present invention to do feature extraction with Labeled-LDA are as follows:

[0037] Step 1: Text Preprocessing

[0038] Prepare a batch of classified text sets in advance, such as 3 cases, randomly divide the training set and prediction set (news commenta...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

Provided is a text categorization method based on an Xgboost categorization algorithm. According to the text categorization method, a characteristic value is calculated by extracting a tagged word through Labeled-LDA, and then text categorization is conducted by using the Xgboost categorization algorithm. Compared with a method that the text categorization is conducted by using a common categorization algorithm and a common vector space modal is adopted as characteristic space, the method reduces required consumed memory, this is because the number of words contained in a Chinese text is several million, dimensionality is high, if the words are adopted as characteristics, the consumed memory is massive, even one machine cannot conduct processing, however, the number of common Chinese characters is no more than ten thousand, the number of frequent Chinese characters is even two to three thousand, the dimensionality is reduced greatly, and meanwhile Xgboost supports input in a dictionary mode rather than an array mode. Besides, the invention provides a novel feature selection algorithm Labeled-LDA algorithm with latent semantic and supervision, the Labeled-LDA is adopted to conduct feature selection, and thus not only can semantic information of massive linguistic data be dug by utilizing LDA, but also class information contained in the text can be utilized. Furthermore, preprocessing is easy, there is no need to extract the characteristics carefully, and accuracy and performance of categorization are improved with the addition of the strong ensemble learning algorithm Xgboost supporting a distributed mode.

Description

technical field [0001] The present invention relates to the field of text classification, more specifically, to a text classification method based on Xgboost classification algorithm. Background technique [0002] Text classification methods have been widely used in search engines, personalized recommendation systems, public opinion monitoring and other fields, and are an important part of efficient management and accurate positioning of massive information. [0003] The common framework of text classification methods is based on machine learning classification algorithms, which include data preprocessing, followed by feature extraction, feature selection, feature classification and other steps. [0004] Feature extraction is to use a unified method and model to identify the text. This method or model can represent the characteristics of the text and can be easily converted into a mathematical language, and then converted into a mathematical model that can be processed by a ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

CPCG06F16/353

Inventor庞宇明任江涛

OwnerSUN YAT SEN UNIV

Text categorization method based on Xgboost categorization algorithm

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology