LDA-based text classification method

A text classification and text technology, which is used in text database clustering/classification, unstructured text data retrieval, special data processing applications, etc. Ease of update and maintenance, high availability of results, universal adaptability

Active Publication Date: 2017-06-13
NINGBO UNIV
View PDF11 Cites 41 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, with the discovery of the Internet, the Internet is filled with a large number of information texts in various forms such as news, blogs, and meeting minutes. Such information texts more or less include academic-related information, and often include the latest Academic research information

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • LDA-based text classification method
  • LDA-based text classification method
  • LDA-based text classification method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0040] Specific embodiments of the present invention will be described in detail below.

[0041] A text classification method based on LDA, such as figure 1 As shown, the Bayesian probability calculation model is used as the text classification model, and a set of feature words that can best reflect the characteristics of the text to be classified is extracted as the feature word set used to input the text classification model. The original feature word set is the original word Set the front part after sorting according to the characteristic weight, use the text classification model to calculate the probability that the combination of characteristic words belongs to each of the predetermined A categories, and take the category with the largest probability value as its category; according to the usual subject classification habits , all subjects can be divided into 75 subject categories, that is, the number of categories A is 75. Use the LDA topic model to assist the text clas...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides an LDA-based text classification method. The method comprises the following steps of: extracting and inputting a feature word set into a text classification model so as to calculate the probability of each type in A predetermined types to which a text belongs, and taking the type with the maximum probability value as the type to which the text belongs; previously training an LDA topic model by using a training corpus according to a set topic number K, so as to K topic associated word sets; previously verifying the text classification model by using a type-specific verification corpus, so as to obtain a classification correctness of each type in the A types; when classification is carried out by using the text classification model, directly outputting a result if the classification correctness, obtained by the text classification model, of the type achieves a set threshold value; and otherwise, calculating the weighted values of K topics corresponding to the text by using the LDA topic model, selecting the topic with the maximum weighted value, forming an expanded feature word set by the first Y words in the associated words of the topic, and carrying out classification again by using the text classification model. The method provided by the invention is strong in scene adaptability and high in result usability.

Description

technical field [0001] The invention relates to a text classification method based on LDA. Background technique [0002] Text classification technology is the core technology in the field of information retrieval and data mining. The main algorithms include Bayesian, K nearest neighbor, neural network and SVM. Among them, the Bayesian algorithm assumes that the features are independent of each other when performing text classification, which greatly simplifies the training and classification process, so it has the characteristics of fast operation and easy implementation, and has become widely used in text classification. A method that has attracted the attention of many scholars. Someone proposed a naive Bayesian text classification algorithm based on expectation maximization (EM), which improves the utilization rate of unlabeled corpus. Some people combine the naive Bayesian text classification algorithm with the SVM algorithm to improve the classification accuracy. How...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06K9/62
CPCG06F16/35G06F18/24155
Inventor 刘柏嵩高元王洋洋尹丽玲费晨杰
Owner NINGBO UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products