Two-stage combined file classification method based on probability subject

A text classification and subject heading technology, applied in special data processing applications, instruments, electrical digital data processing, etc., to achieve the effect of improving efficiency and good classification effect

Inactive Publication Date: 2009-08-26
INST OF AUTOMATION CHINESE ACAD OF SCI
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In order to solve the defect that it is difficult for a single classifier in the prior art to have obvious advantages in terms of accuracy and efficiency, the purpose of the present invention is to make up for the shortcomings of a single classifier, propose a combined classification method, and realize a probability-based theme Two-level Combination Text Classification Method of Words

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Two-stage combined file classification method based on probability subject
  • Two-stage combined file classification method based on probability subject
  • Two-stage combined file classification method based on probability subject

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0015] The present invention will be described in detail below in conjunction with the accompanying drawings. It should be pointed out that the described embodiments are only considered for the purpose of illustration and not limitation of the present invention.

[0016] According to the present invention, the proposed two-level combined text classification method based on probabilistic subject words, when manually classifying, if people judge which category a text belongs to, they often only need to observe some key words in the text to get the correct judge. These key words are generally called subject words, which are included in many classified dictionaries. However, it is impossible to give a strict formal definition of subject terms. In the corpus learning method, a statistical topic word can be defined, which is named as "probabilistic topic word" (Probabilistic Topic Word, PTW). Then the words are extracted from the corpus by means of statistics. Then use these "st...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention relates to the field of natural language processing and pattern recognition, and discloses a two-level combined text classification method based on probabilistic subject words, and the first-level classification: based on the naive Bayesian classification method, the test text is classified by using the characteristics of the probabilistic subject words and the judgment of rejection conditions; Second-level classification: Based on the information gain feature extraction method, feature words are extracted to classify the test texts that were rejected by the first-level classification. The hierarchical combination method of the present invention classifies texts, and integrates the characteristics of different classifiers to correctly classify many texts in the first-level classification very quickly, greatly improving the efficiency of the text classification system, and providing good processing for the practical application of the text classification system Method; Considering the characteristics of the text, the probabilistic keyword is proposed. Under appropriate rejection conditions, the probabilistic keyword can complete a large number of text classification tasks with a high accuracy rate. Experiments have proved that the two-level combination of the present invention can greatly reduce time consumption and improve the classification accuracy rate of the system compared with the traditional single classification.

Description

technical field [0001] The invention relates to the technical fields of pattern recognition and natural language processing, and relates to a method for classifying serially combined texts based on probabilistic subject words. Background technique [0002] Text Categorization (Text Categorization) is one of the comprehensive applications of various natural language processing technologies. Realizing automatic computer classification of text can better help us organize and utilize the current vast text information. At the same time, text classification methods involve many basic problems in pattern recognition, such as classifier design problems, high-dimensional feature problems and so on. Therefore, the research on text classification technology has important practical value and theoretical significance. [0003] To measure the quality of a text classification method, two factors are generally considered. One is the correct rate of classification results, which is often t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30G06F17/27
Inventor 宗成山李寿山
Owner INST OF AUTOMATION CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products