Two-stage combined file classification method based on probability subject

A text classification and subject heading technology, applied in special data processing applications, instruments, electrical digital data processing, etc., to achieve the effect of improving efficiency and good classification effect

Inactive Publication Date: 2007-10-24
INST OF AUTOMATION CHINESE ACAD OF SCI
View PDF0 Cites 48 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In order to solve the defect that it is difficult for a single classifier in the prior art to have obvious advantages in terms of accuracy and efficiency, the purpose of the present invent...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Two-stage combined file classification method based on probability subject
  • Two-stage combined file classification method based on probability subject
  • Two-stage combined file classification method based on probability subject

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0015] The present invention will be described in detail below in conjunction with the accompanying drawings. It should be pointed out that the described embodiments are only considered for the purpose of illustration and not limitation of the present invention.

[0016] According to the present invention, the proposed two-level combined text classification method based on probabilistic subject words, when manually classifying, if people judge which category a text belongs to, they often only need to observe some key words in the text to get the correct judge. These key words are generally called subject words, which are included in many classified dictionaries. However, it is impossible to give a strict formal definition of subject terms. In the corpus learning method, a statistical topic word can be defined, which is named as "probabilistic topic word" (Probabilistic Topic Word, PTW). Then the words are extracted from the corpus by means of statistics. Then use these "st...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to natural language processing and model recognizing technical field, which discloses a two-stage combine text classifying method based on probability subject word, wherein the first stage classification is based on Bayes classifying method to use probability subject word and reject condition judge couple to test the text classification, and the second stage classification is based on traditional character extraction method to extract character word to classify the test text rejected by the first stage. The inventive method classifies the text and fuses the characters of different classifiers to quickly classify various texts in the first classification, to improve classifying efficiency, to provide better treatment on the text classification system. And the invention provides probability subject word based on text character, while the probability subject word can effectively classify various texts in right reject condition. Compared with traditional single classification, the invention can reduce time consumption and improve system classification correct rate.

Description

technical field [0001] The invention relates to the technical fields of pattern recognition and natural language processing, and relates to a method for classifying serially combined texts based on probabilistic subject words. Background technique [0002] Text Categorization (Text Categorization) is one of the comprehensive applications of various natural language processing technologies. Realizing automatic computer classification of text can better help us organize and utilize the current vast text information. At the same time, text classification methods involve many basic problems in pattern recognition, such as classifier design problems, high-dimensional feature problems and so on. Therefore, the research on text classification technology has important practical value and theoretical significance. [0003] To measure the quality of a text classification method, two factors are generally considered. One is the correct rate of classification results, which is often t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/27
Inventor 宗成山李寿山
Owner INST OF AUTOMATION CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products