Text feature extraction method based on categorical distribution probability

A technology of distribution probability and feature extraction, applied in special data processing applications, instruments, calculations, etc., can solve the problems of low classification accuracy, high word space dimension, and large amount of calculation, so as to improve processing efficiency, effect, and operation. effect of time reduction

Inactive Publication Date: 2013-09-11
EAST CHINA NORMAL UNIV
View PDF2 Cites 36 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] At present, the use of computer technology to solve text classification problems generally adopts the vector space model, which has the problems of high word space dimension, large amount of calculation, and low classification accuracy.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text feature extraction method based on categorical distribution probability
  • Text feature extraction method based on categorical distribution probability

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0038] refer to figure 2 , the present invention needs to implement the effectiveness of the text feature extraction method of category distribution probability on a text classification task. By selecting a certain set of Chinese texts, the corpus texts are manually classified according to predefined categories. Perform preprocessing on the classified text set, and then perform feature extraction on the preprocessed text set to obtain a desired number of text feature word sets. The vector space is defined by the selected feature word set, and the preprocessed text is converted into the representation of the vector space model. The standard tfidf weight calculation method is adopted. Then use the specified classifier to train the text vector to obtain the trained classification model.

[0039] When it is necessary to classify the text to be classified, it is only necessary to convert the text to be classified into the representation of the vector space model on the feature ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a text feature extraction method based on categorical distribution probability. The text feature extraction method based on the categorical distribution probability extracts text feature words by means of the manner according to which categorical distribution difference estimation is carried out on words of a text to be categorized. Mean square error values of probability distribution of each word at different categories are worked out by means of category word frequency probability of the words. A certain number of words with high mean square error values are extracted to form a final feature set. The obtained feature set is used as feature words of a text categorizing task to build a vector space model in practical application. A designated categorizer is used for training and obtaining a final category model to categorize the text to be categorized. According to the text feature extraction method based on the categorical distribution probability, category distribution of the words is accurately measured in a probability statistics manner. Category values of the words are estimated in a mean square error manner so as to accurately select features of the text. As far as the text categorizing task is concerned, a text categorizing effect of balanced linguistic data and non-balanced linguistic data is obviously improved.

Description

technical field [0001] The invention relates to computer text processing technology, in particular to a text feature extraction method based on category distribution probability. Background technique [0002] With the rapid development of the Internet, the number of electronic documents on the network has expanded rapidly. Effectively helping users find, filter and manage these massive text data has become an important content of natural language processing research. The representation of text and the selection of feature items is a basic problem in text mining and information retrieval, which quantifies the feature words extracted from the text to represent text information. Transform them from an unstructured original text into structured information that can be recognized and processed by a computer, that is, scientifically abstract the text and establish its mathematical model to describe and replace the text. It enables the computer to realize the recognition of text t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 杨燕李强潘云杜泽宇杨河彬倪敏杰
Owner EAST CHINA NORMAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products