Text feature extraction method based on categorical distribution probability

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology of distribution probability and feature extraction, applied in special data processing applications, instruments, calculations, etc., can solve the problems of low classification accuracy, high word space dimension, and large amount of calculation, so as to improve processing efficiency, effect, and operation. effect of time reduction

Inactive Publication Date: 2013-09-11

EAST CHINA NORMAL UNIV

View PDF2 Cites 36 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0004] At present, the use of computer technology to solve text classification problems generally adopts the vector space model, which has the problems of high word space dimension, large amount of calculation, and low classification accuracy.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment

[0038] refer to figure 2 , the present invention needs to implement the effectiveness of the text feature extraction method of category distribution probability on a text classification task. By selecting a certain set of Chinese texts, the corpus texts are manually classified according to predefined categories. Perform preprocessing on the classified text set, and then perform feature extraction on the preprocessed text set to obtain a desired number of text feature word sets. The vector space is defined by the selected feature word set, and the preprocessed text is converted into the representation of the vector space model. The standard tfidf weight calculation method is adopted. Then use the specified classifier to train the text vector to obtain the trained classification model.

[0039] When it is necessary to classify the text to be classified, it is only necessary to convert the text to be classified into the representation of the vector space model on the feature ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a text feature extraction method based on categorical distribution probability. The text feature extraction method based on the categorical distribution probability extracts text feature words by means of the manner according to which categorical distribution difference estimation is carried out on words of a text to be categorized. Mean square error values of probability distribution of each word at different categories are worked out by means of category word frequency probability of the words. A certain number of words with high mean square error values are extracted to form a final feature set. The obtained feature set is used as feature words of a text categorizing task to build a vector space model in practical application. A designated categorizer is used for training and obtaining a final category model to categorize the text to be categorized. According to the text feature extraction method based on the categorical distribution probability, category distribution of the words is accurately measured in a probability statistics manner. Category values of the words are estimated in a mean square error manner so as to accurately select features of the text. As far as the text categorizing task is concerned, a text categorizing effect of balanced linguistic data and non-balanced linguistic data is obviously improved.

Description

technical field [0001] The invention relates to computer text processing technology, in particular to a text feature extraction method based on category distribution probability. Background technique [0002] With the rapid development of the Internet, the number of electronic documents on the network has expanded rapidly. Effectively helping users find, filter and manage these massive text data has become an important content of natural language processing research. The representation of text and the selection of feature items is a basic problem in text mining and information retrieval, which quantifies the feature words extracted from the text to represent text information. Transform them from an unstructured original text into structured information that can be recognized and processed by a computer, that is, scientifically abstract the text and establish its mathematical model to describe and replace the text. It enables the computer to realize the recognition of text t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F17/30

Inventor 杨燕李强潘云杜泽宇杨河彬倪敏杰

Owner EAST CHINA NORMAL UNIV

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Text feature extraction method based on categorical distribution probability

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology