Chinese short text classification method based on characteristic extension

A classification method and short text technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of discrete short text features, inability to obtain classification effects, short length, etc., to improve accuracy and recall rate Effect

Inactive Publication Date: 2013-03-06
北京洛克威尔科技有限公司
View PDF3 Cites 40 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Since the features of short texts are discrete and short in length, traditional text classification methods cannot achieve comparable classification results to long text corpora when directly applied to short text corpora.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese short text classification method based on characteristic extension
  • Chinese short text classification method based on characteristic extension
  • Chinese short text classification method based on characteristic extension

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0039] Embodiments of the present invention are now described in conjunction with the accompanying drawings.

[0040] Such as figure 1 As shown, the present invention includes five main steps: establishing a background knowledge base, expanding short texts in the training set, establishing a classification model, expanding short texts to be classified and generating classification results.

[0041] Step (1) Establish the background knowledge base: According to the long text corpus, use the improved Apriori algorithm to mine the binary groups of feature words with co-occurrence relationship and the same category tendency, so as to establish the background knowledge base. The specific steps are:

[0042] Step ① Segment the long texts in the long text corpus, and each long text only retains nouns, time words, location words, location words, verbs, adjectives, distinguishing words, status words and strings, so as to obtain the long text corpus feature word set;

[0043] Step ② C...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a Chinese short text classification method based on characteristic extension, and the method comprises the following steps that (1) a background knowledge base is established: the two-tuples of feature words which meet a certain constraint condition are dug from a long text corpus with category marks to form the background knowledge base; (2) short text which is trained in a centralized way is extended: extension words are added to the short text which is trained in a centralized way according to a certain extension rule according to the two-tuples in the background knowledge base; (3) a classification model is built: a (shared virtual memory) SVM classification model is established through an extended short text training set; (4) the short text to be classified is extended: the extension words are added to the short text to be classified according to a certain extension rule according to the two-tuples in the background knowledge base and the feature space of the classification model; and (5) a classification result is generated: the classification result is generated through the classification model and the extended short text. According to the Chinese short text classification method based on characteristic extension, the features of the short text are enriched through the long text corpus, so that the accuracy and the recall rate in the classification of the short text are improved.

Description

technical field [0001] The invention relates to the technical field of text classification systems, in particular to a method for classifying Chinese short texts based on feature expansion. Background technique [0002] According to statistics, about 80% of electronic information data exists in the form of unstructured text files. On the Internet, not only text data is the most common form of data storage, but also the search for data such as video, audio, and pictures has text data associated with it. [0003] Text classification is a key technology for processing and organizing massive text data, which can effectively solve the problem of information clutter, and facilitate users to accurately locate the required information and distribute information. Traditional text classification systems mainly use classification methods such as KNN and SVM, which can achieve good classification results in long text classification applications. [0004] With the continuous developmen...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 欧阳元新罗建辉刘文琦熊璋
Owner 北京洛克威尔科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products