Short text classification method based on CHI and classified association rule algorithm

A classification method and short text technology, which are applied in the fields of unstructured text data retrieval, text database clustering/classification, calculation, etc., can solve the problems of many manual interventions, difficult to determine thresholds, and low algorithm flexibility and program control.

Active Publication Date: 2016-12-07
GUILIN UNIV OF ELECTRONIC TECH
View PDF2 Cites 30 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] 2. After excavating the feature items with co-occurrence relationship, the traditional method is to directly expand the feature of the original text without considering the category relationship of the associated feature, which will cause the introduction of noise feature words and affect the classification performance
In the existing research, the category propensity of the feature is calculated by manually setting the reliability threshold, and then the frequent word set is filtered according to the same direction relationship of the category, too much manual intervention, the threshold is difficult to determine, the flexibility of the algorithm and the controllability of the program not tall
[0006] 3. Considering the rapid expansion of network data volume in recent years, facing the high requirements of massive data on CPU, IO throughput, etc., the traditional serial text classification algorithm has the advantages of computing speed, file storage, fault tolerance, etc. in the environment of large text data volume It seems powerless, so it is necessary to study distributed algorithms that can run in multi-node big data computing mode

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Short text classification method based on CHI and classified association rule algorithm
  • Short text classification method based on CHI and classified association rule algorithm
  • Short text classification method based on CHI and classified association rule algorithm

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0070] News headline classification method based on CHI and classification association rule algorithm.

[0071] The data set contains news headlines and texts of 5 categories (entertainment, finance, sports, IT, women), a total of 30,000 texts, of which 20,000 news headlines are training data, and 10,000 news headlines are test data, of which 2 The text of ten thousand pieces of training data is used as a long text for feature expansion knowledge base construction.

[0072] Category frequent factor:

[0073] Depend on Figure 6 It can be seen that if a unified minimum support threshold is set for frequent word set mining, the number of frequent word sets in each category varies greatly. In the figure, the unified minimum support threshold is 800. A total of 1025 frequent word sets have been excavated from the five categories. The number of frequent items in the financial category alone is 1022, accounting for 99.7%. The frequent word set category skew problem is more serious...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a short text classification method based on CHI and a classified association rule algorithm. The frequencies of frequent word sets of different types of texts are measured, a category frequent factor (LFF) is introduced, the minimum support threshold values of text categories are reasonably allocated through the LFF, the phenomenon that frequent word set categories mined by adopting a traditional FP-Growth algorithm are deflective is avoided, meanwhile category tendency judgment is conducted on the frequent word sets, a CHI checking algorithm is adopted to measure the relevance degree of characteristic words and the categories instead of measurement based on simple word frequency statistics, the step that best parameters are determined through manual parameter setting and experiments is omitted, and the controllability of a classification system is enhanced. In addition, the invention further provides a parallel characteristic extension short text classification algorithm based on a Hadoop/MapReduce big data computing platform. MapReduce parallelization design is conducted on a category frequent factor calculating and characteristic extension method, the short text classification accuracy rate and classification efficiency are improved, and the controllability of the system is improved.

Description

technical field [0001] The invention relates to the fields of natural language processing and text mining, in particular to a short text classification method based on CHI and classification association rule algorithms. Background technique [0002] With the development of the Internet, especially social media, text content on the Internet is becoming more and more abundant. In addition to long texts such as blogs and news, due to the increasing participation of Internet users in network topics, short texts such as Weibo, emails, and comments It has also experienced explosive growth in recent years. Different from long texts, short texts are characterized by less text content, which has the disadvantages of sparse features and weak description information, which leads to the poor effect of traditional feature extraction, text representation models and text classification methods on short texts. To solve this problem, the most direct and effective way is to expand the featur...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/35
Inventor 黄文明莫阳邓珍荣
Owner GUILIN UNIV OF ELECTRONIC TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products