Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Short Text Classification Method Based on Chi and Classification Association Rules Algorithm

A short text and category technology, applied in unstructured text data retrieval, text database clustering/classification, special data processing applications, etc., can solve the difficulty of determining the threshold, not considering the same direction relationship of associated feature categories, and algorithm flexibility Problems such as low program controllability, to achieve the effect of enhancing controllability

Active Publication Date: 2019-07-30
GUILIN UNIV OF ELECTRONIC TECH
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] 2. After excavating the feature items with co-occurrence relationship, the traditional method is to directly expand the feature of the original text without considering the category relationship of the associated feature, which will cause the introduction of noise feature words and affect the classification performance
In the existing research, the category propensity of the feature is calculated by manually setting the reliability threshold, and then the frequent word set is filtered according to the same direction relationship of the category, too much manual intervention, the threshold is difficult to determine, the flexibility of the algorithm and the controllability of the program not tall
[0006] 3. Considering the rapid expansion of network data volume in recent years, facing the high requirements of massive data on CPU, IO throughput, etc., the traditional serial text classification algorithm has the advantages of computing speed, file storage, fault tolerance, etc. in the environment of large text data volume It seems powerless, so it is necessary to study distributed algorithms that can run in multi-node big data computing mode

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Short Text Classification Method Based on Chi and Classification Association Rules Algorithm
  • Short Text Classification Method Based on Chi and Classification Association Rules Algorithm
  • Short Text Classification Method Based on Chi and Classification Association Rules Algorithm

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0070] News headline classification method based on CHI and classification association rule algorithm.

[0071] The data set contains news headlines and texts of 5 categories (entertainment, finance, sports, IT, women), a total of 30,000 texts, of which 20,000 news headlines are training data, and 10,000 news headlines are test data, of which 2 The text of ten thousand pieces of training data is used as a long text for feature expansion knowledge base construction.

[0072] Category frequent factor:

[0073] Depend on Figure 6 It can be seen that if a unified minimum support threshold is set for frequent word set mining, the number of frequent word sets in each category varies greatly. In the figure, the unified minimum support threshold is 800. A total of 1025 frequent word sets have been excavated from the five categories. The number of frequent items in the financial category alone is 1022, accounting for 99.7%. The frequent word set category skew problem is more serious...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The present invention is a short text classification method based on CHI and classification association rule algorithm, which measures the occurrence frequency of frequent word sets of different types of texts, introduces the type frequency factor (LFF), and reasonably allocates each text type through LFF The minimum support threshold of the FP-Growth algorithm overcomes the deviation of the frequent word set categories excavated by the traditional FP‑Growth algorithm. Instead of taking simple word frequency statistics to measure, it avoids the steps of manual parameter setting and experimental determination of the best parameters, and enhances the controllability of the classification system. At the same time, a parallel feature extension short text classification algorithm based on the Hadoop / MapReduce big data computing platform was proposed, and the MapReduce parallel design was carried out for the calculation of category frequent factors and the feature extension method, which improved the short text classification accuracy and classification efficiency, and improved controllability of the system.

Description

technical field [0001] The invention relates to the fields of natural language processing and text mining, in particular to a short text classification method based on CHI and classification association rule algorithms. Background technique [0002] With the development of the Internet, especially social media, text content on the Internet is becoming more and more abundant. In addition to long texts such as blogs and news, due to the increasing participation of Internet users in network topics, short texts such as Weibo, emails, and comments It has also experienced explosive growth in recent years. Different from long texts, short texts are characterized by less text content, which has the disadvantages of sparse features and weak description information, which leads to the poor effect of traditional feature extraction, text representation models and text classification methods on short texts. To solve this problem, the most direct and effective way is to expand the featur...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/35
CPCG06F16/35
Inventor 黄文明莫阳邓珍荣
Owner GUILIN UNIV OF ELECTRONIC TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products