Distributed mass short text KNN (K Nearest Neighbor) classification algorithm and distributed mass short text KNN classification system based on information entropy feature weight quantification

A classification algorithm and information entropy technology, applied in the field of massive short text distributed KNN classification algorithm and system, can solve problems that have not yet been effectively expanded

Active Publication Date: 2014-07-30
SOUTH CHINA UNIV OF TECH
View PDF3 Cites 34 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, most of the traditional algorithms are used for data processing in a stand-alone environment, and have not been effectively extended to parallel and distributed computing platforms so that they can meet the needs of large data processing capabilities.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed mass short text KNN (K Nearest Neighbor) classification algorithm and distributed mass short text KNN classification system based on information entropy feature weight quantification
  • Distributed mass short text KNN (K Nearest Neighbor) classification algorithm and distributed mass short text KNN classification system based on information entropy feature weight quantification
  • Distributed mass short text KNN (K Nearest Neighbor) classification algorithm and distributed mass short text KNN classification system based on information entropy feature weight quantification

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0059] like figure 1 As shown, the present invention is based on the massive short text distributed KNN classification algorithm of information entropy feature weight quantization, comprising the following steps:

[0060] S1. The information entropy index is used to measure the certainty of the distribution of features in the data set, and the features with high certainty are assigned high weights, otherwise, low weights are assigned to obtain a weight quantization method reflecting the class distribution;

[0061] S2. Based on the Hadoop distributed computing platform and designed with the MapReduce computing framework, it is divided into two rounds of MapReduce operation combinations;

[0062] In the first round of Map operation, the training set is evenly divided into multiple sub-training sets and distributed to the nodes where the operation is performed. Set similarity calculation. In the first round of Reduce operation, the similarity calculated by Map is sorted in eac...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a distributed mass short text KNN (K Nearest Neighbor) classification algorithm and a distributed mass short text KNN classification system based on information entropy feature weight quantification. The algorithm mainly includes text preprocessing, weights for measuring features with information entropies, a document vector representation method and a KNN classification algorithm adopting indexing and caching techniques, and the whole algorithm is implemented on the basis of a MapReduce computing framework. By utilizing the information entropies of features in a training set, the invention effectively measures the effects of words in texts on classification, and moreover, based on the MapReduce computing framework, the algorithm can utilize mass clusters to efficiently classify mass short texts. The mass short text classification algorithm provided by the invention has the advantages of high accuracy, high operating efficiency and strong scalability.

Description

technical field [0001] The invention belongs to the technical field of text classification, and relates to a mass short text distributed KNN classification algorithm and system based on information entropy feature weight quantization. Background technique [0002] With the rapid development of the Internet, the number of electronic documents and resources generated by users is increasing day by day. Text classification technology has become a key technology for processing and organizing massive document data. Especially with the large amount of short text data such as microblogs and various user comments appearing on the Internet, related research on short texts has gradually attracted people's attention. The classification algorithm generally includes: text preprocessing, document representation, and classifier selection and training. Specifically, text preprocessing mainly divides the text into word segmentation and removes stop words that are useless for semantic expres...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/35
Inventor 蔡毅蔡志威王涛
Owner SOUTH CHINA UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products