KNN text classification method based on improved K-Medoids

A text classification and text technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as classification performance impact, inapplicability, and huge amount of similarity calculations

Inactive Publication Date: 2016-03-23
BEIJING UNIV OF TECH
View PDF5 Cites 24 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] As one of the classic classification methods, KNN has the advantages of simple implementation and high robustness; but there are also many shortcomings that make it unsuitable for many practical applications.
The shortcomings of KNN mainly include the following two aspects: First, the classification process consumes a lot of time due to the huge amount of similarity calculations, resulting in low classification efficiency.
Second, the classification performance is easily affected by the training samples. When the data is severely unevenly distributed, the performance of the classifier may be seriously affected, or even become extremely poor.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • KNN text classification method based on improved K-Medoids
  • KNN text classification method based on improved K-Medoids
  • KNN text classification method based on improved K-Medoids

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0038] The present invention is realized by adopting the following technical means:

[0039] A KNN text classification method based on improved K-Medoids. Firstly, the training text set and the test text set are preprocessed, including word segmentation, stop word processing, DF feature selection, and both the training text and the test text are expressed as vectors; then the training text is processed by the improved K-Medoids method Crop to get a new training text set S new ;Finally, the representative degree function is defined and introduced into the category attribute function of the original KNN algorithm for KNN classification.

[0040] The above-mentioned improved KNN text classification method comprises the following steps:

[0041] Step 1, download the publicly released Chinese corpus from the Internet - the training text set and the test text set;

[0042] Step 2, using the word segmentation software ICTCLAS to perform word segmentation and stop word removal prep...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a KNN (K-Nearest-Neighbor) text classification method based on improved K-Medoids and relates to the field of computer text data processing. The method comprises the following steps: pre-processing a training text set and a testing text set, wherein preprocessing comprising removal of participles and stop words, DF feature selection and vector representation, so as to obtain a training text vector space and a testing text vector space; carrying out training sample clipping on the basis of an improved K-Medoids method, namely, optimizing from the points of initial center point selection and replacement of center point search strategy, and applying optimization to the training sample clipping so as to obtain a new training text space; and finally, carrying out KNN classification, defining a representative function and applying the representative function to class attribute functions for KNN classification so as to obtain a final result. Experimental results show that compared with a conventional KNN method and a KNN method based on the K-Medoids, the KNN text classification method provided by the invention has higher classification accuracy and classification efficiency.

Description

technical field [0001] The invention relates to the field of computer text data processing, in particular to a K-nearest-neighbor (K-Nearest-Neighbor, KNN) text classification method based on improved K-Medoids. Background technique [0002] With the development of the Internet, the Internet of Things, and cloud computing, data is growing exponentially, leading us into the era of big data. The Internet Data Center (IDC) of the United States pointed out that the data on the Internet is increasing at a rate of 50% every year, and more than 90% of the data in the world is generated in recent years. At present, the amount of global data has reached the ZB level, and with the generation of a large amount of data, there is also great potential value contained in it. [0003] In today's era of big data, it is very important to mine the potential value of data. As a technology to discover the potential value of data, data mining has attracted great attention. Text data accounts f...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06K9/62
CPCG06F16/355G06F18/241
Inventor 汪友生樊存佳王信
Owner BEIJING UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products