A Text Classification Method and System Based on Rough Set and KNN

A text classification and rough set technology, applied in the direction of text database clustering/classification, unstructured text data retrieval, etc., can solve the problems of high computational cost, without considering the problem of massive data processing speed, text recall rate, etc.

Active Publication Date: 2020-09-25
BEIJING INFORMATION SCI & TECH UNIV
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the main disadvantage of this algorithm is that it is necessary to calculate the distance between the text to be classified and all training texts during classification. At the same time, the time complexity is proportional to the number and dimension of training texts, and the calculation overhead of similarity with a large number of training texts very big
For example, in the patent closest to this patent, the K-Nearest Neighbor Text Classification Method Based on Amendment (Patent Application No. 201010601777.5), the KNN method is used, but its system does not consider the speed of massive data processing and the recall rate of text

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Text Classification Method and System Based on Rough Set and KNN
  • A Text Classification Method and System Based on Rough Set and KNN
  • A Text Classification Method and System Based on Rough Set and KNN

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0016] In view of the high dimensionality and calculation times of the text data to be classified in the KNN (K Nearest Neighbor) text classifier, which leads to problems such as high time and space costs, the present invention adopts the rough set attribute reduction algorithm to be classified The data is preprocessed. Then, the NP-hard problem in rough set attribute reduction is further solved by processing methods and algorithms based on attribute order. In terms of reducing the computational complexity of the algorithm, we start from the two aspects of the algorithm itself and computing skills: propose a decreasing calculation method in the calculation of the positive region of the key link of the rough set discrimination matrix to reduce the calculation workload of the equivalence class; use to stop the word search The table method, the introduction of location information in the attribute sequence and the retrieval method of inverted index further reduce the running time ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a rough set and KNN (K Nearest Neighbor)-based text classification method and system, and aims at carrying out classification processing on various types of text information. The method comprises the following steps of pre-processing to-be-classified data by adopting an attribute reduction algorithm of a rough set; further solving an NP-hard problem in attribute reduction of the rough set through an attribute order-based processing method and algorithm; on the key link positive region calculation of a rough set discernible matrix, decreasing the calculation workload of equivalence class by utilizing a decreasing type calculation method, and further reducing the operation time and space cost of the system by using a look-up table method for removing stop words, introduction of position information in attribute orders and a search method for reverse indexes; and finally constructing a classifier on the basis of the above method, wherein the correctness, recall rate and an F value of the classification are relatively ideal, and the classification speed is greatly enhanced. The system and method disclosed by the invention can be used for solving the problem that the consumed time and space costs are high as the KNN text classifier is relatively high in data dimension and calculation frequency of to-be-classified texts.

Description

Technical field [0001] The invention belongs to the field of intelligent information processing, and relates to an efficient processing and classification method and implementation system of unstructured text information, and in particular to a text classification method and system combining rough set and KNN (K Nearest Neighbor). Background technique [0002] With the rapid development of the Internet, online information resources have grown rapidly. In the online information that people come into contact with, most of the information is text, which is expressed in the form of electronic documents. In the face of such a huge and rapidly expanding ocean of information, how to effectively organize and manage this information and quickly, accurately and comprehensively mine the information that users need is a major challenge facing the current information science and technology fields. As a result, data mining technology has become a research hotspot and cutting-edge technology in...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/35
Inventor 朱敏玲
Owner BEIJING INFORMATION SCI & TECH UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products