Improvement-based KNN (K Nearest Neighbor) text classification method

A text classification and text technology, which is applied in text database clustering/classification, unstructured text data retrieval, special data processing applications, etc.

Active Publication Date: 2015-03-11
CHINA TECHENERGY +1
View PDF2 Cites 22 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Aiming at the problem that the accuracy and speed cannot be taken into account in the existing text classification algorithm, the present invention proposes an improved text classifica

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Improvement-based KNN (K Nearest Neighbor) text classification method
  • Improvement-based KNN (K Nearest Neighbor) text classification method
  • Improvement-based KNN (K Nearest Neighbor) text classification method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0056] The invention proposes an improved KNN-based text classification algorithm, which is applied in the review process of software requirements and design documents (especially software reliability review). The algorithm first preprocesses the training text and builds a feature vector space model, including word segmentation (this algorithm uses a general word segmentation method that combines statistical word segmentation and a dictionary for word segmentation), and removes stop words (stop words refer to some in the file set) Words with a high frequency of occurrence and obviously no or little contribution to the classification task. Function words such as adverbs, pronouns, articles, prepositions, and conjunctions that appear in the file set that do not represent actual semantics belong to the category of stop words), feature Word extraction (the purpose is to select words that are helpful for classification, and reduce the dimension, using the chi-square test method, see...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides an improvement-based KNN (K Nearest Neighbor) text classification method. The method comprises the following steps: preprocessing a training text, computing the feature vector of each training sample, and constructing a feature vector spatial model of a training set; defining a density and a distance, defining a density and a distance, defining a whole sample space into a plurality of spherical regions and outliers according to types, and storing as a training set library; during testing, judging whether a text to be tested falls into a certain spherical region, judging the type of the text to be tested according to a corresponding mark number, otherwise, using the outliers and the center point of each sphere as a training set library, calling a KNN algorithm, and judging the type of the text to be tested. By adopting the method provided by the invention, the classification speed, classification accuracy and data skew sensitivity are considered. The method can be well applied to the classification problem of non-spherical distribution, and is particularly suitable for a text classification problem having a high-dimension feature vector and a distribution irregularity feature.

Description

technical field [0001] The invention belongs to the technical field of verification and confirmation of nuclear safety level software, and in particular relates to an improved KNN-based text classification method. Background technique [0002] In recent years, with the gradual standardization of the software development process, the quality requirements for technical documents are getting higher and higher, especially in the process of software development for nuclear safety level, a large number of technical documents, such as requirements documents, design documents, etc., are generated. According to the requirements of relevant nuclear power standards, each requirement item and design item must meet the requirements of certain evaluation rules; and with the iterative development of the software, the technical documents are constantly updated and upgraded, aiming at each version of the document Each item (such as a requirement item or a design item) must also meet the requ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/35
Inventor 冯素梅赵云飞刘建龙张亚栋刘邦信周小波程建明
Owner CHINA TECHENERGY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products