Machine learning-based data classification method and device

A data classification and machine learning technology, applied in the field of data classification based on machine learning, can solve problems such as time-consuming, complicated process, and incoherent operation, and achieve the effect of optimized extraction effect, simple process operation, and high overall efficiency

Active Publication Date: 2017-06-09
SUNINFO INFORMATION TECH
View PDF10 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] (1) The clustering results are unsatisfactory. When the clustering results are obtained, the results are often directly based on the separated results, and more flexible adjustments cannot be made.
In fact, some words have polysemy and have different meanings in different documents, including idioms, proverbs, etc. These words are likely to be separated during the word segmentation process, which leads to the original very representative words becoming different. It has become an ordinary word, which will seriously affect the clustering effect
[0006] (2) The keywords are too short, and they may also appear in other types of documents, so it is necessary to provide enough keywords to be the characteristics of this type of document, and the process of selecting keywords is not very easy , if the number is too large, it will take a considerable amount of time
If all are selected to save time, a considerable part of the vocabulary will be redundant, which will affect the matching results and efficiency to a certain extent
If only a part of the words are selected, it is necessary to manually compare and screen the articles, which is obviously not advisable
[0007] (3) The process is too complicated, the operation is not coherent, and it is difficult to get started. It is likely to miss a certain link during the entire operation process, or waste unnecessary time

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Machine learning-based data classification method and device
  • Machine learning-based data classification method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0044] Such as figure 1 As shown, a data classification method based on machine learning, establishes a data classification model through machine learning, and reads and classifies the data to be classified according to the data classification model. The data classification model is established based on feature words, and the documents are clustered. Carry out word segmentation processing on the file content, use TFIDF algorithm to calculate the weight of words, and then calculate the file similarity, and cluster similar files.

Embodiment 1

[0045] Embodiment one method comprises the steps:

[0046] S11. Determine the first characteristic word group corresponding to each type of data based on the learning data;

[0047]Learning data refers to the sample data concerned for machine learning. These sample data are data that have been classified. For example, these sample data are selected from academic materials in different fields such as medicine, forestry, architecture, and petroleum. Because the same field can also be divided into different categories or research directions, the present invention will cluster these data, classify documents with high similarity into one category as much as possible, and then calculate the corresponding rules for this category, That is, the model, which contains the characteristics of this class. With this feature, users can search for the documents that the user cares about from a large number of documents, that is, documents that meet the characteristics of the model, compared wi...

Embodiment 2

[0062] Such as figure 2 As shown, a data classification method based on machine learning includes the following steps:

[0063] S11. Determine the first characteristic word group corresponding to each type of data based on the learning data;

[0064] Learning data refers to the sample data concerned for machine learning. These sample data are data that have been classified. For example, these sample data are selected from academic materials in different fields such as medicine, forestry, architecture, and petroleum. Because the same field can also be divided into different categories or research directions, the present invention will cluster these data, classify documents with high similarity into one category as much as possible, and then calculate the corresponding rules for this category, That is, the model, which contains the characteristics of this class. With this feature, users can search for the documents that the user cares about from a large number of documents, th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention provides a machine learning-based data classification method. The method comprises the following steps of S11, based on learning data, determining a first feature word group corresponding to each data; S12, according to feature words, classifying the learning data; S13, judging whether the classification of the learning data is correct or not, and if yes, jumping to the step S15; if not, adjusting the first feature word group, and jumping to the step S12; S15, establishing a data classification model based on the first feature word group. The machine learning-based data classification device comprises a first feature word group determination module, a first data classification module, a judgment and classification module, a second feature word group determination module and a modeling module. According to the technical scheme of the invention, the content of a file is subjected to word-cutting treatment, and the weight of a word is calculated based on the TFIDF algorithm. After that, the similarity of the file is calculated. Similar files are clustered and feature words are extracted. Feature words are different from key words, while feature words are more representative and more suitable to be served as the sensitive information. Therefore, the feature words of one cluster can be distinguished from the feature words of other clusters.

Description

technical field [0001] The invention relates to the field of data classification, in particular to a data classification method and equipment based on machine learning. Background technique [0002] As the importance of critical data in ensuring enterprise operations continues to rise, managers are increasingly concerned about data security. At the same time, leaks of sensitive documents occur frequently, but it is obviously difficult to identify and classify sensitive documents only by humans. For a batch of files, it is difficult to accurately distinguish which files belong to which category only by the file name or file type. For the extraction of sensitive data, it is obviously very difficult to filter and screen and extract sensitive data only by manual methods. [0003] Internal personnel themselves lack the experience, motivation, and technical support to complete the data combing work. The existing sensitive data discovery technology architecture generally require...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62G06N99/00
CPCG06N20/00G06F18/241
Inventor 李洋张志勇高政伟
Owner SUNINFO INFORMATION TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products