Automatic identifying and classifying method for sensitive data

A sensitive data, automatic identification technology, applied in the direction of digital data processing, character and pattern recognition, special data processing applications, etc., can solve the problems of sensitive data identification result difference, classification result interference, low identification accuracy, etc., to achieve The effect of improving the efficiency and accuracy of sensitive data identification and classification

Inactive Publication Date: 2015-09-23
北京途美科技有限公司 +1
View PDF3 Cites 21 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] The sensitive data dictionary matching method has the following defects: 1. The recognition accuracy is low. The dictionary matching adopts a pattern matching method, so the establishment of the data dictionary determines the accuracy of sensitive data recognition. When the dictionary is incomplete or the dictionary is established incorrectly , there will be a problem of reduced recognition accuracy; 2. Classification result interference. Because dictionary matching is used, the same data information will be matched to multiple data dictionaries. Because traditional data dictionaries cannot perform weighted calculations, it will cause interference to classification results. , resulting in inaccurate classification results
[0008] The artific

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0026] The sensitive data identification and classification process is described below:

[0027] Step 1: Establish a basic data corpus. Specifically, it includes the following two steps: (1) Using word segmentation technology to preprocess the training data set to obtain a vocabulary set. Word segmentation technology mainly includes Stremming processing of English vocabulary and dictionary segmentation of Chinese vocabulary; preprocessing includes removing meaningless words in the vocabulary set, such as function words, pronouns, etc., to obtain a meaningful vocabulary set; (2) According to TF-IDF The vocabulary set is processed. When a word appears more frequently in all training data sets, the importance of the word is higher, which also indicates that the vector weight value of the word is higher, and the vector weight of each vocabulary is calculated. value to complete the establishment of the corpus.

[0028] Step 2: Sensitive feature extraction and establishment of sen...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an automatic identifying and classifying method for sensitive data, comprising the steps of: pre-treating a training dataset, removing unmeaning words, and thereby obtaining a word set; treating the word set according to TF-IDF, calculating the vector weighed value of each word, and thereby obtaining a basic data corpus; manually identifying and classifying the basic data corpus, selecting the words on which sensitive data can be marked, and thereby forming a sensitive word corpus; performing word segmentation on target data by the word segmentation technology, and then matching extracted features with the sensitive word corpus; and sorting the classification according to accumulated values of the weighted values of the sensitive words from high to low. According to the automatic identifying and classifying method for sensitive data, through establishing the sensitive word corpus, performing word segmentation on the target data, and calculating the accumulated values of the weighted values of the sensitive words, the classification of the sensitive data is achieved, and thereby efficiency and accuracy of the sensitive data identification and classification are improved, and the defect that the existing sensitive data is identified depending on dictionary classification and manual classification is remedied.

Description

technical field [0001] The invention relates to the technical field of information monitoring in computer networks, in particular to a method for automatic identification and classification of sensitive data. Background technique [0002] Data is the supporting foundation of enterprise business and the core part of enterprise information system. Once there is a problem with the database management system, it will affect the continuity of the entire enterprise business. In the protection scheme for sensitive data, the core part is to select sensitive data from massive data and complete the identification of sensitive data. [0003] At present, the identification of sensitive data mainly relies on dictionary matching methods and manual identification methods. [0004] The dictionary matching method matches the data one by one by manually defining the pattern matching formula of sensitive data. When the data is found to meet the pattern matching formula, the data is defined ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06K9/62G06F17/27
CPCG06F40/211G06F40/284G06F18/24
Inventor 王雷林素标
Owner 北京途美科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products