Sensitive data discovery method and system based on text recognition

A sensitive data and text recognition technology, applied in the field of data security, can solve the problems of low accuracy and slow recognition of sensitive data, and achieve the effect of reducing interference, efficient judgment, and ensuring consistency

Active Publication Date: 2020-02-21
SHANGHAI GUAN AN INFORMATION TECH
View PDF13 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] The technical problem to be solved by the present invention lies in the slow identification speed and low precision of sensitive data in the prior art

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Sensitive data discovery method and system based on text recognition
  • Sensitive data discovery method and system based on text recognition
  • Sensitive data discovery method and system based on text recognition

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0069] This embodiment provides a sensitive data discovery method based on text recognition, such as figure 1 As shown, it specifically includes the following steps:

[0070] S01: Sample data extraction

[0071] Extract the standardized business data table within the specified time period (day / month) as the original sample data.

[0072] S02: Text annotation processing

[0073] Collect a large amount of text corpus, borrow text annotation tools, use the BIO annotation method to manually annotate key words in the text corpus, and construct a large number of training samples.

[0074] BIO labeling: label each element as "B-X", "I-X" or "O". Among them, "X" indicates the type of the label element, "B-X" indicates that the segment where this element is located belongs to type X and this element is at the beginning of this segment, "I-X" indicates that the segment where this element is located belongs to type X and this element is in this segment In the middle position, "O" mea...

Embodiment 2

[0104] Correspondingly, this embodiment also provides a sensitive data discovery system based on text recognition, which is characterized in that:

[0105] The sample data extraction module extracts the standardized business data table within a specified time as the original sample data;

[0106] Build a training sample module, collect text data sets, use text labeling tools to label keywords in the text data set, and build a large number of training samples; BIO labeling: label each element as "B-X", "I-X" or "O" . Among them, "X" indicates the type of the label element, "B-X" indicates that the segment where this element is located belongs to type X and this element is at the beginning of this segment, "I-X" indicates that the segment where this element is located belongs to type X and this element is in this segment In the middle position, "O" means not belonging to any type.

[0107]The training sample labeling model module, based on the obtained training samples, uses t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a sensitive data discovery method based on text recognition. The sensitive data discovery method comprises the following steps of S01, extracting the sample data; S02, constructing a training sample, collecting a text data set, and constructing the training sample; S03, training a sample annotation model, obtaining a training sample based on S02, and training a text annotation model; S04, constructing data features; S05, constructing a training set, carrying out label description on the data set obtained in the S04 to form a training set for constructing a classification judgment model; S06, constructing a classification judgment model, and forming a variable prediction model according to the training set obtained in the S05; S07, testing the model. Through the identification of the data variables, the sensitive data can be accurately and efficiently judged and identified under the condition that the data dictionary and the matching rules are incomplete, and theconsistency of identification and classification results is ensured.

Description

technical field [0001] The invention relates to the technical field of data security, in particular to a method and system for discovering sensitive data based on text recognition. Background technique [0002] Data is the supporting foundation of enterprise operations and the core part of enterprise information systems. Once there are problems in data-related management and application systems, it will seriously affect the image and development of enterprises. Therefore, data security issues have always been the subject of concern for enterprises. At present, data protection schemes in practical applications mainly include data isolation, permission setting, and data desensitization. In the data protection scheme, the protection of sensitive data is particularly important. The core part of the sensitive data protection scheme is to select sensitive data from massive data and complete the precise identification of sensitive data. [0003] At present, the identification of s...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/284G06F16/45G06F21/62
CPCG06F16/45G06F21/6245
Inventor 殷钱安梁淑云刘胜马影陶景龙王启凡魏国富徐明余贤喆周晓勇
Owner SHANGHAI GUAN AN INFORMATION TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products