Text sensitive information recognition method based on semi-supervised learning

A technology for sensitive information identification and semi-supervised learning

Inactive Publication Date: 2017-06-27
NO 30 INST OF CHINA ELECTRONIC TECH GRP CORP
View PDF2 Cites 25 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In this process, the sensitive information data set used for learning must reflect the sensitive information of the problem domain as true and complete as possible, otherwise the accuracy of the algorithm will be greatly reduced
However, the reality is that the labor cost of labeling the nature of documents is high, and a large number of unknown documents are easier to obtain. It is difficult for us to obtain such a complete sensitive data set, which limits the use of these methods.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text sensitive information recognition method based on semi-supervised learning
  • Text sensitive information recognition method based on semi-supervised learning
  • Text sensitive information recognition method based on semi-supervised learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0014] The present invention will be further described below in conjunction with the accompanying drawings.

[0015] Such as figure 1 As shown, a text-sensitive information recognition method based on semi-supervised learning, specifically includes the following process.

[0016] (1) Based on the learning samples in the labeled sensitive document set L and the unlabeled unknown document set U, conduct semi-supervised learning to obtain the classification strategy knowledge base.

[0017] The purpose of semi-supervised learning is to comprehensively utilize labeled and unlabeled document samples to form separation policy knowledge. In the sensitive identification problem, documents are divided into sensitive documents and safe documents (non-sensitive documents). Such as figure 2 As shown, the semi-supervised learning process is:

[0018] ① Construct a labeled sensitive document set L and an unlabeled unknown document set U;

[0019] The sensitive document set L stores the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the technical field of information safety, and discloses a text sensitive information recognition method based on semi-supervised learning. The method comprises the steps of 1, based on learning texts in a labeled sensitive text set and an unlabeled unknown text set, conducting semi-supervised learning to obtain a classified strategy knowledge base; 2, conducting Chinese participle and stop words processing on a text to be detected to obtain characteristic element data in the text; 3, using a characteristic vector to represent the characteristic element data, and extracting a characteristic value; 4, using a classified strategy database to conduct sensitive text nature judgment on the characteristic value, and giving a judgment result of the sensitive text or a safe text. According to the text sensitive information recognition method based on semi-supervised learning, a small amount of sensitive texts are labeled, a large amount of text sets are subjected to semi-supervised learning, and the expandable capacity and the practicability of sensitive information recognition can be improved.

Description

technical field [0001] The invention relates to the technical field of information security, in particular to a text sensitive information identification method based on semi-supervised learning. Background technique [0002] For modern society, data is the asset of enterprises, data is people's privacy, and it is the embodiment of the core competitiveness of many industries. Effective protection of key sensitive data of an enterprise can make the enterprise itself invincible in the fierce business competition; protection of personal sensitive information can prevent its leakage from causing social harm. Therefore, in recent years, hot research on sensitive data identification has been spawned. This problem involves many fields such as text mining and information security, and is the core technology of the data security product DLP (Data Leakage Prevention). [0003] Existing sensitive information identification methods include basic detection technology and advanced detec...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27G06K9/62
CPCG06F16/35G06F40/284G06F18/2193G06F18/2411
Inventor 梁玲玲
Owner NO 30 INST OF CHINA ELECTRONIC TECH GRP CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products