Knowledge distillation-based confidential text recognition model training method, system and device

A text recognition and model training technology, applied in the field of text recognition, can solve the problems of large pre-training models and slow prediction speed, and achieve the effect of improving classification accuracy, improving accuracy, and fast prediction speed

Pending Publication Date: 2022-01-07
STATE GRID INFORMATION & TELECOMM BRANCH +1
View PDF0 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, there are defects that the pre-training model is very large and the online prediction speed is slow

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Knowledge distillation-based confidential text recognition model training method, system and device
  • Knowledge distillation-based confidential text recognition model training method, system and device
  • Knowledge distillation-based confidential text recognition model training method, system and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0065] like figure 1 As shown, this embodiment provides a method for training a secret-related text recognition model based on knowledge distillation, including the following steps:

[0066] S1: Prepare an unlabeled corpus A in the secrecy domain.

[0067] The purpose of this step is to prepare a large-scale unlabeled confidential domain corpus for subsequent knowledge distillation.

[0068] S2: Construct a text label hierarchical tree based on the confidential business data, and label the labeled data set B, and prepare the unlabeled data set C.

[0069] Among them, each node of the label hierarchy tree has a label, the label of each child node of the label hierarchy tree belongs to the label of the parent node, and the category label corresponding to the text in the label data set B is the leaf node label of the label hierarchy tree.

[0070] Specifically, this step constructs a label hierarchy tree based on confidential business knowledge, each node of the tree has a corr...

Embodiment 2

[0097] Based on Example 1, such as image 3 As shown, the present invention also discloses a secret-related text recognition model training system based on knowledge distillation, including: corpus preparation unit 1, data set preparation unit 2, text enhancement unit 3, first knowledge distillation unit 4, supervised classification A training unit 5 , a classification model construction unit 6 , a second knowledge distillation unit 7 and a classification model storage unit 8 .

[0098] The corpus preparation unit 1 is used to prepare the unlabeled corpus A in the confidential field;

[0099] The data set preparation unit 2 is used to construct a text label hierarchical tree according to the confidential business data, and label the labeled data set B, and prepare the unlabeled data set C.

[0100] A text enhancement unit 3 is configured to perform text enhancement on the text in the unlabeled data set C.

[0101] The first knowledge distillation unit 4 is used to perform kn...

Embodiment 3

[0107] This embodiment discloses a knowledge distillation-based secret-related text recognition model training device, including a processor and a memory; wherein, when the processor executes the knowledge-distilled-based secret-related text recognition model training program stored in the memory Implement the following steps:

[0108] 1. Prepare the unlabeled corpus A in the confidential field.

[0109] 2. Construct a text label hierarchical tree based on the confidential business data, and label the labeled data set B, and prepare the unlabeled data set C.

[0110] 3. Perform text enhancement on the text in the unlabeled dataset C.

[0111] 4. Carry out knowledge distillation through the unlabeled corpus A, so that the IDCNN model can learn the semantic feature extraction ability from the Bert model.

[0112] 5. Build Bert-clf, a label path classification model based on the Bert model, conduct supervised classification training on the Bert model through the labeled data se...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a knowledge distillation-based confidential text recognition model training method, system and device. The method comprises the following steps: preparing an untagged corpus A in a confidentiality field; constructing a text label hierarchical tree according to confidential business data, labeling to obtain a labeled data set B, and preparing a non-labeled data set C; performing text enhancement on the texts in the label-free data set C; performing knowledge distillation through the label-free corpus A, so that the IDCNN model learns the semantic feature extraction capability from the Bert model; constructing a label path classification model Bert-clf based on a Bert model, and performing supervised classification training on the Bert model through the labeled data set B to obtain the label path classification model Bert-clf; building a label path classification model Idcnn-clf based on the IDCNN model; performing knowledge distillation on the label path classification model Idcnn-clf through the label data set B and the label-free data set C; and storing the label path classification model Idcnn-clf. According to the invention, the prediction speed and the classification accuracy of the classified text recognition model can be effectively improved.

Description

technical field [0001] The present invention relates to the technical field of text recognition, and more specifically relates to a method, system and device for training a secret-related text recognition model based on knowledge distillation. Background technique [0002] The continuous advancement of science and technology has triggered the rapid development of productivity and profound changes in society. Various industries, fields, and scenarios have generated a large amount of data. As a special asset form, the security of data assets is as important as the security of data content. Especially in the field of special industries, the leakage of secret information will bring huge losses. The identification of confidential text is of great significance to information security. [0003] Existing secret-related text recognition mainly uses rule-based methods. In the rule-based method, business personnel usually construct a large number of rules, and judge whether the text...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/35G06F40/30G06K9/62G06N3/08
CPCG06F16/35G06F40/30G06N3/08G06F18/214
Inventor 程杰卢腾吴海杰崔兆伟吕俊峰胡威王国青刘思尧魏家辉林冰洁夏昂牟霄寒王超杨青章东润
Owner STATE GRID INFORMATION & TELECOMM BRANCH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products