A method and apparatus for data security classification

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By defining multi-level labels and hierarchical logical relationships, and using pre-trained models for multi-label hierarchical annotation and training of hierarchical classification models, the problems of low efficiency and insufficient accuracy in data security classification in existing technologies are solved, and the automation and interpretability of data security classification are realized.

CN117272190BActive Publication Date: 2026-06-30BEIJING CHIBO INFORMATION ENG CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: BEIJING CHIBO INFORMATION ENG CO LTD
Filing Date: 2023-10-12
Publication Date: 2026-06-30

Application Information

Patent Timeline

12 Oct 2023

Application

30 Jun 2026

Publication

CN117272190B

IPC: G06F18/243; G06F18/2431; G06F18/214

AI Tagging

Technology Topics

Data setBig data security

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Decision tree model generation method and data recommendation method based on decision tree model
CN114418035BAccurate classification effectCategory attributeData set
A family map-based insurance policy intelligent analysis and risk monitoring method and system
CN122264956AFinance Database management systemsRisk exposureData set
Dynamic tooth chart and automatic charting
WO2026151619A1Data set User device
Vehicle aerodynamic simulation model correction method and device, electronic equipment and medium
CN122263728Aimprove accuracy Improve robustness Geometric CAD Mathematical models Data set Simulation
Fabricated building construction progress simulation method and system based on digital twinning
CN122333952AData set Verification

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing data security classification methods rely on human assessment and single-label classification models, resulting in a large workload, numerous disputes, and an inability to effectively resolve hierarchical classification issues among multiple labels, lacking interpretability and accuracy.

Method used

By determining multi-level labels and their hierarchical logical relationships, a pre-trained data labeling model is used to perform multi-label hierarchical labeling, and a hierarchical classification model is trained to achieve automation and interpretability of data security classification.

Benefits of technology

It improves the accuracy and efficiency of data security classification, overcomes the limitations of existing technologies that rely on experience and single-label models, and promotes the efficient sharing and use of data.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN117272190B_ABST

Patent Text Reader

Abstract

This application provides a data security classification method and apparatus, belonging to the field of data analysis technology and also applicable to the field of big data security technology. One embodiment of the data security classification method includes: determining the required domain knowledge and corresponding data classification standards based on the data to be classified; determining multi-level labels and hierarchical logical relationships between the labels from the data classification standards; acquiring a sample dataset and, based on the hierarchical logical relationships and a pre-trained data annotation model, performing multi-label hierarchical annotation on each sample data in the sample dataset to obtain the multi-label hierarchical annotation result of the sample dataset; training a hierarchical classification model based on the sample dataset and the multi-label hierarchical annotation result, and determining the classification result of the data to be classified based on the trained hierarchical classification model. The technical solution of this application has interpretability and can improve the accuracy of data security classification.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application belongs to the field of data analysis technology and can also be used in the field of big data security technology. Specifically, it relates to a data security classification method and apparatus. Background Technology

[0002] In the field of data governance, data security classification is the foundation for data sharing and use. Data security classification requires comprehensive consideration of the needs of data owners and users, as well as compliance with the requirements of companies and industries. Classification is carried out according to the prescribed security level standards. It is evident that data security involves a large number and variety of data objects, making the task heavy and complex.

[0003] Existing data security classification methods have several drawbacks. On the one hand, relevant professionals assess data security levels and classifications based on their understanding of data items and their experience. This approach is labor-intensive and prone to disputes during qualitative analysis. On the other hand, single-layer classification models based on deep learning are designed for single-label classification, where labels are mutually exclusive and cannot address hierarchical classification. Specifically, labels have inclusion relationships based on their levels. Depending on the number of data levels, each data item must have at least that many levels of labels, leading to a multi-label problem that single-label single-layer classification models cannot satisfy. Furthermore, these models rely solely on the model's accuracy evaluation formula, resulting in limited interpretability. Summary of the Invention

[0004] To address the problems existing in the prior art, this application provides a data security classification method and apparatus, which can make data security classification work more efficient, interpretable, and improve the accuracy of data security classification.

[0005] According to a first aspect of this application, a data security classification method is provided, the method comprising:

[0006] Based on the data to be classified, determine the required domain knowledge and the corresponding data classification standards for the domain knowledge;

[0007] From the data grading criteria, determine the multi-level tags and the hierarchical logical relationships between the tags at each level;

[0008] Obtain a sample dataset, and based on the hierarchical logical relationship and the pre-trained data annotation model, perform multi-label hierarchical annotation on each sample data in the sample dataset to obtain the multi-label hierarchical annotation result of the sample dataset;

[0009] Based on the sample dataset and the multi-label hierarchical annotation results, a hierarchical classification model is trained, and based on the trained hierarchical classification model, the classification result of the data to be classified is determined.

[0010] In some optional embodiments of this example, determining the multi-level tags and the hierarchical logical relationships between the tags from the data classification criteria includes:

[0011] Based on the classification and subdivision in the data grading criteria, a set of primary tags is determined, wherein the set of primary tags includes multiple primary tags;

[0012] The set of tags corresponding to the child nodes of the first-level tag is taken as the set of second-level tags corresponding to the first-level tag, wherein the set of second-level tags includes multiple second-level tags;

[0013] The set of tags corresponding to the child nodes of the second-level tag is taken as the set of third-level tags corresponding to the second-level tag, wherein the set of third-level tags includes multiple third-level tags;

[0014] The set of tags corresponding to the child nodes of the third-level tag is taken as the set of fourth-level tags corresponding to the third-level tag, wherein the set of fourth-level tags includes multiple fourth-level tags;

[0015] The hierarchical logical relationship between the tags at each level is as follows: the second-level tag is a sub-tag of the first-level tag, the third-level tag is a sub-tag of the second-level tag, and the fourth-level tag is a sub-tag of the first-level tag.

[0016] In some optional embodiments of this example, the step of performing multi-label hierarchical annotation on each sample data in the sample dataset based on the hierarchical logical relationship and the pre-trained data annotation model to obtain the multi-label hierarchical annotation result of the sample dataset includes:

[0017] Based on the hierarchical logical relationship, determine the prompt statements for each sample data;

[0018] Based on the prompt statement and the data annotation model, multi-label hierarchical annotation is performed on each sample data to obtain the multi-label hierarchical annotation results of each sample data.

[0019] In response to the fact that the multi-label hierarchical annotation results do not include the last-level label results, the last-level label results are determined according to the data classification criteria, and the data annotation model is optimized based on the last-level label results.

[0020] In some optional embodiments of this example, after obtaining the multi-label hierarchical annotation results of the sample dataset, the method further includes:

[0021] Determine the text similarity between each sample data, and from the sample dataset, select sample data whose text similarity is greater than or equal to the preset similarity threshold as similar text data;

[0022] From the similar text data, those text data with the same multi-label hierarchical annotation results are considered as identical text data;

[0023] The proportion of the same text data in the similar text data is determined. In response to the proportion being less than a preset proportion threshold, the prompt statement is adjusted, and the data annotation model is optimized based on the adjusted prompt statement.

[0024] In some optional embodiments of this example, the hierarchical classification model includes multiple binary classifiers, each corresponding one-to-one with a hierarchical label. Training the hierarchical classification model based on the sample dataset and the multi-label hierarchical annotation results includes:

[0025] Based on the sample dataset and the multi-label hierarchical annotation results, a specific dataset is determined for each binary classifier to train its model, and each binary classifier is trained based on the specific dataset; wherein, the label values of the specific dataset required by each binary classifier are different.

[0026] The step of determining the specific dataset required for model training of each binary classifier includes:

[0027] Based on the correspondence between the binary classifier and the hierarchical labels, in the multi-label hierarchical annotation results of the sample dataset, the annotation value of the sample data containing the hierarchical label corresponding to the binary classifier is set to 1; the annotation value of the sample data that does not contain the hierarchical label corresponding to the binary classifier is set to 0, thus obtaining the specific dataset required for the binary classifier to train the model.

[0028] In some optional embodiments of this example, the hierarchical labels include the first-level label, the second-level label, the third-level label, and the fourth-level label. Determining the classification result of the data to be classified based on the hierarchical classification model includes:

[0029] Based on the hierarchical classification model, the hierarchical classification results of the data to be classified are determined, wherein the hierarchical classification results include first-level label prediction values, second-level label prediction values, third-level label prediction values, and fourth-level label prediction values:

[0030] The actual attributes of the first-level labels corresponding to the binary classifier that outputs the first-level label prediction value, the actual attributes of the second-level labels corresponding to the binary classifier that outputs the second-level label prediction value, the actual attributes of the third-level labels corresponding to the binary classifier that outputs the third-level label prediction value, and the actual attributes of the fourth-level labels corresponding to the binary classifier that outputs the fourth-level label prediction value are determined respectively.

[0031] From the data grading criteria, the security levels of the actual attributes of the first-level tags, second-level tags, third-level tags, and fourth-level tags are determined as the grading results of the data to be graded.

[0032] In some optional embodiments of this example, the hierarchical classifier includes a first-level classification sub-model, a second-level classification sub-model, a third-level classification sub-model, and a fourth-level classification sub-model. The first-level classification sub-model includes N binary classifiers, where N is the number of first-level labels; the second-level classification sub-model includes M binary classifiers, where M is the number of second-level labels; the third-level classification sub-model includes P binary classifiers, where P is the number of third-level labels; and the fourth-level classification sub-model includes Q binary classifiers, where Q is the number of fourth-level labels.

[0033] The step of determining the hierarchical classification result of the data to be classified based on the hierarchical classification model includes:

[0034] In the first-level classification sub-model, only one binary classifier outputs 1, while the outputs of the remaining (N-1) binary classifiers are 0. The 1 is used as the first-level label prediction value.

[0035] In the second-level classification sub-model, only one binary classifier outputs 1, while the outputs of the remaining (M-1) binary classifiers are 0. The 1 is used as the predicted value of the secondary label.

[0036] In the third-level classification sub-model, only one binary classifier outputs 1, while the outputs of the remaining (P-1) binary classifiers are 0. The 1 is used as the predicted value of the third-level label.

[0037] In the fourth-level classification sub-model, only one binary classifier outputs 1, while the outputs of the remaining (Q-1) binary classifiers are 0. The 1 is used as the predicted value of the fourth-level label.

[0038] According to a second aspect of this application, a data security classification device is also provided, comprising:

[0039] The data classification standard determination module is configured to determine the required domain knowledge and the data classification standard corresponding to the domain knowledge based on the data to be classified.

[0040] The hierarchical logical relationship determination module is configured to determine multi-level labels and the hierarchical logical relationships between each level of labels from the data classification criteria.

[0041] The data annotation module is configured to acquire a sample dataset and, based on the hierarchical logical relationship and the pre-trained data annotation model, perform multi-label hierarchical annotation on each sample data in the sample dataset to obtain the multi-label hierarchical annotation result of the sample dataset.

[0042] The training and grading module is configured to train a hierarchical classification model based on the sample dataset and the multi-label hierarchical annotation results, and determine the grading result of the data to be graded based on the trained hierarchical classification model.

[0043] According to a third aspect of this application, an electronic device is also provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the data security classification method.

[0044] According to a fourth aspect of this application, a computer-readable storage medium is also provided, on which a computer program is stored, which, when executed by a processor, implements the steps of the data security classification method.

[0045] This application provides a data security classification method and apparatus, proposing an automated implementation method for interpretable data security classification. It addresses the current situation where the work mainly relies on human experience and judgment, as well as the limitations of single-label classification prediction models that lack comprehensive knowledge consideration. This method can make data security classification work more efficient, improve the accuracy of data security classification, and promote better sharing and use of data. Attached Figure Description

[0046] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0047] Figure 1 This is one of the flowcharts for a data security classification method according to an embodiment of this application;

[0048] Figure 2 This is a second flowchart of a data security classification method according to an embodiment of this application;

[0049] Figure 3 This is one of the schematic diagrams illustrating the multi-level tags and hierarchical logical relationships according to an embodiment of this application;

[0050] Figure 4 This is a second schematic diagram illustrating the multi-level tags and hierarchical logical relationships according to an embodiment of this application;

[0051] Figure 5 This is a third flowchart of a data security classification method according to an embodiment of this application;

[0052] Figure 6 This is a flowchart of a data security classification method according to an embodiment of this application;

[0053] Figure 7 This is a schematic diagram of a hierarchical classification model according to an embodiment of this application;

[0054] Figure 8 This is the fifth flowchart of a data security classification method according to an embodiment of this application;

[0055] Figure 9 This is a schematic diagram of a data security classification device according to an embodiment of this application;

[0056] Figure 10 This is a block diagram of an electronic device used to implement a data security classification method according to an embodiment of this application. Detailed Implementation

[0057] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0058] One embodiment of this application provides a data security classification method, such as... Figure 1 As shown, the method includes:

[0059] Step 101: Based on the data to be classified, determine the required domain knowledge and the corresponding data classification standards;

[0060] Step 102: Determine the multi-level tags and the hierarchical logical relationships between the tags from the data classification criteria;

[0061] Step 103: Obtain the sample dataset, and based on the hierarchical logical relationship and the pre-trained data annotation model, perform multi-label hierarchical annotation on each sample data in the sample dataset to obtain the multi-label hierarchical annotation result of the sample dataset;

[0062] Step 104: Based on the sample dataset and the multi-label hierarchical annotation results, train a hierarchical classification model, and determine the classification result of the data to be classified based on the trained hierarchical classification model.

[0063] This application provides a data security classification method, proposing an automated implementation approach for interpretable data security classification. This addresses the current situation where the work mainly relies on human experience and judgment, as well as the limitations of single-label classification prediction models that lack comprehensive knowledge consideration. This method can make data security classification work more efficient, improve the accuracy of data security classification, and promote better sharing and use of data.

[0064] The following is about Figure 1 Each step will be explained in detail:

[0065] Step 101: Based on the data to be classified, determine the required domain knowledge and the corresponding data classification standards.

[0066] In this embodiment, to ensure the interpretability of data security classification, the required domain knowledge and the corresponding data classification standards are determined in advance based on the data type of the data to be classified. For example, if the data to be classified is "receiving participating institution bank code", then the required domain knowledge is determined to be financial knowledge, and the corresponding data classification standard for financial knowledge can be the industry standard formulated in the "Guideline for Data Security Classification of Financial Data"; when the data to be classified is "social medical insurance card", then the required domain knowledge is determined to be medical knowledge, and the corresponding data classification standard for medical knowledge can be the industry standard formulated by the medical industry, and so on. These details will not be elaborated further in this application.

[0067] Step 102: Determine the multi-level labels and the hierarchical logical relationships between the labels from the data classification criteria.

[0068] In this embodiment, multi-level tags and the hierarchical logical relationships between each level of tags can be determined from the data classification criteria. In some optional methods of this embodiment, such as... Figure 2 As shown, step 102 further includes:

[0069] Step 1021: Determine the set of primary tags according to the classification and subdivision in the data grading standard, wherein the set of primary tags includes multiple primary tags.

[0070] Step 1022: Take the set of tags corresponding to the child nodes of the first-level tag as the set of second-level tags corresponding to the first-level tag, wherein the set of second-level tags includes multiple second-level tags.

[0071] Step 1023: Take the set of tags corresponding to the child nodes of the second-level tag as the set of third-level tags corresponding to the second-level tag, wherein the set of third-level tags includes multiple third-level tags.

[0072] Step 1024: Take the set of tags corresponding to the child nodes of the third-level tag as the set of fourth-level tags corresponding to the third-level tag, wherein the set of fourth-level tags includes multiple fourth-level tags;

[0073] In this embodiment, as Figure 3 As shown, it is possible to determine the first-level tags (a1, a2, a3...aN), second-level tags (b1, b2, b3...bM), third-level tags (c1, c2, c3...cP), and fourth-level tags (d1, d2, d3...dQ) from the data classification criteria. All first-level tags are summarized into a first-level tag attribute set, all second-level tags are summarized into a second-level tag attribute set, all third-level tags are summarized into a third-level tag attribute set, and all fourth-level tags are summarized into a fourth-level tag attribute set.

[0074] The hierarchical logical relationship between the tags at each level is as follows: the second-level tag is a sub-tag of the first-level tag, the third-level tag is a sub-tag of the second-level tag, and the fourth-level tag is a sub-tag of the first-level tag. For example, ... Figure 3 As shown, the second-level tags b1 and b2 are child tags of the first-level tag a1; the third-level tags c1, c2, c3 and c4 are child tags of the second-level tag b1; and the fourth-level tags d1, d2, d3 and d4 are child tags of the third-level tag c3.

[0075] Taking the financial data classification standard, the "Guideline for Financial Data Security Classification," as an example, such as... Figure 4 As shown, the first-level labels include customers, regulation, operation and management, and business. Among them, the second-level labels "individual" and "entity" are sub-labels of "customer" in the first-level labels, and "account information," "transaction information," and "agreement information" are sub-labels of "business" in the first-level labels. The third-level labels "personal identity information," "personal information," "personal natural information," and "personal behavior information" are sub-labels of "person" in the second-level labels, and "insurance fee information" and "general transaction information" are sub-labels of "transaction information" in the second-level labels. The fourth-level labels "personal profile information," "personal property information," "personal contact information," and "personal health information" are sub-labels of "personal natural information" in the second-level labels, and "transaction amount information," "counterparty information," "transaction settlement information," and "transaction accounting information" are sub-labels of "general transaction information" in the third-level labels.

[0076] It should also be noted that if the multi-level labels in the data grading standards are not complete enough, the definitions and descriptions of the relevant labels need to be supplemented to make the multi-label system more complete.

[0077] Step 103: Obtain the sample dataset, and based on the hierarchical logical relationship and the pre-trained data annotation model, perform multi-label hierarchical annotation on each sample data in the sample dataset to obtain the multi-label hierarchical annotation result of the sample dataset.

[0078] In this embodiment, the prompt paradigm of a large natural language processing model is used as the data annotation model to automatically annotate the data, realize multi-label hierarchical annotation of the sample dataset, and obtain the multi-label hierarchical annotation result of the sample dataset.

[0079] In some optional embodiments of this example, such as Figure 5 As shown, the multi-label hierarchical labeling step in step 103 above further includes:

[0080] Step 1031: Determine the prompt statements for each sample data based on the hierarchical logical relationship;

[0081] Step 1032: Based on the prompt statement and the data annotation model, perform multi-label hierarchical annotation on each sample data to obtain the multi-label hierarchical annotation results of each sample data.

[0082] In this embodiment, a Chinese large model platform can be selected for prompt verification. The dialogue mode is completed using an understanding of specific documents. A normative document for data security classification, such as the "Guideline for Financial Data Security Classification", is uploaded. The prompt statements are used to conduct a question-and-answer process to perform multi-label classification labeling on each sample data.

[0083] In a specific example, the prompt statement could be, "Based on your understanding of the uploaded document 'Guidelines for the Classification of Financial Data Security,' as a financial economics expert, determine which subdivision in Appendix B of the document 'Receiving Participating Institution Bank Code' conforms to." The prompt statement could also be, "Based on the hierarchical logical relationship between business transaction information, general transaction information, and counterparty information, determine which subdivision in Appendix B of the 'Guidelines for the Classification of Financial Data Security' conforms to," and so on. This application does not limit the scope of the prompt statement.

[0084] Furthermore, based on this prompt and the data labeling model, multi-label hierarchical labeling can be performed on each sample data. Using the aforementioned prompt as an example, the data labeling model's response and analysis are as follows:

[0085] "The 'receiving participating institution's bank code' belongs to the 'counterpartner information' category in the Level 4 label of the 'Financial Data Security Data Security Classification Guide'."

[0086] In the "Guidelines for Data Security Classification in Financial Data Security," counterparty information refers to information related to a specific counterparty, such as the counterparty's name, address, and contact information. This information is crucial for financial institutions to identify and verify the identity of counterparties, understand their risk and creditworthiness, and conduct compliance audits.

[0087] Therefore, 'receiving participant institution's bank code, as information related to a specific counterparty, falls under the 'counterparty information' category in the fourth-level tag.

[0088] Thus, the annotation result of the sample data 'Receiving Participating Institution Bank Number' can be obtained as 'Counterpartner Information' in the fourth-level label. Based on the aforementioned hierarchical logical relationship, its first-level label, second-level label, and third-level label can also be determined, thereby obtaining the multi-label hierarchical annotation result of 'Receiving Participating Institution Bank Number'.

[0089] Step 1033: In response to the fact that the multi-label hierarchical annotation results do not include the last-level label results, determine the last-level label results according to the data classification criteria, and optimize the data annotation model based on the last-level label results.

[0090] It should be noted that if the response from the large model prompt does not reach the last level of classification, such as the "Guideline for Data Security Classification in Financial Data Security" which has four levels, it is necessary to manually determine whether a fourth level of classification needs to be added. If so, it needs to be added to the document, and the data annotation model's understanding of the document needs to be updated and improved.

[0091] To improve the accuracy of the data annotation model, after obtaining the multi-label hierarchical annotation results, the data annotation model can be fine-tuned based on the multi-label hierarchical annotation results. In some optional methods of this embodiment, the fine-tuning is further described, specifically, as follows: Figure 6 As shown:

[0092] Step 105: Determine the text similarity between each sample data, and from the sample dataset, select sample data whose text similarity is greater than or equal to a preset similarity threshold as similar text data.

[0093] Step 106: From the similar text data, select the similar text data with the same multi-label hierarchical annotation result as the same text data.

[0094] Step 107: Determine the proportion of the same text data in the similar text data; in response to the proportion being less than a preset proportion threshold, adjust the prompt statement and optimize the data annotation model based on the adjusted prompt statement.

[0095] In this embodiment, taking the aforementioned financial field as an example, since structured financial data is a description of the database table structure, the terminology is relatively concise. In similar scenarios or systems, the repetition of descriptive terms is high, such as the bank code of the receiving participating institution and the bank code of the initiating participating institution, the name of the payer and the name of the payee, etc.

[0096] For example, using word frequency-based text matching algorithms such as cosine similarity algorithm or Jaccard similarity algorithm, the similarity of structured financial data can be calculated. A preset similarity threshold of 70% or 65% can be set. When the text similarity is greater than or equal to the preset similarity threshold, these sample data are regarded as similar text data. It should be understood that the probability of the multi-label hierarchical labeling results of similar text data being the same is high.

[0097] Therefore, it is also necessary to identify similar text data with identical multi-label hierarchical annotation results from the similar text data and determine the proportion of such identical text data within the overall similar text data. In other words, it is necessary to determine whether the multi-label hierarchical annotation results of similar text data with text similarity above a preset similarity threshold are identical, and to calculate the proportion of identical results in the total.

[0098] Finally, when the percentage is less than the preset percentage threshold, the prompt statement is adjusted, and the data annotation model is optimized based on the adjusted prompt statement. For example, the preset percentage threshold can be 80%. If the percentage is less than 80%, the prompt statement needs to be readjusted. This process is continued to complete the fine-tuning of the prompt statement of the large model. Based on the adjusted prompt statement, the automatic annotation of the sample dataset is completed. See Table 1, which shows some examples of the multi-label hierarchical annotation results obtained by automatically annotating the sample dataset.

[0099] Table 1

[0100]

[0101] Step 104: Based on the sample dataset and the multi-label hierarchical annotation results, train a hierarchical classification model, and determine the classification result of the data to be classified based on the trained hierarchical classification model.

[0102] In this embodiment, the training of the hierarchical classification model is first introduced, such as... Figure 7 As shown, the hierarchical classification model includes multiple binary classifiers, each of which corresponds one-to-one with a hierarchical label. In other words, each binary classifier uniquely corresponds to one hierarchical label, which includes first-level label, second-level label, third-level label, and fourth-level label.

[0103] In this embodiment, training the hierarchical classification model based on the sample dataset and the multi-label hierarchical annotation results includes:

[0104] The expectation for the model is that it can be trained to output a range of multi-labels based on the input Chinese description, and the multi-labels are hierarchical. Using traditional neural networks and pre-trained language models, the hierarchical relationship is handled by using a multi-label prediction model plus judging the relationship between the upper and lower levels, thus obtaining the hierarchical classification result.

[0105] Based on the sample dataset and the multi-label hierarchical annotation results, a specific dataset is determined for each binary classifier to train its model, and each binary classifier is trained based on the specific dataset; wherein the label values of the specific dataset required for each binary classifier are different.

[0106] Specifically, the step of determining the specific dataset required for model training of each binary classifier includes:

[0107] Based on the correspondence between the binary classifier and the hierarchical labels, in the multi-label hierarchical annotation results of the sample dataset, the annotation value of the sample data containing the hierarchical label corresponding to the binary classifier is set to 1; the annotation value of the sample data that does not contain the hierarchical label corresponding to the binary classifier is set to 0, thus obtaining the specific dataset required for the binary classifier to train the model.

[0108] In other words, each binary classifier in the hierarchical classification model is trained, and a dataset is constructed for training each binary classifier model based on the sample dataset. The labeled values of the dataset used for training each binary classifier are different.

[0109] The dataset construction process for each binary classifier includes:

[0110] The dataset annotation values are generated based on the correspondence between the binary classifier and the hierarchical labels. In the multi-label hierarchical annotation results of the sample dataset, the annotation value is set to 1 when the data item contains the unique label corresponding to the binary classifier, and the annotation value is set to 0 when it does not contain this label. Thus, the specific dataset of the binary classifier is obtained, and the binary classifier is trained based on the specific dataset.

[0111] For example, such as Figure 7 As shown, for example, if the binary classifier corresponding to the first-level label a1 is A1, then the data in the sample dataset containing the first-level label a1 is labeled as 1, and the rest are all labeled as 0, which is used as the specific dataset for the binary classifier A1; if the binary classifier corresponding to the second-level label b1 is B1, then the data in the sample dataset containing the second-level label b1 is labeled as 1, and the rest are all labeled as 0, which is used as the specific dataset for the binary classifier B1. The same applies to the other binary classifiers, which will not be described in detail here.

[0112] In some optional embodiments of this example, the hierarchical tags include the first-level tag, the second-level tag, the third-level tag, and the fourth-level tag, such as... Figure 8 As shown, determining the classification result of the data to be classified based on the hierarchical classification model includes:

[0113] Step 1041: Based on the hierarchical classification model, determine the hierarchical classification result of the data to be classified, wherein the hierarchical classification result includes the predicted value of the first-level label, the predicted value of the second-level label, the predicted value of the third-level label, and the predicted value of the fourth-level label.

[0114] In some optional embodiments of this example, the hierarchical classification model includes a first-level classification sub-model, a second-level classification sub-model, a third-level classification sub-model, and a fourth-level classification sub-model. Taking the data to be classified as input, the model can obtain the output at each level, such as the first-level label prediction value, the second-level label prediction value, the third-level label prediction value, and the fourth-level label prediction value.

[0115] In some optional embodiments of this example, the first hierarchical classification sub-model includes N binary classifiers, where N is the number of first-level labels; the second hierarchical classification sub-model includes M binary classifiers, where M is the number of second-level labels; the third hierarchical classification sub-model includes P binary classifiers, where P is the number of third-level labels; and the fourth hierarchical classification sub-model includes Q binary classifiers, where Q is the number of fourth-level labels.

[0116] The step of determining the hierarchical classification result of the data to be classified based on the hierarchical classification model includes:

[0117] In the first-level classification sub-model, only one binary classifier outputs 1, while the outputs of the remaining (N-1) binary classifiers are 0. The 1 is used as the first-level label prediction value.

[0118] In the second-level classification sub-model, only one binary classifier outputs 1, while the outputs of the remaining (M-1) binary classifiers are 0. The 1 is used as the predicted value of the secondary label.

[0119] In the third-level classification sub-model, only one binary classifier outputs 1, while the outputs of the remaining (P-1) binary classifiers are 0. The 1 is used as the predicted value of the third-level label.

[0120] In the fourth-level classification sub-model, only one binary classifier outputs 1, while the outputs of the remaining (Q-1) binary classifiers are 0. The 1 is used as the predicted value of the fourth-level label.

[0121] It should be noted that, in order to resolve the relationship between label levels, and based on the importance of the binary classifier to its initialization parameters, the method used is to use the model parameters of the parent label as initialization values when fine-tuning the model of the child label.

[0122] Step 1042: Determine the actual attributes of the first-level labels corresponding to the binary classifier that outputs the first-level label prediction value, the actual attributes of the second-level labels corresponding to the binary classifier that outputs the second-level label prediction value, the actual attributes of the third-level labels corresponding to the binary classifier that outputs the third-level label prediction value, and the actual attributes of the fourth-level labels corresponding to the binary classifier that outputs the fourth-level label prediction value.

[0123] Taking "Payee Name" as an example of data to be classified, in the first-level classification sub-model, if the output of binary classifier A1 is 1, then the actual attribute of the first-level label corresponding to binary classifier A1 needs to be determined, such as business; in the second-level classification sub-model, if the output of binary classifier B2 is 1, then the actual attribute of the second-level label corresponding to binary classifier B2 needs to be determined, such as transaction information; in the third-level classification sub-model, if the output of binary classifier C2 is 1, then the actual attribute of the third-level label corresponding to binary classifier C2 needs to be determined, such as general transaction information; in the fourth-level classification sub-model, if the output of binary classifier D2 is 1, then the actual attribute of the fourth-level label corresponding to binary classifier D2 needs to be determined, such as counterparty information.

[0124] Step 1043: Determine the security level of the actual attributes of the first-level label, the second-level label, the third-level label, and the fourth-level label from the data classification criteria, and use this as the classification result of the data to be classified.

[0125] After using the trained hierarchical classification model to perform multi-label hierarchical classification prediction on the data to be classified and obtaining the hierarchical classification results, it is also necessary to determine the security level mapped by the hierarchical classification results according to the data classification standards, that is, to realize the quantitative mapping of security classification.

[0126] For example, code can be written to implement the mapping relationship between multi-label hierarchical classification and security classification according to the typical data classification rules of financial institutions. The mapping can provide a reliable quantitative basis. Input the hierarchical classification result of the data to be classified and output the security level. For example, as shown in Table 2 below, when the actual attribute of the first-level label is business, the actual attribute of the second-level label is transaction information, the actual attribute of the third-level label is general transaction information, and the attribute of the fourth-level label is counterparty information, the security level is 3.

[0127] Table 2

[0128]

[0129] It should be noted that, as seen in the "Guidelines for Data Security Classification in the Financial Industry," the principles for data security classification are conceptual and macro-level. Appendix B provides a reference table of classification rules for typical data from financial institutions to guide the work. The implementation approach of this application is to refer to the classification rules for typical data from financial institutions, classify the data to be classified according to the categorization and subdivision of typical data from financial institutions, map the data to be classified to the classification rules for typical data from financial institutions, and determine the data security level of the data to be classified based on the minimum security level reference items given in the classification rules for typical data from financial institutions. This approach automatically maps the data to be classified to the rules for typical data from financial institutions, thus providing a reliable basis for the classification results and making them interpretable.

[0130] The implementation process of this application involves automatically labeling the dataset, using a hierarchical classification model for text classification to predict labels, completing data classification, obtaining hierarchical classification results, and then writing code to determine the safety classification of the data to be classified based on the correspondence of rules, ultimately obtaining the classification result.

[0131] In summary, this application implements a data security classification method that is an interpretable and automated approach. It addresses the limitations of existing methods that rely primarily on human experience and judgment, as well as the lack of comprehensive knowledge consideration in single-label classification prediction models. This method enables data security classification to be more efficient, improves its accuracy, and promotes better sharing and use of data.

[0132] Based on the same inventive concept, this application also provides a data security classification device, which can be used to implement the method described in the above embodiments, as described in the following embodiments. Since the principle by which this data security classification device solves the problem is similar to that of a data security classification method, the implementation of a data security classification device can refer to the implementation of a data security classification method, and repeated details will not be elaborated further. As used below, the terms "unit" or "module" can refer to a combination of software and / or hardware that implements a predetermined function. Although the system described in the following embodiments is preferably implemented in software, hardware implementation, or a combination of software and hardware, is also possible and contemplated.

[0133] like Figure 9 As shown, the data security classification device includes:

[0134] The data classification standard determination module 901 is configured to determine the required domain knowledge and the data classification standard corresponding to the domain knowledge based on the data to be classified.

[0135] The hierarchical logical relationship determination module 902 is configured to determine multi-level labels and hierarchical logical relationships between each level of labels from the data classification criteria.

[0136] The data annotation module 903 is configured to acquire a sample dataset and, based on the hierarchical logical relationship and the pre-trained data annotation model, perform multi-label hierarchical annotation on each sample data in the sample dataset to obtain the multi-label hierarchical annotation result of the sample dataset.

[0137] The training and grading module 904 is configured to train a hierarchical classification model based on the sample dataset and the multi-label hierarchical annotation results, and determine the grading result of the data to be graded based on the trained hierarchical classification model.

[0138] In some optional embodiments of this example, the hierarchical logical relationship determination module is further configured as follows:

[0139] Based on the classification and subdivision in the data grading criteria, a set of primary tags is determined, wherein the set of primary tags includes multiple primary tags;

[0140] The set of tags corresponding to the child nodes of the first-level tag is taken as the set of second-level tags corresponding to the first-level tag, wherein the set of second-level tags includes multiple second-level tags;

[0141] The set of tags corresponding to the child nodes of the second-level tag is taken as the set of third-level tags corresponding to the second-level tag, wherein the set of third-level tags includes multiple third-level tags;

[0142] The set of tags corresponding to the child nodes of the third-level tag is taken as the set of fourth-level tags corresponding to the third-level tag, wherein the set of fourth-level tags includes multiple fourth-level tags;

[0143] The hierarchical logical relationship between the tags at each level is as follows: the second-level tag is a sub-tag of the first-level tag, the third-level tag is a sub-tag of the second-level tag, and the fourth-level tag is a sub-tag of the first-level tag.

[0144] In some optional embodiments of this example, the data annotation module is further configured as follows:

[0145] Based on the hierarchical logical relationship, determine the prompt statements for each sample data;

[0146] Based on the prompt statement and the data annotation model, multi-label hierarchical annotation is performed on each sample data to obtain the multi-label hierarchical annotation results of each sample data.

[0147] In response to the fact that the multi-label hierarchical annotation results do not include the last-level label results, the last-level label results are determined according to the data classification criteria, and the data annotation model is optimized based on the last-level label results.

[0148] In some optional embodiments of this invention, the apparatus further includes an optimization module, which is configured to:

[0149] Determine the text similarity between each sample data, and from the sample dataset, select sample data whose text similarity is greater than or equal to the preset similarity threshold as similar text data;

[0150] From the similar text data, those text data with the same multi-label hierarchical annotation results are considered as identical text data;

[0151] The proportion of the same text data in the similar text data is determined. In response to the proportion being less than a preset proportion threshold, the prompt statement is adjusted, and the data annotation model is optimized based on the adjusted prompt statement.

[0152] In some optional embodiments of this example, the hierarchical classification model includes multiple binary classifiers, each corresponding one-to-one with a hierarchical label, and the training and classification module includes:

[0153] The training unit is configured to determine the specific dataset required for model training of each binary classifier based on the sample dataset and the multi-label hierarchical annotation results, and to train each binary classifier based on the specific dataset; wherein the label values of the specific dataset required by each binary classifier are different;

[0154] The step of determining the specific dataset required for model training of each binary classifier includes:

[0155] Based on the correspondence between the binary classifier and the hierarchical labels, in the multi-label hierarchical annotation results of the sample dataset, the annotation value of the sample data containing the hierarchical label corresponding to the binary classifier is set to 1; the annotation value of the sample data that does not contain the hierarchical label corresponding to the binary classifier is set to 0, thus obtaining the specific dataset required for the binary classifier to train the model.

[0156] In some optional embodiments of this example, the hierarchical labels include the first-level labels, the second-level labels, the third-level labels, and the fourth-level labels, and the training and classification module further includes:

[0157] A hierarchical classification unit is configured to determine the hierarchical classification result of the data to be classified based on the hierarchical classification model, wherein the hierarchical classification result includes first-level label prediction values, second-level label prediction values, third-level label prediction values, and fourth-level label prediction values:

[0158] The label attribute determination unit is configured to determine the actual first-level label attribute corresponding to the binary classifier that outputs the first-level label prediction value, the actual second-level label attribute corresponding to the binary classifier that outputs the second-level label prediction value, the actual third-level label attribute corresponding to the binary classifier that outputs the third-level label prediction value, and the actual fourth-level label attribute corresponding to the binary classifier that outputs the fourth-level label prediction value.

[0159] The rating unit is configured to determine the security level of the actual attributes of the first-level tag, the actual attributes of the second-level tag, the actual attributes of the third-level tag, and the actual attributes of the fourth-level tag from the data rating criteria, and use this as the rating result of the data to be rated.

[0160] In some optional embodiments of this example, the hierarchical classifier includes a first-level classification sub-model, a second-level classification sub-model, a third-level classification sub-model, and a fourth-level classification sub-model. The first-level classification sub-model includes N binary classifiers, where N is the number of first-level labels; the second-level classification sub-model includes M binary classifiers, where M is the number of second-level labels; the third-level classification sub-model includes P binary classifiers, where P is the number of third-level labels; and the fourth-level classification sub-model includes Q binary classifiers, where Q is the number of fourth-level labels.

[0161] The step of determining the hierarchical classification result of the data to be classified based on the hierarchical classification model includes:

[0162] In the first-level classification sub-model, only one binary classifier outputs 1, while the outputs of the remaining (N-1) binary classifiers are 0. The 1 is used as the first-level label prediction value.

[0163] In the second-level classification sub-model, only one binary classifier outputs 1, while the outputs of the remaining (M-1) binary classifiers are 0. The 1 is used as the predicted value of the secondary label.

[0164] In the third-level classification sub-model, only one binary classifier outputs 1, while the outputs of the remaining (P-1) binary classifiers are 0. The 1 is used as the predicted value of the third-level label.

[0165] In the fourth-level classification sub-model, only one binary classifier outputs 1, while the outputs of the remaining (Q-1) binary classifiers are 0. The 1 is used as the predicted value of the fourth-level label.

[0166] According to embodiments of this disclosure, this disclosure also provides an electronic device and a readable storage medium.

[0167] An electronic device includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the steps of a data security classification method according to the foregoing embodiments.

[0168] A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause a computer to perform the steps of a data security classification method according to the foregoing embodiments.

[0169] Figure 10 A schematic block diagram of an example electronic device 1000 that can be used to implement embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.

[0170] like Figure 10 As shown, device 1000 includes a computing unit 1001, which can perform various appropriate actions and processes according to a computer program stored in read-only memory (ROM) 1002 or a computer program loaded from storage unit 1008 into random access memory (RAM) 1003. The RAM 1003 may also store various programs and data required for the operation of device 1000. The computing unit 1001, ROM 1002, and RAM 1003 are interconnected via bus 1004. Input / output (I / O) interface 1005 is also connected to bus 1004.

[0171] Multiple components in device 1000 are connected to I / O interface 1005, including: input unit 1006, such as keyboard, mouse, etc.; output unit 1007, such as various types of monitors, speakers, etc.; storage unit 1008, such as disk, optical disk, etc.; and communication unit 1009, such as network card, modem, wireless transceiver, etc. Communication unit 1009 allows device 1000 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0172] The computing unit 1001 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 1001 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1001 performs the various methods and processes described above, such as a data security classification method.

[0173] For example, in some embodiments, a data security classification method may be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 1008. In some embodiments, part or all of the computer program may be loaded and / or installed on device 1000 via ROM 1002 and / or communication unit 1009. When the computer program is loaded into RAM 1003 and executed by computing unit 1001, one or more steps of the data security classification method described above may be performed. Alternatively, in other embodiments, computing unit 1001 may be configured to perform a data security classification method by any other suitable means (e.g., by means of firmware).

[0174] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0175] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0176] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0177] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0178] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with embodiments of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.

[0179] Computer systems can include clients and servers. Clients and servers are generally located far apart and typically interact via communication networks. Client-server relationships are created by computer programs running on the respective computers and having a client-server relationship with each other. Servers can be cloud servers, servers in distributed systems, or servers incorporating blockchain technology.

[0180] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution of this disclosure can be achieved, and this is not limited herein.

[0181] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.

Claims

1. A data security classification method, characterized in that, include: Based on the data to be classified, determine the required domain knowledge and the corresponding data classification standards for the domain knowledge; From the data grading criteria, multi-level tags and the hierarchical logical relationships between each level of tags are determined. The hierarchical logical relationships between each level of tags are as follows: a second-level tag is a sub-tag of a first-level tag, a third-level tag is a sub-tag of a second-level tag, and a fourth-level tag is a sub-tag of a third-level tag. Obtain a sample dataset, and based on the hierarchical logical relationship and the pre-trained data annotation model, perform multi-label hierarchical annotation on each sample data in the sample dataset to obtain the multi-label hierarchical annotation result of the sample dataset; Based on the sample dataset and the multi-label hierarchical annotation results, a hierarchical classification model is trained, and based on the trained hierarchical classification model, the classification result of the data to be classified is determined; Specifically, the step of performing multi-label hierarchical annotation on each sample data in the sample dataset based on the hierarchical logical relationship and the pre-trained data annotation model to obtain the multi-label hierarchical annotation result of the sample dataset includes: Based on the hierarchical logical relationship, determine the prompt statements for each sample data; Upload the normative document for data security classification, and based on the prompt statement and the data annotation model, use a question-and-answer approach to perform multi-label hierarchical annotation on each sample data to obtain the multi-label hierarchical annotation results for each sample data; In response to the fact that the multi-label hierarchical annotation results do not include the last-level label results, the last-level label results are determined according to the data classification criteria, and the data annotation model is optimized based on the last-level label results; The method further includes, after obtaining the multi-label hierarchical annotation results of the sample dataset: Calculate the similarity of structured financial data using the cosine similarity algorithm or the Jaccard similarity algorithm; When the text similarity is greater than or equal to the preset similarity threshold, these sample data are considered as similar text data. From the similar text data, similar text data with the same multi-label hierarchical annotation results are regarded as identical text data, and the proportion of identical text data in the similar text data is determined. When the proportion is less than the preset proportion threshold, the prompt statement is adjusted, and the data annotation model is optimized based on the adjusted prompt statement.

2. The method according to claim 1, characterized in that, The step of determining multi-level tags and the hierarchical logical relationships between each level of tags from the data classification criteria includes: Based on the classification and subdivision in the data grading criteria, a set of primary tags is determined, wherein the set of primary tags includes multiple primary tags; The set of tags corresponding to the child nodes of the first-level tag is taken as the set of second-level tags corresponding to the first-level tag, wherein the set of second-level tags includes multiple second-level tags; The set of tags corresponding to the child nodes of the second-level tag is taken as the set of third-level tags corresponding to the second-level tag, wherein the set of third-level tags includes multiple third-level tags; The set of tags corresponding to the child nodes of the third-level tag is taken as the set of fourth-level tags corresponding to the third-level tag, wherein the set of fourth-level tags includes multiple fourth-level tags.

3. The method according to claim 2, characterized in that, The hierarchical classification model includes multiple binary classifiers, each corresponding one-to-one with a hierarchical label. The step of training the hierarchical classification model based on the sample dataset and the multi-label hierarchical annotation results includes: Based on the sample dataset and the multi-label hierarchical annotation results, a specific dataset is determined for each binary classifier to train its model, and each binary classifier is trained based on the specific dataset; wherein, the label values of the specific dataset required by each binary classifier are different. The step of determining the specific dataset required for model training of each binary classifier includes: Based on the correspondence between the binary classifier and the hierarchical labels, in the multi-label hierarchical annotation results of the sample dataset, the annotation value of the sample data containing the hierarchical label corresponding to the binary classifier is set to 1; the annotation value of the sample data that does not contain the hierarchical label corresponding to the binary classifier is set to 0, thus obtaining the specific dataset required for the binary classifier to train the model.

4. The method according to claim 3, characterized in that, The hierarchical labels include the first-level labels, the second-level labels, the third-level labels, and the fourth-level labels. Determining the classification result of the data to be classified based on the hierarchical classification model includes: Based on the hierarchical classification model, the hierarchical classification results of the data to be classified are determined, wherein the hierarchical classification results include first-level label prediction values, second-level label prediction values, third-level label prediction values, and fourth-level label prediction values: The actual attributes of the first-level labels corresponding to the binary classifier that outputs the first-level label prediction value, the actual attributes of the second-level labels corresponding to the binary classifier that outputs the second-level label prediction value, the actual attributes of the third-level labels corresponding to the binary classifier that outputs the third-level label prediction value, and the actual attributes of the fourth-level labels corresponding to the binary classifier that outputs the fourth-level label prediction value are determined respectively. From the data grading criteria, the security levels of the actual attributes of the first-level tags, second-level tags, third-level tags, and fourth-level tags are determined as the grading results of the data to be graded.

5. The method according to claim 4, characterized in that, The hierarchical classifier includes a first-level classification sub-model, a second-level classification sub-model, a third-level classification sub-model, and a fourth-level classification sub-model. The first-level classification sub-model includes N binary classifiers, where N is the number of first-level labels; the second-level classification sub-model includes M binary classifiers, where M is the number of second-level labels; the third-level classification sub-model includes P binary classifiers, where P is the number of third-level labels; and the fourth-level classification sub-model includes Q binary classifiers, where Q is the number of fourth-level labels. The step of determining the hierarchical classification result of the data to be classified based on the hierarchical classification model includes: In the first-level classification sub-model, only one binary classifier outputs 1, while the outputs of the remaining (N-1) binary classifiers are 0. The 1 is used as the first-level label prediction value. In the second-level classification sub-model, only one binary classifier outputs 1, while the outputs of the remaining (M-1) binary classifiers are 0. The 1 is used as the predicted value of the secondary label. In the third-level classification sub-model, only one binary classifier outputs 1, while the outputs of the remaining (P-1) binary classifiers are 0. The 1 is used as the predicted value of the third-level label. In the fourth-level classification sub-model, only one binary classifier outputs 1, while the outputs of the remaining (Q-1) binary classifiers are 0. The 1 is used as the predicted value of the fourth-level label.

6. A data security classification device, characterized in that, include: The data classification standard determination module is configured to determine the required domain knowledge and the data classification standard corresponding to the domain knowledge based on the data to be classified. The hierarchical logical relationship determination module is configured to determine multi-level labels and the hierarchical logical relationships between each level of labels from the data classification criteria. The data annotation module is configured to acquire a sample dataset and, based on the hierarchical logical relationship and the pre-trained data annotation model, perform multi-label hierarchical annotation on each sample data in the sample dataset to obtain the multi-label hierarchical annotation result of the sample dataset. The training and grading module is configured to train a hierarchical classification model based on the sample dataset and the multi-label hierarchical annotation results, and determine the grading result of the data to be graded based on the trained hierarchical classification model. The data annotation module is further configured as follows: Based on the hierarchical logical relationship, determine the prompt statements for each sample data; Upload the normative document for data security classification, and based on the prompt statement and the data annotation model, use a question-and-answer approach to perform multi-label hierarchical annotation on each sample data to obtain the multi-label hierarchical annotation results for each sample data; In response to the fact that the multi-label hierarchical annotation results do not include the last-level label results, the last-level label results are determined according to the data classification criteria, and the data annotation model is optimized based on the last-level label results; After obtaining the multi-label hierarchical annotation results of the sample dataset, the data annotation module is further configured as follows: Calculate the similarity of structured financial data using the cosine similarity algorithm or the Jaccard similarity algorithm; When the text similarity is greater than or equal to the preset similarity threshold, these sample data are considered as similar text data. From the similar text data, similar text data with the same multi-label hierarchical annotation results are regarded as identical text data, and the proportion of identical text data in the similar text data is determined. When the proportion is less than the preset proportion threshold, the prompt statement is adjusted, and the data annotation model is optimized based on the adjusted prompt statement.

7. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the steps of the data security classification method according to any one of claims 1 to 5.

8. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the computer program implements the steps of a data security classification method as described in any one of claims 1 to 5.

Citation Information

Patent Citations

Intelligent hierarchical labeling method
CN112685999A
Data annotation method, AI development platform, computing device cluster and storage medium
CN116862001A

Patent Information

AI Technical Summary

Abstract

Description

Patent Citations

Intelligent hierarchical labeling method

Data annotation method, AI development platform, computing device cluster and storage medium