Artificial intelligence-based data security risk early warning method and system

By constructing a data flow graph and dynamic early warning thresholds, the problem of insufficient perception of the dynamic data flow process in data security early warning is solved, and rapid risk identification and accurate early warning of the data flow process are achieved.

CN121923947BActive Publication Date: 2026-06-26ZHEJIANG CARD WINNER INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHEJIANG CARD WINNER INFORMATION TECH CO LTD
Filing Date
2026-03-27
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies lack the ability to perceive the dynamic flow of data in data security early warning, making it difficult to identify the scope of impact and the trend of spread in a timely manner, and unable to respond quickly to the risks after data is compromised.

Method used

Sensitive data is identified and marked by acquiring metadata lists, a data flow graph is constructed, node risk values ​​are quantified, and iterative propagation and risk diffusion simulation are performed. Data coloring is carried out by capturing data flow event sets with multi-source probes, and early warning thresholds are dynamically adjusted to trigger early warnings.

Benefits of technology

It enables continuous tracking and rapid risk identification of sensitive data throughout the entire process, improves the scientific nature and accuracy of risk assessment, avoids false alarms or omissions, and enhances the adaptability and foresight of security early warning.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121923947B_ABST
    Figure CN121923947B_ABST
Patent Text Reader

Abstract

The application is suitable for the technical field of data processing, and particularly relates to a data security risk early warning method and system based on artificial intelligence. The method comprises the following steps: obtaining metadata list and sensitive data list; matching data stream event set with sensitive data in the sensitive data list and data dyeing to obtain data stream event list; constructing data stream graph; calculating final risk value of each node; adding nodes with final risk value greater than initial threshold value into infection source set; calculating predicted infection probability of each node in future time window based on final risk value of each node and infection source set; determining dynamic early warning threshold value based on final risk value of each node and predicted infection probability; if final risk value is greater than dynamic early warning threshold value, triggering early warning and generating early warning report of nodes with final risk value greater than dynamic early warning threshold value. The method can quickly locate the affected nodes and predict the diffusion path after the data security problem occurs.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application belongs to the field of data processing technology, and in particular relates to a data security risk early warning method and system based on artificial intelligence. Background Technology

[0002] With the deepening of digital transformation, data security risks are increasingly becoming a major challenge for enterprises and organizations. Traditional security early warning measures are no longer sufficient to cope with complex and evolving cyber threats, especially in the context of big data, cloud computing, and the Internet of Things. Artificial intelligence (AI), as a powerful technological tool, is being widely applied in the field of data security due to its advantages in massive data processing, pattern recognition, and automated decision-making.

[0003] In existing technologies, traditional data security alerts focus on static assets and single points of failure, failing to detect the dynamic risks faced by data during complex processes. When data / users / devices / applications are compromised, security teams struggle to quickly determine which data has been affected, where it is currently flowing, and where it will spread next.

[0004] In summary, when issuing security warnings for data, there is a lack of awareness of the dynamic flow of data, which makes it difficult to identify the scope of impact and the spread trend in a timely manner after the data has been compromised. Summary of the Invention

[0005] This application provides a data security risk early warning method and system based on artificial intelligence, which can solve the problem that when providing data security early warning, the lack of perception of the dynamic flow process of data makes it difficult to identify the scope of impact and the spread trend after the data has been compromised.

[0006] In a first aspect, embodiments of this application provide a data security risk early warning method based on artificial intelligence, including:

[0007] Obtain a metadata list, and identify and mark sensitive data in the metadata list to obtain a sensitive data list;

[0008] A data stream event set is captured by a multi-source probe, the data stream event set is matched with sensitive data in the sensitive data list, and the data stream event set is colored according to the identifier of each sensitive data to obtain the data stream event list;

[0009] Based on the sensitive data list and the data flow event list, a data flow graph is constructed, and the basic risk value of each node in the data flow graph is quantified;

[0010] When an initial source of infection is detected, iterative propagation is performed in the data flow graph, starting from the node corresponding to the initial source of infection. Based on the basic risk value of each node, the final risk value of each node is calculated. Nodes with a final risk value greater than the initial threshold are added to the source of infection set. The initial source of infection is used to characterize an event that causes a security problem, and the source of infection set includes the initial source of infection.

[0011] Based on the final risk value of each node and the set of infection sources, a risk diffusion simulation is performed in the data flow graph to calculate the probability that each node will become a new infection source within a future time window, and to obtain the predicted infection probability of each node.

[0012] Based on the final risk value and the predicted infection probability of each node, a dynamic early warning threshold is determined; the final risk value of each node is compared with the dynamic early warning threshold, and if the final risk value is greater than the dynamic early warning threshold, an early warning is triggered, and an early warning report is generated for the node whose final risk value is greater than the dynamic early warning threshold.

[0013] The technical solutions described in this application embodiment have at least the following technical effects:

[0014] The data security risk early warning method based on artificial intelligence provided in this application first obtains a metadata list and identifies and marks sensitive data in the metadata list to obtain a sensitive data list. Then, it captures a data stream event set through multi-source probes, matches the data stream event set with the sensitive data in the sensitive data list, and colors the data stream event set according to the identifier of each sensitive data to obtain a data stream event list. Based on the sensitive data list and the data stream event list, a data stream graph is constructed, and the basic risk value of each node in the data stream graph is quantified. When an initial source of infection (i.e., an event in which a security problem occurs) is detected, the node corresponding to the initial source of infection is used as the starting point to perform [further steps] in the data stream graph. The method iteratively propagates risk and calculates the final risk value of each node based on its basic risk value. Nodes with final risk values ​​greater than an initial threshold are added to the infection source set. Then, based on the final risk value and infection source set of each node, risk diffusion simulation is performed in the data flow graph to calculate the probability of each node becoming a new infection source within a future time window, obtaining the predicted infection probability of each node. Finally, based on the final risk value and predicted infection probability of each node, a dynamic early warning threshold is determined. The final risk value of each node is compared with the dynamic early warning threshold. If the final risk value is greater than the dynamic early warning threshold, an early warning is triggered, and an early warning report is generated for nodes with final risk values ​​greater than the dynamic early warning threshold. This method identifies and marks sensitive data, and combines this with matching and data coloring of data flow event sets captured by multi-source probes. This gives sensitive data identifiable identifiers during its flow, thereby enabling continuous tracking of sensitive data from generation, storage, processing to transmission, and improving the ability to perceive the dynamic flow of data. This method, through the propagation computation mechanism of data flow graphs and risk diffusion simulation mechanism, can quickly locate affected nodes and predict diffusion paths after a data security issue occurs. It avoids the subjectivity and bias of single-event-triggered judgments, improves the scientific rigor and comparability of risk assessment, and enables proactive risk assessment and defense. Furthermore, by using dynamic early warning thresholds, the method allows warning conditions to be dynamically adjusted according to the risk situation, avoiding false alarms or missed alarms caused by fixed threshold mechanisms, thus improving the accuracy and adaptability of security early warnings.

[0015] Secondly, embodiments of this application provide an artificial intelligence-based data security risk early warning system, including:

[0016] A sensitive data marking unit is used to obtain a metadata list and identify and mark sensitive data in the metadata list to obtain a sensitive data list.

[0017] The data coloring unit is used to capture a data stream event set through a multi-source probe, match the data stream event set with sensitive data in the sensitive data list, and color the data stream event set according to the identifier of each sensitive data to obtain the data stream event list;

[0018] The graph construction unit is used to construct a data flow graph based on the sensitive data list and the data flow event list, and to quantify the basic risk value of each node in the data flow graph;

[0019] The risk value calculation unit is used to, when an initial infection source is detected, iterate through the data flow graph starting from the node corresponding to the initial infection source, and calculate the final risk value of each node based on the basic risk value of each node; and add nodes whose final risk value is greater than the initial threshold to the infection source set; wherein, the initial infection source is used to characterize an event in which a security problem occurs, and the infection source set includes the initial infection source;

[0020] An infection probability prediction unit is used to perform risk diffusion simulation in the data flow graph based on the final risk value of each node and the set of infection sources, calculate the probability that each node will become a new source of infection within a future time window, and obtain the predicted infection probability of each node.

[0021] The early warning unit is used to determine a dynamic early warning threshold based on the final risk value of each node and the predicted infection probability of each node; compare the final risk value of each node with the dynamic early warning threshold; if the final risk value is greater than the dynamic early warning threshold, trigger an early warning and generate an early warning report for nodes whose final risk value is greater than the dynamic early warning threshold.

[0022] Thirdly, embodiments of this application provide an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the method described in any of the embodiments of the first aspect.

[0023] It is understood that the beneficial effects of the second and third aspects mentioned above can be found in the relevant descriptions in the first aspect mentioned above, and will not be repeated here. Attached Figure Description

[0024] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0025] Figure 1This is a flowchart illustrating an embodiment of an artificial intelligence-based data security risk warning method provided in this application;

[0026] Figure 2 This is an example diagram of the unified event format for data stream events in the data security risk early warning method based on artificial intelligence provided in the embodiments of this application;

[0027] Figure 3 This is a schematic diagram of the structure of the data security risk early warning system based on artificial intelligence provided in the embodiments of this application;

[0028] Figure 4 This is a schematic diagram of the structure of the electronic device provided in the embodiments of this application. Detailed Implementation

[0029] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, in order to provide a thorough understanding of the embodiments of this application. However, those skilled in the art will understand that this application may also be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods have been omitted so as not to obscure the description of this application with unnecessary detail.

[0030] It should be understood that, when used in this application specification and the appended claims, the term "comprising" indicates the presence of the described features, integrals, steps, operations, elements and / or components, but does not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or a collection thereof.

[0031] It should also be understood that the term “and / or” as used in this application specification and the appended claims means any combination of one or more of the associated listed items and all possible combinations, and includes such combinations.

[0032] In related technologies, data is not statically stored in a single database or server, but rather continuously flows frequently between databases, caching systems, message queues, API interfaces, data platforms, cloud storage, and external partners. This multi-entity, multi-path, and cross-domain flow characteristic makes data itself a dynamically changing object. However, traditional security systems can often only identify whether an asset is abnormal or whether a behavior is illegal, but they cannot depict the propagation trajectory of data between different nodes, nor can they continuously track the diffusion process of sensitive data in complex links.

[0033] Secondly, when a user account is stolen, a terminal is infected with malware, or a database has a vulnerability, the system generates single-point alerts. However, these alerts are isolated and lack overall correlation analysis capabilities. The security team can only know that a certain object has a problem, but it is difficult to determine the object's position and role in the entire data flow network in the first instance.

[0034] Furthermore, when an entity is confirmed to have a security issue, the security team can only take reactive measures, such as banning accounts, isolating hosts, and patching vulnerabilities. However, due to the lack of modeling of data flow relationships and propagation paths, the system cannot predict whether the risk has already spread to other related entities along existing data links, nor can it determine which assets may become new sources of risk in the future. This lack of predictive capability in the protection model often results in security responses lagging behind the speed of attack spread, easily leading to a situation where a chain reaction has already occurred by the time the issue is discovered.

[0035] To address the aforementioned issues, this application provides a data security risk early warning method and system based on artificial intelligence. The method first obtains a metadata list and identifies and marks sensitive data within it to obtain a sensitive data list. Then, it captures data stream event sets using multi-source probes, matches these event sets with the sensitive data in the sensitive data list, and colors the event sets according to the identifier of each sensitive data point to obtain a data stream event list. Based on the sensitive data list and the data stream event list, a data stream graph is constructed, and the basic risk value of each node in the graph is quantified. When an initial source of infection (i.e., an event indicating a security problem) is detected, iterative propagation occurs in the data stream graph, starting from the node corresponding to the initial source of infection. The risk value is then quantified based on the identifier of each node. This method calculates the final risk value for each node based on its base risk value, and adds nodes with final risk values ​​greater than an initial threshold to the infection source set. Then, based on the final risk value and infection source set of each node, a risk diffusion simulation is performed in the data flow graph to calculate the probability of each node becoming a new infection source within a future time window, obtaining the predicted infection probability for each node. Finally, based on the final risk value and predicted infection probability of each node, a dynamic early warning threshold is determined, and the final risk value of each node is compared with the dynamic early warning threshold. If the final risk value is greater than the dynamic early warning threshold, an early warning is triggered, and an early warning report is generated for nodes with final risk values ​​greater than the dynamic early warning threshold. This method identifies and marks sensitive data, and combines this with matching and data coloring of data flow event sets captured by multi-source probes, enabling sensitive data to have identifiable identifiers during its flow. This allows for continuous tracking of sensitive data from generation, storage, processing to transmission, improving the ability to perceive the dynamic flow of data. This method, through the propagation computation mechanism of data flow graphs and risk diffusion simulation mechanism, can quickly locate affected nodes and predict diffusion paths after a data security issue occurs. It avoids the subjectivity and bias of single-event-triggered judgments, improves the scientific rigor and comparability of risk assessment, and enables proactive risk assessment and defense. Furthermore, by using dynamic early warning thresholds, the method allows warning conditions to be dynamically adjusted according to the risk situation, avoiding false alarms or missed alarms caused by fixed threshold mechanisms, thus improving the accuracy and adaptability of security early warnings.

[0036] The data security risk warning method based on artificial intelligence provided in this application embodiment can be applied to electronic devices. In this case, the electronic device is the executing subject of the data security risk warning method based on artificial intelligence provided in this application embodiment. This application embodiment does not impose any restrictions on the specific type of electronic device.

[0037] For example, electronic devices can be laptops, ultra-mobile personal computers (UMPCs), netbooks, desktop computers, computers, laptops, etc.

[0038] To better understand the AI-based data security risk warning method provided in this application, the specific implementation process of the AI-based data security risk warning method provided in this application will be described below by way of example.

[0039] Figure 1 This paper illustrates a schematic flowchart of an AI-based data security risk warning method provided in an embodiment of this application. The AI-based data security risk warning method includes:

[0040] S100: Obtain the metadata list, and identify and mark sensitive data in the metadata list to obtain the sensitive data list.

[0041] For example, a data source adaptation layer can be built to provide standardized access capabilities for different types of data assets. This adaptation layer is implemented in the form of connectors / plugins, abstracting different data sources into a unified data object model, including at least: structured data, relational databases (MySQL, Oracle, SQL Server, PostgreSQL), data warehouses (Hive, MaxCompute), cloud databases (RDS, Redshift); semi / unstructured data, distributed file systems (HDFS, NAS storage), object storage (OSS, S3, MinIO), knowledge base and document management systems (SharePoint, Confluence), code repositories (Git); data flow nodes, mail servers, FTP servers, file transfer of collaboration tools (subject to compliance and authorization), and terminal-specific directories.

[0042] It can periodically scan pre-configured IP ranges or service ports to automatically discover newly added database instances or shared storage; it can connect to the metadata management center of cloud platforms or big data platforms (such as Hive MetaStore, AWS GlueData Catalog) via API to pull the schema information of databases, tables, and columns; at allowed deployment points, it can parse database protocols or network traffic, passively identify the data objects being accessed, fill in the missing active objects in the asset list, and record the access context (access subject, client, time, statement fingerprint, etc.).

[0043] The above process allows the system to generate and continuously update a metadata list. This list can include data object location information (such as data source identifier, instance / project, etc.), object attributes (such as type, format, size / number of rows, etc.), ownership and permissions (such as owner, responsible person, etc.), and usage characteristics (such as recent access time, access frequency, etc.). This metadata list provides clear scanning targets and boundaries for subsequent sensitive data identification, avoiding the costs and disruptions caused by indiscriminate full-data scanning.

[0044] Sensitive data identification can employ a multi-layered identification engine: The first layer uses a rule engine for rapid filtering: regular expression matching can be used to quickly identify data with obvious format characteristics, such as ID card numbers, mobile phone numbers, bank card numbers, IP addresses, license plates, etc., and rules such as check digits / length / number ranges can be introduced to reduce false alarms; keyword matching can be used to identify sensitive semantic clues (such as passwords / secrets, confidentiality agreements, etc.) through field names / titles / paragraph keywords and context dictionaries for scenarios such as configurations, contracts, regulations, and quotations; file fingerprint matching can be used to generate hashes (or segmented hashes) for known sensitive files, and when identical or highly similar files are found across locations, they are directly identified as sensitive and their tags are inherited.

[0045] The second layer uses a machine learning engine for deep understanding and verification: For unstructured text (such as contracts, reports, and source code), NLP models (such as BERT, TextCNN, etc.) can be used to analyze the data and automatically determine whether the content belongs to sensitive data (such as trade secrets, medical information, personal privacy, etc.). For images and PDF scans, OCR (Optical Character Recognition) technology can be used to extract text from the images, and then NLP models or rule engines can be used for verification.

[0046] Sensitive data identified through the above process can be classified into different levels using predefined classification rules. These rules can be based on the data's name, format, and field labels (such as field names and column names). For example, sensitive data can be classified into levels such as L4 (highest sensitivity, such as ID card numbers and bank card numbers), L3 (high sensitivity, such as salary information and customer financial information), L2 (medium sensitivity, such as business reports and financial budgets), and L1 (lowest sensitivity, such as operation logs and public reports).

[0047] If sensitive data is frequently accessed or stored in insecure locations, its data classification can be adjusted based on factors such as access frequency and storage location. For example, if sensitive data (such as financial data or salary information) is frequently accessed, its data classification can be increased. If sensitive data complies with certain regulatory requirements (such as GDPR, HIPAA, etc.), its data classification can be automatically assigned according to legal requirements. For example, medical records are classified as Level 4 according to HIPAA standards.

[0048] The system can generate stable, referential IDs for different sensitive data objects: for structured objects, it can combine the database name, table name, column name, and schema version with a hash algorithm (such as SHA-256) to generate field-level / table-level IDs; for unstructured objects, it can combine the bucket / directory, object key, and version / ETag with a hash algorithm to generate file-level IDs.

[0049] Sensitive data can be extracted. For example, for file data, the entire file content can be extracted; for database data, the field content of the table can be extracted; for text data, the data itself can be extracted for fingerprint calculation; for binary files (such as PDFs, images, etc.), the file can be converted into a byte stream for fingerprint calculation. The extracted content can be encrypted using hash algorithms (such as SHA-256, MD5, etc.) to generate a fixed-length hash value, which is the fingerprint of the sensitive data. For some not entirely identical files (such as updated documents, slightly modified files, etc.), fuzzy hash algorithms (such as SSDEEP) can be used to generate fingerprints, identifying similar data, not just completely identical content.

[0050] This step enables unified organization and visual management of data assets scattered across multi-source heterogeneous environments. It also semantically distinguishes and precisely locates previously indistinguishable ordinary data from sensitive data, providing clear data object boundaries and unique identifiers for subsequent data flow matching, data coloring, and risk propagation calculations. This enhances the system's ability to perceive the storage location, distribution status, and attribute characteristics of sensitive data, enabling accurate identification, standardized management, and computable processing of sensitive data, thus laying the foundation for data security risk analysis and dynamic early warning.

[0051] In one possible implementation, in step S100, sensitive data is marked, including:

[0052] S110 divides all sensitive data into multiple granularities, obtaining the granularity corresponding to each sensitive data. The granularity can be a file-level object, table-level object, field-level object, record-level object, or text fragment-level object.

[0053] For example, a file-level object refers to a complete file, such as an office document (Word, Excel, PowerPoint, etc.), a compressed file (ZIP, RAR, etc.), or a PDF file. Sensitive data can be identified as a file by its file extension (e.g., .pdf, .docx, .xlsx, etc.) and file metadata (e.g., type, size, modified_time, etc.). If it is a file, it is marked as a file-level object.

[0054] Table-level objects are suitable for structured sensitive data, which can be database tables, datasets, views, or partitioned tables. Sensitive data can be identified as table-structured by scanning the database schema or querying database metadata (such as table names, field names, field types, etc.), and if so, it is marked as a table-level object.

[0055] Field-level objects are suitable for individual fields in structured data, such as ID card numbers, bank card numbers, and passwords. Sensitive data can be identified as field-level objects by their field names (e.g., id_card, salary) and data formats (e.g., numbers, characters).

[0056] A record-level object refers to a single record in a database table. Each record represents a business entity, such as the data of an employee, customer, or order. Data belonging to a record-level object can be determined by analyzing the record's primary key identifier (such as employee ID or order ID) and the record's field contents.

[0057] Text fragment-level objects are suitable for specific segments within unstructured text, such as contract terms, sensitive paragraphs in technical documents, and source code snippets. Specific text fragments within sensitive data can be extracted using NLP (Natural Language Processing) techniques and regular expression matching to determine if the sensitive data belongs to a text fragment-level object.

[0058] S120: Based on the granularity corresponding to each sensitive data, select at least one fingerprint strategy from the fingerprint strategy set, and generate a fingerprint set for each sensitive data based on the fingerprint strategy set for each sensitive data. The fingerprint strategy set includes precise fingerprint strategies, approximate fingerprint strategies, and statistical fingerprint strategies.

[0059] It is understandable that precise fingerprints are used to identify perfectly consistent sensitive data. They have the characteristics of high matching accuracy and low false alarm rate, and are suitable for perfectly consistent matching scenarios without any modifications or changes.

[0060] Approximate fingerprints are used to identify sensitive data that still retains the same origin after minor modifications, format changes, field rearrangements, or partial truncation. They are suitable for tracking variations of sensitive data, such as edited documents or data.

[0061] Statistical fingerprints are used to extract statistical features from data. They are mainly used for structured data and can help the system determine the similarity or common origin of different data. They are suitable for comparing table-level and field-level data.

[0062] For example, a precise fingerprinting strategy may include calculating a hash value (such as SHA-256 or MD5) for the file content as a precise fingerprint of a file-level object; calculating a hash value after combining and encoding the schema information (field name, field type, field order) and content digest of a structured data table as a precise fingerprint of a table-level object; calculating a hash value for the field metadata (field name, type) and sample value digest as a precise fingerprint of a field-level object; calculating a hash value for the primary key identifier and field content of a record as a precise fingerprint of a record-level object; and calculating a hash value for a text fragment as a precise fingerprint of a text fragment-level object.

[0063] Approximate fingerprinting strategies can include using fuzzy hashing algorithms (such as ssdeep) to generate approximate fingerprints for file-level objects and text fragment-level objects; and using Locality Sensitive Hashing (LSH) to generate approximate fingerprints for table-level objects and field-level objects.

[0064] The statistical fingerprinting strategy can extract statistical feature vectors (such as the number of rows, the number of columns, the uniqueness of the field, the data missing rate, the duplication rate, etc.) from table-level objects and use these statistical feature vectors as the statistical fingerprints of the table-level objects; it can also extract statistical feature vectors (such as field length, the proportion of unique values, the range of values, the distribution pattern (e.g., the ID card number field may have a length of 18 characters, the number pattern of the bank card number field, etc.)) from field-level objects and use these statistical feature vectors as the statistical fingerprints of the field-level objects.

[0065] A fingerprint strategy corresponding to the granularity of each sensitive data can be selected from the fingerprint strategy set mentioned above, and a fingerprint set for each sensitive data can be generated using the selected fingerprint strategy.

[0066] S130, generate an identifier for each sensitive data item according to preset rules. The preset rules include granularity object, data source type, timestamp, and random unique factor.

[0067] For example, a globally unique identifier, DataID, can be assigned to each piece of sensitive data. DataIDs can be generated using structured encoding and may include granular objects (files / tables / fields / records / fragments), data source types (DB, FILE, OBJ, MAIL, IM, etc.), timestamps, and random unique factors (UUID or Snowflake ID). The specific format of the DataID can be configured. For example, a DataID of FILE_OBJ_20260225_9f3c7a.

[0068] Fingerprints and identifiers not only play a crucial role in data protection and risk assessment, but can also serve as a basis for matching sensitive data in subsequent data stream events, helping to identify and respond to potential risks and breaches in a timely manner.

[0069] S200 captures a data stream event set using multi-source probes, matches the data stream event set with sensitive data in a sensitive data list, and colors the data stream event set according to the identifier of each sensitive data, thus obtaining a data stream event list. The data coloring is used to characterize the writing of the identifier of sensitive data into the data stream event.

[0070] It's understandable that data flows through various paths, such as database queries, file sharing, email sending, API calls, instant messaging tools, and even USB copying. Different types of probes can be deployed at key nodes to comprehensively capture data flow events.

[0071] For example, a database probe can be deployed in a bypass manner to capture database traffic via port mirroring (SPAN), or it can act as a database plugin / proxy (such as the audit plugin for MySQL) to capture audit logs. The captured content can include operation type (such as SELECT, INSERT, UPDATE, etc.), object information (such as database name, table name, field name, key values ​​in WHERE conditions, etc.), result set characteristics (such as the number of rows returned, data size, sample of the first record, etc.), and context information (such as client IP, database login user, timestamp, etc.).

[0072] Network traffic probes can be deployed in bypass mode at core switches, internet egress points, and server area entry points to perform deep packet inspection (DPI) on network traffic. Captured content can include file transfer protocols (i.e., parsing upload / download traffic for SMB / CIFS, NFS, FTP, and HTTP / HTTPS; for HTTPS encrypted traffic, recording the connection tuple, certificate information, amount of data transferred, file name, etc.), email protocols (i.e., parsing SMTP, POP3, and IMAP to extract email headers, subjects, attachment names, and sizes), and instant messaging and collaboration (i.e., identifying traffic characteristics of common applications (such as DingTalk, WeChat Work, Slack, etc.) and parsing file metadata).

[0073] API gateway plugins (such as Kong, Zuul, etc.) can be used to intercept all API requests and responses. The captured content can include RESTful / gRPC calls (i.e., parsing the request path, method, parameters, and key fields in the response body), data object identifiers (i.e., if the API follows REST principles in its design, the resource ID (such as / api / employees / {id}) can uniquely identify the accessed data object), and call chain information (such as the calling service, the called service, user authentication information, etc.).

[0074] A lightweight endpoint for data loss prevention (DLP) can be installed on enterprise-managed terminals (such as Windows, macOS, Linux, etc.). Captured content can include file operations (such as copying, moving, renaming, and deleting files, and information such as source file paths, target paths, and process names), clipboard operations (such as copy / paste operations and text in the clipboard), peripheral device control (i.e., writing files to USB devices), printing operations (such as the name of the printed document and printer information), and screenshots.

[0075] The events captured by the probes above have varying formats. These can be converted into a unified event format (table) and contextual information can be added for subsequent correlation analysis. An example of the unified event format is shown below. Figure 2 As shown. A pre-defined operation semantic mapping table can be used to uniformly convert events into standard operation types, facilitating the correlation and calculation of data flow behavior across different channels within the same graph. For example, the operation semantic mapping table maps DB's SELECT to query / read; API's GET / employees / {id} to read; terminal copy to copy; and email sending to send.

[0076] Whether a data stream event involves sensitive data can be determined by matching field names (such as matching fields in SQL queries, paths in API requests, or filenames in file operations). For example, in the database query `SELECT id_card, nameFROM employee WHERE id_card = '123456789'`, the system matches the `id_card` field and marks the query event as related to ID card number data in the sensitive data list.

[0077] You can calculate the fingerprint of the data object in a data stream event (such as SHA-256 or fuzzy hash) and compare it with the fingerprints in a sensitive data list. If the fingerprint of the data object in the data stream event matches a fingerprint in the sensitive data list, it indicates that the data stream event may contain sensitive data.

[0078] When an event in the data stream event set matches a sensitive data item in the sensitive data list, the sensitive data ID can be assigned to that event (i.e., data coloring). This step effectively improves the ability to identify sensitive data during data flow, ensuring that every event involving sensitive data can be marked and monitored, providing a reliable basis for subsequent risk assessment and security response.

[0079] In one possible implementation, in step S200, the data stream event set is matched with sensitive data in the sensitive data list, and the data stream event set is colored according to the identifier of each sensitive data to obtain the data stream event list, including:

[0080] For each data stream event:

[0081] S210: Generate an event precise fingerprint based on the granularity and precise fingerprint strategy corresponding to the data object in the data stream event, and match the event precise fingerprint with the precise fingerprint in the sensitive data list. If the match is successful, write the identifier of the matched sensitive data into the data stream event.

[0082] For example, the granularity of the data object in the data stream event can be identified by the method in step S110, and the corresponding strategy can be found in the precise fingerprint strategy in step S120 according to its granularity to generate the precise fingerprint of the event.

[0083] Search for the exact fingerprint in the sensitive data list that matches the exact fingerprint of the event (i.e., the hash value is the same). If a unique exact fingerprint is matched, the DataID corresponding to the exact fingerprint can be written to the data stream event, that is, the data stream event is associated with the matched DataID. If no fingerprint is matched or multiple fingerprints are matched, proceed to step S220.

[0084] S220, if the matching fails, generate an event approximate fingerprint based on the granularity and approximate fingerprint strategy corresponding to the data object in the data stream event, calculate the similarity between the event approximate fingerprint and each approximate fingerprint in the sensitive data list, and write the identifier of the sensitive data with the highest similarity and exceeding the similarity threshold into the data stream event.

[0085] For example, the granularity of the data object in the data stream event can be identified by the method in step S110, and the corresponding strategy can be found in the approximate fingerprint strategy in step S120 according to its granularity to generate an approximate fingerprint of the event.

[0086] For approximate fingerprints generated using fuzzy hashing algorithms, the Hamming distance method can be used to calculate the similarity between the event approximate fingerprint and each approximate fingerprint in the sensitive data list. This involves converting both the event approximate fingerprint and the approximate fingerprints in the sensitive data list into binary form, calculating the difference in the number of bits at the same position between the event approximate fingerprint and the approximate fingerprint of each sensitive data point (i.e., the Hamming distance), and then evaluating the similarity between the two fingerprints based on the Hamming distance. The similarity calculation formula is as follows: Where S represents similarity, H represents Hamming distance, and L represents hash value (fingerprint) length.

[0087] For approximate fingerprints generated by Locality Sensitive Hashing (LSH), the Jaccard similarity method can be used to calculate the similarity between the approximate event fingerprint and each approximate fingerprint in the sensitive data list. The Jaccard similarity formula is as follows: ,in, This indicates the similarity between the event fingerprint A and the similar fingerprint B in the sensitive data list. This represents the shared portion between two hash values ​​(fingerprints). This represents the total of two hash values ​​(fingerprints).

[0088] The DataID corresponding to the approximate fingerprint with the highest similarity exceeding the similarity threshold can be selected and written to the data stream event. The similarity threshold can be configured at the granular level. For example, the similarity threshold for text fragment-level objects can be set higher (e.g., 0.85) to prevent false associations; the similarity threshold for table-level objects can be slightly lower (e.g., 0.70) to facilitate the recall of tables from the same source. If all similarities do not exceed the similarity threshold, proceed to step S230.

[0089] S230, if all similarities do not exceed the similarity threshold, extract the statistical features of the data objects in the data stream event, and match the statistical features of the data objects with the statistical fingerprints in the sensitive data list to obtain the identifier set of candidate sensitive data; based on the time, source object, target object and operation type in the data stream event, perform context association verification on the identifier set of candidate sensitive data, determine the identifier of the target sensitive data, and write the identifier of the target sensitive data into the data stream event.

[0090] For example, the granularity of the data object in the data stream event can be identified using the method in step S110, and the corresponding strategy can be found in the statistical fingerprint strategy in step S120 according to its granularity to generate a statistical feature vector. The similarity between the statistical feature vector of the data object and each statistical fingerprint in the sensitive data list can be calculated using a weighted distance or cosine similarity method, and the DataID corresponding to the statistical fingerprint with a similarity higher than the similarity threshold is added to the identifier set of the candidate sensitive data.

[0091] By matching the time in the data stream event, the time when the event occurred can be determined. If the historical record of a candidate sensitive data is associated with the event time period (such as the update history of the data table matching the event time), it can be further confirmed that the data stream event is associated with the candidate sensitive data.

[0092] By mapping the roles of the source and target objects, the sensitive data actually involved in the data flow event can be identified. For example, if the source object is a financial system application that can only access the financial_data table, then it can be inferred that the data flow event is related to the financial_data table.

[0093] You can verify whether the operation type of candidate sensitive data matches the event type based on the operation type in the data stream event (such as INSERT, UPDATE, DELETE, etc.). For example, if the operation type is INSERT, the candidate sensitive data can be a table or field that can receive inserts; if the operation type is UPDATE, the candidate sensitive data can be a record or field that can be updated.

[0094] By matching contextual information such as timestamp, source object, target object, and operation type, it can be determined which candidate sensitive data matches the data stream event in key attributes (the above process can be quantified and confidence levels calculated). If only one candidate sensitive data passes the verification (highest confidence level), then the DataID of that candidate sensitive data becomes the identifier of the target sensitive data, and the identifier of the target sensitive data is written into the data stream event.

[0095] The above process not only enables the identification of sensitive data in various scenarios, but also ensures the accurate correlation of data through context verification, greatly improving the ability to track, manage, and prevent leakage of sensitive data.

[0096] S300 constructs a data flow graph based on a list of sensitive data and a list of data flow events, and quantifies the basic risk value of each node in the data flow graph.

[0097] For example, each sensitive data item in the sensitive data list can be treated as a data node, and the target and source objects in the data flow event list can be treated as entity nodes. Edges of the data flow graph can be constructed based on the operation type of the data flow event, including access edges (indicating that an entity node accesses a data node) and transmission edges (indicating that data flows from one entity node to another). For example, a user accesses sensitive data by querying a database; a file is transferred from a server to a terminal device.

[0098] Whenever the system captures a data stream event, it can trigger the graph update mechanism. The system can check whether the source object, target object, and data object in the data stream event already exist in the data stream graph. If they do not exist, it can create the corresponding new nodes and add a directed edge between the corresponding nodes according to the operation type of the data stream event.

[0099] The basic risk value of an entity node can be quantified by multiple factors: the vulnerability score is calculated based on the number of vulnerabilities or the severity score of the device, such as vulnerability score = number of vulnerabilities / total number of vulnerabilities. Vulnerability scanners can be used to regularly scan devices, servers, and networks to detect known vulnerabilities (such as CVE vulnerabilities).

[0100] The abnormality of behavior is calculated based on the degree of deviation between historical behavior and current behavior. For example, the abnormality of behavior = the degree of deviation from normal behavior / total behavior. The UEBA (User and Entity Behavior Analytics) system can be used to find out whether a user's behavior is consistent with the normal behavior pattern. For example, if an employee logs in during working hours under normal circumstances, but suddenly accesses the company's core database late at night, this behavior is marked as abnormal.

[0101] Threat intelligence hit score can be a binary variable: a hit score of 1 indicates a successful attack, while a miss score indicates a failure. Threat intelligence platforms (such as AlienVault and Anomali) can provide information on known malicious IP addresses, domains, and attack patterns. If an entity node is associated with these known threats, the threat intelligence hit score is 1.

[0102] Historical security incident scores can be calculated by the number of recorded security incidents. For example, historical security incident score = number of security incidents in the past period / total number of security incidents. Historical security incidents, such as intrusion detection and malware incidents, are recorded by a security incident management platform (such as a SIEM system) to determine the security history of an entity.

[0103] The vulnerability score, behavioral anomaly score, threat intelligence hit score, and historical security event score of each entity node can be weighted and summed to obtain the basic risk value of each entity node.

[0104] The basic risk value of a data node can be quantified by multiple factors: the higher the data level, the more sensitive the data, and the higher the risk. The mapping between data level and score can be preset, such as L4→1.0, L3→0.8, L2→0.5, L1→0.2.

[0105] After capturing the data stream event set in step S200, the access popularity can be calculated by counting the frequency of access to sensitive data and / or the number of users, such as access popularity = number of accesses in a recent period / total number of accesses.

[0106] The basic risk value of a data node can be calculated using a comprehensive calculation formula, which is as follows: ,in, L represents the basic risk value of a data node, and L represents the score corresponding to the data level of the data node. This represents the weighting coefficient, and A represents the access popularity.

[0107] The baseline risk values ​​for all entity nodes and data nodes can be recalculated periodically or based on event triggers. The updated baseline risk values ​​will serve as the basis for further risk propagation and diffusion calculations. Quantifying risk factors in this way helps identify and assess potential security threats in sensitive data flows, providing data support for subsequent risk propagation and early warning.

[0108] In one possible implementation, step S300 involves constructing a data flow graph based on the sensitive data list and the data flow event list, including:

[0109] S310 uses each sensitive data in the sensitive data list as a data node in the data flow graph, and uses the source object and target object in the data flow event list as entity nodes in the data flow graph.

[0110] For example, each sensitive data in the sensitive data list can be used as a data node. The node attributes of the data node may include DataID, data level, fingerprint index reference (or fingerprint set); the source object and target object (such as user entity, device entity, application entity, etc.) in the data flow event list can be used as entity nodes. The node attributes of the entity node may include entity ID (such as username, device ID, application ID, etc.) and entity type (such as user, device, application, etc.).

[0111] S320: Based on the operation type in the data flow event list, construct directed relationship edges between entity nodes and data nodes. These directed relationship edges can be either access edges or transmission edges.

[0112] For example, an access edge indicates that an entity node accesses a data node; a transmission edge indicates that data in a data node is transmitted from one entity node to another. Directed relationship edges can be added between entity nodes and data nodes or between entity nodes based on the operation type of each data flow event. The attributes of the edge can include event_id (event ID), timetamp (event occurrence time), operation (operation type), and DataID (associated sensitive data ID).

[0113] For example, if a data stream event is that user_01 queries the file financial_report.pdf, then an access edge is established between the entity node (user_01) and the data node (financial_report.pdf). The edge attributes include event_id: event_001; timetamp: 2026-01-25 10:00:00; operation: Query; dataID: FILE_FILE_20260123_9f3c7a. If a data stream event is that user_01 sends the file financial_report.pdf to the application hr_system, then a transport edge is established between the entity node (user_01) and the entity node (hr_system). The edge attributes include event_id: event_002; timetamp: 2026-01-25 10:05:00; operation: Send; dataID: FILE_FILE_20260123_9f3c7a.

[0114] S330 integrates all data nodes, entity nodes, and directed relation edges between entity nodes and data nodes to obtain a data flow graph.

[0115] For example, a graph database can be selected to store the data flow graph. Graph databases include Neo4j, Amazon Neptune, and ArangoDB. Nodes and edges can be created in the graph database through steps S310 and S320. The nodes and edges can then be integrated using the graph query language of the graph database (such as Cypher for Neo4j) to form a complete data flow graph.

[0116] The data flow graph constructed through the above steps can intuitively display the flow and interaction of sensitive data between different entities, and can capture the path and pattern of data flow in real time, providing strong support for the tracking, monitoring and risk assessment of sensitive data.

[0117] In one possible implementation, the AI-based data security risk early warning method also includes:

[0118] S301, construct derived edges between data nodes based on the operation types in the data flow event list.

[0119] It can be understood that a derived edge means that due to data operations (such as exporting, saving as, converting, etc.), one data node (parent node) derives another data node (child node).

[0120] For example, the decision to establish a derived edge can be determined based on the operation type in the data flow event list. The conditions for establishing a derived edge can include: if the operation type is export (e.g., EXPORT), it means exporting data from a certain data source to generate a new data object; if the operation type is transformation (e.g., CONVERT), it means converting data of a certain format to another format to generate a new data object; if the operation type is filtering (e.g., FILTER), it means filtering certain specific data from the original data to form a new data object.

[0121] Parent and child data objects can be extracted from the data_object (description of the data object involved) of a data stream event, and a derived edge can be established between the data node corresponding to the parent data object and the data node corresponding to the child data object. The attributes of the edge can include event_id (event ID), timetamp (event occurrence time), operation (operation type), parent_DataID (parent data object ID), and derived_DataID (child data object ID).

[0122] S302, construct the storage edge between the data node and the entity node according to the operation type in the data flow event list.

[0123] It can be understood that a storage edge represents a data node being stored or residing in a specific location or system (such as a database, cloud storage, file system, etc.).

[0124] For example, if the operation type of a data stream event is STORE, UPLOAD, SAVE, etc., the target object and data object are extracted from the data stream event, and a storage edge is established between the data node corresponding to the data object and the entity node corresponding to the target object. The edge attributes may include event_id (event ID), timetamp (event occurrence time), operation (operation type), DataID (data object ID), and location_id (unique identifier of storage location).

[0125] The above process further refines the data flow graph, clearly reflecting the data generation, transmission, and storage processes. Derived edges help track inheritance and transformation relationships between data, while storage edges provide information about data storage locations. This allows the system to more accurately track the lifecycle of sensitive data, improving data security and transparency, and providing more detailed information for subsequent risk analysis and diffusion prediction.

[0126] S400: When an initial infection source is detected, iterative propagation occurs in the data flow graph, starting from the node corresponding to the initial infection source. Based on the basic risk value of each node, the final risk value of each node is calculated. Nodes with a final risk value greater than the initial threshold are added to the infection source set. The initial infection source characterizes the event that causes a security problem, and the infection source set includes the initial infection source.

[0127] For example, initial infection source detection can utilize Endpoint Detection and Response (EDR), User and Entity Behavior Analysis (UEBA), Network Traffic Analysis (NTA) platforms, and Data Loss Prevention (DLP). For instance, an EDR system detects ransomware encrypting files on an employee's computer and issues an alert with the following information: Endpoint IP: 10.1.1.2, Process: ransomware.exe, Behavior: File encryption. Upon receiving the EDR alert (via API or Syslog), the system parses the endpoint IP or hostname from the alert and locates the corresponding entity node in the data flow graph, marking that node as the infection source.

[0128] The risk value of the node corresponding to the initial source of infection can be set to 1. The first round of propagation starts from the initial source node, calculates the risk values ​​of its neighboring nodes according to the propagation rules, and updates the risk values ​​of all nodes directly connected to the source node. Based on the results of the first round of propagation, the risk values ​​continue to propagate along the data flow path, updating the risk values ​​of nodes in each round, and gradually expanding the scope of propagation until the risk values ​​of all relevant nodes have been calculated. Iterative propagation stops when the preset maximum number of iterations is reached.

[0129] The final risk value of each node is compared with the initial threshold. If the final risk value of a node is greater than the initial threshold, the node is considered a new source of infection and added to the source of infection set. The initial threshold can be set relatively high (e.g., 0.9) to help control the source of infection and prevent the risk from spreading indefinitely.

[0130] This step effectively identifies the transmission path from the initial source of infection, helps analyze the spread of infection, and accurately determines which nodes become potential sources of infection.

[0131] In one possible implementation, in step S400, starting from the node corresponding to the initial source of infection, iterative propagation is performed in the data flow graph, and the final risk value of each node is calculated based on the basic risk value of each node, including:

[0132] An iterative algorithm similar to Breadth-First Search (BFS) can be used for iterative propagation in the data flow graph. Propagation rules can include: directionality—risk can only propagate from the source node along outgoing edges (the direction of data outflow) to the target node. For example, if the source of infection is a terminal that sends data outwards, then downstream nodes receiving the data may be infected; conversely, if the terminal only receives data (incoming edges), it will not propagate upstream (unless the upstream node also becomes an infection source for other reasons); probabilism—propagation is not inevitable, but a propagation probability is calculated based on the edge attributes and data level; decay—as the number of propagation hops increases, the risk value gradually decreases, reflecting that the attacker needs to overcome more obstacles; and superposition—a node may receive risk from multiple infection sources, and its final risk value is the maximum value.

[0133] S410, perform a first initialization on each node in the data flow graph. This first initialization involves setting the dynamic risk value of the node corresponding to the initial infection source to 1, adding the node corresponding to the initial infection source to a first queue, and setting the dynamic risk value of all other nodes except the node corresponding to the initial infection source to 0.

[0134] For example, suppose the data flow graph is G(V,E), where V represents the set of nodes. v represents a node, and E represents the set of edges. 'e' represents an edge; the initial set of infection sources is , s represents a source of infection.

[0135] Set for all nodes (except the initial source of infection) This indicates a dynamic risk value of 0, setting a threshold of 0 for all initial sources of infection. , indicating a dynamic risk value of 1. Add all sources of infection to the first queue Q1.

[0136] For each node in the first queue:

[0137] S420: Traverse the propagable edges of the current node in the data flow graph and calculate the propagation probability of each propagable edge. Here, a propagable edge represents the edge between the current node and its downstream nodes that can be propagated to.

[0138] For example, propagation probability The risk propagation probability can be determined by the operation type (different operations lead to different probabilities of risk propagation) and the data level (the higher the sensitivity, the greater the probability of risk propagation). A mapping table (configurable) can be pre-established for operation types. For example, the operation type mapping table is: COPY / EXPORT / USB_WRITE→0.9; SEND / UPLOAD / SYNC→0.8; DOWNLOAD→0.7; QUERY / READ→0.3; VIEW / PREVIEW→0.1. A mapping table (configurable) can also be pre-established for data levels. For example, the data level mapping table is: L4→1.0, L3→0.8, L2→0.5, L1→0.2.

[0139] The formula for calculating the propagation probability can be: ,in, This represents the numerical value corresponding to the operation type in the attributes of edge e (via the operation type mapping table). The numerical value corresponding to the data level (through the data level mapping table) can be used for access edges and storage edges. The data level can be used in the node attributes of the data node. The data level of the data node can be retrieved by querying the corresponding data node through the DataID attribute of edge e. The data level of the data node can be used for derived edges.

[0140] We can take a node from Q1 (let's say u, the current node), and based on the propagation rules, traverse the propagable edges e of u (u→v, where v represents the downstream node). According to the formula for calculating the propagation probability, we calculate the propagation probability for each propagable edge e. .

[0141] S430: Calculate the risk increment of each propagable edge based on the propagation probability of each propagable edge and the dynamic risk value of the current node.

[0142] For example, the risk increment brought by each propagable edge e can be calculated, i.e. ,in, Indicates an increase in risk. This represents the dynamic risk value of the current node u, and Decay represents the decay factor (decay per hop). For example, 0.8.

[0143] S440, for each propagable edge: if the risk increment of the current propagable edge is greater than the dynamic risk value of the current downstream node, update the dynamic risk value of the current downstream node to the risk increment of the current propagable edge, and add the current downstream node to the first queue; take the risk value with the largest value between the basic risk value and the dynamic risk value of the current downstream node as the final risk value of the current downstream node. Here, the current downstream node is used to represent the downstream node of the current propagable edge.

[0144] For example, for each propagable edge e, the risk increment of propagable edge e is... Dynamic risk value of the current downstream node v In comparison, if > Update the dynamic risk value of node v, i.e. And add node v to the first queue Q1 (v's risk has increased, and it may continue to spread downstream).

[0145] The basic risk value of the current downstream node v With dynamic risk value Compare the two values ​​and take the maximum value as the final risk value of the upstream and downstream nodes v. ,Right now If multiple nodes propagate to node v, the maximum value among node v's base risk value and all dynamic risk values ​​is still taken as node v's final risk value.

[0146] S450, loop until the first queue is empty, then stop iterative propagation.

[0147] For example, repeat steps S420, S430, and S440 until the first queue Q1 is empty (or exceeds the maximum depth / time window), then stop the iterative propagation.

[0148] The above process enables dynamic risk assessment during the flow of sensitive data, as well as the ability to identify and track potential risk propagation paths in real time, thereby providing effective early warning, risk assessment, and decision support, and improving data security and defense capabilities.

[0149] In one possible implementation, the propagable edges include at least:

[0150] S401, if the current node is an entity node, determine the access edge and / or transmission edge between the entity node and the data node as a propagable edge.

[0151] For example, if the current node is an entity node, the propagable edge can consist of access edges between the entity node and the data node, and / or transmission edges between the entity node and the data node.

[0152] S402, if the current node is a data node, determine the derived edges between data nodes as propagable edges, determine the storage edges between data nodes and entity nodes as propagable edges, and / or determine the reverse relationships of the access edges between entity nodes and data nodes as propagable edges.

[0153] For example, if the current node is a data node, the propagable edge can consist of derived edges between data nodes, and / or storage edges between data nodes and entity nodes, and / or reverse relationships of access edges between entity nodes and data nodes (i.e., allowing a jump back to the entity node from the data node along the reverse direction of the access edge during propagation). The reason for including the reverse relationship of access edges between entity nodes and data nodes in the propagable edge is that if some data is already within the scope of risk impact, then entities that have accessed that data can also be included in the scope of risk impact.

[0154] Through these steps, the appropriate propagable edges can be flexibly selected according to the type of the current node. The system can accurately simulate and calculate the propagation path of risks based on different types of nodes and edges in the data flow graph, making the data flow graph more accurate and operable in risk assessment, monitoring and data security management.

[0155] S500 simulates risk diffusion in the data flow graph based on the final risk value and the set of infection sources for each node, calculates the probability that each node will become a new source of infection within a future time window, and obtains the predicted infection probability for each node.

[0156] Understandably, risk diffusion simulations can use Monte Carlo simulations based on propagation dynamics to calculate the probability that each node will become a new source of infection within a future time window. Monte Carlo simulation is a numerical computation method based on random sampling, which can estimate solutions to complex problems using a large number of random samples.

[0157] For example, a future time window can be preset. (e.g., 3600 seconds), the infection probability of nodes in all infection source sets can be set to 1 or the final risk value, while other nodes are set to 0. Since the infection probability of nodes gradually spreads over time, multiple rounds of simulation can be performed within future time windows: The first round of propagation can be based on the infection probability of nodes in all infection source sets, calculating the infection probability of adjacent nodes using a propagation model. If the infection probability of an adjacent node is greater than a random number (r~U(0,1)), then the node is considered infected and added to the infection source set as a propagation source to continue spreading outwards; each round of propagation updates the infection probability of each node based on the result of the previous propagation step, until a certain threshold is reached. End. The propagation model is as follows: ,in, This represents the probability of infection of the adjacent node j at time t. The probability of infection of neighboring node j at time t-1. This represents the propagation weight between node i and node j (determined by the edge weight, such as the score corresponding to the data level). This represents the infection probability of node i (the current source of infection).

[0158] The simulation can be performed multiple times (e.g., 1000 times). Each simulation follows the above process, and the number of times each node is infected in all simulations is counted. The ratio of the number of times each node is infected in all simulations to the total number of simulations is calculated, which is the probability that each node will become a source of infection in the future time window (predicted infection probability).

[0159] This step enables the prediction of potential sources of infection, providing effective decision support for cybersecurity protection.

[0160] In one possible implementation, S500, based on the final risk value of each node and the set of infection sources, performs risk diffusion simulation in the data flow graph, calculates the probability that each node will become a new infection source within a future time window, and obtains the predicted infection probability of each node, including:

[0161] Single risk diffusion simulation:

[0162] S510, perform a second initialization on each node in the data flow graph. This second initialization involves setting the node state of each node, using the final risk value of all nodes except those corresponding to the infection source set as the current risk value, setting the current risk value of the node corresponding to the infection source set to 1, and adding the node corresponding to the infection source set as a source node to the second queue.

[0163] For example, node status can include HEALTHY (not infected), EXPOSED (exposed but not reaching the infection threshold), and INFECTED (infected).

[0164] We can assume that the set of infection sources is ,Will The node state of all nodes in the process is initialized to INFECTED, and then... The current risk value of all nodes is set to =1 (the risk value of the infection source is set to 1 to ensure that the transmission intensity is based on the infection source as the benchmark for maximum intensity), where v represents a node. To indicate the current moment, we can... All nodes are added to the second queue Q2 as source nodes. The node states of the remaining nodes can be initialized to HEALTHY, and the current risk value of the remaining nodes is set to... = .

[0165] The time step can be preset. Future time window Then the maximum number of simulation steps is For k=1,...,M, the current simulation time is: t= +k At each time step, a diffusion attempt is performed on all nodes whose state is INFECTED (i.e., steps S520, S530, and S540).

[0166] For each source node in the second queue:

[0167] S520: Traverse the propagable edges of the source node in the data flow graph, and determine whether the propagation from the source node to the downstream node of each propagable edge is successful based on the propagation probability of each propagable edge.

[0168] For example, a node u can be taken from Q2, and based on the propagation rule, the propagable edges e of u (u→v, where v represents the downstream node) are traversed. For each propagable edge e, a random number r~U(0,1) is generated using a pseudo-random number generator (PRNG). If the propagation probability of propagable edge e is... Greater than r, i.e., r < If so, it can be considered that the propagation along the propagable edge e was successful in this time step, node v was exposed, and the node state of node v can be changed to EXPOSED.

[0169] S530, if propagation is successful, calculate the risk increment of the downstream node that has successfully propagated based on the current risk value of the source node and the propagation probability of the propagable edge that has successfully propagated, and calculate the cumulative risk value of the downstream node that has successfully propagated based on the current risk value and risk increment of the downstream node that has successfully propagated.

[0170] For example, if the propagation is successful, the risk increment propagated from node u to node v can be calculated, i.e. ,in, Indicates an increase in risk. This represents the risk value of node u at a previous time step, initially set to 1. This represents the decay factor (decay per hop). .

[0171] The current risk value of node v (i.e., the cumulative risk value of node v at the previous time step) can be compared with the risk increment. The sum of these values ​​gives the cumulative risk value of node v at the current time step. If the cumulative risk value exceeds 1, then the cumulative risk value of node v is set to 1. ,in, This represents the cumulative risk value at the previous time step, initially set to the final risk value of node v. .

[0172] S540, if the cumulative risk value of the downstream node that has successfully propagated is greater than the initial threshold, add the downstream node that has successfully propagated to the second queue and update the node status of the downstream node that has successfully propagated.

[0173] For example, the cumulative risk value of node v at the current time step can be compared with an initial threshold. If it is greater than the initial threshold, node v is added to the second queue Q2 to continue spreading outward as a propagation source. The node state of node v can be changed to INFECTED.

[0174] S550: When the maximum number of simulation steps is reached, a single risk diffusion simulation ends. The maximum number of simulation steps is the ratio between the future time window and the preset simulation time step.

[0175] For example, a single risk diffusion simulation ends when the maximum number of simulation steps M is reached. It is possible to identify which nodes in this simulation had an INFECTED state.

[0176] S560 performs N risk diffusion simulations, counts the number of times each node in the statistical data flow graph has been infected, and calculates the predicted infection probability of each node based on the number of times each node has been infected and the number of risk diffusion simulations.

[0177] For example, N independent risk diffusion simulations (e.g., 10,000) can be run, the number of times each node is in the INFECTED state in the N simulations can be counted, and the predicted infection probability of each node can be calculated. ,in, Let n(v) represent the predicted infection probability of node v, and n(v) represent the number of times node v is in the INFECTED state in N simulations.

[0178] The above process can provide forward-looking risk assessment, help identify and prevent potential security threats in advance, provide scientific and real-time prediction of node risks in the data flow graph of the system, and enhance the initiative and response capabilities of data security management.

[0179] S600 determines a dynamic early warning threshold based on the final risk value and the predicted infection probability of each node. It compares the final risk value of each node with the dynamic early warning threshold; if the final risk value is greater than the threshold, an early warning is triggered, and an early warning report is generated for the node whose final risk value exceeds the threshold.

[0180] For example, each node can set a basic early warning threshold based on its data level, business importance, historical security events, etc. For example, the core database: =0.5 (lower threshold, more sensitive), for ordinary employee terminals: =0.8 (a higher threshold, reducing false alarms).

[0181] For each node, the basic early warning threshold can be adjusted based on the node's final risk value. ,Right now ,in, This indicates the first warning threshold. This represents the final risk value; the basic early warning threshold is adjusted based on the predicted infection probability of this node. ,Right now ,in, This indicates the second warning threshold. This indicates the predicted probability of infection; you can select... , , The minimum value is used as the dynamic early warning threshold.

[0182] The final risk value of the node is compared with the dynamic early warning threshold. When the final risk value of the node exceeds the dynamic early warning threshold, an early warning is triggered, and the system generates an early warning report for the node. This report may include the early warning trigger time, the final risk value, the predicted infection probability, the dynamic early warning threshold at the time of triggering, a list of upstream infection sources that caused the risk of the node, the transmission path (from the infection source to the node), and sensitive data associated with the node.

[0183] This step improves the sensitivity and accuracy of early warnings through a dynamic early warning threshold mechanism, and can respond to potential security threats in real time, generating detailed early warning reports to provide data support for subsequent security analysis and response, thereby achieving more efficient security protection and rapid response to potential threats.

[0184] In one possible implementation, step S600, based on the final risk value of each node and the predicted infection probability of each node, determines a dynamic early warning threshold, including:

[0185] The calculation method for the dynamic warning threshold in step S600 may push the threshold too low, leading to an increased probability of false alarms. Therefore, a gating function can be used to clamp the threshold with the lower limit to prevent it from being lowered excessively.

[0186] For each node:

[0187] S610, calculate the first warning threshold based on the basic warning threshold and the final risk value.

[0188] For example, in addition to the basic warning threshold, a minimum warning threshold can also be set, i.e., [ , ], such as [0.35, 0.9].

[0189] In step S600 of Change to a lowering type with a base, i.e. ,in, This indicates the maximum decrease; for example, 0.3 means a maximum decrease of 30%. This refers to a gating function, which only takes effect when the risk is truly high. A gating function can be... ,in, Indicates when Below The threshold is not lowered; once exceeded, the rate decreases linearly. This represents the gating parameter for threshold adjustment (e.g., 0.6).

[0190] S620, calculates the second warning threshold based on the basic warning threshold and the predicted infection probability.

[0191] For example, similarly, not letting Setting the automatic warning threshold to 0 can... Rewritten as ,in, , indicating the maximum decrease This represents a gate function, which can be... ,in, Indicates when predicting the probability of infection Greater than or equal to Only linear decrease, This represents the gating parameter for threshold adjustment (e.g., 0.5).

[0192] S630 determines the dynamic warning threshold based on the first warning threshold, the second warning threshold, and the basic warning threshold.

[0193] For example, the dynamic early warning threshold can be determined by the lower limit clamping, i.e. ,in, Indicates the dynamic early warning threshold. This restricts x to [a, b]. For example, =0.8, =0.65, =0.74, =0.768, The value 0.74 is within the interval [0.65, 0.8]. =0.74.

[0194] These steps allow for dynamic adjustment of the warning threshold based on actual risk and infection probability, ensuring the sensitivity of the warning system while avoiding false alarms. By tightening the lower limit, even if the risk value and infection probability are extremely high, the threshold will not be pushed to an unreasonably low value, maintaining the controllability and effectiveness of the warning system.

[0195] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

[0196] Corresponding to the data security risk warning method based on artificial intelligence described in the above embodiments, this application also provides a data security risk warning system based on artificial intelligence. Each unit of the system can implement each step of the data security risk warning method based on artificial intelligence. Figure 3 The diagram shows a structural block diagram of an artificial intelligence-based data security risk warning system provided in an embodiment of this application. For ease of explanation, only the parts related to the embodiments of this application are shown.

[0197] Reference Figure 3 The system includes:

[0198] The sensitive data marking unit is used to obtain a metadata list and identify and mark sensitive data in the metadata list to obtain a sensitive data list.

[0199] The data coloring unit is used to capture data stream event sets through multi-source probes, match the data stream event sets with sensitive data in the sensitive data list, and color the data stream event sets according to the identifier of each sensitive data to obtain the data stream event list.

[0200] The graph construction unit is used to construct a data flow graph based on a list of sensitive data and a list of data flow events, and to quantify the basic risk value of each node in the data flow graph.

[0201] The risk value calculation unit is used to iteratively propagate through the data flow graph, starting from the node corresponding to the initial infection source, upon detection of the initial infection source. Based on the basic risk value of each node, it calculates the final risk value for each node. Nodes with a final risk value greater than the initial threshold are added to the infection source set. The initial infection source represents an event indicating a security problem, and the infection source set includes the initial infection source.

[0202] The infection probability prediction unit is used to simulate risk diffusion in the data flow graph based on the final risk value and the set of infection sources of each node, calculate the probability that each node will become a new source of infection within a future time window, and obtain the predicted infection probability of each node.

[0203] The early warning unit determines a dynamic early warning threshold based on the final risk value and the predicted infection probability of each node. It compares the final risk value of each node with the dynamic early warning threshold; if the final risk value is greater than the dynamic early warning threshold, an early warning is triggered, and an early warning report is generated for the node whose final risk value is greater than the dynamic early warning threshold.

[0204] It should be noted that the information interaction and execution process between the above-mentioned units are based on the same concept as the method embodiments of this application. For details on their specific functions and technical effects, please refer to the method embodiments section, which will not be repeated here.

[0205] This application also provides an electronic device. Figure 4 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Figure 4 As shown, the electronic device 6 of this embodiment includes: at least one processor 60 ( Figure 4 Only one is shown in the image), at least one memory 61 ( Figure 4 (Only one is shown in the image) and a computer program 62 stored in the at least one memory 61 and executable on the at least one processor 60. When the processor 60 executes the computer program 62, it causes the electronic device 6 to implement the steps in any of the above-described embodiments of the data security risk warning method based on artificial intelligence, or causes the electronic device 6 to implement the functions of each unit in the above-described system embodiments.

[0206] For example, the computer program 62 may be divided into one or more modules / units, which are stored in the memory 61 and executed by the processor 60 to complete this application. The one or more modules / units may be a series of computer program instruction segments capable of performing a specific function, which describe the execution process of the computer program 62 in the electronic device 6.

[0207] The electronic device 6 can be a desktop computer, laptop, handheld computer, or other computing device. This electronic device may include, but is not limited to, a processor 60 and a memory 61. Those skilled in the art will understand that... Figure 4 This is merely an example of electronic device 6 and does not constitute a limitation on electronic device 6. It may include more or fewer components than shown, or combine certain components, or different components, such as input / output devices, network access devices, buses, etc.

[0208] The processor 60 may be a central processing unit (CPU), or it may be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), off-the-shelf programmable gate arrays (FPGAs), or other programmable logic devices. The general-purpose processor may be a microprocessor or any conventional processor.

[0209] In some embodiments, the memory 61 may be an internal storage unit of the electronic device 6, such as a hard disk or memory of the electronic device 6. In other embodiments, the memory 61 may be an external storage device of the electronic device 6, such as a plug-in hard disk, smart memory card (SMC), secure digital card (SD) card, flash card, etc., equipped on the electronic device 6. Furthermore, the memory 61 may include both internal and external storage units of the electronic device 6. The memory 61 is used to store the operating system, applications, bootloader, data, and other programs, such as the program code of the computer program. The memory 61 can also be used to temporarily store data that has been output or will be output.

[0210] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0211] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be included within the protection scope of this application.

Claims

1. A data security risk early warning method based on artificial intelligence, characterized in that, include: Obtain a metadata list, and identify and mark sensitive data in the metadata list to obtain a sensitive data list; A data stream event set is captured by a multi-source probe, the data stream event set is matched with sensitive data in the sensitive data list, and the data stream event set is colored according to the identifier of each sensitive data to obtain the data stream event list; wherein, the data coloring is used to characterize the writing of the identifier of sensitive data into the data stream event; Based on the sensitive data list and the data flow event list, a data flow graph is constructed, and the basic risk value of each node in the data flow graph is quantified; When an initial source of infection is detected, iterative propagation is performed in the data flow graph, starting from the node corresponding to the initial source of infection. Based on the basic risk value of each node, the final risk value of each node is calculated. Nodes with a final risk value greater than the initial threshold are added to the source of infection set. The initial source of infection is used to characterize an event that causes a security problem, and the source of infection set includes the initial source of infection. Based on the final risk value of each node and the set of infection sources, a risk diffusion simulation is performed in the data flow graph to calculate the probability that each node will become a new infection source within a future time window, and to obtain the predicted infection probability of each node. Based on the final risk value of each node and the predicted infection probability of each node, a dynamic early warning threshold is determined; the final risk value of each node is compared with the dynamic early warning threshold, and if the final risk value is greater than the dynamic early warning threshold, an early warning is triggered, and an early warning report for the node whose final risk value is greater than the dynamic early warning threshold is generated. The process of marking sensitive data includes: All sensitive data is divided into multiple granularities to obtain the granularity corresponding to each sensitive data; where the granularity can be a file-level object, a table-level object, a field-level object, a record-level object, or a text fragment-level object; Based on the granularity corresponding to each sensitive data, at least one fingerprint strategy is selected from the fingerprint strategy set, and a fingerprint set for each sensitive data is generated based on the fingerprint strategy set for each sensitive data; wherein, the fingerprint strategy set includes an exact fingerprint strategy, an approximate fingerprint strategy, and a statistical fingerprint strategy; An identifier for each piece of sensitive data is generated according to preset rules; wherein, the preset rules include granularity object, data source type, timestamp, and random unique factor; The step of matching the data stream event set with the sensitive data in the sensitive data list, and coloring the data stream event set according to the identifier of each sensitive data to obtain the data stream event list, includes: For each data stream event: Based on the granularity of the data object in the data stream event and the precise fingerprint strategy, an event precise fingerprint is generated, and the event precise fingerprint is matched with the precise fingerprint in the sensitive data list. If the match is successful, the identifier of the matched sensitive data is written into the data stream event. If the matching fails, an approximate fingerprint of the event is generated according to the granularity of the data object in the data stream event and the approximate fingerprint strategy. The similarity between the approximate fingerprint of the event and each approximate fingerprint in the sensitive data list is calculated. The identifier of the sensitive data with the highest similarity and exceeding the similarity threshold is written into the data stream event. If all similarities do not exceed the similarity threshold, the statistical features of the data objects in the data stream event are extracted, and the statistical features of the data objects are matched with the statistical fingerprints in the sensitive data list to obtain a set of candidate sensitive data identifiers. Based on the time, source object, target object, and operation type in the data stream event, the identifier set of candidate sensitive data is subjected to context association verification to determine the identifier of the target sensitive data, and the identifier of the target sensitive data is written into the data stream event.

2. The data security risk early warning method based on artificial intelligence as described in claim 1, characterized in that, The construction of a data flow graph based on the sensitive data list and the data flow event list includes: Each sensitive data in the sensitive data list is used as a data node in the data flow graph, and the source object and target object in the data flow event list are used as entity nodes in the data flow graph. Based on the operation types in the data flow event list, a directed relationship edge is constructed between the entity node and the data node; wherein, the directed relationship edge is an access edge or a transmission edge; The data flow graph is obtained by integrating all data nodes, entity nodes, and directed relation edges between entity nodes and data nodes.

3. The data security risk early warning method based on artificial intelligence as described in claim 2, characterized in that, The method further includes: Based on the operation types in the data stream event list, construct derived edges between data nodes; Based on the operation types in the data flow event list, construct storage edges between data nodes and entity nodes.

4. The data security risk early warning method based on artificial intelligence as described in claim 3, characterized in that, Starting from the node corresponding to the initial source of infection, iterative propagation is performed in the data flow graph, and the final risk value of each node is calculated based on the basic risk value of each node, including: Each node in the data flow graph is initialized first; wherein, the first initialization is to set the dynamic risk value of the node corresponding to the initial infection source to 1, add the node corresponding to the initial infection source to a first queue, and set the dynamic risk value of other nodes except the node corresponding to the initial infection source to 0. For each node in the first queue: Traverse the propagable edges of the current node in the data flow graph and calculate the propagation probability of each propagable edge; wherein, the propagable edge is used to represent the edge between the current node and the downstream nodes that the current node can propagate to; Calculate the risk increment of each propagable edge based on the propagation probability of each propagable edge and the dynamic risk value of the current node; For each propagable edge: if the risk increment of the current propagable edge is greater than the dynamic risk value of the current downstream node, update the dynamic risk value of the current downstream node to the risk increment of the current propagable edge, and add the current downstream node to the first queue; take the risk value with the largest value between the basic risk value and the dynamic risk value of the current downstream node as the final risk value of the current downstream node; wherein, the current downstream node is used to represent the downstream node of the current propagable edge. The iteration continues until the first queue is empty, at which point the propagation stops.

5. The data security risk early warning method based on artificial intelligence as described in claim 4, characterized in that, The propagable edges include at least: If the current node is an entity node, the access edge and / or transmission edge between the entity node and the data node are determined as the propagable edge; If the current node is a data node, the derived edges between data nodes are determined as the propagable edges, the storage edges between data nodes and entity nodes are determined as the propagable edges, and / or the reverse relationships of the access edges between entity nodes and data nodes are determined as the propagable edges.

6. The data security risk early warning method based on artificial intelligence as described in claim 4, characterized in that, Based on the final risk value of each node and the set of infection sources, a risk diffusion simulation is performed in the data flow graph to calculate the probability that each node will become a new source of infection within a future time window, thus obtaining the predicted infection probability of each node, including: Single-cycle risk diffusion simulation: A second initialization is performed on each node in the data flow graph; wherein, the second initialization is to set the node state of each node, and take the final risk value of the nodes other than the nodes corresponding to the infection source set as the current risk value, set the current risk value of the nodes corresponding to the infection source set to 1, and add the nodes corresponding to the infection source set as source nodes to the second queue. For each source node in the second queue: Traverse the propagable edges of the source node in the data flow graph, and determine whether the propagation from the source node to the downstream node of each propagable edge is successful based on the propagation probability of each propagable edge. If the propagation is successful, the risk increment of the downstream node that has successfully propagated is calculated based on the current risk value of the source node and the propagation probability of the successfully propagated edge. The cumulative risk value of the downstream node that has successfully propagated is calculated based on the current risk value and risk increment of the downstream node. If the cumulative risk value of the downstream node that has successfully propagated is greater than the initial threshold, the downstream node that has successfully propagated is added to the second queue, and the node status of the downstream node that has successfully propagated is updated. When the maximum number of simulation steps is reached, a single risk diffusion simulation ends; wherein, the maximum number of simulation steps is the ratio between the future time window and the preset simulation time step; Perform N risk diffusion simulations, count the number of times each node in the data flow graph is in the infected state, and calculate the predicted infection probability of each node based on the number of times each node is in the infected state and the number of risk diffusion simulations.

7. The data security risk early warning method based on artificial intelligence as described in claim 1, characterized in that, The determination of the dynamic early warning threshold based on the final risk value and the predicted infection probability of each node includes: For each node: Calculate the first warning threshold based on the basic warning threshold and the final risk value; Calculate the second warning threshold based on the basic warning threshold and the predicted infection probability; A dynamic warning threshold is determined based on the first warning threshold, the second warning threshold, and the basic warning threshold.

8. A data security risk early warning system based on artificial intelligence, characterized in that, For implementing the method as described in any one of claims 1 to 7, the artificial intelligence-based data security risk early warning system comprises: A sensitive data marking unit is used to obtain a metadata list and identify and mark sensitive data in the metadata list to obtain a sensitive data list. The data coloring unit is used to capture a data stream event set through a multi-source probe, match the data stream event set with sensitive data in the sensitive data list, and color the data stream event set according to the identifier of each sensitive data to obtain the data stream event list; The graph construction unit is used to construct a data flow graph based on the sensitive data list and the data flow event list, and to quantify the basic risk value of each node in the data flow graph; The risk value calculation unit is used to, when an initial infection source is detected, iterate through the data flow graph starting from the node corresponding to the initial infection source, and calculate the final risk value of each node based on the basic risk value of each node; and add nodes whose final risk value is greater than the initial threshold to the infection source set; wherein, the initial infection source is used to characterize an event in which a security problem occurs, and the infection source set includes the initial infection source; An infection probability prediction unit is used to perform risk diffusion simulation in the data flow graph based on the final risk value of each node and the set of infection sources, calculate the probability that each node will become a new source of infection within a future time window, and obtain the predicted infection probability of each node. The early warning unit is used to determine a dynamic early warning threshold based on the final risk value of each node and the predicted infection probability of each node; compare the final risk value of each node with the dynamic early warning threshold; if the final risk value is greater than the dynamic early warning threshold, trigger an early warning and generate an early warning report for nodes whose final risk value is greater than the dynamic early warning threshold.