Dynamic data privacy protection dynamic evaluation system and method based on big data mining
By constructing a dynamic assessment model for the evolution of privacy risks based on the temporal characteristics of big data mining behavior, real-time capture of operational time-series data is achieved, risk transmission paths and evolution patterns are explored, and proactive warnings of dynamic data privacy risks are realized. This solves the problem of lagging assessment results in existing technologies, provides accurate decision support, and reduces the risk of privacy leakage.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHENGZHOU UNIVERSITY OF LIGHT INDUSTRY
- Filing Date
- 2026-03-04
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies cannot effectively capture the propagation and evolution of dynamic data privacy risks as mining activities progress, resulting in delayed and one-sided assessment results. This makes it impossible to provide comprehensive and real-time decision support for the privacy protection of dynamic data, and it is easy to cause privacy leakage risks.
A dynamic assessment model for the evolution of privacy risks based on the temporal characteristics of big data mining behavior is constructed. The model captures real-time temporal data of all operational processes through a temporal data acquisition module, mines risk transmission paths and evolution patterns through a risk evolution analysis module, achieves advanced prediction of subsequent risks through a risk prediction module, and finally completes real-time assessment by integrating multi-dimensional results through a dynamic assessment module.
It enables real-time tracking, evolution analysis, and early warning of dynamic data privacy risks, ensuring comprehensive and timely assessment results, providing precise decision support for dynamic data privacy protection, and effectively reducing the risk of privacy leakage during big data mining.
Smart Images

Figure CN122241750A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of big data privacy protection technology, specifically to a dynamic data privacy protection assessment system and method based on big data mining. Background Technology
[0002] In the era of big data, the value of data is fully realized through mining and analysis, but the risk of privacy breaches during data mining is also becoming increasingly prominent. First, dynamic data refers to data whose content, structure, access permissions, and usage scenarios continuously change over time or business processes throughout its lifecycle. This is commonly found in real-time data streams and dynamically updated business data—core processing objects in big data mining. Second, dynamic data privacy protection addresses the characteristics of dynamic data by employing technical or management measures to prevent privacy breaches throughout the entire process of data generation, transmission, processing, and mining. Its core objective is to adapt to dynamic data changes to maintain continuous and effective privacy protection. Dynamic assessment of dynamic data privacy protection refers to the process of tracking the state changes of dynamic data and data mining operations in real time, continuously evaluating the effectiveness of privacy protection measures and potential privacy breach risks. Its core significance lies in the timely discovery of privacy protection vulnerabilities in dynamic scenarios, providing a scientific basis for adjusting privacy protection strategies, and ensuring the security and compliance of data privacy during big data mining.
[0003] However, existing technologies mostly focus on the static attributes of the data itself or risk detection in a single mining operation, failing to fully consider the temporal correlation of each operation in the big data mining process. They cannot capture the propagation and evolution of privacy risks as mining progresses, nor can they predict the privacy leakage risks in subsequent mining stages in advance. This results in delayed and one-sided assessment results, failing to provide comprehensive and real-time decision support for the privacy protection of dynamic data, easily leading to privacy leakage risks, and restricting the safe and compliant application of big data mining technology. Therefore, it is of great significance to develop a dynamic assessment system and method for dynamic data privacy protection based on big data mining. Summary of the Invention
[0004] The purpose of this invention is to overcome the shortcomings of existing technologies and provide a dynamic data privacy protection assessment system and method based on big data mining. It constructs a dynamic assessment model of privacy risk evolution based on the temporal characteristics of big data mining behavior, captures real-time temporal data of all operational processes using a time-series data acquisition module, mines risk transmission paths and evolutionary patterns using a risk evolution analysis module, and achieves advanced prediction of subsequent risks using a risk prediction module. Finally, a dynamic assessment module completes a real-time assessment by fusing multi-dimensional results, enabling real-time tracking, evolutionary analysis, and early warning of dynamic data privacy risks. This ensures comprehensive and timely assessment results, provides accurate decision support for dynamic data privacy protection, and effectively reduces the risk of privacy leakage during big data mining.
[0005] To solve the above-mentioned technical problems, the present invention provides the following technical solution: a dynamic data privacy protection assessment system based on big data mining, the system comprising: a time series data acquisition module, a risk evolution analysis module, a risk prediction module, and a dynamic assessment module, wherein each module is sequentially connected by signals; The time-series data acquisition module is used to collect real-time time-series data of all stages of big data mining, including data access, feature extraction, correlation analysis, model training, and result output. The risk evolution analysis module is used to receive the operation time series data output by the time series data acquisition module, and construct a risk propagation map based on the data to explore the transmission path and evolution pattern of privacy risks. The risk prediction module is used to receive the transmission path and evolution law output by the risk evolution analysis module, and to make an advance prediction of the privacy leakage risk in the subsequent mining stage based on the time series prediction algorithm. The dynamic evaluation module is used to integrate the transmission path and evolution law output by the risk evolution analysis module with the prediction results output by the risk prediction module to construct a real-time evaluation system to complete the dynamic evaluation of dynamic data privacy protection.
[0006] Furthermore, the time-series data acquisition module performs the following operations when acquiring operational time-series data: Based on specific business scenarios of big data mining, the entire data flow process is sorted out, and a complete list of key operations including data access, feature extraction, correlation analysis, model training, and result output is generated, clarifying the boundaries and relationships of each operation; For each key operational step in the list, real-time monitoring technology is used to capture the identity information of the entity performing the operation, the specific data identifier of the object being operated on, the timestamps of the operation initiation and completion, the corresponding permission level of the operation, and the direction of data flow in each step. The captured information is processed using a standardized data format, with unified field naming and data types, and duplicate and invalid information is removed to form a unified operational time series data, which is then transmitted to the risk evolution analysis module through a preset data transmission channel.
[0007] Furthermore, the risk evolution analysis module performs the following operations when mining risk transmission paths and evolution patterns: An outlier detection algorithm is used to preprocess the received operation timing data to remove invalid data caused by transmission delay or equipment failure, and missing data is supplemented by interpolation. By treating each operational step as an independent node, the flow of data between different steps is used as a link, and the data transmission volume and interaction frequency of each link are marked to construct an initial risk propagation map. Based on the entropy weight method, the operation frequency and data interaction intensity of each node in the initial graph are weighted. A graph topology analysis algorithm is then used to identify key risk nodes with a wide impact and frequent interactions, and the risk transmission intensity between nodes is calculated. ,in Representative node To the node The intensity of privacy risk transmission For nodes The weighting coefficients, For link - Interaction adaptation coefficient, For link - Real-time data interaction volume For nodes The total number of associated nodes, This data was obtained by processing statistical data on node operation frequency and data interaction intensity using the entropy weight method. Based on historical interaction data of the same big data mining scenario over the past six months, the ratio of the frequency of link risk occurrence to the success rate of interaction was determined. The correlation path between nodes was traced, and the evolutionary pattern of the transmission path, intensity change, and scope spread of privacy risks as operations progressed was summarized.
[0008] Furthermore, the risk prediction module performs the following operations when making advance predictions of privacy leakage risks in subsequent data mining stages: The path length, node density, and frequency of contact with sensitive data are extracted from the transmission path output by the risk evolution analysis module. The risk growth rate and influence cycle parameters are extracted from the evolution law. These features and parameters are integrated into the input variable set of the time series prediction algorithm. Historical risk data from the same or similar big data mining scenarios within the past year were selected, and abnormal samples were removed to construct a training dataset. Cross-validation was used to adaptively train the time series prediction algorithm and adjust the algorithm parameters. Substitute the integrated set of input variables into the trained algorithm, and use the formula... Calculate the probability of privacy leakage risk at each subsequent mining stage, where For the first The probability of privacy leaks at each stage of the mining process. For the first The characteristic value of the transmission path length corresponding to each link This represents the maximum transmission path length in historical data. For the first Risk growth rate parameter for each stage This represents the maximum rate of risk growth in historical data. For the first Risk probability of each excavation stage , , The feature weight coefficients are obtained by iteratively training the training dataset through cross-validation. The risk level is labeled according to the preset risk classification standard, forming a prediction result containing risk probability, risk level and key risk points, which is then transmitted to the dynamic evaluation module.
[0009] Furthermore, the time-series data acquisition module is configured with a data synchronization interface based on a RESTful architecture. This interface establishes a long connection with the operation log system of the big data mining platform to achieve real-time communication. The interface supports high-concurrency data transmission and is configured with a data caching mechanism, which can temporarily store high-frequency acquired data in a short period of time. During the acquisition process, a resource usage monitoring unit is set up to monitor the CPU and memory usage of the big data mining platform in real time. When the load exceeds a preset threshold, the acquisition frequency is automatically reduced. After the load returns to normal, the original acquisition frequency is restored. At the same time, the integrity of the acquired data is verified by comparing the checksum to achieve data transmission verification.
[0010] Furthermore, the risk propagation map adopts a multi-level structure design consisting of an operation layer, a data layer, and a risk layer. The operation layer records in detail the operation instructions, execution time, operation results, and operation subject information of each link. The data layer marks the type, sensitivity level, data subject, and data format attribute information of the flowing data. The risk layer uses a five-color hierarchical method to map different levels of privacy risks, and each layer has an independent update trigger mechanism. The three-layer structure establishes the connection between the operation layer and the data layer through the operation ID, and establishes the connection between the data layer and the risk layer through the data sensitivity level. When the operation behavior or data attributes change during the big data mining process, the map automatically updates the information of the corresponding layer synchronously, completing the visualization and dynamic update of the risk transmission path.
[0011] Furthermore, the real-time assessment system of the dynamic assessment module includes a risk level classification unit and a risk warning unit. The risk level classification unit uses a formula... Calculate the comprehensive risk score, where To calculate the overall risk score, For the feature normalization function, These are characteristic parameters of the transmission path. These are characteristic parameters of evolutionary patterns. In order to predict the probability of risk, , , The feature weight coefficients are determined using the analytic hierarchy process (AHP). Three to five experts in big data privacy protection are invited to conduct pairwise comparisons of the importance of the transmission path, evolutionary patterns, and prediction results. After constructing a judgment matrix, the feature weight coefficients are calculated. Based on multi-dimensional indicators such as whether the influence range of the transmission path covers the sensitive data set, whether the strength of the evolutionary pattern exceeds the historical average, and whether the predicted risk probability is higher than a preset threshold, five risk levels are defined. Each level corresponds to a clear risk description and processing priority. The risk warning unit outputs different forms of warning signals according to the defined risk levels, including system pop-up prompts, SMS notifications to designated personnel, and API interface pushes from the platform. It supports customizable warning recipients and receiving methods, and records the warning issuance time, reception status, and subsequent processing results, forming a closed loop for warning processing.
[0012] Furthermore, the time-series prediction algorithm incorporates a scene recognition module. By parsing the data field identifiers during the big data mining process, it identifies whether the data contains highly sensitive privacy data such as ID card numbers, bank card numbers, and health records, thereby determining the sensitivity level of the mining scene. When a highly sensitive scene is identified, the algorithm automatically initiates a parameter optimization process, increasing the number of prediction iterations to a preset upper limit, adjusting the weight coefficients of the loss function, using gradient descent to accelerate the algorithm's convergence speed, and increasing the extraction dimensions of data features. When the scene sensitivity level is low, the algorithm's basic parameter configuration is maintained.
[0013] The dynamic data privacy protection assessment method based on big data mining is applicable to the aforementioned dynamic data privacy protection assessment system based on big data mining. The method includes the following steps: S1. Start the time series data acquisition module to collect real-time time series data of all stages of big data mining, including data access, feature extraction, correlation analysis, model training, and result output. S2. The risk evolution analysis module receives operation time-series data, and after preprocessing, constructs a risk propagation map to explore the transmission path and evolution pattern of privacy risks. S3, the risk prediction module, based on the transmission path and evolution law, uses a time series prediction algorithm to make an advance prediction of the privacy leakage risk in the subsequent mining process; S4, the dynamic evaluation module integrates transmission paths, evolution patterns and prediction results to construct a real-time evaluation system and complete the dynamic evaluation of dynamic data privacy protection.
[0014] Furthermore, in step S4, when the dynamic evaluation module fuses data, a weighted fusion algorithm is used. The weight of the transmission path is determined according to the number of sensitive data types involved. The more sensitive data types, the higher the weight. The weight of the evolution law is determined according to its historical verification accuracy. The higher the accuracy, the higher the weight. The weight of the prediction result is determined according to the current prediction accuracy of the algorithm. The higher the accuracy, the higher the weight. The comprehensive evaluation score is obtained through weighted calculation, and the risk level is corresponding to the preset score range.
[0015] Compared with existing technologies, this dynamic data privacy protection assessment system and method based on big data mining has the following advantages: This invention constructs a dynamic assessment model for the evolution of privacy risks based on the temporal characteristics of big data mining behavior. It relies on a time-series data acquisition module to capture real-time operational time-series data across all stages, a risk evolution analysis module to uncover risk transmission paths and evolutionary patterns, and a risk prediction module to predict subsequent risks. Finally, a dynamic assessment module completes a real-time assessment by fusing multi-dimensional results. This overcomes the limitations of existing static and single-stage assessment technologies, enabling real-time tracking, evolutionary analysis, and early warning of dynamic data privacy risks. It ensures comprehensive and timely assessment results, providing precise decision support for dynamic data privacy protection, effectively reducing the risk of privacy leaks during big data mining, and guaranteeing data privacy security and compliant use.
[0016] Other advantages, objectives and features of the invention will be set forth in part in the description which follows, and in part will be apparent to those skilled in the art from the following examination or study, or may be learned from the practice of the invention. Attached Figure Description
[0017] Figure 1 This is a schematic diagram of the structure of a dynamic data privacy protection assessment system based on big data mining; Figure 2 A flowchart for a dynamic data privacy protection assessment method based on big data mining; Figure 3 This is a flowchart of a dynamic data privacy protection assessment method based on big data mining. Detailed Implementation
[0018] To further illustrate the technical means and effects of the present invention in achieving its intended purpose, the following detailed description of the specific implementation methods, structures, features, and effects of the present invention, in conjunction with the accompanying drawings and preferred embodiments, is provided below.
[0019] Example 1 This embodiment applies to a medical big data mining platform in a large comprehensive hospital. This platform needs to continuously mine and analyze dynamic data such as patients' electronic medical records, examination and test results, health records, and treatment plans to support core business operations such as disease prediction, treatment plan optimization, and medical research. During the data mining process, the data undergoes multiple stages, including access, feature extraction, correlation analysis, model training, and result output. The data content is updated in real time as the patient's treatment progresses, and access permissions are dynamically adjusted according to the work needs of medical staff and researchers, representing a typical dynamic data scenario. Because medical data contains a large amount of highly sensitive privacy information such as ID card numbers, bank card numbers, and health records, leakage would seriously harm patients' rights. Therefore, a dynamic assessment system is needed to track privacy risks in real time, providing a scientific basis for adjusting privacy protection strategies and ensuring the security and compliance of the medical big data mining process.
[0020] See Figure 1 , Figure 2 and Figure 3 The specific implementation process of this embodiment is as follows: First, the time-series data acquisition module is activated. Combining the specific business scenarios of hospital medical big data mining, the entire data flow process is outlined, from patient visits and data entry by medical staff to researchers using the data for feature extraction, correlation analysis of the correlation between symptoms and treatment plans, training of disease prediction models, and finally output of research reports or clinical treatment recommendations. A complete list of key operations is generated, including data access, feature extraction, correlation analysis, model training, and result output. The boundaries and relationships of each operation are clearly defined. For example, the data access step can only be performed by authorized personnel during working hours, and the feature extraction step needs to be clearly linked to the original data storage step.
[0021] For each key operational step in the checklist, core information is captured through real-time monitoring technology: the identity information of the entity executing the operation includes the employee number of medical staff, the department and authorization number of the researcher; the specific data identifier of the operation object is the set of associated data corresponding to the patient's unique medical ID; the timestamps of the operation initiation and completion are accurately recorded to ensure the integrity of the timeline; the permission levels corresponding to the operation are divided into three categories: clinical diagnosis and treatment level, research level, and management level, clearly defining the data scope that can be operated at different levels; at the same time, the flow direction of data in each step is tracked, for example, from the original database to the feature extraction module, and then from the feature extraction module to the correlation analysis module.
[0022] The captured information is processed using a standardized data format, with unified field naming and data types. For example, the identity information field is uniformly named "Operating Subject Identifier," and the timestamp field adopts a standard time format. Duplicate records and invalid information caused by network fluctuations are removed, resulting in a structured operation time-series data. This module is configured with a RESTful architecture-based data synchronization interface, establishing a long connection with the operation log system of the hospital's medical big data mining platform for real-time communication. The interface supports high-concurrency data transmission and is configured with a data caching mechanism to temporarily store frequently collected medical and research operation data within a short period. During the collection process, a resource usage monitoring unit is set up to monitor the CPU and memory usage of the big data mining platform in real time. When the load exceeds a preset threshold, the collection frequency is automatically reduced, and the original collection frequency is restored after the load returns to normal. At the same time, the integrity of the collected data is verified by comparing checksums to ensure accurate data transmission. Finally, the standardized operation time-series data is transmitted to the risk evolution analysis module through a preset data transmission channel.
[0023] After receiving the operation time series data output by the time series data acquisition module, the risk evolution analysis module first uses an outlier detection algorithm to preprocess the data, eliminating invalid data caused by transmission delays and equipment failures. For missing values that occur during the transmission of patient diagnosis and treatment data, interpolation is used to supplement them, ensuring the integrity and availability of the data.
[0024] Using data access, feature extraction, correlation analysis, model training, and result output as independent nodes, the flow of data between different stages is taken as the link, and the data transmission volume and interaction frequency of each link are marked. For example, the daily interaction frequency between the feature extraction stage and the correlation analysis stage is relatively high, so the change in the transmission volume of this link needs to be marked in detail to construct an initial risk propagation map. This data map employs a multi-layered structure comprising an operation layer, a data layer, and a risk layer. The operation layer meticulously records the operational instructions, execution time, results, and operator information for each stage, such as training instructions, execution time, training accuracy, and the research team involved in the model training stage. The data layer labels the type, sensitivity level, data subject, and format attributes of the flowing data. Sensitivity levels are categorized into four levels—extremely high sensitivity, high sensitivity, medium sensitivity, and low sensitivity—based on the characteristics of medical data. Extremely high sensitivity data includes patient ID numbers and core content of health records. The risk layer uses a five-color grading system to map different levels of privacy risk. Each layer has an independent update trigger mechanism. The operation layer is linked to the data layer via operation ID, and the data layer is linked to the risk layer via data sensitivity level. When operational behaviors or data attributes change during the medical big data mining process, the map automatically updates the information at the corresponding level, enabling a visual representation and dynamic update of the risk transmission path.
[0025] Based on the entropy weight method, the operation frequency and data interaction intensity of each node in the initial graph are weighted. A graph topology analysis algorithm is used to identify key risk nodes with a wide impact and frequent interactions. In the specific implementation of this embodiment, the formula is used... Calculate the risk transmission strength between nodes. Representative node To the node The intensity of privacy risk transmission For nodes The weighting coefficients, For link - The interaction adaptation coefficient is determined by the ratio of the frequency of link risks to the interaction success rate, based on historical interaction data of the same medical big data mining scenarios over the past six months. For link - Real-time data interaction volume For nodes The total number of associated nodes. By tracing the association paths between nodes using this formula, we can summarize the transmission path, intensity changes, and range diffusion patterns of privacy risks as the mining operation progresses. For example, we can identify that the intensity of risk transmission from the data access stage to the feature extraction stage is relatively high, and the risk increases significantly with the frequency of interaction.
[0026] After receiving the transmission path and evolution pattern output by the risk evolution analysis module, the risk prediction module extracts path length, node density, and sensitive data contact frequency features from the transmission path. Among them, the sensitive data contact frequency focuses on the number of times extremely sensitive data such as ID card numbers and health records are contacted. The risk growth rate and influence cycle parameters are extracted from the evolution pattern. These features and parameters are integrated into the input variable set of the time series prediction algorithm.
[0027] Historical risk data from the hospital's medical big data mining platform for similar scenarios within the past year were selected, including records of past privacy breaches and risk handling results. After removing abnormal samples caused by special medical events, a training dataset was constructed. Cross-validation was used to adaptively train the time-series prediction algorithm, continuously adjusting algorithm parameters to improve prediction accuracy. This time-series prediction algorithm incorporates a scene recognition module. By parsing data field identifiers during the big data mining process, it identifies data containing highly sensitive privacy data such as ID card numbers and health records. If the mining scenario is deemed highly sensitive, the algorithm automatically initiates a parameter optimization process, increasing the number of prediction iterations to a preset limit, adjusting the weight coefficients of the loss function, using gradient descent to accelerate convergence, and increasing the dimensionality of data feature extraction to ensure the accuracy of the prediction results.
[0028] In the specific implementation of this embodiment, the integrated set of input variables is substituted into the trained algorithm, and the formula is used to... Calculate the probability of privacy leakage risk at each subsequent mining stage. For the first The probability of privacy leaks at each stage of the mining process. For the first The characteristic value of the transmission path length corresponding to each link This represents the maximum transmission path length in historical data. For the first Risk growth rate parameter for each stage This represents the maximum rate of risk growth in historical data. For the first Risk probability of each excavation stage , , The feature weight coefficients are obtained through multiple rounds of iterative training on the training dataset using cross-validation. Based on preset risk classification criteria, the calculated risk probabilities are labeled with risk levels, forming a predictive result that includes risk probability, risk level, and key risk points. For example, it labels a high-risk scenario due to frequent data interaction during model training, and this result is transmitted to the dynamic evaluation module in real time.
[0029] The dynamic assessment module receives the transmission path and evolution pattern output from the risk evolution analysis module, and the prediction results output from the risk prediction module. It then uses a weighted fusion algorithm to fuse the three types of data. The weight of the transmission path is determined based on the number of sensitive data types involved. In this medical scenario, the transmission path involves multiple sensitive data types such as ID numbers, health records, and medical records, thus receiving a higher weight. The weight of the evolution pattern is determined based on its historical verification accuracy. Statistically, the historical verification accuracy of the evolution pattern is high in this scenario, corresponding to a larger weight. The weight of the prediction result is determined based on the algorithm's current prediction accuracy. Because the algorithm has undergone parameter optimization for highly sensitive scenarios, its prediction accuracy is high, hence it is given a correspondingly high weight.
[0030] In the specific implementation process of this embodiment, through formula Calculate the overall risk score. (Among them...) To calculate the overall risk score, For the feature normalization function, These are characteristic parameters of the transmission path. These are characteristic parameters of evolutionary patterns. In order to predict the probability of risk, , , The feature weight coefficients are determined using the analytic hierarchy process. Specifically, 3-5 experts in the field of medical big data privacy protection are invited to conduct pairwise comparisons of the importance of transmission paths, evolutionary patterns, and prediction results. After constructing a judgment matrix, the feature weight coefficients are calculated.
[0031] Based on multi-dimensional indicators such as whether the impact range of the transmission path covers sensitive data sets, whether the strength of the evolutionary pattern exceeds the historical average, and whether the predicted risk probability is higher than a preset threshold, five risk levels are defined. Each level corresponds to a clear risk description and processing priority. For example, Level 1 risk is extremely high, requiring immediate activation of the emergency response mechanism, while Level 5 risk is extremely low, requiring only routine monitoring. The risk warning unit outputs different forms of warning signals according to the defined risk levels. When a Level 3 or higher risk is detected in the model training stage, the system pops up a notification to the platform administrator, simultaneously notifies the hospital's information security manager via SMS, and pushes warning information through the API interface connected to the hospital's information system. Administrators can customize the warning recipients and reception methods. The system also records the warning issuance time, reception status, and subsequent processing results in detail, such as recording the data encryption and access restriction measures taken by the administrator after receiving the warning, forming a complete warning processing closed loop and ultimately achieving a dynamic assessment of dynamic data privacy protection.
[0032] In summary, this embodiment applies a dynamic data privacy protection assessment system based on big data mining to a hospital's medical big data mining platform, achieving a full-process, real-time, and accurate assessment of medical dynamic data privacy risks. The time-series data acquisition module ensures comprehensive capture and reliable transmission of operational data at each stage of medical data mining. The risk evolution analysis module clearly identifies the transmission path and evolutionary patterns of privacy risks by constructing a multi-level risk propagation map and calculating risk transmission intensity. The risk prediction module, utilizing a time-series sequence prediction algorithm adapted to highly sensitive scenarios, enables proactive risk prediction in subsequent mining stages. The dynamic assessment module, through multi-dimensional data fusion and scientific weight allocation, completes the calculation of comprehensive risk scores and tiered early warning.
[0033] Example 2 This embodiment applies to a big data mining system for a large-scale integrated e-commerce platform. This platform needs to continuously mine dynamic data such as user browsing history, order data, payment information, logistics addresses, and after-sales interactions to support core business operations such as precise marketing, product recommendation, inventory optimization, and user profile construction. Platform data is characterized by high-frequency updates; user behavior changes in real time with shopping needs, and payment and logistics data are dynamically generated along with the transaction process. Furthermore, it contains a large amount of highly sensitive and private information, including mobile phone numbers, bank card numbers, and shipping addresses. If this data is leaked during the mining process, it will threaten user financial security and damage the platform's reputation. Therefore, a dynamic evaluation system adapted to the e-commerce scenario is needed to prevent privacy leaks in real time and ensure the safe and compliant advancement of e-commerce big data mining.
[0034] See Figure 1 , Figure 2 and Figure 3 The specific implementation process of this embodiment is as follows: The time-series data acquisition module is activated. Based on the aforementioned embodiments, and combined with the business characteristics of big data mining on e-commerce platforms, the entire data flow process is analyzed, from user page access generating browsing data, order submission generating transaction data, payment completion recording payment information, to logistics delivery updating logistics data, and after-sales communication forming interactive data. A list of key operations including the training results of the data access feature extraction and correlation analysis model is generated, clarifying the boundaries and relationships of each operation in the e-commerce transaction chain.
[0035] For key operational steps in the list, core information is captured through real-time monitoring technology: the identity information of the entity executing the operation includes the platform operator's account, third-party service provider authorization ID, and user login identifier; the specific data identifier of the operation object is the unique order number, user ID, and logistics tracking number; the timestamps of operation initiation and completion are accurately recorded, with a focus on capturing the high-frequency operation sequence during promotional activities; the corresponding permission levels for operations are divided into transaction management, marketing analysis, and logistics integration levels, clearly defining the access permissions of each level to sensitive data such as payment information and logistics addresses; at the same time, the flow direction of data in each stage is tracked, such as from the user behavior collection module to the recommendation model training module, and from the order system to the data analysis module.
[0036] The captured information is processed using a standardized data format, with unified field naming and data types. Duplicate browsing records and invalid order submission data are removed to form a structured, consistent operational time-series data. This module is configured with a RESTful architecture-based data synchronization interface, establishing a long-lived connection with the e-commerce platform's transaction system, logistics system, user management system, and operation log system to achieve real-time communication. The interface supports high-concurrency data transmission and features a robust data caching mechanism to temporarily store frequently collected transaction and behavioral data during peak traffic periods like e-commerce promotions. A resource usage monitoring unit is set up during the collection process to monitor the CPU and memory usage of the big data mining platform in real time. When the load exceeds a preset threshold due to high traffic during promotions, the collection frequency of non-core data is automatically reduced, prioritizing the collection of critical data such as order payments. Once the load recovers, the original frequency is restored. Simultaneously, data integrity is verified through checksum comparison, and the standardized data is finally transmitted to the risk evolution analysis module.
[0037] After receiving the time-series data, the risk evolution analysis module uses an outlier detection algorithm to preprocess the data, removing invalid data caused by network congestion and invalid data caused by logistics system failures, and supplementing some missing data caused by high-frequency operations during the promotion period through interpolation.
[0038] Using each operational step of e-commerce data mining as an independent node, the flow of data between browsing, analysis, order placement, payment verification, and logistics synchronization is used as a link. The data transmission volume and interaction frequency of each link are marked, with a focus on marking the interaction data in the payment information flow link and the logistics address transmission link, thus constructing an initial risk propagation map. This map adopts a multi-level structure of operation layer, data layer, and risk layer. The operation layer records detailed operation information such as order placement instructions, payment operations, and logistics synchronization instructions, as well as the executing entity. The data layer marks the type and sensitivity level of the flowed data, setting payment information (bank card number) as extremely sensitive and logistics address (phone number) as highly sensitive. The risk layer uses a five-color grading method to map privacy risk levels. The three-layer structure is linked through operation ID and data sensitivity level. When a user's order status is updated or payment information is modified, the map automatically updates the corresponding layer information.
[0039] Based on the entropy weight method, the operation frequency and data interaction intensity of each node in the initial graph are weighted and assigned values. A graph topology analysis algorithm is used to identify key risk nodes such as the payment process and logistics information flow. In the specific implementation of this embodiment, the formula is used... Calculate the intensity of risk transmission between nodes, trace the correlation paths between nodes, and summarize the transmission paths and evolution patterns of privacy risks as e-commerce data mining operations progress.
[0040] After receiving the transmission path and evolution pattern, the risk prediction module extracts path length, order density, payment information, and contact frequency features from the transmission path, and extracts risk growth rate and influence cycle parameters from the evolution pattern, integrating them into the input variable set of the time series prediction algorithm.
[0041] Historical risk data from the past year, including similar scenarios such as major promotional periods and daily operations of e-commerce platforms, were selected. This included past risks of payment information leakage and abnormal access records of order data. After removing abnormal samples caused by system upgrades, a training dataset was constructed. Cross-validation was used to adaptively train the time series prediction algorithm and adjust the algorithm parameters.
[0042] This time-series prediction algorithm incorporates a scene recognition module. By parsing data field identifiers, it identifies data containing highly sensitive privacy data such as bank card numbers and mobile phone numbers, classifying it as a high-sensitivity scene. It then automatically initiates a parameter optimization process, increasing the number of prediction iterations, adjusting the loss function weight coefficients, employing gradient descent to accelerate convergence, and increasing the dimensionality of data feature extraction. In the specific implementation of this embodiment, the input variable set is substituted into the trained algorithm, and the formula is used... The system calculates the probability of privacy leakage in each subsequent mining stage, marks the risk level according to the preset risk classification standard, and forms a prediction result including risk probability, risk level and key risk points, which is then transmitted to the dynamic evaluation module.
[0043] After receiving relevant data, the dynamic evaluation module uses a weighted fusion algorithm to fuse the three types of data. The weight of the transmission path is determined based on the number of sensitive e-commerce data types involved, with transmission paths involving multiple types of sensitive data such as payment information and logistics addresses receiving higher weights; the weight of the evolution pattern is determined based on its historical verification accuracy in e-commerce scenarios; and the weight of the prediction result is determined based on the algorithm's current prediction accuracy in high-concurrency e-commerce scenarios.
[0044] In the specific implementation process of this embodiment, through formula Calculate a comprehensive risk score. Based on whether the transmission path covers core sensitive data sets such as payment and logistics, whether the evolution pattern strength exceeds the historical average of e-commerce scenarios, and whether the predicted risk probability is higher than a preset threshold, five risk levels are defined, each with a specific handling method. For example, a high-risk level triggers the suspension of order transactions, while a low-risk level only initiates routine monitoring.
[0045] The risk warning unit outputs different forms of warning signals based on the risk level. In cases of high risk, a system pop-up prompts risk control specialists to send risk warning SMS messages to users and connects to the e-commerce transaction system API to suspend related order operations. In cases of medium to low risk, only operations management personnel are notified. The system supports customizable warning recipients and receiving methods, and records detailed warning issuance times, reception status, and subsequent processing results, forming a closed-loop warning processing mechanism. Ultimately, this achieves a dynamic assessment of data privacy protection in e-commerce scenarios.
[0046] In summary, this embodiment applies a dynamic data privacy protection assessment system based on big data mining to an e-commerce platform. Building upon the aforementioned embodiments, it adapts to the high-concurrency dynamic data characteristics of e-commerce scenarios, achieving precise control over privacy risks associated with various types of dynamic data, including user behavior, transaction information, and logistics data. The time-series data acquisition module optimizes high-concurrency data processing and resource scheduling mechanisms for scenarios such as e-commerce promotions, ensuring the real-time nature and completeness of key data collection. The risk evolution analysis module accurately identifies key risk nodes in core e-commerce processes such as payment and logistics, clearly outlining the transmission path of privacy risks. The risk prediction module, through algorithm optimization adapted to e-commerce scenarios, achieves proactive risk prediction under high-frequency operations during promotional periods. The tiered early warning system of the dynamic assessment module, linked with the transaction system, forms an immediate risk handling mechanism.
[0047] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention in any way. Although the present invention has been disclosed above with reference to preferred embodiments, it is not intended to limit the present invention. Any person skilled in the art can make some modifications or alterations to the above-disclosed technical content to create equivalent embodiments without departing from the scope of the present invention. Any simple modifications, equivalent changes and alterations made to the above embodiments based on the technical essence of the present invention without departing from the scope of the present invention shall still fall within the scope of the present invention.
Claims
1. A dynamic data privacy protection assessment system based on big data mining, characterized in that, The system includes: a time-series data acquisition module, a risk evolution analysis module, a risk prediction module, and a dynamic assessment module, with each module connected by signals in sequence; The time-series data acquisition module is used to collect real-time time-series data of all stages of big data mining, including data access, feature extraction, correlation analysis, model training, and result output. The risk evolution analysis module is used to receive the operation time series data output by the time series data acquisition module, and construct a risk propagation map based on the data to explore the transmission path and evolution pattern of privacy risks. The risk prediction module is used to receive the transmission path and evolution law output by the risk evolution analysis module, and to make an advance prediction of the privacy leakage risk in the subsequent mining stage based on the time series prediction algorithm. The dynamic evaluation module is used to integrate the transmission path and evolution law output by the risk evolution analysis module with the prediction results output by the risk prediction module to construct a real-time evaluation system to complete the dynamic evaluation of dynamic data privacy protection.
2. The dynamic data privacy protection assessment system based on big data mining according to claim 1, characterized in that, The time-series data acquisition module performs the following operations when acquiring operational time-series data: Based on specific business scenarios of big data mining, the entire data flow process is sorted out, and a complete list of key operations including data access, feature extraction, correlation analysis, model training, and result output is generated, clarifying the boundaries and relationships of each operation; For each key operational step in the list, real-time monitoring technology is used to capture the identity information of the entity performing the operation, the specific data identifier of the object being operated on, the timestamps of the operation initiation and completion, the corresponding permission level of the operation, and the direction of data flow in each step. The captured information is processed using a standardized data format, with unified field naming and data types, and duplicate and invalid information is removed to form a unified operational time series data, which is then transmitted to the risk evolution analysis module through a preset data transmission channel.
3. The dynamic data privacy protection assessment system based on big data mining according to claim 1, characterized in that, The risk evolution analysis module performs the following operations when mining risk transmission paths and evolution patterns: An outlier detection algorithm is used to preprocess the received operation timing data to remove invalid data caused by transmission delay or equipment failure, and missing data is supplemented by interpolation. By treating each operational step as an independent node, the flow of data between different steps is used as a link, and the data transmission volume and interaction frequency of each link are marked to construct an initial risk propagation map. Based on the entropy weight method, the operation frequency and data interaction intensity of each node in the initial graph are weighted. A graph topology analysis algorithm is then used to identify key risk nodes with a wide impact and frequent interactions, and the risk transmission intensity between nodes is calculated. ,in Representative node To the node The intensity of privacy risk transmission For nodes The weighting coefficients, For link - Interaction adaptation coefficient, For link - Real-time data interaction volume For nodes The total number of associated nodes is determined, the connection paths between nodes are traced, and the evolutionary patterns of privacy risks as the operation progresses, including their transmission paths, intensity changes, and scope expansion, are summarized.
4. The dynamic data privacy protection assessment system based on big data mining according to claim 1, characterized in that, When the risk prediction module performs the following operations to proactively predict the risk of privacy leakage in subsequent data mining stages: The path length, node density, and frequency of contact with sensitive data are extracted from the transmission path output by the risk evolution analysis module. The risk growth rate and influence cycle parameters are extracted from the evolution law. These features and parameters are integrated into the input variable set of the time series prediction algorithm. Historical risk data from the same or similar big data mining scenarios within the past year were selected, and abnormal samples were removed to construct a training dataset. Cross-validation was used to adaptively train the time series prediction algorithm and adjust the algorithm parameters. Substitute the integrated set of input variables into the trained algorithm, and use the formula... Calculate the probability of privacy leakage risk at each subsequent mining stage, where For the first The probability of privacy leaks at each stage of the mining process. For the first The characteristic value of the transmission path length corresponding to each link This represents the maximum transmission path length in historical data. For the first Risk growth rate parameter for each stage This represents the maximum rate of risk growth in historical data. For the first Risk probability of each excavation stage , , The feature weight coefficients are used to label the risk level according to the preset risk classification standards, forming a prediction result that includes risk probability, risk level and key risk points, which is then transmitted to the dynamic evaluation module.
5. The dynamic data privacy protection assessment system based on big data mining according to claim 1, characterized in that, The time-series data acquisition module is configured with a data synchronization interface based on a RESTful architecture. This interface establishes a long connection with the operation log system of the big data mining platform to achieve real-time communication. The interface supports high-concurrency data transmission and is configured with a data caching mechanism, which can temporarily store high-frequency acquired data in a short period of time. During the acquisition process, a resource usage monitoring unit is set up to monitor the CPU and memory usage of the big data mining platform in real time. When the load exceeds a preset threshold, the acquisition frequency is automatically reduced. After the load returns to normal, the original acquisition frequency is restored. At the same time, the integrity of the acquired data is verified by comparing the verification code to achieve data transmission verification.
6. The dynamic data privacy protection assessment system based on big data mining according to claim 1, characterized in that, The risk propagation map adopts a multi-level structure design consisting of an operation layer, a data layer, and a risk layer. The operation layer records in detail the operation instructions, execution time, operation results, and operation subject information of each step. The data layer marks the type, sensitivity level, data subject, and data format attribute information of the data being transferred. The risk layer uses a five-color hierarchical method to map different levels of privacy risks, and each layer has an independent update trigger mechanism. The three-layer structure establishes the connection between the operation layer and the data layer through the operation ID, and establishes the connection between the data layer and the risk layer through the data sensitivity level. When the operation behavior or data attributes change during the big data mining process, the map automatically updates the information of the corresponding layer, completing the visualization and dynamic update of the risk transmission path.
7. The dynamic data privacy protection assessment system based on big data mining according to claim 1, characterized in that, The real-time assessment system of the dynamic assessment module includes a risk level classification unit and a risk warning unit. The risk level classification unit uses a formula... Calculate the comprehensive risk score, where To calculate the overall risk score, For the feature normalization function, These are characteristic parameters of the transmission path. These are characteristic parameters of evolutionary patterns. In order to predict the probability of risk, , , For the feature weight coefficients, 3-5 experts in the field of big data privacy protection were invited to conduct pairwise comparisons of the importance of transmission paths, evolution patterns, and prediction results. After constructing a judgment matrix, the feature weight coefficients were calculated. Based on multi-dimensional indicators such as whether the influence range of the transmission path covers the sensitive data set, whether the strength of the evolution pattern exceeds the historical average, and whether the predicted risk probability is higher than the preset threshold, five risk levels were defined. Each level corresponds to a clear risk description and processing priority. The risk warning unit outputs different forms of warning signals according to the defined risk levels, including system pop-up prompts, SMS notifications to designated personnel, and API interface pushes from the platform. It supports custom warning recipients and receiving methods, and records the warning issuance time, reception status, and subsequent processing results to form a closed loop for warning processing.
8. The dynamic data privacy protection assessment system based on big data mining according to claim 1, characterized in that, The time series prediction algorithm has a built-in scene recognition module. By parsing the data field identifiers in the big data mining process, it identifies whether the data contains highly sensitive privacy data such as ID card numbers, bank card numbers, and health records, thereby determining the sensitivity level of the mining scene. When a highly sensitive scene is determined, the algorithm automatically starts the parameter optimization process, increases the number of prediction iterations to a preset upper limit, adjusts the weight coefficients of the loss function, uses gradient descent to accelerate the convergence speed of the algorithm, and increases the extraction dimension of data features. When the scene sensitivity level is low, the basic parameter configuration of the algorithm is maintained.
9. A dynamic data privacy protection assessment method based on big data mining, applicable to the dynamic data privacy protection assessment system based on big data mining as described in any one of claims 1-8, characterized in that, The method includes the following steps: S1. Start the time series data acquisition module to collect real-time time series data of all stages of big data mining, including data access, feature extraction, correlation analysis, model training, and result output. S2. The risk evolution analysis module receives operation time-series data, and after preprocessing, constructs a risk propagation map to explore the transmission path and evolution pattern of privacy risks. S3, the risk prediction module, based on the transmission path and evolution law, uses a time series prediction algorithm to make an advance prediction of the privacy leakage risk in the subsequent mining process; S4, the dynamic evaluation module integrates transmission paths, evolution patterns and prediction results to construct a real-time evaluation system and complete the dynamic evaluation of dynamic data privacy protection.
10. The dynamic data privacy protection assessment method based on big data mining according to claim 9, characterized in that, In step S4, when the dynamic evaluation module fuses data, a weighted fusion algorithm is used. The weight of the transmission path is determined according to the number of sensitive data types involved. The more sensitive data types, the higher the weight. The weight of the evolution law is determined according to its historical verification accuracy. The higher the accuracy, the higher the weight. The weight of the prediction result is determined according to the current prediction accuracy of the algorithm. The higher the accuracy, the higher the weight. The comprehensive evaluation score is obtained through weighted calculation, and the risk level is corresponding to the preset score range.