A big data-based data cleaning system

By standardizing multi-source heterogeneous data and constructing a weighted directed graph, and using dynamic graph neural networks and variational energy models for data cleaning, the problem of data quality control in dynamic spatiotemporal coupling environments is solved, automated data repair and adaptive updates are realized, and the accuracy of data processing and the robustness of the system are improved.

CN122241033APending Publication Date: 2026-06-19BEIJING UNIV OF TECH +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING UNIV OF TECH
Filing Date
2026-05-20
Publication Date
2026-06-19

Smart Images

  • Figure CN122241033A_ABST
    Figure CN122241033A_ABST
Patent Text Reader

Abstract

This invention relates to the field of big data processing and data quality control technology, specifically a data cleaning system based on big data, comprising: a data preprocessing center for collecting multi-source heterogeneous data, generating dimensionless statistical distribution values, and constructing a weighted directed graph; a state mapping unit for generating state space vectors; an energy verification unit for comparing the global consistency energy with a preset safety threshold; if the global consistency energy is greater than the preset safety threshold, a logical conflict signal is generated; if the global consistency energy is less than or equal to the preset safety threshold, a logical consistency signal is generated; a gradient repair unit for generating dimensionless repair values; a feedback control unit for generating the final cleaning result; and setting the multi-source heterogeneous data as the final cleaning result. This invention solves the problem of the lack of automatic correction methods for system constraints in existing technologies, and improves the convergence speed and accuracy of the repair.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of big data processing and data quality control technology, specifically a data cleaning system based on big data. Background Technology

[0002] With the in-depth application of big data technology in the industrial internet and complex financial systems, the generation speed and correlation complexity of multi-source heterogeneous data have increased significantly. These data not only have huge differences in physical attributes, but are also usually in a dynamically changing spatiotemporal coupling environment, with close topological connections or business logic associations between nodes.

[0003] Currently, conventional data processing methods mainly focus on statistical analysis of single data points or filtering based on static rules. However, when faced with systems with complex graph structures, traditional methods often struggle to eliminate the influence of different dimensions on computation and easily overlook the temporal evolution and spatial coupling characteristics between nodes. This detection method based on isolated points is difficult to discover deep logical conflicts hidden in data associations, and it also lacks a closed-loop control mechanism that can automatically deduce and correct abnormal data based on the inherent constraints of the system, making it difficult for data quality control to meet the needs of large-scale dynamic systems. Therefore, how to build a cleaning system that can eliminate the influence of dimensions, effectively verify logical consistency, and achieve automatic repair based on energy gradients for multi-source heterogeneous data in a dynamic spatiotemporal coupling environment has become an urgent problem to be solved in this field. Summary of the Invention

[0004] To address the aforementioned technical problems, this invention provides a data cleaning system based on big data. Specifically, the technical solution of this invention includes:

[0005] The data preprocessing center is used to collect multi-source heterogeneous data, standardize the multi-source heterogeneous data, generate dimensionless statistical distribution values, and construct a weighted directed graph.

[0006] The state mapping unit is used to input dimensionless statistical distribution values ​​and weighted directed graphs into a dynamic graph neural network to generate a state space vector.

[0007] The energy verification unit is used to substitute the state space vector into the variational energy model, calculate the global consistency energy, and compare the global consistency energy with a preset safety threshold. If the global consistency energy is greater than the preset safety threshold, a logical conflict signal is generated; if the global consistency energy is less than or equal to the preset safety threshold, a logical consistency signal is generated.

[0008] The gradient repair unit is used to respond to the logic conflict signal, calculate the energy gradient scalar, update the dimensionless statistical distribution value using the energy gradient scalar, and generate the dimensionless repair value.

[0009] The feedback control unit is used to respond to the dimensionless repair value, perform reverse mapping on the dimensionless repair value, and generate the final cleaning result; and is used to respond to the logical consistency signal to set the multi-source heterogeneous data as the final cleaning result.

[0010] The preferred standardization process is as follows:

[0011] Obtain the historical observation mean and standard deviation of multi-source heterogeneous data;

[0012] Centralized data is obtained by subtracting the historical observation mean from multi-source heterogeneous data.

[0013] Divide the centered data by the standard deviation to generate a dimensionless statistical distribution value;

[0014] The process of constructing a weighted directed graph is as follows:

[0015] To obtain the physical connections or business relationships between entities;

[0016] Generate an adjacency matrix based on physical connection relationships or business association relationships;

[0017] Set the adjacency matrix as a weighted directed graph.

[0018] Preferably, the process of generating the state space vector is as follows:

[0019] Obtain the input transformation matrix, the self-circular recursive matrix, and the weights of neighboring nodes;

[0020] The input features are obtained by performing feature space enhancement on the dimensionless statistical distribution values ​​using the input transformation matrix.

[0021] By using a self-circulating recursive matrix to process the state space vector of a node at the previous time step, historical evolution characteristics can be obtained.

[0022] By aggregating the state space vectors of neighboring nodes at the previous time step using the weights of neighboring nodes, spatial aggregation features are obtained.

[0023] The input features, historical evolution features, and spatial aggregation features are linearly superimposed to obtain superimposed features;

[0024] The hyperbolic tangent function is used to activate the superimposed features to generate a state space vector.

[0025] Preferably, the process for calculating the globally consistent energy is as follows:

[0026] Obtain the logical mapping matrix and coupling weights;

[0027] The current state space vector of the neighboring nodes is transformed using a logical mapping matrix to obtain the predicted state vector;

[0028] Calculate the Euclidean distance between the current state space vector and the predicted state vector of the current node at the current moment;

[0029] Calculate the squared value of the Euclidean distance;

[0030] Multiplying the squared value by the coupling weight yields the node energy value;

[0031] The node energy values ​​of all nodes are summed to generate a globally consistent energy.

[0032] Preferably, the process for calculating the energy gradient scalar is as follows:

[0033] The state derivative is obtained by taking the partial derivative of the global uniformity energy with respect to the state space vector.

[0034] Differentiating the hyperbolic tangent function yields the activation derivative;

[0035] The energy gradient scalar is generated by performing chain rule operations using the input transformation matrix, state derivative, and activation derivative.

[0036] The process of generating dimensionless repair values ​​is as follows:

[0037] Get the preset repair step size;

[0038] The adjustment amount is obtained by multiplying the energy gradient scalar by the preset repair step size;

[0039] Subtract the adjustment amount from the dimensionless statistical distribution value to generate the dimensionless repair value.

[0040] Preferably, the process for obtaining the preset repair step size is as follows:

[0041] Obtain information on the extreme points of globally consistent energy;

[0042] Based on the extreme point information, the basic step size is decayed using an optimization algorithm to generate a preset repair step size.

[0043] Preferably, the reverse mapping process for the dimensionless repair value is as follows:

[0044] Obtain the historical observation mean and standard deviation of multi-source heterogeneous data;

[0045] Multiply the dimensionless repair value by the standard deviation to obtain the restored fluctuation value;

[0046] The restored fluctuation value is added to the historical observation average to generate the final cleaning result.

[0047] Preferably, the feedback control unit is also used to perform the following steps:

[0048] Receive valid data confirmation commands from external sources;

[0049] In response to a valid data confirmation command, calculate the average cumulative energy value within a preset time period;

[0050] Obtain the preset forgetting factor;

[0051] The coupling weights are updated using a preset forgetting factor and an average energy accumulation value to generate updated coupling weights.

[0052] Compared with the prior art, the present invention has the following beneficial effects:

[0053] 1. This system eliminates the impact of dimensional differences in data with different physical attributes on calculation by standardizing multi-source heterogeneous data and constructing a weighted directed graph; it organizes discrete data points into an associated network, effectively solving the shortcomings of traditional isolated point detection methods that ignore the connection strength and topology between nodes, ensuring that the subsequent cleaning process fully considers the inherent business logic relationship of the system and improves the accuracy of data processing.

[0054] 2. This system utilizes dynamic graph neural networks and variational energy models to accurately capture the temporal evolution and spatial coupling characteristics between nodes; by calculating global consistency energy, it quantifies the degree of conflict between data state and logical constraints, achieving effective verification of deep logical conflicts in a dynamic spatiotemporal coupling environment, and breaking through the limitations of conventional static rule filtering in dealing with complex dynamic relational systems.

[0055] 3. This system constructs a gradient repair mechanism based on the principle of minimum energy. It uses differential calculation to obtain the energy gradient and updates the abnormal data in reverse, realizing automatic deduction that conforms to global logical constraints. With the help of a dynamic step size decay strategy, it establishes an automated closed-loop control from error detection to intelligent repair, which solves the problem that existing technologies lack automatic correction methods for system constraints and improves the convergence speed and accuracy of repair.

[0056] 4. This system is equipped with a feedback control unit that includes a forgetting factor, which has adaptive learning capabilities. It can dynamically update the coupling weights based on valid external confirmation instructions and automatically adapt to parameter drift caused by physical system aging or environmental changes. This online update mechanism reduces the false alarm rate and enables the cleaning logic to be adaptively maintained in complex and ever-changing environments for a long time, ensuring the robustness of system operation. Attached Figure Description

[0057] The present invention will be further explained below with reference to the accompanying drawings and embodiments:

[0058] Figure 1 This is a structural diagram of the system of the present invention. Detailed Implementation

[0059] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to specific embodiments.

[0060] Example 1:

[0061] Please see Figure 1 A data cleaning system based on big data includes:

[0062] The data preprocessing center is used to collect multi-source heterogeneous data, standardize the multi-source heterogeneous data, generate dimensionless statistical distribution values, and construct a weighted directed graph.

[0063] The state mapping unit is used to input dimensionless statistical distribution values ​​and weighted directed graphs into a dynamic graph neural network to generate a state space vector.

[0064] The energy verification unit is used to substitute the state space vector into the variational energy model, calculate the global consistency energy, and compare the global consistency energy with a preset safety threshold. If the global consistency energy is greater than the preset safety threshold, a logical conflict signal is generated; if the global consistency energy is less than or equal to the preset safety threshold, a logical consistency signal is generated.

[0065] The gradient repair unit is used to respond to the logic conflict signal, calculate the energy gradient scalar, update the dimensionless statistical distribution value using the energy gradient scalar, and generate the dimensionless repair value.

[0066] The feedback control unit is used to respond to the dimensionless repair value, perform reverse mapping on the dimensionless repair value, and generate the final cleaning result; and is used to respond to the logical consistency signal to set the multi-source heterogeneous data as the final cleaning result.

[0067] This embodiment provides a data cleaning system based on big data. The system constructs a closed-loop automated data quality control environment, which can perform effective logical consistency verification and automatic repair of multi-source heterogeneous data in a dynamic spatiotemporal coupling environment.

[0068] The system mainly includes a data preprocessing center, a state mapping unit, an energy verification unit, a gradient repair unit, and a feedback control unit;

[0069] The data preprocessing center is equipped with communication interfaces for connecting to external sensor networks or business databases, and is used to collect multi-source heterogeneous data. Considering the differences in physical properties of different source data, the center performs standardization processing on the multi-source heterogeneous data to generate dimensionless statistical distribution values, eliminating the influence of dimensions on subsequent calculations. At the same time, the center constructs a weighted directed graph based on a pre-configured system topology or business logic relationship, organizing discrete data points into a graph-structured network.

[0070] The state mapping unit is connected to the data preprocessing center. It is used to input dimensionless statistical distribution values ​​and weighted directed graphs into a pre-set dynamic graph neural network. Through the forward propagation calculation of this network, the system captures the temporal evolution characteristics and spatial coupling characteristics between nodes and generates a high-dimensional state space vector that can characterize the current system state.

[0071] The energy verification unit receives the aforementioned state space vector and substitutes it into the variational energy model. Based on the principle of minimum energy in physics, this unit calculates the global consistency energy, which characterizes the degree of conflict between the current data state and the inherent logical constraints of the system. The unit compares the calculated global consistency energy with a preset safety threshold. If the global consistency energy is greater than the preset safety threshold, it indicates that there is a logical conflict in the current data, and the system generates a logical conflict signal. If the global consistency energy is less than or equal to the preset safety threshold, it indicates that the data conforms to the logical constraints, and the system generates a logical consistency signal.

[0072] The gradient repair unit is activated in response to a logic conflict signal; the unit uses differential calculation to obtain the gradient of the energy function with respect to the input data, that is, to calculate the energy gradient scalar; the unit uses the energy gradient scalar to update the original dimensionless statistical distribution value in reverse, generating a dimensionless repair value that reduces the global energy.

[0073] The feedback control unit is used for system output management. When a dimensionless repair value is received, the unit performs a reverse mapping to restore the dimensionless value to a value with actual physical meaning, generating the final cleaning result. When in response to a logic consistency signal, the unit directly sets the original multi-source heterogeneous data as the final cleaning result output. Through the above settings, the embodiments of this application can realize closed-loop control from error detection to automatic error repair, which is particularly suitable for processing industrial or financial data with complex coupling relationships.

[0074] Example 2:

[0075] The standardization process is as follows:

[0076] Obtain the historical observation mean and standard deviation of multi-source heterogeneous data;

[0077] Centralized data is obtained by subtracting the historical observation mean from multi-source heterogeneous data.

[0078] Divide the centered data by the standard deviation to generate a dimensionless statistical distribution value;

[0079] The process of constructing a weighted directed graph is as follows:

[0080] To obtain the physical connections or business relationships between entities;

[0081] Generate an adjacency matrix based on physical connection relationships or business association relationships;

[0082] Set the adjacency matrix as a weighted directed graph.

[0083] This embodiment provides a detailed description of the standardization process and weighted directed graph construction process performed by the data preprocessing center;

[0084] During the standardization process, in order to ensure that data with different dimensions can be processed in the same vector space, the system obtains the historical observation averages of multi-source heterogeneous data from the historical database. and historical standard deviation Here, the historical average is... The historical standard deviation is calculated by taking the arithmetic mean of the valid historical data of the node within a preset time window. It is the standard deviation statistic calculated based on the same set of valid historical data; the system will use the currently collected multi-source heterogeneous data Subtract the historical average The system obtains centralized data; it then divides the centralized data by the historical standard deviation. Generate dimensionless statistical distribution values The calculation process is shown in the following formula:

[0085]

[0086] in The standardized dimensionless value. These are the original physical observations;

[0087] During the construction of the weighted directed graph, the system obtains the physical connection relationships or business association relationships between entities; for example, in a power system, this relationship is represented by the topology connection table of the power grid; in a financial system, this relationship is represented by the transfer link table between accounts; based on the above physical connection relationships or business association relationships, the system generates an adjacency matrix. The elements in the adjacency matrix represent the existence and strength of connections between nodes. The system sets this adjacency matrix as a weighted directed graph to provide path definitions for subsequent information aggregation in the graph neural network. Considering the dynamic changes in business relationships, such as the addition of new transfer records or changes in device connections, the data preprocessing center is configured with a periodic reconstruction mechanism, for example, every [period]. During a time window or in response to a network topology change event, the process of constructing the weighted directed graph described above is re-executed to update the adjacency matrix. This ensures that subsequent state mapping and energy verification are always based on the latest system topology; in this way, this application ensures that the data cleaning process fully considers the system topology and avoids the limitations of outlier detection methods.

[0088] Example 3:

[0089] The process of generating the state space vector is as follows:

[0090] Obtain the input transformation matrix, the self-circular recursive matrix, and the weights of neighboring nodes;

[0091] The input features are obtained by performing feature space enhancement on the dimensionless statistical distribution values ​​using the input transformation matrix.

[0092] By using a self-circulating recursive matrix to process the state space vector of a node at the previous time step, historical evolution characteristics can be obtained.

[0093] By aggregating the state space vectors of neighboring nodes at the previous time step using the weights of neighboring nodes, spatial aggregation features are obtained.

[0094] The input features, historical evolution features, and spatial aggregation features are linearly superimposed to obtain superimposed features;

[0095] The hyperbolic tangent function is used to activate the superimposed features to generate a state space vector.

[0096] This embodiment provides a detailed description of the process by which the state mapping unit generates the state space vector;

[0097] To capture the temporal evolution of data while also considering the influence of spatial neighborhood, the system obtains a pre-trained input transformation matrix. Self-circular recursive matrix and neighbor node aggregate weight The matrix is ​​based on the historical dataset and aims to minimize the error in predicting the state at the next time step. The target vectors are obtained through pre-training using the Backpropagation Time Transmission (BPTT) algorithm, thus ensuring that the generated vectors have temporal prediction capabilities; the system utilizes the input transformation matrix. For dimensionless statistical distribution values Feature space enhancement is performed, which maps the scalar input to a feature vector to obtain the input features;

[0098] Meanwhile, in order to preserve the evolutionary inertia of the nodes themselves, the system utilizes a self-circulating recursive matrix. The state space vector of a node at the previous time step Linear transformation is performed to obtain historical evolution characteristics; furthermore, to aggregate spatial information, the system utilizes the weights of neighboring nodes. The state space vectors of all neighboring nodes at the previous time step Weighted summation and aggregation are performed to obtain spatial aggregation features;

[0099] The system linearly superimposes the input features, historical evolution features, and spatial aggregation features obtained above to obtain superimposed features; and utilizes the hyperbolic tangent function... Nonlinear activation is applied to the superimposed features to generate the state space vector at the current time step. The mathematical expression of this process is as follows:

[0100]

[0101] It should be noted that the aggregate weights defined here... The feature extraction values ​​used in graph neural networks are fixed after being determined during the pre-training phase; while the coupling weights in Example 4 below... The logic verification used for the energy function can be dynamically adjusted based on system feedback;

[0102] in, Represents a node The set of neighboring nodes; by introducing a self-circulating recursive matrix and a neighbor aggregation mechanism, this state space vector can simultaneously encode the node's own trend and environmental influence, providing rich semantic information for subsequent energy verification.

[0103] Example 4:

[0104] The process of calculating the globally consistent energy is as follows:

[0105] Obtain the logical mapping matrix and coupling weights;

[0106] The current state space vector of the neighboring nodes is transformed using a logical mapping matrix to obtain the predicted state vector;

[0107] Calculate the Euclidean distance between the current state space vector and the predicted state vector of the current node at the current moment;

[0108] Calculate the squared value of the Euclidean distance;

[0109] Multiplying the squared value by the coupling weight yields the node energy value;

[0110] The node energy values ​​of all nodes are summed to generate a globally consistent energy.

[0111] This embodiment provides a detailed description of the process by which the energy verification unit calculates the globally consistent energy.

[0112] The purpose of calculating the global consistency energy is to quantify the deviation between the current system state and the ideal logical state; the system obtains the logical mapping matrix. and coupling weights ; where, logical mapping matrix The acquisition process is as follows: Construct a self-supervised reconstruction loss function. The loss function is defined as the sum of squared reconstruction errors of historical normal data samples in the state space, i.e. The loss function is iteratively optimized using the stochastic gradient descent (SGD) algorithm until convergence, thereby obtaining the logistic mapping matrix that minimizes the reconstruction error. ;

[0113] It should be noted that, considering the massive number of nodes in big data scenarios, in order to avoid parameter explosion and improve computational efficiency, the logical mapping matrix in this embodiment... A parameter sharing mechanism is adopted; specifically, the system divides the edges into... There are several preset business relationship types, such as: transfer relationship, guarantee relationship, holding relationship, etc., and each type corresponds to a shared mapping matrix. ;at this time, The value depends on the node With nodes The connection type between them, i.e., if the edge ( , ) belongs to the Class relationship, then = In this way, the number of model parameters is decoupled from the number of nodes, ensuring the scalability of the system in large-scale graph data.

[0114] The system utilizes a logical mapping matrix The state space vector of the neighboring node at the current time. The transformation yields the predicted state vector; this predicted state vector represents the state of the node under the premise of satisfying logical constraints. The theoretical state that should be present; the system calculates the actual current state space vector of the current node at the current moment. Euclidean distance between the predicted state vector and the target state vector;

[0115] To construct the energy function, the system calculates the square of the Euclidean distance and then correlates the square with the coupling weights. Multiply the values ​​to obtain the node energy value; the system sums the node energy values ​​of all nodes in the entire graph to generate a globally consistent energy. The calculation formula is as follows:

[0116]

[0117] Based on this, the system will calculate the globally consistent energy. With preset safety threshold Compare; this preset security threshold It is set based on the 95th percentile of energy distribution during the system's historical normal operation; if If so, it is determined that there is a logical conflict in the system.

[0118] Example 5:

[0119] The process of calculating the energy gradient scalar is as follows:

[0120] The state derivative is obtained by taking the partial derivative of the global uniformity energy with respect to the state space vector.

[0121] Differentiating the hyperbolic tangent function yields the activation derivative;

[0122] The energy gradient scalar is generated by performing chain rule operations using the input transformation matrix, state derivative, and activation derivative.

[0123] The process of generating dimensionless repair values ​​is as follows:

[0124] Get the preset repair step size;

[0125] The adjustment amount is obtained by multiplying the energy gradient scalar by the preset repair step size;

[0126] Subtract the adjustment amount from the dimensionless statistical distribution value to generate the dimensionless repair value.

[0127] This embodiment provides a detailed explanation of the process by which the gradient repair unit calculates the energy gradient scalar and generates dimensionless repair values.

[0128] When the system detects a logical conflict, it needs to minimize the global energy by adjusting the input data; to do this, the system uses a chain rule to optimize the global consistency energy. Relative to the state space vector Take the partial derivative to obtain the state derivative. Simultaneously, the system applies the hyperbolic tangent function. Find the derivative, specifically the state derivative. Based on globally consistent energy function Taking the partial derivative with respect to the state vector, the calculation formula is as follows:

[0129]

[0130] This formula characterizes the weighted sum of the residuals of the current node's state vector deviating from the predicted state vectors derived from its neighbors; it should be understood that, in the rigorous mathematical derivation, the total graph energy... right The partial derivatives should also include the nodes. As a component of the energy terms of other nodes that are neighbors, i.e. The derivative; In order to reduce the complexity of real-time computation, this embodiment adopts a local gradient approximation strategy, which only uses the active constraint gradient calculated by the above formula for updating, which is sufficient to guide the system to converge to a low-energy state; The effectiveness of this local approximation strategy is based on the following engineering assumption: In each small repair iteration step... Next, node State changes affect neighboring nodes The reverse effect, i.e., the second-order effect of the passive constraint gradient is a higher-order small quantity relative to the active constraint gradient; in addition, the system performs closed-loop monitoring through the feedback control unit described below. Once it detects that this approximation leads to local energy oscillation, i.e., the total energy does not decrease after repair, it will automatically trigger the step size decay mechanism, thereby numerically ensuring the robustness of the iteration process; the activation derivative is obtained, i.e. ;

[0131] The system utilizes the input transformation matrix The state derivative and activation derivative are chained together to generate an energy gradient scalar. The gradient scalar indicates the direction in which the input data should change to reduce system energy. The gradient calculation formula is as follows:

[0132]

[0133] In the formula This represents the Hadamard product, which is the product of corresponding elements.

[0134] During the generation of dimensionless repair values, the system obtains the preset repair step size. ; Combine the energy gradient scalar with the preset repair step size Multiply to obtain the adjustment amount; the system will convert the original dimensionless statistical distribution values. Subtract this adjustment amount to generate the dimensionless repair value. The calculation formula is as follows:

[0135]

[0136] By iteratively updating along the opposite direction of the gradient, this application can automatically deduce data values ​​that conform to global logical constraints, thus realizing intelligent repair based on causal logic.

[0137] Example 6:

[0138] The process of obtaining the preset repair step size is as follows:

[0139] Obtain information on the extreme points of globally consistent energy;

[0140] Based on the extreme point information, the basic step size is decayed using an optimization algorithm to generate a preset repair step size.

[0141] This embodiment further optimizes the process of obtaining the preset repair step size;

[0142] To achieve rapid convergence in the early stages of repair and maintain repair accuracy in the later stages, the system employs a dynamic step-size strategy. The system acquires the extreme point information of the globally consistent energy, specifically including the current energy value and its rate of change. Based on this extreme point information, the system uses an optimization algorithm to adjust the base step size. Perform attenuation processing to generate a preset repair step size. ;

[0143] In this embodiment, an exponential decay strategy is adopted; as the number of iterations increases... The increase or when global consistency energy When approaching the minimum value, update the step size according to the following rules:

[0144]

[0145] in The preset attenuation rate ranges from 0 to 1, for example, 0.99; here, k is defined as an internal iteration counter for the current single repair process triggered by a logic conflict signal; when the system generates a new logic conflict signal and starts a new round of repair process, the counter... Automatically reset to 0; as the number of repair iterations for the same data point increases, The system gradually accumulates data; in this way, it uses large steps for rapid correction when far from steady state and small steps for fine-tuning when approaching steady state, effectively avoiding numerical oscillations and improving the accuracy of the repair results.

[0146] Example 7:

[0147] The reverse mapping process for the dimensionless repair value is as follows:

[0148] Obtain the historical observation mean and standard deviation of multi-source heterogeneous data;

[0149] Multiply the dimensionless repair value by the standard deviation to obtain the restored fluctuation value;

[0150] The restored fluctuation value is added to the historical observation average to generate the final cleaning result.

[0151] This embodiment provides a detailed description of the reverse mapping process performed by the feedback control unit;

[0152] In order for the repaired data to be recognized and used by business systems, it must be restored to the physical space; the system obtains the historical observation average of the multi-source heterogeneous data corresponding to this node. and historical standard deviation The system will use the dimensionless repair values ​​generated by the gradient repair unit. Compared with historical standard deviation Multiply by the original values ​​to obtain the restored fluctuation value; then multiply the restored fluctuation value by the historical average. Add them together to generate the final cleaning result. The calculation formula is as follows:

[0153]

[0154] This step strictly corresponds to the standardized operations in the preprocessing stage, ensuring the complete restoration of the physical meaning of the data.

[0155] Example 8:

[0156] The feedback control unit is also used to perform the following steps:

[0157] Receive valid data confirmation commands from external sources;

[0158] In response to a valid data confirmation command, calculate the average cumulative energy value within a preset time period;

[0159] Obtain the preset forgetting factor;

[0160] The coupling weights are updated using a preset forgetting factor and an average energy accumulation value to generate updated coupling weights.

[0161] This embodiment provides a detailed description of the weight update process performed by the feedback control unit;

[0162] To address parameter drift caused by physical system aging or environmental changes, the system possesses adaptive learning capabilities. When it receives a valid data confirmation command from the outside, it indicates that the current high-energy state is caused by changes in the system's own characteristics rather than data errors. In response to this command, the system calculates the average cumulative energy value over a preset time period. ;

[0163] The system obtains the preset forgetting factor. The forgetting factor is used to control the speed at which historical weights adapt to the new environment; its value is usually set to a small positive number, such as 0.01. A preset forgetting factor is used. and average energy accumulation value For coupling weights The system will then update the above average cumulative energy value. The value of the objective function to be optimized is considered, and the update rule is as follows:

[0164]

[0165] Where, partial derivatives This characterizes the contribution of the connection weight to the average energy; through this step, the system automatically reduces the weight of high residual edges, provided that the data is valid, thereby achieving adaptation to environmental changes;

[0166] Through the above-mentioned online update mechanism, the system can automatically adjust the coupling strength between nodes to adapt to the new physical environment, thereby reducing the false alarm rate and realizing long-term adaptive maintenance of the cleaning logic.

[0167] It should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A data cleaning system based on big data, characterized in that, include: The data preprocessing center is used to collect multi-source heterogeneous data, standardize the multi-source heterogeneous data, generate dimensionless statistical distribution values, and construct a weighted directed graph. The state mapping unit is used to input dimensionless statistical distribution values ​​and weighted directed graphs into a dynamic graph neural network to generate a state space vector. The energy verification unit is used to substitute the state space vector into the variational energy model, calculate the globally consistent energy, and compare the globally consistent energy with a preset safety threshold. If the global consistency energy exceeds the preset security threshold, a logical conflict signal is generated; If the global consistency energy is less than or equal to the preset security threshold, a logical consistency signal is generated. The gradient repair unit is used to respond to the logic conflict signal, calculate the energy gradient scalar, update the dimensionless statistical distribution value using the energy gradient scalar, and generate the dimensionless repair value. The feedback control unit is used to respond to the dimensionless repair value, perform reverse mapping on the dimensionless repair value, and generate the final cleaning result; It is used to set multi-source heterogeneous data as the final cleaning result in response to a logical consistency signal.

2. The data cleaning system based on big data according to claim 1, characterized in that, The standardization process is as follows: Obtain the historical observation mean and standard deviation of multi-source heterogeneous data; Centralized data is obtained by subtracting the historical observation mean from multi-source heterogeneous data. Divide the centered data by the standard deviation to generate a dimensionless statistical distribution value; The process of constructing a weighted directed graph is as follows: To obtain the physical connections or business relationships between entities; Generate an adjacency matrix based on physical connection relationships or business association relationships; Set the adjacency matrix as a weighted directed graph.

3. The data cleaning system based on big data according to claim 1, characterized in that, The process of generating the state space vector is as follows: Obtain the input transformation matrix, the self-circular recursive matrix, and the weights of neighboring nodes; The input features are obtained by performing feature space enhancement on the dimensionless statistical distribution values ​​using the input transformation matrix. By using a self-circulating recursive matrix to process the state space vector of a node at the previous time step, historical evolution characteristics can be obtained. By aggregating the state space vectors of neighboring nodes at the previous time step using the weights of neighboring nodes, spatial aggregation features are obtained. The input features, historical evolution features, and spatial aggregation features are linearly superimposed to obtain superimposed features; The hyperbolic tangent function is used to activate the superimposed features to generate a state space vector.

4. The data cleaning system based on big data according to claim 1, characterized in that, The process of calculating the globally consistent energy is as follows: Obtain the logical mapping matrix and coupling weights; The current state space vector of the neighboring nodes is transformed using a logical mapping matrix to obtain the predicted state vector; Calculate the Euclidean distance between the current state space vector and the predicted state vector of the current node at the current moment; Calculate the squared value of the Euclidean distance; Multiplying the squared value by the coupling weight yields the node energy value; The node energy values ​​of all nodes are summed to generate a globally consistent energy.

5. A data cleaning system based on big data according to claim 4, characterized in that, The process of calculating the energy gradient scalar is as follows: The state derivative is obtained by taking the partial derivative of the global uniformity energy with respect to the state space vector. Differentiating the hyperbolic tangent function yields the activation derivative; The energy gradient scalar is generated by performing chain rule operations using the input transformation matrix, state derivative, and activation derivative. The process of generating dimensionless repair values ​​is as follows: Get the preset repair step size; The adjustment amount is obtained by multiplying the energy gradient scalar by the preset repair step size; Subtract the adjustment amount from the dimensionless statistical distribution value to generate the dimensionless repair value.

6. A data cleaning system based on big data according to claim 5, characterized in that, The process of obtaining the preset repair step size is as follows: Obtain information on the extreme points of globally consistent energy; Based on the extreme point information, the basic step size is decayed using an optimization algorithm to generate a preset repair step size.

7. A data cleaning system based on big data according to claim 1, characterized in that, The reverse mapping process for the dimensionless repair value is as follows: Obtain the historical observation mean and standard deviation of multi-source heterogeneous data; Multiply the dimensionless repair value by the standard deviation to obtain the restored fluctuation value; The restored fluctuation value is added to the historical observation average to generate the final cleaning result.

8. A data cleaning system based on big data according to claim 4, characterized in that, The feedback control unit is also used to perform the following steps: Receive valid data confirmation commands from external sources; In response to a valid data confirmation command, calculate the average cumulative energy value within a preset time period; Obtain the preset forgetting factor; The coupling weights are updated using a preset forgetting factor and an average energy accumulation value to generate updated coupling weights.