Privacy-enhanced cpps anomaly detection method based on longitudinal federated learning

By employing a vertical federated learning architecture in CPPS and combining SCINet and Transformer models for feature compression and optimization, the problems of underutilization of information-side data and privacy leakage in existing technologies are solved, achieving efficient anomaly detection and privacy protection.

CN121479828BActive Publication Date: 2026-06-26SICHUAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SICHUAN UNIV
Filing Date
2025-11-07
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing CPPS anomaly detection methods mainly focus on the physical side of the power system, failing to effectively integrate information-side data. Furthermore, the centralized training mode poses a privacy risk and cannot meet the current security requirements of CPPS.

Method used

A vertical federated learning architecture is adopted, deploying SCINet and Transformer models on the physical and information sides respectively. Through feature compression and cross-entropy loss function, deep feature extraction and privacy protection are achieved, and a bidirectional collaborative optimization mechanism between the client and the server is constructed.

Benefits of technology

While ensuring detection accuracy, it significantly improves data privacy protection, reduces the risk of data reconstruction, and enhances the practicality and security of CPPS anomaly detection.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121479828B_ABST
    Figure CN121479828B_ABST
Patent Text Reader

Abstract

The application discloses a privacy-enhanced CPPS anomaly detection method based on longitudinal federated learning and relates to the field of CPPS anomaly detection.The SCINet model and the Transformer model are respectively deployed on the clients of the physical side and the information side for local deep feature extraction through longitudinal federated learning;feature compression processing is respectively performed on the deep features of the physical side data and the deep features of the information side data in the feature uploading stage, which not only realizes effective compression of the uploaded features, but also greatly reduces the amount of sensitive information that may be leaked in the intermediate features.In addition, the application constructs a bidirectional collaborative optimization mechanism between the client and the server, which can optimize the local feature extraction strategy in real time.The application not only guarantees the accuracy of anomaly detection, but also significantly improves the data privacy protection effect, overcomes the shortcomings of traditional anomaly detection methods in terms of privacy security, feature compression effectiveness and bilateral collaborative optimization, and effectively improves the practicality and security of CPPS anomaly detection.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of CPPS anomaly detection, and specifically to a privacy-enhanced CPPS anomaly detection method based on longitudinal federated learning. Background Technology

[0002] With the widespread application of information technology and intelligent equipment in power systems, power systems are no longer simply physical entities, but have gradually evolved into complex architectures that deeply integrate information and physical systems, namely Cyber-Physical Power Systems (CPPS). Its core characteristic is the establishment of a tightly coupled closed-loop control and feedback mechanism between physical power equipment and information communication networks. Under this architecture, physical-side data (such as measurements of node voltage, current, and power) and information-side data (such as network throughput, transmission delay, and message loss rate) together constitute a comprehensive description of the system's operational status. Both physical-side and information-side data exhibit high temporal correlation and interactive relationships during normal operation and fault evolution: once an anomaly occurs at one end (e.g., abnormal generator vibration or communication network congestion), it often rapidly propagates to the other end through coupling links, causing a chain reaction and cross-domain interference, thus posing a serious threat to the safe and stable operation of the power grid. Therefore, collaborative analysis and accurate anomaly detection of massive, multi-source, heterogeneous data in CPPS have become fundamental requirements for ensuring power grid security.

[0003] However, existing CPPS anomaly detection technologies primarily focus on analyzing operating parameters on the physical side of the power system, while the auxiliary application value of information-side data has not been fully explored. Although some studies have attempted to combine data from different sources for anomaly detection, most remain at a shallow fusion stage, failing to effectively combine the deep features hidden in the data from both sides for collaborative analysis, thus failing to comprehensively and accurately characterize CPPS anomaly states. With the continuous growth of the scale of multi-source heterogeneous data in CPPS, the coupling and correlation characteristics between the data from both sides are becoming increasingly significant. Traditional detection methods based on rules or shallow models are no longer sufficient to meet the current security requirements of CPPS in terms of feature coverage, discrimination accuracy, and robustness.

[0004] Furthermore, most current CPPS anomaly detection methods rely on a centralized data training model: the collected raw data is centralized on a central server for unified training, and the constructed machine learning or deep learning model directly maps input features to system operating status. However, with the ever-expanding scale of data, this centralized model not only brings extremely high computational and storage overhead, but also exposes the raw data to potential attackers during transmission, posing a serious risk of privacy leakage. Specifically, in the centralized training model, each data source must upload its raw data, which may become a target for malicious attackers, leading to the theft and leakage of various privacy-sensitive information (such as user electricity usage habits, equipment operating parameters, and network communication status). Malicious attackers can then reconstruct or reverse-analyze this privacy data to gradually uncover system weaknesses and launch targeted attacks to induce wider power outages or equipment failures. Therefore, how to effectively protect data privacy while ensuring anomaly detection accuracy has become a key technical challenge that CPPS security research urgently needs to solve. Summary of the Invention

[0005] To address the aforementioned shortcomings of existing technologies, the privacy-enhanced CPPS anomaly detection method based on vertical federated learning provided by this invention not only ensures detection accuracy but also effectively reduces the risk of data reconstruction, thus exhibiting strong privacy protection.

[0006] To achieve the above-mentioned objectives, the technical solution adopted by this invention is as follows:

[0007] A privacy-enhanced CPPS anomaly detection method based on longitudinal federated learning is provided, which includes the following steps:

[0008] Introducing a vertical federated learning architecture to build local models on the client side:

[0009] The SCINet model is deployed on the physical side of CPPS, and the Transformer model is deployed on the information side of CPPS; the SCINet model is used to extract deep features of the physical side data, and the Transformer model is used to extract deep features of the information side data.

[0010] With the goal of minimizing the mutual information between the intermediate features uploaded by the client and the original data and maximizing the mutual information between the intermediate features and the output labels, feature compression processing is performed on the deep features of the physical side data and the deep features of the information side data respectively, resulting in physical side compressed features and information side compressed features.

[0011] The physical-side compressed features and the information-side compressed features are uploaded to the server and concatenated in one-dimensional space to obtain the global features.

[0012] In the server, global features are mapped to score vectors through fully connected layers, and the predicted probabilities of each category are obtained through a classifier; the category with the highest predicted probability is selected as the detection and classification result.

[0013] Calculate the loss function for the detection classification result and the true label, generate the corresponding gradient, and send the gradient to the corresponding client after splitting it according to the concatenation order of the global features;

[0014] In the client, the local model is trained based on the penalty term of the KL divergence and the received gradient to obtain the trained local model.

[0015] The deep features of the local data to be detected are obtained through the trained local model, and then the corresponding anomaly detection results are obtained through the server.

[0016] Furthermore, the specific methods for extracting deep features from physical-side data include the following steps:

[0017] T time-step data points were continuously extracted from the physical-side time-series dataset of CPPS and arranged sequentially in chronological order to form a time series. Each time step contains N-dimensional features. Indicates the first i Data at each time step; time-series data includes node voltage, current, and power;

[0018] Time series As input to the SCINet model, the corresponding feature vector is output through the SCINet model, thus obtaining the deep features of the physical side data.

[0019] Furthermore, the specific methods for extracting deep features from information-side data include the following steps:

[0020] From the information side of CPPS, select traffic records or business records that are at the same time point as the corresponding physical side data, and then obtain the information side sample sequence;

[0021] By using the information-side sample sequence as input to the Transformer model and outputting the corresponding feature vector through the Transformer model, we obtain the deep features of the information-side data.

[0022] Furthermore, the specific methods for performing feature compression processing on the deep features of physical-side data and the deep features of information-side data respectively include the following steps:

[0023] A feature compression layer is added before the final output layer of the SCINet and Transformer models. This layer performs feature compression on both the physical and information-side deep data features. The expression for feature compression in this layer is as follows:

[0024]

[0025] Compressed features are obtained by performing feature compression processing on the feature compression layer; and These represent the mean vector and standard deviation vector of the latent variable distribution, respectively, output by the SCINet model or the Transformer model; From the standard normal distribution Random noise sampled independently in the middle; This indicates element-wise multiplication.

[0026] Furthermore, specific methods for splicing physical-side compressed features and information-side compressed features in one-dimensional space include:

[0027] The physical-side compressed features and information-side compressed features at the same time are concatenated into a two-sided deep feature; the two-sided deep features are then concatenated in chronological order to obtain the global feature.

[0028] Furthermore, the expression for mapping global features to a score vector through a fully connected layer is:

[0029]

[0030] in This is the score vector obtained from the fully connected layer; This represents the input feature vector in the fully connected layer, i.e., the global features; This is the weight matrix of the fully connected layer; This is the bias vector.

[0031] Furthermore, the classifier in the server is a softmax classifier, and its classification expression is:

[0032]

[0033] in Represents the score vector Let m be the predicted probability of category m. Global features The score of the corresponding category m; M is the total number of categories; Global features Corresponding category j The score; It is the natural index.

[0034] Furthermore, the cross-entropy loss function is used to detect the classification results and the true labels.

[0035] Furthermore, the expression for splitting the gradient according to the concatenation order of the global features is:

[0036]

[0037] in Indicates the first i The gradient component corresponding to each client; The cross-entropy loss function; For the first i The compression features corresponding to each client are either physical-side compression features or information-side compression features; and They represent the first i The starting and ending indices of the compressed features uploaded by each client in the global features; The starting index and ending index are respectively and The gradient corresponding to the time.

[0038] Furthermore, during the training process, the loss function of the local model... The expression is:

[0039]

[0040] Where N is the total number of clients; This indicates that the server has the following parameters: The distribution of the detection and classification results obtained at that time; This is a penalty term based on KL divergence, used to measure the compressed features produced by the local model. Distribution Compared with a pre-defined simple prior distribution The degree of difference between them; As weight; These are intermediate compressed features for the local model; The detection and classification results output by the server; This serves as the input during the local model training process.

[0041] During the training process of the local model, the first i The expression for the local model update parameters of each client is:

[0042]

[0043] in For the first iThe updated parameters of the local model on each client; For the first i Parameters of the local model on each client before update; The learning rate; Indicates the first i Gradient generation for each client's features; Indicates the first i The gradient obtained by calculating the KL loss of each client; This is the balance coefficient.

[0044] The beneficial effects of this invention are as follows: This invention utilizes vertical federated learning to deploy the SCINet model and the Transformer model on the physical and information sides of the client, respectively, for local deep feature extraction. During the feature upload stage, it aims to minimize the mutual information between the client-uploaded intermediate features and the original data, and maximize the mutual information between the intermediate features and the output labels. Feature compression processing is performed on the deep features of the physical and information side data, respectively, achieving effective compression of uploaded features and significantly reducing the amount of potentially leaked sensitive information in the intermediate features. Furthermore, this invention constructs a bidirectional collaborative optimization mechanism between the client and server. After receiving feedback gradient information from the server, the client can optimize its local feature extraction strategy in real time, thereby continuously improving the anomaly detection performance of the overall detection model. This invention significantly improves data privacy protection while ensuring anomaly detection accuracy, overcoming the shortcomings of traditional anomaly detection methods in terms of privacy security, feature compression effectiveness, and bidirectional collaborative optimization, effectively enhancing the practicality and security of CPPS anomaly detection. Attached Figure Description

[0045] Figure 1 This is a flowchart illustrating the method.

[0046] Figure 2 This is a schematic diagram of the feature compression layer structure;

[0047] Figure 3 This is a comparison chart of the accuracy rates of different combination models in Example 2;

[0048] Figure 4 This is a comparison chart of the accuracy rates of different combination models in Example 2;

[0049] Figure 5 This is a comparison chart of recall rates for different combination models in Example 2;

[0050] Figure 6 This is a comparison chart of F1 scores for different combination models in Example 2;

[0051] Figure 7 This is a comparison chart of the FedAvg performance of the four methods in Example 4;

[0052] Figure 8 The four methods in Example 4 Performance comparison chart;

[0053] Figure 9 The four methods in Example 4 Performance comparison chart;

[0054] Figure 10 The four methods in Example 4 Performance comparison chart. Detailed Implementation

[0055] The specific embodiments of the present invention are described below to enable those skilled in the art to understand the present invention. However, it should be understood that the present invention is not limited to the scope of the specific embodiments. For those skilled in the art, various changes are obvious as long as they are within the spirit and scope of the present invention as defined and determined by the appended claims. All inventions utilizing the concept of the present invention are protected.

[0056] Example 1:

[0057] like Figure 1 As shown, the privacy-enhanced CPPS anomaly detection method based on longitudinal federated learning includes the following steps:

[0058] S1. Introduce a vertical federated learning architecture to build a local model on the client side:

[0059] The SCINet model is deployed on the physical side of CPPS, and the Transformer model is deployed on the information side of CPPS; the SCINet model is used to extract deep features of the physical side data, and the Transformer model is used to extract deep features of the information side data.

[0060] S2. With the goal of minimizing the mutual information between the intermediate features uploaded by the client and the original data (reducing the risk of privacy leakage) and maximizing the mutual information between the intermediate features and the output labels (maintaining model performance), feature compression processing is performed on the deep features of the physical side data and the deep features of the information side data respectively, resulting in physical side compressed features and information side compressed features.

[0061] S3. Upload the physical-side compressed features and the information-side compressed features to the server and concatenate them in one-dimensional space to obtain the global features;

[0062] S4. In the server, the global features are mapped to a score vector through a fully connected layer, and the predicted probability of each category is obtained through a classifier; the category with the highest predicted probability is selected as the detection and classification result.

[0063] S5. Calculate the loss function of the detection classification result and the real label, generate the corresponding gradient, and send the gradient to the corresponding client after splitting it according to the concatenation order of the global features.

[0064] S6. In the client, the local model is trained based on the penalty term of KL divergence and the received gradient to obtain the trained local model.

[0065] S7. By effectively removing redundant information through the trained local model, only retaining the feature information related to label identification, a more concise and anomaly-discriminating local deep feature of the data to be detected is generated, and then the corresponding anomaly detection result is obtained through the server.

[0066] Federated learning, based on the different data distribution patterns and the degree of overlap among clients, can generally be divided into three categories: Horizontal Federated Learning (HFL), Vertical Federated Learning (VFL), and Federated Transfer Learning (FTL). These three architectures correspond to different practical application scenarios, reflecting the diversity of data distribution, feature space, and participants, while finding a balance between data privacy protection and model performance. Considering the application scenario of this method, VFL is chosen because CPPS data on both sides have the typical characteristics of "complementary features and overlapping samples." Physical and informational data are often in the same system state, generated at the same timestamp, and oriented towards the same batch of event labels. Therefore, there is overlap at the sample level, differences in feature dimensions, while the data labels remain consistent. By using VFL, feature information from different data sources can be combined, and their complementary characteristics can be utilized to construct a more discriminative anomaly detection model.

[0067] In this embodiment, extracting deep features from physical-side data includes the following operations:

[0068] First, T consecutive time-step data points are extracted from the physical-side time-series dataset. These data points are then arranged chronologically to form a time series. Each time-step data point contains N-dimensional features. The final time series can be represented as follows: .in, This is the data for one time step that makes up the sequence. Then, when the sequence... When input into the SCINet model, the model separates and splits the odd-numbered elements from the even-numbered elements in the time series. Specifically, the original sequence... It will be divided into and Two odd and even subsequences. Where k represents half the largest positive integer less than or equal to T. and Representing the original sequence The last odd-numbered element and the last even-numbered element in the sequence. Although these two odd-even subsequences have a relatively coarse time resolution, they still preserve the original sequence. This contains most of the trend information. Therefore, the SCINet model uses different convolutional kernels. and Convolution operations are performed to extract different but valuable features from the two odd and even subsequences, including temporal features that can enhance the model's representational capabilities.

[0069] Meanwhile, to compensate for the information loss that may result from the downsampling operation, this application also introduces an interactive learning strategy. The core idea of ​​this strategy is to achieve information exchange between two odd and even subsequences by learning affine transformation parameters from each other, thereby preserving more contextual information and potential correlation characteristics, and thus alleviating the problem of incomplete information caused by sequence decomposition. Specifically, when the original sequence... Decomposed into two odd and even subsequences and back, and It will be passed through two structurally similar but parameterized one-dimensional convolutional modules. and The hidden states are extracted and converted into exponential (exp) form, resulting in two sets of projected feature representations. These two sets of feature representations are then element-wise multiplied with the original sequence of the other sequence to achieve cross-sequence information compensation and sharing. That is, The projection features will be with Perform parameter interaction learning, and The projection characteristics will also be with Interactive parameter learning is performed. This interactive learning process enables the feature information of the two odd and even subsequences to complement and enhance each other in the numerical space, thereby effectively improving the overall quality and representational ability of the feature sequence. The specific calculation process is shown in Equation (1).

[0070] (1)

[0071] In the formula, and These are the middle odd and even subsequences, respectively; This indicates element-wise multiplication.

[0072] After completing the initial interactive learning, the SCINet model obtained two intermediate odd-even subsequences. and Next, we need to continue with... and Perform interactive operations to further uncover and supplement feature information that may have been overlooked or not yet fully revealed in previous steps. First, and The inputs will be fed into two other one-dimensional convolutional modules. and In the middle. Compared with the previous one-dimensional convolution module. and similar, and They also have similar but independent parameter settings, enabling differentiated feature extraction and projection mapping for their respective input sequences, thereby obtaining... and The hidden states. Then, these two hidden state vectors will be added to or subtracted from each other's original sequences. That is, by The obtained projection results will be with To perform addition and subtraction operations, by The resulting projection will be the same as Perform addition and subtraction operations. The specific calculation process is shown in equation (2).

[0073] (2)

[0074] In the formula, and These are the two feature subsequences that the interactive learning module will ultimately output. "±" indicates that addition or subtraction operations can be selected during the design process according to specific needs.

[0075] In this embodiment, the SCINet architecture is constructed by arranging multiple basic modules (SCI-Blocks) hierarchically, resulting in a binary tree structure. This allows the SCINet model to have both a local view and a global view of the entire time series, facilitating more efficient extraction of useful temporal feature information. The feature subsequence obtained through the first SCI-Block... and These subsequences will be used as inputs to other nodes in the binary tree structure, and after undergoing the same processing, more new feature subsequences will be obtained. Feature information extracted from previous layers will be gradually accumulated, meaning that deeper features will contain temporal information transmitted from shallower layers. In this way, both short-term and long-term temporal dependencies in physical time-series data can be captured simultaneously. Then, all feature subsequences obtained from the last layer of SCI-Block are rearranged to generate new perceptually enhanced feature sequences for subsequent feature fusion and anomaly detection classification.

[0076] In this embodiment, extracting deep features from information-side data includes the following operations:

[0077] From the information-side dataset, traffic records or business data that are at the same time point as the corresponding physical-side data are selected to construct a time-synchronized sample sequence, which is then input into the Transformer model. Only by ensuring that the data from both sides are strictly aligned in the time dimension can the SCINet-Transformer combined model (i.e., SCINet model + Transformer model) effectively support the accurate judgment of the overall operating status of CPPS at a certain moment.

[0078] The Transformer model employs a positional encoding mechanism based on sine and cosine functions, attaching a unique positional identifier to the feature vector corresponding to each time step in the input sequence. This ensures that each time step has clearly distinguishable positional information in the dimension of the input vector, avoiding confusion or positional ambiguity during subsequent training. The specific calculation process is shown in equations (3) and (4):

[0079] (3)

[0080] (4)

[0081] In the formula, The encoding vector representing the position in the sequence is the first... The values ​​for each (even-indexed) dimension are generated by a sine function; The encoding vector representing the position in the sequence is the first... The values ​​for each (odd-indexed) dimension are generated by a cosine function; It is a position index. This represents the dimension of each position vector in the Transformer model. This is the index of the current dimension.

[0082] After the input sequence is positionally encoded, it will subsequently serve as the input to the multi-head attention layer. Queries are generated using three sets of weight matrices. Q ), key K ) and value (Value, V ) matrix. Where, Q and K It contains positional information of the input sequence, used to determine the locations of interest. V This includes the numerical information of the input sequence. The calculation process is shown in equations (5), (6), and (7):

[0083] (5)

[0084] (6)

[0085] (7)

[0086] In the formula, , , For parameter matrices, The input sequence is position-encoded.

[0087] Then, Q Matrix and K Matrix multiplication is used to calculate the correlation between features at the current location and features at other locations; a higher value indicates a stronger correlation. The result is then scaled by the square root of the hidden layer dimension to stabilize the gradient. Finally, the softmax function is used to normalize the result, obtaining the gradient for each hidden layer. K For the present Q Attention weights. Finally, the softmax calculation result and the corresponding... V Multiplication assigns higher weights to important features while suppressing irrelevant information. The calculation process is shown in equation (8):

[0088] (8)

[0089] In the formula, express K Dimensions.

[0090] The Transformer model enhances its representation capabilities through a multi-head attention mechanism. Specifically, Q , K and V After mapping through a fully connected network, the input is fed into multiple self-attention modules, each calculating its own attention. The results are then concatenated to obtain the output of the multi-head attention layer. The calculation process is shown in equations (9) and (10):

[0091] (9)

[0092] (10)

[0093] In the formula, This represents each individual "attention head" in a multi-head attention mechanism. This represents a multi-head attention function. It's a concatenation function. This represents the additional weight matrix.

[0094] Finally, the output of the multi-head attention layer is input into the feedforward neural network, and after residual connection and layer normalization processing, the final output of the encoder is obtained. The corresponding calculation process is shown in equations (11), (12), and (13):

[0095] (11)

[0096] (12)

[0097] (13)

[0098] In the formula, This is the input matrix of the feedforward neural network. Indicates the output matrix. Representation layer normalization method, Indicates a feedforward network. It is an activation function. and This represents the weight matrix of two linear layers, while and This is the corresponding bias term.

[0099] After processing by multiple Transformer encoders, the final feature vector is output. This vector captures the contextual information at each position in the input sequence and will be used for subsequent feature fusion and anomaly detection.

[0100] The deep feature information extracted from the two-sided data still contains a large amount of detailed information related to the original data. If stolen by a malicious attacker, this feature information can be used to reverse-engineer the distribution of the original data, leading to privacy leaks. To address the privacy leak problem, this method aims to minimize the mutual information between the intermediate features uploaded by the client and the original data, and maximize the mutual information between the intermediate features and the output labels. It performs feature compression processing on the deep features of the physical side data and the deep features of the information side data, respectively, resulting in compressed physical side features and compressed information side features.

[0101] For example, in this embodiment, the SCINet model is uniformly applied to the CPPS physical side client to process measurement data (node ​​voltage, current, power, etc.); while the Transformer model is uniformly applied to the CPPS information side client to process traffic data (packet latency, effective throughput, etc.). Both models still only serve as deep feature information extractors and do not perform detection and classification tasks locally. However, an encoding layer (feature compression layer) is added before their final output layer. The structure of the feature compression layer is as follows... Figure 2 As shown.

[0102] The goal of the client models (SCINet model and Transformer model) on both sides of the CPPS system, as encoders, is to learn from the local input data. To intermediate feature variables The conditional probability distribution is shown in equation (14).

[0103] (14)

[0104] In the formula, and These are the mean and standard deviation of the encoder output, respectively, reflecting the intermediate characteristics. The central position and distribution width.

[0105] If we directly obtain the conditional distribution of the intermediate feature variables If sampling is performed, then the encoder must introduce an independent random source from the outside, directly using... and Random sampling is performed to obtain samples that conform to a probability distribution. However, if this is done, the sampling operation becomes like a "black box" random function, unable to establish a clear mapping relationship between network parameters and randomly selected samples. In other words, there is no differentiable functional form between network parameters and selected samples. Furthermore, the inherent uncertainty of random sampling truncates the derivative of the target loss function with respect to the neural network parameters, thus preventing the direct use of conventional backpropagation mechanisms for effective gradient updates and end-to-end optimization.

[0106] Therefore, this method employs a reparameterization technique to transform the originally non-differentiable random sampling process into a deterministic and continuously differentiable function. That is, by cleverly reconstructing the sampling process, random sampling is decomposed into a combination of "differentiable function mapping" and "independent noise", thereby embedding the process of "how to sample" back into the calculation of the model, as shown in Equation (15).

[0107] (15)

[0108] In the formula, and These represent the mean vector and standard deviation vector of the latent variable distribution, respectively. They are output by the encoder and are learnable parameters of the encoder. From the standard normal distribution The random noise sampled independently is independent of the encoder network parameters and only provides the randomness required during the sampling process.

[0109] Through the above transformation, the encoder network parameters are successfully separated from the random sampling process. The encoder output... and Only participate in determining intermediate features The distribution shape characteristics (center location and scale) do not directly participate in random sampling. This means that randomness only occurs through noise that is independent of the encoder network parameters. This is to reflect the situation. Therefore, during forward propagation, it is only necessary to randomly sample from the standard normal distribution. Then the required result can be obtained through deterministic calculation. However, during backpropagation, the non-differentiability problem of random sampling can be bypassed, and the deterministic mapping function part ( and Standard gradient calculation is performed. This maintains the randomness of sampling while ensuring that the encoder's key parameters are within a differentiable calculation process, perfectly resolving the contradiction between automatic differentiation and random sampling.

[0110] From a mathematical perspective, when optimizing the backpropagation algorithm, the gradient calculation of KL divergence with respect to the model parameters is shown in equations (16) and (17).

[0111] (16)

[0112] (17)

[0113] In the formula, The gradient of the loss function with respect to intermediate features. This involves calculating the derivative of the intermediate features with respect to the model parameters.

[0114] because and Both are mapping functions output by the encoder, and their gradients with respect to the encoder network parameters are obviously calculable. Therefore, the gradient from the loss function to the encoder network parameters can be accurately and stably obtained using the backpropagation algorithm. Specifically, when the encoder network parameters are updated, they will be automatically adjusted according to the optimization requirements of the loss function. and The value of . When the model finds that certain features contribute little or are redundant to the prediction of the target variable, it tends to increase the standard deviation of the dimension corresponding to the intermediate feature. This process compresses the information content of these feature dimensions, gradually reducing the contribution of these redundant dimensions to the encoder output until they are ignored. For truly valuable and predictive key features, the encoder reduces the standard deviation of the corresponding dimension. To increase the information retention of corresponding features, the encoder iterates repeatedly, adaptively retaining key features and eliminating redundant features. Ultimately, an optimal balance is achieved between information compression and encoder performance.

[0115] The final target loss function of each client model placed on the information side and the physical side of CPPS can be referred to as Equation (18):

[0116] (18)

[0117] In the formula, the first term on the right-hand side is the prediction performance loss term, which measures the prediction effect on the true label variable given intermediate features. Specifically, it is represented by the cross-entropy loss function. This is a commonly used form of loss function, responsible for guiding the model to learn more accurate and discriminative latent variable feature representations. Here, N is the total number of clients. For the decoder (fully connected layer + classifier in the server, parameters are...) The predicted distribution is obtained from this. Furthermore, since the client model only acts as a feature extractor and does not include a decoder, after uploading the compressed deep features, feature concatenation is performed on the server side to complete the subsequent detection and classification tasks. After calculating the total cross-entropy loss function, the server side distributes the corresponding gradients to each client model to update the model. The second term is the information compression regularization term, which is a penalty term based on KL divergence. It can be used to measure the intermediate compression features produced by the encoder. Z Distribution Compared with a pre-defined simple prior distribution The degree of difference between them. Its function is to force the model not to extract too much redundant information from the input variables arbitrarily or excessively, but only to extract the information that is truly useful for the output prediction, so as to ensure the model's good generalization performance.

[0118] In actual training and optimization, the two parts of the loss mentioned above are combined into a unified loss function, which is achieved through hyperparameters. The loss function is weighted and balanced. It is explicit and fully differentiable, thus allowing optimization using standard deep learning training methods, including forward propagation to calculate the loss value, backpropagation to calculate the gradient, and using optimizers such as the Adam optimizer to update network parameters.

[0119] The specific training process of the client model includes the following steps: First, local data is fed into the encoder to obtain intermediate feature distribution parameters and feature compression is performed using reparameterization sampling technology; second, the compressed features are uploaded to the central server to perform anomaly detection tasks and obtain the corresponding output targets; then, the KL divergence loss of the client model is calculated and backpropagation is performed to obtain the relevant gradients, and then the local model is iteratively trained by combining the corresponding gradient values ​​returned by the central server to optimize the local model parameters.

[0120] Each client holds a different local dataset, while the label set is held by the server. In each training round, each client model extracts features from its local data and compresses and uploads them to the server. These uploaded feature vectors carry key information from their respective data sources, but have been effectively compressed locally and no longer contain too many original data details. After receiving the feature vectors from different clients, the server uses a feature concatenation method to concatenate the uploaded deep feature vectors in one-dimensional space, thereby generating a unified global feature representation that covers information from all data sources. As shown in Equation (19):

[0121] (19)

[0122] In the formula, Represents global features. This represents the extracted bilateral deep features.

[0123] This fusion feature not only preserves the original feature information of both sides of the data, but also enhances the feature representation by capturing complex interaction patterns and potential correlations between the information and physical domains, thereby forming richer global features in the combined feature space. This serves as the main input for subsequent anomaly detection and classification tasks.

[0124] Subsequently, the server uses the top-level classifier to perform forward inference on this global feature representation, outputs the predicted label, and completes the detection and classification task.

[0125] Specifically, global features are input into a fully connected layer. This layer learns weights and biases to perform a non-linear transformation on the feature mapping from the previous layer (the layer that acquires global features), thereby mapping the input feature sequence to a higher-level feature representation, as shown in Equation (20):

[0126] (20)

[0127] In the formula, Represents the score vector. This represents the input feature vector in the fully connected layer. It is the weight matrix of the fully connected layer. It is the bias vector.

[0128] Then, the output vector from the fully connected layer is fed into the final "decision maker"—the softmax function—to present the multi-class classification results in the form of probabilities. The process is shown in equation (21):

[0129] (twenty one)

[0130] In the formula, It is input The score corresponding to category m, where M is the total number of categories. This represents the predicted probability for category m. It is input Corresponding category j The score.

[0131] Each element in the output vector of the fully connected layer represents a raw score for a different category. Using a softmax function, these scores are mapped to a category probability between 0 and 1 (the sum of the transformed category probabilities for all elements is 1). These probability values ​​reflect the confidence that the input sample belongs to each category. Then, the category label corresponding to the highest probability value is selected as the final output result determining the current operating state of the CPPS system.

[0132] Then, based on the loss function between the output label and the true label, the corresponding gradient is generated and sent to each client, prompting the client to update its local model parameters through backpropagation. Specifically, the server calculates a unified cross-entropy loss function for the globally concatenated features. However, the intermediate features uploaded by each client only correspond to a portion of the global features. Therefore, directly sending the overall gradient to all clients is clearly unreasonable. To address this, the server decomposes the gradient of the global loss function with respect to the globally concatenated features into local gradients specific to each client, as follows:

[0133] First, the server takes the derivative of the main loss through backpropagation to obtain the overall gradient for the global splicing features, as shown in Equation (22).

[0134] (twenty two)

[0135] Secondly, due to It is composed of features uploaded by each client model. The server has recorded the start and end positions of each client feature in the complete concatenated feature vector when concatenating the features. Therefore, after generating the complete gradient, the server can directly use the "index slicing" operation. The server will split the complete gradient into the parts corresponding to each client according to the start and end positions of each feature segment (because the order of feature concatenation is fixed) and return it to the corresponding client local model. As shown in Equation (23).

[0136] (twenty three)

[0137] In the formula, and They represent the client respectively. i Upload intermediate compression features Start and end indices in the global splicing features The starting index and ending index are respectively and The gradient corresponding to the time.

[0138] After receiving the feature gradient returned by the server, each client performs local backpropagation, as shown in equation (24).

[0139] (twenty four)

[0140] in For the first i The updated parameters of the local model on each client; For the first i Parameters of the local model on each client before update; The learning rate; Indicates the first i Gradient generation for each client's features; Indicates the first i The gradient obtained by calculating the KL loss of each client; This is the balance coefficient.

[0141] Through the backpropagation described above, the client model will adjust its local model parameters. This allows for the generation of features that are more beneficial for the classification task in the next iteration.

[0142] Based on the above description, the specific algorithm flow is summarized as follows.

[0143]

[0144] In this method, the client-side model undertakes the crucial tasks of feature extraction and encoding compression of the raw data. Unlike traditional centralized models that rely on a unified architecture, each client in VFL must independently model, train, and optimize based on the characteristics of its local data. Therefore, the quality of the chosen client-side model directly determines the quality of subsequent feature concatenation and the discriminative power of the server-side classifier.

[0145] This embodiment will conduct a series of client-side model difference comparison experiments to comprehensively and objectively compare and analyze the anomaly detection performance of different model combinations (such as LSTM, CNN, GRU, Transformer, and SCINet) under federated learning. In particular, it will verify whether the combination of the SCINet model (physical side) and the Transformer model (information side) selected in this embodiment has significant advantages and rationality, thereby verifying the scientific nature and effectiveness of the method selection.

[0146] Example 2:

[0147] This embodiment 2 is a further extension based on embodiment 1. In this embodiment 2, a total of six client combination model schemes are set up, as shown in Table 1.

[0148] Table 1: Different Combination Model Schemes

[0149]

[0150] To ensure the fairness and objectivity of the experimental results, all model combinations were run under identical conditions, using a unified dataset, data partitioning method, training method, DeepVIB compression module, and server-side model structure, differing only in the client-side feature extraction model. Furthermore, performance evaluation metrics included accuracy, precision, recall, and F1 score to comprehensively measure the performance differences between different model combinations. The specific experimental results are as follows: Figure 3 , Figure 4 , Figure 5 and Figure 6 As shown.

[0151] Experimental results show significant differences in detection performance among different client-side model combinations. Combination F performed best, achieving an accuracy of 87.3%, nearly 4 percentage points higher than the second-best combination B, and leading by nearly 3 percentage points in F1 score, forming a clear performance gap. The advantage of combination F stems from the high degree of adaptation between its model structure and data characteristics. The SCINet model, based on its unique architecture design including segmented convolution and interactive learning, effectively captures periodic patterns and abrupt fluctuations in physical measurement data. Meanwhile, the Transformer model, through its self-attention mechanism, accurately identifies differentiated representations that deviate from normal traffic patterns. This combination ensures significant depth and complementarity in bilateral feature extraction, thereby improving the quality of features uploaded to the server and enhancing the detection capabilities of the cloud-based classifier.

[0152] In contrast, other combinations, such as D or E, while performing reasonably well on some individual metrics (e.g., recall), lagged behind combination F by more than 5 percentage points in overall metrics. The reason for this is that these combinations used homogeneous model structures across different clients, resulting in biased feature extraction. This made it difficult to process heterogeneous data on both sides with equal precision, easily leading to simplistic feature representations. This resulted in redundant and insufficiently complementary feature information uploaded to the server. Consequently, the information gain from concatenating features was limited, thus restricting the discriminative ability of the cloud-based classifier. Therefore, when selecting a model, the compatibility between the model structure and data characteristics must be considered; blindly applying the same model will not improve performance.

[0153] Furthermore, experimental results show that in CPPS federated learning, blindly relying on a single "high-performance" model cannot guarantee optimal anomaly detection performance under dual-client collaboration. More importantly, it is crucial to consider the adaptability between the model structure and the characteristics of the data itself. That is, based on the characteristics of different client data, a suitable combination of model structures for processing heterogeneous data from various locations should be selected to efficiently extract deep features from distributed heterogeneous data. This perfectly meets the inherent requirements of FL for concise, efficient, and complementary feature representation, while also fully demonstrating the structural advantages and application potential of highly adaptable heterogeneous model combinations under non-independent and identically distributed data.

[0154] Example 3:

[0155] This embodiment 3 is a further extension based on embodiment 2. Real-world power grid systems have complex and diverse topologies and significant differences in scale. This often leads to anomaly detection models failing to exhibit good generalization performance in practical applications. That is, when a detection model is trained on a power system of a specific scale, its anomaly detection performance often deteriorates significantly or even fails to operate normally when migrated to a power system with a drastically different scale. Therefore, the primary challenge facing anomaly detection models in practical applications is how to achieve better generalization performance.

[0156] Example 3 of this paper designs a cross-dataset transfer generalization experiment to thoroughly evaluate whether the proposed method can achieve sufficient generalization ability while balancing detection performance and privacy protection. The experiment includes three training modes: one-sided training, combined training, and federated training.

[0157] (1) Single-sided training: The detection model is trained using only data from either the physical or information side of the CPPS system, without cross-domain feature fusion. That is, all physical data is used only to train the SCINet model, and all information data is used only to train the Transformer model. The performance metric of the best performing model is selected.

[0158] (2) Combined training: Same as the SCINet-Transformer combined model settings;

[0159] (3) Federated training: The technical solution used in this method is adopted. After extracting the deep features, each client model compresses and uploads them, and the server completes the feature stitching and performs the detection task.

[0160] In short, the single-sided training mode can be regarded as a detection model trained using only a single modality feature, the combined training mode is equivalent to an ideal model with complete information and no privacy constraints, while the federated training mode is a detection model that fuses information from both sides under privacy protection conditions.

[0161] Example 3 uses IEEE 33-node system simulation data as the training dataset for the models. Three training modes are trained based on this dataset. After training, performance metrics are recorded. The training samples include normal operation data, line fault data, and network attack data to ensure that each model can effectively learn the feature patterns of different abnormal states. Simultaneously, a more complex and larger-scale IEEE 118-node system simulation dataset is selected as the test set, which includes abnormal data of the same type as the training set. After the training phase is complete, the models are transferred to the IEEE 118-node dataset for testing. During the transfer testing phase, the models are not retrained or fine-tuned to examine their generalization performance in unfamiliar data environments. The performance evaluation metrics for anomaly detection include accuracy, precision, recall, and F1 score to objectively and fairly assess the differences in generalization performance among the three learning modes. The experimental results are shown in Table 2.

[0162] Table 2: Comparison of Generalization Performance

[0163]

[0164] The experimental results show that, as expected, the detection performance of each training mode differed significantly during the training and testing phases. Firstly, the combined training mode demonstrated the best overall performance, especially with an accuracy rate as high as 95.1%, significantly outperforming both the single-sided and federated training modes. The F1 score was also nearly 10% higher than the single-sided training mode. This is because, under the combined training mode, the anomaly detection model has no information constraints or feature compression, thus possessing the most comprehensive feature information and the highest feature space integrity. The model has sufficient discriminative information for detection and classification.

[0165] In contrast, while federated training also utilizes bilateral data from CPPS, it suffers from privacy concerns, resulting in the loss of some feature details during feature compression and uploading, thus leading to slightly inferior performance. The performance gap compared to combined training is approximately 6-8% across various metrics. However, from a practical application perspective, this compromise in accuracy yields higher privacy security. As for single-sided training, its performance is the worst, with an average difference of over 10% in accuracy and F1 score compared to combined training. This indicates that without cross-modal information support, relying solely on a single information source makes high-precision anomaly detection and recognition difficult.

[0166] During the transfer testing phase, the overall detection performance of all three training modes declined. However, it's worth noting that the combined training mode and the federated training mode demonstrated stronger cross-scenario adaptability. Compared to the training and testing phase, the overall performance decline of the single-sided training mode exceeded 15%, the combined training mode experienced a performance drop of over 5% during transfer, while the federated training mode only saw a decline of around 2-3%. This indicates that federated training has a greater generalization advantage when dealing with new data environment changes.

[0167] This is primarily because both combined training and federated training utilize bilateral data, resulting in broader feature coverage and stronger adaptability to similar anomalies. Furthermore, federated training introduces a feature compression layer, making the model more inclined to retain key information useful for the detection task while actively removing redundant information lacking discriminative power. This mechanism prevents the model from overfitting to specific topologies and reduces its dependence on specific training environments. Thus, even in new testing environments, the general essential information learned by the model remains effective when facing similar anomalies. In contrast, combined training, lacking effective feature compression and generalization mechanisms, tends to exhibit overfitting during training. When the model is transferred to new data environments, this overfitting problem is amplified, leading to a decline in performance.

[0168] In summary, while combined training offers optimal detection performance under ideal conditions, it carries a significant risk of privacy breaches due to its reliance on complete raw data. Federated training, while slightly inferior in detection performance, still performs well overall, and its advantages in privacy protection and stable generalization performance are more pronounced, making it a more promising training mode choice.

[0169] Example 4:

[0170] This embodiment 4 is a further extension of embodiment 3. In this embodiment 4, the impact of the feature compression strategy (feature compression layer) on the communication load and transmission efficiency of the client uploading data to the server will be evaluated. Specifically, the experiment will compare the standard federated learning based on the FedAvg method with FD-CP models with different degrees of feature compression. The communication efficiency and detection performance of each method will be evaluated using the following two metrics: (1) average number of communication rounds, and (2) average accuracy. Finally, the advantages of the feature compression strategy in reducing communication overhead will be quantitatively analyzed based on the experimental results.

[0171] FedAvg: All client models are uniformly set to Transformer models and trained on local data. Each round, the model parameters are uploaded to the server. The server then performs a weighted average of the uploaded model parameters to generate a global model, which is sent to each client for the next iteration. While this avoids sharing raw data, it also means that each round of communication transmits redundant payloads equivalent to the size of the complete model structure, potentially consuming significant communication resources.

[0172] FD-CP experimental group: Three different feature compression levels were set up for control experiments, and feature entropy compression ratio (FC) was introduced to characterize the different compression levels. As shown in Equation (25).

[0173] (25)

[0174] In the formula, It represents the total amount of information contained in the original input data. This represents the amount of information retained in the compressed feature.

[0175] The calculation of information entropy is specifically expressed as shown in equations (26) and (27).

[0176] (26)

[0177] In the formula, m This represents the feature dimension of the original input data. It is the covariance matrix of the original input data.

[0178] (27)

[0179] In the formula, k This represents the compressed feature dimension. It is the covariance matrix of the compressed features.

[0180] Essentially, it reflects the degree to which compressed feature information is preserved compared to the original data. A higher value indicates a more conservative feature compression effect and more information retention. Conversely, a lower value indicates a stronger feature compression effect and less information retention.

[0181] To study the impact of different degrees of feature compression on communication efficiency, this embodiment designs three compression schemes: (1) :FC ratio is 1 / 3; (2) :FC ratio is 1 / 2; (3) The FC ratio is 2 / 3.

[0182] After the four methods converged, the experimental results were as follows: Figure 7, Figure 8 , Figure 9 and Figure 10 As shown.

[0183] Experimental results show that when using the FedAvg method, the frequent uploading and aggregation of complete model parameters between the client and server leads to a large amount of communication data per round and low communication efficiency. Furthermore, the heterogeneity of data among clients and the problem of redundant weight aggregation result in a large number of communication rounds, slowing down the convergence speed of the global model and resulting in a high overall communication overhead. In contrast, due to... The compression ratio in this method is too high, resulting in the loss of a large amount of crucial details for the detection task during feature compression. This leads to a decrease in the model's discriminative power, with an average detection accuracy of only 41.3%. Furthermore, due to insufficient information uploaded in each round, the model requires more training rounds to compensate for the missing information and achieve convergence. Ultimately, this method adds nearly 30% more training rounds, but not only does it result in poor performance, it also further increases the communication burden.

[0184] And in and These two methods, employing relatively mild feature compression strategies, achieve a good balance between detection performance and communication efficiency. Especially... It effectively eliminates redundant information in intermediate features while retaining key information with high discriminative power, resulting in the highest accuracy among all schemes at 84.2%, an improvement of nearly 7% compared to FedAvg, and a reduction of approximately 40% in the number of communication rounds. This demonstrates that it achieves an effective balance between communication efficiency and detection performance. While slightly inferior, its communication efficiency is about 30% higher than FedAvg, and its accuracy loss is within 3%, demonstrating strong application feasibility.

[0185] Example 4 further verifies the impact of feature compression strategies on federated communication efficiency and detection performance. While excessive feature compression can theoretically reduce the cost of a single round of communication, it leads to severe degradation of model performance, resulting in a net loss in overall convergence efficiency. Conversely, a reasonable and moderate feature compression strategy can effectively reduce redundancy and retain key information, thereby accelerating model convergence while ensuring detection accuracy.

[0186] In summary, this invention fully considers the heterogeneity and distributed nature of the data from both sides of the CPPS system. It employs vertical federated learning to achieve privacy-preserving collaborative training and designs specific feature extraction methods for both physical-side measurement data and information-side traffic data. The physical-side client uses the SCINet model with multi-scale dynamic feature extraction capabilities to process the physical measurement data, while the information-side client utilizes the Transformer model to capture local mutations and long-term dependencies in the communication traffic data. Unlike traditional vertical federated learning methods that simply concatenate or aggregate features, this invention uses a learnable distribution of hidden variables to adaptively compress deep features of local data from each client. This effectively removes redundant privacy-sensitive information while retaining the discriminative information required for anomaly detection, significantly reducing the risk of privacy leakage. Furthermore, this invention designs a bidirectional collaborative optimization mechanism between the client and server. Through gradient feedback from the server, the client model is dynamically iteratively optimized, enabling the client feature extraction network to respond in real-time to the needs of the global classification task, continuously improving the model's perceptual sensitivity and generalization ability. This collaborative mechanism solves the bottleneck problem of isolated optimization and insufficient collaboration between the client and server in existing technologies, thereby further improving the accuracy and robustness of anomaly detection while ensuring privacy protection.

Claims

1. A privacy-enhanced CPPS anomaly detection method based on longitudinal federated learning, characterized in that, Includes the following steps: Introducing a vertical federated learning architecture to build local models on the client side: The SCINet model is deployed on the physical side of CPPS, and the Transformer model is deployed on the information side of CPPS; the SCINet model is used to extract deep features of the physical side data, and the Transformer model is used to extract deep features of the information side data. With the goal of minimizing the mutual information between the intermediate features uploaded by the client and the original data and maximizing the mutual information between the intermediate features and the output labels, feature compression processing is performed on the deep features of the physical side data and the deep features of the information side data respectively, resulting in physical side compressed features and information side compressed features. The physical-side compressed features and the information-side compressed features are uploaded to the server and concatenated in one-dimensional space to obtain the global features. In the server, global features are mapped to score vectors through fully connected layers, and the predicted probabilities of each category are obtained through a classifier; the category with the highest predicted probability is selected as the detection and classification result. Calculate the loss function for the detection classification result and the true label, generate the corresponding gradient, and send the gradient to the corresponding client after splitting it according to the concatenation order of the global features; In the client, the local model is trained based on the penalty term of the KL divergence and the received gradient to obtain the trained local model. The deep features of the local data to be detected are obtained through the trained local model, and then the corresponding anomaly detection results are obtained through the server.

2. The privacy-enhanced CPPS anomaly detection method based on longitudinal federated learning according to claim 1, characterized in that, The specific methods for extracting deep features from physical data include the following steps: T time-step data points were continuously extracted from the physical-side time-series dataset of CPPS and arranged sequentially in chronological order to form a time series. Each time step contains N-dimensional features. Indicates the first i Data at each time step; time-series data includes node voltage, current, and power; Time series As input to the SCINet model, the corresponding feature vector is output through the SCINet model, thus obtaining the deep features of the physical side data.

3. The privacy-enhanced CPPS anomaly detection method based on longitudinal federated learning according to claim 1, characterized in that, The specific methods for extracting deep features from information-side data include the following steps: From the information side of CPPS, select traffic records or business records that are at the same time point as the corresponding physical side data, and then obtain the information side sample sequence; By using the information-side sample sequence as input to the Transformer model and outputting the corresponding feature vector through the Transformer model, we obtain the deep features of the information-side data.

4. The privacy-enhanced CPPS anomaly detection method based on longitudinal federated learning according to claim 1, characterized in that, The specific methods for performing feature compression processing on the deep features of physical side data and the deep features of information side data respectively include the following steps: A feature compression layer is connected before the final output layer of the SCINet and Transformer models. This layer performs feature compression on both the physical and information-side deep data features. The expression for feature compression in this layer is as follows: Compressed features are obtained by performing feature compression processing on the feature compression layer; and These represent the mean vector and standard deviation vector of the latent variable distribution, respectively, output by the SCINet model or the Transformer model; From the standard normal distribution Random noise sampled independently in the middle; This indicates element-wise multiplication.

5. The privacy-enhanced CPPS anomaly detection method based on longitudinal federated learning according to claim 1, characterized in that, Specific methods for concatenating physical-side compressed features and information-side compressed features in one-dimensional space include: The physical-side compressed features and information-side compressed features at the same time are concatenated into a two-sided deep feature; the two-sided deep features are then concatenated in chronological order to obtain the global feature.

6. The privacy-enhanced CPPS anomaly detection method based on longitudinal federated learning according to claim 1, characterized in that, The expression for mapping global features to a score vector using a fully connected layer is: in This is the score vector obtained from the fully connected layer; This represents the input feature vector in the fully connected layer, i.e., the global features; This is the weight matrix of the fully connected layer; This is the bias vector.

7. The privacy-enhanced CPPS anomaly detection method based on longitudinal federated learning according to claim 6, characterized in that, The classifier in the server is a softmax classifier, and its classification expression is: in Represents the score vector Let m be the predicted probability of category m. Global features The score of the corresponding category m; M represents the total number of categories; Global features Corresponding category j The score; It is the natural index.

8. The privacy-enhanced CPPS anomaly detection method based on longitudinal federated learning according to claim 1, characterized in that, The cross-entropy loss function is used to detect the classification results and the true labels.

9. The privacy-enhanced CPPS anomaly detection method based on longitudinal federated learning according to claim 8, characterized in that, The expression for splitting the gradient according to the concatenation order of global features is: in Indicates the first i The gradient component corresponding to each client; The cross-entropy loss function; For the first i The compression features corresponding to each client are either physical-side compression features or information-side compression features; and They represent the first i The starting and ending indices of the compressed features uploaded by each client in the global features; The starting index and ending index are respectively and The gradient corresponding to the time.

10. The privacy-enhanced CPPS anomaly detection method based on longitudinal federated learning according to claim 9, characterized in that, During the training process of the local model, the loss function of the local model The expression is: Where N is the total number of clients; This indicates that the server has the following parameters: The distribution of the detection and classification results obtained at that time; This is a penalty term based on KL divergence, used to measure the compressed features produced by the local model. Distribution Compared with a pre-defined simple prior distribution The degree of difference between them; As weight; These are intermediate compressed features for the local model; The detection and classification results output by the server; This serves as the input during the local model training process. During the training process of the local model, the first i The expression for the local model update parameters of each client is: in For the first i The updated parameters of the local model on each client; For the first i The parameters of the local model on each client before the update; The learning rate; Indicates the first i Gradient generation for each client's features; Indicates the first i The gradient obtained by calculating the KL loss of each client; This is the balance coefficient.