Transverse federated learning training method capable of improving system fairness in sensitive scenarios

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By introducing the Gini coefficient and the impact evaluator of deep reinforcement learning into federated learning, and adjusting the aggregation weights of the global model, the performance unfairness caused by heterogeneity of client data in federated learning is solved, and a balance between fairness and accuracy is achieved in heterogeneous scenarios.

CN115238905BActive Publication Date: 2026-06-30TSINGHUA SHENZHEN INTERNATIONAL GRADUATE SCHOOL

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: TSINGHUA SHENZHEN INTERNATIONAL GRADUATE SCHOOL
Filing Date: 2022-07-05
Publication Date: 2026-06-30

AI Technical Summary

Technical Problem

In sensitive scenarios, federated learning suffers from slow model convergence, overall performance degradation, and performance unfairness across clients due to heterogeneity in client data.

Method used

We use the Gini coefficient as a fairness metric for federated aggregation and construct an impact evaluator based on deep reinforcement learning. By iteratively aggregating local models from different clients, we adjust the aggregation weights of the global model to improve system fairness.

Benefits of technology

While ensuring the overall performance of the model, it significantly improves the fairness of the client and the overall performance of the model, achieving a balance between fairness and accuracy in heterogeneous scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115238905B_ABST

Patent Text Reader

Abstract

The application discloses a transverse federated learning training method capable of improving system fairness in a sensitive scene, and comprises the following steps: S1, establishing a fairness index of federated aggregation; S2, constructing an influence evaluator based on deep reinforcement learning; S3, based on transverse federated learning, a client aggregates local models from different clients through iteration, and trains a shared global model through an aggregation mode. According to the application, the verification accuracy of different clients can be used to allocate the aggregation weight of the current round, and the contribution of different clients to the global model during each aggregation round can be intelligently arranged, so that the performance of the global model is ensured, and the fairness of the clients is ensured.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer technology, and in particular to a horizontal federated learning training method that can improve system fairness in sensitive scenarios. Background Technology

[0002] The development of machine learning technology has enabled predictive algorithms to perform better than expected in some fields. However, in practice, a sufficiently effective model requires massive amounts of data for training. But in some sensitive scenarios, such as patient data from different hospitals or driving data from different vehicles, a single device may not have enough quantity and quality of data to learn a more robust model.

[0003] Federated learning is a novel distributed machine learning approach where each client trains a local model or updates parameters using local data, then only sends the model parameters to the server. All parameters are aggregated in the cloud, resulting in a more comprehensive model without the need for data exchange. However, in practical applications, heterogeneity leads to a significant performance degradation in federated learning, including a 9.2% drop in accuracy and a 2.32-fold increase in training time, while also compromising fairness. For example, if different clients have vastly different data distributions, the model's performance may vary significantly across different vehicles, with accuracy even lower than that of the local model.

[0004] Due to the heterogeneity of data size and distribution across different clients in federated learning, simply minimizing the total loss in large networks may disproportionately improve or degrade model performance on some clients, resulting in a loss of uniformity in device outcomes. For example, while the federated average accuracy is high, the accuracy of individual devices within the network cannot be guaranteed.

[0005] Recently, there has been increasing interest in fair machine learning algorithms; however, current methods are not directly applicable to federated settings. Recent work has introduced fairness algorithms suitable for federated learning, employing minimax optimization methods to ensure that overall fairness is not improved at the expense of some client performance. Borrowing the idea of resource allocation, fairness is treated as a resource to be allocated, achieving a centralized distribution of client accuracy and mitigating potential conflicts between clients before averaging gradients. However, these algorithms all prioritize fairness as their sole objective. We can simply assume a trade-off between fairness and optimal performance (usually represented by average performance). Therefore, in real-world federated learning applications, it is natural to want their programs to further guarantee fairness while ensuring optimal performance.

[0006] To reflect the fairness of a federated network, metrics such as the standard deviation of client uniformity and accuracy are proposed as indicators to measure network fairness. However, metrics such as the standard deviation STD are related to the expectation of accuracy. Therefore, for different application scenarios, assuming that the test accuracies of all clients in networks A and B are 0.07, 0.08, 0.09 and 0.7, 0.8, 0.9 respectively, although the "rich-poor" gap in the test accuracies of the two networks is quite the same, the STD of A will be smaller, so the fairness levels of federated networks with different system performance levels cannot be compared. Summary of the Invention

[0007] To overcome the deficiencies of the above background art, the present invention proposes a horizontal federated learning training method that can improve system fairness in sensitive scenarios to solve the problems of slow convergence of federated learning, overall performance degradation, and performance unfairness across clients caused by heterogeneous data on different clients.

[0008] The technical problems of the present invention are solved by the following technical solutions:

[0009] The present invention discloses a horizontal federated learning training method that can improve system fairness in sensitive scenarios, including the following steps:

[0010] S1. Establish a fairness metric for federated aggregation;

[0011] S2. Construct an impact evaluator based on deep reinforcement learning;

[0012] S3. Based on horizontal federated learning, the client iteratively aggregates local models from different clients and trains a shared global model through an aggregation method.

[0013] In some embodiments, in step S1, the Gini coefficient is used as the fairness metric for federated aggregation.

[0014] In some embodiments, the definition of the Gini coefficient is as follows: Let the performances of model one and model two on N clients be ω1 and ω2 respectively. If G(ω1) < G(ω2), then model one is more balanced, where:

[0015]

[0016] where G represents the fairness coefficient of accuracy on the client, acc i 、acc j represent the accuracies on the test sets of any client in the model, represents the average accuracy of all clients, K represents the total number of K clients, and k represents the kth client; the fairness of the model for all clients is defined according to the Lorenz curve, and the metric for judging fairness is a proportional value between 0 and 1.

[0017] In some embodiments, step S2 specifically includes:

[0018] S2.1 The state is represented by the model parameters of each client device in each round.

[0019] S2.2 Given the current state, the reinforcement learning AI learns a policy distribution, calculates the aggregate weights for each client based on the policy, and then updates the global model.

[0020] S2.3. Transmit the updated global model parameters to the local machine, observe the local validation accuracy of the global model on the local validation set, and obtain the reward function of the deep reinforcement learning agent from the average validation accuracy and Gini coefficient on each client.

[0021] In some embodiments, step S2.1 specifically includes: the state of the t-th round is determined by the vector Let $\mathbf$ and $\mathbf$ represent the model parameters of the $K$ clients participating in the update; during training, the client and server jointly maintain the model parameter list. The set contains K clients, where k represents the k-th client in the set; in each round of federated learning, the list is updated after a client uploads its trained local model to the server.

[0022] In some embodiments, step S2.2 specifically includes: after each state list update, training a deep reinforcement learning agent using the client model parameters participating in the aggregation, with the action space consisting of vectors. It means that, among them This represents a learnable mean, thus the k-th client receives the weights for participating in the aggregation of the global model. From a mean A Gaussian distribution with fixed variance.

[0023] In some embodiments, step S2.3 specifically includes: the reward comes from a small validation set local to the client, defined as r. t =μ t log(G t μ and G represent the average verification accuracy and the accuracy fairness coefficient on the client side, respectively.

[0024] In some embodiments, in step S2, the deep reinforcement learning-based impact evaluator learns the optimal aggregation strategy using a four-layer fully connected layer.

[0025] In some embodiments, step S3 specifically includes: in each iteration, the server randomly selects a certain number of clients to transmit global model parameters, the clients participating in the training use the downloaded global model for training, and then upload the locally trained model parameters and aggregate them into a new global model on the server.

[0026] In some embodiments, training a shared global model via aggregation includes the following steps:

[0027] The federated learning problem can be written as:

[0028]

[0029] in It is the local loss function of the i-th client;

[0030] The loss functions of different clients are aggregated, and the aggregate weight p is learned using a deep reinforcement learning-based impact evaluator. i The state-action transitions and rewards during the aggregation training process are abstracted into a Markov decision process. In the t-th iteration, an impact evaluator based on deep reinforcement learning obtains environmental observations s. t and perform action a t The environment is affected by the agent and transitions to state s. t+1 The agent receives a reward r for this process. t The goal of a reinforcement learning agent is to find an optimal policy that maximizes the expected long-term reward.

[0031] π * =argmax π E τ～π(τ) [r(τ)] (3)

[0032] Where π represents the policy, τ represents a trajectory obtained by interacting with the environment using the policy, E represents the mean, and r(τ) represents the total reward of this trajectory.

[0033] Next, by expanding the formula, we obtain the objective function and the gradient based on the Monte Carlo approximation as follows:

[0034]

[0035] Where J is the objective function, θ is the neural network parameter with respect to the policy, and d is a fixed symbol in the integral formula;

[0036] Because federated learning training is very time-consuming and computationally resource-intensive, the PPO algorithm from reinforcement learning is used to improve sample efficiency; the importance sampling concept is applied, as follows:

[0037]

[0038] Where, π θ_new Indicates the new strategy, π θ_old This represents the old strategy; that is, applying old interaction patterns to learn new strategies.

[0039] The beneficial effects of this invention compared to the prior art include:

[0040] This invention provides a horizontal federated learning training method for sensitive scenarios that can improve system fairness. It establishes a fairness index for federated aggregation, constructs an impact evaluator based on deep reinforcement learning, and then, based on horizontal federated learning, allows clients to iteratively aggregate local models from different clients, training a shared global model through this aggregation. The parameters output by the deep reinforcement learning-based impact evaluator can modify the weights of different clients aggregated in the global model based on horizontal federated learning, thereby allowing the model to exhibit varying degrees of bias towards different clients and improving system fairness.

[0041] In some embodiments, the present invention allocates the aggregation weight for this round based on the verification accuracy of different clients, and intelligently arranges the contribution of different clients to the global model in each round of aggregation, thereby ensuring the performance of the global model while further ensuring the fairness of the clients. Attached Figure Description

[0042] Figure 1 This is a flowchart of the horizontal federated learning training method according to an embodiment of the present invention.

[0043] Figure 2 This is a schematic diagram of the structure of the horizontal federated learning training method according to an embodiment of the present invention.

[0044] Figures 3a to 3c This is a performance (average accuracy) comparison chart of the horizontal federated learning training method of this invention combined with FedAvg and FedProx.

[0045] Figures 4a to 4c A comparison chart of the fairness (definition of Gini coefficient) of the horizontal federated learning training method of this invention with the combination of FedAvg and FedProx.

[0046] Figures 5a to 5c This is a comparison chart showing the performance of the horizontal federated learning training method of this invention with other methods applied to fairness correction in federated learning. Detailed Implementation

[0047] The present invention will be further described below with reference to the accompanying drawings and preferred embodiments. It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other.

[0048] It should be noted that the directional terms such as left, right, up, down, top, and bottom in this embodiment are only relative concepts or are based on the normal use of the product, and should not be considered as restrictive.

[0049] To overcome the problems of slow convergence, overall performance degradation, and performance unfairness across clients caused by heterogeneous data on different clients, an embodiment of the present invention provides a horizontal federated learning training method that can improve system fairness in sensitive scenarios. This method can allocate the aggregation weights of this round according to the verification accuracy of different clients, and intelligently arrange the contributions of different clients to the global model during each round of aggregation, thereby ensuring the fairness of the clients while ensuring the performance of the global model.

[0050] The core problems to be solved in the embodiments of the present invention include: (1) designing a function that can reflect fairness at the client granularity; (2) constructing a deep reinforcement learning impact evaluator, and using the proposed fairness function to automatically learn a fairness strategy to allocate the impact of clients on the global model; (3) designing an algorithm to solve the problem of low training efficiency caused by low sample utilization rate in online reinforcement learning.

[0051] The methods adopted in the embodiments of the present invention to solve the problems mainly include: (1) proposing a measurement index for evaluating fairness at the client granularity; (2) constructing a high-sample-utilization impact evaluator based on deep reinforcement learning based on the fairness coefficient and the verification accuracy of the clients; (3) ensuring that the proposed impact evaluator is compatible with existing federated learning methods.

[0052] As Figure 1 shown, the horizontal federated learning training method for improving system fairness in sensitive scenarios of the embodiments of the present invention includes the following steps:

[0053] S1. Establish a fairness index for federated aggregation.

[0054] Specifically, the embodiments of the present invention use the Gini coefficient as the index for measuring system fairness, and this index can better measure the fairness degree of the network.

[0055] The definition of the Gini coefficient is: Let the performances of model one and model two on N clients be ω1 and ω2 respectively. If G(ω1) < G(ω2), then model one is more balanced.

[0056] Among them:

[0057]

[0058] where acc i and acc j represent the accuracies on the test sets of any client in the model, represents the average accuracy of all clients, K represents a total of K clients, and k represents the k-th client. The fairness of the model for all clients can be defined according to the Lorenz curve, and the index for judging fairness is a proportional value between 0 and 1.

[0059] Unlike existing definitions of fairness in federated systems, such as uniformity and STD, the Gini coefficient proposed in this invention describes the dispersion of a constant distribution and is scale-invariant. Therefore, the fairness of networks with different average performances can be compared using a unified metric.

[0060] S2. Construct an impact evaluator based on deep reinforcement learning.

[0061] Specifically, in federated learning training, since a certain proportion of clients are randomly selected to participate in updates each round, the optimal weight allocation is non-differentiable. There are various methods to address this non-differentiable optimization bottleneck. This embodiment of the invention employs reinforcement learning to seek the optimal allocation strategy by maximizing the reward function. Simultaneously, reinforcement learning encourages random exploration; therefore, the problem of allocating the proportion of different local models within the global model is modeled as a reinforcement learning problem to explore the optimal aggregation strategy.

[0062] Therefore, the federated learning process can be modeled as a Markov decision process, where the state is represented by the model parameters of each client device in each round. Given the current state, the reinforcement learning agent learns a policy distribution, calculates the aggregate weights for each client based on the policy, and updates the global model. The updated global model parameters are then transmitted to the local client. On the local validation set, the validation accuracy of the global model is observed. The reward function of the deep reinforcement learning agent is obtained from the average validation accuracy across all clients and the Gini coefficient. The goal is to train the deep reinforcement learning agent to converge to the target accuracy and fairness as quickly as possible. The specific settings are as follows:

[0063] state s t The state in round t is determined by the vector. Let $\mathbf$ and $\mathbf$ represent the model parameters of the $K clients participating in the update. During training, the client and server jointly maintain the model parameter list. In each round of federated learning, the list is updated after the client uploads its trained local model to the server.

[0064] Actions: After each state list update, we train a deep reinforcement learning agent using the parameters of the client model participating in the aggregation. The action space consists of vectors. It means that, among them Let a learnable mean be the action on k clients, so that the k-th client receives its weight in the aggregated global model. From a mean A Gaussian distribution with fixed variance.

[0065] Reward: The reward comes from a small validation set on the client's local machine, defined as r. t =μ tlog(G t μ and G represent the average validation accuracy and the accuracy fairness coefficient on the client side, respectively (see the definition of the Gini coefficient). This setting encourages the federated model to achieve optimal performance and fairness.

[0066] In this embodiment of the invention, the impact evaluator based on deep reinforcement learning employs a four-layer fully connected layer to learn the optimal aggregation strategy.

[0067] S3. Based on horizontal federated learning, the client iteratively aggregates local models from different clients and trains a shared global model through aggregation.

[0068] Specifically, in each iteration, the server randomly selects a certain number of clients to transmit global model parameters. The clients participating in the training use the downloaded global model for training, and then upload the locally trained model parameters and aggregate them into a new global model on the server.

[0069] This invention takes a classification task as an example, defining the problem as a C-classification problem. In this case, a most common federated learning problem can be written as:

[0070]

[0071] in It is the local loss function for the i-th client.

[0072] The loss functions of different clients are aggregated. Assuming that the data is partitioned by N clients, an impact evaluator based on deep reinforcement learning is used to learn the aggregate weight p. i The state-action transitions and rewards during the aggregation training process are abstracted into a Markov decision process. In the t-th iteration, environmental observations s are obtained based on a deep reinforcement learning-based impact evaluator. t and perform action a t a t This represents the set of vectors representing all actions in round t, where the environment transitions to state s due to the agent's influence. t+1 The agent receives a reward r for this process. t The goal of a reinforcement learning agent is to find an optimal policy that maximizes the expected long-term reward.

[0073] π*=argmaX π E τ～π(τ) [r(T)] (3)

[0074] Where π represents the policy, τ represents a trajectory obtained by interacting with the environment using the policy, E represents the mean, and r(τ) represents the overall reward of this trajectory. Next, by expanding the formula, we can obtain the objective function and the gradient based on the Monte Carlo approximation, respectively:

[0075]

[0076] Where J is the objective function, θ is the neural network parameter with respect to the policy, and d is a fixed symbol in the integral formula;

[0077] Because federated learning requires significant time and computational resources for training, the PPO algorithm from reinforcement learning is used to improve sample efficiency. Applying the idea of importance sampling, we can:

[0078]

[0079] Where π represents the policy, τ represents a trajectory obtained by interacting with the environment using the policy, and r(τ) represents the total reward of this trajectory. θ_new Indicates the new strategy, π θ_old This indicates the old strategy.

[0080] That is, to learn new strategies by applying old interaction patterns.

[0081] In some embodiments, a schematic diagram of the structure of a horizontal federated learning training method that can improve system fairness in sensitive scenarios is shown below. Figure 2 As shown in the figure, ω globa It is a global parameter, that is, the local parameter ω uploaded by each client. i The p-value of the impact evaluator based on deep reinforcement learning i According to the formula in the server shown in the image The calculated global parameter is ω. globa . Figure 2 The corresponding process is as follows:

[0082] (1) All N available devices with different data sizes and distributions sign in to the server as clients. The server selects K = N * C clients to participate in the update according to a certain ratio C, and initializes the model parameter ω. init The data is then transmitted to the selected client, which uses the global model parameters to obtain the validation accuracy on the validation set, trains the model on local data, and returns the local model parameters {ω}. i 1 ,k∈K} and verification accuracy {acc k 1 ,k∈K}.

[0083] (2) In the t-th iteration, the server calculates the average precision μ^t and Gini coefficient Gini^t based on the returned acc_k^t, calculates the weight P_{k} of client k participating in the global update, and then calculates the weight P_{k} based on {ω k t ,k∈K} and {p it Update the global model parameters for k∈K.

[0084] (3) The server randomly selects a certain proportion of clients to participate in the update. The selected clients use the global model ω from the previous round. globa After training locally, upload the updated model parameters and validation accuracy.

[0085] Figures 3a to 3c and Figures 4a to 4c This paper demonstrates the efficiency of the proposed horizontal federated learning training method for sensitive scenarios, which improves system fairness, when combined with FedAvg (Federated Learning) and FedProx (Federated Aggregation), respectively. The simulation involved 100 clients trained on the Cifar10, Cifar100, and Fashion-MNIST datasets, with highly heterogeneous local data distribution among the clients. For model selection, CNN (Convolutional Neural Network) models were used for Cifar10 and Cifar100 datasets, while a four-layer MLP (Multilayer Perceptron) was used for Fashion-MNIST. It can be observed that the method in this embodiment maintains a convergence speed similar to the baseline, with slight improvements or more stable average accuracy, and significantly enhanced fairness.

[0086] Next, the method of this embodiment of the invention is tested on the same dataset with other federated learning methods aimed at improving fairness. To align with baseline algorithms, using the same experimental setup as existing technologies, this embodiment of the invention extracts a subset of data labeled with three categories—shirts / shirts, pullovers, and shirts—and divides this subset into three domains, each containing one type of clothing. Then, a classifier is trained for these three classes using logistic regression and the Adam optimizer. Since the client-side labels are uniquely identified here, this embodiment of the invention is not compared to models trained on specific domains in this experiment. The validation results are as follows: Figures 5a to 5c As shown, the data on the left, middle, and right sides of each client are the results of the q-FFL algorithm, the AFL algorithm, and the method of this embodiment, respectively. A smaller Gini value indicates greater fairness. Specifically, in clients 1, 2, and 3, the left side represents the results of the q-FFL algorithm (Gini = 0.084), the middle side represents the results of the AFL algorithm (Gini = 0.046), and the right side represents the results of the method of this embodiment (Gini = 0.027). Figures 5a to 5c As can be observed, the method of this embodiment performs better in terms of both final average accuracy and fairness among clients.

[0087] The above description, in conjunction with specific preferred embodiments, provides a further detailed explanation of the present invention. It should not be construed that the specific implementation of the present invention is limited to these descriptions. For those skilled in the art, several equivalent substitutions or obvious modifications can be made without departing from the concept of the present invention, and all such modifications, achieving the same performance or purpose, should be considered within the scope of protection of the present invention.

Claims

1. A transverse federated learning training method capable of improving system fairness in a sensitive scenario, characterized in that, Includes the following steps: S1. Establish fairness indicators for federal aggregation, including the Gini coefficient; S2. Construct an impact evaluator based on deep reinforcement learning; wherein, the impact evaluator is trained through a reward function, which couples the average verification accuracy on the client and the Gini coefficient to guide the deep reinforcement learning agent to converge to the target accuracy and fairness; S3. Based on horizontal federated learning, the client iteratively aggregates local models from different clients and trains a shared global model through aggregation. In each iteration, the server randomly selects a certain number of clients to transmit global model parameters. The clients participating in the training use the downloaded global model for training, and then upload the locally trained model parameters and aggregate them into a new global model on the server. Then, the updated global model parameters are sent to the selected clients. The selected clients use the global model parameters to obtain the validation accuracy on the validation set. The resulting validation accuracy data is used to calculate the average accuracy and Gini coefficient for the new round, and then update the parameters of the impact evaluator.

2. The horizontal federated learning training method as described in claim 1, characterized in that, The definition of the Gini coefficient is: let model one and model two respectively be N , , on a client, if , then model one is more balanced, wherein: （1） Where i represents the client number of Model 1, and j represents the client number of Model 2; The fairness factor representing the precision on the client side. , This represents the accuracy on any client test set in the model. This represents the average precision across all clients. K Indicates shared ownership K One client, Indicates the first Each client; the fairness of the model for all clients is defined according to the Lorenz curve, and the index for judging fairness is a ratio between 0 and 1.

3. The horizontal federated learning training method as described in claim 1, characterized in that, Step S2 specifically includes: S2.1 The state is represented by the model parameters of each client device in each round; S2.2 Given the current state, the reinforcement learning AI learns a policy distribution, calculates the aggregate weights corresponding to each client based on the policy, and then updates the global model. S2.

3. Transmit the updated global model parameters to the local machine, observe the local validation accuracy of the global model on the local validation set, and obtain the reward function of the deep reinforcement learning agent from the average validation accuracy on each client and the Gini coefficient.

4. The horizontal federated learning training method as described in claim 3, characterized in that, Step S2.1 specifically includes: t The state of the wheel is determined by a vector. This indicates that each of them participated in the update. K The model parameters for each client; the client and server jointly maintain the model parameter list during training. The set contains a total of K One client, Represents the first element in the set. One client; in each round of federated learning, the list is updated after the client uploads its trained local model to the server.

5. The horizontal federated learning training method as described in claim 4, characterized in that, Step S2.2 specifically includes: after each state list update, training a deep reinforcement learning agent using the parameters of the client model participating in the aggregation, with the action space consisting of vectors. It means that among them Let represent a learnable mean, and thus the th k Each client receives the weights for participating in the aggregated global model. From a mean of A Gaussian distribution with fixed variance.

6. The horizontal federated learning training method as described in claim 5, characterized in that, Step S2.3 specifically includes: the reward comes from a small validation set locally on the client, defined as... , G and G represent the average verification accuracy and the accuracy fairness coefficient on the client side, respectively.

7. The horizontal federated learning training method as described in claim 1, characterized in that, In step S2, the deep reinforcement learning-based impact evaluator uses a four-layer fully connected layer to learn the optimal aggregation strategy.

8. The horizontal federated learning training method as described in claim 7, characterized in that, Training a shared global model through aggregation includes the following steps: The federated learning problem can be written as: （2） in , is the i The local loss function for each client; The loss functions of different clients are aggregated, and the aggregate weights are learned using a deep reinforcement learning-based impact evaluator. The state-action transitions and rewards during the aggregation training process are abstracted into a Markov decision process, in the first... t In each iteration, an impact estimator based on deep reinforcement learning obtains environmental observations. and perform actions The environment is affected by the agent and transitions to a state. The agent receives a reward for this process. The goal of a reinforcement learning agent is to find an optimal strategy that maximizes the expected long-term reward. （3） in Representation strategy, This represents a trajectory obtained by interacting with the environment using a strategy. This represents the mean. This represents the overall reward for this trajectory; Next, by expanding the formula, we obtain the objective function and the gradient based on the Monte Carlo approximation as follows: （4） in, It is the objective function. These are the neural network parameters related to the policy. It is a fixed symbol in the integral formula; Because federated learning training is very time-consuming and computationally resource-intensive, the PPO algorithm from reinforcement learning is used to improve sample efficiency; the importance sampling concept is applied, as follows: （5） in, Indicating a new strategy, This represents the old strategy; that is, applying old interaction patterns to learn new strategies.