Secure system framework to perform training in federated learning
By integrating SVRG, SAGA, and Top-k sparsification with gradient perturbation, the framework addresses privacy and performance issues in federated learning, ensuring secure and efficient model training across heterogeneous clients.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- ST ENGINEERING IHQ PTE LTD
- Filing Date
- 2024-12-24
- Publication Date
- 2026-07-02
AI Technical Summary
Federated learning systems face challenges in maintaining privacy and performance due to privacy attacks that infer sensitive information from gradient submissions, and existing security measures like SecAgg incur significant communication overhead, making them impractical for bandwidth-limited applications.
Integrate Stochastic Variance Reduced Gradient (SVRG) and Stochastic Average Gradient (SAGA) variance reduction techniques with gradient perturbation and Top-k sparsification to enhance privacy and model training convergence, reducing communication overhead by transmitting only significant model updates.
The proposed framework improves model training convergence and robust privacy preservation, mitigating data leakage risks and enhancing the usability of federated learning in privacy-sensitive applications.
Smart Images

Figure SG2024050830_02072026_PF_FP_ABST
Abstract
Description
SECURE SYSTEM FRAMEWORK TO PERFORM TRAINING IN FEDERATED LEARNINGTECHNICAL FIELD
[0001] The present disclosure relates generally to computer learning, and more specifically, to systems and methods that use federated learning with high levels of privacy and improved performance.BACKGROUND
[0002] Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data and thus perform tasks without explicit instructions.Recently, artificial neural networks have been able to surpass many previous approaches in performance. ML is now used in many fields, including natural language processing, computer vision, speech recognition, email filtering, agriculture and medicine.
[0003] Deep learning is a subset of machine learning methods based on neural networks with representation learning. The field takes inspiration from biological neuroscience and is centered around stacking artificial neurons into layers and "training" them to process data. The adjective "deep" refers to the use of multiple layers (ranging from three to several hundred or thousands) in the network. Methods used can be either supervised (i.e. , using input objects and a desired output value to train a model), unsupervised (i.e., wherein algorithms learn patterns exclusively from unlabeled data) or semi-supervised (i.e., using a combination of a small amount of human-labeled data followed by a large amount of unlabeled data).
[0004] As computing ability increases and data volume proliferates, deep learning has developed rapidly in the field of artificial intelligence (Al). To obtain high-performance deep learning models, organizations hope to collect more individual data for modeltraining. Large datasets have allowed for significant breakthroughs in machine learning. For example, health care professionals can design better treatment plans with knowledge about patient outcomes due to various interventions worldwide.Pharmaceutical companies with proprietary drug development data can collaborate to build knowledge about how the body is likely to metabolize different compounds.
[0005] However, because data acquisition involves individual privacy and inter-organizational interests, data silos and privacy protection become major bottlenecks for traditional centralized training in reality. Efforts have been made to use Al without sharing highly sensitive personal data.
[0006] Federated learning is a deep learning method that can protect sensitive information including personal information. With federated learning, it is possible to collaboratively train a model with data from multiple users without any raw data leaving their devices. In this process, only information on the global model and the local model updates are exchanged between the server and the client, ensuring that the data collected by the client is not exposed to external entities, which has an advantage in terms of information protection. A method of generating a global model encompassing all local model updates of clients (also referred to as participants, parties, edge devices, nodes, users, etc.) participating in a federated learning network may be implemented. More specifically, each client may collect data, train the local model, and upload information on the trained local model updates to a server. The server may collect information on local models, update a global model, and transmit information on the updated global model to a client. The client may update the local model with the information on the global model received from the server.
[0007] In practice, federated learning has shortcomings. Privacy attacks can infer or reconstruct sensitive information from the submitted gradient, which causes users’ privacy leakage in federated learning. Techniques like Soteria, GradDefense and OutPost provide some level of defense against gradient leakage attacks but lack the combination of variance reduction methods for enhanced performance. Moreover,recent efforts to secure data have generally been unsuccessful. For example, secure aggregation (SecAgg) protocol can protect users’ privacy while completing federated learning tasks, but it incurs significant communication overhead and wall clock training time on large-scale model training task. Thus, it is difficult to apply SecAgg in bandwidth-limited federated applications.
[0008] There is thus a need for an improved framework for using federated learning. The framework should allow for analysis of large amounts of data with high levels of privacy and improved performance. Embodiments include a federated learning framework that integrates gradient perturbation with variance reduction techniques for enhanced privacy and model training convergence performance.SUMMARY OF INVENTION
[0009] The following summary is provided to facilitate an understanding of some of the innovative features unique to the disclosed embodiment and is not intended to be a full description. A full appreciation of the various aspects of the embodiments disclosed herein can be gained by taking into consideration the entire specification, claims, drawings, and abstract as a whole.
[0010] Embodiments include methods of federated learning that combine Stochastic Variance Reduced Gradient (SVRG) and Stochastic Average Gradient “ameliore” (SAGA) variance reduction techniques with gradient perturbation. The methods can also use Top-k sparsification for efficient communication and robust defense against gradient leakage attacks.
[0011] Embodiments include a method of federated learning that includes steps of (a) initialization on a central server, (b) local training on a plurality of clients, each client comprised of a local dataset and (c) global aggregation on the central server. The step of initialization can include directing each of the plurality of clients to download a global model from the central server for local training. The step of local training can include gathering model updates and variance reduction. In aspects, Stochastic VarianceReduced Gradient (SVRG) and Stochastic Average Gradient (SAGA) are integrated into this step along with gradient sparsification. In aspects, gradient sparsification includes Top-k sparsification computed by each client and selection of Top-k gradients at a magnitude of 70%, 80%, 90% or higher. In aspects, the step of global aggregation includes collection of noisy, sparsified updates from the clients, aggregation to form a new global model and redistribution for one or more additional rounds of training.
[0012] Embodiments also include a method of federated learning that includes (a)setting up a federated learning environment with a plurality of clients, (b) applying one or more variance reduction techniques to enhance model convergence and (c) using Top-k sparsification to reduce communication overhead by transmitting only significant model updates.
[0013] Embodiments also include a method of improving the security of data of a federated learning system. The method can include (a) setting up a federated learning environment with a plurality of clients, (b) applying one or more variance reduction techniques to enhance model convergence and (c) using Top-k sparsification to reduce communication overhead by transmitting only significant model updates.
[0014] Embodiments also include a system for a federated learning machine. The system can include (a) a central server and (b) a plurality of clients, each with a local dataset. The central server can be configured to direct each of the plurality of clients to download a global model from a central server for local training. Local clients can be configured to apply one or more variance reduction techniques to enhance model convergence. Furthermore, local clients can use Top-k sparsification to reduce communication overhead by transmitting only significant model updates. In aspects, the central server is configured to apply one or more variance reduction techniques selected from Stochastic Variance Reduced Gradient (SVRG), Stochastic Average Gradient (SAGA) and Stochastic Dual Coordinate Ascent (SDCA). In aspects, the central server is configured to use Top-k sparsification and select Top-K gradients at amagnitude of 70%, 80%, 90% or higher. The central server can then aggregate the sparsified updates from the clients to form a new global model for redistribution.
[0015] In aspects, the methods can be applied to centralized federated learning, decentralized federated learning or heterogeneous federated learning.
[0016] The invention is an improvement to existing federated learning technologies by integrating gradient perturbation with variance reduction techniques for enhanced privacy and model training convergence performance. In aspects, the framework provides improved model training convergence and robust privacy preservation, reducing the risk of data leakage and enhancing the practical usability of federated learning systems in privacy-sensitive applications.Definitions
[0017] Reference in this specification to "one embodiment / aspect" or "an embodiment / aspect" means that a particular feature, structure, or characteristic described in connection with the embodiment / aspect is included in at least one embodiment / aspect of the disclosure. The use of the phrase "in one embodiment / aspect" or "in another embodiment / aspect" in various places in the specification are not necessarily all referring to the same embodiment / aspect, nor are separate or alternative embodiments / aspects mutually exclusive of other embodiments / aspects. Moreover, various features are described which may be exhibited by some embodiments / aspects and not by others. Similarly, various requirements are described which may be requirements for some embodiments / aspects but not other embodiments / aspects. Embodiment and aspect can in certain instances be used interchangeably.
[0018] The terms used in this specification generally have their ordinary meanings in the art, within the context of the disclosure, and in the specific context where each term is used. Certain terms that are used to describe the disclosure are discussed below, or elsewhere in the specification, to provide additional guidance to the practitionerregarding the description of the disclosure. It will be appreciated that the same thing can be said in more than one way.
[0019] The term “internet privacy” involves the right or mandate of personal privacy concerning the storage, re-purposing, provision to third parties, and display of information pertaining to oneself via the Internet. Internet privacy is a subset of data privacy. Privacy concerns have been articulated from the beginnings of large-scale computer sharing and especially relate to mass surveillance.Privacy can entail either personally identifiable information (PI I) or non-PII information such as a site visitor's behavior on a website. PI I refers to any information that can be used to identify an individual. For example, age and physical address alone could identify who an individual is without explicitly disclosing their name, as these two parameters are unique enough to identify a specific person typically. Other forms of PH may include GPS tracking data used by apps, as the daily commute and routine information can be enough to identify an individual.
[0020] The term “intrusion detection system” or “IDS" refers to a network security tool that monitors network traffic and devices for known malicious activity, suspicious activity or security policy violations. An IDS can help accelerate and automate network threat detection by alerting security administrators to known or potential threats, or by sending alerts to a centralized security tool. A centralized security tool such as a security information and event management (SIEM) system can combine data from other sources to help security teams identify and respond to cyberthreats that might slip by other security measures. An IDS cannot stop security threats on its own. Today IDS capabilities are typically integrated with — or incorporated into — intrusion prevention systems (IPSs), which can detect security threats and automatically act to prevent them.
[0021] The term “ransomware” refers to malicious software that prevents a user from accessing their files until a ransom is paid. There are two variations of ransomware, being crypto ransomware and locker ransomware. Locker ransomware just locks down a computer system without encrypting its contents, whereas crypto ransomware locksdown a system and encrypts its contents. For example, programs such as CryptoLocker encrypt files securely, and only decrypt them on payment of a substantial sum of money.
[0022] The term “spyware” refers to programs designed to monitor users' web browsing, display unsolicited advertisements, or redirect affiliate marketing revenues are called spyware. Spyware programs do not spread like viruses; instead they are generally installed by exploiting security holes. They can also be hidden and packaged together with unrelated user-installed software. The Sony BMG rootkit was intended to prevent illicit copying; but also reported on users' listening habits, and unintentionally created extra security vulnerabilities.
[0023] The term “advanced persistent threat” or “APT” refers to a stealthy threat actor, typically a state or state-sponsored group, which gains unauthorized access to a computer network and remains undetected for an extended period. In recent times, the term may also refer to non-state-sponsored groups conducting large-scale targeted intrusions for specific goals. Such threat actors' motivations are typically political or economic. Every major business sector has recorded instances of cyberattacks by advanced actors with specific goals, whether to steal, spy, or disrupt. These targeted sectors include government, defense, financial services, legal services, industrial, telecoms, consumer goods and many more. Some groups utilize traditional espionage vectors, including social engineering, human intelligence and infiltration to gain access to a physical location to enable network attacks. The purpose of these attacks is to install custom malware (i.e. , malicious software).
[0024] The term “adversarial machine learning” refers to the study of the attacks on machine learning algorithms, and of the defenses against such attacks. Most machine learning techniques are mostly designed to work on specific problem sets, under the assumption that the training and test data are generated from the same statistical distribution (HD). However, this assumption is often dangerously violated in practical high-stake applications, where users may intentionally supply fabricated data thatviolates the statistical assumption. Most common attacks in adversarial machine learning include evasion attacks, data poisoning attacks, Byzantine attacks and model extraction.
[0025] The term “classifier” refers to the mathematical function, implemented by a classification algorithm that maps input data to a category. An algorithm that implements classification, especially in a concrete implementation, is known as a classifier.
[0026] The term “software as a service” or “SaaS” refers to a software licensing and delivery model in which software is licensed on a subscription basis and is centrally hosted. SaaS is also known as on-demand software, web-based software, or web-hosted software. SaaS is considered to be part of cloud computing, along with several other as a service business models. SaaS apps are typically accessed by users of a web browser (a thin client). SaaS became a common delivery model for many business applications, including office software, messaging software, development software, gamification, virtualization, etc. In aspects, the system and methods described herein are provided as SaaS.
[0027] The term “artificial Intelligence” or “Al” refers to intelligence exhibited by machines, rather than humans. The term, as applied herein, refers to when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving."
[0028] The term “computer learning” or “machine learning” refers to an application of artificial intelligence (Al) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.
[0029] The term “machine learning software” generally refers to a type of software application that uses artificial intelligence to make predictions or decisions based on data.
[0030] The term “deep learning” generally refers to a type of machine learning based on artificial neural networks in which multiple layers of processing are used to extract progressively higher-level features from data.
[0031] The term “module” such as that used in “deep learning module” generally refers to an extension to a software application’s main program that is dedicated to a specific function. For example, a “deep learning module” refers to an extension dedicated to deep learning which can perform all deep learning functions automatically without additional programming.
[0032] The term “neural network” refers to a method in artificial intelligence that teaches computers to process data in a way that is inspired by the human brain. It is a type of machine learning process, called deep learning, that uses interconnected nodes or neurons in a layered structure that resembles the human brain.
[0033] “Artificial neural networks” or “ANNs” are distributed computing systems that include a number of neurons interconnected through connection points called synapses. Each synapse encodes the strength of the connection between the output of one neuron and the input of another. The output of each neuron is determined by the aggregate input received from other neurons that are connected to it. Thus, the output of a given neuron is based on the outputs of connected neurons from preceding layers and the strength of the connections as determined by the synaptic weights. An ANN is trained to solve a specific problem (e g., pattern recognition) by adjusting the weights of the synapses such that a particular class of inputs produce a desired output.
[0034] Various algorithms may be used for this learning process. Certain algorithms may be suitable for specific tasks such as image recognition, speech recognition, orlanguage processing. Training algorithms lead to a pattern of synaptic weights that, during the learning process, converges toward an optimal solution of the given problem.
[0035] Artificial neural networks include, for example, a feedforward neural network, a radial basis function network, a self-organizing map, learning vector quantization, a recurrent neural network, a Hopfield network, a Boltzmann machine, an echo state network, long shortterm memory, a bi-directional recurrent neural network, a hierarchical recurrent neural network, a stochastic neural network, a modular neural network, an associative neural network, a deep neural network, a deep belief network, a convolutional neural networks, a convolutional deep belief network, a large memory storage and retrieval neural network, a deep Boltzmann machine, a deep stacking network, a tensor deep stacking network, a spike and slab restricted Boltzmann machine, a compound hierarchical-deep model, a deep coding network, a multilayer kernel machine, or a deep Q-network.
[0036] A “convolutional neural network” or “CNN” refers to a class of deep, feed-forward artificial neural networks, most commonly applied to analyzing visual imagery. CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing. A CNN consists of an input and an output layer, as well as multiple hidden layers. The hidden layers of a CNN typically consist of convolutional layers, pooling layers, fully connected layers and normalization layers. Convolutional layers apply a convolution operation to the input, passing the result to the next layer. Local or global pooling layers combine the outputs of neuron clusters at one layer into a single neuron in the next layer. Fully connected layers connect every neuron in one layer to every neuron in another layer. CNNs use relatively little pre-processing compared to other image classification algorithms. This means that the network learns the filters that in traditional algorithms were hand-engineered. This independence from prior knowledge and human effort in feature design is a major advantage.
[0037] The term “bottleneck” in a neural network refers to a layer with fewer neurons than the layer below or above it. Having such a layer encourages the network tocompress feature representations (of salient features for the target variable) to best fit in the available space. Improvements to compression occur due to the goal of reducing the cost function, as for all weight updates. In a CNN, bottleneck layers are added to reduce the number of feature maps (aka channels) in the network, which, otherwise, tend to increase in each layer. This is achieved by using 1x1 convolutions with fewer output channels than input channels.
[0038] The term “gradient perturbation” refers to a method that injects noise at every iterative update to ensure differential privacy. It is widely used for differentially private optimization.
[0039] The term “variance reduction technique” refers to a method used in optimization to reduce the variability of gradient estimates during iterative updates, leading to faster and more stable convergence. These techniques, such as Stochastic Variance Reduced Gradient (SVRG) and Stochastic Average Gradient (SAGA), are particularly useful in stochastic optimization and machine learning. By reducing variance, these methods enable more efficient training, especially when working with large datasets, resulting in improved model performance and reduced computational costs.
[0040] The term “stochastic variance reduction” refers to an algorithmic approach to minimizing functions that can be decomposed into finite sums. By exploiting the finite sum structure, variance reduction techniques are able to achieve convergence rates that are impossible to achieve with methods that treat the objective as an infinite sum, as in the classical Stochastic approximation setting.
[0041] The term “variance reduction” refers to approaches that are widely used for training machine learning models such as logistic regression and support vector machines as these problems have finite-sum structure and uniform conditioning that make them ideal candidates for variance reduction.
[0042] The term “stochastic gradient descent” or “SGD” refers to an optimization method for unconstrained optimization problems. In contrast to (batch) gradient descent, SGD approximates the true gradient of E (w, b) by considering a single training example at a time.
[0043] The term “gradient method” refers to a conventional approach used in computer science for solving optimization problems. It involves constructing an error function and designing a neural network model based on the negative gradient descent. The method is used to prove convergence and stability in optimization problems.
[0044] The term “convergence accuracy” refers to the final accuracy achieved by the models after a fixed number of training rounds. It is measured as the highest accuracy attained by the models during the training process and is used to assess the effectiveness of our variance reduction methods.
[0045] The term “convergence rate” refers to how quickly the models reach their optimal performance. It is measured by the number of communication rounds required for the models to converge to their maximum accuracy. Faster convergence indicates a more efficient training process, which is crucial in federated learning due to the iterative nature of training across multiple devices.
[0046] The term “communication overhead” refers to the number of parameters transmitted between clients and the central server. By focusing on the most significant gradients, the goal is to decrease communication overhead while maintaining model performance using Top-k sparsification. The impact of different sparsification levels (e.g., top 70%, 80%, and 90% gradients) was evaluated on the communication payload and model accuracy.
[0047] Other technical terms used herein have their ordinary meaning in the art that they are used, as exemplified by a variety of technical dictionaries. The particular values and configurations discussed in these non-limiting examples can be varied and are citedmerely to illustrate at least one embodiment and are not intended to limit the scope thereof.BRIEF DESCRIPTION OF THE DRAWINGS
[0048] The accompanying drawings illustrate aspects of the present invention. In such drawings:
[0049] FIG. 1 depicts a system in which federated learning is performed, according to embodiments.
[0050] FIG. 2A shows the training accuracy convergence of models for SVRG variance reduction gradient method on EMNIST (with LeNet architecture).
[0051] FIG. 2B shows the results also using SVRG with ResNet-18 architecture.
[0052] FIG. 2C shows the results also using SAGA with LeNet architecture.
[0053] FIG. 2D shows the results also using SAGA with ResNet-18 architecture.
[0054] FIG. 3A shows the training accuracy convergence of models for SVRG variance reduction gradient method on CIFAR-10 (with LeNet architecture).
[0055] FIG. 3B shows the results also using SVRG with ResNet-18 architecture.
[0056] FIG. 3C shows the results also using SAGA with LeNet architecture.
[0057] FIG. 3D shows the results also using SAGA with ResNet-18 architecture.
[0058] FIG. 4A shows the effectiveness of defense mechanism against DLG attack in scenarios consisting of different training methods on EMNIST.
[0059] FIG. 4B shows the effectiveness of defense mechanism against DLG attack in scenarios consisting of different training methods on CIFAR-10.DETAILED DESCRIPTION
[0060] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to provide further explanation of the subject technology as claimed. Additional features and advantages of the subject technology are set forth in the description below, and in part will be apparent from the description, or may be learned by practice of the subject technology. The advantages of the subject technology will be realized and attained by the methods and systems particularly pointed out in the written description and claims hereof.
[0061] Federated learning aims at training a machine learning algorithm, for example deep neural networks, on multiple local datasets contained in local nodes without explicitly exchanging data samples. The general principle consists in training local models on local data samples and exchanging parameters (e.g. the weights and biases of a deep neural network) between these local nodes at some frequency to generate a global model shared by all nodes.
[0062] The main difference between federated learning and distributed learning lies in the assumptions made on the properties of the local datasets, as distributed learning originally aims at parallelizing computing power where federated learning originally aims at training on heterogeneous datasets. While distributed learning also aims at training a single model on multiple servers, a common underlying assumption is that the local datasets are independent and identically distributed (i.i.d.) and roughly have the same size. None of these hypotheses are made for federated learning; instead, the datasets are typically heterogeneous and their sizes may span several orders of magnitude. Moreover, the clients involved in federated learning may be unreliable as they aresubject to more failures or drop out because they commonly rely on less powerful communication media (i.e. Wi-Fi) and battery-powered systems (i.e. smartphones and loT devices) compared to distributed learning where nodes are typically datacenters that have powerful computational capabilities and are connected to one another with fast networks. Federated learning can be implemented as centralized, decentralized or heterogeneous.Centralized federated learning
[0063] In the centralized federated learning setting, a central server is used to orchestrate the different steps of the algorithms and coordinate all the participating nodes during the learning process. The server is responsible for the nodes selection at the beginning of the training process and for the aggregation of the received model updates. Since all the selected nodes have to send updates to a single entity, the server may become a bottleneck of the system.Decentralized federated learning
[0064] In the decentralized federated learning setting, the nodes are able to coordinate themselves to obtain the global model. This setup prevents single point failures as the model updates are exchanged only between interconnected nodes without the orchestration of the central server. Nevertheless, the specific network topology may affect the performances of the learning process.Heterogeneous federated learning
[0065] An increasing number of application domains involve a large set of heterogeneous clients, e.g., mobile phones and loT devices. Most of the existing Federated learning strategies assume that local models share the same global model architecture. Recently, a new federated learning framework named HeteroFL was developed to address heterogeneous clients equipped with very different computation and communication capabilities. The HeteroFL technique can enable the training of heterogeneous local models with dynamically varying computation and non-IID data complexities while still producing a single accurate global inference mode.
[0066] Recent studies have highlighted privacy vulnerabilities related to gradient leakage attacks in federated learning systems. Prior efforts to address this privacy concern include adding noise to shared gradients based on the calculated leakage risk based on the spread model weight variance or sensitivity of gradients. The magnitude of neural network weights can intuitively reveal the privacy leakage risks associated with their gradients. Therefore, the variance statistics of the neural network weights are used to quantify the privacy leakage risks per layer. Gradients from layers with high variance in their weights have high privacy leakage risk and therefore incur higher perturbation footprint. Reducing the variance of model weights is a key strategy to mitigate these risks. This reduction minimizes the privacy leakage risk and, consequently, reduces the perturbation footprint during training leading to more stable training and better convergence performance. To achieve this, Applicants have developed an FL algorithm that incorporates local variation reduction in the context of using gradient perturbation in federated learning. To further minimize the risk of privacy breaches but also enhance the overall efficiency, Applicants have adopted Top-k sparsification. This technique restricts the number of gradients transmitted, mitigating the risk of gradient leakage attacks, and reducing communication overhead.
[0067] FIG. 1 is a flowchart that depicts components of a system for federated learning according to embodiments. The framework includes three main steps: initialization, local training and global aggregation. The system does not require an exchange of raw data from client devices to global servers. Instead, the raw data on edge devices is used to train the model locally, increasing data privacyStep 1: Initialization
[0068] The process begins with initialization on a central server. The central server initializes the global model and distributes it to all participating clients. Each client downloads the global model for local training.Step 2: Local Training of Model
[0069] This step is further divided into several sub-steps:2.1 Gather Model Updates• Function: Each client trains the global model using its local dataset and computes the model updates.• Process: Clients calculate the difference between their locally trained model parameters and the initial global model parameters.2.2 Variance Reduction• Function: Apply variance reduction techniques (such as SVRG or SAGA) to stabilize and improve the convergence of local updates.• SVRG: Computes a reference gradient over the entire local dataset and uses it to correct the stochastic gradient updates.• SAGA: Maintains a running average of past gradients and updates the gradient estimates incrementally.2.3 Gradient Sparsification• Function: Reduce communication overhead by selecting and transmitting only the most significant gradients.• Process: Apply Top-k sparsification to the gradients computed by each client, selecting only the Top-k gradients with the highest magnitudes (e.g., 70%, 80% or 90%).2.4 Risk Estimation and PerturbationClients assess the privacy leakage risk by analyzing the variance of model weights. Based on the estimated risk, noise is added to the gradients using gradient perturbation techniques (e.g., Gaussian noise) before sending updates to the central server. This step ensures that the shared gradients do not reveal sensitive information about the local data.• Function: Ensure privacy by adding noise to the gradients based on the estimated privacy leakage risk.• Risk Estimation: Clients assess the privacy leakage risk by analyzing the variance of model weights.• Perturbation: Based on the estimated risk, clients add noise to the gradients using gradient perturbation techniques (Gaussian noise) to protect individual data before sending updates to the central server.Step 3: Global Aggregation
[0070] Central Server: The central server collects the noisy, sparsified updates from all clients.• Aggregation: These updates are aggregated to form a new global model, incorporating the knowledge gained from all participating clients while preserving privacy.• Redistribution: The updated global model is then redistributed to the clients for the next round of training.Variance Reduction
[0071] Variance reduction is a procedure used to increase the precision of the estimates obtained for a given simulation or computational effort. Every output random variable from the simulation is associated with a variance that limits the precision of the simulation results. As described above, the framework can use Stochastic Variance Reduced Gradient (SVRG) and Stochastic Average Gradient “ameliore” (SAGA) to improve model convergence speed and accuracy. Their implementation helps mitigate privacy leakage risks and reduce perturbation footprint which improves stability during training and convergence performance.
[0072] SVRG is an optimization algorithm designed to reduce the variance of stochastic gradients. SVRG computes the full gradient of the loss function for entire dataset and uses it to refine stochastic gradient updates for more stable convergence. Similarly, SAGA techniques also aim to reduce the variance in gradient updates. SAGA maintains a running average of past gradients, leading to reduced variance in the stochastic gradient estimates.
[0073] Without being bound by theory, Applicants propose the following rationale for the improvements. Reducing model weight / gradient variance can lead to a reducedleakage risk, consequently incurring less perturbation and lightweight defense for faster convergence, improved model accuracy, and more efficient federated learning. This can assist in mitigating the challenges associated with noisy and non-IID (non-Independently and Identically Distributed) data present on client devices.Gradient Sparsification
[0074] Gradient sparsification is used to reduce communication overhead by transmitting only significant model updates. After variance reduction, Top-k sparsification is applied to the gradients. This process selects only the most significant gradients (Top-k) for communication, reducing the amount of data transmitted and enhancing privacy. The sparsified gradients ensure that only the most impactful updates are sent, minimizing communication overhead and selecting only the Top-k gradients with the highest magnitudes.
[0075] Applicants propose that sparsification to select Top-k high-magnitude gradients in federated learning safeguards data privacy by transmitting less information. It can also accelerate model training, adapt to varying data types, and, when combined with gradient perturbation, ensures strong privacy protection.Operating Environment
[0076] Embodiments include a system for a federated machine learning system that includes at least one processor; and at least one memory including computer program code; the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform; update a machine learning model, with a federated machine learning network.
[0077] The software of the present invention can be compatible with any operating system, including Mac OS, MicrosoftWindows, Linux, Arthur, ARX, MOS, RISC iX, RISC OS, Fire OS, AmigaOS, Amiga Unix, AMSDOS, Contiki, CP / M 2.2, CP / M Plus, SymbOS, IBM, IBM AIX, Newton OS, iPadOS, watchOS, tvOS, bridgeOS, visionOS, XTS-400, BeOS, BelA, Unix, MINI-UNIX, PWB / UNIX, CB UNIX, BESYS, Inferno,Burroughs MCP, GEOS, AmigaOS, AROS Research Operating System, SCOPE, Chippewa Operating System, MACE, Kronos, NOS, SIPROS, Puffin OS, CTOS, AOS, DG / UX, RDOS, CTOS, Deos, HeartOS, CP / M, Personal CP / M, CP / M Plus, CP / M-68K, CP / M-8000, CP / M-86 Plus, Personal CP / M-86, MP / M, MP / M II, FlexOS, Novell, or any other known operating system in the art.
[0078] The software of the present invention is also compatible with myriad graphical user interfaces. Compatible graphical user interfaces include but are not limited to those hosted on desktop computer monitors, laptop computer monitors, smartphone screens, smartwatch screens, television monitors, projection screens, multiplexed screens, LCD displays or any other hosted graphical user interface known in the art.
[0079] The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform particular tasks or implement particular abstract data types. The computer program and data may be fixed in any form (e.g., source code form, computer executable form, or an intermediate form) either permanently or transitorily in a tangible storage medium, such as a semiconductor memory device (e g., a RAM, ROM, PROM, EEPROM, or Flash-Programmable RAM), a magnetic memory device (e g., a diskette or fixed hard disk), an optical memory device (e g., a CD-ROM or DVD), a PC card (e.g., PCMCIA card), or other memory device. The computer program and data may be fixed in any form in a signal that is transmittable to a computer using any of various communication technologies, including, but in no way limited to, analog technologies, digital technologies, optical technologies, wireless technologies, networking technologies, and internetworking technologies. The computer program and data may be distributed in any form as a removable storage medium with accompanying printed or electronic documentation (e.g., shrink wrapped software or a magnetic tape), preloaded with a computer system (e.g., on system ROM or fixed disk), or distributed from a server or electronic bulletin board over the communication system (e.g., the Internet or World Wide Web.) It is appreciated that any of the software components ofthe present invention may, if desired, be implemented in ROM (read-only memory) form. The software components may, generally, be implemented in hardware, if desired, using conventional techniques.
[0080] The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices. Practitioners of ordinary skill will recognize that the invention may be executed on one or more computer processors that are linked using a data network, including, for example, the Internet. In another embodiment, different steps of the process can be executed by one or more computers and storage devices geographically separated but connected by a data network in a manner so that they operate together to execute the process steps. In one embodiment, a user's computer can run an application that causes the user's computer to transmit a stream of one or more data packets across a data network to a second computer, referred to here as a server. The server, in turn, may be connected to one or more mass data storage devices where the database is stored. The server can execute a program that receives the transmitted packet and interpret the transmitted data packets in order to extract database query information. The server can then execute the remaining steps of the invention by means of accessing the mass storage devices to derive the desired result of the query. Alternatively, the server can transmit the query information to another computer that is connected to the mass storage devices, and that computer can execute the invention to derive the desired result. The result can then be transmitted back to the user's computer by means of another stream of one or more data packets appropriately addressed to the user's computer. In one embodiment, the relational database may be housed in one or more operatively connected servers operatively connected to computer memory, for example, disk drives. In yet another embodiment, the initialization of the relational database may be prepared on the set of servers and the interaction with the user's computer occur at a different place in the overall process.
[0081] Further illustration of the present invention is shown in the working examples produced below.EXAMPLES
[0082] The following non-limiting examples are provided for illustrative purposes only in order to facilitate a more complete understanding of representative embodiments now contemplated. These examples are intended to be a mere subset of all possible contexts in which the components of the formulation may be combined. Thus, these examples should not be construed to limit any of the embodiments described in the present specification, including those pertaining to the type of components and / or methods and uses thereof.Example 1Secure System Framework to Perform Training in Federated Learning using CIFAR-10 datasets with LeNet and ResNet-18 Architectures
[0083] FIG. 2A - 2D and FIG. 3A - 3D show a comparison of the runtimes of two models against two datasets. The objective was to observe the impact of variance reduction applied in federated learning to the convergence performance of model training.
[0084] The CIFAR-10 dataset is one of the most widely used datasets for machine learning research. The dataset contains 60,00032x32 color images in ten different classes (i.e., airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks).
[0085] The training was conducted on non-i.i.d. data distributed across 100 clients for 100 rounds, except for ResNet-18 on CIFAR-10, which was trained for 80 rounds. Each round was trained for 5 epochs, with a partition sample size of 1128 for EMNIST and 500 for CIFAR-10. Variance reduction techniques SVRG and SAGA were employed to improve training stability and convergence coupled with Top-k sparsification. Thefederated learning process was carried out using the FedAvg aggregation algorithm with a learning rate of 0.01. In each round, 100 clients were available, out of which 5 were selected randomly for federated learning.
[0086] FIG. 2A - 2D show a comparison of the performance of a typical learning model implementing FedAvg and the performance of a learning model implementing a local model training method of federated learning framework implementing a training data classification. FIG. 2A shows the training accuracy convergence of models for SVRG variance reduction gradient method with LeNet architecture; FIG. 2B shows the results also using SVRG with ResNet-18 architecture; FIG. 2C shows the results also using SAGA with LeNet architecture; FIG. 2D shows the results also using SAGA with ResNet-18 architecture.Example 2Secure System Framework to Perform Training in Federated Learning using EMNIST datasets with LeNet and ResNet-18 Architectures
[0087] The EMNIST dataset is a set of handwritten character digits derived from the NIST Special Database 19 and converted to a 28x28 pixel image format and dataset structure that directly matches the MNIST dataset. FIG. 3A- 3D show a comparison of the performance of a typical learning model implementing FedAvg and the performance of a learning model implementing a local model training method of federated learning framework implementing a training data classification.
[0088] FIG. 3A shows the training accuracy convergence of models for SVRG variance reduction gradient method with LeNet architecture; FIG. 3B shows the results also using SVRG with ResNet-18 architecture; FIG. 3C shows the results also using SAGA with LeNet architecture; FIG. 3D shows the results also using SAGA with ResNet-18 architecture.
[0089] Table 1 shows the final convergence accuracy for LeNet across different optimizers and datasets.Table 1
[0090] Similarly, Table 2 shows the final convergence accuracy for ResNet-18 across different optimizers and datasets.Table 2>">
[0091] FIG. 4A shows the effectiveness of defense mechanism against DLG attack in scenarios consisting of different training methods on EMNIST. Similarly, FIG. 4B shows the effectiveness of defense mechanism against DLG attack in scenarios consisting of different training methods on CIFAR-10.
[0092] The results show that both SVRG and SAGA methods significantly enhance the convergence speed and accuracy compared to the baseline OutPost SGD method, across different Top-k sparsification levels.
[0093] In closing, it is to be understood that although aspects of the present specification are highlighted by referring to specific embodiments, one skilled in the art will readily appreciate that these disclosed embodiments are only illustrative of the principles of the subject matter disclosed herein. Therefore, it should be understood that the disclosed subject matter is in no way limited to a particular compound, composition, article, apparatus, methodology, protocol, and / or reagent, etc., described herein, unless expressly stated as such. In addition, those of ordinary skill in the art will recognize that certain changes, modifications, permutations, alterations, additions, subtractions and sub-combinations thereof can be made in accordance with the teachings herein without departing from the spirit of the present specification. It is therefore intended that the following appended claims and claims hereafter introduced are interpreted to include all such changes, modifications, permutations, alterations, additions, subtractions and sub-combinations as are within their true spirit and scope.
[0094] Certain embodiments of the present invention are described herein, including the best mode known to the inventors for carrying out the invention. Of course, variations on these described embodiments will become apparent to those of ordinary skill in the art upon reading the foregoing description. The inventor expects skilled artisans to employ such variations as appropriate, and the inventors intend for the present invention to be practiced otherwise than specifically described herein. Accordingly, this invention includes all modifications and equivalents of the subject matter recited in the claims appended hereto as permitted by applicable law. Moreover, any combination of the above-described embodiments in all possible variations thereof is encompassed by the invention unless otherwise indicated herein or otherwise clearly contradicted by context.
[0095] Groupings of alternative embodiments, elements, or steps of the present invention are not to be construed as limitations. Each group member may be referred to and claimed individually or in any combination with other group members disclosed herein. It is anticipated that one or more members of a group may be included in, or deleted from, a group for reasons of convenience and / or patentability. When any suchinclusion or deletion occurs, the specification is deemed to contain the group as modified thus fulfilling the written description of all Markush groups used in the appended claims.
[0096] Unless otherwise indicated, all numbers expressing a characteristic, item, quantity, parameter, property, term, and so forth used in the present specification and claims are to be understood as being modified in all instances by the term “about.” As used herein, the term “about” means that the characteristic, item, quantity, parameter, property, or term so qualified encompasses a range of plus or minus ten percent above and below the value of the stated characteristic, item, quantity, parameter, property, or term. Accordingly, unless indicated to the contrary, the numerical parameters set forth in the specification and attached claims are approximations that may vary. For instance, as mass spectrometry instruments can vary slightly in determining the mass of a given analyte, the term "about" in the context of the mass of an ion or the mass / charge ratio of an ion refers to + / -0.50 atomic mass unit. At the very least, and not as an attempt to limit the application of the doctrine of equivalents to the scope of the claims, each numerical indication should at least be construed in light of the number of reported significant digits and by applying ordinary rounding techniques.
[0097] Use of the terms “may” or “can” in reference to an embodiment or aspect of an embodiment also carries with it the alternative meaning of “may not” or “cannot.” As such, if the present specification discloses that an embodiment or an aspect of an embodiment may be or can be included as part of the inventive subject matter, then the negative limitation or exclusionary proviso is also explicitly meant, meaning that an embodiment or an aspect of an embodiment may not be or cannot be included as part of the inventive subject matter. In a similar manner, use of the term “optionally” in reference to an embodiment or aspect of an embodiment means that such embodiment or aspect of the embodiment may be included as part of the inventive subject matter or may not be included as part of the inventive subject matter. Whether such a negative limitation or exclusionary proviso applies will be based on whether the negative limitation or exclusionary proviso is recited in the claimed subject matter. Further, theuse of the terms “include,” “includes” and “including” means include, includes and or including as well as include, includes and including, but not limited to.
[0098] Notwithstanding that the numerical ranges and values setting forth the broad scope of the invention are approximations, the numerical ranges and values set forth in the specific examples are reported as precisely as possible. Any numerical range or value, however, inherently contains certain errors necessarily resulting from the standard deviation found in their respective testing measurements. Recitation of numerical ranges of values herein is merely intended to serve as a shorthand method of referring individually to each separate numerical value falling within the range. Unless otherwise indicated herein, each individual value of a numerical range is incorporated into the present specification as if it were individually recited herein.
[0099] All patents, patent publications, and other publications referenced and identified in the present specification are individually and expressly incorporated herein by reference in their entirety for the purpose of describing and disclosing, for example, the compositions and methodologies described in such publications that might be used in connection with the present invention. These publications are provided solely for their disclosure prior to the filing date of the present application. Nothing in this regard should be construed as an admission that the inventors are not entitled to antedate such disclosure by virtue of prior invention or for any other reason. All statements as to the date or representation as to the contents of these documents is based on the information available to the applicants and does not constitute any admission as to the correctness of the dates or contents of these documents.The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternativeimplementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order or out of order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and / or flowchart illustration, and combinations of blocks in the block diagrams and / or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
[0100] Lastly, the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of the present invention, which is defined solely by the claims. Accordingly, the present invention is not limited to that precisely as shown and described.
[0101] Although embodiments of the current disclosure have been described comprehensively in considerable detail to cover the possible aspects, those skilled in the art would recognize that other versions of the disclosure are also possible.
[0102] While the present invention has been described in terms of particular embodiments and applications, in both summarized and detailed forms, it is not intended that these descriptions in any way limit its scope to any such embodiments and applications, and it will be understood that many substitutions, changes and variations in the described embodiments, applications and details of the method and system illustrated herein and of their operation can be made by those skilled in the art without departing from the spirit of this invention.
Claims
CLAIMSWhat is claimed is:
1. A method of federated learning, the method comprising steps of:(a) initialization on a central server;(b) local training on a plurality of clients, each client comprised of a local dataset; and(c) global aggregation on the central server.
2. The method of claim 1 , wherein the step of initialization comprises each of the plurality of clients downloading a global model from the central server for local training.
3. The method of claim 1 , wherein the step of local training comprises a step of gathering model updates.
4. The method of claim 3, wherein the step of local training further comprises a step of variance reduction.
5. The method of claim 4, wherein the variance reduction comprises use of a Stochastic Variance Reduced Gradient (SVRG) algorithm.
6. The method of claim 4, wherein the variance reduction comprises use of a Stochastic Average Gradient (SAGA) algorithm.
7. The method of claim 3, wherein the step of local training further comprises a step of gradient sparsification.
8. The method of claim 7, wherein the step of gradient sparsification comprises Top-k sparsification computed by each client and selection of Top-K gradients at a magnitude of 70% or higher.
9. The method of claim 1 , wherein the step of global aggregation comprises collection of noisy, sparsified updates from the plurality of clients, aggregation to form a new global model and redistribution to the plurality of clients for one or more additional rounds of training.
10. A method of federated learning, the method comprising:(a) setting up a federated learning environment with a plurality of clients;(b) applying one or more variance reduction techniques to enhance model convergence; and(c) using Top-k sparsification to reduce communication overhead by transmitting only significant model updates.
11. The method of claim 10, further comprising a step of:(d) implementing gradient perturbation techniques to add noise to selected gradients at each layer, updating the model parameters locally, and then sending the local model updates to send to the central server.
12. The method of claim 10, wherein the step setting up a federated learning environment comprises initializing of each of the plurality of clients to download a global model from a central server for local training.
13. The method of claim 10, wherein the step of applying one or more variance reduction techniques comprises use of a Stochastic Variance Reduced Gradient (SVRG) algorithm.
14. The method of claim 10, wherein the step of applying one or more variance reduction techniques comprises use of a Stochastic Average Gradient (SAGA) algorithm.
15. The method of claim 10, wherein the step of using Top-k sparsification comprises a step of computing by each client and selection of Top-k gradients at a magnitude of 70% or higher.
16. The method of claim 10, further comprising a step of global aggregation comprising collection of noisy, sparsified updates from the plurality of clients, aggregation to form a new global model and redistribution to the plurality of clients for one or more additional rounds of training.
17. A method of improving the security of data of a federated learning system, the method comprising:(a) setting up a federated learning environment with a plurality of clients;(b) applying one or more variance reduction techniques to enhance model convergence; and(c) using Top-k sparsification to reduce communication overhead by transmitting only significant model updates.
18. The method of claim 17, further comprising a step of:(d) implementing gradient perturbation techniques to add noise to selected gradients at each layer, updating the model parameters locally, and then sending the local model updates to send to the central server.
19. The method of claim 17, wherein the step of applying one or more variance reduction techniques comprises use of a Stochastic Variance Reduced Gradient (SVRG) algorithm.
20. The method of claim 17, wherein the step of applying one or more variance reduction techniques comprises use of a Stochastic Average Gradient (SAGA) algorithm.
21. The method of claim 17, wherein the step of using Top-k sparsification comprises a step of computing by each client and selection of Top-K gradients at a magnitude of 70% or higher.
22. The method of claim 17, further comprising a step of global aggregation comprising collection of noisy, sparsified updates from the plurality of clients, aggregation to form a new global model and redistribution to the plurality of clients for one or more additional rounds of training.
23. A system for a federated learning machine, the system comprised of:(a) a central server;(b) a plurality of clients, each client comprised of a local dataset;wherein the central server is configured to initializing of each of the plurality of clients to download a global model from a central server for local training, wherein the central server is configured to apply one or more variance reduction techniques to enhance model convergence;wherein the central server is configured to using Top-k sparsification to reduce communication overhead by transmitting only significant model updates.
24. The system of claim 23, wherein the central server is configured to apply one or more variance reduction techniques selected from Stochastic Variance Reduced Gradient (SVRG) and Stochastic Average Gradient (SAGA).
25. The system of claim 23, wherein the central server is configured to use Top-k sparsification and select Top-K gradients at a magnitude of 70% or higher26. The system of claim 23, wherein the central server is configured to collect noisy, sparsified updates from the plurality of clients and aggregate updates to form a new global model for redistribution to the plurality of clients for a subsequent round of training.