Federated dual stochastic kernel learning for vertically partitioned data

By employing a federated double random kernel learning algorithm to compute random features locally and using a tree-structured communication method, the low efficiency and privacy protection issues in vertically partitioned data training are resolved, enabling efficient and secure nonlinear model training and overcoming the linear separability limitation.

CN115943397BActive Publication Date: 2026-06-16JINGDONG TECH HLDG CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
JINGDONG TECH HLDG CO LTD
Filing Date
2021-06-24
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing technologies struggle to efficiently and securely train machine learning models for vertically partitioned data on multiple distributed edge devices, especially when maintaining data and model privacy. Traditional methods are inefficient and limited to linearly separable models.

Method used

The Federated Double Stochastic Kernel Learning (FDSKL) algorithm is adopted. By computing random features locally and updating the kernel model using double stochastic gradients, combined with a tree-structured communication scheme, data and model privacy are protected, and a constant learning rate is used in a parallel computing environment.

🎯Benefits of technology

It achieves convergence to the optimal solution in O(1/t), significantly improving training efficiency, breaking through the linear separability constraint, and providing more efficient privacy protection and scalability.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115943397B_ABST
    Figure CN115943397B_ABST
Patent Text Reader

Abstract

A system and method for making predictions using a machine learning model. The system includes a coordinator, an active computing device, and a passive computing device in communication with each other. The active computing device has a processor and a storage device storing computer executable code. The computer executable code is configured to: obtain parameters of a machine learning model; retrieve an instance from local data; sample a random direction for the instance; compute a dot product of the random direction and a point of the instance, compute a random feature; compute predicted values of the instance in the active computing device and the passive computing device, aggregate the predicted values to obtain a final predicted value; determine model coefficients using the random feature, the final predicted value, and a target value of the instance; update the machine learning model using the model coefficients; and predict a value of a new instance.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] Cross-references

[0002] References are cited and discussed in the description of this disclosure, which may include patents, patent applications, and various publications. The citations and / or discussions of such references are provided solely to clarify the description of this disclosure and do not imply that any such references constitute "prior art" as disclosed herein. All references cited and discussed in the specification are incorporated herein by reference in their entirety, to the same extent as each individual reference is incorporated by reference individually. Technical Field

[0003] This disclosure generally relates to federated learning, and more specifically, to a large-scale privacy-preserving federated learning approach for vertically partitioned data using a kernel method. Background Technology

[0004] The background description provided herein is intended to provide a general overview of the context of this disclosure. Within the scope of this background description, the work of the currently named inventors and descriptions that might not have been considered prior art at the time of filing are not, expressly or impliedly, acknowledged as prior art to this disclosure.

[0005] Federated learning is a machine learning technique that allows algorithms to be trained on multiple distributed edge devices or servers that store local data samples without exchanging their data samples. However, handling large amounts of data with sufficient efficiency, scalability, and security is a challenge.

[0006] Therefore, there is a need in this field to address the aforementioned defects and shortcomings. Summary of the Invention

[0007] In some aspects, this disclosure relates to a system for making predictions using a machine learning model. In some embodiments, the system includes an active computing device and at least one passive computing device in communication with the active computing device. Each of the active and passive computing devices includes local data. The active computing device includes a processor and a storage device storing computer-executable code. The computer-executable code, when executed at the processor, is configured to:

[0008] Obtain the parameters of the machine learning model;

[0009] Retrieve instances from local data of the active computing device;

[0010] Sampling is performed on the random directions of the instance;

[0011] Calculate the dot product between the random direction and the instance, and calculate the random features based on the dot product;

[0012] Calculate the predicted value of the instance in the active computing device, instruct the at least one passive computing device to calculate the predicted value of the instance in the at least one passive computing device, and summarize the predicted values ​​from the active computing device and the at least one passive computing device to obtain the final predicted value of the instance, wherein the predicted value of the instance in the at least one passive computing device is obtained based on the local data of the at least one passive computing device;

[0013] The model coefficients are determined using the difference between the final predicted value of the instance and the target value of the instance, and the random features.

[0014] The machine learning model is updated using the model coefficients; and

[0015] The machine learning model is used to predict the value of new instances.

[0016] In some embodiments, the parameters of the machine learning model include a constant learning rate.

[0017] In some embodiments, the instance is characterized by an index, and the computer-executable code is configured to provide the index to the at least one passive computing device, and each client computer in the active computing device and the at least one passive computing device is configured to sample the random direction based on the index. In some embodiments, the random direction is sampled from a Gaussian distribution.

[0018] In some embodiments, the random feature is achieved using an equation. Calculated Let be the dot product, and b be a random value. In some embodiments, the value of b is within... Within the range. In some embodiments, Depend on calculate, It is the number of the active computing devices and the at least one passive computing device. It is the 1st of q computing devices indivual, It is the first The dot product of random directions and instances in a computing device. It is in the Random numbers generated in a computing device It is the active computing device. In some embodiments, The value of is in Within the range.

[0019] In some embodiments, the predicted value of the instance in the active computing device is calculated using multiple iterations, the predicted value being calculated using an equation in the iterations. renew, This is the predicted value for the instance. These are the model coefficients of the aforementioned instance. This is the aforementioned random characteristic.

[0020] In some embodiments, the iteration is equal to or greater than 2.

[0021] In some embodiments, the computer-executable code is configured to use equations The machine learning model is updated by replacing each previous model coefficient, where It is any one of the coefficients in the previous model. λ is the learning rate of the machine learning model, and λ is the regularization parameter of the machine learning model.

[0022] In some embodiments, communication between the active computing device and the at least one passive computing device is performed using a tree structure by a coordinating computing device that communicates with the active computing device and the at least one passive computing device.

[0023] In some respects, this disclosure relates to a method for making predictions using a machine learning model. The method includes:

[0024] The parameters of the machine learning model are obtained by an active computing device;

[0025] The active computing device retrieves instances from its local data.

[0026] The active computing device samples the random orientation of the instance;

[0027] The active computing device calculates the dot product between the random direction and the instance, and calculates random features based on the dot product;

[0028] The active computing device calculates a predicted value for the instance, instructs at least one passive computing device to calculate a predicted value for the instance in the at least one passive computing device, and aggregates the predicted values ​​from the active computing device and the at least one passive computing device to obtain a final predicted value for the instance, wherein the predicted value for the instance in the at least one passive computing device is obtained based on local data of the at least one passive computing device;

[0029] The active computing device uses the difference between the final predicted value of the instance and the target value of the instance, along with the random features, to determine the model coefficients.

[0030] The active computing device uses the model coefficients to update the machine learning model; and

[0031] The active computing device uses the machine learning model to predict the value of the new instance.

[0032] In some embodiments, the parameters of the machine learning model include a constant learning rate.

[0033] In some embodiments, the instance is characterized by an index, and the computer-executable code is configured to provide the index to the at least one passive computing device, and each client computer in the active computing device and the at least one passive computing device is configured to sample the random direction based on the index. In some embodiments, the random direction is sampled from a Gaussian distribution.

[0034] In some embodiments, the random feature is achieved using an equation. Calculated is the dot product, and b is a random value.

[0035] In some embodiments, Depend on calculate, It is the number of the active computing devices and the at least one passive computing device. It is the 1st of q computing devices indivual, It is the first The dot product of random directions and instances in a computing device. It is in the Random numbers generated in a computing device It is the active computing device.

[0036] In some embodiments, the predicted value of the instance in the active computing device is calculated using multiple iterations, the predicted value being calculated using an equation in the iterations. renew, This is the predicted value for the instance. These are the model coefficients of the aforementioned instance. This is the aforementioned random characteristic.

[0037] In some embodiments, the computer-executable code is configured to use equations The machine learning model is updated by replacing each previous model coefficient, where It is any one of the coefficients in the previous model. λ is the learning rate of the machine learning model, and λ is the regularization parameter of the machine learning model.

[0038] In some embodiments, communication between the active computing device and the at least one passive computing device is performed using a tree structure by a coordinating computing device that communicates with the active computing device and the at least one passive computing device.

[0039] In some aspects, this disclosure relates to a non-transitory computer-readable medium storing computer-executable code. The computer-executable code, when executed at a processor of a computing device, is configured to perform the methods described above.

[0040] These and other aspects of this disclosure will become apparent from the following description of preferred embodiments taken in conjunction with the accompanying drawings and description, although variations and modifications therein may affect the novel conception and scope of this disclosure without departing from it. Attached Figure Description

[0041] The accompanying drawings illustrate one or more embodiments of this disclosure and, together with the written description, serve to explain the principles of this disclosure. Where possible, the same reference numerals are used throughout the drawings to refer to the same or similar elements of the embodiments.

[0042] Figure 1 A federated dual random kernel learning (FDSKL) system according to certain embodiments of the present disclosure is schematically depicted.

[0043] Figure 2A A tree structure according to certain embodiments of the present disclosure is schematically depicted.

[0044] Figure 2B A tree structure according to certain embodiments of the present disclosure is schematically depicted.

[0045] Figure 3A A federated learning system according to certain embodiments of this disclosure is schematically depicted.

[0046] Figure 3B A worker according to certain embodiments of the present disclosure is schematically depicted.

[0047] Figure 4 The FDSKL training process according to certain embodiments of the present disclosure is illustrated schematically.

[0048] Figure 5 The flowchart illustrating the process of calculating predicted values ​​for samples using the FDSKL model according to certain embodiments of the present disclosure is illustrated.

[0049] Figure 6 The flowchart illustrating the process of calculating the adjusted dot product using the FDSKL model according to certain embodiments of the present disclosure is illustrated.

[0050] Figure 7AAn algorithm for comparison according to certain embodiments of this disclosure is illustrated schematically.

[0051] Figure 7B A benchmark dataset used in experiments according to certain embodiments of this disclosure is illustrated schematically.

[0052] Figures 8A-8H The illustration schematically depicts the use of certain embodiments according to this disclosure. Figure 7A The comparison method shown is used to perform binary classification.

[0053] Figures 9A-9D The elapsed time is schematically depicted on four datasets with different structures according to certain embodiments of the present disclosure.

[0054] Figure 10A-10D The diagram schematically depicts the change in training time as the number of training instances increases according to certain embodiments of the present disclosure.

[0055] Figure 11 Box plots of test errors for three core methods, a linear method (FD-SVRG), and FDSKL according to certain embodiments of the present disclosure are schematically depicted.

[0056] Overview of this disclosure

[0057] In some embodiments, the symbols and equations are defined as follows:

[0058] Represents a portion or sample of data of an instance that can be used on any worker or server; the instance is indexed by i.

[0059] It is an instance or sample in a local worker or local server;

[0060] It is a distribution or measure, such as the Gaussian distribution;

[0061] It is the random direction corresponding to index i;

[0062] yes The transpose operation;

[0063] It is the random direction corresponding to index i in the local server l;

[0064] yes The transpose operation;

[0065] yes and The dot product;

[0066] yes and The dot product;

[0067] yes Random numbers within a certain range, which can be generated by a random number generator;

[0068] Denotes the adjusted dot product, where and The dot product is adjusted by a random value b;

[0069] Denotes the adjusted dot product, where and The dot product is adjusted by a random value b;

[0070] It is an example The predicted value;

[0071] It is an example Tags;

[0072] It is an example The random characteristics of it can be derived from... calculate;

[0073] It is an example The model coefficients;

[0074] Representation of instances The model coefficients at different iterations, where t is the iteration number (not the transpose operator), have corresponding model coefficients for each iteration i. ;

[0075] This represents the model coefficients for different iterations in the workers, where q is the total number of workers. These are the model coefficients of the l-th worker. It is the corresponding set of iteration indices. For each local worker l, there are model coefficients corresponding to each iteration of the model. .

[0076] , and It is a tree structure used for communication.

[0077] In many real-world machine learning applications, data is provided by multiple providers, each maintaining private records of different sets of features about a common entity. Training these vertically partitioned data effectively and efficiently while maintaining data privacy using traditional machine learning algorithms is a challenge.

[0078] This disclosure relates to large-scale privacy-preserving federated learning for vertically partitioned data, focusing on nonlinear learning using kernels. In some aspects, this disclosure provides a Federated Double Stochastic Kernel Learning (FDSKL) algorithm for vertically partitioned data. In some embodiments, this disclosure uses stochastic features to approximate the kernel mapping function and uses double stochastic gradients to update the kernel model; these are all federated computations without revealing the entire data sample to each worker. Furthermore, this disclosure uses a tree-structured communication scheme to distribute and aggregate computations with minimal communication costs. This disclosure proves that FDSKL converges to the optimal solution in O(1 / t), where t is the number of iterations. This disclosure also provides a data security analysis under the semi-honesty assumption. In summary, FDSKL is the first efficient and scalable privacy-preserving federated kernel method. Extensive experimental results on various benchmark datasets demonstrate that FDSKL is significantly faster than state-of-the-art federated learning methods when dealing with kernels.

[0079] Some embodiments of this disclosure have the following advantages: (1) The FDSKL algorithm can be trained efficiently, scalably, and securely on vertically partitioned data using a kernel method. (2) FDSKL is a distributed double stochastic gradient algorithm with a constant learning rate, which is much faster than existing double stochastic gradient algorithms based on decreasing learning rates, and also much faster than existing privacy-preserving federated kernel learning algorithms. (3) The tree-structured communication scheme is used for distributed and aggregated computation, which is more efficient than star-structured and ring-structured communication, making FDSKL more efficient than existing federated learning algorithms. (4) Existing federated learning algorithms for vertically partitioned data use encryption technology to ensure the security of the algorithm, which is time-consuming and laborious. However, the method of this disclosure uses random perturbation to maintain the security of the algorithm, which is cheaper than encryption technology and makes FDSKL more efficient than existing federated learning algorithms. (5) Most existing federated learning algorithms for vertically partitioned data are limited to linearly separable models. The FDSKL of this disclosure is the first efficient and scalable federated learning algorithm for vertically partitioned data that breaks through the implicit linear separability limitation.

[0080] In some respects, the important novel features of this disclosure include: (1) FDSKL is a distributed double stochastic gradient algorithm for vertically partitioned data with a constant learning rate. (2) This disclosure proves the sublinear convergence speed of FDSKL. (3) This disclosure computes locally. To avoid local data Directly transmitted to other workers for calculation , where b is an added random number to ensure the safety of the algorithm. (4) This disclosure provides a data security analysis under the semi-honesty assumption.

[0081] In some respects, this disclosure relates to the approximation of stochastic features. Stochastic features (Rahimi and Recht, 2008, 2009, which are incorporated herein by reference in their entirety) are a powerful technique that scales kernel methods. As shown in Theorem 0, this technique uses a continuous and shift-invariant positive definite kernel (i.e., An interesting duality between ) and stochastic processes. Theorem 0 (Rudin, 1962, which is incorporated herein by reference in its entirety). Theorem 0 states that if and only if There exists a finite nonnegativity metric on. hour, Continuous, real-valued, symmetric, and shift-invariant functions It is a positive definite kernel, such that:

[0082] ,in yes Uniform distribution on .

[0083] According to Theorem 0, the value of the kernel function can be obtained by explicitly calculating the random characteristics. The diagram is an approximation, as shown below:

[0084] , where m is the number of random features. From Specifically, for the Gaussian RBF kernel , It is density and A proportional Gaussian distribution. For the Laplac kernel (Yang et al., 2014, which is incorporated herein by reference in its entirety), a Cauchy distribution is produced. Note the random feature map. The computation requires calculating a linear combination of the original input features, or a vertical segmentation. This property makes random feature approximation very suitable for federated learning settings.

[0085] In some respects, this disclosure relates to double stochastic gradients. Because RKHS The gradient of the function in can be calculated as: Therefore, regarding random features stochastic gradient It can be represented as: .

[0086] Given a randomly sampled data instance (x, y) and a random feature Regarding the sampled instance (x, y) and random features Loss function on RKHS The double stochastic gradient can be expressed as: .

[0087] because , The stochastic gradient can be expressed as: .

[0088] It is important to note that there are According to the stochastic gradient It can be done by step size Update the solution. Then, make... Then we get:

[0089]

[0090] According to the equation above, It is a definition Importance coefficients of the model. Note that the model in the above equation... It does not satisfy the usual kernel model. The same implicit linear separability assumption applies.

[0091] In some respects, the Federated Double Random Kernel Learning (FDSKL) algorithm is described as follows.

[0092] FDSKL system architecture: Figure 1 An FDSKL system according to certain embodiments of this disclosure is schematically depicted. For example... Figure 1 As shown, the system includes a coordinator and multiple workers. The workers are named worker 1, 2, ..., q, where worker 1 is the active worker and the rest are passive workers. It's important to note that if federated learning is initiated, any worker can become an active worker. Each worker has its own private data, which is inaccessible to other workers. This system features a novel approach to achieving data and model privacy, and utilizes a special three-structure communication mechanism between the workers and the coordinator. The FDSKL structure vertically partitions the computation of random features.

[0093] Data privacy: To maintain data privacy in vertical partitions, certain embodiments of this disclosure divide... Value calculation to avoid local data The random seed is transmitted to other workers. Specifically, this disclosure sends the random seed to the l-th worker. Once the l-th worker receives the random seed, it can uniquely generate a random direction based on the random seed. Therefore, this disclosure can be computed locally. This avoids transferring local data. Directly transmitted to other workers for calculation According to other workers The value is difficult to deduce any This ensures data privacy.

[0094] Model privacy: In addition to maintaining data privacy for vertical partitions, this disclosure also maintains model privacy. Model coefficients They are stored privately in different workers. Based on the model coefficients... The position of the model coefficients in this disclosure Divided into ,in Denotes the model coefficients of the l-th worker. This is the corresponding iterative index set. This disclosure does not include the local model coefficients. It is transmitted directly to other workers. For calculation... This disclosure is calculated locally. And transmit it to other workers. It can be done on all Reconstruct by summing. If It is difficult to determine the value Inferring local model coefficients Therefore, this disclosure achieves model privacy.

[0095] Tree-structured communication: In order to obtain and This disclosure requires the accumulation of local results from different workers. This disclosure uses an efficient tree-structured communication scheme to obtain global sums, which is faster than simple star-structured and ring-structured communication strategies. The tree structure described by Zhang et al., 2018 is incorporated herein by reference in its entirety. Figure 2A and Figure 2B Two examples, T1 and T2, of tree-structured communication are schematically depicted. (e.g.) Figure 2A As shown, communication T1 involves four workers, with communication values ​​of 6, 5, -7, and 2 for workers 1, 2, 3, and 4, respectively. This disclosure pairs the workers so that when worker 1 adds the result of worker 2, worker 3 can simultaneously add the result of worker 4. Finally, the results of the two pairs of workers are sent to the coordinator, and this disclosure obtains the global sum. In some embodiments, when the order of the above steps is reversed, this disclosure may refer to it as reverse-order tree-structured communication. Similarly, Figure 2B The diagram shows that communication T2 involves three workers.

[0096] FDSKL Algorithm: To extend Stochastic Gradient Descent (DSG) to federated learning of vertically partitioned data while maintaining data privacy, this disclosure requires careful design of computation. , The steps for updating the solution are described. In some embodiments, the solution is detailed in steps 1-3, and illustrated by referring to algorithms 2 and 3 via the following algorithm 1. In some embodiments, FDSKL uses a constant learning rate, which is more easily implemented in a parallel computing environment, in contrast to the decreasing learning rate used in DSG. .

[0097] 1. Calculation This disclosure is based on the same random seed i and the probability metric of each worker. Generate random directions Therefore, this disclosure can be computed locally. To maintain secrecy This disclosure is not intended to... Instead of transmitting directly to other workers, it is transmitted from... Uniformly randomized Then The data is transmitted to another worker. All workers perform the computation locally. Subsequently, this disclosure enables efficient and secure acquisition of global and local data by using a tree-structured communication scheme based on tree structure T1 for the workers {1, ..., q}. .

[0098] Currently, for the l-th worker, this disclosure has obtained multiple b values ​​q times. In order to recover... The value, this disclosure is made by deleting Other values ​​(i.e., deleted) Select one from {1,..., q}-{l} As the value of b. To prevent leakage. Any information regarding the workers {1, …,q}-{l} is unclear in this disclosure, as it uses a completely different tree structure T2 (for the two tree structures T1 and T...). 2, If there is no subtree with multiple leaves that belongs to both T1 and T2 simultaneously, then they are completely different. Algorithm 3 summarizes the computation. Detailed steps.

[0099] 2. Calculation :according to

[0100]

[0101] This disclosure has .but, and The data is stored in different workers. Therefore, this disclosure first summarizes the results in Algorithm 2 by computing locally. By using a tree-structured communication scheme, this disclosure can effectively obtain information equal to... global and (See line 7 in Algorithm 1).

[0102] 3. Update rules: Due to Storing in different workers, this disclosure uses a reverse tree structure communication scheme through coefficients ( To update each worker's... (See line 10 in Algorithm 1).

[0103] Based on these key steps, this disclosure summarizes the FDSKL algorithm in Algorithm 1. It should be noted that, unlike the decreasing learning rate used in DSG, the FDSKL of this disclosure uses a constant learning rate, which is more easily implemented in parallel computing environments. However, convergence analysis with a constant learning rate is more difficult than convergence analysis with a decreasing learning rate.

[0104]

[0105]

[0106] In some embodiments, the output It is line 5 in Algorithm 1. .

[0107] Theoretical Analysis:

[0108] In some embodiments, this disclosure demonstrates that FDSKL converges to the optimal solution at a rate of O(1 / t), as shown in Theorem 1.

[0109] Assumption 1: Assume the following conditions are true:

[0110] 1. There exists an optimal solution to the objective problem, denoted as . .

[0111] 2. Derivative It has an upper limit relative to its first parameter, that is .

[0112] 3. Regarding the first parameter, the loss function Its first derivative is L-Lipschitz continuous.

[0113] 4. There is an upper limit to the number of core values. ,Right now Random feature mappings have an upper limit. ,Right now .

[0114] Theorem 1:

[0115] Under assumption 1, let the value be... Set Algorithm 1 , This disclosure will be made in [year]. Later reached , in , , as well as .

[0116] In some embodiments, this disclosure demonstrates that FDSKL can prevent inference attacks (as defined in Definition 1 below) under the semi-honesty assumption (Assumption 2 below):

[0117] Definition 1 (Inference Attack): An inference attack on the l-th worker is an attack that infers that a sample belongs to another worker without direct access. Certain feature groups G.

[0118] Assumption 2 (Semi-honest security): All workers will follow the protocol or algorithm to perform correct computations. However, they may retain records of intermediate computation results, which can be used later to infer data for other workers. Detailed Implementation

[0119] The present disclosure is described in more detail in the following examples, which are intended to be illustrative only, as many modifications and variations therein will be apparent to those skilled in the art. Various embodiments of the present disclosure are now described in detail. Referring to the accompanying drawings, throughout the views, the same numerals indicate the same parts. Unless the context clearly specifies otherwise, the terms “a,” “an,” and “the” as used herein and throughout the claims have the meaning of the plural. Furthermore, as used in the description and claims of this disclosure, unless the context clearly specifies otherwise, “in” has the meaning of “in” and “on”. Additionally, headings or subheadings may be used in the specification for the reader's convenience, but these do not affect the scope of the present disclosure. Furthermore, some terms used in this specification are given more specific definitions below.

[0120] The terms used in this specification generally have their ordinary meaning in the art, in the context of this disclosure, and in the specific context in which each term is used. Certain terms used to describe this disclosure are discussed below or elsewhere in the specification to provide practitioners with additional guidance regarding the description of this disclosure. It will be understood that the same thing can be expressed in more than one way. Therefore, alternative language and synonyms may be used for any one or more terms discussed herein, and have no particular significance in whether a term is elaborated or discussed herein. This disclosure provides synonyms for certain terms. The use of one or more synonyms does not preclude the use of other synonyms. The use of examples anywhere in this specification, including examples of any terms discussed herein, is merely illustrative and in no way limits the scope and meaning of this disclosure or any exemplary terms. Likewise, this disclosure is not limited to the various embodiments given in this specification.

[0121] As described herein, the term "module" can refer to or include application-specific integrated circuits (ASICs); electronic circuits; combinational logic circuits; field-programmable gate arrays (FPGAs); processors (shared, dedicated, or grouped) that execute code; other suitable hardware components that provide the described functionality; or some or all of the above, such as in a system-on-a-chip. The term "module" can include memory (shared, dedicated, or grouped) that stores code executed by the processor.

[0122] As described herein, the term "code" can include software, firmware, and / or microcode, and can refer to programs, routines, functions, classes, and / or objects. The term "shared" as used above means that some or all of the code from multiple modules can be executed using a single (shared) processor. Furthermore, some or all of the code from multiple modules can be stored in a single (shared) memory. The term "group" as used above means that some or all of the code from a single module can be executed using a group of processors. Furthermore, a group of memories can be used to store some or all of the code from a single module.

[0123] As described herein, the term "interface" generally refers to a communication tool or device used at the interaction point between components to perform data communication between components. Generally, interfaces can be applied at both the hardware and software levels, and can be unidirectional or bidirectional. Examples of physical hardware interfaces can include electrical connectors, buses, ports, cables, terminals, and other I / O devices or components. Components communicating with the interface can be, for example, multiple components of a computer system or peripheral devices.

[0124] This disclosure relates to computer systems. As shown in the accompanying drawings, computer components may include physical hardware components indicated by solid lines and virtual software components indicated by dashed lines. Those skilled in the art will understand that, unless otherwise stated, these computer components may be implemented as software, firmware, or hardware components, or combinations thereof, but are not limited to these forms.

[0125] The apparatus, systems, and methods described herein can be implemented by one or more computer programs executed by one or more processors. The computer program includes processor-executable instructions stored on a non-transitory tangible computer-readable medium. The computer program may also include stored data. Non-limiting examples of non-transitory tangible computer-readable media are non-volatile memory, magnetic memory, and optical memory.

[0126] This disclosure will now be described more fully below with reference to the accompanying drawings, in which embodiments of the disclosure are illustrated. However, this disclosure may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete and will fully convey the scope of the disclosure to those skilled in the art.

[0127] Figure 3A A federated learning system according to certain embodiments of this disclosure is schematically depicted. Figure 3A As shown, system 300 includes a coordinator 310 and multiple workers 350 that communicate with each other via network 330. In some embodiments, Figure 3AEach coordinator 310 and worker 350 shown can be a server computer, cluster, cloud computer, general-purpose computer, headless computer, or special-purpose computer providing federated learning capabilities. In some embodiments, each coordinator 310 and worker 350 is a server computing device. In some embodiments, each worker 350 includes a model and privacy data for federated learning, and the coordinator 310 includes a mechanism for collecting certain data from the workers 350 using a tree structure. Each worker 350 can initiate federated learning as an active worker, and an active worker can invoke one or more other workers as passive workers. Active and passive workers can train the federated learning model together, but privacy data is protected within the respective worker and is not shared with other workers. In some embodiments, the federated learning model is Federated Doubly Stochastic Kernel Learning (FDSKL), and privacy data is stored on vertically partitioned data. The network 330 can be a wired or wireless network, or a public or private network, etc. Examples of network 330 may include, but are not limited to, a local area network (LAN) or a wide area network (WAN) including the Internet. In some embodiments, two or more different networks 330 may be used to connect coordinator 310 and worker 350.

[0128] Figure 3B A worker 350 according to certain embodiments of the present disclosure is schematically depicted. In some embodiments, worker 350 is a server computing device and acts as an active worker. However, worker 350 can be any of workers 350-1 to 350-q, because each worker can initialize FDSKL training and operate as an active worker, and each worker can operate as a passive worker and provide information to the active worker. Figure 3B As shown, worker 350 may include, but is not limited to, processor 352, memory 354, and storage device 356. In some embodiments, worker 350 may include other hardware and software components (not shown) to perform its respective tasks. Examples of such hardware and software components may include, but are not limited to, other required memory, interfaces, buses, input / output (I / O) modules or devices, network interfaces, and peripheral devices.

[0129] Processor 352 may be a Central Processing Unit (CPU) configured to control the operation of worker 350. Processor 352 may execute an operating system (OS) or other application on worker 350. In some embodiments, worker 350 may have multiple CPUs as processors, such as two CPUs, four CPUs, eight CPUs, or any suitable number of CPUs.

[0130] Memory 354 may be volatile memory, such as random-access memory (RAM), used to store data and information during the operation of worker 350. In some embodiments, memory 354 may be an array of volatile memory. In some embodiments, worker 350 may operate on multiple memories 354. In some embodiments, worker 350 may also include a graphics card to assist processor 352 and memory 354 in image processing and display.

[0131] Storage device 356 is a non-volatile data storage medium used to store the operating system (not shown) and other applications of worker 350. Examples of storage device 356 may include non-volatile memory such as flash memory, memory card, USB drive, hard disk drive, floppy disk, optical drive, or any other type of data storage device. In some embodiments, worker 350 may have multiple storage devices 356, which may be the same type of storage device or different types of storage devices, and the applications of worker 350 may be stored in one or more storage devices 356 within worker 350.

[0132] In this embodiment, processor 352, memory 354, and storage device 356 are components of worker 350 (e.g., a server computing device). In other embodiments, worker 350 may be a distributed computing device, where processor 352, memory 354, and storage device 356 are shared resources from multiple computers in a predetermined area.

[0133] Storage device 356 specifically includes an FDSKL application 358 and privacy data 372. FDSKL application 358 includes a listener 360, a parameter module 362, a sampling module 364, a random feature module 366, an output prediction module 368, and a model coefficient module 370. In some embodiments, storage device 356 may include other applications or modules necessary for the operation of FDSKL application 358. It should be noted that modules 360-370 are all implemented using computer-executable code or instructions, or data tables or databases, and together they constitute an application. In some embodiments, each module may also include sub-modules. Alternatively, some modules may be combined into a stack. In other embodiments, some modules may be implemented as circuits rather than executable code. In some embodiments, a module or combination of modules may also be referred to as a model, which may have multiple parameters that can be learned through training, and the model with trained parameters can be used for prediction.

[0134] Listener 360 is configured to initialize FDSKL training upon receiving a training instruction and send a notification to parameter module 362. This instruction can be received from an administrator or user of worker 350. In this case, worker 350 acts as an active worker. In some embodiments, when a request is received from the active worker, listener 360 can also instruct parameter module 362 to compute and provide information to the active worker along with other relevant modules. In this case, worker 350 acts as a passive worker. The information provided by the passive worker may include the predicted output corresponding to the sample, as well as the dot product of the random direction and the sample adjustment. It should be noted that the modules of FDSKL application 358 in the active and passive workers are essentially the same, and operations applied in the active worker can invoke certain functions applied in the passive worker. Unless otherwise stated, the following modules are described with respect to the active worker.

[0135] The parameter module 362 is configured to provide the sampling module 364 with parameters for the FDSKL application upon receiving a notification from the listener 360 that FDSKL training should be performed. These parameters include, for example, distributions. Regularization parameter λ and constant learning rate In some embodiments, distribution or measurement It is a Gaussian distribution. In some embodiments, the regularization parameter λ and the learning rate are constant. and measurement It is predefined.

[0136] Sampling module 364 is configured to select an instance or sample from privacy data 372 at index i when the parameters for the FDSKL application are available. or For example, privacy data 372 or local data in worker 350. This can include a customer's online spending, loan, and repayment information. Index i is used to identify the customer and can be the customer's personal ID. Sampling module 364 is configured to randomly select instances; therefore, index i is also called the random seed. The sampled instance can include a set of customer attributes, each corresponding to a customer record. A customer's records can include their monthly online spending, the amount and quantity of their loans, their repayment history, etc.

[0137] The sampling module 364 is also configured to send index i to other relevant workers via coordinator 310 when index i is available. Other relevant workers are workers available in system 300 or related to the model training of the active workers; these relevant workers are defined as passive workers.

[0138] Sampling module 364 is also configured to use index i from the distribution In the random direction Perform sampling and set the instances and random direction Send to the random feature module 366 and the output prediction module 368. Because of the instance... It was randomly selected from privacy data 372, so the instance... The corresponding index i can also be considered a random value. Therefore, index i is used as a reference from the distribution. In the random direction A random seed for sampling.

[0139] The random feature module 366 is configured to, upon receiving an instance and random direction When calculating random directions and examples The dot product is calculated by adding a random number b to the dot product to obtain an adjusted dot product, which is then stored locally. Random features are calculated based on the adjusted dot product, and the adjusted dot product is sent to the output prediction module 368, while the random features are sent to the model coefficient module 370. In some embodiments, the adjusted dot product is obtained using a formula... Obtained, where b is A random number within a certain range. In some embodiments, the random feature... It uses equations Obtained.

[0140] The output prediction module 368 is configured to calculate samples The predicted output value is used to instruct the output prediction modules 368 of other related workers to calculate their respective predicted output values. The final predicted output is calculated by adding the predicted output values ​​of the active worker and the expected output values ​​of the passive worker, and this final output value is sent to the model coefficient module 370. In some embodiments, each active worker and passive worker is configured to invoke the above-described algorithm 2 to calculate their respective predicted output values. In some embodiments, the various predicted output values ​​are communicated based on a tree structure T0. In some embodiments, the tree structure communication T0 has... Figure 2A or Figure 2B The structures shown are the same or similar.

[0141] The model coefficient module 370 is configured to, upon receiving random features from the random feature module 366 and the final predicted value from the output prediction module 368 Calculation and Examples Corresponding model coefficients The model coefficients are stored locally, and all previous model coefficients are updated. In some embodiments, the model coefficient module 370 is configured to use equations... Calculate the model coefficients. Here It is an example The final predicted output, It is an example The true value, Based on and The loss function is the difference between the two. This is used when calculating and saving the model coefficients. At the same time, the model coefficient module 370 is also configured to update previous coefficients in the model. For example, if there are j previous model coefficients, then the equation is used. This updates each of the j previous model coefficients. In some embodiments, Equal to or greater than 0 Greater than 0.

[0142] Privacy data 372 stores data specifically for worker 350. Privacy data 372 comprises numerous instances or samples, each of which can be indexed. The privacy data 372 stored in different workers is different, but they can be indexed or linked in the same way, for example, by customer identification. In one example, the first worker 350-1 could be a server computing device in a digital financial company, whose privacy data includes online spending, loan, and repayment information. The second worker 350-2 could be a server computing device in an e-commerce company, whose privacy data includes online shopping information. The third worker 350-3 could be a server computing device in a bank, whose privacy data includes customer information such as average monthly deposits and account balances. If someone submits a loan application to the digital financial company, the digital financial company can comprehensively utilize the information stored in these three workers to assess the credit risk of the financial loan. Therefore, for the assessment, the first worker 350-1 can act as an active worker initiating the process, while the second and third workers 350-2 can operate as passive workers. These three workers do not share privacy data. However, since some customers of digital financial companies, e-commerce companies, and banks are the same, these customers can be indexed and linked so that the FDSKL model or FDSKL application 358 can utilize their privacy data in the three workers. In some embodiments, each of the three workers 350-1, 350-2, and 350-3 has an FDSKL application 358 installed, and each worker can initialize FDSKL training as an active worker. It should be noted that for each worker, privacy data 372 can be accessed by that worker's own FDSKL application 358.

[0143] In some embodiments, the FDSKL application 358 may also include a user interface. This user interface is configured to provide a user interface or graphical user interface within the worker 350. In some embodiments, the user is able to configure or modify the parameters used for training or using the FDSKL application 358.

[0144] Figure 4 An FDSKL training process according to certain embodiments of the present disclosure is schematically depicted. In some embodiments, the training process is performed by a server computing device, such as... Figure 3B The worker 350 shown is specifically executed by the FDSKL application 358. It should be noted that, unless otherwise stated in this disclosure, the steps of the FDSKL training procedure or method may be arranged in a different order, and are therefore not limited to the one described above. Figure 4 The order shown. In some embodiments, training process 400 corresponds to algorithm 1.

[0145] like Figure 4As shown, in step 402, the listener 360 of the active worker 350 receives a notification from the administrator or user that an FDSKL training process is required, and in response to receiving the notification, initializes the FDSKL training process by sending the notification to the parameter module 362. The parameter module 362 accordingly provides the parameters of the FDSKL application 358 to the sampling module 364. These parameters include, for example, the distribution... Regularization parameter λ and constant learning rate .

[0146] In step 404, the sampling module 364 randomly selects an instance or sample with index i from the privacy data 372. The index i is then sent to the passive worker. An instance may include multiple attributes. In some embodiments, the random seed i is stored in each active and passive worker for later use.

[0147] In step 406, the sampling module 364 samples from the distribution based on index i. In the random direction Perform sampling and set the instances and sampling random direction It is sent to the random feature module 366 and the output prediction module 368.

[0148] In step 408, when the instance and random direction When available, the random feature module 366 calculates the random direction. and examples The dot product is calculated by adding a random number to obtain an adjusted dot product, which is then saved locally. In some embodiments, a formula is used. Obtain the adjusted dot product, where b is A random number within a given range.

[0149] In step 410, in obtaining After the value, the random feature module 366 uses a function Calculate random characteristics The random feature is then sent to the output prediction module 368 and the model coefficient module 370.

[0150] In step 412, when the sample Weight values ,measure and model coefficients When available, each worker's output prediction module calculates 368 samples. The predicted value. Here is the sample. and model coefficients It is specifically designed for each worker; different workers generally have different features. and model coefficients But the distribution The index i is the same across different workers, therefore different workers have the same random direction. The random direction is based on index i from the distribution Sampling is performed in the middle. In some embodiments, the worker may not have any model coefficients before any training process. After each training iteration, corresponding model coefficients α are created and added to the model. In some embodiments, the output prediction module 368 of the active worker calculates its prediction for the samples. The predicted values, and also coordinated with passive workers to calculate their predictions for their respective samples. The respective predicted values. In some embodiments, the output prediction module 368 of the active worker is coordinated with the passive worker through the coordinator 310.

[0151] In step 414, when the respective predicted output values ​​are available in the workers, the coordinator 310 uses a tree-structured communication scheme to send the predicted values ​​from the passive workers to the active workers, and the output prediction module 368 uses the predicted values ​​from the active workers and the passive workers to obtain samples. final predicted value The final predicted value is then sent to the model coefficient module 370. In some embodiments, the model coefficient module 370 is used. Perform the calculation.

[0152] In step 416, upon receiving random features from the random feature module 366 and the final predicted value from the output prediction module 368 At that time, the model coefficient module 370 calculation and examples The corresponding model coefficients for proactive workers In some embodiments, similar steps are performed on all passive workers, and each passive worker has newly calculated model coefficients.

[0153] In step 418, after calculating the model coefficients, the model coefficient module 370 of the active worker updates all previous model coefficients.

[0154] In some embodiments, the passive worker will similarly perform the above steps in parallel. Specifically, after receiving index i from the active worker via coordinator 310, the passive worker selects the instance corresponding to index i from its local privacy data 372. For the random direction corresponding to index i Perform sampling (or alternatively, receive random directions from the active worker). ), calculate the dot product between the random direction and the instance. Through Add a random value b within the range to obtain the adjusted dot product. Calculate random features based on the adjusted dot product. The passive worker's own prediction output is calculated according to Algorithm 2, and the final or overall prediction output is calculated by summing the prediction outputs of all active workers and passive workers. Calculate the model coefficients specifically for this worker. and model coefficients Save it locally and update all previous model coefficients.

[0155] It should be noted that the active worker and the passive worker share the same index i and random direction. However, there are also their own examples. Furthermore, the random value 'b' for calculating the adjusted dot product differs for different workers. Additionally, each worker has its own instance. Each has its own corresponding prediction output, but the worker will use the same final prediction output. That is, the sum of the predicted outputs of all workers.

[0156] In some embodiments, after the model is trained through the above process, prediction can be performed in a similar manner. The differences between prediction and training include, for example: providing one instance for prediction while training with multiple randomly selected instances; using the provided instance for prediction while iteratively training with randomly selected instances; and stopping prediction in step 414 since the prediction of the provided instance has been completed, even if the training requires updating the model parameters in steps 416 and 418.

[0157] Figure 5 The process of computing predicted values ​​for samples using an FDSKL model according to certain embodiments of this disclosure is illustrated schematically. In some embodiments, the training process is performed by a server computing device, such as... Figure 3B The worker 350 shown is specifically executed by the FDSKL application 358. It should be noted that, unless otherwise stated in this disclosure, the steps of the FDSKL training process or method may be arranged in a different order, and are therefore not limited to the one described above. Figure 5 The sequence shown. In some embodiments, the process corresponds to... Figure 4 Step 412 in the process is performed by the output prediction module 368 of the active worker. In some embodiments, the output prediction module 368 of either the active worker or the passive worker can perform this process. Figure 5The process is as follows. When calculating the predicted values ​​from all workers in step 414 above, the final predicted value can be calculated by adding the predicted values ​​from all workers. In some embodiments, process 500 corresponds to algorithm 2.

[0158] like Figure 5 As shown, when the distribution Model coefficients Active worker l, corresponding iterative index set When sample x is available, the process of calculating the predicted value can be executed iteratively a predetermined number of times. In some embodiments, the number of iterations is defined by the user based on the data to be analyzed.

[0159] In step 502, the output prediction module 368 in the first active worker will predict the value. Set to 0.

[0160] In step 504, for each iteration of computation using sample x, the output prediction module 368 is derived from the distribution corresponding to random seed i. Selecting a random direction In some embodiments, the random seed i is stored, for example, in process 404 described above, and retrieved in step 504.

[0161] In step 506, if If it is saved locally, then the output prediction module 368 will retrieve it. Otherwise, use Algorithm 3 to indicate. The calculation.

[0162] In step 508, the output prediction module 368 uses a function based on Calculate random characteristics In some embodiments, the random direction and random number are different for each iteration.

[0163] In step 510, the output prediction module 368 utilizes Calculate the predicted values ​​on the local worker.

[0164] In some embodiments, after the model is trained, Figure 5 The steps in this process can be used for prediction. Specifically, when given a new sample x with index i, it is possible to predict from the distribution. Obtain the corresponding random direction The model then uses samples and random directions to calculate the adjusted dot product. Use the adjusted dot product to calculate random features And use the random feature to calculate the prediction corresponding to the new sample x. .

[0165] Figure 6 The process of computing an adjusted dot product using an FDSKL model according to certain embodiments of this disclosure is illustrated schematically. In some embodiments, the training process is performed by a server computing device, such as... Figure 3B The worker 350 shown is specifically executed by the FDSKL application 358. It should be noted that, unless otherwise stated in this disclosure, the steps of the FDSKL training process or method can be arranged in a different order, and are therefore not limited to the order shown in Figure 6. In some embodiments, the process corresponds to... Figure 6 Step 408 in the example. Figure 4 As shown, this process is executed by the random feature module 366 of the active worker. In some embodiments, the random feature module 366 of either the active worker or the passive worker can execute this process. Figure 6 The process.

[0166] like Figure 6 As shown, when the instance and the corresponding random direction When available, in step 602, the random feature module 366 in the l-th active worker (or any worker) is used in an example. and random direction Calculate the dot product .

[0167] In step 604, the random feature module 366 uses a seed. Generate in A random number b within a certain range. In some embodiments, the random seed... It is generated locally, for example, by any type of random number generator.

[0168] In step 606, the random feature module 366 adds a random number b to the dot product. To obtain the adjusted dot product .

[0169] In step 608, steps 602-606 are repeated for each of the workers 1 to q. Specifically, while the l-th active worker executes steps 602-606, the active worker's random feature module 366 also requires the passive worker to locally repeat steps 602, 604, and 606, so that each worker computes its own adjusted dot product. Assume there are a total of q workers. For any one of the q workers... Random seed used This indicates that the generated Random numbers within the range This means that the dot product is represented by... This indicates that the adjusted dot product is used... This indicates that... It's important to note that among the q workers... They are the same, for each worker. They all use the same index i from the distribution The selected workers. However, because different workers store different data for the same client i, the q workers... They are different, and because each random number is generated locally, the random numbers in the q workers are... They are different. By executing steps 602-608, each of the q workers has its own adjusted dot product calculated by that worker.

[0170] In step 610, the random feature module 366 sums the dot products of the adjustments from workers 1 to q to obtain the summed dot product. In some embodiments, an equation is used. Perform summation. Since the summation uses data from q workers, the dot product of the summations is... Includes q random numbers In some embodiments, the summation is coordinated by a coordinator 310. In some embodiments, a tree structure T1 is used for summation.

[0171] In step 612, the random feature module 366 randomly selects a worker l' that is not the l-th active worker, and uses the tree structure T2 to calculate the sum of random numbers b excluding the l-th worker. Since l' is different from l, the l'-th worker is a passive worker randomly selected from all passive workers. The sum of random numbers b is calculated using... Represent, and use equations Calculate. Since the sum does not include the random number b from the l'th passive worker, the sum is... Includes (q-1) random numbers b.

[0172] In step 614, the random feature module 366 generates random features by subtracting the random number from the summed dot product. The random features are calculated as follows: Because the dot product of summation Includes q random numbers b from q workers, and Including (q - 1) random numbers b, so The random number component in the value corresponds to the random number b in the l'th passive worker.

[0173] pass Figures 4-6The operations described above can be used to train the FDSKL application 358 of the l-th active worker using local data from the l-th active worker and data from the passive worker. The instance index i is randomly selected by the active worker, and the index i is the same for all workers; each random number b is randomly generated locally, and the random numbers in different workers are different. The data shared by the workers are index i, the adjusted dot product, and the predicted output. By using specific training steps and the application of random numbers, data exchange between workers is limited, achieving high security.

[0174] In some aspects, this disclosure relates to a non-transitory computer-readable medium for storing computer-executable code. In some embodiments, the computer-executable code may be software stored in the aforementioned storage device 356. When executed, the computer-executable code may perform one of the methods described above.

[0175] In some aspects, this disclosure relates to a method for predicting instance outcomes using a trained FDSKL model. In some embodiments, this disclosure uses Figure 4 The steps described in the document are used to train random features based on the FDSKL model, using Figure 5 The steps in this process provide predictions from a trained FDSKL model for a given sample. In some embodiments, this disclosure predicts whether a loan should be granted to a customer based on online financial information from customers of digital financial companies, online shopping patterns from e-commerce companies, and banking information from traditional banks. The prediction can be initiated by the server of any of the three entities, but the three entities do not need to share their actual customer data.

[0176] Example. Exemplary experiments have been conducted using models according to certain embodiments of this disclosure.

[0177] Experimental Design: To demonstrate the superiority of FDSKL in federated kernel learning using vertically partitioned data, we compared it with PP-SVMV (Yu, Vaidya, and Jiang, 2006). Furthermore, to verify the prediction accuracy of FDSKL on vertically partitioned data, we compared it with an oracle learner that can access the entire data sample without federated learning constraints. For the oracle learner, we used state-of-the-art kernel classification solvers, including LIBSVM (Chang and Lin, 2011) and DSG (Dai et al., 2014). Finally, we presented FD-SVRG (Wan et al., 2007) using a linear model to relatively validate the accuracy of FDSKL. The algorithms used for comparison are in... Figure 7A The following are listed in Table 1.

[0178] Implementation Details: Our experiments were conducted on a 24-core dual-socket Intel Xeon CPU E5-2650 v4 machine with 256GB of RAM. We implemented our FDSKL in Python, where parallel computation was handled by MPI4py (Dalcin et al., 2011). The code for LIBSVM was provided by Chang and Lin (2011). We used the implementation for DSG provided by Dai et al. (2014). We modified the DSG implementation to use a constant learning rate. Our experiments used the following binary classification dataset, as described below.

[0179] Dataset: Figure 7B Table 2 shows the benchmark datasets used in the experiments. The eight real-world binary classification datasets shown in Table 2 are from the LIBSVM website. We split the datasets in a 4:1 ratio for training and testing. It's important to note that PP-SVMV consistently ran out of memory in the experiments on the epsilon, realsim, and w8a datasets, meaning that this method is only effective when the number of instances is below approximately 45,000, given the computational resources specified above.

[0180] Results and Discussion: In Figures 8A-8H , Figures 9A-9D and Figures 10A-10D The results are displayed in [the image / image]. We [the image / image]... Figures 8A-8H The table provides test error versus training time graphs for four state-of-the-art kernel methods on the datasets gisette, phishing, a9a, ijcnn1, cod-rna, w8a, real-sim, and epsilon. Clearly, our algorithm consistently achieves the fastest convergence speed compared to other state-of-the-art kernel methods. Figures 10A-10D In the examples, we demonstrate the relationship between training time and different training sizes for FDSKL and PP-SVMV on the datasets phishing, a9a, cod-rna, and w8a. Furthermore, Figure 10B , Figure 10C and Figure 10D The missing results in PP-SVMV are due to insufficient memory. Clearly, our method has better scalability than PP-SVMV. This scalability advantage is due to a combination of factors, primarily because FDSKL employs a stochastic feature method, which is efficient and easily parallelized. Furthermore, we can demonstrate that the communication structure used in PP-SVMV is not optimal, meaning that sending and receiving the partition kernel matrix takes longer.

[0181] As described in the previous section, FDSKL uses a tree-structured communication scheme to distribute and aggregate computations. To validate this system design, we also compared the efficiency of three commonly used communication structures (ring-based, tree-based, and star-based communication structures). The goal of the comparison task was to compute the kernel matrix (linear kernel) of the training set for four datasets. Specifically, each node maintained a subset of features from the training set and was instructed to compute the kernel matrix using only that subset. The local kernel matrix computed at each node was then summed using one of the three communication structures. Our experiments compared the efficiency (communication time elapsed) of obtaining the final kernel matrix, with results for datasets gisette, phishing, a9a, ijcnn1, and ijcnn1 as shown in [reference needed]. Figures 9A-9D As shown. From Figures 9A-9D We can say that as the number of nodes increases, our (tree-based) communication structure has the lowest communication cost. This explains why PP-SVMV, which uses a ring-based communication structure, is inefficient. Figures 10A-10D As shown.

[0182] Figure 11 The results are boxplot test errors for three state-of-the-art kernel methods, a linear method (FD-SVRG), and our FDSKL. All results are averaged across 10 different training-test split trials. Based on these results, our FDSKL consistently exhibits the lowest test error and variance. Furthermore, the linear method generally performs worse than the other kernel methods.

[0183] Conclusion: Privacy-preserving federated learning for vertically partitioned data is currently in high demand in machine learning. In some embodiments and examples of this disclosure, we propose a federated double stochastic kernel learning (FDSKL) algorithm for vertically partitioned data, which overcomes the implicit linear separability limitation used in existing privacy-preserving federated learning algorithms. We prove that FDSKL has a sublinear convergence speed and can guarantee data security under the semi-honesty assumption. To our knowledge, FDSKL is the first efficient and scalable privacy-preserving federated kernel method. Extensive experimental results show that FDSKL handles high-dimensional data more efficiently than state-of-the-art kernel methods while maintaining similar generalization performance.

[0184] Some embodiments of this disclosure have the following advantages: (1) The FDSKL algorithm can be trained efficiently, scalably, and securely on vertically partitioned data using a kernel method. (2) FDSKL is a distributed double stochastic gradient algorithm with a constant learning rate, which is much faster than existing double stochastic gradient algorithms based on decreasing learning rates, and also much faster than existing privacy-preserving federated kernel learning algorithms. (3) The tree-structured communication scheme is used for distributed and aggregated computation, which is more efficient than star-structured and ring-structured communication, making FDSKL more efficient than existing federated learning algorithms. (4) Existing federated learning algorithms for vertically partitioned data use encryption technology to ensure the security of the algorithm, which is time-consuming and laborious. However, the method of this disclosure uses random perturbation to maintain the security of the algorithm, which is cheaper than encryption technology and makes FDSKL more efficient than existing federated learning algorithms. (5) Most existing federated learning algorithms for vertically partitioned data are limited to linearly separable models. The FDSKL of this disclosure is the first efficient and scalable federated learning algorithm for vertically partitioned data that breaks through the implicit linear separability limitation.

[0185] The foregoing description of exemplary embodiments of this disclosure is presented for illustrative and descriptive purposes only and is not intended to be exhaustive or to limit this disclosure to the precise form disclosed. Many modifications and variations are possible in accordance with the foregoing teachings.

[0186] The embodiments were chosen and described to explain the principles of this disclosure and its practical application, thereby enabling others skilled in the art to utilize this disclosure and various embodiments, as well as various modifications suitable for the particular intended use. Alternative embodiments will become apparent to those skilled in the art to which this disclosure pertains without departing from the spirit and scope of this disclosure. Therefore, the scope of this disclosure is defined by the appended claims rather than the foregoing description and the exemplary embodiments described therein.

Claims

1. A system for making predictions using a machine learning model, comprising: An active computing device and at least one passive computing device communicating with the active computing device, wherein each of the active computing device and the passive computing device includes local data, the active computing device includes a processor and a storage device storing computer-executable code, the computer-executable code being configured to: Obtain the parameters of the machine learning model; Retrieve instances from local data of the active computing device; Sampling is performed on the random directions of the instance; Calculate the dot product between the random direction and the instance, and calculate the random features based on the dot product; Calculate the predicted value of the instance in the active computing device, instruct the at least one passive computing device to calculate the predicted value of the instance in the at least one passive computing device, and summarize the predicted values ​​from the active computing device and the at least one passive computing device to obtain the final predicted value of the instance, wherein the predicted value of the instance in the at least one passive computing device is obtained based on the local data of the at least one passive computing device; The model coefficients are determined using the difference between the final predicted value of the instance and the target value of the instance, and the random features. The machine learning model is updated using the model coefficients; and The machine learning model is used to predict the value of new instances.

2. The system according to claim 1, wherein, The parameters of the machine learning model include a constant learning rate.

3. The system according to claim 1, wherein, The instance is characterized by an index, and the computer-executable code is configured to provide the index to the at least one passive computing device, and each client computer in the active computing device and the at least one passive computing device is configured to sample the random direction based on the index.

4. The system according to claim 3, wherein, The random direction is sampled from a Gaussian distribution.

5. The system according to claim 1, wherein, The random feature is achieved using an equation. Calculated It is the dot product, It is a random value. Indicates by A portion of an indexed instance or a sample of data. Representation and Index The corresponding random direction.

6. The system according to claim 5, wherein, Depend on calculate, It is the number of the active computing devices and the at least one passive computing device. yes The first of the computing devices indivual, It is the first The dot product of random directions and instances in a computing device. It is in the Random numbers generated in a computing device It is the active computing device, Indicates the first In the index of computing devices The corresponding random direction, Indicates the first In a computing device A portion of an instance of an index or a sample of data.

7. The system according to claim 1, wherein, The predicted value of the instance in the active computing device is calculated using multiple iterations, and the predicted value is expressed using an equation in the iterations. renew, This is the predicted value for the instance. These are the model coefficients of the aforementioned instance. This is the aforementioned random characteristic.

8. The system according to claim 7, wherein, The iteration is equal to or greater than 2.

9. The system according to claim 1, wherein, The computer-executable code is configured to use equations The machine learning model is updated by replacing each previous model coefficient, where It is any one of the coefficients in the previous model. It is the learning rate of the machine learning model. It is the regularization parameter of the machine learning model.

10. The system according to claim 1, wherein, Communication between the active computing device and the at least one passive computing device is performed using a tree structure by a coordinating computing device that communicates with the active computing device and the at least one passive computing device.

11. A method for making predictions using a machine learning model, comprising: The parameters of the machine learning model are obtained by an active computing device; The active computing device retrieves instances from its local data. The active computing device samples the random orientation of the instance; The active computing device calculates the dot product between the random direction and the instance, and calculates random features based on the dot product; The active computing device calculates a predicted value for the instance, instructs at least one passive computing device to calculate a predicted value for the instance in the at least one passive computing device, and aggregates the predicted values ​​from the active computing device and the at least one passive computing device to obtain a final predicted value for the instance, wherein the predicted value for the instance in the at least one passive computing device is obtained based on local data of the at least one passive computing device; The active computing device uses the difference between the final predicted value of the instance and the target value of the instance, along with the random features, to determine the model coefficients. The active computing device uses the model coefficients to update the machine learning model; as well as The active computing device uses the machine learning model to predict the value of the new instance.

12. The method according to claim 11, wherein, The parameters of the machine learning model include a constant learning rate.

13. The method according to claim 11, wherein, The instance is characterized by an index, and computer-executable code is configured to provide the index to the at least one passive computing device, and each client computer in the active computing device and the at least one passive computing device is configured to sample the random direction based on the index.

14. The method according to claim 13, wherein, The random direction is sampled from a Gaussian distribution.

15. The method according to claim 11, wherein, The random feature is achieved using an equation. Calculated It is the dot product, It is a random value. Indicates by A portion of an indexed instance or a sample of data. Representation and Index The corresponding random direction.

16. The method according to claim 11, wherein, Depend on calculate, It is the number of the active computing devices and the at least one passive computing device. yes The first of the computing devices indivual, It is the first The dot product of random directions and instances in a computing device. It is in the Random numbers generated in a computing device It is the active computing device, Indicates the first In the index of computing devices The corresponding random direction, Indicates the first In a computing device A portion of an instance of an index or a sample of data.

17. The method according to claim 11, wherein, The predicted value of the instance in the active computing device is calculated using multiple iterations, and the predicted value is expressed using an equation in the iterations. renew, This is the predicted value for the instance. These are the model coefficients of the aforementioned instance. This is the aforementioned random characteristic.

18. The method according to claim 11, wherein, Computer executable code is configured to use equations The machine learning model is updated by replacing each previous model coefficient, where It is any one of the coefficients in the previous model. It is the learning rate of the machine learning model. It is the regularization parameter of the machine learning model.

19. The method according to claim 11, wherein, Communication between the active computing device and the at least one passive computing device is performed using a tree structure by a coordinating computing device that communicates with the active computing device and the at least one passive computing device.

20. A non-transitory computer-readable medium for storing computer-executable code, wherein, The computer-executable code is configured to, when executed at the processor of an active computing device: Obtain the parameters of the machine learning model; Retrieve instances from local data of the active computing device; Sampling is performed on the random directions of the instance; Calculate the dot product between the random direction and the instance, and calculate the random features based on the dot product; Calculate the predicted value of the instance in the active computing device, instruct at least one passive computing device to calculate the predicted value of the instance in the at least one passive computing device, and summarize the predicted values ​​from the active computing device and the at least one passive computing device to obtain the final predicted value of the instance, wherein the predicted value of the instance in the at least one passive computing device is obtained based on local data of the at least one passive computing device; The model coefficients are determined using the difference between the final predicted value of the instance and the target value of the instance, and the random features. The machine learning model is updated using the model coefficients. as well as The machine learning model is used to predict the value of new instances.