Model training method based on longitudinal federated learning, medium, device and product
By employing a publish-subscribe architecture and channel caching mechanism in vertical federated learning, asynchronous training is achieved, which solves the problems of high computational cost and training latency in vertical federated learning, improves model training efficiency and accuracy, and ensures data security.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING ZITIAO NETWORK TECH CO LTD
- Filing Date
- 2025-04-30
- Publication Date
- 2026-06-19
AI Technical Summary
In longitudinal federated learning, traditional federated gradient transfer methods result in high computational costs and large communication overhead, while synchronous training methods affect model training efficiency and lead to significant training latency.
We adopt a segmentation-based longitudinal federated learning approach. Through a publish-subscribe architecture and channel caching mechanism, we divide the data into batches and establish one-to-one corresponding channels to achieve asynchronous training, decouple the training process of multiple participants, use embedded vectors and gradient information for model training, and use a differential privacy protocol to protect data security.
It eliminates training latency caused by synchronization dependencies, improves model training efficiency, and ensures the accuracy of model training and data security.
Smart Images

Figure CN120235271B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of federated learning technology, and more specifically, to a model training method, medium, device, and product based on longitudinal federated learning. Background Technology
[0002] In Vertical Federated Learning (VFL), sample identifiers (e.g., IDs) of the data are aligned across different participants, but the feature sets differ. For example... Figure 1 As shown, labeled organization A possesses the first feature data of the first object (i.e., organization A's private data), while unlabeled organization B possesses the second feature data of the first object (belonging to organization B's private data). For example, organization A is a bank, and organization B is an e-commerce platform. The bank possesses users' financial data, while the e-commerce platform possesses users' consumption records. The labels could indicate whether a user will default on payments or their credit rating. Both parties want to collaborate on training the model but cannot directly share the original feature data. Traditional VFL uses federated gradient transfer, where each party calculates local gradients and exchanges information through encryption techniques (such as secure multi-party computation or homomorphic encryption) to ultimately optimize the global model. However, this method incurs high communication overhead and computational costs.
[0003] To reduce the computational burden and communication overhead during model training, VFL (Virtual Functional Framework) based on Split Learning is currently commonly used for model training. However, this method typically employs synchronous training, which can impact training efficiency and lead to significant training latency. Summary of the Invention
[0004] This summary section is provided to briefly introduce the concepts, which will be described in detail in the detailed description section below. This summary section is not intended to identify key or essential features of the claimed technical solution, nor is it intended to limit the scope of the claimed technical solution.
[0005] Firstly, this disclosure provides a model training method based on vertical federated learning. Multiple participants in the model training include a labeled first participant and M unlabeled second participants, where M ≥ 1. The data held by each participant is aligned with sample labels and then divided into N batches, where N ≥ 1. Each of the N batches corresponds one-to-one with N first channels and one-to-one with N second channels. The method is applied to any one of the M unlabeled second participants. The method includes: obtaining a first training sample of the current batch, wherein the first training sample of the current batch is one of the N batches; inputting the first training sample into a first feature extraction model to obtain a first embedding vector; and inserting the first embedding vector into the first feature extraction model corresponding to the current batch. A first feature extraction model is created by taking one channel and retrieving the earliest stored first gradient information from the N second channels if there is a non-empty second channel. The first feature extraction model is then updated using the earliest stored first gradient information. The second channel is used to store the first gradient information of the first training samples of the corresponding batch. The first gradient information is obtained by the first participant training the inference model based on the second embedding vector and M first embedding vectors taken from the first channel corresponding to the corresponding batch. The second embedding vector is obtained by the first participant inputting the second training samples of the corresponding batch into the second feature extraction model. If all N second channels are empty, the same process is performed for the next batch of training samples until all N batches are completed.
[0006] Secondly, this disclosure provides a model training method based on vertical federated learning. Multiple participants in the model training include a labeled first participant and M unlabeled second participants, where M ≥ 1. The data held by each participant are aligned with sample labels and then divided into N batches, where N ≥ 1. Each of the N batches corresponds one-to-one with N first channels and one-to-one with N second channels. The method is applied to the first participant and includes: obtaining a second training sample from the current batch, wherein the second training sample from the current batch is one of the N batches; and inputting the second training sample into a second feature extraction model to obtain a second embedding vector. M first embedding vectors are extracted from the first channel corresponding to the current batch; wherein the M first embedding vectors are obtained by each of the M unlabeled second participants by inputting the first training sample of the current batch into the local first feature extraction model and placing it into the first channel corresponding to the current batch; based on the second embedding vectors and the M first embedding vectors, the inference model is trained to obtain first gradient information, second gradient information and third gradient information; the first gradient information is placed into the second channel corresponding to the current batch, and the second feature extraction model is updated using the second gradient information, and the inference model is updated using the third gradient information.
[0007] Thirdly, this disclosure provides a computer-readable medium having a computer program stored thereon, which, when executed by a processing device, implements the steps of the method provided in the first aspect of this disclosure or the steps of the method provided in the second aspect of this disclosure.
[0008] Fourthly, this disclosure provides an electronic device, comprising: a storage device having a computer program stored thereon; and a processing device for executing the computer program in the storage device to implement the steps of the method provided in the first aspect of this disclosure or the steps of the method provided in the second aspect of this disclosure.
[0009] Fifthly, this disclosure provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the method provided in the first aspect of this disclosure or the steps of the method provided in the second aspect of this disclosure.
[0010] In the above technical solution, before training the model based on vertical federated learning, the M unlabeled second participants and the labeled first participants first perform sample label alignment on the data they hold, and divide the aligned sample set into N batches respectively, and establish N first channels and N second channels corresponding one-to-one with the N batches. The first channel is used to store the embedding vector of the first training sample of the corresponding batch, and the second channel is used to store the gradient information of the first training sample of the corresponding batch. During batch training of the model, the second participant can use the first training samples of the current batch to train the first feature extraction model and obtain the first embedding vector. Then, the second participant, as the publisher, puts the first embedding vector into the first channel corresponding to the current batch. At the same time, the first participant can use the second training samples of the current batch to train the second feature extraction model and obtain the second embedding vector. Then, the first participant, as the subscriber, takes out M first embedding vectors from the first channel corresponding to the current batch and trains the inference model based on the second embedding vector and the taken M first embedding vectors to obtain the first gradient information, the second gradient information, and the third gradient information. Next, the first participant, as the publisher, puts the first gradient information into the second channel corresponding to the current batch, updates the second feature extraction model with the second gradient information, and updates the inference model with the third gradient information. After the second participant places the first embedding vector into the corresponding first channel, it can act as a subscriber to determine whether first gradient information exists in the N second channels. If there are non-empty second channels among the N second channels, the earliest stored first gradient information is retrieved from the non-empty second channels, and the first feature extraction model is updated using this earliest stored first gradient information. If there is no first gradient information in any of the N second channels, i.e., all N second channels are empty, then there is no need to wait for the gradient information of the first training sample of the current batch from the first participant, and the training of the first feature extraction model can continue. Thus, through the publish-subscribe architecture and channel caching mechanism, the training processes of multiple participants are decoupled, realizing asynchronous training of the model, thereby eliminating the training latency caused by synchronization dependency in vertical federated learning and improving model training efficiency. In addition, the embedding vectors and gradient information of training samples from different batches are placed into the corresponding channels of the corresponding batches, which can ensure that the sample labels are aligned at all times, thereby ensuring the accuracy of model training.
[0011] Other features and advantages of this disclosure will be described in detail in the following detailed description section. Attached Figure Description
[0012] The above and other features, advantages, and aspects of the embodiments of this disclosure will become more apparent from the accompanying drawings and the following detailed description. Throughout the drawings, the same or similar reference numerals denote the same or similar elements. It should be understood that the drawings are schematic, and the originals and elements are not necessarily drawn to scale. In the drawings:
[0013] Figure 1 This is a schematic diagram of vertical federated learning based on segmentation learning in related technologies.
[0014] Figure 2 This is a schematic diagram outlining the synchronization dependency analysis in related technologies.
[0015] Figure 3 This is a flowchart illustrating a model training method based on longitudinal federated learning applied to any unlabeled second participant, according to an exemplary embodiment.
[0016] Figure 4 This is an overview diagram of a vertical federated learning system according to an exemplary embodiment.
[0017] Figure 5 This is a flowchart illustrating a vertical federated learning system according to an exemplary embodiment.
[0018] Figure 6 This is a flowchart illustrating a model training method based on longitudinal federated learning applied to a labeled first participant, according to an exemplary embodiment.
[0019] Figure 7 This is a block diagram illustrating a model training apparatus based on longitudinal federated learning applied to any unlabeled second participant, according to an exemplary embodiment.
[0020] Figure 8 This is a block diagram illustrating a model training apparatus based on longitudinal federated learning applied to a labeled first participant, according to an exemplary embodiment.
[0021] Figure 9 This is a schematic diagram of the structure of an electronic device according to an exemplary embodiment. Detailed Implementation
[0022] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.
[0023] It should be understood that the steps described in the method embodiments of this disclosure may be performed in different orders and / or in parallel. Furthermore, the method embodiments may include additional steps and / or omit the steps shown. The scope of this disclosure is not limited in this respect.
[0024] The term "comprising" and its variations as used herein are open-ended inclusions, meaning "including but not limited to". The term "based on" means "at least partially based on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Definitions of other terms will be given in the description below.
[0025] It should be noted that the concepts of "first" and "second" mentioned in this disclosure are used only to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or their interdependencies.
[0026] It should be noted that the terms "a" and "a plurality of" used in this disclosure are illustrative rather than restrictive, and those skilled in the art should understand that, unless otherwise expressly indicated in the context, they should be understood as "one or more".
[0027] The names of messages or information exchanged between multiple devices in the embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.
[0028] It is understood that before using the technical solutions disclosed in the various embodiments of this disclosure, users should be informed of the types, scope of use, and usage scenarios of the personal information involved in this disclosure in an appropriate manner in accordance with relevant laws and regulations, and user authorization should be obtained.
[0029] For example, upon receiving a user's active request, a prompt message is sent to the user to explicitly inform them that the requested operation will require the acquisition and use of the user's personal information. This allows the user to independently choose whether to provide personal information to the software or hardware, such as the electronic device, application, server, or storage medium performing the operations of this disclosed technical solution, based on the prompt message.
[0030] As an optional but non-limiting implementation, in response to a user's active request, sending a prompt message to the user can be done via a pop-up window, where the prompt message can be presented in text format. Furthermore, the pop-up window can also include a selection control allowing the user to choose "agree" or "disagree" to provide personal information to the electronic device.
[0031] It is understood that the above notification and user authorization process are merely illustrative and do not constitute a limitation on the implementation of this disclosure. Other methods that comply with relevant laws and regulations may also be applied to the implementation of this disclosure.
[0032] Meanwhile, it is understood that the data involved in this technical solution (including but not limited to the data itself, the acquisition or use of the data) shall comply with the requirements of relevant laws, regulations and related provisions.
[0033] As discussed in the background section, to reduce the computational burden and communication overhead during model training, VFL (Vertical Flow Modeling) based on Split Learning is currently commonly used for model training. However, this model training method typically employs synchronous training, which can affect model training efficiency and lead to significant training latency.
[0034] Specifically, Split Learning reduces computational burden and data exposure risks by breaking down a model to be trained (e.g., a deep neural network) into multiple sub-models, allowing each participant to train only the sub-model they are responsible for. For example... Figure 1 As shown, the model to be trained is split into a bottom model and a top model. Before training the model, organizations A and B need to align their private data. For example, they can obtain the intersection of common sample identifiers based on the Private Set Intersection (PSI) technique. Then, the model is trained based on the aligned private data.
[0035] Among them, the model to be trained can be a classification model, which can be used for image classification, text classification, group classification and other scenarios. The model to be trained can also be a prediction model, which can be applied to trend prediction and other scenarios.
[0036] Specifically, such as Figure 1As shown, Organization B can train its own bottom model based on aligned private data to obtain embedding vectors (belonging to hidden representations). This process can be called the forward propagation of Organization B's bottom model. This embedding vector is then sent to Organization A after being noisily enhanced using a security protocol (e.g., differential privacy protocol). Organization A trains its own bottom model based on aligned private data to obtain embedding vectors. This process can be called the forward propagation of Organization A's bottom model. Then, Organization A trains its top model based on this locally generated embedding vector and the noisily enhanced embedding vector received from Organization B. This process can be called the forward propagation of Organization A's top model. Based on the label Y corresponding to the private data and the output of the top model, the model loss is calculated, and the model loss is used to calculate Organization A's... The algorithm first obtains the gradient information of the aligned private data of organization A, the gradient information of the aligned private data of organization B, and the gradient information of the embedding vector. Next, organization A updates its own bottom model based on the gradient information of its own aligned private data. This process can be called the backpropagation of organization A's bottom model. Then, it updates its top model based on the gradient information of the embedding vector. This process can be called the backpropagation of organization A's top model. At the same time, organization A sends the gradient information of organization B's aligned private data to organization B after adding noise through a security protocol. Organization B can update its own bottom model based on the received noisy gradient information. This process can be called the backpropagation of organization B's bottom model.
[0037] like Figure 2 As shown, organizations A and B train their models synchronously, with synchronization dependencies between different processes. For example, the backpropagation process of the top model in organization A can only be executed after the forward propagation of the bottom model in organization B is completed; similarly, the backpropagation process of the bottom model in organization B can only be executed after the backward propagation of the top model in organization A is completed. These synchronization dependencies affect model training efficiency and lead to significant training latency.
[0038] In view of this, this disclosure provides a model training method, medium, device and product based on longitudinal federated learning.
[0039] Figure 3 This is a flowchart illustrating a model training method based on longitudinal federated learning applied to any unlabeled second participant, according to an exemplary embodiment. Figure 3 As shown, the model training method based on longitudinal federated learning applied to any unlabeled second participant may include the following S101~S106.
[0040] In S101, obtain the first training sample of the current batch.
[0041] In this disclosure, the multiple participants in model training include a labeled first participant and M unlabeled second participants, where M ≥ 1, meaning there can be one or more unlabeled participants in model training. Each of the M second participants holds first private sample data, and the first participant holds the labels corresponding to the second private sample data and the second private sample data. The multiple participants can train the model based on VFL. Before model training, private sample data alignment is required. For example, a common sample identifier intersection can be obtained based on PSI. Then, the first private sample data corresponding to the sample identifier intersection is used as the first sample set, and the second private sample data corresponding to the sample identifier intersection is used as the second sample set. Each first and second sample set is then divided according to a preset batch size, resulting in N batches. That is, the data held by each participant is divided into N batches after sample identifier alignment. The M second participants and the first participant can perform batch training on the model, where the batch size is the number of samples in each batch. Training samples in the same batch of the first and second sample sets have the same sample identifier, and N ≥ 1. The first training sample in the current batch is one of the N batches mentioned above.
[0042] The model to be trained can be split into a feature extraction model (bottom model) and an inference model (top model). Each of the M unlabeled second participants is used to train its own first feature extraction model based on the first sample set. The first participants are used to train their own second feature extraction model and inference model based on the second sample set and the corresponding labels. The first feature extraction model and the second feature extraction model have the same structure, and their initial weights can be the same or different.
[0043] In S102, the first training sample is input into the first feature extraction model to obtain the first embedding vector.
[0044] In S103, the first embedding vector is placed into the first channel corresponding to the current batch. N batches correspond one-to-one with N first channels (also called embedding channels) and N batches correspond one-to-one with N second channels (also called gradient channels). The second channel is used to store the first gradient information of the first training sample of the corresponding batch. The first gradient information is obtained by the first participant training the inference model based on the second embedding vector and M first embedding vectors taken from the first channel corresponding to the corresponding batch.
[0045] In this disclosure, the second embedding vector is obtained by the first participant inputting the second training samples of the corresponding batch into the second feature extraction model. Retrieving the embedding vector from the first channel corresponding to the corresponding batch means removing the embedding vector from the first channel corresponding to the corresponding batch; after removal, the first channel corresponding to the corresponding batch no longer contains the removed embedding vector.
[0046] To ensure that sample labels are aligned at all times during model training, N first channels and N second channels can be established, each corresponding to one batch. The first channel stores the embedding vectors of the first training sample in its corresponding batch, and the second channel stores the gradient information of the first training sample in its corresponding batch. Furthermore, buffer mechanisms can be designed for both the first and second channels, configuring a preset number of buffers. For example, the first channel can be configured with 5 buffers, storing a maximum of 5 embedding vectors, and the second channel can be configured with 5 buffers, storing a maximum of 5 gradient information. If the buffers in the first and second channels are full, the embedding vectors or gradients can be discarded using a first-in, first-out (FIFO) principle.
[0047] For example, if M=1, and both the first and second sample sets contain 600 training samples with a batch size of 200, then the first and second sample sets can be divided into three batches: batch A, batch B, and batch C. In this case, three first channels and three second channels can be established: first channel A1 and second channel A2 corresponding to batch A, first channel B1 and second channel B2 corresponding to batch B, and first channel C1 and first channel C2 corresponding to batch C. First channel A1 stores the embedding vectors of the first training samples from batch A of the second participant; first channel B1 stores the embedding vectors of the first training samples from batch B of the second participant; first channel C1 stores the embedding vectors of the first training samples from batch C of the second participant; second channel A2 stores the gradient information of the first training samples from batch A of the second participant; second channel B2 stores the gradient information of the first training samples from batch B of the second participant; and second channel C2 stores the gradient information of the first training samples from batch C of the second participant.
[0048] For example, M=2, and the first sample set and the second sample sets held by the two second participants each contain 600 training samples, with a batch size of 200. Therefore, the first sample set and the two second sample sets can be divided into three batches: batch A, batch B, and batch C. In this case, three first channels and three second channels can be established: first channel A1 and second channel A2 corresponding to batch A, first channel B1 and second channel B2 corresponding to batch B, and first channel C1 and first channel C2 corresponding to batch C. First channel A1 stores the embedding vectors of the first training samples from batch A of the two second participants; first channel B1 stores the embedding vectors of the first training samples from batch B of the two second participants; first channel C1 stores the embedding vectors of the first training samples from batch C of the two second participants; second channel A2 stores the gradient information of the first training samples from batch A of the two second participants; second channel B2 stores the gradient information of the first training samples from batch B of the two second participants; and second channel C2 stores the gradient information of the first training samples from batch C of the two second participants.
[0049] The second participant can use the first training sample of the current batch to train a local first feature extraction model, obtain the first embedding vector of the first training sample of the current batch, and put the first embedding vector into the first channel corresponding to the current batch. For example, if the current batch is batch B, the second participant can put the first embedding vector of the first training sample of batch B into the first channel B1 after obtaining it.
[0050] The first participant can use the second training samples of the current batch to train a local second feature extraction model, that is, input the second training samples of the current batch into the second feature extraction model to obtain the second embedding vector of the second training samples of the current batch; then, the first participant extracts M first embedding vectors from the first channel corresponding to the current batch; next, the first participant trains the local inference model based on the second embedding vectors and the extracted M first embedding vectors, that is, concatenate the first embedding vectors and the second embedding vectors and input them into the inference model to obtain the output result; then, based on the output result and the label corresponding to the second training samples of the current batch, the model loss is calculated, and based on the model loss, the first gradient information, the second gradient information, and the third gradient information are calculated. The first gradient information is the gradient information of the first training samples of the current batch, used to update the first feature extraction model; the second gradient information is the gradient information of the second training samples of the current batch, used to update the second feature extraction model; and the third gradient information is the gradient information of the embedding vectors, used to update the inference model. Next, the first participant puts the first gradient information into the second channel corresponding to the current batch, uses the second gradient information to update the second feature extraction model, and uses the third gradient information to update the inference model.
[0051] For example, if the current batch is batch C, the first participant can use the second training samples of batch C to train the local second feature extraction model and obtain the second embedding vector of the second training samples of batch C. Then, the first participant extracts M first embedding vectors from the first channel C1 corresponding to batch C. Next, the first participant trains the local inference model based on the second embedding vectors and the extracted M first embedding vectors to obtain the first gradient information, the second gradient information, and the third gradient information. Next, the first participant puts the first gradient information into the second channel C2 corresponding to batch C, and uses the second gradient information to update the second feature extraction model and uses the third gradient information to update the inference model.
[0052] In S104, determine whether all N channels are empty.
[0053] After the second participant places the first embedding vector into the corresponding first channel, it can attempt backpropagation of the first feature extraction model. At this time, it can be determined whether all N second channels are empty. If all N second channels are empty, it indicates that the first participant has not yet generated new gradient information. In this case, backpropagation of the first feature extraction model can be temporarily suspended, and forward propagation of the first feature extraction model can continue. That is, the next batch of training samples is obtained, and the same process is executed, that is, the next batch of training samples is used as the first training sample of the current batch, and S102~S106 is executed until all N batches are completed. When all N batches are completed, it indicates that one training round has been completed. Thus, multiple training rounds can be repeated until the model converges. If there is a non-empty second channel among the N second channels, it indicates that the first participant has generated new gradient information. At this time, backpropagation of the first feature extraction model can be performed, that is, S105 and S106 are executed. After that, it can return to S104.
[0054] In S105, the earliest stored first gradient information is retrieved from the non-empty second channel.
[0055] In S106, the first feature extraction model is updated using the earliest stored first gradient information.
[0056] In the above technical solution, before training the model based on vertical federated learning, the M unlabeled second participants and the labeled first participants first perform sample label alignment on the data they hold, and divide the aligned sample set into N batches respectively, and establish N first channels and N second channels corresponding one-to-one with the N batches. The first channel is used to store the embedding vector of the first training sample of the corresponding batch, and the second channel is used to store the gradient information of the first training sample of the corresponding batch. During batch training of the model, the second participant can use the first training samples of the current batch to train the first feature extraction model and obtain the first embedding vector. Then, the second participant, as the publisher, puts the first embedding vector into the first channel corresponding to the current batch. At the same time, the first participant can use the second training samples of the current batch to train the second feature extraction model and obtain the second embedding vector. Then, the first participant, as the subscriber, takes out M first embedding vectors from the first channel corresponding to the current batch and trains the inference model based on the second embedding vector and the taken M first embedding vectors to obtain the first gradient information, the second gradient information, and the third gradient information. Next, the first participant, as the publisher, puts the first gradient information into the second channel corresponding to the current batch, updates the second feature extraction model with the second gradient information, and updates the inference model with the third gradient information. After the second participant places the first embedding vector into the corresponding first channel, it can act as a subscriber to determine whether first gradient information exists in the N second channels. If there are non-empty second channels among the N second channels, the earliest stored first gradient information is retrieved from the non-empty second channels, and the first feature extraction model is updated using this earliest stored first gradient information. If there is no first gradient information in any of the N second channels, i.e., all N second channels are empty, then there is no need to wait for the gradient information of the first training sample of the current batch from the first participant, and the training of the first feature extraction model can continue. Thus, through the publish-subscribe architecture and channel caching mechanism, the training processes of multiple participants are decoupled, realizing asynchronous training of the model, thereby eliminating the training latency caused by synchronization dependency in vertical federated learning and improving model training efficiency. In addition, the embedding vectors and gradient information of training samples from different batches are placed into the corresponding channels of the corresponding batches, which can ensure that the sample labels are aligned at all times, thereby ensuring the accuracy of model training.
[0057] In addition, to prevent data leakage by the second party due to the cracking of embedded information, the first channel can integrate a differential privacy protocol to add noise to the embedded information placed in the first channel. At the same time, to prevent data leakage by the first party due to the cracking of gradient information, the second channel can also integrate a differential privacy protocol to add noise to the gradient information placed in the second channel.
[0058] To further improve the training efficiency of the model, the M second participants and the first participant can adopt a parameter server (PS) architecture to achieve efficient data parallelism, thereby improving the training efficiency of the model. Specifically, each second participant can include... The first working node and the first parameter server, , Each of the first working nodes corresponds to at least one of the N batches, and, Each of the first working nodes has a first feature extraction model deployed on it.
[0059] For example, such as Figure 4 As shown, M=1, and the second participant includes two first working nodes, namely Worker11 and Worker12. The first sample set was divided into two batches, batch A and batch B. The batch ID of batch A was... k The Batch ID of batch B is l Worker11 and batch A (i.e., Batch ID) k Corresponding to this, it is used to train the first feature extraction model deployed on Worker11 using the first training samples of batch A; Worker12 corresponds to batch B (i.e., Batch ID). l This corresponds to the first feature extraction model deployed on Worker12, which is used to train the first feature extraction model using the first training samples of batch B.
[0060] For example, M=2, meaning that the multiple participants in the model training include two second participants, namely second participant L1 and second participant L2. Second participant L1 includes two first working nodes, namely Worker11 and Worker12, and second participant L2 includes two first working nodes, namely Worker13 and Worker14. The first sample set is divided into five batches, namely batch A, batch B, batch C, batch D and batch F. Worker11 and Worker13 correspond to batches A, B, and F, respectively. Worker11 is used to train the first feature extraction model deployed on Worker11 using the first training samples of these three batches held by the second participant L1. Worker13 is used to train the first feature extraction model deployed on Worker13 using the first training samples of these three batches held by the second participant L2. Worker12 and Worker14 correspond to batches C and D, respectively. Worker12 is used to train the first feature extraction model deployed on Worker12 using the first training samples of these two batches held by the second participant L1. Worker14 is used to train the first feature extraction model deployed on Worker12 using the first training samples of these two batches held by the second participant L2.
[0061] Specifically Each of the first working nodes is used to: obtain the first training sample of the current batch, wherein the current batch is one of the first batches, and the first batch includes at least one batch corresponding to the first working node; input the first training sample of the current batch into the first feature extraction model deployed on the first working node to obtain the first embedding vector; put the first embedding vector into the first channel corresponding to the current batch; if there is a non-empty second channel in the second channel corresponding to the first batch, then retrieve the earliest stored first gradient information from the non-empty second channel corresponding to the first batch; update the first feature extraction model deployed on the first working node using the earliest stored first gradient information; if all the second channels corresponding to the first batch are empty, then obtain the training sample of the next batch and perform the same process until the first batch is completed.
[0062] In this disclosure, when the second channel corresponding to the first batch is empty, the training sample of the next batch in the first batch can be used as the first training sample of the current batch, and the same process can be continued until the first batch is completed. At this time, when all batches in the first batch have been completed, it indicates that the first working node has completed one training round it is responsible for. In this way, the first working node can repeat multiple training rounds until the model converges.
[0063] After updating the first feature extraction model deployed on the first working node using the earliest stored first gradient information, it can continue to determine whether the second channel corresponding to the first batch is empty. If the second channel corresponding to the first batch is empty, the same process is performed on the next batch of training samples until the first batch is completed. If there is a non-empty second channel corresponding to the first batch, the earliest stored first gradient information is retrieved from the non-empty second channel corresponding to the first batch. Then, the earliest stored first gradient information is used to update the first feature extraction model deployed on the first working node.
[0064] For example, M=1, such as Figure 4 As shown, Worker11 and batch A (i.e., Batch ID) k Corresponding to this, if the current batch is batch A, then Worker11 is used to: input the first training sample of batch A into the first feature extraction model deployed on Worker11 to obtain the first embedding vector; and input the first embedding vector into the model corresponding to batch A (i.e., Batch ID). k The first channel (i.e., the embedded channel) corresponding to the Batch ID; if it is related to the Batch ID k If the corresponding second channel (i.e., gradient channel) is empty, then the next batch will still be batch A, and the first training sample of batch A will be obtained.
[0065] For example, the first working node corresponds to batch A, batch B, and batch F. The current batch is batch B. The first working node is used to input the first training sample of batch B into the first feature extraction model deployed on the first working node to obtain the first embedding vector. The first embedding vector is put into the first channel (i.e., the embedding channel) corresponding to batch B. If the second channel corresponding to batch A, the second channel corresponding to batch B, and the second channel corresponding to batch F are all empty, the same process can be performed on the training samples of the next batch, that is, batch F is used as the current batch, until the first batch is completed.
[0066] In addition, the above Each of the first working nodes can also be used to: send the updated first model parameters of the first feature extraction model to the first parameter server; the first parameter server is used to... The first model parameters sent by each first working node are aggregated to obtain the second model parameters, and the second model parameters are sent to each first working node. Each of the first working nodes is also used to update the model parameters of the first feature extraction model deployed on the first working node to the second model parameters.
[0067] For example, such as Figure 4 As shown, Worker11 and batch A (i.e., Batch ID) k Corresponding to the current batch, which is batch A, Worker11 will insert the first embedding vector into the data of batch A (i.e., batch ID). k Following the first channel corresponding to ), it can also be used for: if it is related to Batch ID k If the corresponding second channel is not empty, then it is from the Batch ID. k The earliest stored first gradient information is retrieved from the corresponding second channel; the earliest stored first gradient information is used to update the first feature extraction model deployed on Worker11; and the first model parameters of the updated first feature extraction model are sent to the first parameter server PS1.
[0068] The first parameter server PS1 is used to aggregate the first model parameters sent by Worker11 and Worker12 to obtain the second model parameters, and then send the second model parameters to Worker11 and Worker12. Worker11 is also used to update the model parameters of the first feature extraction model deployed on Worker11 to the second model parameters. Worker12 is also used to update the model parameters of the first feature extraction model deployed on Worker12 to the second model parameters.
[0069] In addition, in order to reduce the number of communications, such as Figure 4 As shown, within the PS architecture, a semi-asynchronous update mechanism can be used, meaning that worker nodes update every [time period]. Each training epoch interacts with the parameter server to update the local model. Specifically, Each of the first working nodes is used for every... The training rounds will be the most recent The updated parameters of the first model in each training epoch are sent to the first parameter server. ≥1; The first parameter, the server, is used to... The most recent one sent by the first working node The updated first model parameters in each training round are aggregated to obtain the second model parameters, and the second model parameters are sent to each first working node; in this way, each first working node can update the model parameters of the first feature extraction model deployed on the first working node to the second model parameters. It can be a preset value.
[0070] When M second participants and the first participant perform asynchronous training of the model, it can complicate or hinder model convergence. To address this challenge, a dynamic semi-asynchronous update mechanism can be provided within each participant to achieve efficient iterative training and thus fast convergence. This mechanism dynamically adjusts the asynchronous update frequency based on real-time feedback during training, balancing the need for faster computation with the stability required for model convergence. To implement the dynamic semi-asynchronous update mechanism in the PS architecture of VFL, the synchronization interval... It is designed to decrease as the model accuracy approaches the target accuracy, for example, It is positively correlated with the number of training rounds. In the early stages of model training, when the model accuracy is far from the target accuracy, the synchronization interval is large, enabling the model to achieve stable learning. As the model accuracy improves, the synchronization interval decreases and the synchronization frequency increases to fine-tune the model and ensure faster convergence.
[0071] For example, .in, In order to align with training rounds t The corresponding synchronization interval; The initial synchronization interval is a preset value; It is the hyperbolic tangent function.
[0072] In this disclosure, the parameters such as batch size, the first number of first working nodes in each of the M second participants, and the second number of second working nodes in the first participant can all be preset values. In VFL, computational resources and data feature dimensions often differ significantly among the participants, resulting in heterogeneity in data or computational resources among different participants, leading to inconsistent computational speeds and affecting training throughput and convergence time. Specifically, the number of first working nodes in each of the M second participants remains consistent.
[0073] To address this issue, an optimal set of parameters can be determined based on the system configuration files (including model information and hardware capability information) of multiple participants. These parameters include batch size, the first number of first worker nodes in the second participant, the second number of second worker nodes in the first participant, and the number of Central Processing Unit (CPU) cores allocated to each worker node. By determining these parameters based on the specific resources and constraints of multiple participants, the goal is to balance computational load, reduce latency, and improve the overall efficiency of the publish-subscribe VFL system while maintaining privacy compliance. This allows for balancing the computational speed of multiple participants while eliminating training latency caused by synchronization dependencies in VFL, thereby maximizing computational resource utilization. Consequently, this reduces the training cost of VFL and promotes efficient use of computational resources by each participant.
[0074] A VFL system based on a publish-subscribe architecture can include M second participants and a first participant, with multiple participants employing a PS architecture. For example... Figure 5 As shown, the VFL system flow can include a profiling phase, a planning phase, and a training phase (such as...). Figure 5 (Not shown in the image).
[0075] During the profiling phase, each participant utilizes the validation dataset (non-privacy data) and each participant ( Figure 5 Taking Party A and Party B as examples, this involves the information held by Party A (i.e., model information) and hardware capabilities. Specifically, hardware capability information may include communication bandwidth and memory limitations, while model information includes the size of the embedding vectors. E The magnitude of gradient information G、 Information such as parameter size (i.e., the value of hyperparameters used later).
[0076] During the planning phase, the task publisher or coordinator can construct an optimization problem based on the system configuration files (i.e., system profile files) of multiple participants. This optimization problem aims to model the single training duration of multiple participants. If M=1, then a set of optimal hyperparameter configurations is solved directly using dynamic programming based on the single training durations of multiple participants. Based on the hyperparameter configurations, batch sharding, worker node allocation, and channel allocation are performed for the training samples of multiple participants. If M is greater than 1, then the second participant with the longest single training duration (referred to as the third participant below for easy distinction) is first selected from the M unlabeled second participants. Then, based on the single training durations of the first and third participants, a set of optimal hyperparameter configurations is solved using dynamic programming. Based on the hyperparameter configurations, batch sharding, worker node allocation, and channel allocation are performed for the training samples of multiple participants. The number of first workers is the same in each second participant. The training time for a single batch of the second participant can include the forward propagation time of the second participant's first feature extraction model, the forward propagation time of the second participant's first feature extraction model, and the time it takes for the second participant to send the embedding vector to the first participant. When M=1, this second participant is the third participant.
[0077] During the training phase, after obtaining the hyperparameter configuration, the VFL system begins model training.
[0078] Specifically, when the above method is applied to a third party, the above method may further include the following steps (a1) to (a4).
[0079] Step (a1): Construct the objective function with the minimum first duration as the optimization objective.
[0080] The independent variables of the objective function include batch size, the first number of first working nodes in the second participant, and the second number of second working nodes in the first participant. The first duration is the maximum value between the second duration of single-batch training of the second participant (i.e., the third participant) and the third duration of single-batch training of the first participant. The second duration is determined based on the batch size and the first number, and the third duration is determined based on the batch size and the second number.
[0081] Step (a2): Based on the basic memory consumption of the third participant, construct the first memory constraint of the third participant, and based on the basic memory consumption of the first participant, construct the second memory constraint of the first participant.
[0082] Among them, basic memory consumption is the amount of memory used by the participants to maintain basic functions.
[0083] For example, the first memory constraint of the third participant can be constructed based on the third participant's basic memory consumption using the following equation (1):
[0084] (1)
[0085] in, Batch size; The first memory constraint is the third party's constraint regarding the batch size. The function that controls memory consumption; The base memory consumption for third-party participants is obtained through the profiling phase. and All of these are hyperparameters.
[0086] For example, the second memory constraint of the first participant can be constructed based on the first participant's basic memory consumption through the following equation (2):
[0087] (2)
[0088] in, The second memory constraint is the first participant's constraint regarding the batch size. The function that controls memory consumption; The base memory consumption of the first participant is obtained through the profiling phase; This is a hyperparameter.
[0089] Step (a3): Based on the first memory constraint, the second memory constraint, the basic memory consumption of the third participant, and the basic memory consumption of the first participant, construct a constraint on the batch size as a constraint condition for the objective function.
[0090] For example, the batch size constraint can be constructed based on the first memory constraint, the second memory constraint, the base memory consumption of the third participant, and the base memory consumption of the first participant, using the following equation (3):
[0091] (3)
[0092] in, The set of candidate batch sizes; This represents the maximum batch size; for different... B There are different , for The maximum value; for different B There are different , for The maximum value that can be obtained.
[0093] Step (a4): Solve the objective function under constraints using the dynamic programming algorithm to obtain the optimal solution for the independent variables.
[0094] In this disclosure, the second duration of single-batch training of the third participant may include the duration of forward propagation of the first feature extraction model of the third participant, the duration of forward propagation of the first feature extraction model of the third participant, and the duration of the third participant sending the embedding vector to the first participant.
[0095] For example, the forward propagation duration of the first feature extraction model of the third party can be calculated using the following equation (4). :
[0096] (4)
[0097] in, , All are hyperparameters; For the allocation to the third party j The number of CPU cores in the first working node , , , for The lower limit value, for The upper limit, This represents the total number of CPU cores for the third participating party.
[0098] For example, the backpropagation duration of the first feature extraction model of the third party can be calculated using the following equation (5). :
[0099] (5)
[0100] in, , All of these are hyperparameters.
[0101] For example, the duration for a third party to send an embedding vector to a first party can be calculated using the following equation (6). :
[0102] (6)
[0103] in, This refers to the network bandwidth between the third party and the first party.
[0104] It should be noted that the calculation method for the duration of a single training session for the other two second participants (excluding the third participant) is similar to the calculation method for the duration of a single training session for the third participant, and will not be repeated here.
[0105] The third duration of single-batch training for the first participant may include the duration of forward propagation of the second feature extraction model of the first participant, the duration of backward propagation of the second feature extraction model of the first participant, the total duration of forward and backward propagation of the inference model of the first participant, and the duration of the first participant sending gradient information to the third participant.
[0106] For example, the forward propagation duration of the second feature extraction model of the first participant can be calculated using the following equation (7). :
[0107] (7)
[0108] in, , All are hyperparameters; For the allocation to the first participant i The number of CPU cores in each second working node , , The number of second working nodes in the first participating party, and , for The lower limit value, for The upper limit, This represents the total number of CPU cores for the first participating party.
[0109] For example, the backpropagation duration of the second feature extraction model of the first participant can be calculated using the following equation (8). :
[0110] (8)
[0111] in, , All of these are hyperparameters.
[0112] For example, the total duration of forward and backward propagation of the first participant's reasoning model can be calculated using the following equation (9). :
[0113] (9)
[0114] in, , , , All of these are hyperparameters.
[0115] For example, the duration for the first participant to send gradient information to the third participant can be calculated using the following equation (10). :
[0116] (10)
[0117] Therefore, the objective function is .
[0118] The following provides a detailed explanation of the specific implementation method for solving the objective function under constraints using the dynamic programming algorithm in step (a4) above, to obtain the optimal solution for the independent variables. Specifically, this can be achieved through the following steps (a41) to (a46):
[0119] Step (a41): Based on the batch size constraint, determine the candidate values for the batch size, and based on the range of values for the first quantity of the first working node, i.e. Determine the first number of candidate values, based on the range of values for the second number of the second working nodes, i.e. Determine the second number of candidate values.
[0120] Step (a42): Define the dynamic programming array dp [ i ][ j ][ k ], i The index of the candidate values for the batch size. j The index of the first number of candidate values, k The index of the second number of candidate values, dp [ i ][ j ][ k ] indicates that when the batch size is taken as the first... i The candidate value, the first quantity is taken as the first...j The candidate value, the second quantity is taken as the first... k The optimal solution of the objective function when there are candidate values.
[0121] Step (a43): Define the state transition equation based on the objective function:
[0122] .
[0123] Step (a44): Set the initial value of the dynamic programming table to infinity.
[0124] Step (a45): Fill the dynamic programming table in a preset order (usually bottom-up), for each state dp [ i ][ j ][ k The value is calculated according to the state transition equation. If the calculated value is less than the value of the dynamic programming table, the dynamic programming table is updated to the calculated value. Otherwise, the dynamic programming table is filled in the preset order until the dynamic programming table is filled.
[0125] Step (a46) determines the latest value of the dynamic programming table as the global optimal solution of the objective function, thereby obtaining the specific values of the optimal parameter combination.
[0126] To reduce the occurrence of model training errors, such as Figure 4 As shown, a waiting deadline mechanism can be set for each of the M unlabeled second participants, where T ddl This refers to the waiting deadline for the second participant, i.e., the first preset duration. Specifically, the above-described model training method based on longitudinal federated learning applied to any second participant may further include the following steps:
[0127] After a first preset time after the first embedding vector is placed into the first channel, it is determined whether the first gradient information of the first training sample of the current batch exists in the second channel corresponding to the current batch.
[0128] If the first gradient information of the first training sample of the current batch is not found in the second channel corresponding to the current batch, then a first instruction is sent to the first participant.
[0129] The first instruction instructs the first participant to regenerate the first gradient information of the first training sample in the current batch. Upon receiving the first instruction, the first participant can regenerate the first gradient information of the first training sample in the current batch. The first preset duration can be a preset value.
[0130] To further reduce model training errors, in addition to setting a waiting deadline mechanism for the second participant, a waiting deadline mechanism can also be set for the first participant. Specifically, after obtaining the second embedding vector and a second preset time interval, the first participant determines whether the number of first embedding vectors in the first channel corresponding to the current batch reaches M, that is, whether all M second participants have placed the first embedding vectors of their training samples for the current batch into the first channel; if the number of first embedding vectors in the first channel corresponding to the current batch does not reach M, a second instruction is sent to the fourth participant, which instructs the fourth participant to regenerate the first embedding vectors. The fourth participant includes the second participants among the M unlabeled second participants who have not placed the first embedding vectors of their training samples for the current batch into the first channel. At this time, the above-mentioned model training method based on vertical federated learning applied to any second participant may also include the following steps:
[0131] In response to receiving the second instruction sent by the first participant, the steps of inputting the first training sample into the first feature extraction model to obtain the first embedding vector and then inserting the first embedding vector into the first channel corresponding to the current batch are re-executed.
[0132] After multiple parties complete model training based on VFL, they can perform joint inference based on their respective sub-models. For example, M=1, the multiple parties include a first party and a second party, where the first party can perform joint inference based on the trained second feature extraction model and inference model, and the second party can perform joint inference based on the trained first feature extraction model.
[0133] Specifically, each of the M second participants can input its own features into the locally trained first feature extraction model to obtain a third embedding vector, and then send the third embedding vector to the first participant. The first participant inputs its own features into the locally trained second feature extraction model to obtain a fourth embedding vector. Then, the third and fourth embedding vectors sent by each second participant are concatenated to obtain a concatenated vector, and the concatenated vector is input into the locally trained inference model to obtain the inference result.
[0134] Figure 6 This is a flowchart illustrating a model training method based on longitudinal federated learning applied to a labeled first participant, according to an exemplary embodiment. Figure 6 As shown, the model training method based on longitudinal federated learning applied to the labeled first participant may include the following S201~S205.
[0135] In S201, obtain the second training sample of the current batch.
[0136] The second training sample in the current batch is one of the N batches.
[0137] In S202, the second training sample is input into the second feature extraction model to obtain the second embedding vector.
[0138] In S203, M first embedding vectors are extracted from the first channel corresponding to the current batch. The M first embedding vectors are obtained by each of the M unlabeled second participants by inputting the first training sample of the current batch into the first feature extraction model and then putting them into the first channel corresponding to the current batch. N batches correspond one-to-one with N first channels and N batches correspond one-to-one with N second channels.
[0139] In S204, the inference model is trained based on the second embedding vector and M first embedding vectors to obtain the first gradient information, the second gradient information and the third gradient information.
[0140] In S205, the first gradient information is placed into the second channel corresponding to the current batch, and the second gradient information is used to update the second feature extraction model, and the third gradient information is used to update the inference model.
[0141] In the above technical solution, before training the model based on vertical federated learning, the M unlabeled second participants and the labeled first participants first perform sample label alignment on the data they hold, and divide the aligned sample set into N batches respectively, and establish N first channels and N second channels corresponding one-to-one with the N batches. The first channel is used to store the embedding vector of the first training sample of the corresponding batch, and the second channel is used to store the gradient information of the first training sample of the corresponding batch. During batch training of the model, the second participant can use the first training samples of the current batch to train the first feature extraction model and obtain the first embedding vector. Then, the second participant, as the publisher, puts the first embedding vector into the first channel corresponding to the current batch. At the same time, the first participant can use the second training samples of the current batch to train the second feature extraction model and obtain the second embedding vector. Then, the first participant, as the subscriber, takes out M first embedding vectors from the first channel corresponding to the current batch and trains the inference model based on the second embedding vector and the taken M first embedding vectors to obtain the first gradient information, the second gradient information, and the third gradient information. Next, the first participant, as the publisher, puts the first gradient information into the second channel corresponding to the current batch, updates the second feature extraction model with the second gradient information, and updates the inference model with the third gradient information. After the second participant places the first embedding vector into the corresponding first channel, it can act as a subscriber to determine whether first gradient information exists in the N second channels. If there are non-empty second channels among the N second channels, the earliest stored first gradient information is retrieved from the non-empty second channels, and the first feature extraction model is updated using this earliest stored first gradient information. If there is no first gradient information in any of the N second channels, i.e., all N second channels are empty, then there is no need to wait for the gradient information of the first training sample of the current batch from the first participant, and the training of the first feature extraction model can continue. Thus, through the publish-subscribe architecture and channel caching mechanism, the training processes of multiple participants are decoupled, realizing asynchronous training of the model, thereby eliminating the training latency caused by synchronization dependency in vertical federated learning and improving model training efficiency. In addition, the embedding vectors and gradient information of training samples from different batches are placed into the corresponding channels of the corresponding batches, which can ensure that the sample labels are aligned at all times, thereby ensuring the accuracy of model training.
[0142] Optionally, the first participant includes The second working node , Each of the N second working nodes corresponds to at least one of the N batches. Each of the second working nodes is used to: acquire a second training sample of the current batch, wherein the current batch is one of the second batches, and the second batch includes at least one batch corresponding to the second working node; input the second training sample into a second feature extraction model deployed on the second working node to obtain a second embedding vector; extract M first embedding vectors from the first channel corresponding to the current batch; train the inference model deployed on the second working node based on the second embedding vector and the M first embedding vectors to obtain first gradient information, second gradient information and third gradient information; put the first gradient information into the second channel corresponding to the current batch, update the second feature extraction model deployed on the second working node using the second gradient information, and update the inference model deployed on the second working node using the third gradient information.
[0143] For example, such as Figure 4 As shown, the first participating party includes three first working nodes, namely Worker21, Worker22, and Worker23. The second sample set was divided into two batches, batch A and batch B. The batch ID of batch A was... k The Batch ID of batch B is l Worker22 and Worker23 both correspond to batch A (i.e., Batch IDk) and are used to train the second feature extraction model and inference model deployed on themselves using the second training samples of batch A; Worker21 corresponds to batch B (i.e., Batch IDk). l Corresponding to this, it is used to train the second feature extraction model and inference model deployed on Worker21 using the second training samples of batch B.
[0144] Optionally, the first participant may also include a second parameter server; Each of the second working nodes is also used to send the updated third model parameters of the second feature extraction model and the updated fourth model parameters of the inference model to the second parameter server; the second parameter server is also used to... The third model parameters sent by the second working node are aggregated to obtain the fifth model parameters. The fourth model parameters sent by each second working node are aggregated to obtain the sixth model parameter, and the fifth and sixth model parameters are sent to each second working node. Each of the second working nodes is also used to update the model parameters of the second feature extraction model deployed on the second working node to the fifth model parameters, and to update the model parameters of the inference model deployed on the second working node to the sixth model parameters.
[0145] Optionally, Each of the second working nodes is used for every... The training rounds will be the most recent The updated third and fourth model parameters are sent to the second parameter server in each training epoch. ≥1; The second parameter, the server, is used for... The most recent one sent by the second working node The updated third model parameters from each training epoch are aggregated to obtain the fifth model parameters. The most recent one sent by the second working node The updated fourth model parameters in each training round are aggregated to obtain the sixth model parameters, and the fifth and sixth model parameters are sent to each second working node.
[0146] Optionally, It is positively correlated with the number of training rounds.
[0147] Optionally, the model training method based on longitudinal federated learning applied to the labeled first participant further includes: constructing an objective function with the minimum first duration as the optimization objective, wherein the independent variables of the objective function include batch size, the first number of first working points in the second participant, and the second number of second working nodes; the first duration is the maximum value between the second duration of the third participant's single-batch training and the third duration of the first participant's single-batch training; the third participant is the second participant with the longest single-batch training duration among the M unlabeled second participants; the second duration is determined based on the batch size and the first number, and the third duration is determined based on the batch size and the second number; constructing a first memory constraint for the third participant based on the third participant's basic memory consumption, and constructing a second memory constraint for the first participant based on the first participant's basic memory consumption, wherein the basic memory consumption is the amount of memory occupied by the participant to maintain basic functions; constructing a batch size constraint based on the first memory constraint, the second memory constraint, the basic memory consumption of the third participant, and the basic memory consumption of the first participant, as a constraint condition for the objective function; and solving the objective function under the constraints using a dynamic programming algorithm to obtain the optimal solution for the independent variables.
[0148] Optionally, the model training method based on longitudinal federated learning applied to the labeled first participant further includes: after obtaining the second embedding vector for a second preset time period, determining whether the number of the first embedding vectors in the first channel corresponding to the current batch reaches M; if the number of the first embedding vectors in the first channel corresponding to the current batch does not reach M, then sending a second instruction to the fourth participant, wherein the second instruction is used to instruct the fourth participant to regenerate the first embedding vector, and the fourth participant includes the second participants among the M unlabeled second participants who have not put the first embedding vector into the first channel.
[0149] Optionally, the model training method based on longitudinal federated learning applied to the labeled first participant further includes: in response to receiving a first instruction, re-executing the steps of inputting the second training sample into the second feature extraction model to obtain the second embedding vector, to putting the first gradient information into the second channel corresponding to the current batch, updating the second feature extraction model using the second gradient information, and updating the inference model using the third gradient information, wherein the first instruction is used to instruct the first participant to regenerate the first gradient information of the first training sample of the current batch.
[0150] The specific implementation of each step in the model training method based on vertical federated learning applied to a labeled first participant according to the embodiments of this disclosure has been described in detail in the model training method based on vertical federated learning applied to any unlabeled second participant according to the embodiments of this disclosure, and will not be repeated here.
[0151] Figure 7 This is a block diagram illustrating a model training apparatus based on longitudinal federated learning applied to any unlabeled second participant, according to an exemplary embodiment. The multiple participants in the model training include a labeled first participant and M unlabeled second participants, where M ≥ 1. The data held by each participant is aligned with sample labels and then divided into N batches, N ≥ 1. Each of the N batches corresponds one-to-one with N first channels and one-to-one with N second channels. Figure 7As shown, the model training device 300 based on longitudinal federated learning is applied to any one of the M unlabeled second participants. The device 300 includes: a first acquisition module 301, used to acquire a first training sample of the current batch, wherein the first training sample of the current batch is one of the N batches; a first input module 302, used to input the first training sample into a first feature extraction model to obtain a first embedding vector; a storage module 303, used to put the first embedding vector into a first channel corresponding to the current batch; and a first retrieval module 304, used to retrieve the earliest stored second channel from the non-empty second channels if there is a non-empty second channel among the N second channels. First gradient information; update the first feature extraction model using the earliest stored first gradient information; wherein, the second channel is used to store the first gradient information of the first training samples of the corresponding batch, the first gradient information is obtained by the first participant training the inference model based on the second embedding vector and M first embedding vectors taken from the first channel corresponding to the corresponding batch, the second embedding vector is obtained by the first participant inputting the second training samples of the corresponding batch into the second feature extraction model; the first trigger module 305 is used to obtain the next batch of training samples and execute the same process if all N second channels are empty, until all N batches are executed and then the process ends.
[0152] In the above technical solution, before training the model based on vertical federated learning, the M unlabeled second participants and the labeled first participants first perform sample label alignment on the data they hold, and divide the aligned sample set into N batches respectively, and establish N first channels and N second channels corresponding one-to-one with the N batches. The first channel is used to store the embedding vector of the first training sample of the corresponding batch, and the second channel is used to store the gradient information of the first training sample of the corresponding batch. During batch training of the model, the second participant can use the first training samples of the current batch to train the first feature extraction model and obtain the first embedding vector. Then, the second participant, as the publisher, puts the first embedding vector into the first channel corresponding to the current batch. At the same time, the first participant can use the second training samples of the current batch to train the second feature extraction model and obtain the second embedding vector. Then, the first participant, as the subscriber, takes out M first embedding vectors from the first channel corresponding to the current batch, and trains the inference model based on the second embedding vector and the taken M first embedding vectors to obtain the first gradient information, the second gradient information, and the third gradient information. Next, the first participant, as the publisher, puts the first gradient information into the second channel corresponding to the current batch, updates the second feature extraction model with the second gradient information, and updates the inference model with the third gradient information. After the second participant places the first embedding vector into the corresponding first channel, it can act as a subscriber to determine whether first gradient information exists in the N second channels. If there are non-empty second channels among the N second channels, the earliest stored first gradient information is retrieved from the non-empty second channels, and the first feature extraction model is updated using this earliest stored first gradient information. If there is no first gradient information in any of the N second channels, i.e., all N second channels are empty, then there is no need to wait for the gradient information of the first training sample of the current batch from the first participant, and the training of the first feature extraction model can continue. Thus, through the publish-subscribe architecture and channel caching mechanism, the training processes of multiple participants are decoupled, realizing asynchronous training of the model, thereby eliminating the training latency caused by synchronization dependency in vertical federated learning and improving model training efficiency. In addition, the embedding vectors and gradient information of training samples from different batches are placed into the corresponding channels of the corresponding batches, which can ensure that the sample labels are aligned at all times, thereby ensuring the accuracy of model training.
[0153] Optionally, the second participant includes The first working node The Each of the first working nodes corresponds to at least one of the N batches; Each of the first working nodes is configured to: acquire the first training sample of the current batch, wherein the current batch is one of the first batches, and the first batch includes at least one batch corresponding to the first working node; input the first training sample into the first feature extraction model deployed on the first working node to obtain the first embedding vector; place the first embedding vector into the first channel corresponding to the current batch; if there is a non-empty second channel in the second channel corresponding to the first batch, then retrieve the earliest stored first gradient information from the non-empty second channel corresponding to the first batch; update the first feature extraction model deployed on the first working node using the earliest stored first gradient information; if all the second channels corresponding to the first batch are empty, then acquire the next batch of training samples and perform the same process until the first batch is completed.
[0154] Optionally, the second participant further includes a first parameter server; Each of the first working nodes is further configured to: send the updated first model parameters of the first feature extraction model to the first parameter server; the first parameter server is configured to process the first model parameters of the first feature extraction model. The first model parameters sent by each of the first working nodes are aggregated to obtain second model parameters, and the second model parameters are sent to each of the first working nodes; Each of the first working nodes is also used to update the model parameters of the first feature extraction model deployed on the first working node to the second model parameters.
[0155] Optionally, the Each of the first working nodes is used for every... The training rounds will be the most recent The updated parameters of the first model in each training epoch are sent to the first parameter server. ≥1; the first parameter server is used for the... The most recent one sent by the first working node The updated first model parameters in each training round are aggregated to obtain second model parameters, and the second model parameters are sent to each of the first working nodes.
[0156] Optionally, It is positively correlated with the number of training rounds.
[0157] Optionally, when the model training device 300 based on longitudinal federated learning is applied to a third participant, the device 300 further includes: a first construction module, used to construct an objective function with the minimum first duration as the optimization objective, wherein the third participant is the second participant with the longest single-batch training duration among the M unlabeled second participants, the independent variables of the objective function include batch size, a first number of first working nodes in the second participant, and a second number of second working nodes in the first participant, the first duration is the maximum value between the second duration of the single-batch training of the third participant and the third duration of the single-batch training of the first participant, the second duration is determined based on the batch size and the first number, and the third duration is determined based on the batch size. The second quantity is determined; the second construction module is used to construct a first memory constraint for the third participant based on the basic memory consumption of the third participant, and to construct a second memory constraint for the first participant based on the basic memory consumption of the first participant, wherein the basic memory consumption is the memory size occupied by the participant to maintain basic functions; the third construction module is used to construct a constraint limit on the batch size based on the first memory constraint, the second memory constraint, the basic memory consumption of the third participant, and the basic memory consumption of the first participant, as a constraint condition for the objective function; the first solution module is used to solve the objective function under the constraint condition using a dynamic programming algorithm to obtain the optimal solution of the independent variable.
[0158] Optionally, the model training device 300 based on longitudinal federated learning applied to any unlabeled second participant further includes: a first determining module, configured to determine whether, after a first preset time period following the placement of the first embedding vector into the first channel, there exists first gradient information of the first training sample of the current batch in the second channel corresponding to the current batch; and a first sending module, configured to send a first instruction to the first participant if, in the second channel corresponding to the current batch, there is no first gradient information of the first training sample of the current batch, wherein the first instruction is configured to instruct the first participant to regenerate the first gradient information of the first training sample of the current batch.
[0159] Optionally, the model training device 300 based on longitudinal federated learning applied to any unlabeled second participant further includes: a second triggering module, configured to, in response to receiving a second instruction sent by the first participant, re-execute the steps from inputting the first training sample into the first feature extraction model to obtain the first embedding vector to placing the first embedding vector into the first channel corresponding to the current batch, wherein the second instruction is used to instruct the second participant to regenerate the first embedding vector.
[0160] Figure 8 This is a block diagram illustrating a model training apparatus based on longitudinal federated learning applied to a labeled first participant, according to an exemplary embodiment. The multiple participants in the model training include a labeled first participant and M unlabeled second participants, where M ≥ 1. The data held by each participant is aligned with sample labels and then divided into N batches, N ≥ 1. Each of the N batches corresponds one-to-one with N first channels and one-to-one with N second channels. Figure 8 As shown, a model training device 400 based on longitudinal federated learning applied to a labeled first participant includes: a second acquisition module 401, used to acquire a second training sample in the current batch, wherein the second training sample in the current batch is one of the N batches; a second input module 402, used to input the second training sample into a second feature extraction model to obtain a second embedding vector; a second extraction module 403, used to extract M first embedding vectors from a first channel corresponding to the current batch; wherein the M first embedding vectors are obtained by each of the M unlabeled second participants by inputting the first training sample of the current batch into a local first feature extraction model and placing them into the first channel corresponding to the current batch; a training module 404, used to train an inference model based on the second embedding vector and the M first embedding vectors to obtain first gradient information, second gradient information, and third gradient information; and a first update module 405, used to place the first gradient information into the second channel corresponding to the current batch, update the second feature extraction model using the second gradient information, and update the inference model using the third gradient information.
[0161] In the above technical solution, before training the model based on vertical federated learning, the M unlabeled second participants and the labeled first participants first perform sample label alignment on the data they hold, and divide the aligned sample set into N batches respectively, and establish N first channels and N second channels corresponding one-to-one with the N batches. The first channel is used to store the embedding vector of the first training sample of the corresponding batch, and the second channel is used to store the gradient information of the first training sample of the corresponding batch. During batch training of the model, the second participant can use the first training samples of the current batch to train the first feature extraction model and obtain the first embedding vector. Then, the second participant, as the publisher, puts the first embedding vector into the first channel corresponding to the current batch. At the same time, the first participant can use the second training samples of the current batch to train the second feature extraction model and obtain the second embedding vector. Then, the first participant, as the subscriber, takes out M first embedding vectors from the first channel corresponding to the current batch and trains the inference model based on the second embedding vector and the taken M first embedding vectors to obtain the first gradient information, the second gradient information, and the third gradient information. Next, the first participant, as the publisher, puts the first gradient information into the second channel corresponding to the current batch, updates the second feature extraction model with the second gradient information, and updates the inference model with the third gradient information. After the second participant places the first embedding vector into the corresponding first channel, it can act as a subscriber to determine whether first gradient information exists in the N second channels. If there are non-empty second channels among the N second channels, the earliest stored first gradient information is retrieved from the non-empty second channels, and the first feature extraction model is updated using this earliest stored first gradient information. If there is no first gradient information in any of the N second channels, i.e., all N second channels are empty, then there is no need to wait for the gradient information of the first training sample of the current batch from the first participant, and the training of the first feature extraction model can continue. Thus, through the publish-subscribe architecture and channel caching mechanism, the training processes of multiple participants are decoupled, realizing asynchronous training of the model, thereby eliminating the training latency caused by synchronization dependency in vertical federated learning and improving model training efficiency. In addition, the embedding vectors and gradient information of training samples from different batches are placed into the corresponding channels of the corresponding batches, which can ensure that the sample labels are aligned at all times, thereby ensuring the accuracy of model training.
[0162] Optionally, the first participant includes The second working node The Each of the second working nodes corresponds to at least one of the N batches; Each of the second working nodes is configured to: acquire the second training sample of the current batch, wherein the current batch is one of the second batches, wherein the second batch includes the at least one batch corresponding to the second working node; input the second training sample into the second feature extraction model deployed on the second working node to obtain the second embedding vector; extract the M first embedding vectors from the first channel corresponding to the current batch; train the inference model deployed on the second working node based on the second embedding vector and the M first embedding vectors to obtain the first gradient information, the second gradient information and the third gradient information; put the first gradient information into the second channel corresponding to the current batch, and update the second feature extraction model deployed on the second working node using the second gradient information, and update the inference model deployed on the second working node using the third gradient information.
[0163] Optionally, the first participant further includes a second parameter server; Each of the second working nodes is further configured to send the updated third model parameters of the second feature extraction model and the updated fourth model parameters of the inference model to the second parameter server; the second parameter server is further configured to... The third model parameters sent by the second working node are aggregated to obtain the fifth model parameters, which are then used to... The fourth model parameters sent by each of the second working nodes are aggregated to obtain the sixth model parameter, and the fifth model parameter and the sixth model parameter are sent to each of the second working nodes; Each of the second working nodes is further configured to update the model parameters of the second feature extraction model deployed on the second working node to the fifth model parameters, and to update the model parameters of the inference model deployed on the second working node to the sixth model parameters.
[0164] Optionally, the Each of the second working nodes is used for every... The training rounds will be the most recent The updated third model parameters and the updated fourth model parameters are sent to the second parameter server in each training epoch. ≥1; the second parameter is used by the server for the... The most recent one sent by the second working node The updated third model parameters from each training epoch are aggregated to obtain the fifth model parameters. The most recent one sent by the second working node The updated fourth model parameters in each training round are aggregated to obtain the sixth model parameters, and the fifth model parameters and the sixth model parameters are sent to each of the second working nodes.
[0165] Optionally, It is positively correlated with the number of training rounds.
[0166] Optionally, the model training device 400 applied to the labeled first participant based on longitudinal federated learning further includes: a fourth construction module, used to construct an objective function with the minimum first duration as the optimization objective, wherein the independent variables of the objective function include batch size, a first number of first working points in the second participant, and a second number of second working nodes, the first duration is the maximum value between the second duration of the third participant's single batch training and the third duration of the first participant's single batch training, the third participant is the second participant with the longest single batch training duration among the M unlabeled second participants, the second duration is determined based on the batch size and the first number, and the third duration is determined based on the batch size and the first number. The second module determines the quantity; the fifth module is used to construct a first memory constraint for the third participant based on its basic memory consumption, and a second memory constraint for the first participant based on its basic memory consumption, wherein the basic memory consumption is the amount of memory occupied by the participant to maintain basic functions; the sixth module is used to construct a constraint on the batch size based on the first memory constraint, the second memory constraint, the basic memory consumption of the third participant, and the basic memory consumption of the first participant, as a constraint condition for the objective function; the second solution module is used to solve the objective function under the constraint conditions using a dynamic programming algorithm to obtain the optimal solution for the independent variable.
[0167] Optionally, the model training device 400 applied to the labeled first participant based on longitudinal federated learning further includes: a second determining module, configured to determine, after a second preset time after obtaining the second embedding vector, whether the number of the first embedding vectors in the first channel corresponding to the current batch reaches M; and a second sending module, configured to send a second instruction to the fourth participant if the number of the first embedding vectors in the first channel corresponding to the current batch does not reach M, wherein the second instruction is used to instruct the fourth participant to regenerate the first embedding vector, and the fourth participant includes the second participants among the M unlabeled second participants who have not placed the first embedding vectors in the first channel.
[0168] Optionally, the model training device 400 applied to the labeled first participant based on longitudinal federated learning further includes: a third triggering module, configured to, in response to receiving a first instruction, re-execute the steps from inputting the second training sample into the second feature extraction model to obtain the second embedding vector to placing the first gradient information into the second channel corresponding to the current batch, updating the second feature extraction model using the second gradient information, and updating the inference model using the third gradient information, wherein the first instruction is used to instruct the first participant to regenerate the first gradient information of the first training sample of the current batch.
[0169] In addition, this disclosure also provides a computer-readable medium having a computer program stored thereon, which, when executed by a processing device, implements the steps of the above-described model training method based on longitudinal federated learning applied to any unlabeled second participant, or implements the steps of the above-described model training method based on longitudinal federated learning applied to a labeled first participant.
[0170] This disclosure also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the above-described model training method based on longitudinal federated learning applied to any unlabeled second participant, or implements the steps of the above-described model training method based on longitudinal federated learning applied to a labeled first participant.
[0171] The following is for reference. Figure 9 The diagram illustrates a structural schematic of an electronic device (e.g., a terminal device or a server) 600 suitable for implementing embodiments of the present disclosure. The terminal device in the embodiments of the present disclosure may include, but is not limited to, mobile terminals such as mobile phones, laptops, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and fixed terminals such as digital TVs and desktop computers. Figure 9 The electronic device shown is merely an example and should not be construed as limiting the functionality and scope of the embodiments disclosed herein.
[0172] like Figure 9As shown, electronic device 600 may include a processing device (e.g., a central processing unit, a graphics processor, etc.) 601, which can perform various appropriate actions and processes according to a program stored in read-only memory (ROM) 602 or a program loaded from storage device 608 into random access memory (RAM) 603. RAM 603 also stores various programs and data required for the operation of electronic device 600. Processing device 601, ROM 602, and RAM 603 are interconnected via bus 604. Input / output (I / O) interface 605 is also connected to bus 604.
[0173] Typically, the following devices can be connected to I / O interface 605: input devices 606 including, for example, touchscreens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 607 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 608 including, for example, magnetic tapes, hard disks, etc.; and communication devices 609. Communication device 609 allows electronic device 600 to communicate wirelessly or wiredly with other devices to exchange data. Although Figure 9 An electronic device 600 with various devices is shown; however, it should be understood that it is not required to implement or possess all of the devices shown. More or fewer devices may be implemented or possessed alternatively.
[0174] In particular, according to embodiments of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this disclosure include a computer program product comprising a computer program carried on a non-transitory computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via a communication device 609, or installed from a storage device 608, or installed from a ROM 602. When the computer program is executed by the processing device 601, it performs the functions defined in the methods of embodiments of this disclosure.
[0175] It should be noted that the computer-readable medium described in this disclosure can be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this disclosure, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this disclosure, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical fibers, RF (radio frequency), etc., or any suitable combination thereof.
[0176] In some implementations, clients and servers can communicate using any currently known or future-developed network protocol such as HTTP (Hypertext Transfer Protocol) and can interconnect with digital data communication (e.g., communication networks) of any form or medium. Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), the Internet (e.g., the Internet of Things), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future-developed networks.
[0177] The aforementioned computer-readable medium may be included in the aforementioned electronic device; or it may exist independently and not assembled into the electronic device.
[0178] The aforementioned computer-readable medium carries one or more programs. When the aforementioned one or more programs are executed by the electronic device, the electronic device causes the electronic device to: acquire a first training sample of the current batch, wherein the first training sample of the current batch is one of N batches, and the multiple participants in model training include a labeled first participant and M unlabeled second participants, wherein M≥1, the data held by each of the multiple participants are divided into N batches after sample label alignment, N≥1, the N batches correspond one-to-one with N first channels, and the N batches correspond one-to-one with N second channels; input the first training sample into a first feature extraction model to obtain a first embedding vector; place the first embedding vector into the first channel corresponding to the current batch; if the... If any of the N second channels is non-empty, then the earliest stored first gradient information is retrieved from the non-empty second channels; the first feature extraction model is updated using the earliest stored first gradient information; wherein, the second channel is used to store the first gradient information of the first training sample of the corresponding batch, and the first gradient information is obtained by the first participant training the inference model based on the second embedding vector and M first embedding vectors retrieved from the first channel corresponding to the corresponding batch, and the second embedding vector is obtained by the first participant inputting the second training sample of the corresponding batch into the second feature extraction model; if all N second channels are empty, then the same process is performed for the next batch of training samples until all N batches are completed.
[0179] Alternatively, the aforementioned computer-readable medium carries one or more programs that, when executed by the electronic device, cause the electronic device to: acquire a second training sample in the current batch, wherein the second training sample in the current batch is one of N batches, the multiple participants in model training include a labeled first participant and M unlabeled second participants, wherein M≥1, the data held by each of the multiple participants are divided into N batches after sample label alignment, N≥1, the N batches correspond one-to-one with N first channels and the N batches correspond one-to-one with N second channels; input the second training sample into a second feature extraction model to obtain a second embedding vector; M first embedding vectors are extracted from the first channel corresponding to the current batch; wherein the M first embedding vectors are obtained by each of the M unlabeled second participants by inputting the first training sample of the current batch into the local first feature extraction model and placing it into the first channel corresponding to the current batch; based on the second embedding vectors and the M first embedding vectors, the inference model is trained to obtain first gradient information, second gradient information and third gradient information; the first gradient information is placed into the second channel corresponding to the current batch, and the second feature extraction model is updated using the second gradient information, and the inference model is updated using the third gradient information.
[0180] Computer program code for performing the operations of this disclosure can be written in one or more programming languages or a combination thereof, including but not limited to object-oriented programming languages such as Java, Smalltalk, and C++, as well as conventional procedural programming languages such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).
[0181] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0182] The modules described in the embodiments of this disclosure can be implemented in software or in hardware. The names of the modules are not necessarily limiting in certain circumstances; for example, the first acquisition module can also be described as "the module for acquiring the first training sample of the current batch".
[0183] The functions described above in this document can be performed at least in part by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), system-on-a-chip (SoCs), complex programmable logic devices (CPLDs), and so on.
[0184] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0185] According to one or more embodiments of this disclosure, Example 1 provides a model training method based on longitudinal federated learning. Multiple participants in the model training include a labeled first participant and M unlabeled second participants, where M ≥ 1. The data held by each participant is aligned with sample labels and then divided into N batches, N ≥ 1. Each of the N batches corresponds one-to-one with N first channels and one-to-one with N second channels. The method is applied to any one of the M unlabeled second participants. The method includes: obtaining a first training sample of the current batch, wherein the first training sample of the current batch is one of the N batches; inputting the first training sample into a first feature extraction model to obtain a first embedding vector; and inserting the first embedding vector into the current batch... The first channel corresponds to the batch; if there is a non-empty second channel among the N second channels, the earliest stored first gradient information is taken from the non-empty second channel; the first feature extraction model is updated using the earliest stored first gradient information; wherein, the second channel is used to store the first gradient information of the first training sample of the corresponding batch, the first gradient information is obtained by the first participant training the inference model based on the second embedding vector and M first embedding vectors taken from the first channel corresponding to the corresponding batch, the second embedding vector is obtained by the first participant inputting the second training sample of the corresponding batch into the second feature extraction model; if all N second channels are empty, the same process is performed for the next batch of training samples until all N batches are completed.
[0186] According to one or more embodiments of this disclosure, Example 2 provides the method of Example 1, wherein the second participant includes The first working node The Each of the first working nodes corresponds to at least one of the N batches; Each of the first working nodes is configured to: acquire the first training sample of the current batch, wherein the current batch is one of the first batches, and the first batch includes at least one batch corresponding to the first working node; input the first training sample into the first feature extraction model deployed on the first working node to obtain the first embedding vector; place the first embedding vector into the first channel corresponding to the current batch; if there is a non-empty second channel in the second channel corresponding to the first batch, then retrieve the earliest stored first gradient information from the non-empty second channel corresponding to the first batch; update the first feature extraction model deployed on the first working node using the earliest stored first gradient information; if all the second channels corresponding to the first batch are empty, then acquire the next batch of training samples and perform the same process until the first batch is completed.
[0187] According to one or more embodiments of this disclosure, Example 3 provides the method of Example 2, wherein the second participant further includes a first parameter server; Each of the first working nodes is further configured to: send the updated first model parameters of the first feature extraction model to the first parameter server; the first parameter server is configured to process the first model parameters of the first feature extraction model. The first model parameters sent by each of the first working nodes are aggregated to obtain second model parameters, and the second model parameters are sent to each of the first working nodes; Each of the first working nodes is also used to update the model parameters of the first feature extraction model deployed on the first working node to the second model parameters.
[0188] According to one or more embodiments of this disclosure, Example 4 provides the method of Example 3, wherein... Each of the first working nodes is used for every... The training rounds will be the most recent The updated parameters of the first model in each training epoch are sent to the first parameter server. ≥1; the first parameter server is used for the... The most recent one sent by the first working node The updated first model parameters in each training round are aggregated to obtain second model parameters, and the second model parameters are sent to each of the first working nodes.
[0189] According to one or more embodiments of this disclosure, Example 5 provides the method of Example 4. It is positively correlated with the number of training rounds.
[0190] According to one or more embodiments of this disclosure, Example 6 provides the method of Example 2. When the method is applied to a third participant, the method further includes: constructing an objective function with the minimum first duration as the optimization objective, wherein the third participant is the second participant with the longest single-batch training duration among the M unlabeled second participants, the independent variables of the objective function include batch size, a first number of first working nodes in the second participant, and a second number of second working nodes in the first participant, the first duration is the maximum value between the second duration of single-batch training of the third participant and the third duration of single-batch training of the first participant, and the second duration is determined based on the batch size and the first number. The third duration is determined based on the batch size and the second quantity; a first memory constraint is constructed for the third participant based on its basic memory consumption, and a second memory constraint is constructed for the first participant based on its basic memory consumption, wherein the basic memory consumption is the amount of memory occupied by the participant to maintain basic functions; a constraint on the batch size is constructed based on the first memory constraint, the second memory constraint, the basic memory consumption of the third participant, and the basic memory consumption of the first participant, serving as a constraint condition for the objective function; the objective function is solved under the constraint conditions using a dynamic programming algorithm to obtain the optimal solution for the independent variable.
[0191] According to one or more embodiments of this disclosure, Example 7 provides the method of Example 1, the method further comprising: after a first preset time period following the placement of the first embedding vector into the first channel, determining whether there is first gradient information of the first training sample of the current batch in the second channel corresponding to the current batch; if there is no first gradient information of the first training sample of the current batch in the second channel corresponding to the current batch, sending a first instruction to the first participant, wherein the first instruction is used to instruct the first participant to regenerate the first gradient information of the first training sample of the current batch.
[0192] According to one or more embodiments of this disclosure, Example 8 provides the method of Example 1, the method further comprising: in response to receiving a second instruction sent by the first participant, re-executing the step of inputting the first training sample into a first feature extraction model to obtain a first embedding vector to the step of placing the first embedding vector into a first channel corresponding to the current batch, wherein the second instruction is used to instruct the second participant to regenerate the first embedding vector.
[0193] According to one or more embodiments of this disclosure, Example 9 provides a model training method based on longitudinal federated learning. Multiple participants in the model training include a labeled first participant and M unlabeled second participants, where M ≥ 1. The data held by each participant is aligned with sample labels and then divided into N batches, where N ≥ 1. Each of the N batches corresponds one-to-one with N first channels and one-to-one with N second channels. The method is applied to the first participant and includes: obtaining a second training sample from the current batch, wherein the second training sample from the current batch is one of the N batches; inputting the second training sample into a second feature extraction model to obtain a second... Embedding vectors; extracting M first embedding vectors from the first channel corresponding to the current batch; wherein the M first embedding vectors are obtained by each of the M unlabeled second participants by inputting the first training sample of the current batch into the local first feature extraction model and placing it into the first channel corresponding to the current batch; based on the second embedding vectors and the M first embedding vectors, training the inference model to obtain first gradient information, second gradient information and third gradient information; placing the first gradient information into the second channel corresponding to the current batch, updating the second feature extraction model using the second gradient information, and updating the inference model using the third gradient information.
[0194] According to one or more embodiments of this disclosure, Example 10 provides the method of Example 9, wherein the first participant includes The second working node The Each of the second working nodes corresponds to at least one of the N batches; Each of the second working nodes is configured to: acquire the second training sample of the current batch, wherein the current batch is one of the second batches, wherein the second batch includes the at least one batch corresponding to the second working node; input the second training sample into the second feature extraction model deployed on the second working node to obtain the second embedding vector; extract the M first embedding vectors from the first channel corresponding to the current batch; train the inference model deployed on the second working node based on the second embedding vector and the M first embedding vectors to obtain the first gradient information, the second gradient information and the third gradient information; put the first gradient information into the second channel corresponding to the current batch, and update the second feature extraction model deployed on the second working node using the second gradient information, and update the inference model deployed on the second working node using the third gradient information.
[0195] According to one or more embodiments of this disclosure, Example 11 provides the method of Example 10, wherein the first participant further includes a second parameter server; Each of the second working nodes is further configured to send the updated third model parameters of the second feature extraction model and the updated fourth model parameters of the inference model to the second parameter server; the second parameter server is further configured to... The third model parameters sent by the second working node are aggregated to obtain the fifth model parameters, which are then used to... The fourth model parameters sent by each of the second working nodes are aggregated to obtain the sixth model parameter, and the fifth model parameter and the sixth model parameter are sent to each of the second working nodes; Each of the second working nodes is further configured to update the model parameters of the second feature extraction model deployed on the second working node to the fifth model parameters, and to update the model parameters of the inference model deployed on the second working node to the sixth model parameters.
[0196] According to one or more embodiments of this disclosure, Example 12 provides the method of Example 11, wherein... Each of the second working nodes is used for every... The training rounds will be the most recent The updated third model parameters and the updated fourth model parameters are sent to the second parameter server in each training epoch. ≥1; the second parameter is used by the server for the... The most recent one sent by the second working node The updated third model parameters from each training epoch are aggregated to obtain the fifth model parameters. The most recent one sent by the second working node The updated fourth model parameters in each training round are aggregated to obtain the sixth model parameters, and the fifth model parameters and the sixth model parameters are sent to each of the second working nodes.
[0197] According to one or more embodiments of this disclosure, Example 13 provides the method of Example 12. It is positively correlated with the number of training rounds.
[0198] According to one or more embodiments of this disclosure, Example 14 provides the method of Example 10, the method further comprising:
[0199] An objective function is constructed with the minimum first training duration as the optimization objective. The independent variables of the objective function include batch size, the first number of first working points in the second participant, and the second number of second working nodes. The first training duration is the maximum value between the second training duration of the third participant in a single batch and the third training duration of the first participant in a single batch. The third participant is the second participant with the longest single-batch training duration among the M unlabeled second participants. The second training duration is determined based on the batch size and the first number, and the third training duration is also determined based on the batch size and the second number. A first memory constraint is constructed for the third participant based on its basic memory consumption, and a second memory constraint is constructed for the first participant based on its basic memory consumption. The basic memory consumption is the amount of memory occupied by the participant to maintain basic functions. A constraint on the batch size is constructed based on the first memory constraint, the second memory constraint, the basic memory consumption of the third participant, and the basic memory consumption of the first participant, serving as a constraint condition for the objective function. The objective function is solved under these constraints using a dynamic programming algorithm to obtain the optimal solution for the independent variables.
[0200] According to one or more embodiments of this disclosure, Example 15 provides the method of Example 9, the method further comprising: after obtaining the second embedding vector for a second preset time period, determining whether the number of the first embedding vectors in the first channel corresponding to the current batch reaches M; if the number of the first embedding vectors in the first channel corresponding to the current batch does not reach M, then sending a second instruction to the fourth participant, wherein the second instruction is used to instruct the fourth participant to regenerate the first embedding vector, the fourth participant including the second participants among the M unlabeled second participants who have not placed the first embedding vector in the first channel.
[0201] According to one or more embodiments of this disclosure, Example 16 provides the method of Example 9, the method further comprising: in response to receiving a first instruction, re-executing the steps of inputting the second training sample into the second feature extraction model to obtain a second embedding vector to the steps of placing the first gradient information into the second channel corresponding to the current batch, updating the second feature extraction model using the second gradient information, and updating the inference model using the third gradient information, wherein the first instruction is used to instruct the first participant to regenerate the first gradient information of the first training sample of the current batch.
[0202] According to one or more embodiments of the present disclosure, Example 17 provides a computer-readable medium having a computer program stored thereon that, when executed by a processing device, implements the steps of the method described in any one of Examples 1-16.
[0203] According to one or more embodiments of this disclosure, Example 18 provides an electronic device including: a storage device having a computer program stored thereon; and a processing device for executing the computer program in the storage device to implement the steps of the method of any one of Examples 1-16.
[0204] According to one or more embodiments of the present disclosure, Example 19 provides a computer program product including a computer program that, when executed by a processor, implements the steps of the method described in any one of Examples 1-16.
[0205] The above description is merely a preferred embodiment of this disclosure and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of this disclosure is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-described concept. For example, technical solutions formed by substituting the above features with (but not limited to) technical features disclosed in this disclosure that have similar functions.
[0206] Furthermore, while the operations are described in a specific order, this should not be construed as requiring these operations to be performed in the specific order shown or in a sequential order. In certain environments, multitasking and parallel processing may be advantageous. Similarly, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of this disclosure. Certain features described in the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented individually or in any suitable sub-combination in multiple embodiments.
[0207] Although the subject matter has been described using language specific to structural features and / or methodological logic, it should be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or actions described above. Rather, the specific features and actions described above are merely illustrative forms of implementing the claims. Regarding the apparatus in the above embodiments, the specific manner in which the various modules perform their operations has been described in detail in the embodiments relating to the method, and will not be elaborated upon here.
Claims
1. A model training method based on longitudinal federated learning, characterized in that, The multiple participants in model training include a labeled first participant and M unlabeled second participants, where M≥1. The data held by each participant are aligned with sample labels and then divided into N batches according to batch size, where N≥1. The N batches correspond one-to-one with N first channels and one-to-one with N second channels. The first and second channels are configured with buffers. The buffer of the first channel is used to store embedding vectors, and the buffer of the second channel is used to store gradient information. The batch size is determined according to the system configuration file of the multiple participants. The system configuration file includes model information and hardware capability information, including communication bandwidth and memory limitations. The method is applied to any one of the M unlabeled second participants, and the method includes: Obtain the first training sample of the current batch, wherein the first training sample of the current batch is one of the N batches; The first training sample is input into the first feature extraction model to obtain the first embedding vector; the second participant, as the publisher, puts the first embedding vector into the first channel corresponding to the current batch; The second participant, acting as a subscriber, determines whether the N second channels are empty; If there is a non-empty second channel among the N second channels, then the earliest stored first gradient information is retrieved from the non-empty second channel; the first feature extraction model is updated using the earliest stored first gradient information; wherein, the second channel is used to store the first gradient information of the first training samples of the corresponding batch, the first gradient information is obtained by the first participant training the inference model based on the second embedding vector and M first embedding vectors retrieved from the first channel corresponding to the corresponding batch, and the second embedding vector is obtained by the first participant inputting the second training samples of the corresponding batch into the second feature extraction model; If all N second channels are empty, the same process is executed for the next batch of training samples until all N batches have been completed.
2. The method according to claim 1, characterized in that, The second participant includes one first worker node, The Each of the first worker nodes corresponds to at least one of the N batches. The Each of the first worker nodes is configured to: Obtain the first training sample of the current batch, wherein the current batch is one of the first batches, and the first batch includes at least one batch corresponding to the first working node; The first training sample is input into the first feature extraction model deployed on the first working node to obtain the first embedding vector; Place the first embedding vector into the first channel corresponding to the current batch; If there is a non-empty second channel in the second channel corresponding to the first batch, then the earliest stored first gradient information is retrieved from the non-empty second channel corresponding to the first batch. The first feature extraction model deployed on the first working node is updated using the earliest stored first gradient information; If the second channel corresponding to the first batch is empty, the same process is executed for the next batch of training samples until the first batch is completed.
3. The method of claim 2, wherein, The second participant also includes the first parameter server; The Each of the first worker nodes is further configured to: Send the updated first model parameters of the first feature extraction model to the first parameter server; The first parameter server is configured to aggregate the first model parameters sent by the first working nodes to obtain second model parameters, and send the second model parameters to each of the first working nodes. The first parameter server is configured to aggregate the first model parameters sent by the first working nodes to obtain second model parameters, and send the second model parameters to each of the first working nodes. The Each of the first worker nodes is further configured to update model parameters of the first feature extraction model deployed on the first worker node to the second model parameters.
4. The method according to claim 3, characterized in that, The Each of the first working nodes is used for every... The training rounds will be the most recent The updated parameters of the first model in each training epoch are sent to the first parameter server. ≥1; The first parameter server is used for the... The most recent one sent by the first working node The updated first model parameters in each training round are aggregated to obtain second model parameters, and the second model parameters are sent to each of the first working nodes.
5. The method according to claim 4, characterized in that, It is positively correlated with the number of training rounds.
6. The method according to claim 2, characterized in that, When the method is applied to a third party, the method further includes: An objective function is constructed with the minimum first training duration as the optimization objective. The third participant is the second participant with the longest single-batch training duration among the M unlabeled second participants. The independent variables of the objective function include the batch size, the first number of first working nodes in the second participant, and the second number of second working nodes in the first participant. The first training duration is the maximum value between the second training duration of the third participant and the third training duration of the first participant in a single batch. The second training duration is determined based on the batch size and the first number, and the third training duration is determined based on the batch size and the second number. Based on the basic memory consumption of the third participant, a first memory constraint for the third participant is constructed, and based on the basic memory consumption of the first participant, a second memory constraint for the first participant is constructed, wherein the basic memory consumption is the amount of memory occupied by the participant to maintain basic functions; Based on the first memory constraint, the second memory constraint, the basic memory consumption of the third participant, and the basic memory consumption of the first participant, a constraint on the batch size is constructed as a constraint condition for the objective function. The objective function is solved using a dynamic programming algorithm under the given constraints to obtain the optimal solution for the independent variable.
7. The method according to claim 1, characterized in that, The method further includes: After a first preset time after the first embedding vector is placed into the first channel, it is determined whether the first gradient information of the first training sample of the current batch exists in the second channel corresponding to the current batch; If the first gradient information of the first training sample of the current batch is not present in the second channel corresponding to the current batch, a first instruction is sent to the first participant, wherein the first instruction is used to instruct the first participant to regenerate the first gradient information of the first training sample of the current batch.
8. The method according to claim 1, characterized in that, The method further includes: In response to receiving a second instruction from the first participant, the steps from inputting the first training sample into the first feature extraction model to obtain the first embedding vector to placing the first embedding vector into the first channel corresponding to the current batch are re-executed, wherein the second instruction is used to instruct the second participant to regenerate the first embedding vector.
9. A model training method based on longitudinal federated learning, characterized in that, The multiple participants in model training include a labeled first participant and M unlabeled second participants, where M≥1. The data held by each participant is aligned with sample labels and divided into N batches according to the batch size, where N≥1. The N batches correspond one-to-one with N first channels and one-to-one with N second channels. The first and second channels are configured with buffers. The buffer of the first channel is used to store embedding vectors, and the buffer of the second channel is used to store gradient information. The batch size is determined according to the system configuration file of the multiple participants. The system configuration file includes model information and hardware capability information, including communication bandwidth and memory limitations. The method is applied to the first participant, and the method includes: Obtain the second training sample of the current batch, wherein the second training sample of the current batch is one of the N batches; The second training sample is input into the second feature extraction model to obtain the second embedding vector; The first participant, as a subscriber, extracts M first embedding vectors from the first channel corresponding to the current batch; wherein, the M first embedding vectors are obtained by each of the M unlabeled second participants by inputting the first training sample of the current batch into a local first feature extraction model and placing it into the first channel corresponding to the current batch; Based on the second embedding vector and the M first embedding vectors, the inference model is trained to obtain first gradient information, second gradient information and third gradient information; The first participant, acting as the publisher, places the first gradient information into the second channel corresponding to the current batch, updates the second feature extraction model using the second gradient information, and updates the inference model using the third gradient information.
10. The method according to claim 9, characterized in that, The first participant includes The second working node The Each of the N second working nodes corresponds to at least one of the N batches; The Each of the second working nodes is used for: Obtain the second training sample of the current batch, wherein the current batch is one of the second batch, wherein the second batch includes at least one batch corresponding to the second working node; The second training sample is input into the second feature extraction model deployed on the second working node to obtain the second embedding vector; Extract the M first embedding vectors from the first channel corresponding to the current batch; Based on the second embedding vector and the M first embedding vectors, the inference model deployed on the second working node is trained to obtain the first gradient information, the second gradient information and the third gradient information. The first gradient information is placed into the second channel corresponding to the current batch, and the second feature extraction model deployed on the second working node is updated using the second gradient information, and the inference model deployed on the second working node is updated using the third gradient information.
11. The method according to claim 10, characterized in that, The first participant also includes a second parameter server; The Each of the second working nodes is also used to send the third model parameters of the updated second feature extraction model and the fourth model parameters of the updated inference model to the second parameter server; The second parameter server is also used for the... The third model parameters sent by the second working node are aggregated to obtain the fifth model parameters, which are then used to... The fourth model parameters sent by each second working node are aggregated to obtain the sixth model parameter, and the fifth model parameter and the sixth model parameter are sent to each second working node. The Each of the second working nodes is further configured to update the model parameters of the second feature extraction model deployed on the second working node to the fifth model parameters, and to update the model parameters of the inference model deployed on the second working node to the sixth model parameters.
12. The method according to claim 11, characterized in that, The Each of the second working nodes is used for every... The training rounds will be the most recent The updated third model parameters and the updated fourth model parameters are sent to the second parameter server in each training epoch. ≥1; The second parameter server is used for the... The most recent one sent by the second working node The updated third model parameters from each training epoch are aggregated to obtain the fifth model parameters. The most recent one sent by the second working node The updated fourth model parameters in each training round are aggregated to obtain the sixth model parameters, and the fifth model parameters and the sixth model parameters are sent to each of the second working nodes.
13. The method according to claim 12, characterized in that, It is positively correlated with the number of training rounds.
14. The method according to claim 10, characterized in that, The method further includes: An objective function is constructed with the minimum first duration as the optimization objective. The independent variables of the objective function include the batch size, the first number of first working points in the second participant, and the second number of second working nodes. The first duration is the maximum value between the second duration of the third participant's single batch training and the third duration of the first participant's single batch training. The third participant is the second participant with the longest single batch training duration among the M unlabeled second participants. The second duration is determined based on the batch size and the first number, and the third duration is determined based on the batch size and the second number. Based on the basic memory consumption of the third participant, a first memory constraint for the third participant is constructed, and based on the basic memory consumption of the first participant, a second memory constraint for the first participant is constructed, wherein the basic memory consumption is the amount of memory occupied by the participant to maintain basic functions; Based on the first memory constraint, the second memory constraint, the basic memory consumption of the third participant, and the basic memory consumption of the first participant, a constraint on the batch size is constructed as a constraint condition for the objective function. The objective function is solved using a dynamic programming algorithm under the given constraints to obtain the optimal solution for the independent variable.
15. The method according to claim 9, characterized in that, The method further includes: After obtaining the second embedded vector for a second preset time period, determine whether the number of the first embedded vectors in the first channel corresponding to the current batch reaches M; If the number of the first embedded vectors in the first channel corresponding to the current batch is less than M, a second instruction is sent to the fourth participant, wherein the second instruction is used to instruct the fourth participant to regenerate the first embedded vector, and the fourth participant includes the second participants among the M unlabeled second participants who have not put the first embedded vector into the first channel.
16. The method according to claim 9, characterized in that, The method further includes: In response to receiving a first instruction, the steps of inputting the second training sample into the second feature extraction model to obtain the second embedding vector are re-executed, up to the steps of putting the first gradient information into the second channel corresponding to the current batch, updating the second feature extraction model using the second gradient information, and updating the inference model using the third gradient information, wherein the first instruction is used to instruct the first participant to regenerate the first gradient information of the first training sample of the current batch.
17. A computer-readable medium having a computer program stored thereon, characterized in that, When executed by a processing device, the computer program performs the steps of the method described in any one of claims 1-16.
18. An electronic device, characterized in that, include: A storage device on which computer programs are stored; A processing device for executing the computer program in the storage device to implement the steps of the method according to any one of claims 1-16.
19. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1-16.