Self-supervised longitudinal federated learning method based on dual perception collaboration and semantic calibration
By employing a self-supervised longitudinal federated learning method with dual-perception collaboration and semantic calibration, this approach addresses the semantic misalignment problem caused by label scarcity in longitudinal federated learning, improves model performance, reduces privacy leakage risks, and achieves efficient semantic information fusion and localized calibration.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHEJIANG UNIV OF FINANCE & ECONOMICS
- Filing Date
- 2026-03-16
- Publication Date
- 2026-06-19
Smart Images

Figure CN122242812A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of federated learning, specifically relating to a self-supervised longitudinal federated learning method based on dual-perception collaboration and semantic calibration. Background Technology
[0002] Vertical Federated Learning (VFL), as an important research direction in Federated Learning (FL), aims to explore a privacy-preserving distributed learning method. In this method, all participants share the same sample identity (ID) space but possess different feature spaces. By utilizing complementary multi-source data, VFL can collaboratively train joint VFL models without compromising privacy, thereby addressing real-world needs and finding wide application in cross-domain scenarios.
[0003] Traditional supervised video flow (VFL) research typically focuses on joint model training using aligned samples among participants. However, in real-world scenarios, the data sources of different participants are often different, resulting in a sparse distribution of aligned samples, which severely limits the applicability of traditional methods. To address this issue, recent supervised VFL research has increasingly focused on training using unaligned samples among participants. However, these studies rely too heavily on semantic information provided by the active party's labels, neglecting the scarcity of labels in real-world scenarios. This scarcity stems from the fact that different participants hold different feature spaces. Although collaboration among different participants helps expand the feature space, this may also cause the labels held by the active party to become inapplicable to the expanded feature space. Furthermore, relabeling is costly. Therefore, when participants learn representations individually using their own private unaligned samples under label-scarce conditions, the significant differences in the distribution of data features among participants (i.e., the so-called domain shift) exacerbate the semantic misalignment in the above methods, leading to poor model performance.
[0004] Existing technologies have been developed to transfer centralized self-supervised learning (SSL) methods to VFL scenarios. By utilizing a large number of unlabeled samples, the dependence of multi-source semantic representation learning on labeled resources is reduced, thus alleviating semantic misalignment to some extent. Despite this progress, simply transferring centralized SSL methods still faces two unavoidable problems. First, existing vertical federated self-supervised learning (VFSSL) frameworks still employ simple information exchange mechanisms, such as direct exchange of representations between participants or averaging of shared representations. This fails to account for severe domain shifts, resulting in poor semantic fusion between different sources, thereby weakening the effectiveness of centralized SSL methods in vertical multi-source semantic representation learning. Second, centralized self-supervised learning methods capture semantic information for representation learning only through instance-level perception. When such methods are transferred to VFL scenarios, the limitation of single-perception makes it impossible to capture deep correlations and inter-domain knowledge between distributed data because data from different sources cannot be directly shared. The two issues mentioned above make it difficult to resolve the semantic misalignment caused by severe domain offset, ultimately limiting the overall performance of the VFSSL framework.
[0005] Currently, most supervised federated learning research focuses on improving the effectiveness of SSL in multi-source semantic representation learning. For example, by constructing a relevance matrix and leveraging dimensionality-level perception to facilitate SSL, additional semantic information can be captured, effectively alleviating the semantic misalignment problem. However, these methods incur significant computational and communication overhead due to the need to construct the relevance matrix, and their implementation still relies on the supervised paradigm and is highly dependent on label resources. Therefore, they are difficult to directly apply to VFSSL frameworks designed specifically for label-scarce scenarios. Summary of the Invention
[0006] The purpose of this invention is to provide a self-supervised longitudinal federated learning method based on dual-perception collaboration and semantic calibration, which can train a high-performance VFL model in VFL scenarios with scarce labels and effectively reduce the risk of privacy leakage caused by advanced label inference attacks.
[0007] To achieve the above objectives, the technical solution adopted by the present invention is as follows:
[0008] A self-supervised longitudinal federated learning method based on dual-perceptual collaboration and semantic calibration is used for One participant and one server collaboratively train a joint longitudinal federated learning model. Each participant holds a collaborative model, a local model, and a local momentum model, including:
[0009] Each participant obtains a collaborative representation based on the aligned image samples through the collaborative model and uploads it to the server. The server integrates the collaborative representations of each participant and the average value of the collaborative representations to obtain a federated representation and distributes it to each participant.
[0010] Each participant constructs positive instance pairs and positive dimension pairs based on its local collaborative representation and the received federated representation, and calculates instance-level perceptual self-supervised learning loss and dimension-level perceptual self-supervised learning loss to update the collaborative model and complete one collaborative training.
[0011] Each participant augments its local private image sample data into two augmented views. The local model and the local momentum model are used to process the two augmented views respectively and calculate the symmetric local self-supervised learning loss. The local model and the collaborative model are used to process the two augmented views respectively and calculate the attention knowledge transfer loss. The local model is updated according to the loss, and the local momentum model is updated using the updated local model to complete one local training.
[0012] After alternating between collaborative training and local training, the local models of each participant are taken, and the local encoders in the local models of each participant are combined with the downstream classifier held by the active party to form a joint longitudinal federated learning model. Aligned and labeled image samples are used to collaboratively fine-tune the joint longitudinal federated learning model, and the final joint longitudinal federated learning model is output.
[0013] Several alternative methods are provided below, but they are not intended as additional limitations on the overall solution above. They are merely further additions or optimizations. Provided there are no technical or logical contradictions, each alternative method can be combined individually with respect to the overall solution above, or multiple alternative methods can be combined with each other.
[0014] Preferably, the server integrates the collaboration representations of each participant and the average of the collaboration representations to obtain a federated representation, including:
[0015] For the Aligned image samples, fused with the collaborative representations of all participants, are as follows:
[0016]
[0017] In the formula, For the first Fusion representation of aligned image samples, For participating parties index, Indicates semantic similarity. Indicates the participating parties For the Collaborative representation of the output of aligned image samples Indicates the first The average value of the collaborative representation of all participants corresponding to a given aligned image sample;
[0018] The average of the fused representation and the collaborative representation of each aligned image sample is added together and then averaged again to obtain the federated representation of each aligned image sample.
[0019] Preferably, the instance-level perceptual self-supervised learning loss is calculated as follows:
[0020] Take positive instance pairs and negative instance pairs ,in Indicates the participating parties For the Collaborative representation of the output of aligned image samples Indicates the first The federated representation corresponding to each aligned image sample Indicates the first The federated representation corresponding to each aligned image sample , This indicates the total number of aligned image samples;
[0021] The instance-level perceptual self-supervised learning loss is then calculated as follows:
[0022]
[0023] In the formula, Indicates the participating parties Instance-level perceptual self-supervised learning loss, Indicates semantic similarity. This is the first temperature hyperparameter. Indicates the participating parties For the The collaborative representation of the output of each aligned image sample, the indicator function exist The value is 1 if it is true, and 0 otherwise.
[0024] Preferably, the dimension-level perceptual self-supervised learning loss is calculated as follows:
[0025] Take positive dimension pairs and negative dimension pair ,in Indicates the participating parties The first in the collaborative representation of all aligned image sample outputs One dimension, Represents the first in the federated representation of all aligned image samples One dimension, Represents the first in the federated representation of all aligned image samples One dimension, , The total number of dimensions;
[0026] The loss for dimension-level perceptual self-supervised learning is then calculated as follows:
[0027]
[0028] In the formula, Indicates the participating parties Dimensional-level perceptual self-supervised learning loss, Indicates the participating parties The first in the collaborative representation of all aligned image sample outputs One dimension, The first temperature hyperparameter, the indicator function exist The value is 1 if it is true, and 0 otherwise.
[0029] Preferably, the step of processing the two enhanced views using a local model and a collaborative model respectively and calculating the attention knowledge transfer loss includes:
[0030] The two enhanced views are processed using the local model to obtain the first local representation and the second local representation.
[0031] The two enhanced views are processed using a collaborative model to obtain the first collaborative representation and the second collaborative representation.
[0032] Based on the same enhanced view, a first positive view pair is constructed from the first local representation and the first collaborative representation, and a second positive view pair is constructed from the second local representation and the second collaborative representation;
[0033] The attention knowledge transfer loss is calculated based on the first and second front view pairs as follows:
[0034]
[0035] In the formula, Indicates the participating parties Attentional knowledge transfer loss This represents the total number of local private image samples. Indicates the participating parties Assigned to the The local private image sample corresponding to the first Front view Attention coefficient Indicates the participating parties For the The first local private image sample obtained Local representation, Indicates the participating parties For the The first local private image sample obtained Collaborative representation, This indicates a loss due to self-monitoring.
[0036] Preferably, the step of co-tuning the joint longitudinal federated learning model using aligned and labeled image samples includes:
[0037] Each participant's local encoder transforms aligned and labeled image samples into hidden representations;
[0038] The active party receives and aggregates all hidden representations sent by the passive parties;
[0039] The active party uses a downstream classifier to map the aggregated representations to prediction results;
[0040] Based on the prediction results and labels, the global loss is calculated using the cross-entropy loss function and the joint longitudinal federated learning model is updated. This process is repeated until the collaborative fine-tuning is complete.
[0041] This invention provides a self-supervised longitudinal federated learning method based on dual-aware collaboration and localized semantic calibration (DACLSC) for label-scarce VFLs under semi-honesty and no-collusion assumptions. Specifically, the DACLSC framework divides the pre-training process into two stages: a collaborative training stage and a local training stage. In the collaborative training stage, a multi-source adaptive semantic fusion (MASF) mechanism is proposed to replace the simple information exchange mechanism. Multi-source semantic information from aligned samples is fully fused to generate a more semantically complete federated representation, thereby supporting collaboration among participants and achieving efficient information interaction. Furthermore, a dual-aware collaboration (DAC) method is proposed, which combines instance-level and dimension-level awareness in SSL and does not require the construction of an additional correlation matrix, enabling the capture of deeper correlations and inter-domain knowledge in aligned samples. By combining the MASF mechanism, DAC further enhances the ability of SSL to promote semantic alignment and domain invariance, while significantly reducing computational and communication overhead. In the subsequent local training phase, a Localized Semantic Calibration (LSC) method is proposed. This method adaptively localizes inter-domain knowledge through attention-based knowledge transfer (AKT), effectively mitigating semantic misalignment between local private samples from different sources. DACLSC achieves a unified approach to vertical multi-source semantic representation learning and alignment by alternately executing DAC and LSC. While protecting data privacy, it fully utilizes all available samples from each participant, improving the generalization ability of the VFL model. Ultimately, it solves the semantic misalignment problem in label-scarce VFLs and effectively reduces the risk of privacy leaks caused by advanced label inference attacks. Attached Figure Description
[0042] Figure 1 This is a schematic diagram illustrating data partitioning between two parties in a real-world scenario.
[0043] Figure 2 This is a flowchart of the self-supervised longitudinal federated learning method based on dual-perception collaboration and semantic calibration of the present invention.
[0044] Figure 3 This is an overall architecture diagram of the pre-training stage of the DACLSC framework of this invention;
[0045] Figure 4 This is a diagram of the downstream supervision stage architecture of the DACLSC of this invention. Detailed Implementation
[0046] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0047] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to limit the invention.
[0048] In real-world scenarios, the scarcity of labeled resources and aligned samples limits the applicability of supervised vertical federated learning (VFL). Current research addresses the label scarcity problem by transferring centralized self-supervised learning methods to VFL to utilize a large number of unlabeled samples, thereby reducing dependence on labeled resources. However, simply transferring SSL policies and related information exchange mechanisms cannot solve the semantic misalignment problem caused by severe domain shifts, thus limiting the overall efficiency of the vertical federated SSL framework. To address these challenges, this embodiment proposes a VFL framework with dual-aware collaboration and localized semantic calibration.
[0049] like Figure 1 As shown, the standard VFL setup includes Each participating party (in order to) (as an index), a joint VFL model is collaboratively trained through a central server. Each participant... They all have a private dataset This represents a vertical partition of the overall dataset. Typically, labels are held by a single participant, known as the active participant (Participant 1); the remaining participants, called passive participants, only hold feature data. In VFL, each participant aligns its data based on the sample ID to construct the complete dataset. However, in real-world scenarios, a complete dataset contains only a small number of aligned and labeled samples, denoted as . ,in As a participant Aligned and labeled samples, The corresponding labels held by the active participants. Center-aligned but unlabeled samples are denoted as ,in As a participant Holding Subset. Furthermore, each participant... It also holds unaligned and unlabeled private samples, denoted as .
[0050] Most VFL studies typically rely solely on aligned and labeled samples, severely limiting their practicality in real-world scenarios. In contrast, DACLSC utilizes all available samples for pre-training, requiring only a minimal amount of... It can be fine-tuned to adapt to specific downstream tasks. Consistent with traditional VFL research, the ultimate goal of DACLSC is to improve the performance of joint VFL models on specific downstream tasks. Therefore, the effectiveness of DACLSC is evaluated on the final downstream task.
[0051] like Figure 2 As shown, the overall process of DACLSC is divided into a pre-training stage and a downstream supervision stage. The pre-training stage is further divided into a collaborative training stage and a local training stage. The core idea of DACLSC is to capture deeper correlations and inter-domain knowledge through dual-perception SSL without constructing an additional correlation matrix. Subsequently, the proposed LSC method can effectively alleviate the misalignment between local private samples from different sources.
[0052] (a) The pre-training phase of DACLSC.
[0053] In the DACLSC framework, each participant It holds three models, namely the collaborative model. Local model and local momentum model Each model consists of an encoder and a predictor. Similar to the standard configuration of the Federated SSL framework, the local momentum model... As a local model A momentum version of it is used to assist in pre-training.
[0054] like Figure 3 The diagram illustrates the overall architecture of the DACLSC framework's pre-training phase. Specifically, the pre-training phase consists of two alternating stages: collaborative training and local training.
[0055] (1) Collaborative training phase.
[0056] For aligned samples, each participant holds different longitudinal slices covering the complete feature space, which constitute complementary perspectives of the data instances. Because these complementary perspectives contain rich inter-domain knowledge, they can naturally serve as positive samples in SSL. Therefore, DACLSC performs SSL on aligned samples, jointly training the collaborative models of all participants, enabling the models to acquire inter-domain knowledge and further improving their generalization ability.
[0057] Given a batch of aligned samples Among them, the participating parties Longitudinal slices held Recorded as All participating parties Utilizing its collaborative encoder , will input Transform into hidden representations Subsequently, the collaborative predictor Will Convert to Dimensional Collaborative Representation ,in express The number of samples in the dataset. Existing VFSSL frameworks typically rely on simple information exchange mechanisms, such as averaging collaborative representations on the server side. This fails to adequately account for severe domain offset issues, resulting in poor fusion of multi-source semantics. To address this deficiency, DACLSC employs the MASF mechanism to comprehensively fuse multi-source semantic information, providing support for subsequent collaborative training. As shown in Equation (1), the server fuses the collaborative representations of each participant to obtain a fused representation set. , To align the sample index.
[0058] (1)
[0059] in, , and They represent , and The first in The fusion representation, collaborative representation, and average of the collaborative representations for each aligned sample. (Function) Used to measure objects semantic similarity, Represents the L2 norm. Each collaborative representation They will all be based on their relationship with The semantic similarity is assigned a weight; the higher the similarity, the greater the weight, thereby enhancing semantic consistency during the semantic fusion process.
[0060] To avoid the collaboration process relying too heavily on the collaborative representation of one party, the server in Based on this, and with the ability to reflect the basic trend of global distribution By integrating these components, the final federated representation defined by formula (2) is obtained. Then, the server will Distributed to all participants. Furthermore, to further enhance privacy protection, all representation gradients involved in the information exchange process are truncated.
[0061] (2)
[0062] Upon receiving Afterwards, all participating parties Perform DAC and train its collaborative model using SSL with dual awareness at both the instance and dimension levels.
[0063] A. Instance-level aware SSL. All participating parties exist and Identify and match the corresponding positive instance pairs. This is because each pair of positive instances consists of complementary perspectives from the same sample. for The first in Each representation. Subsequently, as shown in formula (3), each participant... By minimizing the instance-level perceived loss of SSL To optimize its collaboration model ( , representing the collaborative model By collaborative encoder and collaborative predictors composition).
[0064] (3)
[0065] in, Indicates the participating parties Instance-level perceptual self-supervised learning loss, and They are respectively and The first in Individual characteristics. It is the first temperature hyperparameter, while the indicator function exist The value is 1 if it is true, and 0 otherwise. For negative instance pairs, Indicates the first The federated representation corresponding to each aligned image sample , This represents the total number of aligned image samples. This is achieved by minimizing... Instance-level awareness enables positive instance pairs in aligned samples to attract each other while negative instance pairs repel each other. However, relying solely on instance-level awareness SSL struggles to capture deeper relationships within aligned samples, thus failing to address semantic misalignment issues. Therefore, DACLSC employs dimension-level awareness SSL, learning dimensional relationships and domain invariance in aligned samples without constructing additional correlation matrices.
[0066] B. Dimensional-level SSL awareness. All participating parties. match and The same dimension in the same space, thus constructing positive dimension pairs. And treat any dimension as a negative dimension , Integer index, where Indicates the participating parties The first in the collaborative representation of all aligned image sample outputs One dimension, Represents the first in the federated representation of all aligned image samples One dimension, Represents the first in the federated representation of all aligned image samples One dimension, , This represents the total number of dimensions. Subsequently, the participating parties... of The training objective for dimensional level-aware SSL is shown in Equation (4).
[0067] (4)
[0068] in, Indicates the participating parties Dimensional-level perceptual self-supervised learning loss, Indicates the participating parties The first in the collaborative representation of all aligned image sample outputs Each dimension, indicating function exist The value is 1 if the condition is met, and 0 otherwise. Dimension-aware SSL enables collaborative models among participants to learn domain invariance from positive dimension pairs while simultaneously decoupling different dimensions. This mechanism not only promotes semantic alignment of aligned samples in the representation space but also significantly enhances the generalization ability of collaborative models without requiring the construction of additional association matrices.
[0069] This embodiment combines and DAC further improves the performance of SSL in vertical multi-source semantic representation learning, while significantly reducing computational and communication overhead. In summary, all participating parties... DAC loss As shown in formula (5).
[0070] (5)
[0071] in, It is a trade-off and Hyperparameters.
[0072] (2) Local training phase.
[0073] During the local training phase, the participating parties Independently pre-train its local model Using local private samples Learning domain-specific knowledge. However, the representations learned independently by each participant are often scattered across the representation space, exacerbating semantic misalignment between local private samples from different sources. To correct this misalignment, DACLSC employs the LSC method and improves local training through AKT.
[0074] Specifically, gradient truncation was applied to all local momentum models. Specifically, for each participant... Given a batch Data augmentation strategy (a collection of multiple data augmentation methods) It will be converted into two enhanced views. and Local encoder First enhanced view Mapped to the first hidden representation The local momentum encoder Then the second enhanced view Mapped to the second hidden momentum representation Subsequently, the local predictor The first hidden representation Convert to The first local representation of dimensionality Local momentum predictor Then the second hidden momentum will be characterized Transformation into a second local momentum representation ,in, for Number of samples in the first hidden representation. With the second local momentum characterization The corresponding positive sample pairs They can be correlated because each pair originates from the same sample. Here, and They are respectively and The first in One representation. Based on the symmetric computation path, and By exchanging these inputs and using them as inputs to the local model and the local momentum model respectively, a second local representation can be obtained. and the first local momentum characterization Similarly, and The corresponding positive sample pairs It can also be related. Here, and They are respectively and The first in Each representation. Finally, as shown in Equations (6) and (7), each participant... By minimizing the symmetric local SSL loss To optimize the local model ( , indicating the local model Includes local encoder and local predictor ).
[0075] (6)
[0076] (7)
[0077] in, Indicates the participating parties Symmetric local self-supervised loss, This represents the total number of local private samples. This indicates the method used to calculate self-supervised loss. The corresponding positive sample pairs, queue Historical local momentum representations are preserved as negative samples. For from the queue Local momentum representation, queue It includes the local momentum representation generated in each round of training.
[0078] By minimizing All participating parties This allows positive samples within a local private sample to attract each other while negative samples repel each other. However, the semantic misalignment problem between local private samples from different sources remains unresolved.
[0079] To correct this misalignment, the DACLSC framework introduces the LSC method, which uses AKT to localize inter-domain knowledge from collaborative models, thereby enhancing local training. Specifically, for knowledge from collaborative models... of and All participating parties Using a frozen collaboration model The first cooperative representation was calculated separately. Second cooperative representation Subsequently, all participating parties Related and The corresponding front view pair Because each pair originates from the same enhanced view. Similarly, and The corresponding front view pair The association can also be completed here. and They are respectively and The first in Each representation. For each participating party AKT's training objective As shown in formulas (8) and (9).
[0080] (8)
[0081] (9)
[0082] in, Indicates the participating parties Attentional knowledge transfer loss Indicates the participating parties Assigned to the The local private image sample corresponding to the first Front view Attention coefficient Indicates the participating parties For the The first local private image sample obtained Local representation, Indicates the participating parties For the The first local private image sample obtained Collaborative representation, To enhance view indexing, It is the second temperature hyperparameter. For China and Israel For each sample indexed, its two front views are... and It originates from two different augmented views. It should be noted that the self-supervised loss in equation (8) Similarly to Equation (7), where the queue is a historical collaborative representation queue containing the collaborative representations generated in each round of training. Therefore, AKT achieves the localization of inter-domain knowledge by adaptively adjusting the intensity of knowledge transfer between these two views. This method enhances semantic consistency by strengthening the alignment between semantically similar views, while mitigating the influence between dissimilar views to alleviate semantic misalignment.
[0083] LSC will and This approach combines knowledge from other participants with knowledge from the participants themselves, effectively mitigating semantic misalignment between local private samples from different sources. In summary, the participants... LSC total loss As shown in formula (10).
[0084] (10)
[0085] μ is used to control the intensity of AKT.
[0086] To further integrate the knowledge of all local models, the DACLSC framework performs partial model aggregation after LSC. Specifically, to accommodate the feature heterogeneity in VFL, each participating party... its local encoder It is broken down into two components: a participant-specific local underlying encoder. and a local top-level encoder Then, part of the local model Uploaded to the server. Subsequently, the server performs partial model aggregation to obtain a global aggregated model, defined as follows: .in, Indicates the participating parties Number of samples held This represents the total number of samples from all participating parties. Indicates the global top-level encoder. Represents the global predictor. This represents a portion of the global model. Ultimately, the server will... The model is distributed to each participant, who then replaces the local top-level encoder in their local model with the global top-level encoder, resulting in a new local model for the next round of pre-training. Partial model aggregation enables the local top-level encoder to learn more general representations, thus achieving effective sharing among the participants.
[0087] After aggregation is completed, as shown in equation (11), each participating party Its local momentum model is updated using the exponential moving average method. , indicating the local momentum model By local momentum encoder and local momentum predictor (Constructed) for use in the next round of pre-training.
[0088] (11)
[0089] in, This is the momentum coefficient. According to the momentum update mechanism, it can be made... Compared to The evolution is smoother, thus stabilizing the local training process.
[0090] This embodiment achieves the unification of vertical multi-source semantic representation learning and semantic alignment by alternating between collaborative training phases and local training phases (the total number of pre-training phases is preset, and in each pre-training phase, collaborative training phases and local training phases are alternately executed, with each phase executed once or multiple times; for example, the collaborative training phase is executed twice first, and then the local training phase is executed twice, which is considered as the end of one pre-training phase).
[0091] (ii) DACLSC downstream monitoring phase.
[0092] After the DACLSC pre-training phase, each participant Each obtained a pre-trained local encoder. In the downstream monitoring stage, such as Figure 4 As shown, the joint VFL model consists of pre-trained encoders from all participants and a task-specific classifier (downstream classifier) held by the active party, which is collaboratively fine-tuned using a small number of aligned and labeled samples.
[0093] For the participating parties Its local encoder Align and label the input samples Transform into hidden representation Subsequently, consistent with the traditional VFL paradigm, the passive party hides its representation. Send to the initiator. Then, a task-specific classifier. The aggregated representation is mapped to the prediction result, as shown in Equation (12).
[0094] (12)
[0095] in, This represents the aggregated form.
[0096] Finally, the joint VFL model is fine-tuned by minimizing the global loss in formula (13).
[0097] (13)
[0098] in, The true label held by the active party. This is the cross-entropy loss function.
[0099] To visually demonstrate the effectiveness of the invention, the following experiment is provided in this embodiment.
[0100] (1) Experimental setup.
[0101] Given that a two-party setup is the standard configuration in tag-scarce VFLs, this experiment mainly conducts evaluations under a two-party setup, and further expands the evaluation scope to multi-party scenarios for broader validation.
[0102] Datasets and Models. This experiment was conducted on two tabular datasets and two image datasets to evaluate the performance of DACLSC, as detailed below.
[0103] The NUS-WIDE dataset consists of images from Flickr, containing 634-dimensional low-level visual features and 1000-dimensional text label features. In this experiment, data from 10 categories were selected for a multi-class classification task, and the image features and corresponding text label features were assigned to two participants respectively to simulate a longitudinal federated learning scenario.
[0104] The Avazu dataset contains 8 continuous features and 14 categorical features. In this experiment, the categorical features are mapped to a 32-dimensional embedding space. Furthermore, the continuous and categorical features are randomly assigned to two participants for the click-through rate (CTR) prediction task. To control computational cost, this experiment randomly selects 800,000 samples from the original dataset as the training set and 200,000 samples as the test set.
[0105] The Breast Histopathology Images dataset is used for a binary classification task aimed at breast cancer identification. To simulate a Virtual Functional Flow (VFL) environment, this experiment assigns two different images corresponding to the same patient to two different participants.
[0106] The ModelNet dataset contains 40 classes, each covering multiple 3D objects. This experiment selected 15 classes exhibiting a long-tailed distribution for a multi-class classification task. For each object, 12 images were captured from different perspectives and evenly distributed among the four participants. Each VFL sample was constructed by randomly selecting one image from each participant.
[0107] The encoder structure of each participant is customized according to the specific requirements of the task. Specific configurations are detailed in Table 1. A 2FC functional layer represents two fully connected (FC) functional layers, and each FC functional layer contains a fully connected layer, a ReLU layer, and a Batchnorm layer. The number of neurons in the fully connected layers of the two FC functional layers is 512 and 512, respectively. ResNet-18 uses the open-source ResNet with standard parameter configurations (only the last linear output layer is removed). On all datasets, the classifier owned by the active participant consists of a single fully connected functional layer, while each predictor consists of three FC functional layers, with the number of neurons in the fully connected layers of the three FC functional layers being 512, 512, and 512, respectively.
[0108] Table 1 Encoder structure for different datasets
[0109]
[0110] Baseline Methods. To comprehensively evaluate the performance of the proposed DACLSC framework, this experiment uses several state-of-the-art methods for tag-scarce VFLs as baselines.
[0111] LightGBM (Light Gradient Boosting Machine) and Vanilla VFL are both supervised VFL methods trained using only aligned and labeled samples.
[0112] FedCVT (Federated Cross-View Training) is a semi-supervised learning framework for cross-view flaccidity (VFL) models with scarce two-sided labels. This framework enhances the training dataset through missing feature estimation and pseudo-label generation techniques.
[0113] FedLocal utilizes local private samples to pre-train the local models of each participant. FedLocal covers representative methods such as VFLFS and SSVFL.
[0114] FedHSSL (Federated Hybrid Self-Supervised Learning framework) is an SSL framework that achieves state-of-the-art performance in label-sparse VFLs. FedHSSL integrates three representative self-supervised learning methods—Simsiam, BYOL, and MoCo—for pre-training VFL models from each party. Depending on the specific SSL method employed, FedHSSL can be extended to three baseline variants: FedHSSL-BYOL (FedHSSL-B), FedHSSL-SimSiam (FedHSSL-S), and FedHSSL-MoCo (FedHSSL-M).
[0115] To ensure fairness in the comparison, the DACLSC framework proposed in this invention is trained using the same number of aligned and labeled samples as all baselines. The downstream supervision phase uses 200 to 1000 aligned and labeled samples for fine-tuning, with the result being the average of five runs.
[0116] Evaluation metrics. Top-1 accuracy was used as the evaluation metric on the NUS-WIDE and ModelNet datasets, while the area under the ROC curve (AUC) and F1 score were used as evaluation metrics on the Avazu and BHI datasets, respectively.
[0117] Data augmentation. Data augmentation method for the NUS-WIDE dataset: randomly select 30% of the feature dimensions for each input sample and mask them. Then replace these selected dimensions with values randomly sampled from the historical data distribution range of that dimension, while leaving the unselected dimensions unchanged.
[0118] A data augmentation method for the BHI and ModelNet datasets: Two random but semantically consistent visual transformations are applied independently to the same original image, generating two augmented views with different content but unchanged semantics. In terms of specific augmentation design, this method integrates multiple complementary visual perturbation mechanisms. First, random scaling and positional cropping introduce changes in the field of view and local structure, preventing the model from over-relying on details in fixed locations. Then, color dithering simulates differences in brightness, contrast, and color under different imaging conditions. Next, random intensity Gaussian blur suppresses high-frequency texture information, guiding the model to focus on more stable morphological and structural features. Simultaneously, random horizontal flipping enhances the model's adaptability to changes in orientation. All transformations are triggered randomly with preset probabilities, ensuring that each generated view is unique.
[0119] Data augmentation methods for the Avazu dataset: For continuous features, the processing is the same as for the NUS-WIDE dataset. Randomly sample replacement values within a reasonable range of normalized values, introducing controlled numerical perturbations. Unselected feature dimensions remain unchanged, ensuring the augmented samples are still close to the true data distribution. For discrete features, data augmentation methods randomly select 30% of the discrete feature fields as perturbation targets. The selected fields are not replaced with another true category; instead, they are uniformly mapped to a special placeholder value (usually corresponding to a reserved low-frequency index). This placeholder value is numerically valid class encoding, but semantically does not point to any specific entity, equivalent to "this field is unavailable in the current sample" or "this field's information is masked."
[0120] Implementation Details. In this experiment, DACLSC uses the same encoder architecture as all baseline methods. 40% of the dataset is used as aligned samples, and the remainder as unaligned samples. Two learning rates, 0.025 and 0.01, were tested in the downstream supervision phase, and the best performance results are reported. The pre-training phase for all datasets was set to 40 epochs with a batch size of 1024. Temperature hyperparameter. With momentum coefficient The values are fixed at 0.5 and 0.99 respectively. See Table 2 for other hyperparameter configurations.
[0121] Table 2 Hyperparameter configurations used for different datasets
[0122]
[0123] (2) Experimental results.
[0124] This experiment compares the performance of DACLSC with the baseline method on four datasets. The results are detailed in Table 3.
[0125] Table 3 Performance evaluation of different methods
[0126]
[0127] Note: In each column, the best value is highlighted in bold.
[0128] Experimental results show that in label-scarce VFLs, the performance of both LightGBM and Vanilla VFL is severely limited by the scarcity of label resources, because such supervised learning frameworks are highly dependent on aligned and labeled samples for model training.
[0129] Experimental results show that DACLSC consistently outperforms the FedCVT method on all datasets. For example, when using 200 aligned and labeled samples, DACLSC outperforms FedCVT by 26.8%, 13.7%, and 14.1% on the NUS-WIDE, Avazu, and BHI datasets, respectively. Because aligned and labeled samples are extremely limited in label-scarce VFLs, and FedCVT heavily relies on these samples for missing feature estimation and pseudo-label generation, it faces significant challenges when utilizing semi-supervised learning methods.
[0130] The FedHSSL-based approach is robust in label-sparse VFLs. However, simply integrating a centralized SSL method into FedHSSL is insufficient to address the semantic misalignment problem caused by severe domain offsets. Experimental results show that the proposed DACLSC framework is more effective in resolving semantic misalignment issues across all datasets. For example, in scenarios using 200 aligned and labeled samples, DACLSC achieves performance improvements of 4.0%, 2.2%, 1.8%, and 2.0% on the NUS-WIDE, Avazu, BHI, and ModelNet datasets, respectively, compared to FedHSSL-M.
[0131] Thanks to the MASF mechanism and DAC method, DACLSC significantly outperforms FedLocal, demonstrating that collaborative training plays a crucial role in promoting representation alignment and mitigating semantic misalignment.
[0132] Furthermore, DACLSC demonstrates superior performance in experiments on the ModelNet dataset, showcasing its applicability to multi-participant scenarios and its robustness to significantly long-tailed distributions. Moreover, under various experimental settings, DACLSC's standard deviation is lower than most baseline methods, indicating superior stability and generalization ability across all datasets.
[0133] (3) Privacy experiment analysis against label inference attack.
[0134] (3.1) Security and privacy analysis.
[0135] Throughout the design of the VFL framework, security and privacy have always been primary considerations. Based on the assumption that all participants are semi-honest and do not collude with each other, the security and privacy of DACLSC are mainly affected by two stages: information exchange during the collaborative training phase and partial model aggregation involved in the local training phase, as both stages involve communication and interaction between participants. Specifically, during the collaborative training phase, the server collects gradient-free collaborative representations from each participant for information exchange. These representations are obtained by performing a series of nonlinear transformations on the original data using deep neural networks. Existing research shows that, without access to the model structure or parameters of other participants, the ability of semi-honest participants to recover the original data from such representations is significantly weakened. Therefore, DACLSC requires each participant to fully retain its local underlying encoder during partial model aggregation, thereby preventing semi-honest participants from recovering the original data of other participants by exploiting exposed model components. Furthermore, DACLSC employs gradient truncation techniques during information exchange, effectively reducing the risk of gradient inversion attacks and other feature reconstruction attacks, thus further strengthening the framework's privacy protection capabilities.
[0136] Furthermore, recent research indicates that passive parties can launch model-based label inference attacks to infer the original labels held by the active party. To mitigate such threats, advanced defense techniques such as Gaussian noise injection can be seamlessly integrated into the DACLSC framework to effectively prevent model-based label inference attacks.
[0137] (3.2) Calculation cost analysis.
[0138] Because the DACLSC model architecture is task-dependent, this experimental analysis ignores the computational costs of forward and backward propagation within the model, focusing instead on the overhead incurred by the training algorithm during the critical pre-training phase. Specifically, this experiment analyzes the computational overhead of DACLSC from both the participant and server perspectives. On the participant side, the computational overhead for each participant to perform DAC is... ,in Indicates the total number of collaborative training batches. Indicate complexity. Each participating party. The computational cost of performing LSC is ,in Indicates the participating parties The total number of local training batches, Queue The length. On the server side, the computational overhead of the MASF mechanism is... The computational cost of partial model aggregation is ,in Select hidden layer indexes for aggregation from the local models of each participant. This corresponds to the number of neurons in the hidden layer.
[0139] In summary, the total computational cost of DACLSC is In comparison, the total computational cost of the Federated Hybrid Self-Supervised Learning framework (FedHSSL) is... ,in This represents the computational cost of co-training in FedHSSL. When Significantly greater than At that time, the computational cost of DACLSC is asymptotically equivalent to that of FedHSSL. Therefore, the method proposed in this invention can significantly improve the performance of the joint VFL model, while keeping the increase in computational cost within a reasonable range.
[0140] (3.3) Threat Model.
[0141] Existing research indicates that model-based label inference attacks pose another significant privacy threat to the VFSSL framework, potentially leading to the leakage of label privacy by the initiating party. Therefore, this experiment further investigates this threat within the proposed DACLSC framework and explores corresponding privacy protection mechanisms.
[0142] The threat model encompasses four key dimensions of the attacker: attacker's objectives, capabilities, knowledge, and attack methods, as detailed below:
[0143] Attacker's objective: To designate the passive party with the largest index as the attacker, who attempts to deduce the tags held by the active party.
[0144] Attacker capabilities: Under the assumption of semi-honesty, the attacker adheres to the VFL protocol, meaning that the attacker may still attempt to infer the private data of other participants without deviating from the prescribed collaborative process.
[0145] Attacker Knowledge: To comprehensively assess the potential threats faced by DACLSC, this experiment assumes that the attacker possesses some prior knowledge about the active party, including model architecture, input dimensions, and number of categories. Furthermore, it is assumed that the attacker holds a small number of auxiliary labeled samples, denoted as . .
[0146] Attack Methods: Model-based label inference attacks are a major threat to the DACLSC framework. To evaluate DACLSC's robustness against such threats, this experiment employs a model completion (MC) attack, an advanced model-based label inference method. In the DACLSC framework, MC attacks can be initiated by attackers during the pre-training phase or the downstream supervision phase.
[0147] Existing defense techniques can be seamlessly integrated into the DACLSC framework to reduce the risk of privacy breaches caused by model-based label inference attacks. Specifically, in this experiment, the active party can access information used for communication between participants during training. Apply an isotropic Gaussian noise (ISO) perturbation.
[0148] This experimental analysis focuses on the label privacy leakage risk in the most critical stage of DACLSC: the pre-training stage. Specifically, attackers build the MC attack model during the pre-training stage. To mitigate privacy risks, the active party adds ISO perturbations to both the collaborative representations involved in information exchange and the model parameters participating in aggregation. After training is complete, attackers can exploit this perturbation. Labeled inference was performed on the test set. All datasets were fine-tuned using 200 aligned and labeled samples. During pre-training, 20% of the data was designated as aligned samples. Furthermore, this experiment employed a baseline attack model pre-trained using FedLocal, denoted as [model name missing]. This is used to simulate the attacker's prior knowledge. Both attack models are trained using 80 auxiliary labeled samples, and label restoration accuracy is used as the evaluation metric for the MC attack. A comparison of the main task performance (i.e., the performance of DACLSC on a specific task) and the MC attack results is detailed in Table 4. This indicates the strength of the ISO perturbation used during the pre-training phase.
[0149] Table 4 Comparison of main task performance and MC attack effectiveness
[0150]
[0151] Note: "w / ISO" and "w / o ISO" indicate that ISO perturbation was applied and not applied during model training, respectively.
[0152] The results show that, without applying ISO perturbation, The attack outperformed the baseline on all datasets. This indicates that DACLSC may lead to additional label information leakage during the pre-training phase. When appropriate... When ISO perturbation is applied, The effectiveness of attacks has decreased significantly, falling below the baseline on most datasets. Although this protection leads to a slight decrease in the performance of the main task, the performance fluctuations remain within acceptable limits. These results demonstrate that, through appropriate privacy protection mechanisms, DACLSC can effectively reduce the risk of privacy breaches caused by advanced model-based label inference attacks.
[0153] This invention proposes the DACLSC framework for label-scarce VFLs, aiming to effectively address the semantic misalignment problem caused by severe domain shift. Combining a novel MASF mechanism, the proposed DAC method enhances the performance of SSL in promoting semantic alignment and domain invariance, while significantly reducing computational and communication overhead. Furthermore, DACLSC achieves a unified approach to longitudinal multi-source semantic representation learning and semantic alignment by alternately executing DAC and LSC, leveraging available samples from all participants to improve the generalization ability of the VFL model, ultimately resolving the semantic misalignment problem. Extensive experimental results demonstrate that DACLSC exhibits superior performance in label-scarce VFLs. Moreover, comprehensive security and privacy analyses demonstrate the robustness of DACLSC in protecting the privacy of participating parties.
[0154] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0155] The embodiments described above are merely illustrative of several implementations of the present invention, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of the invention. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these modifications and improvements all fall within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the appended claims.
Claims
1. A self-supervised longitudinal federated learning method based on dual-perception collaboration and semantic calibration, used for... A joint longitudinal federated learning model is collaboratively trained by one participant and one server, characterized in that... Each participant holds a collaboration model, a local model, and a local momentum model, including: Each participant obtains a collaborative representation based on the aligned image samples through the collaborative model and uploads it to the server. The server integrates the collaborative representations of each participant and the average value of the collaborative representations to obtain a federated representation and distributes it to each participant. Each participant constructs positive instance pairs and positive dimension pairs based on its local collaborative representation and the received federated representation, and calculates instance-level perceptual self-supervised learning loss and dimension-level perceptual self-supervised learning loss to update the collaborative model and complete one collaborative training. Each participant augments its local private image sample data into two augmented views. The local model and the local momentum model are used to process the two augmented views respectively and calculate the symmetric local self-supervised learning loss. The local model and the collaborative model are used to process the two augmented views respectively and calculate the attention knowledge transfer loss. The local model is updated according to the loss, and the local momentum model is updated using the updated local model to complete one local training. After alternating between collaborative training and local training, the local models of each participant are taken, and the local encoders in the local models of each participant are combined with the downstream classifier held by the active party to form a joint longitudinal federated learning model. Aligned and labeled image samples are used to collaboratively fine-tune the joint longitudinal federated learning model, and the final joint longitudinal federated learning model is output.
2. The self-supervised longitudinal federated learning method based on dual-perception collaboration and semantic calibration according to claim 1, characterized in that, The server integrates the collaborative representations of each participant and the average of the collaborative representations to obtain a federated representation, including: For the Aligned image samples, fused with the collaborative representations of all participants, are as follows: In the formula, For the first Fusion representation of aligned image samples, For participating parties index, Indicates semantic similarity. Indicates the participating parties For the Collaborative representation of the output of aligned image samples Indicates the first The average value of the collaborative representation of all participants corresponding to a given aligned image sample; The average of the fused representation and the collaborative representation of each aligned image sample is added together and then averaged again to obtain the federated representation of each aligned image sample.
3. The self-supervised longitudinal federated learning method based on dual-perception collaboration and semantic calibration according to claim 1, characterized in that, The instance-level perceptual self-supervised learning loss is calculated as follows: Take positive instance pairs and negative instance pairs ,in Indicates the participating parties For the Collaborative representation of the output of aligned image samples Indicates the first The federated representation corresponding to each aligned image sample Indicates the first The federated representation corresponding to each aligned image sample , This indicates the total number of aligned image samples; The instance-level perceptual self-supervised learning loss is then calculated as follows: In the formula, Indicates the participating parties Instance-level perceptual self-supervised learning loss, Indicates semantic similarity. This is the first temperature hyperparameter. Indicates the participating parties For the The collaborative representation of the output of each aligned image sample, the indicator function exist The value is 1 if it is true, and 0 otherwise.
4. The self-supervised longitudinal federated learning method based on dual-perception collaboration and semantic calibration according to claim 1, characterized in that, The dimension-level perceptual self-supervised learning loss is calculated as follows: Take positive dimension pairs and negative dimension pair ,in Indicates the participating parties The first in the collaborative representation of all aligned image sample outputs One dimension, Represents the first in the federated representation of all aligned image samples One dimension, Represents the first in the federated representation of all aligned image samples One dimension, , The total number of dimensions; The loss for dimension-level perceptual self-supervised learning is then calculated as follows: In the formula, Indicates the participating parties Dimensional-level perceptual self-supervised learning loss, Indicates the participating parties The first in the collaborative representation of all aligned image sample outputs One dimension, The first temperature hyperparameter, the indicator function exist The value is 1 if it is true, and 0 otherwise.
5. The self-supervised longitudinal federated learning method based on dual-perception collaboration and semantic calibration according to claim 1, characterized in that, The process of using local and collaborative models to process the two augmented views and calculating the attention knowledge transfer loss includes: The two enhanced views are processed using the local model to obtain the first local representation and the second local representation. The two enhanced views are processed using a collaborative model to obtain the first collaborative representation and the second collaborative representation. Based on the same enhanced view, a first positive view pair is constructed from the first local representation and the first collaborative representation, and a second positive view pair is constructed from the second local representation and the second collaborative representation; The attention knowledge transfer loss is calculated based on the first and second front view pairs as follows: In the formula, Indicates the participating parties Attentional knowledge transfer loss This represents the total number of local private image samples. Indicates the participating parties Assigned to the The local private image sample corresponding to the first Front view Attention coefficient Indicates the participating parties For the The first local private image sample obtained Local representation, Indicates the participating parties For the The first local private image sample obtained Collaborative representation, This indicates a loss due to self-monitoring.
6. The self-supervised longitudinal federated learning method based on dual-perception collaboration and semantic calibration according to claim 1, characterized in that, The method of co-tuning the joint longitudinal federated learning model using aligned and labeled image samples includes: Each participant's local encoder transforms aligned and labeled image samples into hidden representations; The active party receives and aggregates all hidden representations sent by the passive parties; The active party uses a downstream classifier to map the aggregated representations to prediction results; Based on the prediction results and labels, the global loss is calculated using the cross-entropy loss function and the joint longitudinal federated learning model is updated. This process is repeated until the collaborative fine-tuning is complete.