Wireless multi-modal federated learning system for modal-heterogeneous users

The wireless multimodal federated learning system, through decision-level fusion and resource optimization, solves the problems of low training efficiency and energy waste caused by heterogeneous user modalities, and achieves more efficient multimodal model training and higher classification accuracy.

CN120529414BActive Publication Date: 2026-06-30SHANGHAI JIAOTONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI JIAOTONG UNIV
Filing Date
2025-06-18
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing multimodal federated learning systems in wireless scenarios cannot effectively handle the problem of heterogeneous user data modes, resulting in increased computational load, complex communication processes, and inconsistent mode convergence speeds, which affects model training efficiency and accuracy.

Method used

A decision-level fusion multimodal model is adopted, which combines user scheduling and uplink bandwidth allocation. By optimizing user participation and resource allocation through base stations and servers, a wireless multimodal federated learning system is designed. The training process of the model is optimized by using a single-modal loss function and global gradient to ensure the convergence of each modality model.

Benefits of technology

It improves the training convergence speed and accuracy of multimodal models, enhances the accuracy of classification tasks, and saves 20% of computation and communication overhead under energy constraints.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120529414B_ABST
    Figure CN120529414B_ABST
Patent Text Reader

Abstract

This invention provides a wireless multimodal federated learning system for users with heterogeneous modalities, comprising a server, a base station, and users. The server is directly deployed at the base station or connected to the base station via optical fiber. The server broadcasts a global multimodal model to users through the base station, and simultaneously sends user scheduling results and the uplink bandwidth used by the users. After receiving the local multimodal model uploaded by the users, the server performs global aggregation to obtain a new global multimodal model. The base station uses a wireless network to complete communication between the server and users. The scheduled users update the received model using their own data and upload the resulting local multimodal model to the base station. This invention considers wireless multimodal federated learning under heterogeneous modalities, designs a wireless multimodal federated learning system, and reduces the energy consumption of federated learning by jointly optimizing user scheduling and uplink power, further improving the performance of the multimodal model on both single-mode and multimodal data.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of wireless communication and artificial intelligence, and more specifically, to a wireless multimodal federated learning system for users with heterogeneous modalities. Background Technology

[0002] Existing federated learning systems and methods in wireless scenarios typically only consider the case where user data is unimodal. When user data is multimodal, the modal heterogeneity among users will bring new problems to multimodal federated learning in wireless scenarios, thus requiring the provision of corresponding systems and solutions.

[0003] Modal heterogeneity in multimodal federated learning manifests as different users possessing different modal data. Due to limitations in user devices, modal heterogeneity is quite common in practical applications. Currently, in the research field of multimodal federated learning, there are many solutions to the modal heterogeneity problem—such as reconstruction of missing modal representations, knowledge distillation based on heterogeneous models, and modification of multimodal fusion methods. In wireless scenarios with fixed latency, limited bandwidth, and limited computing resources, training the model required for reconstruction and reconstructing missing modal representations introduces significant additional computational overhead. Similarly, knowledge distillation on proxy datasets also introduces additional computational overhead. Furthermore, the collection and sharing of proxy datasets complicates the existing communication process.

[0004] Besides the difficulties in training methods, the modal heterogeneity of multimodal models also leads to significant differences in computation and communication between users with different modalities. Due to the vast differences in the structure and modal data of each modal sub-model within a multimodal model, inconsistent convergence speeds among the modalities are inevitable during training. This problem means that an absolute balancing strategy for each modality cannot guarantee rapid convergence of the multimodal model. On the contrary, overtraining on converged modalities not only wastes computational resources but may also lead to overfitting to that modality's data. Consequently, non-converged modalities cannot be adequately trained.

[0005] Patent application CN116386058A discloses a multimodal federated learning training method and apparatus. The method includes: inputting shared data into an initial server-side model to obtain an output global feature representation, and transmitting the global feature representation to a client; receiving local feature representations generated by the client; aggregating the local feature representations transmitted by the client based on the global feature representation and the local feature representations to obtain an aggregated feature representation; and training the server-side model based on the aggregated feature representation to complete one round of model training. However, this patent cannot completely solve the existing technical problems, nor can it meet the needs of this invention. Summary of the Invention

[0006] To address the shortcomings of existing technologies, the purpose of this invention is to provide a wireless multimodal federated learning system for users with heterogeneous modalities.

[0007] The wireless multimodal federated learning system for modally heterogeneous users provided by the present invention includes:

[0008] The server is used to decide the user scheduling result and uplink bandwidth allocation result for each communication round, generate and broadcast a global multimodal model, receive the local multimodal model uploaded by the scheduled user and perform global aggregation to update the global multimodal model.

[0009] The base station utilizes a wireless network to facilitate communication between the server and the user, collects the channel gain of the user equipment, broadcasts the global multimodal model, user scheduling results, and uplink bandwidth allocation results issued by the server, and receives the local multimodal model uploaded by the scheduled user and transmits it to the server.

[0010] Multiple users, each with heterogeneous multimodal data, meaning that the modalities possessed by different users are any subset of all modal sets; the scheduled user uses local multimodal data to update the received global multimodal model to obtain a local multimodal model, and uses the allocated uplink bandwidth to upload the local multimodal model to the base station.

[0011] The server is either directly deployed on the base station side or connected to the base station via optical cable.

[0012] Preferably, the server decides the user scheduling vector in the nth communication round. Uplink communication bandwidth allocation results Where U represents the total number of users, and the set of users participating in the nth communication round is... The server's global multimodal model consists of M global unimodal sub-models, that is,

[0013] The base station utilizes a wireless network to facilitate communication between the user and the server, and collects the channel gain of user i in the nth communication round.

[0014] U users constitute a set The set of all modalities owned by the user The set of users with mode m is User i has the following modality set: The number of modes is Local dataset is Where D i x is the size of the dataset. i,m,j It is the feature vector of the sample mode, yi,j These are the labels of the samples; the users being scheduled. After local updates, upload the local multimodal model. The set of participating users who possess modal data m in the nth communication round

[0015] Preferably, during the local update process, user i calculates the multimodal loss function using the local multimodal dataset as follows:

[0016]

[0017] Where L(·) is the loss function; This represents the feature x of mode m. k,m,j Input to global single-modal sub-model The obtained output results are then averaged across the output results of each modality possessed by user i. The multimodal model output result is obtained through decision-level fusion. Each unimodal sub-model in the multimodal model completes the machine learning inference task using unimodal features. The sum of the user's unimodal loss functions is expressed as:

[0018]

[0019] For the modalities owned by the user The single-mode loss function is calculated as follows:

[0020]

[0021] Due to modal heterogeneity, for modalities where the user is missing The single-modal loss function is defined as the global loss, and its gradient is defined as the global single-modal gradient, i.e.:

[0022]

[0023] Adding the two together yields the user's local loss function:

[0024] H i (θ n-1 ) = F i (θ n-1 )+G i (θ n-1 )

[0025] During the user's local update process, the number of local training cycles in each communication round is set to 1, and batch gradient descent is used to obtain the local gradient:

[0026] H i (θ n-1 ) = F i (θn-1 )+G i (θ n-1 )

[0027] The local gradient also consists of M local sub-gradients, namely:

[0028]

[0029] For the modalities owned by the user The corresponding local sub-gradient is calculated as follows:

[0030]

[0031] For modalities where the user is missing, the local sub-gradient is defined as the global gradient. Finally, the local update formula for user i is obtained:

[0032]

[0033] Preferably, the global multimodal model on the server is obtained by aggregating the local multimodal models uploaded by the scheduled users, and for the modalities owned by the scheduled users... Its aggregation form is:

[0034]

[0035] The definition of aggregation weight is as follows: When no scheduled user owns modality m, that is, when At this point, the parameters of the single-modal sub-model of this mode are kept the same as at the end of the previous communication round, that is:

[0036]

[0037] The goal of the entire wireless multimodal federated learning system is to minimize the sum of the multimodal loss function and the unimodal loss function, expressed as:

[0038]

[0039] Preferably, user i incurs computational latency and energy overhead when performing local updates. It is assumed that all users have the same computing power, i.e., the CPU frequency f and energy consumption coefficient α are the same; using β... m This represents the number of CPU cycles required to train one sample on modality m. Depending on the decision-level fusion method, user i in modality m... The number of CPU cycles required to train one sample is: The user's computation latency and computational energy overhead are respectively:

[0040]

[0041] Preferably, after user i completes the local update, communication latency and energy overhead are incurred during the process of uploading the local multimodal model. Multi-user uplink communication is implemented using frequency division multiple access (FDMA). According to Shannon's formula, in the nth communication round, user... The uplink speed is:

[0042]

[0043] in, Where is the bandwidth allocated to user i, p is the uplink communication power, and N0 is the power spectral density of white noise. It is channel gain;

[0044] In an FDMA system, the total bandwidth allocated to participating users does not exceed the total bandwidth used by the base station for wireless multimodal federated services.

[0045]

[0046] Based on their respective modalities, different users upload monomodal sub-models of different modalities. Monomodal sub-models of the same modality have the same structure, therefore... m This represents the data length corresponding to the single-modal sub-model of mode m, and the total data volume of user uplink communication is... This represents the uplink latency of user i in the nth communication round:

[0047]

[0048] Furthermore, the energy cost of user i during the uplink communication process is given:

[0049]

[0050] Preferably, for each participating user in the nth communication round, within the specified maximum delay T max The local model has been uploaded, therefore, the user Delay constraints must be met:

[0051] T i n,com +T i cmp ≤T max

[0052] Set the energy allocated to a single communication round for wireless multimodal federated learning to E. add Therefore, the remaining energy of user i in the nth communication round is:

[0053]

[0054] The remaining energy from each communication round is accumulated and used in high-overhead communication rounds. For the entire training process, only when the total remaining energy of user i is not negative is there a constraint:

[0055]

[0056] Under bandwidth, latency, and energy constraints, decisions are made regarding access users and bandwidth allocation, using the loss function H(θ). N Let be the objective function, and the performance optimization problem of wireless multimodal federated learning can be expressed as:

[0057]

[0058] Where N→+∞ indicates that enough communication rounds have been trained until the multimodal model converges.

[0059] Preferably, based on the properties that the loss function H(θ) is γ-smooth and ρ-Lipschitz continuous, the global gradient of the single-modal sub-model has an upper bound. Furthermore, the difference between the local gradient and the global gradient of a single-modal sub-model has an upper bound. Under the assumption that, the upper bound of the loss function is derived as follows:

[0060]

[0061] There is a form Where γ is the smoothness coefficient and ρ is the Lipschitz continuity coefficient. These are the upper bound coefficients of the m-modal gradient and the difference coefficients of the m-modal gradient for user i, respectively, and the ideal aggregation weights when all users are connected.

[0062] Preferably, the constant term H(θ) is subtracted from the objective function of P1. 0 After that, it is broken down into a summation of the differences between a series of global loss functions. Furthermore, for the long-term average energy constraint C5, we first construct a virtual queue for energy and then give a virtual queue mean stability constraint C5′ equivalent to C5:

[0063]

[0064] After performing an equivalent transformation on the objective function and C5, we obtain the standard structure that can be solved using the Lyapunov optimization method:

[0065]

[0066] stC1,C2,C3,C4,C5′

[0067] Based on the Lyapunov optimization method, the Lyapunov drift function with a penalty term is given:

[0068]

[0069] Substituting the upper bound of performance, and then amplifying it through a series of inequalities while ignoring the constant term and a... n Irrelevant terms are used to obtain an upper bound for optimization and the transformed optimization problem:

[0070]

[0071] stC1,C2,C3,C4

[0072] Objective function J1(a n B n The first term is the upper bound of the performance of multimodal federated learning, and the second term is the energy cost of the current communication round. The Lyapunov penalty factor V>0 can balance the relationship between the two: when V→+∞, the optimization method tends to increase the energy cost to improve the performance of wireless multimodal federated learning, while when V→0, the optimization method tries to reduce the energy cost and ignores the performance of wireless multimodal federated learning. The specific setting of V is determined according to different energy costs and performance requirements.

[0073] Preferably, since a n It is a combination optimization variable, B n Since the variable is continuously optimized, it can be written as an equivalent master problem P4, and decomposed into combinatorial optimization subproblems P4.1 and continuously optimized subproblems P4.2, which are respectively:

[0074]

[0075] stC1,C2,C3,C4.

[0076]

[0077] stC1,C3.

[0078]

[0079] stC2,C3,C4.

[0080] in, Let represent the optimal solution to P4.2, and the objective function of P4.2 is:

[0081]

[0082] P4.2 is transformed into a convex problem through a series of transformations, and the optimal bandwidth is given using the KKT conditions:

[0083]

[0084] The relevant variables satisfy the following relationship:

[0085]

[0086] The remaining formulas involved in the numerical solution using Newton's iteration method are:

[0087]

[0088]

[0089] Using the above relationships, the optimal bandwidth vector can be obtained;

[0090] P4.1 is solved using an immune algorithm, assuming the randomly generated initial antibody set is... Where S represents the total number of antibodies. Let g be the user access vector corresponding to the s-th antibody in generation 0. Assume the current generation g = 0. Construct an affinity function aff to measure the quality of this antibody based on the objective function in P4.1:

[0091]

[0092] in, ι>0 is the exponential coefficient for adjusting the affinity dispersion. Currently, for infeasible solutions, the affinity function is set to 0.

[0093] The antibody concentration function den in the immune algorithm measures the similarity between an antibody and other antibodies in the set. Its calculation formula is as follows:

[0094]

[0095] The similarity function sim(·) measures the relationship between the Hamming distance dis(·) between two antibodies and the distance threshold Dis.

[0096]

[0097] The affinity and concentration of an antibody together determine the level of stimulation required for antibody selection.

[0098]

[0099] Where ε1 and ε2 represent the weights corresponding to the affinity function and antibody concentration function, respectively, and the S / μ antibodies with the highest excitation in the antibody set are selected to form the set. Then, each selected antibody was cloned to a μ-fold size, and these antibodies were subjected to random mutations to obtain the mutated set. Select collection The most compatible One antibody, together with S / μ randomly generated antibodies, constitutes the next-generation antibody ensemble. Then, repeat the above operations on the new antibody set until the maximum number of iterations is reached, thus obtaining the optimal result.

[0100] Compared with the prior art, the present invention has the following beneficial effects:

[0101] (1) The wireless multimodal federated learning system of the present invention can adapt to the situation of heterogeneous user modalities and can ensure that the multimodal model and each single-modal sub-model can be trained using the user's heterogeneous modal data to achieve convergence, and is suitable for a wide range of wireless multimodal federated learning application scenarios.

[0102] (2) Compared with similar algorithms, the solution of the present invention can improve the accuracy by up to 4.62% in the classification task of multimodal data and up to 2.79% in the classification task of single-modal data.

[0103] (3) The solution of the present invention takes into account the energy constraints of user equipment. Under the premise of achieving the above-mentioned performance advantages, it saves at least 20% of energy consumption compared with similar algorithms. Attached Figure Description

[0104] Other features, objects, and advantages of the present invention will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings:

[0105] Figure 1 A wireless multimodal federated learning system for users with heterogeneous modalities;

[0106] Figure 2 For a multimodal model based on decision-level fusion;

[0107] Figure 3a and Figure 3b The effects of different Lyanov penalty factors on multi-model performance, single-model performance, and energy cost of the CREMA-D and IEMOCAP datasets are shown respectively.

[0108] Figures 4a to 4d The comparisons of the present invention with other schemes on the CREMA-D dataset are as follows: multi-mode performance, single-mode performance, and energy consumption.

[0109] Figures 5a-5d The comparisons are as follows: on the IEMOCAP dataset, the performance of this invention and other schemes in multi-mode performance, single-mode performance, and energy consumption. Detailed Implementation

[0110] The present invention will now be described in detail with reference to specific embodiments. These embodiments will help those skilled in the art to further understand the present invention, but do not limit the invention in any way. It should be noted that those skilled in the art can make several changes and improvements without departing from the concept of the present invention. These all fall within the protection scope of the present invention.

[0111] Example

[0112] This invention can be applied to next-generation wireless communication networks to assist in the efficient training of multimodal models. In next-generation wireless communication networks, some terminal devices are equipped with one or more sensing devices such as cameras, microphones, and sensors to jointly collect data and train models. This embodiment can be deployed on base stations or edge servers to build a multimodal collaborative system for various scenarios such as intelligent driving, telemedicine, and industrial IoT. By efficiently training multimodal models, tasks can be assisted faster and more accurately, improving user experience and enhancing the system's intelligence level.

[0113] This invention discloses a wireless multimodal federated learning system, such as... Figure 1 As shown, it includes U users, base stations, and servers. All users constitute a set. All modalities of the data constitute a set Conversely, the set of modalities possessed by user i is This represents the number of modalities for user i, and the local dataset for user i in that modality is represented as follows: in x is the size of the dataset. i,m,j y is the feature vector of mode m of sample j. i,j This is the label for sample j. All users possessing modality m constitute the set.

[0114] Similar to unimodal federated learning, the entire process of multimodal federated learning is also divided into N communication rounds. Each communication round consists of model distribution, local update, model upload, and global aggregation. In the nth communication round, the multimodal global model θ... n-1 It is first broadcast to all users. For the first communication round, θ is defined as... 0 The multimodal model is initialized randomly. It should be noted that the multimodal model used in multimodal learning varies depending on the modality fusion method. In this invention, to adapt to modal heterogeneity among users and avoid redundant computational and communication burdens, a decision-level fusion-based multimodal model is adopted, the specific architecture of which is as follows: Figure 1 As shown, the global multimodal model consists of M global unimodal sub-models, that is, During the training process of the nth communication round, the global single-modal sub-model obtained in the (n-1)th communication round... Based on the input feature vector x of user i in mode m i,m,j Calculate the output If certain modalities are missing, the output for that modality will not be generated. A fusion processor then averages the outputs of each modality to obtain the output of the multimodal model. Therefore, after receiving the global multimodal model from the server, the user calculates the multimodal loss based on their local dataset as follows:

[0115]

[0116] Where L() is the loss function, which is generally in the form of cross-entropy for classification tasks, and mean squared error for regression tasks. The global single-modal sub-model obtained in the (n-1)th communication round y i,j These are sample labels;

[0117] In addition to the multimodal loss function, each unimodal sub-model in the multimodal model can also utilize unimodal features to complete machine learning inference tasks. Therefore, the sum of the user's unimodal loss functions can be expressed as:

[0118]

[0119] For the modalities owned by the user The single-mode loss function can be calculated as follows:

[0120]

[0121] Among them, v m The weights of the single-mode loss function, It is the input feature vector.

[0122] Due to modal heterogeneity, for modalities where the user is missing The single-mode loss function is defined as the global loss, and its gradient is defined as the global single-mode gradient. That is:

[0123]

[0124] The above definition aims to unify the forms of multimodal loss functions and unimodal loss functions, by adding the two together to obtain the user's local loss function form:

[0125] H i (θ n-1 ) = F i (θ n-1 )+G i (θn-1 ).

[0126] During the user's local update process, the number of local training cycles in each communication round is set to 1. Batch gradient descent is used to obtain the local gradient by adding the multimodal gradient to the unimodal gradient.

[0127]

[0128] Similar to the local model, the local gradient also consists of M local sub-gradients, i.e.:

[0129]

[0130] For the modalities owned by the user The corresponding local sub-gradient can be calculated as follows:

[0131]

[0132] For modalities where the user is missing, the local sub-gradient is also defined as the global gradient. Ultimately, the local update formula for user i can be obtained:

[0133]

[0134] Where η is the pre-set learning rate. Although the above equation involves model updates for missing modes, the calculation process is actually an identity for global updates.

[0135] The goal of multimodal federated learning is to minimize the global loss function, which is a weighted sum of the local loss functions. Therefore, the goal of the entire wireless multimodal federated learning system is to minimize the sum of the multimodal loss function and the unimodal loss function.

[0136]

[0137] User engagement is represented by engagement vectors. Let represent the set of users participating in the nth communication round. The set of participating users who possess modal data m in the nth communication round

[0138] After completing local training, the user will upload the local model, such as Figure 1 As shown. For the modalities that users possess... Its aggregation form is:

[0139]

[0140] The definition of aggregation weight is as follows: η is the pre-set learning rate.

[0141] Once no scheduled user owns modality m, that is, At this point, the parameters of the single-modal sub-model of this mode are kept the same as at the end of the previous communication round, that is:

[0142]

[0143] exist Figure 1 In the wireless multimodal federated learning architecture, at the beginning of each communication round, the server first solves the performance optimization problem of wireless multimodal federated learning based on the user, channel, and model conditions, thereby generating user participation vectors. and the channel bandwidth used by users to upload local models in These represent 1-0 indicator variables indicating whether user i has been scheduled and the allocated uplink bandwidth, respectively. The server then sends out a... n B n And multimodal global model, user Users do not participate in training and uploading during this communication round, and the final global aggregation process is completed with the participation of some users.

[0144] The communication process in wireless multimodal federated learning can be divided into two parts: base station downlink communication and user uplink communication. Communication between the base station and the server is achieved via fiber optic cable, resulting in high communication rates and negligible latency. Downlink communication relies on the base station's broadcast communication. Due to the base station's large downlink transmission power and ample downlink bandwidth, the downlink latency is negligible compared to the uplink latency. Similarly, since the base station generally has sufficient energy supply, the focus is on the energy consumption of each user terminal.

[0145] After completing a local update, users incur communication latency and energy overhead during the process of uploading their local multimodal models. Due to significant differences in the amount of data transmitted by users due to modal heterogeneity, multi-user uplink communication is implemented using Frequency Division Multiple Access (FDMA). According to Shannon's formula, in the nth communication round, users... The uplink speed is:

[0146]

[0147] in, Where is the bandwidth allocated to user i, p is the uplink communication power, and N0 is the power spectral density of white noise. It refers to channel gain. In an FDMA system, the total bandwidth allocated to participating users cannot exceed the total bandwidth B of the base station used for wireless multimodal federated services. max :

[0148]

[0149] in, B is a 1-0 indicator variable to show whether user i has been scheduled; max The total bandwidth used by the base station for wireless multimodal federated services;

[0150] Depending on the modalities they possess, different users will upload monomodal sub-models of different modalities, but monomodal sub-models of the same modality have the same structure, therefore l m This can represent the data length corresponding to a single-modal sub-model of mode m. The total data volume of user uplink communication is... This can be used to represent the uplink delay of user i in the nth communication round:

[0151]

[0152] This allows us to determine the energy cost of user i during the uplink communication process:

[0153]

[0154] Local updates by users incur computational latency and energy costs. It is assumed that all users have the same computing power, i.e., the CPU frequency f and energy consumption coefficient α are identical. Due to the different structures of the single-modal sub-models and the different characteristics of the input data, the computational power required to complete training on a single sample also varies, denoted by β. m This represents the number of CPU cycles required to train one sample on modality m. According to the decision-level fusion method, the computational power required to train multiple single-modality sub-models is additive, while the computational power of decision-level fusion is linearly related to the number of modalities: adding modal decisions once requires β0 CPU cycles. Adding the modalities requires One CPU cycle. Although the local loss function includes both multimodal and unimodal local losses, the output of the unimodal sub-model has already been obtained during the calculation of the multimodal local loss. Therefore, only the loss function between the unimodal sub-model output and the label needs to be calculated additionally, and this computational cost is negligible compared to the computational cost of forward propagation in the neural network. Similarly, the computational cost required for the unimodal local gradient is also negligible. Thus, the user i's modal... The number of CPU cycles required to train one sample is This allows us to express the user's computation latency and computational energy overhead as follows:

[0155]

[0156] Among them, the modal decision summation requires β0 CPU cycles;

[0157] For each participating user in the nth communication round, the specified maximum time delay T needs to be met. max The local model has been uploaded, therefore, the user Delay constraints must be met:

[0158] T i n,com +T i cmp ≤T max .

[0159] Besides the time constraint, since users are generally wireless terminals with their own batteries and cannot afford excessive energy consumption, the energy allocated to each user for a single communication round of wireless multimodal federated learning is set to E. add Therefore, the remaining energy of user i in the nth communication round is:

[0160]

[0161] Unlike latency, the remaining energy from each communication round can be accumulated and used in high-overhead communication rounds. For the entire training process, it is only necessary for the total remaining energy of user i to be non-negative, i.e., there is a constraint:

[0162]

[0163] Under bandwidth, latency, and energy constraints, decisions are made regarding access users and bandwidth allocation, using the loss function H(θ). N Let be the objective function, and the performance optimization problem of wireless multimodal federated learning can be expressed as:

[0164]

[0165] The N→+∞ signifies that enough communication rounds have been trained until the multimodal model converges.

[0166] Based on the properties that the loss function H(θ) is γ-smooth and ρ-Lipschitz continuous, the global gradient of the single-modal submodel has an upper bound. Furthermore, the difference between the local gradient and the global gradient of a single-modal sub-model has an upper bound. Under the assumption that, the upper bound of the loss function can be derived as follows:

[0167]

[0168] There is a form Where γ is the smoothness coefficient and ρ is the Lipschitz continuity coefficient. These are the upper bound coefficient of the m-mode gradient and the difference coefficient of the m-mode gradient for user i, respectively. All four coefficients can be estimated. Ideal aggregation weights during full user access. in, is the m-modal gradient difference coefficient for user i; γ is the smoothness coefficient; ρ is the Lipschitz continuity coefficient. For process variables; Here are the upper bound coefficients for the m-mode gradient; Ideal aggregation weight for user i when all users are connected; Let be the m-modal gradient difference coefficient for user i.

[0169] Subtract the constant term H(θ) from the objective function of P1 0 After that, it can be decomposed into a summation of the differences between a series of global loss functions. In addition, for the long-term average energy constraint C5, a virtual queue for energy is first constructed. And a virtual queue mean stability constraint C5′ equivalent to C5 is given:

[0170]

[0171] in, For a virtual queue about energy

[0172] After performing an equivalent transformation on the objective function and C5, we obtain the standard structure that can be solved using the Lyapunov optimization method:

[0173]

[0174] stC1,C2,C3,C4,C5′.

[0175] Based on the Lyapunov optimization method, the Lyapunov drift function with a penalty term can be given:

[0176]

[0177] After a series of inequalities are amplified, the constant term is ignored and the sum with a is... n Irrelevant terms provide an upper bound for optimization and the transformed optimization problem:

[0178]

[0179] stC1,C2,C3,C4.

[0180] Objective function J1(a n B nThe first term is the upper bound of the performance of multimodal federated learning, and the second term is the energy cost of the current communication round. The Lyapunov penalty factor V>0 can balance the relationship between the two: when V→+∞, the optimization method will tend to increase the energy cost to improve the performance of wireless multimodal federated learning, while when V→0, the optimization method will try to reduce the energy cost and ignore the performance of wireless multimodal federated learning. The specific setting of V can be determined according to different energy costs and performance requirements.

[0181] Because of a n It is a combination optimization variable, B n Since the variable is a continuous optimization problem, it can be written as an equivalent master problem P4, and decomposed into combinatorial optimization subproblems P4.1 and continuous optimization subproblems P4.2, respectively:

[0182]

[0183] stC1,C2,C3,C4.

[0184]

[0185] stC1,C3.

[0186]

[0187] stC2,C3,C4.

[0188] in Let represent the optimal solution to P4.2, and the objective function of P4.2 is:

[0189]

[0190] P4.2 can be transformed into a convex problem through a series of transformations, and the optimal bandwidth can be given using the KKT conditions:

[0191]

[0192] The relevant variables satisfy the following relationship:

[0193]

[0194] The above relationship cannot be given It has a closed form, but can be solved numerically using Newton's iteration method. The remaining formulas involved are:

[0195]

[0196]

[0197] Using the above relationships, the optimal bandwidth vector can be obtained.

[0198] P4.1 can be solved using an immune algorithm. Let the randomly generated initial antibody set be... Where S represents the total number of antibodies. Let g be the user access vector corresponding to the s-th antibody in generation 0. Assume the current generation g = 0. Based on the objective function in P4.1, an affinity function aff can be constructed to measure the quality of this antibody.

[0199]

[0200] in, ι>0 is the exponential coefficient that adjusts the dispersion of affinity; Let be the user access vector corresponding to the s-th antibody in the g-th generation; s is the antibody subscript; S is the total number of antibodies.

[0201] For infeasible solutions, the affinity function is set to 0. The antibody concentration function `den` in the immune algorithm measures the similarity between the antibody and other antibodies in the set; its calculation formula is as follows:

[0202]

[0203] The similarity function sim(·) measures the relationship between the Hamming distance (dis(·)) of two antibodies and the distance threshold Dis.

[0204]

[0205] The affinity and concentration of an antibody together determine the level of stimulation required for antibody selection.

[0206]

[0207] Where ε1 and ε2 represent the weights corresponding to the affinity function and antibody concentration function, respectively; aff() is the affinity function; and den() is the antibody concentration function.

[0208] Select the S / μ antibodies with the highest excitability from the antibody set to form a collection. Then, each selected antibody was cloned to a μ-fold size, and these antibodies were subjected to random mutations to obtain the mutated set. Select collection The most compatible One antibody, together with S / μ randomly generated antibodies, constitutes the next-generation antibody ensemble. Then, repeat the above operations on the new antibody set until the maximum number of iterations is reached, thus obtaining the optimal a. n* .

[0209] This invention discloses a wireless multimodal federated learning system. Figure 1 The basic structure of the present invention has been described. Figure 2 The basic model on which the invention is based is described, as shown in Figure 3. Figure 3b , Figures 4a to 4d , Figures 5a-5d The performance of this invention was compared with that of other solutions.

[0210] Modal heterogeneity in multimodal federated learning manifests as different users possessing different modalities of data. Wireless multimodal federated learning is an emerging system that enables users to fully train their data across various modalities. In wireless scenarios, latency is fixed, and bandwidth and computing resources are limited. Modal heterogeneity leads to significant differences in computation and communication between users with different modalities. The impact of modal heterogeneity on multimodal performance needs to be re-characterized.

[0211] This invention provides a wireless multimodal federated learning system (such as...) Figure 1 As shown, the system includes servers, base stations, and users with multimodal data. The system design incorporates the performance requirements of federated learning training and, under the constraints of communication and computing resources, improves the performance of multimodal federated learning through user selection and resource scheduling schemes.

[0212] Users in the system possess multimodal data and vary in their computational and communication capabilities. After updating their federated learning models using their own data, users upload the updated models to the base station via a wireless network. The base station configures a server to receive the user-uploaded models and perform partial aggregation; subsequently, the central server receives the models uploaded by the edge servers and performs global aggregation to complete the federated learning task.

[0213] Given the modal differences in user data within the system, this invention aims to schedule user participation in the federated learning process. Simultaneously, taking into account the performance of each modal model and the user's computing and communication capabilities, a novel wireless multimodal federated learning system is designed.

[0214] This invention employs a decision-level fusion multimodal model and constructs a single-modal loss function using the output of each single modality, which is then incorporated into the final objective. This system design can adapt to heterogeneous user modalities and ensures convergence for each modality.

[0215] This invention jointly optimizes user scheduling and uplink bandwidth allocation. User scheduling is based on real-time channel conditions, latency, and energy constraints, while also considering the training and convergence levels for each modality. This optimized design enables the system to schedule appropriate users in a more effective manner, improving the performance of wireless multimodal federated learning.

[0216] The wireless multimodal federated learning system of the present invention can adapt to the situation of heterogeneous user modalities and can ensure that the multimodal model and each single-modal sub-model can be trained to converge using the user's heterogeneous modal data, thus adapting to a wide range of wireless multimodal federated learning application scenarios.

[0217] Those skilled in the art will understand that, in addition to implementing the system, apparatus, and their modules provided by this invention in purely computer-readable program code, the same program can be implemented in the form of logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded microcontrollers by logically programming the method steps. Therefore, the system, apparatus, and their modules provided by this invention can be considered a hardware component, and the modules included therein for implementing various programs can also be considered structures within the hardware component; alternatively, modules for implementing various functions can be considered both software programs implementing the method and structures within the hardware component.

[0218] Specific embodiments of the present invention have been described above. It should be understood that the present invention is not limited to the specific embodiments described above, and those skilled in the art can make various changes or modifications within the scope of the claims, which do not affect the essence of the present invention. Unless otherwise specified, the embodiments and features described in this application can be arbitrarily combined with each other.

Claims

1. A wireless multi-modal federated learning system for modal-heterogeneous users, characterized in that, include: The server is used to decide the user scheduling result and uplink bandwidth allocation result for each communication round, generate and broadcast a global multimodal model, receive the local multimodal model uploaded by the scheduled user and perform global aggregation to update the global multimodal model. The base station utilizes a wireless network to facilitate communication between the server and the user, collects the channel gain of the user equipment, broadcasts the global multimodal model, user scheduling results, and uplink bandwidth allocation results issued by the server, and receives the local multimodal model uploaded by the scheduled user and transmits it to the server. Multiple users, each with heterogeneous multimodal data, meaning that the modalities possessed by different users are any subset of all modal sets; the scheduled user uses local multimodal data to update the received global multimodal model to obtain a local multimodal model, and uses the allocated uplink bandwidth to upload the local multimodal model to the base station. The server is either directly deployed on the base station side or connected to the base station via optical cable; For each participating user in the th communication round, the local model upload is completed within a specified maximum latency , thus the user needs to satisfy the latency constraint: Setting a single communication round energy of a user for a wireless multi-modal federated learning as Thus, the remaining energy of the user in the first communication round is: ​ representing the user 1-0 indicating variable whether scheduled or not The remaining energy of each communication round is accumulated for the high-overhead communication rounds, and for the whole training process, only the total remaining energy of the users is not negative, i.e. the constraint: Under bandwidth, latency, and energy constraints, decisions are made regarding access users and bandwidth allocation, using a loss function. Let the objective function be the performance optimization problem of wireless multimodal federated learning, which can be expressed as: in This indicates that enough communication rounds have been trained until the multimodal model converges; The goal of the entire wireless multimodal federated learning system is to minimize the sum of the multimodal loss function and the unimodal loss function, expressed as: Where N is the total number of communication rounds, Represents the multimodal loss function. This represents the single-mode loss function.

2. The wireless multimodal federated learning system for heterogeneous modal users according to claim 1, characterized in that, The server in the first User scheduling vector for each communication round. Uplink communication bandwidth allocation results , among them For the total number of users, the first The set of users participating in each communication round The server's global multimodal model is composed of It consists of several global single-modal sub-models, that is, ; The base station utilizes a wireless network to facilitate communication between the user and the server, and collects data. Each communication round of users The channel gain is ; A set of users The set consists of all the modalities owned by the user. It has modes The user set is ;user The set of modes is The number of modes is The local dataset is ,in It is the size of the dataset. It is the feature vector of the sample mode. These are the labels of the samples; the users being scheduled. After local updates, upload the local multimodal model. , No. Each communication round has modal data. The set of participating users .

3. The wireless multimodal federated learning system for heterogeneous modal users according to claim 2, characterized in that, During the local update process, user i calculates the multimodal loss function using the local multimodal dataset as follows: in, The loss function; Indicates modality Features Input to global single-modal sub-model The obtained output results are then given to the user. The outputs of each modality are averaged, and the multimodal model output is obtained by decision-level fusion. Each unimodal sub-model in the multimodal model uses unimodal features to complete machine learning inference tasks. The sum of the user's unimodal loss functions is expressed as: For the modalities owned by the user The single-mode loss function is calculated as follows: Due to modal heterogeneity, for modalities where the user is missing The single-mode loss function is defined as the global loss, and its gradient is defined as the global single-mode gradient, i.e.: Adding the two together yields the user's local loss function: During the user's local update process, the number of local training cycles in each communication round is set to 1, and batch gradient descent is used to obtain the local gradient: The local gradient is also generated by It consists of local sub-gradients, namely: For the modalities owned by the user The corresponding local sub-gradient is calculated as follows: For modalities where the user is missing, the local sub-gradient is defined as the global gradient. Ultimately, it will be obtained by users. Local update formula: The learning rate is set in advance.

4. The wireless multimodal federated learning system for heterogeneous modal users according to claim 3, characterized in that, The global multimodal model on the server is obtained by aggregating the local multimodal models uploaded by the scheduled users. For each modality possessed by the scheduled user... Its aggregation form is: n The definition of aggregation weight is as follows: When no scheduled user has a modality That is, At this point, the parameters of the single-modal sub-model of this mode are kept the same as at the end of the previous communication round, that is: 。 5. The wireless multimodal federated learning system for heterogeneous modal users according to claim 4, characterized in that, When user i performs a local update, computational latency and energy costs are incurred. It is assumed that all users have the same computing power, i.e., CPU frequency. and energy consumption coefficient Same; Use Indicating in modality The number of CPU cycles required to train on a single sample, depending on the decision-level fusion method, depends on the user. In modality The number of CPU cycles required to train one sample is: Then the user's computation latency and computational energy overhead are respectively: b0 The number of CPU cycles required for a single addition of modal decisions.

6. The wireless multimodal federated learning system for heterogeneous modal users according to claim 5, characterized in that, After user i completes its local update, communication latency and energy overhead are incurred during the process of uploading the local multimodal model. Multi-user uplink communication is implemented using frequency division multiple access (FDMA). According to Shannon's formula, in the... In each communication round, the user The uplink speed is: in, It is assigned to the user bandwidth, It is the uplink communication power. It is the power spectral density of white noise. It is channel gain; In an FDMA system, the total bandwidth allocated to participating users does not exceed the total bandwidth used by the base station for wireless multimodal federated services. Based on their respective modalities, different users upload monomodal sub-models of different modalities. Monomodal sub-models of the same modality have the same structure. Representing modes The data length corresponding to the single-modal sub-model, and the total data volume of user uplink communication are This indicates the user In the Uplink latency per communication round: Then give users Energy consumption during uplink communication: 。 7. The wireless multimodal federated learning system for heterogeneous modal users according to claim 6, characterized in that, Based on loss function have smooth, The continuous property means that the global gradient of a single-modal submodel has an upper bound. Furthermore, the difference between the local gradient and the global gradient of a single-modal sub-model has an upper bound. Under the assumption that, the upper bound of the loss function is derived as follows: There is a form ,in The smoothness coefficient, For Lipschitz continuity coefficients. They are respectively Modal gradient upper bound coefficient and user of Modal gradient difference coefficient, ideal aggregation weight when all users are connected. ; , It is a process variable.

8. The wireless multimodal federated learning system for heterogeneous modal users according to claim 7, characterized in that, right objective function minus constant term Then, it is broken down into a summation of the differences of a series of global loss functions. Additionally, regarding the long-term average energy constraint First, a virtual queue for energy is constructed, and then a relationship is given. Equivalent virtual queue mean stability constraint : For the objective function and After performing an equivalent transformation, we obtain the standard structure that can be solved using the Lyapunov optimization method: Based on the Lyapunov optimization method, the Lyapunov drift function with a penalty term is given: For a virtual queue concerning energy, V is the Lyapunov penalty factor; Substituting the upper bound of performance, and then amplifying it through a series of inequalities while ignoring the constant term and the sum, Irrelevant terms are used to obtain an upper bound for optimization and the transformed optimization problem: objective function The first term represents the upper bound of the performance of multimodal federated learning, the second term represents the energy cost of the current communication round, and the Lyapunov penalty factor. Able to weigh the relationship between the two: when At that time, optimization methods tend to increase energy overhead to improve the performance of wireless multimodal federated learning, while when At the same time, optimization methods strive to reduce energy consumption while neglecting the performance of wireless multimodal federated learning. The specific settings are determined based on different energy expenditures and performance requirements.

9. The wireless multimodal federated learning system for heterogeneous modal users according to claim 8, characterized in that, because It is a combination optimization variable. Since the optimization variables are continuous, it can be written as an equivalent principal problem. It is decomposed into combinatorial optimization subproblems. and continuous optimization subproblems They are respectively: in, , express The optimal solution, and The objective function is: After a series of transformations, the problem becomes a convex problem, and the optimal bandwidth is given using the KKT conditions: The relevant variables satisfy the following relationship: The remaining formulas involved in the numerical solution using Newton's iteration method are: Using the above relationships, the optimal bandwidth vector can be obtained; The solution is obtained using an immune algorithm, assuming the randomly generated initial antibody set is... ,in This represents the total number of antibodies. For the 0th generation The user access vector corresponding to each antibody is set to the current generation. ,according to The objective function is used to construct an affinity function that measures the quality of the antibody. : in, , It is the exponential coefficient that adjusts the dispersion of affinity. Currently, for infeasible solutions, the affinity function is set to 0. Antibody concentration function in immune algorithms The formula for measuring the similarity of an antibody to other antibodies in the set is as follows: The similarity function By comparing the Hamming distance between the two antibodies With distance threshold Measured by the size relationship: The affinity and concentration of an antibody together determine the level of stimulation required for antibody selection. in, These represent the weights corresponding to the affinity function and the antibody concentration function, respectively, representing the factors that maximize the incentive in selecting the antibody set. A set of antibodies Then, each selected antibody was cloned into the original. The antibodies were multiplied and then randomly mutated to obtain the mutated set. ; Select set The most compatible One antibody, and randomly generated These antibodies constitute the next generation of antibody ensembles. Then, the above operations are performed on the new antibody set until the maximum number of iterations is reached, thus obtaining the optimal result. .