Method and apparatus for training a model
By introducing attention-based fusion weights into longitudinal federated learning, sparse tensors are weighted and fused, which solves the problems of high communication volume and low model accuracy, and achieves more efficient model training and more accurate data processing.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ALIPAY (HANGZHOU) INFORMATION TECH CO LTD
- Filing Date
- 2023-04-23
- Publication Date
- 2026-06-16
AI Technical Summary
In the vertical federated learning architecture, the amount of communication between feature members and label members is large, resulting in a long training time, and the fusion of sparse intermediate results may affect the model accuracy and convergence speed.
We employ attention-based fusion weights to perform weighted fusion of sparse tensors, utilize the fusion and prediction modules in the global model, train the model using the sparsified intermediate results to reduce communication overhead, and update local model parameters through gradient feedback.
While reducing communication volume, it improves the convergence speed and accuracy of the model, ensuring the efficiency of vertical federated learning and the accuracy of data processing results.
Smart Images

Figure CN116468115B_ABST
Abstract
Description
Technical Field
[0001] This specification relates to one or more embodiments in the field of computer technology, and more particularly to methods and apparatus for jointly training models. Background Technology
[0002] With the rapid development of deep learning, artificial intelligence technology is demonstrating its advantages in almost every industry. However, big data-driven AI faces many challenges in reality. For example, data silos are severe, resulting in low utilization and persistently high costs. Individual training members in some industries may also have limited or poor-quality data. Furthermore, due to industry competition, privacy concerns, and complex management procedures, data integration between different departments within the same company can face significant obstacles and high costs.
[0003] Federated learning was proposed in this context. Federated learning is a framework based on distributed machine learning, whose main idea is to build machine learning models based on datasets distributed across multiple devices while preventing data leakage. In this framework, clients (e.g., mobile devices) collaboratively train the model under the coordination of the server, while the training data can be stored locally on the client, eliminating the need to upload data to a data center as in traditional machine learning methods. During federated learning, the amount of data transmitted is usually proportional to the sample size; the larger the data volume, the greater the communication volume. For large-scale joint learning, excessive communication volume can lead to a longer overall training time. Summary of the Invention
[0004] This specification describes one or more embodiments of a method, apparatus, and system for a joint update model to address one or more problems mentioned in the background art.
[0005] According to the first aspect, a method for jointly training a model is provided, applicable to the process in which multiple training members in a longitudinal federated learning architecture jointly update undetermined parameters in the model using their local privacy data. The multiple training members include at least one feature member and a label member holding labeled data. The model includes local models corresponding to each feature member and a global model corresponding to the label member. The global model includes a fusion module and a prediction module. The method is executed by the label member during the current parameter update cycle of the model. The method includes: receiving sparse tensors from each feature member, wherein a single sparse tensor is obtained by sparsifying a local intermediate tensor of a single feature member, and the local intermediate tensor of a single feature member is obtained by using its own... The local model processes the feature data of the current batch of training samples to obtain intermediate results. A fusion module fuses each sparse tensor based on its corresponding fusion weights to obtain a fused tensor, where each fusion weight represents a parameter to be determined in the global model. A prediction module processes the fused tensor to obtain a prediction result for the current batch of training samples, and then determines the model loss based on a comparison between the prediction result and the label data. Based on the model loss, global gradient data for each parameter to be determined in the global model, as well as intermediate gradient data for each intermediate tensor, are determined. Each intermediate gradient data is provided to its corresponding feature member to update the corresponding local model. The global gradient data is then used to update each parameter to be determined in the global model.
[0006] In one embodiment, the step of fusing the sparse tensors based on their respective fusion weights using the fusion module to obtain the fused tensor includes: determining each state matrix for each sparse tensor, where a single state matrix describes the position information of the effective elements of the corresponding single sparse tensor, wherein the effective elements are the elements retained when sparsifying the intermediate tensor, and the positions of the effective elements in the state matrix are described by predetermined non-zero values, with other positions being 0; and performing a division operation on element-wise between the summed tensor obtained by summing the sparse tensors and the weighted tensor obtained by weighting the summed state matrices according to their respective fusion weights to obtain the fused tensor.
[0007] In one embodiment, when the first element in the weighted tensor is 0, a predetermined element value or a predetermined marker value is used as the result of the division operation at the position corresponding to the first element in the fused tensor.
[0008] In one embodiment, the at least one feature member includes a first member, and the single sparse tensor corresponding to the first member is a first sparse tensor, which is obtained by retaining K elements in a first intermediate tensor as valid elements and setting the other elements to zero.
[0009] In one embodiment, the K elements are the K largest elements in the first intermediate tensor, or K elements randomly selected from the first intermediate tensor.
[0010] In one embodiment, K is determined by one of the following methods: pre-setting; determining according to a predetermined compression ratio α; or based on a compression ratio α that is negatively correlated with the current cycle number. t Sure.
[0011] In one embodiment, each sparse tensor is in the following compressed format: it includes only the valid elements, and for each valid element, it is described by its position information and value in the corresponding intermediate tensor.
[0012] According to the second aspect, an apparatus for jointly training a model is provided, suitable for a process in which multiple training members under a longitudinal federated learning architecture jointly update the undetermined parameters in the model using their local privacy data. The multiple training members include at least one feature member and a label member holding labeled data. The model includes local models corresponding to each feature member and a global model corresponding to the label member. The global model includes a fusion module and a prediction module. The apparatus is located at the label member and includes a receiving unit, a fusion unit, a prediction unit, a gradient determination unit, and an update unit.
[0013] The current parameter update cycle of the model is as follows:
[0014] The receiving unit is configured to receive each sparse tensor from each feature member, wherein a single sparse tensor is obtained by sparsifying the local intermediate tensor of a single feature member, and the local intermediate tensor of a single feature member is an intermediate result obtained by processing the feature data of the current batch of training samples in the local area using a local model.
[0015] The fusion unit is configured to use the fusion module to fuse each sparse tensor based on corresponding fusion weights to obtain a fused tensor, wherein each fusion weight is a parameter to be determined in the global model;
[0016] The prediction unit is configured to process the fusion tensor through the prediction module to obtain the prediction result for the training samples of the current batch, and then determine the model loss based on the comparison between the prediction result and the label data.
[0017] The gradient determination unit is configured to determine the global gradient data of each undetermined parameter in the global model based on the model loss, as well as the intermediate gradient data corresponding to each intermediate tensor. Each intermediate gradient data is used to provide to the corresponding feature members to update the corresponding local model.
[0018] The update unit is configured to update each undetermined parameter in the global model using global gradient data.
[0019] According to a third aspect, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the method of the first aspect.
[0020] According to a fourth aspect, a computing device is provided, including a memory and a processor, characterized in that the memory stores executable code, and when the processor executes the executable code, it implements the method of the first aspect.
[0021] According to the methods and apparatus provided in the embodiments of this specification, under a vertical federated learning architecture, feature members process local feature data using their local models to obtain intermediate tensors, and after sparsifying the intermediate tensors, pass them to label members. Label members, through the fusion module in the global model, fuse the various sparse tensors based on corresponding fusion weights, and use these fusion weights as undetermined parameters under the attention mechanism, adjusting them during model training. This sparse tensor fusion method can reduce the communication volume between feature members and label members, while utilizing fusion weights to focus on the importance of effective elements determined by each sparse tensor, thereby improving model convergence speed and ensuring model accuracy. Attached Figure Description
[0022] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the following description of the embodiments will be briefly introduced. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0023] Figure 1 This is a schematic diagram of the system architecture of each training member under the vertical federated learning architecture;
[0024] Figure 2 This is a schematic diagram of the model architecture of each training member under the vertical federated learning architecture;
[0025] Figure 3 This diagram illustrates the interactive flow of a single parameter update cycle of a jointly trained model according to an embodiment of this specification.
[0026] Figure 4 This diagram illustrates a flowchart of a joint training model performed by a tag member according to one embodiment of this specification.
[0027] Figure 5A schematic block diagram of an apparatus for a joint training model of labeled members according to one embodiment of this specification is shown. Detailed Implementation
[0028] The solution provided in this specification will now be described with reference to the accompanying drawings.
[0029] Federated learning, also known as federated machine learning, consortium learning, or alliance learning, is a machine learning framework that effectively helps multiple organizations use data and perform machine learning modeling while meeting user privacy, data security, and regulatory requirements.
[0030] Specifically, suppose company A and company B each build a task model, where a single task could be classification or prediction, and these tasks have already been approved by their respective users when the data was acquired. However, due to incomplete data—for example, company A lacks labeled data, company B lacks user feature data, or the data is insufficient, with an inadequate sample size to build a good model—the models on each end may fail to be built or perform poorly. Federated learning aims to solve the problem of how to build high-quality models on both A and B, where the training of the model utilizes data from both companies, and each company's proprietary data remains unknown to other parties, i.e., a shared model is built without violating data privacy regulations. This shared model is like a superior model built by aggregating the data from all parties. In this way, the built model serves only the respective objectives of each party in its region.
[0031] In federated learning, the various entities can be referred to as training members (or data parties, etc.). Each training member can hold different business data and participate in the joint training of the model through devices, computers, servers, etc. This business data can be various types of data, such as characters, images, voice, animation, and video. Typically, the business data held by each training member is related, and the business parties corresponding to each training member can also be related. For example, among multiple business parties involved in financial business, business party 1 is a bank, providing savings and loan services to users, and can hold data such as users' income and expenditure records, loan amounts, and deposit amounts; business party 2 is a wealth management platform, and can hold data such as users' loan records, investment records, and repayment deadlines; business party 3 is a shopping website, and can hold data such as users' shopping habits, payment habits, and payment accounts. For example, in a healthcare business involving multiple stakeholders, each stakeholder could be a hospital, a medical examination institution, etc. Stakeholder 1 could be Hospital A, whose local business data includes user symptoms, diagnoses, treatment plans, treatment results, etc. Stakeholder 2 could be Medical Examination Institution B, whose local business data includes user symptoms, examination conclusions, etc., and so on. A single training member in federated learning can hold business data from one stakeholder or multiple stakeholders.
[0032] In a federated learning architecture, two or more data providers can jointly train a model. This model can be any model used to process business data and obtain corresponding business processing results; it can also be called a business model. For example, business data could be user financial data, and the resulting business processing result could be the user's financial credit assessment. Another example is user customer service dialogue data, and the resulting business processing result could be a customer service answer recommendation, and so on. Each training member can use its local business model to process its local business data locally. The goal of federated learning is to train a model that can better handle this business data; therefore, a federated learning model can also be called a business model.
[0033] Federated learning is divided into horizontal federated learning and vertical federated learning. In a horizontal federated learning architecture, the sample sets of different training members have a high degree of feature overlap, but the sample sources are different. For example, multiple sample sets correspond to customers of different banks. Generally, the data features managed by banks are similar, but the customers may be different. Thus, when different training members hold business data from different banks, a horizontal federated learning approach can be used to train the model. In a vertical federated learning architecture, different datasets have high overlap in IDs (e.g., phone numbers are consistent), but the features are different. For example, a bank and a hospital facing the same user group (such as residents of a small county). The samples from the bank and the hospital have a large degree of overlap in the number of people, but the features are different. Bank data may correspond to features such as deposits and loans, while hospital data may correspond to features such as physiological indicators, health status, and medical records. Here, training members holding business data from the bank and hospital respectively can jointly train the model using a vertical federated learning approach.
[0034] The technical solutions provided in this specification are improvements to the vertical federated learning architecture.
[0035] Vertical federated learning can also be called vertical segmentation learning, split learning, etc. Figure 1 This illustrates a specific implementation architecture for vertical federated learning. In this architecture, the feature data of the training samples is distributed among some (usually most) or all training members, while the label data is typically held by a few (e.g., one) training members. For ease of description, this specification refers to the training members holding the feature data as feature members (e.g., […]). Figure 1 Training member 1, training member 2, ..., training member n, etc., are used to refer to training members who hold labeled data. The training members that hold labeled data are called label members (e.g., training member 1, training member 2, ..., training member n, etc.). Figure 1 In this context, training member X represents the training sample. Specifically, each feature member can hold a portion of the feature data of the training sample, such as the first feature data, the second feature data, and so on up to the nth feature data. For a training sample, it can have the first feature data corresponding to training member 1, the second feature data corresponding to training member 2, and so on up to the nth feature data corresponding to training member n, and the label data corresponding to training member X. In particular, training member X can be any one of training member 1, training member 2, and so on up to training member n.
[0036] Furthermore, Figure 2 A diagram illustrating the model architecture of vertical federated learning is shown. Figure 2As shown, in a vertical federated learning scenario, the business model is typically divided into two parts: a local model held by each feature member to process local feature data and obtain intermediate results, and a global model held by the label member to process the intermediate results of each training member and obtain the final global output (prediction result). Each feature member can provide the label member with the intermediate results obtained by processing local feature data using its local model. The label member can then use the global model to process the intermediate results to obtain the global prediction result.
[0037] on the other hand, Figure 2 The dashed arrows also illustrate the backpropagation path of parameter gradients in the model. After obtaining the global prediction result (i.e., the global output), the label member can compare it with its local label data to obtain the current model loss. Then, the label member determines the gradient of each undetermined parameter in the global model (the partial derivative of the model loss with respect to the undetermined parameters) according to the model loss to update the global model. Furthermore, the gradient of each intermediate result can be determined using the global model. The label member feeds back the gradients of each intermediate result to the corresponding feature member, which uses the backpropagation of the gradients of the intermediate results to determine the gradient of each undetermined parameter in the corresponding local model to update the undetermined parameters in its local model.
[0038] When a training member holds both label data and partial feature data, it can hold both the global model and a local model that processes the local feature data. In this case, the part that processes the feature data through the local model can serve as the feature member, while the part that performs global processing by comparing the prediction results of the global model with the label data can serve as the label member. That is, the training member is both a feature member and a label member. Furthermore, the intermediate data obtained by the feature member can be directly used by the label member without communication.
[0039] In a vertical federated learning architecture, there is forward intermediate result and backward gradient transmission between feature members and label members. The amount of data transmitted is proportional to the sample size; the larger the data volume, the greater the communication transmission. For split learning on large-scale data, communication time can lead to a longer overall training time. Conventional techniques can sparsify the intermediate results obtained by the local model processing of corresponding feature data on feature members. This method can reduce communication volume. However, if the sparsity of the intermediate results is high (e.g., retaining 10% of the data), the fusion result of the sparse data will be offset from the fusion result of the original intermediate results during the fusion process, thus affecting the model accuracy and convergence speed of split learning.
[0040] In view of this, this specification provides a technical solution for a jointly trained model, applicable to a vertical federated learning architecture. Under the technical concept provided in this specification, the labeled members perform weighted fusion of the sparse intermediate results using a fusion weight based on an attention mechanism. The weights used for weighting are called fusion weights. The fusion weights can be used to indicate the importance of each value in the intermediate results of the corresponding feature member. During model training, the fusion weights are treated as undetermined parameters and adjusted according to the corresponding gradient data (partial derivatives of the model loss with respect to the fusion weights). Since uploading the sparse intermediate results to the feature members can significantly reduce the amount of data transmitted, and fusing the sparse data of each feature member according to the fusion weights under the attention mechanism reduces the deviation between the fused result of the sparse data and the fused result of the original intermediate results, thereby effectively ensuring the accuracy of the data processing results and the efficiency of vertical federated learning.
[0041] The technical concept of this specification is described in detail below.
[0042] Please refer to Figure 3 As shown, the interaction flow of the joint update model in one embodiment is illustrated. This interaction flow is described using the interaction between feature members and label members in one update cycle as an example. For ease of description, Figure 3 This diagram illustrates the interaction flow between a feature member (hereinafter referred to as the first member) and a label member. Assume the first member holds the first feature data of the training samples. This first feature data, together with the other feature data held by other feature members, constitutes the sample feature data of the training samples. The label member holds the label data of the training samples. The label data, together with the feature data held by each feature member, forms the training samples in the training sample set. Each training member can perform a secure joint privacy intersection (SPI) operation using the unique identifier of the training samples (such as a user's mobile phone number, a unique user ID randomly assigned during registration, etc.) to align the training samples. As an example, one PSI process is as follows: each training member calculates the hash value of the unique identifier of the samples (such as a mobile phone number) and uploads it to a server (such as a trusted third party). The server matches the intersection of the hash values to determine the intersection of the feature dataset and the label dataset. Optionally, the service provider can also sort the data intersections of each training member and feed the sorting results back to each training member so that each training member can obtain the same batch of training samples according to the same rules during training.
[0043] In the longitudinal federated learning process, the training members can jointly negotiate the model structure. Based on the negotiation results, each feature member can construct its own local model, and the label member can construct the global model. The model structure includes at least constraints on the output results of each local model (corresponding to intermediate results) and the input data format of the global model. For example, each local model may have the same output format (corresponding to intermediate results). The specific internal structure of the global and local models can be determined based on the actual characteristics of the data.
[0044] Under the technical concept described in this specification, the global model may include a fusion model and a prediction model. The fusion model is used to fuse the outputs of the various local models, and the prediction model can be used to process the fusion results to obtain prediction results (such as...). Figure 2 (Global output in the specification). Under the implementation architecture of this specification, the output results corresponding to each local model can have the same dimension. For example, for a single training sample, the output result can correspond to an m-dimensional vector. Then, for a batch of n training samples, it can correspond to an n×m or m×n dimensional intermediate tensor.
[0045] It's worth noting that when a training member holds both partial feature data as a feature member and label data as a label member, the output obtained by that feature member processing its local feature data through its local model can be directly used for the calculation of the global model without requiring data communication. Therefore, Figure 3 The first member involved can be any feature member other than the label member. The local model corresponding to the first member can be called the first local model.
[0046] It's understandable that in the process of various training members jointly updating the model in vertical federated learning, there can be multiple parameter update cycles. Within each parameter update cycle, the label member can update the undetermined parameters in the global model, and the feature member can update the undetermined parameters in its local model. Undetermined parameters are model parameters that need to be adjusted during model training, such as the propagation weights between hidden neurons in a fully connected neural network, and the coefficients and constant term parameters in a multinomial fusion model. Figure 3 The embodiment illustrates the interaction flow of a single parameter update cycle. For example... Figure 3 As shown, the interaction process may include the following steps 301 to 308.
[0047] In step 301, the first member processes the first feature data of several training samples in the current batch based on the first local model to obtain the first intermediate tensor.
[0048] It is understood that the first member holds at least several sets of first feature data for training samples. Each set of first feature data can correspond to one or more feature items. The first feature data can be extracted in advance from local data. For example, in a business scenario where a user's creditworthiness is assessed, one training sample corresponds to one user. The local data held by the first training member consists of the user's financial management, loan, and repayment data. Features such as financial management type, amount, returns, loan frequency, loan amount, and repayment timeliness can then be extracted as the first feature data for the corresponding training sample. The current batch of training samples can include one or more training samples.
[0049] In a vertical federated learning architecture, several training samples in the current batch can be sampled from the local dataset by each training member through a consensus-based privacy-preserving method, and the sampled training samples are mutually aligned. The first feature data can be a subset of the feature data of these training samples. For example, if a training sample has 100 feature items, and the first member holds 10 of them, these 10 feature items can be called the first feature data of the training sample. Correspondingly, the first intermediate tensor can be the processing result of the first local model on these 10 feature items (first feature data) of the current batch. The first local model can be an embedding model or an encoding model, which fuses the feature values in the first feature data or mines its deep features and represents them through a vector of predetermined dimensions. The first local model processes the first feature data of at least one training sample in the current batch, and the output result is denoted as the first intermediate tensor, such as M1.
[0050] Here, the first intermediate tensor can be a one-dimensional tensor, a two-dimensional tensor, or a three-dimensional tensor; this specification does not limit it. For example, the first local model can obtain an m-dimensional embedding vector for each training sample, and n training samples can obtain n m-dimensional embedding vectors, for example, forming an n×m two-dimensional tensor as the first intermediate tensor.
[0051] Further, in step 302, the first member sparsifies the first intermediate tensor to obtain the first sparsified tensor.
[0052] The sparsification process of a tensor involves setting some of its elements to 0. Specifically, it retains the values of elements at certain positions while setting the values of elements at other positions to zero. For example, a sparsified result of a one-dimensional tensor (3, 2, 1) with only one valid element can be (3, 0, 0). These elements with valid values are called valid elements. Assuming the number of elements in the first intermediate tensor is n1, and the number of valid elements retained after sparsification of the first intermediate tensor is n2, then n2 is less than n1, and the remaining n1-n2 elements are set to 0.
[0053] Sparsification of the first intermediate tensor can be performed using various sparsification methods such as random sparsification and Top K sparsification. Taking Top K as an example, the K elements with the largest values are retained (where n² = K), and the other values are set to 0. Random sparsification randomly selects K positions, retains the corresponding values as valid elements, and sets the values at other positions to 0. The specific number of valid elements retained (i.e., K) can be determined in different ways. In one embodiment, the number of valid elements K can be a preset value (e.g., K = 100), in which case the first member can select elements with the preset value as valid elements for sparsification in each parameter update cycle. In another embodiment, the number of valid elements K can be an integer value determined according to a predetermined compression ratio α (e.g., 10%), such as K = [α × n1]. Here, [] can represent rounding, which can be either rounding up or rounding down. In a further embodiment, the compression ratio α can decrease with the cumulative number of cycles. For example, a preset decay factor β (less than 1) can be used, and the compression ratio corresponding to the current cycle t can be α. t The exponential result, with a preset decay factor β as the base and the cumulative number of periods as the exponent, is positively correlated, for example denoted as: α t =α×β t K = α t ×n1. α is the preset initial value of the attenuation factor, such as 20%. In other embodiments, the number of effective elements K can also be determined in other reasonable ways, such as the compression ratio of the current period being positively correlated with the network bandwidth of the current device and negatively correlated with CPU utilization.
[0054] The sparsification result of the first intermediate tensor can be called the first sparsified tensor, denoted as H1. Similar to the first member, each feature member can process the corresponding feature data of the current batch of training samples through the local model to obtain the corresponding intermediate vectors, such as M2, M3, etc. The sparsification results of these intermediate tensors are denoted as H2, H3, etc.
[0055] In step 303, the first member provides the first sparsified tensor to the label member.
[0056] It's understandable that the first sparsified tensor can be a two-dimensional tensor, a three-dimensional tensor, etc. Taking a two-dimensional tensor as an example, the first sparsified tensor can be a sparse matrix. When the number of elements retained in the sparsified tensor is usually small, for example, less than 10%, most positions in the first sparsified matrix are 0, while the number of elements remains unchanged. In other words, the sparsification process of the first intermediate tensor reduces the number of effective elements and the numerical complexity.
[0057] To further reduce data transmission volume, in possible designs, the first sparse tensor can be compressed for communication, propagating only the values of valid elements. In this case, for a valid element, a position index and an element value can be transmitted. The position index describes the position information of the valid element in the first sparse tensor. In one embodiment, the row and column coordinates of the valid element in the tensor can be used as the corresponding position index, such as the position in the 3rd row and 5th column of a two-dimensional tensor, corresponding to x=3, y=5, or (3, 5), etc. Then, the description of a valid element can be a triple, where two dimensions describe the position index and one dimension describes the value of the valid element. For example, the triple (3, 5, 5) indicates that the 3rd row and 5th column is a valid element with a value of 5. When the dimensions of the first intermediate tensor are other values, other tuples can also be used to describe the valid elements, which will not be listed here. In another embodiment, the positions in the first sparse tensor can be sorted sequentially, and the sorting index (e.g., ranging from 1 to n1) can be used as the position index of the corresponding element. For example, a tensor with 150 rows and 1000 columns can be sorted by values between 1 and 150,000. The element in the 3rd row and 5th column (e.g., the value 5) could correspond to position number 2005, and could be described by the tuple (2005, 5). In other embodiments, there are other methods for communication compression of the first sparse tensor, which will not be elaborated here. In this way, the amount of data transmitted from the first member to the label member is greatly reduced. Taking a 150-row, 1000-column tensor as an example, assuming a sparsity ratio of 10%, 15,000 out of 150,000 values are retained as valid elements. Compressing this into tuples, the compressed data transmission volume is 30,000 values (15,000 tuples), far less than a sparse matrix containing 150,000 values, significantly reducing communication volume.
[0058] On the other hand, in step 304, the label member uses the fusion module in the global model to fuse the sparse tensors sent by the training members who hold feature data based on the corresponding fusion weights to obtain the fused tensor.
[0059] It is understandable that each feature member can utilize its local model and local feature data to perform corresponding operations according to steps 301 to 303, thereby providing the label member with the sparse tensor corresponding to the local intermediate tensor. The label member can first use the fusion module to fuse these sparse tensors.
[0060] This specification employs an attention mechanism during the fusion of various sparse tensors, assigning different fusion weights to each sparse tensor. These fusion weights are adjusted as undetermined parameters during model updates.
[0061] In one embodiment, the sparse tensors can be merged into a fused tensor by processing them according to their respective fusion weights (e.g., performing multiplication), then summing or concatenating them. For example, w1, w2, ... are the respective fusion weights, and the fused tensor can be H = w1H1 + w2H2 + ...
[0062] In another embodiment, the state matrix describing the position information of the effective elements in the sparse tensor can be weighted according to various fusion weights, and the fusion weight of each element of the sparse tensor can be determined according to each element in the resulting weighted tensor (e.g., each sparse tensor is a two-dimensional tensor, and the weighted tensor can be a weighted matrix). Specifically, firstly, the tag member can determine the corresponding state tensor based on each sparse tensor; where, in the state tensor, the position corresponding to the effective element of the sparse tensor is a predetermined non-zero value, and other positions are 0 values. For example, if a sparse tensor is (3, 2, 0), the corresponding state tensor can be (1, 1, 0). The state tensors corresponding to each sparse tensor are denoted as S1, S2, ... for example. Then, the tag member can weight the corresponding state tensor according to various fusion weights. Assuming that the various fusion weights are w1, w2, ..., the weighted tensor is S = w1 S1 + w2 S2 + ... It can be seen that in the weighted tensor S, each element corresponds to the sum of the fusion weights of the sparse tensors that retain valid elements at the corresponding positions. For example, in the first row and first column, the sparse tensors that retain valid elements are the first and fourth members, corresponding to fusion weights w1 and w4 respectively. Therefore, the element in the first row and first column of the weighted tensor is w1 + w4. Furthermore, the label members fuse the various sparse tensors based on the weighted tensor. The fusion method is, for example, dividing the sum of the sparse matrices into the fused tensor element-wise, as denoted as: H = ∑ i H i / S. Here, tensor division is element-wise, meaning that elements at corresponding positions are divided. In other words, each element in the weighted tensor describes the importance of the corresponding element in the sum tensor of each sparse matrix, and for a single position, the importance of the corresponding element in the sum tensor is negatively correlated with the importance of the corresponding element in the weighted tensor.
[0063] In particular, each sparse tensor may contain invalid elements at the same position. In this case, the element value of the weighted tensor S at the corresponding position is 0. During the division of corresponding elements, the divisor may be around 0, which may lead to computer calculation errors. Therefore, for the elements of the weighted tensor S with a value of 0, the value of the corresponding element in H can be set to a preset value (such as 0), or a preset flag (such as the error flag NON), etc., without any restrictions here.
[0064] In other embodiments, the label members can be fused into various sparse tensors in other reasonable ways, which will not be listed here.
[0065] Subsequently, according to step 305, the label members process the fusion tensor through the prediction module in the global model to obtain the prediction results for the training samples of the current batch, and then determine the model loss based on the comparison between the prediction results and the label data.
[0066] The prediction module can be implemented using various machine learning models, such as decision trees and fully connected neural networks, that can obtain category prediction values or multi-category prediction vectors. By calling the prediction module, the label member can predict the corresponding category for each training sample in the current batch, serving as the prediction result. For example, in a business scenario predicting user trustworthiness, the prediction result is the probability value of whether a user is trustworthy or not. Similarly, the corresponding sample labels can provide whether the user is trustworthy for each training sample in the current batch, or the label values for trustworthiness and non-trustworthiness (e.g., a trustworthy user is labeled as 1 in the trustworthiness dimension and 0 in the non-trustworthiness dimension, and vice versa). The label member can also determine the model loss based on the comparison between the prediction results and the label data. The model loss can be determined using various reasonable loss models such as difference, variance, Euclidean distance, cosine similarity, and cross-entropy, which will not be elaborated upon here.
[0067] Further, in step 306, the label member determines the global gradient data of each undetermined parameter in the global model, as well as the intermediate gradient data corresponding to each intermediate tensor, based on the model loss.
[0068] Here, both global gradient data and intermediate gradient data are gradient data. The term "global" is used to correspond to "global model," and "intermediate" is used to correspond to "intermediate tensor." These names do not impose any other substantial constraints on the corresponding gradient data. In practice, they can also be named in other ways without affecting their substantive meaning.
[0069] It can be understood that the gradient of a parameter is the partial derivative of the model loss with respect to the parameter. In determining the gradient of a parameter, the input value can be considered a fixed value. For example, in a model y = wx, when determining the gradient of the parameter w, x can be considered a fixed value. The gradient of w is the product of the partial derivative of the model loss with respect to y and the partial derivative of y with respect to w (e.g., x). Furthermore, gradient data between multiple layers of neural networks exhibits backpropagation. For example, in the first layer y1 = w... a x, the second layer is y2 = w b y1, then w aThe gradient is the product of the partial derivative of the model loss Loss with respect to y1 and the partial derivative of y1 with respect to x. The partial derivative of the model loss Loss with respect to y1 can be determined by the product of the partial derivative of the model loss Loss with respect to y2 and the partial derivative of y2 with respect to y1.
[0070] Thus, on the one hand, the label members can locally use the model loss to determine the corresponding gradients for each undetermined parameter in the global model (including the prediction module and the fusion module), including the gradients of each fusion weight. These gradient data can be used by the label members to update each undetermined parameter in the global model based on gradient descent or similar methods. For example, if the gradient of the undetermined parameter w is δ, it can be updated with a step size λ as: w = w - λδ.
[0071] Thus, in step 3071, the label members can update the corresponding undetermined parameters using the global gradient data of each undetermined parameter in the global model. During the undetermined parameter update process, the fusion weights in the fusion model are updated, thereby adjusting the importance (or attention) of each sparse tensor.
[0072] On the other hand, the label member can determine the corresponding gradient data for each intermediate tensor. The determination method is similar to that used to determine the partial derivative of y1 previously. The gradient data determined for the first intermediate tensor can be denoted as the first gradient data. Through step 3072, the label member can feed back the first gradient data to the first member.
[0073] Then, in step 308, the first member updates the undetermined parameters in the first local model using the first gradient data. The update method can be gradient descent, Newton's method, etc., which will not be elaborated here.
[0074] Correspondingly, the label member can also feed back the gradient data corresponding to other intermediate tensors to other relevant feature members, so that the other feature members can update the undetermined parameters of their local models.
[0075] according to Figure 3 The illustrated process shows that each training member can jointly complete one parameter update cycle of the model. Through multiple parameter update cycles of iteration, the model performance can be stabilized, such as the undetermined parameters tending to converge, the gradient data tending to converge, and the accuracy reaching a predetermined value, thereby completing the joint training of the model.
[0076] The above combination Figure 3The schematic diagram illustrates the model update process of one embodiment of this specification from the perspective of the interaction between feature members and label members. As can be seen from the above process, the technical concept provided in this specification introduces an attention mechanism during the aggregation of sparse tensors from various feature members by label members. This mechanism assigns different levels of attention to the effective elements transmitted by each feature member, thereby enabling more effective fusion of the sparse data from each feature member and improving the accuracy of the fusion result of the sparse data relative to the fusion result of the intermediate tensor. Furthermore, while reducing data communication, it ensures model accuracy and improves the model training efficiency of the longitudinal federated learning process.
[0077] Figure 4 The process of jointly training a model according to one embodiment is also illustrated, whereby... Figure 3 The illustrated flow shows the execution of the label members within a single parameter update cycle. For example... Figure 4 As shown, the process includes:
[0078] Step 401: Receive each sparse tensor from each feature member;
[0079] In this context, a single sparse tensor is obtained by sparsifying the local intermediate tensor of a single feature member. The local intermediate tensor of a single feature member is an intermediate result obtained by processing the local feature data of the training samples in the current batch using the local local model.
[0080] Step 402: Use the fusion module to fuse each sparse tensor based on the corresponding fusion weights to obtain the fused tensor;
[0081] In this model, each fusion weight represents an undetermined parameter in the global model. In one embodiment, each fusion weight can be used as a weighted average of the sparse tensors, and the fused tensors are summed using weighted averages to obtain the fused tensor. In another embodiment, each sparse tensor can be used to determine its own state matrix. Then, the summed tensor obtained by summing the sparse tensors is divided element-wise by the weighted tensor obtained by summing the state matrices according to the fusion weights, resulting in the fused tensor. Each individual state matrix describes the position information of the valid elements of the corresponding individual sparse tensor. Valid elements are those retained during the sparsification of intermediate tensors, such as the element in the first row and first column. The positions of valid elements in the state matrix are described by predetermined non-zero values, while other positions are 0. Further, assuming the first element is any element in the weighted tensor, if the first element is 0, a predetermined element value (e.g., 0) or a predetermined flag value (e.g., NON) can be used as the result of the division operation corresponding to the first element in the fused tensor.
[0082] Step 403: The prediction module processes the fusion tensor to obtain the prediction result for the training samples of the current batch, and then determines the model loss based on the comparison between the prediction result and the label data.
[0083] Step 404: Determine the global gradient data of each undetermined parameter in the global model and the intermediate gradient data corresponding to each intermediate tensor based on the model loss.
[0084] Each intermediate gradient data can be provided to the corresponding feature member, which can then use it to update the corresponding local model.
[0085] Step 405 updates the undetermined parameters in the global model using global gradient data. This includes updating the fusion weights, which describe the degree of attention paid to each sparse tensor during the fusion process.
[0086] Reviewing the above process, in the split learning jointly conducted by multiple training members, each feature member only transmits a small portion (e.g., 10%) of the effective data, including intermediate results, to the label member, thereby reducing the complexity of the communication data. In particular, when the sparse tensor is compressed during transmission, the amount of data transmitted is significantly reduced. The label member, based on an attention mechanism using fusion weights, mines the importance of the effective elements transmitted by each feature member, thus giving different levels of attention. This allows for more effective fusion of the sparse tensor, improving model accuracy. Experiments have shown that, compared to a training process where the effective element proportion transmitted by a single feature member is around 10%, a model achieving the required accuracy can be obtained with higher efficiency.
[0087] According to another embodiment, an apparatus for jointly updating a model is also provided. This apparatus may be located in the label member during the process where multiple training members in a longitudinal federated learning architecture jointly update undetermined parameters in the model using their local privacy data. Figure 5 An apparatus 500 for a joint update model of tag members according to one embodiment is shown. The apparatus 500 may include a receiving unit 501, a fusion unit 502, a prediction unit 503, a gradient determination unit 504, and an update unit 505.
[0088] During a parameter update cycle of the model, receiving unit 501 is configured to receive sparse tensors from each feature member, wherein a single sparse tensor is obtained by sparsifying the local intermediate tensor of a single feature member, and the local intermediate tensor of a single feature member is an intermediate result obtained by processing the feature data of the current batch of training samples using the local local model; fusion unit 502 is configured to fuse each sparse tensor based on the corresponding fusion weights using the fusion module to obtain a fused tensor, wherein each fusion weight is a parameter to be determined in the global model; prediction unit 503 is configured to process the fused tensor through the prediction module to obtain the prediction result for the current batch of training samples, and then determine the model loss based on the comparison between the prediction result and the label data; gradient determination unit 504 is configured to determine the global gradient data of each parameter to be determined in the global model and the intermediate gradient data corresponding to each intermediate tensor according to the model loss, and the intermediate gradient data is provided to the corresponding feature members to update the corresponding local model; update unit 505 is configured to update each parameter to be determined in the global model using the global gradient data.
[0089] In one embodiment, the fusion unit 502 is further configured to: determine each state matrix using each sparse tensor, where a single state matrix describes the position information of the valid elements of the corresponding single sparse tensor, wherein the valid elements are the elements retained when the intermediate tensor is sparsified, and the positions of the valid elements in the state matrix are described by predetermined non-zero values, while other positions are 0; perform a division operation (element-wise division) between the summed tensor obtained by summing the various sparsified tensors and the weighted tensor obtained by weighting the summed state matrices according to the various fusion weights to obtain the fused tensor. Optionally, assuming any element in the weighted tensor is denoted as the first element, if the first element is 0, a predetermined element value or a predetermined marker value is used as the result of the division operation at the position corresponding to the first element in the fused tensor.
[0090] In an optional implementation, it is assumed that at least one feature member includes a first member, and the single sparse tensor corresponding to the first member is a first sparse tensor. The first sparse tensor is obtained by retaining K elements from a first intermediate tensor as valid elements and setting the other elements to zero. The K elements are either the largest among all elements of the first intermediate tensor or randomly selected from the first intermediate tensor. According to one embodiment, K is determined by one of the following methods: pre-setting; determining according to a predetermined compression ratio α; or based on a compression ratio α negatively correlated with the current period number. t Sure.
[0091] In one embodiment, each sparse tensor is described by the following compressed format: including only the valid elements, and each valid element corresponds to its position information and value in the corresponding intermediate tensor.
[0092] It is worth noting that, Figure 5 The device 500 shown is Figure 4 The illustrated method embodiments correspond to and can be applied to... Figure 3 The illustrated interactive flow uses labeled members to work in conjunction with feature members to complete the process. Figure 3 The parameter update process within. Therefore, Figure 3 Descriptions of tag members in the text and Figure 4 The descriptions corresponding to the method embodiments in the text can all be adapted to... Figure 5 The device 500 shown will not be described in detail here.
[0093] According to another embodiment, a computer-readable storage medium is also provided, on which a computer program is stored, which, when executed in a computer, causes the computer to perform a combination Figure 4 The methods described above.
[0094] According to another embodiment, a computing device is also provided, including a memory and a processor, wherein executable code is stored in the memory, and when the processor executes the executable code, it implements a combination... Figure 4 The methods described above.
[0095] Those skilled in the art will recognize that the functions described in the embodiments of this specification in one or more of the above examples can be implemented using hardware, software, firmware, or any combination thereof. When implemented in software, these functions can be stored in a computer-readable medium or transmitted as one or more instructions or code on a computer-readable medium.
[0096] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of the technical concept in this specification. It should be understood that the above description is only a specific embodiment of the technical concept in this specification and is not intended to limit the scope of protection of the technical concept in this specification. Any modifications, equivalent substitutions, improvements, etc., made on the basis of the technical solutions of the embodiments in this specification should be included within the scope of protection of the technical concept in this specification.
Claims
1. A method for jointly training a model, applicable to the process in which multiple training members in a longitudinal federated learning architecture jointly update the undetermined parameters in the model using their local privacy data, wherein, The plurality of training members include at least one feature member and a label member holding label data. The model includes each local model corresponding to each feature member and a global model corresponding to the label member. The global model includes a fusion module and a prediction module. The method is executed by the label member during the current parameter update cycle of the model, and includes: Each sparse tensor is received from each feature member. Each sparse tensor is obtained by sparsifying the local intermediate tensor by a single feature member. The local intermediate tensor of a single feature member is an intermediate result obtained by processing the feature data of the current batch of training samples locally using the local local model. The fusion module is used to fuse each sparse tensor based on its corresponding fusion weights to obtain a fused tensor, wherein each fusion weight is a parameter to be determined in the global model. The prediction module processes the fusion tensor to obtain the prediction result for the current batch of training samples, and then determines the model loss based on the comparison between the prediction result and the label data. The global gradient data of each undetermined parameter in the global model is determined based on the model loss, as well as the intermediate gradient data corresponding to each intermediate tensor. Each intermediate gradient data is used to provide to the corresponding feature members to update the corresponding local model. Update the undetermined parameters in the global model using global gradient data.
2. The method according to claim 1, wherein, The fusion module is used to fuse the sparse tensors based on their respective fusion weights to obtain the fused tensors, which include: Each sparse tensor determines a state matrix. A single state matrix is used to describe the position information of the effective elements of the corresponding single sparse tensor. The effective elements are the elements retained when the intermediate tensor is sparsified. The position of the effective elements in the state matrix is described by a predetermined non-zero value, and other positions are 0. The summed tensor obtained by summing the various sparse tensors is divided element-wise by the weighted tensor obtained by summing the various state matrices according to the various fusion weights, to obtain the fusion tensor.
3. The method according to claim 2, wherein, The weighted tensor includes a first element. When the first element is 0, a predetermined element value or a predetermined marker value is used as the result of the division operation at the position corresponding to the first element in the fused tensor.
4. The method according to claim 1, wherein, The at least one feature member includes a first member, and the single sparse tensor corresponding to the first member is a first sparse tensor. The first sparse tensor is obtained by retaining K elements in the first intermediate tensor as valid elements and setting the other elements to zero.
5. The method according to claim 4, wherein, The K elements are either the largest in the first intermediate tensor or randomly selected from the first intermediate tensor.
6. The method according to claim 4, wherein, K is determined by one of the following methods: Preset; Determined according to the predetermined compression ratio α; Based on the compression ratio α, which is negatively correlated with the current cycle number t Sure.
7. The method according to claim 1, wherein, Each sparse tensor is described by the following compressed format: it includes only the valid elements, and each valid element corresponds to its position information and value in the corresponding intermediate tensor.
8. An apparatus for jointly training a model, suitable for a process in which multiple training members under a longitudinal federated learning architecture jointly update undetermined parameters in the model using their local privacy data, wherein, The plurality of training members include at least one feature member and a label member holding labeled data. The model includes local models corresponding to each feature member and a global model corresponding to the label member. The global model includes a fusion module and a prediction module. The device is located at the label member and includes a receiving unit, a fusion unit, a prediction unit, a gradient determination unit, and an update unit. The current parameter update cycle of the model is as follows: The receiving unit is configured to receive each sparse tensor from each feature member, wherein a single sparse tensor is obtained by sparsifying the local intermediate tensor of a single feature member, and the local intermediate tensor of a single feature member is an intermediate result obtained by processing the feature data of the current batch of training samples in the local area using a local model. The fusion unit is configured to use the fusion module to fuse each sparse tensor based on corresponding fusion weights to obtain a fused tensor, wherein each fusion weight is a parameter to be determined in the global model; The prediction unit is configured to process the fusion tensor through the prediction module to obtain the prediction result for the training samples of the current batch, and then determine the model loss based on the comparison between the prediction result and the label data. The gradient determination unit is configured to determine the global gradient data of each undetermined parameter in the global model based on the model loss, as well as the intermediate gradient data corresponding to each intermediate tensor. Each intermediate gradient data is used to provide to the corresponding feature members to update the corresponding local model. The update unit is configured to update each undetermined parameter in the global model using global gradient data.
9. A computer-readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the method of any one of claims 1-7.
10. A computing device, comprising a memory and a processor, characterized in that, The memory stores executable code, and when the processor executes the executable code, it implements the method of any one of claims 1-7.