Compilation methods, devices, electronic devices, and storage media
By generating judgment statements that identify the executor in vertical federated learning, the source code compilation process is simplified, the problem of complex code in traditional vertical federated learning is solved, and efficient code development and maintenance are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- WEBANK (CHINA)
- Filing Date
- 2020-11-13
- Publication Date
- 2026-06-30
AI Technical Summary
Traditional vertical federated learning code development is complex, resulting in high machine learning development and maintenance costs, poor user experience, and difficulty in meeting data privacy protection and policy and regulatory requirements.
By generating conditional statements that identify the executor, the corresponding code segments are identified and executed, simplifying the source code compilation process of vertical federated learning and generating unified code applicable to multiple participants.
It reduces the complexity and maintenance difficulty of source code development, improves the efficiency of code development and maintenance, and meets the requirements of data privacy protection and policies and regulations.
Smart Images

Figure CN114489655B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data processing, and more particularly to a compilation method, apparatus, electronic device, and storage medium. Background Technology
[0002] With the continuous development of technology, artificial intelligence has been widely applied, and the value of data has become increasingly prominent. However, with the growing emphasis on data privacy and the release of corresponding policies and regulations in recent years, the traditional approach of acquiring raw data and performing machine learning modeling is no longer convenient or feasible. People have begun to turn to methods such as federated machine learning or multi-party secure computation for modeling.
[0003] Federated machine learning or multi-party secure computation involves information transmission and scheduling among multiple participants, typically requiring explicit encryption of information transmission during code development. This necessitates different code logic implementations for each participant with a different role, leading to code complexity and making development and maintenance difficult. It also increases the development and learning costs of the corresponding machine learning methods, resulting in a poor user experience. Summary of the Invention
[0004] The main objective of this invention is to provide a compilation method, apparatus, electronic device, and storage medium, which aims to improve the compilation process to simplify source code development and thus increase the efficiency of source code development and maintenance.
[0005] To achieve the above objectives, the present invention provides a compilation method applied to the first participant among multiple participants in vertical federated learning, the method comprising:
[0006] Obtain the source code for vertical federated learning; the source code includes multiple code segments for implementing vertical federated learning and an executor identifier corresponding to each code segment, the executor identifier being used to indicate at least one participant executing the code segment;
[0007] For each code segment, a corresponding judgment statement is generated based on the executor identifier corresponding to the code segment. The judgment statement is used to determine whether the first participant belongs to the executor of the code segment, so that when the judgment statement is executed, the judgment result of the judgment statement determines whether to execute the code segment.
[0008] Based on each code segment and the corresponding conditional statements, the compiled code is generated.
[0009] Optionally, before generating the corresponding conditional statement, the method further includes:
[0010] Determine the data merging function in the source code; the data merging function is used to indicate the merging method of the unilateral data provided by each participant in the longitudinal federated learning;
[0011] The source code is parsed according to the data merging function to generate a computation graph; the computation graph is used to represent the data processing logic of the unilateral data provided by each participant.
[0012] The step of generating a corresponding judgment statement based on the executor identifier corresponding to the code segment includes:
[0013] Based on the executor identifier corresponding to the code segment and the data processing logic represented by the computation graph, a corresponding judgment statement is generated.
[0014] Optionally, the method further includes:
[0015] Based on the computation graph, determine the node where the unilateral data provided by each participant is located;
[0016] Add an encrypted node after the node containing the unilateral data provided by each participating party;
[0017] For the encrypted node, generate an encryption statement to encrypt the unilateral data;
[0018] The process of generating compiled code based on each code segment and the corresponding conditional statements for each code segment includes:
[0019] Based on the conditional statements corresponding to each code segment and the encryption statements, the compiled code is generated.
[0020] Optionally, the method further includes:
[0021] Determine the data transfer function in the source code; the data transfer function is used to indicate the direction of transmission of model training data, which is calculated based on the unilateral data.
[0022] Add an encryption node before the node containing the data transmission function;
[0023] For the encrypted node, generate an encrypted statement to encrypt the model training data.
[0024] Optionally, generating the encryption statement for encrypting the model training data includes:
[0025] If the data transmission function corresponding to the encrypted node points to the coordinator among the participants, then an encrypted statement for encrypting the model training data is generated based on the coordinator's public key.
[0026] Optionally, the method further includes:
[0027] Based on the data transmission function, the parameters of the data transmission function, and the identifier of at least one participant corresponding to the code segment where the data transmission function is located, a corresponding data transmission statement is generated;
[0028] The process of generating compiled code based on the conditional statements corresponding to each code segment includes:
[0029] Based on each code segment and the corresponding conditional statements and data transmission statements, the compiled code is generated.
[0030] Optionally, the method further includes:
[0031] Identify the data decryption function in the source code;
[0032] Generate a decryption statement based on the private key of the coordinator among the participants;
[0033] The process of generating compiled code based on each code segment and the corresponding conditional statements for each code segment includes:
[0034] Based on the conditional statements and decryption statements corresponding to each code segment, the compiled code is generated.
[0035] The present invention also provides a compilation apparatus, comprising:
[0036] An acquisition module is used to acquire the source code of vertical federated learning; the source code includes multiple code segments for implementing vertical federated learning and an executor identifier corresponding to each code segment, the executor identifier being used to indicate at least one participant executing the code segment;
[0037] The generation module is used to generate a corresponding judgment statement for each code segment based on the executor identifier corresponding to the code segment. The judgment statement is used to determine whether the first participant belongs to the executor of the code segment. When the judgment statement is executed, the module determines whether to execute the code segment based on the judgment result of the judgment statement. The module also generates compiled code based on each code segment and the judgment statement corresponding to each code segment.
[0038] The present invention also provides an electronic device, the electronic device comprising: a memory, a processor, and a compiler stored in the memory and executable on the processor, wherein the compiler, when executed by the processor, implements the steps of the compilation method as described above.
[0039] The present invention also provides a computer-readable storage medium storing a compiler, which, when executed by a processor, implements the steps of the compilation method described above.
[0040] This invention provides a compilation method, apparatus, electronic device, and storage medium. The compilation method can be applied to a first participant among multiple participants in a vertical federated learning process, where the first participant can be any party. The method includes: obtaining source code for the vertical federated learning process; the source code includes multiple code segments for implementing the vertical federated learning process and an executor identifier corresponding to each code segment, the executor identifier indicating at least one participant executing the code segment; for each code segment, generating a corresponding judgment statement based on the executor identifier corresponding to the code segment, the judgment statement determining whether the first participant is an executor of the code segment, so that when the judgment statement is executed, the result of the judgment statement determines whether the code segment should be executed; and generating compiled code based on each code segment and the corresponding judgment statement. Each participant can use this compilation method to compile the source code. During the compilation process, the executor identifier corresponding to each code segment in the source code is identified, and a corresponding judgment statement is generated based on the executor identifier, thereby obtaining the compiled code. When each participant executes the compiled code, upon reaching the code corresponding to a conditional statement, it can determine whether it belongs to the executor of the corresponding code segment and automatically execute its own code segment, thus cooperating to complete the vertical federated learning process. In other words, this compilation method is the underlying foundation for supporting source code simplification. Because the compilation method supports generating corresponding conditional statements based on the executor identifier in the source code, it compiles the source code into executable code. Therefore, only the executor of each code segment needs to be marked in the source code. This simplifies the original separate source code applicable to multiple participants into a single set of source code applicable to all participants. Furthermore, since each code segment's executor is at least one participant, identical operations from all parties can be integrated into the same code segment using the "executor identifier," further simplifying the source code. This reduces the complexity of source code development and the difficulty of later maintenance. Attached Figure Description
[0041] Figure 1 A schematic diagram illustrating an application scenario provided by the present invention;
[0042] Figure 2 A flowchart of a compilation method provided in an embodiment of the present invention;
[0043] Figure 3 A flowchart of a parsing method provided in an embodiment of the present invention;
[0044] Figure 4 This is a schematic diagram of a generated computation graph provided in an embodiment of the present invention;
[0045] Figure 5 A schematic diagram of a vertical federated learning process provided in an embodiment of the present invention;
[0046] Figure 6 This is a schematic diagram of the structure of a compiler device provided in an embodiment of the present invention;
[0047] Figure 7 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention.
[0048] The realization of the objective, functional features and advantages of the present invention will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation
[0049] Exemplary embodiments of the present disclosure will now be described in more detail with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
[0050] In many fields, technologies such as machine learning and model training based on big data are constantly evolving. Mining big data can yield a wealth of valuable information. With technological advancements, the sources of raw data are becoming increasingly diverse, even involving cross-domain collaborations. However, in recent years, there has been a growing emphasis on data privacy, along with the issuance of corresponding policies and regulations. This has rendered the traditional approach of directly acquiring raw data from multiple sources for machine learning modeling inconvenient and impractical. Federated machine learning and multi-party secure computation have emerged as solutions.
[0051] Federated machine learning, or multi-party secure computation, typically involves encrypted information transmission between multiple participants. Federated learning is an emerging foundational artificial intelligence technology designed to enable efficient machine learning across multiple participants or computing nodes while ensuring information security during big data exchange, protecting endpoint and personal data privacy, and guaranteeing legal compliance. The machine learning algorithms used in federated learning are not limited to neural networks but also include important algorithms such as random forests.
[0052] Federated learning is divided into horizontal federated learning, vertical federated learning, and federated transfer learning, depending on the dataset.
[0053] Vertical federated learning refers to a model training process where at least two sample datasets have significant user overlap but limited user feature overlap. Therefore, the datasets are split vertically (i.e., along the feature dimension), and the portion of data where users are the same but user features are not entirely identical is used for training. For example, consider two different institutions: a local bank and a local e-commerce platform. Their user bases likely encompass a large portion of the local population, resulting in significant user overlap. However, since the bank records user spending and credit ratings, while the e-commerce platform retains browsing and purchase history, their user feature overlap is relatively small. Vertical federated learning aggregates the different features of the same users from both datasets in an encrypted state, using these as training samples to enhance the model's capabilities.
[0054] In the process of vertical federated learning, the first step is to use encrypted user sample alignment technology to align the encrypted sample data provided by each data provider for model training. The model training process then requires encryption training by a third party. Specifically, the third party distributes its public key to each data provider to encrypt the data exchanged during training. This data exchange mainly involves data providers exchanging intermediate results for gradient calculation in encrypted form; data providers with labeled data calculating the loss function based on their labeled data and transmitting the encrypted result to the third party; and the third party decrypting the data, updating the model weights, and then encryptedly sending them back to each data provider.
[0055] Because it involves multi-party collaboration, and the execution operations of each party are not entirely the same, implementing the above-mentioned vertical federated learning process generally requires developing corresponding source code for each role participating in the federated learning. This results in extremely complex and numerous source codes used to complete federated learning, leading to high machine learning costs, a poor user development experience, and significant difficulties in later maintenance and modification of the source code.
[0056] To address the aforementioned problems, this invention proposes a compilation method that identifies executor identifiers within the source code during compilation and generates corresponding conditional statements based on these identifiers. This allows participating parties to execute their own code segments by executing these conditional statements during the compilation process, enabling all parties to collaboratively complete the federated learning process. Based on this compilation method, the code used for vertical federated learning can be written within the same source code during development, eliminating the need to develop separate source codes for different participating parties. This significantly reduces the complexity of source code development.
[0057] Figure 1 This is a schematic diagram illustrating an application scenario provided by the present invention. For example... Figure 1As shown, users write source code for longitudinal logistic regression on their client-side according to their business needs, and then send the source code to various participants in the longitudinal logistic regression (the diagram shows three participants as an example; the specific number of participants depends on the actual scenario). Each participant compiles the source code, generates executable machine code, and runs it. During execution, each participant executes the code segment corresponding to its own role, and all participants cooperate to realize the training and / or prediction process of longitudinal logistic regression. This achieves the updating of the parameters of the longitudinal federated learning model.
[0058] Figure 2 This is a flowchart illustrating a compilation method according to an embodiment of the present invention. The execution entity of the compilation method provided in this embodiment is any one of the multiple participants in vertical federated learning, referred to as the first participant. For example... Figure 2 As shown, the method in this embodiment may include:
[0059] S201. Obtain the source code for vertical federated learning; the source code includes multiple code segments for implementing vertical federated learning and an executor identifier corresponding to each code segment. The executor identifier is used to indicate at least one participant executing the code segment.
[0060] In this invention, a code segment refers to a piece of code separated by participant identifiers. Considering the possibility of multiple levels of code nesting, there may also be nesting relationships between different code segments.
[0061] The executor identifier may include the device identifier or role identifier of the participating party.
[0062] In vertical federated learning, participants refer to the various parties involved, which can be multiple. These could include providers of raw training data and providers of model parameters. Each participant may also have one or more device nodes involved in the training. However, the roles of the participants are essentially threefold: the provider of feature and target variables (guest, abbreviated as G), the provider of feature variables (host, abbreviated as H), and the coordinator (arbiter, abbreviated as A). Because their roles in vertical federated learning determine the operations they need to perform, it is preferable to use role identifiers as participant identifiers during source code development.
[0063] For example, a role conditional function `role()` can be defined as the executor identifier. The parameter of this function is the role (G, H, A) of the participants in the longitudinal federated learning. In this example, if the executor identifier corresponding to a certain code segment is `role(G)`, then the executor of this code segment is G.
[0064] S202. For each code segment, generate a corresponding judgment statement based on the executor identifier corresponding to the code segment. The judgment statement is used to determine whether the first participant belongs to the executor of the code segment. When the judgment statement is executed, the result of the judgment statement is used to determine whether the code segment should be executed.
[0065] The subject of this method is the first participant in the vertical federated learning process, and the first participant can be any party in the federated learning process.
[0066] The source code only contains participant identifiers; the executing device itself cannot automatically identify which code segment it needs to execute. Therefore, during the compilation process, a corresponding conditional statement needs to be generated based on the participant identifier for each code segment. This statement is then executed by the executing device when executing the compiled code to determine whether to execute the corresponding code segment.
[0067] Taking the role condition function role(G, H) as an example, the code generated after the judgment statement should roughly mean "whether the first participant is G or H. If so, execute the corresponding code segment; otherwise, execute the next statement."
[0068] S203. Generate compiled code based on each code segment and the corresponding conditional statement for each code segment.
[0069] The source code can be a program written in an interpreted language, such as a program written in Python. Python, as an interpreted language, is simple and easy to learn, but it cannot be understood by machines, so it needs to be compiled into a machine language that can be understood.
[0070] The compilation method provided in this embodiment can be applied to the first participant among multiple participants in vertical federated learning, where the first participant can be any party. The method includes: obtaining the source code for vertical federated learning; the source code includes multiple code segments for implementing vertical federated learning and an executor identifier corresponding to each code segment, the executor identifier indicating at least one participant executing the code segment; for each code segment, generating a corresponding judgment statement based on the executor identifier corresponding to the code segment, the judgment statement determining whether the first participant is an executor of the code segment, so that when executing the judgment statement, the result of the judgment statement determines whether to execute the code segment; and generating compiled code based on each code segment and the judgment statement corresponding to each code segment. Each participant can use this compilation method to compile the source code. During the compilation process, the executor identifier corresponding to each code segment in the source code is identified, and a corresponding judgment statement is generated based on the executor identifier, thereby obtaining the compiled code. When each participant executes the compiled code, upon reaching the code corresponding to the judgment statement, it can determine whether it belongs to the executor of the corresponding code segment, and then automatically execute its own code segment, working together to complete the vertical federated learning process. In other words, this compilation method is the underlying foundation for supporting source code simplification. Because the compilation method supports generating corresponding conditional statements based on the executor identifier in the source code, it compiles the source code into executable code. Therefore, only the executor of each code segment needs to be marked in the source code. This simplifies the original different source code applicable to multiple participants into a single set of source code applicable to all participants. Furthermore, since each code segment's executor is at least one participant, identical operations by all parties can be integrated into the same code segment using the "executor identifier," further simplifying the source code. This reduces the complexity of source code development and the difficulty of later maintenance.
[0071] In some embodiments, the source code may be parsed before generating the judgment statement corresponding to the executor identifier.
[0072] like Figure 3 The methods for parsing source code, as shown, can include:
[0073] S301. Determine the data merging function in the source code; the data merging function is used to indicate how to merge the unilateral data provided by each participant in the longitudinal federated learning.
[0074] Based on the above explanation of vertical federated learning, it is clear that both G and H, the participants in vertical federated learning, need to provide original sample data to participate in model training, and then update their own model parameters according to the training results. During model training and use, even when data transmission is involved, all parties' data is encrypted. Therefore, it can be considered that the data provided by a particular party always belongs solely to that party. Hence, in this invention, it is referred to as unilateral data.
[0075] In vertical federated learning, the participants providing sample data for model training fall into two categories: those providing only feature variables (role H) and those providing both feature and target variables (role G). Furthermore, the two types of sample data have significant user overlap but minimal feature variable overlap, necessitating data merging. The purpose of data merging is to extract and combine data from both parties that share similar user data but have slightly different features.
[0076] Therefore, in practical implementation, a data merging function can be defined to indicate how the data from each party in longitudinal federated learning is merged. In this way, although the data from both parties G and H remain separate, they can be logically merged into the data required for model training, and treated as a whole in the source code. This eliminates the need to write separate code for each party's data, further simplifying the source code.
[0077] S302. Based on the data merging function, parse the source code to generate a computation graph; the computation graph is used to represent the data processing logic of the unilateral data provided by each participant.
[0078] The data merging function characterizes the merging method of individual data. When parsing based on the data merging function, the data generated in the source code using the function can be broken down into the original individual data. Thus, the generated computation graph can be used to represent the data processing logic of the individual data. Nodes in the computation graph represent data or their corresponding logical methods, and paths between nodes represent the logical relationships between them.
[0079] Figure 4 This is a computation graph generated from parsing the statement "w = vfed(w_g, w_h)" in the source code. Each node is represented as a tensor. "Op" represents an operator. Specifically, "TensorSymbolOP(w_g)" represents the operator for w_g, "TensorSymbolOP(w_h)" represents the operator for w_h, and "FedTensorCompositeOP" represents the operator for vfed. In addition, the computation graph can also contain nodes representing constants and methods. For example, "Literal" represents a constant, and "Op.method" represents a method possessed by an operator.
[0080] After parsing and generating the computation graph, the executor identifier occupies a node in the computation graph. The aforementioned generation of corresponding judgment statements based on the executor identifier of the code segment can specifically include: generating corresponding judgment statements based on the executor identifier of the code segment and the data processing logic represented by the computation graph. Furthermore, adjustments can be made at the nodes corresponding to the executor identifiers based on the data processing logic represented by the computation graph; this may involve adding, deleting, or modifying corresponding nodes, or adding, deleting, or modifying the paths between corresponding nodes, forming judgment logic between nodes. Based on the newly generated judgment logic, corresponding judgment statements are generated at the code level.
[0081] In vertical federated learning, data encryption is required whenever data transmission is involved. (Reference) Figure 5 The schematic diagram of vertical federated learning illustrates that the process primarily involves the transmission of two types of data: the transmission of unilateral data between G and H, and the transmission of model training data between G, H, and A. In this invention, model training data refers to parameters that indicate the model's training performance, such as the loss function and its convergence state. To reduce the complexity of the source code, the encryption steps can be omitted, and encryption statements can be added by finding the corresponding nodes during the compilation process.
[0082] Specifically, for the first scenario, the nodes containing the unilateral data provided by each participant can be determined based on the computation graph; an encryption node can be added after the nodes containing the unilateral data provided by each participant; and an encryption statement can be generated to encrypt the original data for the encryption node.
[0083] Here, "after the node containing the unilateral data" refers to the logical order following the node containing the unilateral data. Generally, the data foundation for the model training process is the unilateral data, therefore the node containing the unilateral data is a leaf node.
[0084] For the second scenario, the data transfer function in the source code can be identified; an encryption node can be added before the node containing the data transfer function; and for the encryption node, an encryption statement can be generated to encrypt the model training data.
[0085] The data transfer function indicates the direction of model training data transmission. The model training data is related data for model training calculated based on unilateral data.
[0086] For unilateral data, since it is known that it is provided by parties G and H, and generally H sends the data to G, the transmission direction is fixed. Therefore, it is not necessary to write corresponding code for encryption and transmission in the source code, or the encryption and transmission direction can be included in the data merging function.
[0087] For model training data, based on the principles of longitudinal federated learning, interaction typically occurs between G, H, and the coordinator A. The transmission direction may differ depending on the specific parameters. Correspondingly, instead of writing corresponding code for encryption and transmission in the source code, encryption and transmission nodes can be added during compilation by determining the nodes where the corresponding data resides. Alternatively, the transmission direction for different parameters can be represented in the source code using data transmission functions. During parsing, the transmission direction is determined and encryption nodes are added by locating the corresponding nodes of the data transmission functions.
[0088] Furthermore, if the data transmission function corresponding to the added encryption node points to the coordinator among the participants, then an encryption statement for encrypting the model training data is generated based on the coordinator's public key. The coordinator's public key can be pre-stored at each participant and retrieved directly from their local machine during the encryption step.
[0089] Specifically, when generating data transmission statements based on data transmission functions, the corresponding data transmission statements can be generated according to the data transmission function, its parameters, and the identifier of at least one participant corresponding to the code segment containing the data transmission function. Data transmission functions can be divided into two categories based on the data transmission direction: sending functions and receiving functions. Sending and receiving are determined primarily by the participant executing the transmission step. The parameters of the data transmission function are the identifiers of the participants at the other end of the data transmission direction. Thus, combining these three factors determines the specific transmission direction of the data to be transmitted. For example, if the data transmission function is a receiving function, and its parameter in a certain code segment is H, and the identifier of the participant in that code segment is A, then the transmission direction of the data to be transmitted can be determined as from H to A.
[0090] Similar to encryption, the decryption process can also be omitted from the source code and correspondingly incorporated into the encoding process. (Reference) Figure 5 Based on the process of vertical federated learning, it is known that decryption is only required after the coordinator receives the data. Therefore, it is not necessary to write the corresponding code for the decryption step in the source code. Instead, the node containing the received data can be determined during compilation, and the decryption node can be added subsequently. Alternatively, the decryption step can be represented by a decryption function in the source code. During parsing, the decryption statement is generated by locating the corresponding node of this decryption function. Specifically, the key in the decryption statement can be the coordinator's private key.
[0091] In a specific embodiment, a federated learning model training function can be defined, which can be defined as def f(train_data, params). The two input parameters of this function are the training data "train_data" and the model training-related parameters "params". The model training-related parameters can include three parameters: the loss tolerance value tol, the learning rate learning_rate, and the maximum number of training epochs max_iter.
[0092] In actual federated learning, calling this function and inputting the relevant data corresponding to the parameters will initiate the model training process, which involves executing the source code corresponding to this function.
[0093] The source code corresponding to function f() may include an initialization program, a model training program, and a prediction program for the trained model.
[0094] During initialization, for G, the data provided by G can be defined as x_g (sample data), y (label), and w_g (weights). For H, the data provided by H can be defined as x_h and w_h. x is defined as the total sample data after vertical federation of the two data sources using vfed, and w is defined as the total weights after vertical federation of the two data sources using vfed.
[0095] During training, data from G and H are used in computation (in practice, the loss function is calculated on the G side), resulting in the loss value `fed_loss` and the gradient value `fed_grad`. G then sends the loss value to A. Upon receiving the loss value, A decrypts it and determines whether the loss function has converged based on the loss value and a preset loss tolerance, sending the convergence status to G and H. G, H, and A jointly determine the weight changes based on the gradient value and a preset learning rate. If the loss function has not converged, G and H update the model weights based on the weight changes. The entire training process needs to iterate until the loss function converges or until the preset maximum number of training epochs is reached.
[0096] The prediction process is performed by G and H. For H, the intermediate prediction result pred_h is obtained based on the sample data and weight data and sent to G. For G, the intermediate prediction result pred_g is obtained based on the sample data and weight data, and after being added to pred_h, a sigmoid transformation is performed to obtain the prediction probability.
[0097] After writing the source code, users can send and store it on the electronic devices of all parties involved in the federated learning process, so that each participant can optimize the federated learning model by executing the source code.
[0098] When running this source code, the compilation method of this invention can be used to compile the source code, generate machine code, and then execute the compiled machine code.
[0099] In longitudinal federated learning, model training methods can employ logistic regression or linear regression to calculate parameters such as the loss function and its gradient during the training process.
[0100] For example, in logistic regression, let the data features of G be x_g, the weights of the logistic regression function of G be w_g, and the target variable of G be y; let the data features of H be x_h, and the weights of the logistic regression function of H be w_g; let the total sample data features be x, and the total weights of the logistic regression function be w. Let the data merging function be vfed(). This function is used to find the intersection of the sample data of G and the sample data of H, identify the users in the sample, and take the union of the data features of these users in G and the data features of these users in H. This function can be used to calculate the data features x and the logistic regression function weights w. The corresponding Python code is as follows:
[0101] x = vfed(x_g, x_h)
[0102] w = vfed(w_g, w_h)
[0103] The corresponding loss function l(w) and the gradient of the loss function The calculation formula can be found in the following formula:
[0104]
[0105]
[0106] Based on the above formula, corresponding source code can be written to represent the specific calculation process.
[0107] For example, in linear regression, let's define the data features of G as x_g, the weights of the logistic regression function of G as w_g, and the target variable of G as y; the data features of H as x_h, and the weights of the logistic regression function of H as w_g; the total sample data features as x, and the total weights of the logistic regression function as w. Let the data merging function be vfed().
[0108] The corresponding loss function l(w) and the gradient of the loss function The calculation formula can be found in the following formula:
[0109]
[0110]
[0111] Based on the above formula, corresponding source code can be written to represent the specific calculation process.
[0112] Figure 6 A schematic diagram of a compiler apparatus is provided as an example of an embodiment of the present invention, as shown below. Figure 6 As shown, the compilation device includes an acquisition module 601 and a generation module 602.
[0113] The acquisition module 601 is used to acquire the source code of vertical federated learning; the source code includes multiple code segments for implementing vertical federated learning and an executor identifier corresponding to each code segment, the executor identifier being used to indicate at least one participant executing the code segment.
[0114] The generation module 602 is used to generate a corresponding judgment statement for each code segment based on the executor identifier corresponding to the code segment. The judgment statement is used to determine whether the first participant belongs to the executor of the code segment. When the judgment statement is executed, the result of the judgment statement is used to determine whether to execute the code segment. The module also generates compiled code based on each code segment and the judgment statement corresponding to each code segment.
[0115] Optionally, the device 600 further includes: a determining module 603 and a parsing module 604.
[0116] Before the generation module 602 generates the corresponding judgment statement, the determination module 603 determines the data merging function in the source code; the data merging function is used to indicate the merging method of the unilateral data provided by each participant in the vertical federated learning; the parsing module 604 parses the source code according to the data merging function and generates a computation graph; the computation graph is used to represent the data processing logic of the unilateral data provided by each participant.
[0117] When generating a corresponding judgment statement based on the executor identifier corresponding to the code segment, the generation module 602 is specifically used to generate a corresponding judgment statement based on the executor identifier corresponding to the code segment and the data processing logic represented by the computation graph.
[0118] Optionally, the determining module 603 is further configured to determine the node where the unilateral data provided by each participant is located based on the computation graph.
[0119] The parsing module 604 is also used to add an encryption node after the node containing the unilateral data provided by each participant;
[0120] The generation module 602 is also used to generate an encryption statement for the encryption node to encrypt the unilateral data;
[0121] When generating compiled code based on each code segment and the corresponding judgment statement, the generation module 602 is specifically used to generate compiled code based on each code segment, the corresponding judgment statement, and the encryption statement.
[0122] Optionally, the determining module 603 is further configured to determine a data transmission function in the source code; the data transmission function is used to indicate the transmission direction of the model training data, which is calculated based on the unilateral data;
[0123] The parsing module 604 is also used to add an encryption node before the node where the data transmission function is located;
[0124] The generation module 602 is also used to generate encryption statements for the encryption node to encrypt the model training data.
[0125] Optionally, when generating the encryption statement to encrypt the model training data, the generation module 602 is specifically used to generate the encryption statement to encrypt the model training data based on the coordinator's public key if the transmission direction of the data transmission function corresponding to the encryption node points to the coordinator among the participants.
[0126] Optionally, the generation module 602 is further configured to generate a corresponding data transmission statement based on the data transmission function, the parameters of the data transmission function, and the identifier of at least one participant corresponding to the code segment in which the data transmission function is located;
[0127] The generation module 602 generates compiled code based on each code segment and the corresponding conditional statements for each code segment, including:
[0128] Based on each code segment and the corresponding conditional statements and data transmission statements, the compiled code is generated.
[0129] Optionally, the determining module 603 is further configured to determine the data decryption function in the source code;
[0130] The generation module 602 is also used to generate a decryption statement based on the private key of the coordinator among the participants;
[0131] When generating compiled code based on each code segment and the corresponding judgment statement, the generation module 602 is specifically used to generate compiled code based on each code segment, the corresponding judgment statement, and the decryption statement.
[0132] The compilation apparatus provided in the embodiments of the present invention can execute the compilation method provided in any embodiment of the present invention, and has the functional modules corresponding to the above-described compilation method and the same beneficial effects as the above-described compilation method, which will not be repeated here.
[0133] Figure 7 This is a schematic diagram of the structure of an electronic device provided in one embodiment of the present invention, as shown below. Figure 7 As shown, the electronic device 700 includes: a memory 701, a processor 702, and a computer program.
[0134] The computer program is stored in memory 701 and configured to be executed by processor 702 to implement the compilation method provided in any embodiment of the present invention.
[0135] The memory 701 and the processor 702 are connected via a bus.
[0136] The relevant explanations can be understood by referring to the corresponding descriptions and effects in the above embodiments, and will not be elaborated further here.
[0137] The present invention also provides a computer-readable storage medium having a computer program stored thereon, the computer program being executed by a processor to implement the compilation method provided in any of the embodiments described above.
[0138] The computer-readable storage medium can be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, and optical data storage device, etc.
[0139] In the several embodiments provided by this invention, it should be understood that the disclosed devices and methods can be implemented in other ways. For example, the device embodiments described above are merely illustrative; for instance, the division of modules is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple modules may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some interfaces, devices, or modules, and may be electrical, mechanical, or other forms.
[0140] The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
[0141] Furthermore, the functional modules in the various embodiments of the present invention can be integrated into one processing unit, or each module can exist physically separately, or two or more modules can be integrated into one unit. The unit composed of the above modules can be implemented in hardware or in the form of hardware plus software functional units.
[0142] The integrated modules described above, implemented as software functional modules, can be stored in a computer-readable storage medium. These software functional modules, stored in a storage medium, include several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute some steps of the methods described in the various embodiments of the present invention.
[0143] It should be understood that the aforementioned processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), etc. A general-purpose processor can be a microprocessor or any conventional processor. The steps of the method disclosed in this invention can be directly manifested as being executed by a hardware processor, or executed by a combination of hardware and software modules within the processor.
[0144] The memory may include high-speed RAM, and may also include non-volatile storage (NVM), such as at least one disk storage device, and may also be a USB flash drive, external hard drive, read-only memory, disk or optical disc, etc.
[0145] The bus can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of illustration, the buses shown in the accompanying drawings are not limited to a single bus or a single type of bus.
[0146] The aforementioned storage medium can be implemented from any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk. The storage medium can be any available medium accessible to general-purpose or special-purpose computers.
[0147] An exemplary storage medium is coupled to a processor, enabling the processor to read information from and write information to the storage medium. Alternatively, the storage medium can be an integral part of the processor. Both the processor and the storage medium can reside in an Application Specific Integrated Circuit (ASIC). Alternatively, the processor and storage medium can exist as discrete components in an electronic device or host device.
[0148] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.
[0149] The sequence numbers of the above embodiments of the present invention are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.
[0150] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods described in the various embodiments of the present invention.
[0151] The above are merely preferred embodiments of the present invention and do not limit the patent scope of the present invention. Any equivalent structural or procedural transformations made based on the content of the present invention's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of the present invention.
Claims
1. A compilation method, characterized in that, The method, applied to the first participant among multiple participants in longitudinal federated learning, includes: Obtain the source code for vertical federated learning; the source code includes multiple code segments for implementing vertical federated learning and an executor identifier corresponding to each code segment, the executor identifier being used to indicate at least one participant executing the code segment; For each code segment, a corresponding judgment statement is generated based on the executor identifier corresponding to the code segment. The judgment statement is used to determine whether the first participant belongs to the executor of the code segment, so that when the judgment statement is executed, the judgment result of the judgment statement determines whether to execute the code segment. Based on each code segment and the corresponding conditional statements, the compiled code is generated. Before generating the corresponding conditional statement, the method further includes: Determine the data merging function in the source code; the data merging function is used to indicate the merging method of the unilateral data provided by each participant in the longitudinal federated learning; The source code is parsed according to the data merging function to generate a computation graph; the computation graph is used to represent the data processing logic of the unilateral data provided by each participant. The step of generating a corresponding judgment statement based on the executor identifier corresponding to the code segment includes: Based on the executor identifier corresponding to the code segment and the data processing logic represented by the computation graph, corresponding judgment statements are generated. Further, based on the data processing logic represented by the computation graph, adjustments are made at the nodes corresponding to the executor identifier, adding, deleting, or modifying corresponding nodes, or adding, deleting, or modifying the paths between corresponding nodes, forming judgment logic between nodes. Based on the newly generated judgment logic, corresponding judgment statements are generated at the code level, including: Based on the computation graph, determine the node where the unilateral data provided by each participant is located; Add an encrypted node after the node containing the unilateral data provided by each participating party; For the encrypted node, generate an encryption statement to encrypt the unilateral data; The process of generating compiled code based on each code segment and the corresponding conditional statements for each code segment includes: Based on each code segment and the corresponding conditional statement for each code segment, as well as the encryption statement, the compiled code is generated, which also includes: Determine the data transfer function in the source code; the data transfer function is used to indicate the direction of transmission of model training data, which is calculated based on the unilateral data. Add an encryption node before the node containing the data transmission function; For the encrypted node, generate an encrypted statement to encrypt the model training data.
2. The method according to claim 1, characterized in that, The generation of the encryption statement for encrypting the model training data includes: If the data transmission function corresponding to the encrypted node points to the coordinator among the participants, then an encrypted statement for encrypting the model training data is generated based on the coordinator's public key.
3. The method according to claim 2, characterized in that, Also includes: Based on the data transmission function, the parameters of the data transmission function, and the identifier of at least one participant corresponding to the code segment where the data transmission function is located, a corresponding data transmission statement is generated; The process of generating compiled code based on each code segment and the corresponding conditional statements for each code segment includes: Based on each code segment and the corresponding conditional statements and data transmission statements, the compiled code is generated.
4. The method according to any one of claims 1-3, characterized in that, Also includes: Identify the data decryption function in the source code; Generate a decryption statement based on the private key of the coordinator among the participants; The process of generating compiled code based on each code segment and the corresponding conditional statements for each code segment includes: Based on each code segment and the corresponding conditional statement and decryption statement, the compiled code is generated.
5. A compilation apparatus, characterized in that, include: The acquisition module is used to acquire the source code for vertical federated learning; The source code includes multiple code segments for implementing vertical federated learning and an executor identifier corresponding to each code segment, wherein the executor identifier is used to indicate at least one participant executing the code segment; The generation module is used to generate a corresponding judgment statement for each code segment based on the executor identifier corresponding to the code segment. The judgment statement is used to determine whether the first participant belongs to the executor of the code segment, so that when the judgment statement is executed, the judgment result of the judgment statement determines whether to execute the code segment. Based on each code segment and the corresponding conditional statements, the compiled code is generated. Before the generation module generates the corresponding judgment statement, the device further includes: a determination module, used to determine the data merging function in the source code; the data merging function is used to indicate the merging method of the unilateral data provided by each participant in the vertical federated learning; a parsing module, used to parse the source code according to the data merging function and generate a computation graph; the computation graph is used to represent the data processing logic of the unilateral data provided by each participant; generating the corresponding judgment statement according to the executor identifier corresponding to the code segment includes: generating the corresponding judgment statement according to the executor identifier corresponding to the code segment and the data processing logic represented by the computation graph; further, adjusting the nodes corresponding to the executor identifier according to the data processing logic represented by the computation graph, adding, deleting, or modifying the corresponding nodes, or adding, deleting, or modifying the paths between the corresponding nodes, to form the judgment logic between nodes; and generating the corresponding judgment statement at the code level based on the newly generated judgment logic. The determination module is also used to determine the node where the unilateral data provided by each participant is located based on the computation graph; The parsing module is also used to add an encryption node after the node containing the unilateral data provided by each participant; The generation module is also used to generate encryption statements for the encryption node to encrypt the unilateral data; When the generation module generates compiled code based on each code segment and the corresponding judgment statement, it is specifically used to generate compiled code based on each code segment, the corresponding judgment statement, and the encryption statement. The determining module is also used to determine the data transmission function in the source code; the data transmission function is used to indicate the transmission direction of the model training data, which is calculated based on the unilateral data; The parsing module is also used to add an encryption node before the node where the data transmission function is located; The generation module is also used to generate encryption statements for the encryption node to encrypt the model training data.
6. An electronic device, characterized in that, The electronic device includes: a memory, a processor, and a compiler stored in the memory and executable on the processor, wherein the compiler, when executed by the processor, implements the steps of the compilation method as described in any one of claims 1 to 4.
7. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a compiler, which, when executed by a processor, implements the steps of the compilation method as described in any one of claims 1 to 4.