A high-performance multi-party secure computing training method and system based on GPU
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HARBIN INST OF TECH
- Filing Date
- 2023-10-09
- Publication Date
- 2026-06-26
Smart Images

Figure CN117332838B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of multi-party secure computation protocols, specifically a high-performance multi-party secure computation training system based on a GPU. Background Technology
[0002] In recent years, with the rapid development of computer technology, machine learning models have been increasingly widely used, and more and more private data is being used as training data. Multi-party computation (MPC) provides a privacy-preserving computation method that can perform privacy-preserving computations on objective functions without revealing input data, showing broad application prospects and significant research value. However, to meet certain security assumptions, MPC protocols are based on secret value sharing, increasing computational complexity and generating substantial inter-party communication. Furthermore, because communication and computation are performed serially during encrypted computation, frequent communication leads to GPU idle time, resulting in low hardware utilization during training. These two factors significantly increase the training time cost based on MPC, which is the main obstacle to the widespread application of this secure and privacy-preserving computation method.
[0003] The main problem with using secure multi-party computation (MPC) for model training is that all operations in MPC protocols are based on additive secret shared values, resulting in significant computational and communication overhead and a sharp increase in training time. Because computations are performed on encrypted data, basic operations such as addition, multiplication, and comparison consist of interdependent computational and communication steps, leading to low average utilization of computing resources and network bandwidth over a period of time, resulting in very slow computation speeds. In fact, low computational efficiency has always been a core issue in MPC model training. Due to this problem, MPC-based training remained in the theoretical research stage for a long time after its inception and could not be practically implemented. Currently, most research on MPC training focuses on improving computational speed by reducing communication complexity; however, computation and communication are still performed serially, offering very limited improvement in hardware resource utilization. Modern model training frequently utilizes GPUs, and frequent communication causes GPUs to be idle for extended periods, resulting in training times that are thousands or even tens of thousands of times longer than plaintext training, even with the most advanced MPC training frameworks. The problems of excessively long training time and slow data throughput have not been effectively solved. Summary of the Invention
[0004] The technical problem to be solved by this invention is:
[0005] The purpose of this invention is to provide a high-performance multi-party secure computation training system based on GPU, that is, to provide a multi-party secure computation training framework with a higher degree of parallelism, so as to realize the parallelism between different layers of the neural network by combining data parallelism and model parallelism, thereby improving the data throughput speed of the training process.
[0006] The technical solution adopted by the present invention to solve the above-mentioned technical problems is as follows:
[0007] A high-performance multi-party secure computation training method based on GPU, the method comprising:
[0008] The subnetwork segmentation step is used to divide the entire neural network and enable the segmented subnetworks to run simultaneously, so that the computation of linear and nonlinear layers overlaps as much as possible.
[0009] The pipeline training process runs two subnetworks as two processes on the same GPU, performing linear and nonlinear layer calculations respectively. The two processes can simultaneously utilize multiple resources such as GPU cores, video memory, and network bandwidth of the computing server. At the same time, parameter caching and optimization of video memory usage are added to minimize contention for these computing resources.
[0010] Furthermore, the algorithm for subnetwork segmentation is as follows:
[0011] The algorithm optimizes the size of subnetworks to achieve parallelism and minimizes the average batch throughput of each subnetwork, thereby synchronizing the batch throughput of each subnetwork and improving pipeline utilization. Simultaneously, considering the need for adjacent subnetworks to exchange outputs and gradients, the algorithm calculates the size of the neural network to assess the additional communication time overhead caused by segmentation. The goal of the subnetwork segmentation algorithm is to calculate the segmentation method with the shortest average batch training time based on the computational unit topology of the multi-party secure computation system. Each subnetwork is assigned to different computational units for computation, so their training process is parallel. The total throughput time of each network layer, i.e., the total time spent on forward and backward propagation, is obtained under typical training conditions.
[0012] Use A k Let (i→j,m) represent the training time of the longest subnetwork in the optimally segmented pipeline when m workers are used at stage k to train the network between layers i and j. Then the optimal subnetwork segmentation problem can be reduced to the computationally minimum A L (0→N,m L ), where L is the total number of levels in the topology, and N is the total number of layers in the network; T k (i→j,m) represents m computing units in B. kThe total forward and backward propagation time when synchronously training a subnetwork from layer i to layer j under a bandwidth of .
[0013] Furthermore, the subnetwork segmentation implementation algorithm is a dynamic programming algorithm for calculating the optimal subnetwork segmentation method. It uses multiple loops to expand the computation, decomposing the problem from top to bottom. In each loop, optimal time is used for comparison, representing the optimal theoretical training time for each level of computational unit calculated from the bottom up. The optimal segmentation method is selected for each subproblem, and finally, the globally optimal segmentation method is obtained. The A... k and T k The calculation formula is:
[0014]
[0015]
[0016] If A k (i→j,m) then A k (i→j,m)=A
[0017] If A k (i→j,m) k If (i→N,m), then there exists a better partition for the i→Nth layer of the network, with the partition point being j, A k (i→N,m)=A k (i→j,m);
[0018] If A k (i→N,m) k (m) Then A k (m)=A k (i→N,m),
[0019] If A k (m) k Then A k =A k (m).
[0020] Furthermore, the algorithm for implementing pipeline training is as follows:
[0021] Pipeline training employs a scheduling strategy that alternates between forward and backward propagation. After each subnetwork computes the k-th batch of forward propagation, it needs to send the output tensor of its last layer to the downstream subnetwork as its input, while simultaneously switching to backward propagation mode to receive the gradients returned by the downstream subnetwork, compute the backward propagation gradient of the kx-th batch, and update its own parameters. Next, the subnetwork needs to send the gradient calculation results to the upstream subnetwork for backward propagation and receive the output tensor of the upstream subnetwork for the (k+1)-th forward propagation.
[0022] Furthermore, in the process of implementing pipeline assembly line training,
[0023] A parameter caching method is used to avoid parameter version inconsistencies. The parameter cache stores multiple parameter versions in each sub-network. This indicates the parameter version used by subnetwork 'a' during the computation of batch b. Subnetworks closer to the input side need to cache more parameter versions. During forward propagation, each subnetwork uses the latest parameter version for computation. After subnetwork 1 completes the forward computation of the first batch, this parameter version is... The parameter caching mechanism ensures that the backpropagation calculation process of each batch within the subnetwork is effective, guaranteeing parameter consistency within the subnetwork. The parameters are cached, and during the first batch of backpropagation, this version of parameters is retrieved for backpropagation calculation. The gradient correction values resulting from parameter updates are then calculated and merged into the gradients of the next batch. Finally, this parameter version is discarded.
[0024] Furthermore, the pipeline training implementation algorithm is a pipeline algorithm based on a parameter caching mechanism, including: a preparation phase, in which each sub-network performs forward propagation of different batches according to its own position; a stabilization phase, in which each sub-network alternates between forward and backward computation; and a termination phase, in which each sub-network performs unfinished backward propagation computation.
[0025] Furthermore, during the implementation of pipeline training, the inconsistencies in data between sub-networks caused by parameter caching are corrected, specifically as follows:
[0026] Suppose a pipeline with no repeating subnetworks, containing n subnetworks, where the parameters of each subnetwork are represented as w1, w2, ..., wn. n After training through t batches, the parameters are represented as follows: And so on, after each batch passes, calculate the average gradient ▽f(w1,w2,…,w) of all data in that batch. n ), where f represents the loss function; assuming the learning rate is v, the original parameter update method is:
[0027]
[0028] Because of parameter caching, the parameters in subnetwork 1 use versions from batch n-1 prior, the parameters in subnetwork 2 use versions from batch n-2 prior, and so on; the result of the parameter update is:
[0029]
[0030] To ensure the consistency of parameters across the entire model, the parameter update method will be changed to:
[0031]
[0032] Furthermore, in the implementation of pipeline training, the number of batch intervals between two adjacent sub-networks is increased to reduce pipeline bubbles caused by communication time. The number of batch intervals needs to be dynamically adjusted according to the memory usage and the actual training speed.
[0033] A GPU-based high-performance multi-party secure computation training system, comprising program modules corresponding to the steps of the aforementioned technical solution, which execute the steps of the GPU-based high-performance multi-party secure computation training method during runtime; the program modules include:
[0034] The subnetwork segmentation module is used to segment the entire neural network and enable the segmented subnetworks to run simultaneously, so that the calculations of linear and nonlinear layers overlap as much as possible.
[0035] The pipeline training module is used to run two sub-networks as two processes on the same GPU, respectively running the computation of linear and non-linear layers. The two processes can simultaneously utilize multiple resources such as GPU cores, video memory, and network bandwidth of the computing server; at the same time, parameter caching and optimization of video memory usage are added to minimize the contention of these computing resources.
[0036] A computer-readable storage medium storing a computer program configured to implement, when invoked by a processor, the steps of a GPU-based high-performance multi-party secure computation training method.
[0037] The present invention has the following beneficial technical effects:
[0038] The method of this invention is a multi-party security computation training system based on a pipeline assembly line training method, such as... Figure 1 As shown, this method addresses the bottlenecks of computation and communication in the linear and nonlinear computational network layers during MPC model training. It designs a pipelined training method to achieve parallelism between sub-networks and implements an optimal sub-network partitioning algorithm to balance the training load across each sub-network. This invention presents a multi-party secure computation training framework with higher parallelism. By combining data parallelism and model parallelism, it achieves parallelism between different layers of the neural network, significantly improving the data throughput speed during training. Attached Figure Description
[0039] Figure 1 A block diagram of the overall design of a high-performance multi-party secure computation training system;
[0040] Figure 2 A framework diagram for overall system optimization;
[0041] Figure 3 A schematic diagram illustrating the design principles of the subnetwork segmentation algorithm.
[0042] Figure 4 Schematic diagram of the pipeline assembly line training method;
[0043] Figure 5 Schematic diagram of an improved pipeline assembly line;
[0044] Figure 6 A diagram illustrating communication using a communication process;
[0045] Figure 7 This is a diagram illustrating distributed communication.
[0046] Figure 8 To optimize the comparison of training speed before and after;
[0047] Figure 9 A bar chart comparing average GPU utilization;
[0048] Figure 10 A graph showing the change in GPU utilization over time for each training configuration;
[0049] Figure 11 A comparison chart showing the loss reduction for five training configurations. Detailed Implementation
[0050] The implementation of the GPU-based high-performance multi-party secure computation training method and system described in this invention is explained as follows:
[0051] 1. Technical Concept
[0052] This invention proposes a multi-party security computation training system based on a pipeline assembly line training method, such as... Figure 1 As shown, this method addresses the bottlenecks of computation and communication in the training process of the MPC model's linear and nonlinear computational network layers, respectively. It designs a pipeline training method to achieve parallelism between sub-networks and implements an optimal sub-network segmentation algorithm to balance the training load between each sub-network.
[0053] 2 Technical Solution
[0054] The training system designed in this paper is described below, including the construction of the overall system optimization idea and the construction of each component of the system, mainly including: sub-network segmentation algorithm, pipeline training framework, and further optimization of derivative problems found in the process of system implementation.
[0055] 2.1 Overall System Optimization Strategy
[0056] We designed a high-performance multi-party secure computation training framework based on GPUs, focusing on improving GPU utilization and increasing training batch throughput during the training process. This framework significantly improves model training speed with minimal impact on accuracy. Our theoretical and experimental analysis reveals significant differences in resource utilization characteristics among different network layers within the multi-party secure computation framework: linear layers have relatively high computational cost, low communication time, and short GPU idle time, with computation being the primary bottleneck; nonlinear layers have a large number of communication rounds, relatively low computational cost, and high GPU idle time, with the frequent communication rounds being the main bottleneck. To improve GPU utilization and increase parallelism in the training process, we employ a method of simultaneously computing linear and nonlinear layers, performing computations on both network layers while maintaining computational speed. The overall optimization framework proposed in this paper is as follows: Figure 2 As shown, a pipelined training method is designed and implemented to segment the entire neural network and enable the segmented subnetworks to run simultaneously, maximizing the overlap between linear and nonlinear layer computations. The aim is to ensure that when two subnetworks run as two separate processes on the same GPU, performing linear and nonlinear layer computations respectively, both processes can simultaneously utilize GPU cores, video memory, and network bandwidth of the computing server, minimizing contention for these computational resources.
[0057] 2.2 Subnetwork Segmentation Algorithm
[0058] The design idea of the subnetwork segmentation algorithm is as follows: Figure 3 As shown. To achieve parallelism between subnetworks, the size of the subnetworks needs to be allocated in a reasonable way so that the average batch throughput of each subnetwork is as close as possible, thereby synchronizing the batch throughput of each subnetwork and improving the utilization of the pipeline. At the same time, since adjacent subnetworks need to pass outputs and gradients to each other, the size of the neural network also needs to be calculated to evaluate the additional communication time overhead caused by the segmentation.
[0059] The goal of the subnetwork segmentation algorithm is to calculate the segmentation method with the shortest average training time per batch, based on the computational unit topology of a multi-party secure computation system. Each subnetwork is assigned to a different computational unit for computation, so their training processes are parallel. We need to test the total throughput time of each network layer in a typical training environment, which is the total time spent on forward and backward propagation. Taking forward propagation as an example, when the first subnetwork completes a batch of forward computation, it needs to pass the output of the last layer to the second subnetwork as input, causing a certain data transmission delay between computational units. This delay can be calculated by dividing the total data transmission size by the connection bandwidth. We use A... k Let (i→j,m) represent the training time of the longest subnetwork in the optimally segmented pipeline when m workers are used at stage k to train the network between layers i and j. Then the optimal subnetwork segmentation problem can be reduced to computationally minimizing Ai. L (0→N,m L ), where L is the total number of levels in the topology and N is the total number of layers in the network. Using T k (i→j,m) represents m computing units in B. k The total forward and backward propagation time is calculated when simultaneously training a subnetwork from layer i to layer j under a given bandwidth. Algorithm 1 demonstrates a dynamic programming algorithm for calculating the optimal subnetwork partitioning, where lines 2 to 11 use multiple loops to expand the computation, representing the top-down decomposition of the problem; lines 12 to 15 correspond to the above description of A. k and T k The calculation; the last 16 to 20 lines use the optimal time for comparison in each loop, which means that by calculating the best theoretical training time for each level of computing unit from the bottom up, the optimal partitioning method is selected for each subproblem, and finally the global optimal partitioning method is obtained, which is the solution to the original problem.
[0060]
[0061]
[0062] 2.3 Pipeline Assembly Line Training Framework
[0063] The basic design of the pipeline assembly line training method is as follows: Figure 4This is a scheduling strategy that alternates between forward and backward propagation. After each sub-network computes the k-th batch of forward propagation, it needs to send the output tensor of its last layer to the downstream sub-network as its input, and simultaneously switch to backward propagation mode to receive the gradients returned by the downstream sub-network, compute the backward propagation gradient of the kx-th batch, and update its own parameters. Next, the sub-network needs to send the gradient calculation results to the upstream sub-network for backward propagation, and receive the output tensor of the upstream sub-network for the (k+1)-th forward propagation.
[0064] Because the batch number of consecutive forward propagation is greater than the batch number of backpropagation, directly performing backpropagation on the current model parameters will lead to inconsistencies between the forward and backpropagation parameters of the same batch. This causes deviations in the model's convergence direction and speed, resulting in ineffective gradients, oscillating convergence directions, or even non-convergence, thus reducing training efficiency. Therefore, this paper employs a parameter caching method to avoid the problem of inconsistent parameter versions. The parameter cache stores multiple parameter versions in each sub-network and uses... This indicates the parameter version used by subnetwork 'a' during batch b of computation. Subnetworks closer to the input side require more parameter versions to be cached. During forward propagation, each subnetwork uses the latest parameter version for computation. After subnetwork 1 completes the first batch of forward computation, this parameter version is... The parameter caching mechanism caches the parameters and retrieves this version during the first batch of backpropagation for backpropagation calculation. Then, it calculates the gradient correction caused by the parameter update, merges it into the gradient of the next batch, and finally discards this parameter version. This parameter caching mechanism ensures that the backpropagation calculation process in each batch within a subnetwork is effective, guaranteeing parameter consistency within the subnetwork. Algorithm 2 specifically demonstrates the pipeline algorithm based on the parameter caching mechanism designed above. Lines 2-6 represent the preparation phase, where each subnetwork performs forward propagation in different batches according to its position; lines 7-16 represent the stabilization phase, where each subnetwork alternates between forward and backward calculations; lines 17-22 represent the termination phase, where each subnetwork performs unfinished backpropagation calculations.
[0065]
[0066]
[0067] 2.4 Further optimization of the system
[0068] The strategy described above introduces some problems. For each batch of training data, parameter caching causes inconsistencies between sub-networks. Therefore, a method needs to be designed to correct these inconsistencies. Assume a pipeline with no duplicate sub-networks, containing n sub-networks, where the parameters of each sub-network are represented as w1, w2, ..., w... n After training through t batches, the parameters are represented as follows: And so on. After each batch passes, the average gradient ▽f(w1,w2,…,w) of all data in that batch needs to be calculated. n ), where f represents the loss function. Assuming the learning rate is v, the original parameter update method is:
[0069]
[0070] Because we used parameter caching, the parameters in subnetwork 1 use versions from batch n-1 prior, the parameters in subnetwork 2 use versions from batch n-2 prior, and so on. In other words, the actual result of the parameter updates is:
[0071]
[0072] To ensure the consistency of the parameters across the entire model, we changed the parameter update method to:
[0073]
[0074] Although this method leads to some parameter lag, which affects the convergence speed, it can theoretically maintain the consistency of parameters between subnetworks as much as possible, thereby improving the effectiveness and accuracy of training.
[0075] The second issue lies in the additional memory overhead caused by parameter caching. While pipelined systems don't incur the same massive memory usage as data parallelism, the increased memory consumption due to the greater number of parameter versions cached in upstream subnetworks makes this memory usage significant. Besides reducing batch size, further adjustments to the pipeline design logic can be made, such as... Figure 5 As shown, increasing the batch interval between adjacent subnetworks reduces pipeline bubbles caused by communication time, improves pipeline utilization, and increases training data throughput. However, increasing the number of batches simultaneously in training mode comes at the cost of increased GPU memory usage. Therefore, the batch interval needs to be dynamically adjusted based on GPU memory usage and actual training speed.
[0076] Algorithm 3 demonstrates the improved pipeline training process, which can dynamically adjust the pipeline scheduling based on the actual training performance during runtime. Lines 4 and 7 refer to the forward propagation calculation shown in lines 8-11 of Algorithm 3-2 and the backward propagation calculation shown in lines 12-16 of Algorithm 3-2, respectively.
[0077]
[0078] In the implementation of the training system, this paper adopted two inter-process communication schemes. The first communication method uses an independent communication process to manage all inter-process communication, such as... Figure 6 As shown. This method requires creating a dedicated communication process on each server to forward all communication during training. The initial system design adopted this simple and easy-to-implement approach, but inter-server communication involves multiple copying processes through GPU memory, RAM, and network, significantly increasing communication latency.
[0079] To improve data exchange throughput and reduce communication latency, this invention designs, as follows: Figure 7 The distributed communication model enables direct communication between each training process. Specifically, based on the PyTorch distributed training backend gloo, process groups are created for each group of processes that need to communicate with each other. The communication thread within each process group uses the broadcast() function to transmit data and sends back acknowledgment information after receiving data.
[0080] This approach enables GPU-to-GPU communication, directly transmitting data via PCIe channels and GPU memory copying, eliminating the overhead of copying between GPU memory and system memory twice. Furthermore, the communication backend, gloo, is specifically optimized for PyTorch's large-scale tensor transfers, significantly improving tensor transfer efficiency between processes and reducing additional latency caused by communication control. In addition, this paper also designs a tensor receive buffer mechanism to improve the data exchange rate.
[0081] 3. Invention Effects
[0082] 3.1 Test Environment
[0083] The test experiments were compared with the CryptGPU framework. Three neural networks, LeNet, AlexNet, and VGG16, were used to represent different network sizes, and three datasets, MNIST, CIFAR10, and Tiny Imagenet, were used to represent different dataset sizes. A total of five training configurations were used, and the specific information is shown in Table 1.
[0084] Table 1. Detailed information on the five training configurations.
[0085]
[0086] The experiment was conducted in a cloud server environment, using three Tencent Cloud "GPU Computing GN7.2XLARGE32" servers in the same region as the three computing servers in a multi-party secure computing environment. This model of server is equipped with eight Intel Xeon Platinum 8255C processor cores and one NVIDIA Tesla T4 GPU. The measured network bandwidth between any two servers was approximately 3Gbps, and the network latency was approximately 0.16 milliseconds.
[0087] 3.2 Test Results and Analysis
[0088] Experiment 1 Training Speed Test: In this experiment, the framework designed in this paper is compared with the CryptGPU framework. A binary splitting method is used, dividing each neural network to be trained into two sub-networks, each assigned to a process in the GPU for execution. This part of the experiment performed 5 rounds of training with a length of 101 batches for each of the five training configurations, and removed the first batch of data in each round to eliminate the error caused by GPU warm-up overhead.
[0089] Experiment 1 Results: The experimental results are shown in Table 2 and... Figure 8 As shown. In summary, the framework proposed in this paper improves training speed by more than 15% for each configuration, and the improvement is greater for configurations with smaller training loads.
[0090] Table 2 Experimental Results of Training Speed
[0091]
[0092] Analysis of Experiment 1: The framework designed in this paper can significantly improve the parallelism of the training process for small networks because, for smaller networks and datasets, operations such as convolution and matrix multiplication have lower computational costs and utilize fewer GPU cores. This allows the system to use the remaining GPU cores to simultaneously perform operations like ReLU, where communication is the primary bottleneck, without affecting computational efficiency. This minimizes resource contention between the two sub-networks, resulting in a significant improvement compared to sequential training of a single network. For configurations with higher training loads, such as the VGG16 network, the large number of layers and scale lead to very high dimensionality in convolution and matrix multiplication operations within linear computation layers. When one sub-network is performing linear computations, the remaining GPU cores are insufficient to support nonlinear function calculations, causing the two sub-networks to compete for GPU cores, thus reducing the computational performance of both. If both sub-networks happen to be performing extensive linear layer computations, the training time for each sub-network will increase significantly, and the probability of both sub-networks simultaneously entering subsequent nonlinear computations will increase, further reducing training speed and widening the gap with the theoretical training time. In other words, under the training framework designed in this paper, the longer the linear and nonlinear computations overlap between the two sub-networks, the longer they can run at full speed simultaneously, and the greater the relative improvement.
[0093] Experiment 2 GPU Utilization Test: This invention measured the average GPU utilization for each training configuration during the training process. The results are as follows: Figure 9 As shown. Next, this invention further measured the changes in GPU utilization over a certain period of time under the five training configurations studied in this paper, and obtained the experimental results. Figure 10 As shown.
[0094] Experiment 2 Analysis: The experimental results show that under the pipeline scheduling strategy designed in this paper, the average GPU utilization of all training configurations reached over 70%. It can be seen that the lower the computational load of the training configuration, the lower the GPU utilization, and the greater the relative improvement. For training configurations with high computational load, the GPU is almost always at full capacity. Even so, the pipeline training system designed in this paper still achieved a 22% improvement in GPU utilization. GPU utilization is one of the optimization directions of the training system designed in this paper, and improving hardware utilization has a significant impact on the batch data throughput rate of the pipeline. For configurations with fewer network layers and a relatively high proportion of computation time for nonlinear layers, a significant drop in utilization is clearly observed during the training process of the CryptGPU system. However, for the latter three training configurations with very high computational load, the pipeline scheduling design can fully load the computational load of the two processes onto the GPU simultaneously, keeping the GPU utilization at a high level, demonstrating a very efficient scheduling and utilization of hardware resources.
[0095] Experiment 3: Convergence Speed and Accuracy Test: Finally, the training speed of the system was tested by measuring the cross-entropy loss during training to track the model's convergence. Figure 11 The data shows the loss reduction for five training configurations, and it can be seen from the data that the pipeline method has a very small impact on the convergence speed.
[0096] Finally, the accuracy of the model after several training epochs was measured and compared with the accuracy of CryptGPU. The experimental results are shown in Table 3. It can be seen that the system has a very small impact on the accuracy of the model.
[0097] Analysis of Experiment 3: The main reason is that the test was conducted in an environment with two sub-networks. The problem of outdated parameter versions caused by the pipeline training method was not obvious, so it had little impact on the convergence direction and distance. Furthermore, since the floating-point simulation of integer calculation and nonlinear function calculation of CryptGPU itself have certain errors, the impact of pipeline errors on model convergence efficiency is reduced to a certain extent.
[0098] Experimental Summary: Experimental results show that, compared with existing state-of-the-art methods, the multi-party secure computation training system designed in this paper achieves a significant improvement in training speed, reaching up to 34% on the best-performing training configuration. Regarding GPU utilization, this system shows a significant improvement for each training configuration, reaching a maximum of 94%. As for training effectiveness, the pipeline method did not significantly impact model accuracy.
[0099] Table 3 Validation set accuracy for each training configuration
[0100]
Claims
1. A high-performance multi-party secure computation training method based on GPU, characterized in that, The method includes: The subnetwork segmentation step is used to divide the entire neural network and enable the segmented subnetworks to run simultaneously, so that the computation of linear and nonlinear layers overlaps as much as possible. The pipeline training process runs two subnetworks as two processes on the same GPU, performing linear and nonlinear layer calculations respectively. The two processes can simultaneously utilize multiple resources such as GPU cores, video memory, and network bandwidth of the computing server. At the same time, parameter caching and optimization of video memory usage are added to minimize contention for these computing resources. The algorithm for subnetwork segmentation is as follows: The algorithm optimizes the size of subnetworks to achieve parallelism and minimizes the average batch throughput of each subnetwork, thereby synchronizing the batch throughput of each subnetwork and improving pipeline utilization. Simultaneously, considering the need for adjacent subnetworks to exchange outputs and gradients, the algorithm calculates the size of the neural network to assess the additional communication time overhead caused by segmentation. The goal of the subnetwork segmentation algorithm is to calculate the segmentation method with the shortest average batch training time based on the computational unit topology of the multi-party secure computation system. Each subnetwork is assigned to different computational units for computation, so their training process is parallel. The total throughput time of each network layer, i.e., the total time spent on forward and backward propagation, is obtained under typical training conditions. use Let represent the optimally segmented pipeline, in the th Level of use The worker came to train the first Layer and first When training the network between layers, the longest training time is spent on the subnetwork; therefore, the optimal subnetwork segmentation problem can be reduced to a problem with minimal computation. ,in The total number of levels in the topology. The total number of layers in the network; express Each computing unit in Synchronous training of a class from the first class under bandwidth conditions layer to the first When considering a sub-network of layers, the total time for forward and backward propagation; The subnetwork segmentation algorithm is a dynamic programming algorithm used to calculate the optimal subnetwork segmentation method. It uses multiple loops to expand the computation, decomposing the problem from top to bottom. In each loop, optimal time is used for comparison, representing the optimal theoretical training time for each computational unit calculated from the bottom up. The optimal segmentation method is selected for each subproblem, and finally, the globally optimal segmentation method is obtained. and The calculation formula is: if if Then for the network's first There is a better partitioning of the layer, and the partitioning point is... , ; , if .
2. The GPU-based high-performance multi-party secure computation training method according to claim 1, characterized in that, The algorithm for pipeline training is as follows: Pipeline training employs a scheduling strategy that alternates between forward and backward propagation. After each subnetwork computes the k-th batch of forward propagation, it needs to send the output tensor of its last layer to the downstream subnetwork as its input, while simultaneously switching to backward propagation mode to receive the gradients returned by the downstream subnetwork, compute the backward propagation gradient of the kx-th batch, and update its own parameters. Next, the subnetwork needs to send the gradient calculation results to the upstream subnetwork for backward propagation and receive the output tensor of the upstream subnetwork for the (k+1)-th forward propagation.
3. The GPU-based high-performance multi-party secure computation training method according to claim 2, characterized in that, In the process of implementing pipeline assembly line training, A parameter caching method is used to avoid parameter version inconsistencies. The parameter cache stores multiple parameter versions in each sub-network. This indicates the parameter version used by subnetwork 'a' during the computation of batch b. Subnetworks closer to the input side need to cache more parameter versions. During forward propagation, each subnetwork uses the latest parameter version for computation. After subnetwork 1 completes the forward computation of the first batch, this parameter version is... The parameter caching mechanism ensures that the backpropagation calculation process of each batch within the subnetwork is effective, guaranteeing parameter consistency within the subnetwork. The parameters are cached, and during the first batch of backpropagation, this version of parameters is retrieved for backpropagation calculation. The gradient correction values resulting from parameter updates are then calculated and merged into the gradients of the next batch. Finally, this parameter version is discarded.
4. The GPU-based high-performance multi-party secure computation training method according to claim 3, characterized in that, The pipeline training algorithm is a pipeline algorithm based on a parameter caching mechanism, which includes: a preparation phase, in which each sub-network performs forward propagation of different batches according to its position; a stabilization phase, in which each sub-network alternates between forward and backward computation; and a termination phase, in which each sub-network performs unfinished backward propagation computation.
5. The GPU-based high-performance multi-party secure computation training method according to claim 4, characterized in that, In the implementation of pipeline training, the inconsistencies in data between sub-networks caused by parameter caching are corrected, specifically as follows: Suppose a pipeline with no repeating subnetworks contains n subnetworks, and the parameters of each subnetwork are represented as follows: , … After training through t batches, the parameters are represented as follows: , And so on, after each batch passes, the average gradient of all data in that batch is calculated. , where f represents the loss function; assuming the learning rate is v, the original parameter update method is: Because of parameter caching, the parameters in subnetwork 1 use versions from batch n-1 prior, the parameters in subnetwork 2 use versions from batch n-2 prior, and so on; the result of the parameter update is: To ensure the consistency of parameters across the entire model, the parameter update method will be changed to: 。 6. The GPU-based high-performance multi-party secure computation training method according to claim 5, characterized in that, In the implementation of pipeline training, pipeline bubbles caused by communication time are reduced by increasing the number of batch intervals between two adjacent subnetworks. The number of batch intervals needs to be dynamically adjusted according to the memory usage and the actual training speed.
7. A high-performance multi-party secure computation training system based on GPU, characterized in that, The system has a program module corresponding to the steps of any one of the claims 1-6 above, and executes the steps in the GPU-based high-performance multi-party secure computation training method described above when running; The program modules include: The subnetwork segmentation module is used to segment the entire neural network and enable the segmented subnetworks to run simultaneously, so that the calculations of linear and nonlinear layers overlap as much as possible. The pipeline training module is used to run two sub-networks as two processes on the same GPU, respectively running the computation of linear and non-linear layers. The two processes can simultaneously utilize multiple resources such as GPU cores, video memory, and network bandwidth of the computing server; at the same time, parameter caching and optimization of video memory usage are added to minimize the contention of these computing resources.
8. A computer-readable storage medium, characterized in that: The computer-readable storage medium stores a computer program configured to implement, when invoked by a processor, the steps of the GPU-based high-performance multi-party secure computation training method according to any one of claims 1-6.