Information transmission system based on federated learning

By establishing a direct link between the central server and the participants in federated learning to transmit task logs, the problem of excessive bandwidth load on the central server is solved, training speed and data throughput are improved, and the security and privacy of federated learning are ensured.

CN116305029BActive Publication Date: 2026-06-23HANGZHOU YIKANG HUILIAN TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HANGZHOU YIKANG HUILIAN TECH CO LTD
Filing Date
2023-02-14
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

During federated learning training, viewing task logs can affect data throughput and the speed of federated learning. Excessive bandwidth load on the central server can also impact training speed and efficiency.

Method used

By establishing a link channel between the central server and the participants, task logs and other information can be transmitted directly, avoiding data transmission on the central server and reducing bandwidth load.

Benefits of technology

This effectively reduces the data transmission burden on the central server, improves data throughput and training speed, and ensures the security and privacy of federated learning.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116305029B_ABST
    Figure CN116305029B_ABST
Patent Text Reader

Abstract

The application relates to a federal learning-based information transmission system, comprising a center server and a plurality of participants for executing a training task, task information of the training task comprising first type information stored in the center server and second type information stored only in each participant; when the center server receives a request of a user for viewing the first type information, corresponding first type information is provided according to the authority of the user; when the center server receives a request of the user for viewing the second type information, a link channel connected to a corresponding participant for accessing the second type information is provided according to the authority of the user. When the user accesses the second type information, the user is provided with a link channel for accessing the second type information, so that the second type information is prevented from being uploaded and downloaded on the center server data, and the viewing task log is prevented from affecting the data throughput and the speed of federal learning training.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of privacy protection technology in federated learning, and in particular to an information transmission system based on federated learning. Background Technology

[0002] Federated learning training records the entire process through task logs. The longer the training period, the larger the task log file becomes. Querying the task logs is essential in federated learning training, typically used for development debugging and user viewing. This is because the task logs record the differences between time machines in each round, the task's running status, the training time for each participant in each round, the status of each physical machine involved in the training, and so on. The richness of the queryable information is something that task metrics lack.

[0003] Generally, task logs provided by federated learning are distributed externally through a central server node, which is more intuitive and transparent to users. However, during federated learning training, the central server continuously uploads and downloads data, simultaneously collaborating with multiple participants to complete the training. The large size of these task log files can negatively impact the central server's bandwidth. This is because the central server needs to synchronize the task logs from the physical machine where they reside. This data traffic affects gradient propagation during training, hindering the central server's normal control and model parameter updates during federated learning. Consequently, network transmission time increases as a proportion of the overall training time, prolonging the training duration.

[0004] In summary, viewing task logs during federated learning training currently impacts data throughput and the speed of federated learning training. Summary of the Invention

[0005] Therefore, it is necessary to provide an information transmission system based on federated learning to address the aforementioned technical problems.

[0006] This application is based on a federated learning information transmission system, which includes a central server for executing training tasks and multiple participants. The task information of the training tasks includes a first type of information stored in the central server and a second type of information stored only in each of the participants.

[0007] When the central server receives a user's request to view the first type of information, it provides the corresponding first type of information according to the user's permissions;

[0008] When the central server receives a user's request to view the second type of information, it provides a link channel connecting the relevant participants to access the second type of information, based on the user's permissions.

[0009] Optionally, the first type of information includes a link to the second type of information;

[0010] When the central server provides the link channel, the user who initiates the request and the participant storing the second type of information directly transmit data.

[0011] Optionally, the first type of information may also include at least one of the following: task identifier, participant information, task status information, and task performance metrics.

[0012] The second type of information includes task logs.

[0013] Optionally, each of the participating parties includes a physical machine cluster. The central server communicates with each of the participating parties to obtain resource usage information of each physical machine in the physical machine cluster. The resource usage information belongs to the participating party information and includes CPU information, memory information, GPU information, and hard disk information.

[0014] Optionally, the task logs are stored on the physical machines within each participating party. When the central server provides the link channel, the user initiating the request can directly transmit data with the physical machines within the corresponding participating party through the link channel.

[0015] Optionally, the central server includes a front-end and a back-end that periodically exchange information, with the first type of information stored in the back-end;

[0016] The backend communicates with each of the participating parties to receive user requests to view the first type of information;

[0017] The front end obtains the first type of information and displays it visually according to the user's permissions.

[0018] Optionally, the users include administrators and all the participating parties:

[0019] When the backend receives a request from each of the participants to view the task information, it provides the first type of information corresponding to the requesting participant and sends it to the frontend for viewing.

[0020] When the backend receives a request from the administrator to view the task information, it provides all the first type of information and sends it to the frontend for viewing.

[0021] Optionally, the task metrics include local metrics and global metrics.

[0022] The local metrics are used to reflect the quality of the local training set, including the local training loss and the classification task or object detection task.

[0023] The global metrics include the metrics reflected by the model obtained after completing the training task on the test set or validation set.

[0024] Optionally, the task status information includes: task submission, task initialization, task running, task early termination, and task end.

[0025] Task submission includes: one of the participants initiating a request to the central server to establish the training task, which is then confirmed by the central server;

[0026] Task initialization includes: selecting the corresponding physical machines from each of the participating parties to participate in the training task based on the resource occupancy information obtained from the central server;

[0027] Early termination of a task includes: ending the training task early when the convergence condition of the algorithm in the training task is met.

[0028] Optionally, the completion of the task includes: ending the training task, and the central server saving, backing up, and distributing the model obtained from the completion of the training task to each of the participating parties.

[0029] The method of the information transmission system based on federated learning in this application has at least the following effects:

[0030] When a user accesses the second type of information, a link channel is provided to access the second type of information, avoiding the data uplink and downlink of the second type of information on the central server, and avoiding affecting the viewing of task logs, data throughput and the speed of federated learning training. Attached Figure Description

[0031] Figure 1 This is a functional flowchart of an information transmission system based on federated learning in one embodiment of this application;

[0032] Figure 2 This is a schematic diagram of the structure of an information transmission system based on federated learning in one embodiment of this application; Detailed Implementation

[0033] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0034] Currently, viewing task logs during federated learning training impacts data throughput and training speed. To resolve this technical issue, see [link to relevant documentation]. Figure 1 and Figure 2One embodiment of this application provides an information transmission system based on federated learning (hereinafter referred to as the information transmission system), including a central server for executing training tasks and multiple participants. The task information of the training tasks includes a first type of information stored on the central server and a second type of information stored only on each participant. When the central server receives a request from a user to view the first type of information, it provides the corresponding first type of information according to the user's permissions; when the central server receives a request from a user to view the second type of information, it provides a link channel connecting to the corresponding participant for accessing the second type of information according to the user's permissions.

[0035] When a user accesses the second type of information, a link is provided to facilitate this access, preventing the second type of information from being transmitted upstream and downstream from the central server. This avoids impacting the viewing of task logs and thus affecting data throughput and the speed of federated learning training. The second type of information includes task logs.

[0036] Furthermore, the first type of information includes links pointing to the second type of information; when the central server provides a link channel, the user initiating the request and the participant storing the second type of information directly transmit data. The second type of information includes task logs.

[0037] In this embodiment, the central server acts as a third-party server to organize and complete the entire federated learning training. The participants include the task initiator who initiates the training task and the data providers who participate in the training task. During federated learning training, the task initiator can also provide its own data for training. The fundamental difference between the first type of information and the second type of information lies in the location of their actual physical storage. The first type of information is stored on the central server, specifically in its database. The second type of information is actually stored on the physical machines of the participating parties.

[0038] Since the first type of information includes links pointing to the second type of information, obtaining these links requires access to the federated server. The federated server can control user permissions, ensuring the confidentiality of federated learning training. This embodiment uses a decentralized server approach for network transmission of task logs, avoiding the impact of task logs on the central server's data throughput and ensuring the security of federated learning.

[0039] In general, task information can be divided into two aspects: first, the information of the distributed cluster, that is, the information of the participants, including the hardware information (resource usage information) of each physical machine in the participants, so that the administrator (engineer) can select the appropriate physical machines or physical machine clusters of the participants for scheduling during training; second, the task configuration information of the federated learning training task.

[0040] In one embodiment, the first type of information, in addition to links to the second type of information, also includes at least one of the following: task identifier, participant information, task configuration information, task status information, and task metrics. The task identifier is a unique identifier ID for a training task; the task configuration information includes the algorithm or strategy used by the task; and the participant information includes a list of participants, the datasets of each participant, their respective training progress, and validation set accuracy.

[0041] Task configuration information can be organized in two different ways: by participant or by task. These two methods refer to the different ways task configuration information is displayed within the user's permissions. When organized by participant, users can view training records, current training progress, and training logs for each participant's task. When organized by task, users can view the completion status of training tasks, model ownership, and available algorithms. In a task-based information transmission system, users often prioritize the timeliness of tasks, such as checking the accuracy of the model during training, the task's running status, or downloading task results. Therefore, the backend of the central server should first check whether the user has permission to access the task and its results or model.

[0042] See Figure 2 Each participant includes a physical machine cluster (target cluster). The central server communicates with each participant to obtain the resource usage information of each physical machine in the physical machine cluster. The resource usage information belongs to the participant information and includes CPU information, memory information, GPU information and hard disk information.

[0043] It is understandable that participants in federated learning training participate in the form of physical machine clusters, although it is possible that some participants participate with a single physical machine. Since each participant may have multiple different physical machines, and tasks may be executed on different physical machines of the same participant, relevant information should be recorded in the database.

[0044] A federated learning-based information transmission system involves each participant possessing a different dataset and computing power (physical machines). Therefore, integrating information from these participants is also a function of the information transmission system. For each participant, the information that needs to be recorded includes information about tasks they have participated in, tasks they are currently undertaking, and task logs for each task.

[0045] In one embodiment, task logs are stored on physical machines within each participating party. When the central server provides a link channel, the user initiating the request can directly transmit data with each physical machine within the corresponding participating party through the link channel.

[0046] The federated learning-based information transmission system creates an independent link for each task of each participant, facilitating the sharing of task logs and providing corresponding download access. These links are interfaces provided by HTTP servers on the physical machines within the participating participants where the task logs reside. When users access task logs, they do not go through a central server; instead, they directly obtain the source files of the data from the different physical machines used for training by the participating participants, achieving zero-copy of task logs throughout the entire training process.

[0047] By directly transmitting data through the link channel to each physical machine, the load can be evenly distributed across all physical machines, preventing any participant from becoming a performance bottleneck by relying on a single physical machine to provide access. Furthermore, the information transmission system caches each task log in the backend, appending only the updated log to the end each time an API is called.

[0048] For a single physical machine, to obtain its resource usage information, two auxiliary modules in the Python programming language, GPUtil and psutil, are selected to obtain formalized resource information. GPUtil uses the nvidia-smi command in the NVIDIA graphics card driver to obtain all available hardware information for all physical machines, including GPU information such as memory utilization, GPU power information, GPU temperature information, driver information, and device ID.

[0049] The psutil module can monitor the system's running status, similar to commands like ps, top, free, and du in Linux systems. However, psutil is a cross-platform module, supporting operating systems such as Linux / UNIX / OSX / Windows. It can easily obtain CPU, memory, disk, network, and process information. Using this information, federated learning researchers and engineers can choose which physical machines to use for training, thereby achieving better load balancing.

[0050] In one embodiment, the central server includes a front-end (front-end server) and a back-end (back-end server) that periodically exchange information. The first type of information is stored in the back-end, which communicates with each participating party to receive user requests to view the first type of information. The front-end obtains the first type of information and displays it visually according to the user's permissions.

[0051] Specifically, when a user views the visualization provided by the front-end, the back-end retrieves the user's cluster and the physical machines they can view from the database based on their permissions. Generally, these machines represent all the physical machines of the user's participating party. The back-end then communicates with the external interfaces of each physical machine in the cluster via socket communication, concurrently initiating processes on each physical machine to view physical resources and updating them according to pre-defined schedules. This information is then aggregated and transmitted to the back-end machines via socket communication.

[0052] The backend then sends information about the physical machines in the cluster to the frontend in JSON format at regular intervals. Each time the frontend receives data, it updates the page, providing a real-time preview of the physical machine resources and allowing for transparent viewing of the physical cluster's system information. After the user closes the page, the backend performs garbage collection and concurrently terminates the system resource monitoring processes on the physical machines.

[0053] In addition, the federated learning system in this application is based on Flask and Jinja2, which has lightweight logic and code dependencies, enabling rapid deployment on various physical machines. At the same time, some parsing and request logic is implemented in the user's browser to reduce the pressure on the central server.

[0054] Due to the high requirements for data privacy, the training process of federated learning is distributed, involving access control of various participants and complex authentication and encryption operations, which greatly increases the complexity of training, testing and auditing tasks.

[0055] See Figure 2 In one embodiment, users include administrators and various participants. When the backend receives a request from a participant to view task information, it provides the first type of information corresponding to the requesting participant and sends it to the frontend for viewing. When the backend receives a request from an administrator to view task information, it provides all the first type of information and sends it to the frontend for viewing.

[0056] The backend will communicate with the participants' external interfaces according to their permissions. The participants will then locate the physical machine actually executing the task within their own physical machine cluster and communicate with it. Subsequently, task-related information will be aggregated layer by layer to the server backend and displayed to the user. For the task initiator, the task information they receive includes not only their own information but also the information of the task participants.

[0057] When users view task information organized by participating entities, their login permissions must be restricted. Specific permissions can be categorized as unrelated parties, participating parties, and task initiators, with permissions increasing accordingly. For privacy protection purposes, ordinary users of a single participating party should only be able to view partial information about tasks they are involved in and all information about tasks they have initiated, while administrators should be able to view task information initiated and participated in by all participating parties. Figure 2 As shown, the physical machine cluster of the participating party (target cluster) sends a viewing request to the backend. The backend performs permission checks, provides the first type of information, and refreshes the page in real time.

[0058] For a single participant, the backend needs to communicate with that participant's external interface via socket communication to read the task information recorded in the database, including the participant's role in the task, task logs, and task results or models. This data is then sent by the backend to the frontend and displayed to the user. When the user is an administrator, the above operations of reading task information and sending it to the frontend are performed in each participant, and then aggregated and sent to the server backend before being sent to the frontend for the administrator to view.

[0059] The federated learning-based information transmission system provided in this embodiment not only allows for visual viewing of the federated learning training process but also makes the training process transparent, enabling administrators to monitor it as if it were local training. This helps engineers identify factors that significantly impact training speed and model accuracy, allowing for performance optimization. Federated learning researchers and engineers can use the information transmission system provided in each embodiment to monitor the distributed training process, obtain training logs, and compare training results as if it were local training. The information transmission system provided in this application is a federated learning information monitoring system integrating auditing, training, maintenance, and testing information. Most of the data and information researchers need when researching and implementing federated learning algorithms can be obtained through this system.

[0060] A task in federated learning refers to a training session in a broad sense, which may involve only the initiator or many different data providers and computing power providers. For a given task, task information includes task logs, task metrics (task-related metrics), and task status.

[0061] Task metrics include local metrics and global metrics. Local metrics reflect the quality of the local training set, including the local training loss and the performance on the classification or object detection task. Global metrics include the metrics reflected by the model on the test or validation set after completing the training task.

[0062] Task metrics can be downloaded for offline analysis, such as visualization and quantitative analysis. Specifically, for task-related metrics, in addition to classification display, the information transmission system provides options to select the real-time refresh interval and download metrics for offline analysis. These metrics can be divided into local metrics and global metrics. Generally, local metrics only include the loss, accuracy (for classification tasks), or AP (for object detection tasks) from local training; these metrics can reflect the quality of the local training set relatively well. Global metrics include the model's performance on the test or validation set; these metrics reflect the performance of the trained model and the training process. Metrics are stored in TensorBoard's proto format and CSV format. The former allows for better visualization using TensorBoard, while the latter enables quantitative analysis of the metrics.

[0063] Task status information includes: task submission, task initialization, task execution, early task termination, and task completion. Task submission includes one participant initiating a request to the central server to establish a training task, which is then confirmed by the central server. Task initialization includes selecting appropriate physical machines from each participant to participate in the training task based on resource usage information obtained from the central server. Early task termination includes ending the training task early when the convergence conditions of the algorithm in the training task are met. Task completion includes ending the training task; the central server saves, backs up, and distributes the model obtained from the completed training task to each participant.

[0064] Specifically, task status information refers to the task's running status. The information transmission system further refines this running status, dividing it into six states: task submission, task initialization (task initialization), task running, task premature termination, task end, and task completion. Among these:

[0065] Task initialization can be understood as the process of establishing connections and distributing configurations among participating parties. It requires coordinating the preparation of datasets and computing resources by using the directed acyclic graph generated by the central server (central node). By confirming in advance whether the task initiator has the appropriate permissions and that all participants are in the correct initialization state, any failure in any step will cause the entire task to fail, returning the corresponding error reason to the user. This design reduces the system's flexibility to some extent, but avoids wasting computing resources.

[0066] Early stopping of a task corresponds to the termination of training prematurely upon reaching convergence conditions in certain algorithms. When the entire training process tends towards convergence, the training task is stopped early and enters the final state. This setting is to avoid model overfitting and shorten training time. There are generally two ways to use early stopping algorithms in federated learning: one is for multiple participants to independently determine whether early stopping is appropriate, and the other is for the central node to make a unified decision. This embodiment chooses the second method, which offers stronger robustness and avoids inconsistencies in the system's state.

[0067] Task completion is a temporary state, and it's a more complex state for federated learning compared to local training. Once the task reaches this state, the central node saves, backs up, and distributes the model, computation results, and training records. The central node holds a copy of the results until all participants have received them. Task completion is the follow-up to task completion; at this point, an interface for model download is provided to authorized users. This task status information helps users better understand the current progress of the task.

[0068] The federated learning-based information transmission system provided in the embodiments of this application can realize most of the functions required by researchers and engineers when conducting algorithm research and implementation. It offers an experience similar to local single-machine training, reducing the negative impact of federated learning training on bandwidth. Furthermore, performance and communication optimizations have been made to minimize its resource consumption in the distributed system and distribute the load evenly across each server in the system, avoiding the bottleneck effect in distributed systems.

[0069] The technical features of the above embodiments can be combined arbitrarily. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as the combination of these technical features does not contradict each other, it should be considered to be within the scope of this specification. When technical features of different embodiments are embodied in the same drawing, it can be regarded as the drawing also disclosing examples of combinations of the various embodiments involved.

[0070] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of the invention patent. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this patent application should be determined by the appended claims.

Claims

1. An information transmission system based on federated learning, comprising a central server for performing training tasks and multiple participating parties, characterized in that, The training task information includes a first type of information stored on the central server and a second type of information stored only on each of the participating parties. The second type of information includes task logs, which are stored on physical machines within each participating party. When the central server receives a user's request to view the first type of information, it provides the corresponding first type of information according to the user's permissions; When the central server receives a user's request to view the second type of information, it provides a link channel connecting the corresponding participants for accessing the second type of information, based on the user's permissions. When the central server provides the link channel, the user who initiated the request can directly transmit data with each physical machine within the corresponding participant through the link channel.

2. The information transmission system based on federated learning according to claim 1, characterized in that, The first type of information includes links pointing to the second type of information; When the central server provides the link channel, the user who initiates the request and the participant storing the second type of information directly transmit data.

3. The information transmission system based on federated learning according to claim 2, characterized in that, The first type of information also includes at least one of the following: task identifier, participant information, task status information, and task performance metrics.

4. The information transmission system based on federated learning according to claim 3, characterized in that, Each of the participating parties includes a physical machine cluster. The central server communicates with each of the participating parties to obtain resource usage information of each physical machine in the physical machine cluster. The resource usage information belongs to the participating party information and includes CPU information, memory information, GPU information, and hard disk information.

5. The information transmission system based on federated learning according to claim 4, characterized in that, The task status information includes: task submission, task initialization, task running, task early termination, and task end. Task submission includes: one of the participants initiating a request to the central server to establish the training task, which is then confirmed by the central server; Task initialization includes: selecting the corresponding physical machines from each of the participating parties to participate in the training task based on the resource occupancy information obtained from the central server; Early termination of a task includes: ending the training task early when the convergence condition of the algorithm in the training task is met.

6. The information transmission system based on federated learning according to claim 5, characterized in that, The completion of the task includes: ending the training task, and the central server saving, backing up, and distributing the model obtained from the completed training task to each of the participating parties.

7. The information transmission system based on federated learning according to claim 3, characterized in that, The task metrics include local metrics and global metrics; The local metrics are used to reflect the quality of the local training set, including the local training loss and the classification task or object detection task. The global metrics include the metrics reflected by the model obtained after completing the training task on the test set or validation set.

8. The information transmission system based on federated learning according to claim 1, characterized in that, The central server includes a front-end and a back-end that periodically exchange information, with the first type of information stored in the back-end. The backend communicates with each of the participating parties to receive user requests to view the first type of information; The front end obtains the first type of information and displays it visually according to the user's permissions.

9. The information transmission system based on federated learning according to claim 8, characterized in that, The users include the administrators and all the participating parties: When the backend receives a request from each of the participants to view the task information, it provides the first type of information corresponding to the requesting participant and sends it to the frontend for viewing. When the backend receives a request from the administrator to view the task information, it provides all the first type of information and sends it to the frontend for viewing.