Method for distributing workload based on message passing in a parallel computing system
By introducing a sensing mechanism and representative computing units into the parallel computing system, MPI communication is optimized, solving the problem of suboptimal communication modes in MPI, achieving more efficient inter-group communication and computing task balancing, and improving system performance and bandwidth utilization.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HUAWEI TECH CO LTD
- Filing Date
- 2023-12-22
- Publication Date
- 2026-06-19
AI Technical Summary
Existing MPI methods in parallel computing systems suffer from insufficient understanding of the underlying network topology, leading to suboptimal communication patterns and increased communication latency between nodes with large topological distances. Furthermore, existing solutions fail to effectively address the latency overhead in small messages and the problem of inter-group communication partitioning.
Introducing a perception mechanism into a parallel computing system involves dividing computing unit groups into representative computing units and background computing units, optimizing the message passing interface, enabling only a subset of representatives to communicate with each other, reducing the size of the communicator, decreasing inter-group communication latency, and executing communication tasks in parallel.
It improves the performance and network bandwidth utilization of parallel computing systems, reduces inter-group communication latency, and enhances computing efficiency and the balance of communication tasks.
Smart Images

Figure CN122249794A_ABST
Abstract
Description
Technical Field
[0001] This invention generally relates to workload allocation, and more specifically, to a method for allocating workloads involving communication tasks in a parallel computing system based on message passing. The invention also relates to a method for allocating workloads based on message passing within a computing unit of a parallel computing system. Furthermore, the invention relates to a parallel computing system including a controller for performing workload allocation; and a controller for operating within the parallel computing system to perform workload allocation. Background Technology
[0002] In recent years, across various fields ranging from High-Performance Computing (HPC) to Artificial Intelligence (AI), computing applications have become increasingly complex and scaled due to high demands for computing resources. To address these demands, a natural evolutionary approach has been adopted, distributing computing resources across multiple nodes or processors within computing applications via Message Passing Interfaces (MPI). This means that the computing resources of a computing application are divided and processed by multiple nodes or processors to handle the increasingly heavy workloads. Distributed computing resources create a bottleneck in collective operations within MPI. Collective operations play a crucial role in coordinating interactions between distributed nodes or processors. This bottleneck significantly impacts the overall efficiency of computing applications in both HPC and AI. Collective operations in MPI exhibit different communication modes, including all-to-one, one-to-all, and all-to-all communication modes. Collective operations can be implemented in MPI according to the specific requirements of the computing application. However, due to insufficient understanding of the underlying network topology, the efficiency of collective operations faces challenges.
[0003] The existing collective algorithm in MPI operates under the assumption of equivalent communication time between each pair of processes, an assumption that is highly inaccurate in today's clusters. Equivalent communication time is greatly affected by the distance within the process topology (e.g., within a slot, within a node, within a rack, etc.). This bias in the assumption leads to suboptimal communication patterns. In multi-level clusters, communication between adjacent nodes in the topology requires only one hop, while communication between nodes that are far apart in the topology requires five hops. The additional number of hops increases the communication latency between nodes that are far apart in the topology.
[0004] Some existing schemes utilize multiple representatives within each node to minimize contention within nodes during data aggregation. However, this design has limitations; it is specific to two-level communicators. These existing schemes do not address the issue of inter-group communication partitioning, but only focus on intra-group data partitioning or contention mitigation. Furthermore, these existing schemes are limited to partitioning data among different representatives only at higher topology levels, and do not utilize multiple representatives in lower topology levels. Data partitioning only applies to large messages, while the latency overhead problem, which dominates in small messages, remains unresolved.
[0005] Therefore, it is necessary to address the aforementioned technical problems / deficiencies in computing systems used for workload allocation. Summary of the Invention
[0006] The present invention aims to provide: a method for allocating workloads involving communication tasks based on message passing in a parallel computing system; a method for allocating workloads based on message passing in a computing unit of a parallel computing system; a parallel computing system including a controller for performing workload allocation; and a controller for operating in a parallel computing system to perform workload allocation while avoiding one or more disadvantages of prior art methods.
[0007] This objective is achieved through the features of the independent claim. Other implementations become apparent from the dependent claims, the specification, and the drawings.
[0008] According to a first aspect, there exists a method for allocating workloads involving communication tasks based on message passing in a parallel computing system. The parallel computing system includes one or more computing units. The one or more computing units are arranged into three or more groups of computing units. Each group of computing units includes at least two representative computing units and any number of background computing units, wherein the at least two representative computing units act as representatives of the group. The representatives of the group communicate with each representative of the other groups. The method includes: each group of computing units receiving a communication task to be executed in parallel. The communication task indicates inter-unit communication within each group. The method includes: computing units in each group performing the inter-unit communication within the group to which the computing unit belongs. The method includes: each computing unit acting as a representative receiving the communication task from a corresponding group member of the computing unit. The method includes: aggregating the received communication tasks. The method includes: sending the aggregated communication task of the computing unit belonging to the group to the representatives of the other computing unit groups. The method includes: receiving the aggregated communication task from the other computing unit groups. The method further includes: ultimately determining the workload based on the aggregated communication tasks of the computing unit group to which the computing unit belongs and the aggregated communication tasks of the other computing unit groups.
[0009] The method optimizes collective operations for Message Passing Interface (MPI) users, who execute communication tasks in parallel across multiple computing units. This optimization ensures efficient execution of collective operations, reduces overall computation time, and improves the performance of the parallel computing system. The method mitigates the performance limitations of the parallel computing system by introducing a perceptual mechanism to minimize communication overhead. It improves the efficiency of the parallel computing system by partitioning communication tasks so that each representative communicates only with a subset of other representatives. This partitioning of communication tasks minimizes the overall latency of the parallel computing system during inter-group communication by reducing the size of the communicator. By restricting communication to specific representatives, the latency of inter-group communication performed by the parallel computing system is reduced, resulting in faster interaction between representatives. The partitioned communication tasks of all representatives can also be performed in parallel, improving network bandwidth utilization. The reduction in communicator size and the possibility of executing parallel communication tasks for different representatives reduce the overall latency during inter-group communication and improve bandwidth utilization.
[0010] Optionally, the method further includes: a representative of each computing unit group sending the aggregated communication task of the computing unit group to which the representative belongs to the other computing unit groups, and receiving the aggregated communication tasks of the other computing unit groups in parallel.
[0011] Optionally, the communication tasks to be executed in parallel are partial tasks of the workload to be assigned. The method further includes: splitting the workload into communication tasks for each computing unit group in the computing unit group.
[0012] Optionally, the workload also includes computational tasks to be executed. As part of the communication task, a portion of the results of the computational task is transmitted between the computational units. The method further includes: each background computational unit executing its portion of the computational task and sending the results of that portion of the computational task as part of the inter-unit communication. This method achieves more efficient utilization of computational resources by ensuring a good balance between communication and computational tasks.
[0013] Optionally, the parallel computing system is used to operate using a message passing interface (MPI).
[0014] Optionally, the method further includes: after a computing unit in the computing unit group has executed the computing task within the computing unit group and the inter-unit communication, using a first MPI_barrier, enabling the aggregated communication task to be provided to the representative for processing without any computing unit continuing to operate.
[0015] Optionally, the method further includes: the representative computing unit receiving the aggregated communication tasks of the computing unit group before the first MPI_barrier, and after sending the aggregated communication tasks of each computing unit group to the other computing unit groups and receiving the aggregated communication tasks of the other computing unit groups, utilizing a second MPI_barrier between the representatives to ensure that all aggregated communication tasks are received before continuing the operation.
[0016] Optionally, the method further includes: after the workload is finally determined based on the aggregated communication tasks of the computing unit groups and the aggregated communication tasks of the other computing unit groups, utilizing a third MPI_barrier within each group.
[0017] Optionally, the method further includes performing an MPI_reduce operation: reducing computational units in a group to representatives of the group, then broadcasting to other representatives, thereby performing the reduction of the representatives in parallel, followed by the reduction from the representatives to representatives of the first group.
[0018] According to the second aspect, there exists a method for distributing workload based on message passing within computing units of a parallel computing system. The parallel computing system includes one or more computing units. The one or more computing units are arranged into three or more groups of computing units. Each computing unit is a representative of one of the computing unit groups. The computing unit is used to communicate with representatives of other groups. The method includes: ultimately determining the workload.
[0019] The method optimizes collective operations for Message Passing Interface (MPI) users, who execute communication tasks in parallel across multiple computing units. This optimization ensures efficient execution of collective operations, reduces overall computation time, and improves the performance of the parallel computing system. The method mitigates the performance limitations of the parallel computing system by introducing a perceptual mechanism to minimize communication overhead. It improves the efficiency of the parallel computing system by partitioning communication tasks so that each representative communicates only with a subset of other representatives. This partitioning of communication tasks minimizes the overall latency of the parallel computing system during inter-group communication by reducing the size of the communicator. By restricting communication to specific representatives, the latency of inter-group communication performed by the parallel computing system is reduced, resulting in faster interaction between representatives. The partitioned communication tasks of all representatives can also be performed in parallel, improving network bandwidth utilization. The reduction in communicator size and the possibility of executing parallel communication tasks for different representatives reduce the overall latency during inter-group communication and improve bandwidth utilization.
[0020] According to the third aspect, there exists a parallel computing system that includes a controller for executing the method.
[0021] According to the fourth aspect, there exists a controller for running in a parallel computing system, the controller being used to execute the method.
[0022] The controller optimizes collective operations for Message Passing Interface (MPI) users, who execute communication tasks in parallel across multiple computing units. This optimization ensures efficient execution of collective operations, reduces overall computation time, and improves the performance of the parallel computing system. The controller mitigates the performance limitations of the parallel computing system by introducing a perceptual mechanism to minimize communication overhead. The controller partitions communication tasks and improves the efficiency of the parallel computing system by ensuring that each delegate communicates only with a subset of other delegates. Task partitioning by the controller minimizes the overall latency of the parallel computing system during inter-group communication by reducing the size of the communicators. By restricting communication to specific delegates, the latency of inter-group communication performed by the parallel computing system is reduced, resulting in faster interaction between delegates. The partitioned communication tasks of all delegates can also be performed in parallel, improving network bandwidth utilization. The reduction in communicator size and the possibility of executing parallel communication tasks for different delegates reduce the overall latency during inter-group communication and improve bandwidth utilization.
[0023] According to the fifth aspect, there exists a computer program product comprising program instructions, wherein the program instructions, when executed by one or more processors in a parallel computing system, perform the method.
[0024] Therefore, unlike existing solutions, this method used in parallel computing systems allocates workloads involving communication tasks based on message passing. This method executes communication tasks in parallel across multiple computing units, reducing the overall latency of parallel computing systems during inter-group communication.
[0025] These and other aspects of the invention will become apparent from one or more implementations described below. Attached Figure Description
[0026] The implementation of the present invention will be described below by way of example and in conjunction with the accompanying drawings, in which: Figure 1 This is a block diagram illustrating a parallel computing system including a controller according to an implementation of the present invention, the controller being used to perform the allocation of workloads involving communication tasks; Figure 2A This paper illustrates a possible exemplary implementation of a three-stage collective communication task performed by a parallel computing system according to some embodiments of the present invention; Figure 2B This illustrates one possible exemplary implementation of a three-stage collective operation performed by a parallel computing system according to some embodiments of the present invention; Figures 3A to 3DAn example diagram is shown illustrating an implementation of the Allreduce collective operation performed by a parallel computing system using the message passing interface MPI_MAX operator, according to an embodiment of the present invention. Figures 4A to 4C A flowchart is shown of a method for allocating workloads involving communication tasks in a parallel computing system based on message passing, according to an implementation of the present invention. Figure 5 A flowchart is shown of a method for allocating workload based on message passing in a computing unit of a parallel computing system according to an implementation of the present invention; Figure 6 It is a diagram of a computer system (e.g., a parallel computing system and a controller) in which various architectures and functions of various previously implemented methods can be realized. Detailed Implementation
[0027] The present invention provides a method for allocating workloads involving communication tasks based on message passing in a parallel computing system; a method for allocating workloads based on message passing in a computing unit of a parallel computing system; a parallel computing system including a controller for performing workload allocation; and a controller for running in the parallel computing system to perform workload allocation.
[0028] To enable those skilled in the art to more easily understand the present invention, the following implementation of the present invention is described in conjunction with the accompanying drawings.
[0029] The terms “first,” “second,” “third,” and “fourth” (if any) used in the description of the invention, the claims, and the accompanying drawings are used to distinguish similar objects and are not necessarily used to describe a particular order or sequence. It should be understood that the terms thus used are interchangeable where appropriate, allowing various implementations of the invention described herein to be implemented, for example, in an order different from that shown or described herein. Furthermore, the terms “comprising” and “having,” and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to the steps or units expressly listed, but may include other steps or units not expressly listed, or other steps or units inherent to such a process, method, product, or apparatus.
[0030] Figure 1A block diagram of a parallel computing system 102 including a controller 110 according to an implementation of the present invention is shown. The controller 110 is used to perform workload allocation involving communication tasks. The parallel computing system 102 includes one or more computing units. The controller 110 is used to operate within the parallel computing system 102. The one or more computing units are configured into three or more computing unit groups 104A to 104N. Each computing unit group 104A to 104N includes at least two representative computing units 106A to 106N and any number of background computing units 108A to 108N, wherein the at least two representative computing units 106A to 106N are used to act as representatives of the group. The representatives of the group are used to communicate with each representative of the other groups. The controller 110 is used to enable each computing unit group 104A to 104N to receive communication tasks to be executed in parallel, which instruct inter-unit communication within each group 104A to 104N. Controller 110 enables computing units in each computing unit group 104A to 104N to perform inter-unit communication within their respective computing unit group 104A. Each computing unit acts as a representative receiving communication tasks from its corresponding group members. Controller 110 aggregates the received communication tasks. Controller 110 then forwards the aggregated communication tasks from computing unit group 104A to representatives of other computing unit groups 104N. Controller 110 receives aggregated communication tasks from other computing unit groups 104N. Controller 110 ultimately determines the workload based on the aggregated communication tasks from computing unit group 104A and other computing unit groups 104N.
[0031] In one implementation, there is a controller 110 for operation in the parallel computing system 102. The controller 110 is used to perform workload distribution.
[0032] Controller 110 optimizes collective operations for Message Passing Interface (MPI) users, who execute communication tasks in parallel across multiple computing units. This optimization ensures efficient execution of collective operations, reduces overall computation time, and improves the performance of the parallel computing system 102. Controller 110 mitigates the performance limitations of the parallel computing system 102 by introducing an awareness mechanism to minimize communication overhead. Controller 110 partitions communication tasks and improves the efficiency of the parallel computing system 102 by ensuring that each representative communicates only with a subset of other representatives. Task partitioning by controller 110 minimizes the overall latency of the parallel computing system 102 during inter-group communication by reducing the size of the communicators. By restricting communication to specific representatives, the latency of inter-group communication performed by the parallel computing system is reduced, resulting in faster interaction between representatives. The partitioned communication tasks of all representatives can also be performed in parallel, improving network bandwidth utilization. The reduction in communicator size and the possibility of executing parallel communication tasks for different representatives reduce the overall latency during inter-group communication and improve bandwidth utilization.
[0033] Figure 2A A possible exemplary implementation 200 of a three-stage collective communication task performed by a parallel computing system according to some embodiments of the present invention is shown. The parallel computing system includes one or more computing units. The parallel computing system includes a controller.
[0034] The controller is used to group one or more rank numbers associated with one or more computing units or nodes into one or more computing unit groups or one or more rank number groups based on the location of one or more computing units or nodes in the network topology of the parallel computing system and their hierarchy in the network topology. This hierarchy can be a node hierarchy, a switch hierarchy, or a rank number hierarchy. This means that (i) if each node is associated with one or more rank numbers, these rank numbers can be organized into one or more computing unit groups, and (ii) if one or more rank numbers associated with each node communicate through the same switch due to the physical proximity of the nodes, these rank numbers can be organized into one or more computing unit groups. Within one or more computing unit groups, each computing unit group includes at least one representative computing unit and one or more background computing units. The representative computing unit in each computing unit group acts as a representative, performing the workload of the parallel computing system. The workload includes computational tasks and communication tasks.
[0035] Optionally, network topology refers to the arrangement of various units such as nodes and switches in a parallel computing system. A switch is a network device that facilitates communication between different nodes. Each node in the network topology is associated with one or more rank numbers. One or more rank numbers represent multiple node connections linked to each node. Each node represents a computing unit.
[0036] The controller is used to assign computational tasks of the parallel computing system to one or more background computing units within each computing unit group. The one or more background computing units within each computing unit group execute the portion of the computational task assigned to them. The one or more background computing units transmit the results of their computational tasks through a representative within each computing unit group as part of inter-unit communication. This means that communication tasks share partial results of the computational tasks executed by the one or more background computing units within each computing unit group among other computing unit groups.
[0037] The controller is used to divide communication tasks (i.e., collective operations) into intra-group communication 202, inter-group communication 204, and additional intra-group communication 206 within one or more computing unit groups (1 to N), as shown in Figure 2. The parallel computing system divides the communication tasks and assigns them to each computing unit group. This means that each computing unit group receives communication tasks to be executed in parallel from the parallel computing system.
[0038] In inter-group communication 204, the representative computing unit in each computing unit group aggregates the communication tasks of its corresponding background computing unit. The representative computing unit transmits the aggregated communication tasks of its computing unit group to representatives of other computing unit groups. This means that, as part of the communication task, a portion of the result of a computing task executed by one or more background computing units in the computing unit group to which the representative computing unit belongs will be transmitted between other computing unit groups.
[0039] The controller is used to perform intra-group communication 202 to aggregate communication tasks of one or more background computing units within each computing unit group and forward the aggregated communication tasks to the representative computing unit in each computing unit group.
[0040] In intragroup communication 202, one or more back-end computing units in each computing unit group perform communication tasks assigned by the parallel computing system to the corresponding group to which the one or more back-end computing units belong.
[0041] The controller is used to execute additional intra-group communication 206. Additional intra-group communication 206 sends the results of computation tasks obtained from inter-group communication 204 back to one or more background computation units within each computation unit group. These results constitute aggregated communication tasks for other computation unit groups.
[0042] Figure 2B This paper illustrates a possible exemplary implementation of a three-stage collective operation performed by a parallel computing system according to some embodiments of the present invention. The three stages include intra-group, inter-group, and additional intra-group operations. The collective operation can be MPI_BARRIER, MPI_REDUCE, and MPI_ALLREDUCE, where MPI is a message passing interface. The operation of the MPI barrier used in the parallel computing system includes: after a computing unit (i.e., a representative) completes its computational task and inter-unit communication within the computing unit group, the controller executes a first MPI_barrier. Utilizing the first MPI_barrier, the MPI_barrier ensures that all computing units within the group have completed their tasks and communications before proceeding with the operation. The aggregated communication tasks generated by the computing unit group are then provided to the representative computing unit for processing, without any further operation by any computing unit.
[0043] Before the first MPI_barrier is executed, the delegates receive the aggregated communication tasks for their respective compute unit groups. After sending the aggregated communication tasks of each compute unit group to other compute unit groups and receiving the corresponding aggregated communication tasks from other compute unit groups, the controller executes a second MPI_barrier between delegates to achieve synchronization and ensure that all aggregated communication tasks are received before proceeding. The controller ultimately determines the workload based on the aggregated communication tasks of the compute unit group and the aggregated communication tasks of other compute unit groups.
[0044] After finalizing the workload based on the aggregated communication tasks of this compute unit group and the aggregated communication tasks of other compute unit groups, the controller executes a third MPI_barrier within each group. Executing a third MPI_barrier means that the controller ensures that all communication tasks, whether within this group or other compute unit groups, are completed before any subsequent computations occur.
[0045] Figures 3A to 3D Example figures 300A to 300D illustrate an implementation of the Allreduce collective operation performed by a parallel computing system using the message passing interface MPI_MAX operator, according to an embodiment of the present invention. The parallel computing system may include 48 rank numbers, each an independent node. A controller in the parallel computing system divides these 48 rank numbers into 6 rank number groups. Optionally, rank numbers within the same group are closer to each other than to rank numbers in different groups. Rank numbers within the same group are considered the topology. The Allreduce collective operation is used to find and inform the rank number with the maximum value among the 48 rank numbers using the MPI_MAX operator.
[0046] Figures 300A to 300D illustrate two different implementations of the Allreduce collective operation. The first implementation has a single representative for each group. This single representative in each group is responsible for communicating with other representatives in other computational unit groups to coordinate the Allreduce collective operation. Figure 3A In the diagram, individual representatives within each computational unit group are marked with thick outlines. For example, group 1 has 5 individual representatives, such as... Figure 3A As shown. The communicators in group 1 are 1, 25, 19, 2, 1, 12, 17, and 7, as follows. Figure 3A As shown.
[0047] The second implementation involves multiple representatives per group (i.e., multiple-level-multiple-representatives (MLMR)). The controller can select multiple representatives from each computing unit group and manage communication between groups through these selected representatives according to criteria to ensure that each computing unit group communicates with every other group. These criteria are: (i) each pair of possible computing unit groups must have at least one communicator containing representatives from each group; (ii) each representative forwards the aggregated communication task for their entire group to other groups; and (iii) each representative within each group will create at least one communicator with one or more representatives from other groups. Figure 3A In the diagram, multiple representatives within each computational unit group are marked with thick outlines, where each group contains communicators. For example, multiple representatives in group 1 are 5, 1, 25, 12, and 19, as shown below. Figure 3A As shown. The communicators in group 1 are 17, 2, and 7, as follows. Figure 3A As shown.
[0048] Allreduce collective operations can be implemented in three phases using the MPI_MAX operator. These three phases can be intra-group communication, inter-group communication, and additional intra-group communication.
[0049] The controller can execute the Allreduce collective operation, using the MPI_MAX operator to find the rank number with the maximum value in each rank number group. In the first implementation of intra-group communication, the rank number with the maximum value in each rank number group is sent to the individual representative in each rank number group. At the end of the intra-group communication, each representative in each rank number group holds the maximum value of the corresponding group to which that representative belongs. The individual representative in group 1 holds the rank number with the maximum value (in... Figure 3B (represented as 25 in Chinese).
[0050] In the second implementation of intra-group communication, the rank number with the maximum value in each rank number group is transmitted to multiple representatives in each rank number group. At the end of the intra-group communication, each representative in each group holds the maximum value of their respective group. Multiple representatives in group 1 hold the rank number with the maximum value (in... Figure 3B (represented as 25 in Chinese).
[0051] In the first implementation of inter-group communication, the communication task is divided among one or more rank-numbered groups. Each representative in each rank-numbered group communicates with the other representatives in the other rank-numbered groups (i.e., five representatives from the other groups (group 2, group 3, group 4, group 5, group 6)). In inter-group communication, the number of representatives is equal to the number of rank-numbered groups minus one, which ensures that each representative in that rank-numbered group communicates with exactly one representative from every other rank-numbered group.
[0052] like Figure 3B As shown, a single representative in group 1 (rank number 25) communicates with single representatives in group 2 (rank number 77), group 3 (rank number 88), group 4 (rank number 92), group 5 (rank number 81), and group 6 (rank number 34). This means that after inter-group communication between the single representatives in group 1 and group 2, the rank number of the single representative in group 1 is 77. Then, the single representative in group 1 (rank number 77) communicates with the single representative in group 3 (rank number 88).
[0053] After inter-group communication between individual representatives in Group 1 and Group 3, the rank number of the individual representative in Group 1 is 88. Then, the individual representative in Group 1 (rank number 88) communicates with the individual representative in Group 4 (rank number 92). After inter-group communication between individual representatives in Group 1 and Group 4, the rank number of the individual representative in Group 1 is 92. Then, the individual representative in Group 1 (rank number 92) communicates with the individual representative in Group 5 (rank number 81).
[0054] After inter-group communication between a single representative in group 1 and a single representative in group 5, the rank number of the single representative in group 1 is 92. Then, the single representative in group 1 (rank number 92) communicates with a single representative in group 6 (rank number 34). After inter-group communication between the single representative in group 1 and a single representative in group 6, the rank number of the single representative in group 1 is 92. This means that when representatives from different rank number groups communicate, the rank number assigned to each representative is based on the maximum value between the two groups involved in the communication.
[0055] In such Figure 3CThe diagram shows that after a single representative in group 1 finds a rank number with a maximum value of 92, a single representative in group 2 (rank number 77) communicates with a single representative in group 1 (rank number 92), a single representative in group 3 (rank number 88), a single representative in group 4 (rank number 92), a single representative in group 4 (rank number 81), and a single representative in group 6 (rank number 34).
[0056] At the end of inter-group communication, in the communication pairs between the two groups, each representative holds a rank number with the maximum value. In the 6 groups, each rank number group's individual representative has a maximum value of 92, such as... Figure 3C As shown.
[0057] In the second type of inter-group communication, multiple representatives from each rank-numbered group communicate with another representative from another rank-numbered group. For example, a representative from group 1 (rank number 25) communicates with a representative from group 21 (rank number 77), a representative from group 1 (rank number 25) communicates with a representative from group 3 (rank number 88), a representative from group 1 (rank number 25) communicates with a representative from group 4 (rank number 92), a representative from group 1 (rank number 25) communicates with a representative from group 5 (rank number 81), and a representative from group 1 (rank number 25) communicates with a representative from group 5 (rank number 34) to find the rank number with the maximum value among the two groups participating in the communication. At the end of the inter-group communication, each representative from group 1 holds the rank number with the maximum value; that is, multiple representatives hold the maximum values 77, 88, 92, 81, and 34, as shown below. Figure 3C As shown.
[0058] In the first implementation of communication within the supplementary group, a single representative in each rank number group can send the rank number with the maximum value in the corresponding communicator.
[0059] In the second implementation of intra-group communication, multiple representatives in each rank group send the rank number with the maximum value through their respective communicators. Sending the rank number with the maximum value to the corresponding communicator in each rank group requires additional computational overhead. At the end of intra-group communication, the Allreduce collective operation completes, and all rank numbers in each rank group contain the maximum value of one or more rank groups. For example, if the rank numbers in one or more rank groups include 92, then... Figure 3D As shown.
[0060] Figures 4A to 4CThis is a flowchart illustrating a method for allocating workloads involving communication tasks based on message passing in a parallel computing system according to an implementation of the present invention. The parallel computing system includes one or more computing units. In step 402, the method includes: arranging the one or more computing units into three or more computing unit groups. Each computing unit group includes at least two representative computing units and any number of background computing units, wherein the at least two representative computing units are used to act as representatives of the group. The representatives of the group are used to communicate with each representative of the other groups. In step 404, the method includes: each computing unit group in the computing unit group receives a communication task to be executed in parallel. The communication task indicates inter-unit communication within each group. In step 406, the method includes: the computing units in each computing unit group perform inter-unit communication within the computing unit group of the computing unit. In step 408, the method includes: each computing unit acting as a representative receives a communication task from the corresponding group member of the computing unit. In step 410, the method includes: aggregating the received communication tasks. In step 412, the method includes: sending the aggregated communication tasks of the computing unit group of the computing unit to representatives of the other computing unit groups. In step 414, the method includes receiving aggregated communication tasks from other computing unit groups. In step 416, the method includes ultimately determining the workload based on the aggregated communication tasks of the computing unit group and the aggregated communication tasks of other computing unit groups.
[0061] This method optimizes collective operations for Message Passing Interface (MPI) users, who execute communication tasks in parallel across multiple computing units. This optimization ensures efficient execution of collective operations, reduces overall computation time, and improves the performance of the parallel computing system. This method alleviates the performance limitations of the parallel computing system by introducing a perceptual mechanism to minimize communication overhead. It improves the efficiency of the parallel computing system by partitioning communication tasks so that each representative communicates only with a subset of other representatives. This partitioning of communication tasks minimizes the overall latency of the parallel computing system during inter-group communication by reducing the size of the communicator. By restricting communication to specific representatives, the latency of inter-group communication performed by the parallel computing system is reduced, resulting in faster interaction between representatives. The partitioned communication tasks of all representatives can also be performed in parallel, improving network bandwidth utilization. The reduction in communicator size and the possibility of executing parallel communication tasks for different representatives reduce the overall latency during inter-group communication and improve bandwidth utilization.
[0062] Optionally, the method further includes: a representative of each computing unit group sending the aggregated communication task representing the computing unit group to other computing unit groups, and receiving aggregated communication tasks from other computing unit groups in parallel.
[0063] Optionally, the communication tasks to be executed in parallel are partial tasks of the workload to be assigned. The method further includes splitting the workload into communication tasks for each computing unit group within the computing unit group.
[0064] Optionally, the workload also includes computational tasks to be executed. As part of the communication task, a portion of the results of the computational tasks will be transferred between the computational units. The method further includes: each background computational unit executing the portion of the computational task belonging to that background computational unit and sending the result of that portion as part of the inter-unit communication.
[0065] Alternatively, the parallel computing system may utilize a message passing interface (MPI) for operation.
[0066] Optionally, the method further includes: after a computing unit in a computing unit group has executed the computing task within the computing unit group and the inter-unit communication, using a first MPI_barrier, enabling the aggregation communication task to be provided to the representative for processing without any computing unit continuing to operate.
[0067] Optionally, the method further includes: representing computing units to receive aggregate communication tasks of computing unit groups before a first MPI_barrier, and after sending aggregate communication tasks of each computing unit group to other computing unit groups and receiving aggregate communication tasks of other computing unit groups, utilizing a second MPI_barrier between representatives to ensure that all aggregate communication tasks are received before continuing the operation.
[0068] Optionally, the method further includes: utilizing a third MPI_barrier within each group after the workload is finally determined based on the aggregated communication tasks of the computing unit group and the aggregated communication tasks of other computing unit groups.
[0069] Optionally, the method further includes performing an MPI_reduce operation, which reduces computational units in a group to representatives of that group, and then broadcasts this to other representatives, thereby performing the reduction of that representative in parallel, followed by the reduction from that representative to representatives of the first group.
[0070] Figure 5A flowchart illustrating a method for distributing workload based on message passing in computing units of a parallel computing system according to an implementation of the present invention is shown. The parallel computing system includes one or more computing units. The one or more computing units are arranged into three or more groups of computing units. A computing unit is a representative of one of the computing unit groups. The computing unit is used to communicate with representatives of other groups. In step 502, the method includes: finally determining the workload of the parallel computing system.
[0071] This method optimizes collective operations for Message Passing Interface (MPI) users, who execute communication tasks in parallel across multiple computing units. This optimization ensures efficient execution of collective operations, reduces overall computation time, and improves the performance of the parallel computing system. This method alleviates the performance limitations of the parallel computing system by introducing a perceptual mechanism to minimize communication overhead. It improves the efficiency of the parallel computing system by partitioning communication tasks so that each representative communicates only with a subset of other representatives. This partitioning of communication tasks minimizes the overall latency of the parallel computing system during inter-group communication by reducing the size of the communicator. By restricting communication to specific representatives, the latency of inter-group communication performed by the parallel computing system is reduced, resulting in faster interaction between representatives. The partitioned communication tasks of all representatives can also be performed in parallel, improving network bandwidth utilization. The reduction in communicator size and the possibility of executing parallel communication tasks for different representatives reduce the overall latency during inter-group communication and improve bandwidth utilization.
[0072] In one implementation, there is a computer program product comprising program instructions that, when executed by one or more processors in a storage system, perform the methods described above.
[0073] Figure 6 This is an illustration of a computer system (e.g., a parallel computing system and controller) that can implement various architectures and functions of various prior implementations. As shown, computer system 600 includes at least one processor 604 connected to bus 602. Computer system 600 can be implemented using any suitable protocol, such as peripheral component interconnect, PCI-Express, Accelerated Graphics Port (AGP), HyperTransport, or any other bus or point-to-point communication protocol. Computer system 600 also includes memory 606.
[0074] The control logic (software) and data are stored in memory 606, which can take the form of random-access memory (RAM). In this invention, a single semiconductor platform can refer to a standalone, single-semiconductor-based integrated circuit or chip. It should be noted that the term "single semiconductor platform" can also refer to a multi-chip module with enhanced connectivity. This multi-chip module simulates an on-chip module with enhanced connectivity, thereby simulating on-chip operation, achieving a significant improvement compared to implementations using a traditional central processing unit (CPU) and bus. Of course, various modules can be added individually or in various combinations of semiconductor platforms, depending on the user's needs.
[0075] Computer system 600 may also include auxiliary storage 610. Auxiliary storage 610 includes, for example, hard disk drives and removable storage drives, such as floppy disk drives, magnetic tape drives, compact disk drives, digital versatile disk (DVD) drives, recording devices, universal serial bus (USB) flash memory. The removable storage drive performs at least one of the following operations: reading from and writing to the removable storage unit in a known manner.
[0076] A computer program or computer control logic algorithm may be stored in at least one of the memory 606 and the auxiliary memory 610. When such a computer program is executed, it enables the computer system 600 to perform the various functions described above. The memory 606, the auxiliary memory 610, and any other storage devices are computer-readable media.
[0077] In one implementation, the architecture and functions described in the preceding figures can be implemented in the context of processor 604, a graphics processor coupled to communication interface 612, an integrated circuit (not shown) capable of simultaneously possessing at least some of the capabilities of processor 604 and graphics processor, and a chipset (i.e., a set of integrated circuits designed to function and be sold as units performing related functions, etc.).
[0078] Furthermore, the architectures and functions described in the previously illustrated figures can also be implemented in environments such as general-purpose computer systems, circuit board systems, game console systems for entertainment purposes, and application-specific systems. For example, computer system 600 can take the form of a desktop computer, a laptop computer, a server, a workstation, a game console, or an embedded system.
[0079] Furthermore, the computer system 600 can take the form of various other devices, including but not limited to personal digital assistant (PDA) devices, mobile phone devices, smartphones, televisions, etc. Additionally, although not shown, the computer system 600 can be coupled to a network (e.g., a telecommunications network, a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, a peer-to-peer network, or a wired network, etc.) via I / O interface 608 for communication purposes.
[0080] It should be understood that the arrangement of components shown in the described figures is exemplary and other arrangements are possible. It should also be understood that the various system components (and devices) defined by the claims, described below, and shown in the various block figures represent components in some systems configured according to the subject matter disclosed herein. For example, one or more of these system components (and devices) may be implemented wholly or partially by at least some of the components shown in the arrangements illustrated in the described figures.
[0081] Furthermore, while at least one of these components is implemented at least partially as an electronic hardware component and thus constitutes a machine, the other components may be implemented in software, which, when included in the execution environment, constitutes a machine, hardware, or a combination of software and hardware.
[0082] Although the invention and its advantages have been described in detail, it should be understood that various changes, substitutions and modifications may be made without departing from the spirit and scope of the invention as defined by the appended claims.
Claims
1. A method for allocating workloads involving communication tasks based on message passing in a parallel computing system (102), characterized in that, The parallel computing system (102) includes multiple computing units, wherein the multiple computing units are arranged into three or more computing unit groups (104A to 104N). Each computing unit group (104A to 104N) includes at least two representative computing units (106A to 106N) and any number of background computing units (108A to 108N), wherein the at least two representative computing units (106A to 106N) serve as representatives of the group. The representative of the group is used to communicate with each representative of the other groups, and the method includes: Each computing unit group (104A to 104N) receives a communication task to be executed in parallel, the communication task instructing inter-unit communication within each group. The computing units in each computing unit group (104A to 104N) perform the inter-unit communication within the computing unit group to which the computing unit belongs. Each computing unit that acts as the representative, The communication task is received from the corresponding group member of the computing unit. The received communication tasks are aggregated. The aggregated communication task of the computing unit group (104A) to which the computing unit belongs is sent to the representative of the other computing unit groups. The aggregated communication task is received from the other computing unit group (104N), and wherein, The method further includes: The workload is ultimately determined based on the aggregated communication tasks of the computing unit group (104A) to which the computing unit belongs and the aggregated communication tasks of the other computing unit groups (104N).
2. The method according to claim 1, characterized in that, The method further includes: a representative of each computing unit group (104A to 104N) sending the aggregated communication task of the computing unit group (104A) to which the representative belongs to the other computing unit groups (104N), and receiving the aggregated communication tasks of the other computing unit groups (104N) in parallel.
3. The method according to claim 1 or 2, characterized in that, The communication tasks to be executed in parallel are partial tasks of the workload to be assigned, and the method further includes: splitting the workload into communication tasks for each computing unit group (104A to 104N).
4. The method according to any one of the preceding claims, characterized in that, The workload also includes a computing task to be executed, wherein, as part of the communication task, a portion of the result of the computing task will be transmitted between computing units, and wherein the method further includes: each background computing unit (106A to 106N) executing the portion of the computing task belonging to the background computing unit and sending the result of the portion of the computing task as part of the inter-unit communication.
5. The method according to any one of the preceding claims, characterized in that, The parallel computing system (102) is used to operate using the message passing interface MPI.
6. The method according to claim 5, characterized in that, The method further includes: After the computing units in the computing unit group (104A to 104N) have executed the computing tasks within the computing unit group and the inter-unit communication, the aggregation communication task can be provided to the representative and processed by the first MPI_barrier without any computing unit continuing to operate.
7. The method according to claim 6, characterized in that, The method further includes: The representative computing units (106A to 106N) receive the aggregated communication task from the computing unit group (104A) before the first MPI_barrier, and After sending the aggregated communication task of each computing unit group (104A to 104N) to the other computing unit groups (104A to 104N) and receiving the aggregated communication task of the other computing unit groups (104A to 104N), a second MPI_barrier is used between the representatives (106A to 106N) to ensure that all aggregated communication tasks are received before continuing the operation.
8. The method according to claim 7, characterized in that, The method further includes: After the workload is finally determined based on the aggregated communication tasks of the computing unit group (104A) and the aggregated communication tasks of the other computing unit groups (104N), a third MPI_barrier is used within each group.
9. The method according to any one of claims 5 to 8, characterized in that, The method further includes: Perform the MPI_reduce operation: reduce the computational units in a group to the representatives of the group, and then broadcast to the other representatives, thereby performing the reduction of the representatives in parallel, followed by the reduction from the representatives to the representatives of the first group.
10. A method for distributing workload based on message passing in a computing unit of a parallel computing system (102), characterized in that, The parallel computing system (102) includes multiple computing units, wherein the multiple computing units are arranged into three or more computing unit groups (104A to 104N). The computing unit is a representative of one of the computing unit groups (104A to 104N), wherein the computing unit is used to communicate with representatives of other groups, and the method includes: The final determination will be made after the main claims are agreed upon.
11. A parallel computing system (102), characterized in that, Includes a controller (110) for performing the method according to any one of claims 1 to 10.
12. A controller (110) for operation in a parallel computing system (102), characterized in that, The controller (110) is used to perform the method according to any one of claims 1 to 10.
13. A computer program product, characterized in that, Includes program instructions that, when executed by one or more processors in the parallel computing system (102), perform the method according to any one of claims 1 to 10.