An apparatus for reducing power consumption of high performance computing memory and a method of using the same

By using heterogeneous computing modules and parallel processing technology, the problems of high computation time and high power consumption in high-performance computing systems are solved, achieving efficient and accurate data processing and reasonable resource allocation, while reducing power consumption.

CN115712336BActive Publication Date: 2026-06-16GUIZHOU POWER GRID CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GUIZHOU POWER GRID CO LTD
Filing Date
2022-11-23
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

In existing technologies, the computation process of high-performance computing systems by multiple modules takes a long time, requires repeated data searches, is prone to missing some data, and the computation results and speed are inconsistent in different environments. The allocation of existing technologies is not reasonable enough, resulting in high power consumption.

Method used

The computing module adopts a heterogeneous architecture, including multiple many-core coprocessors and high-performance memory units. It uses a CPU+GPU architecture for parallel computing and combines the CHAOS parallel framework for parallel computing. Through CHAOS parallel computing, it combines breadth-first search and bottom-up algorithms to reduce the number of memory accesses. It uses resource management units and shared storage units for dynamic scheduling to achieve parallel processing and data allocation.

🎯Benefits of technology

Parallel computing through heterogeneous architecture improves computational efficiency, reduces computation time and power consumption, enhances computational accuracy and data processing efficiency, and reduces memory access overhead and power consumption.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115712336B_ABST
    Figure CN115712336B_ABST
Patent Text Reader

Abstract

The application discloses a device for reducing power consumption of high-performance computing memory and a use method thereof, and relates to the technical field of high-performance computing, and specifically comprises the following modules: a computing module: a plurality of many-core coprocessors perform real-time computation and processing on data with the aid of a high-performance memory unit; a monitoring module: data obtained through computation is sent to an analysis module when the computing module normally works; an analysis module: a plurality of algorithm units are divided into multiple groups to perform different algorithms to compute data transmitted by the monitoring module; and a distribution module: slave node executors running in different resource environments monitor tasks that need to be executed by themselves from a message management unit. According to the application, data in the computing module can be effectively searched, the computation accuracy is improved, different computation data is conveniently fused into a unified shared storage unit, each kind of data has a suitable computation environment, and the power consumption of the device is effectively reduced.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a device and method for reducing the power consumption of high-performance computing memory, belonging to the field of memory power consumption calculation technology. Background Technology

[0002] High-performance computing (HPC) technology is a national strategic high-tech technology and a crucial means to solve a series of major challenges in economic construction, social development, scientific progress, and national security. It has become a strategic technological high ground for countries worldwide in the information age. Today, my country's economic and social development and national security have an urgent need for HPC; for example, solving major challenges such as energy shortages, environmental pollution, and global climate change urgently requires HPC. High-performance computers are playing an irreplaceable role in all sectors of the national economy and have become an important tool for researching and solving challenging problems in various fields.

[0003] Chinese patent application (publication number CN105607726A) discloses a method and apparatus for reducing memory power consumption in high-performance computing clusters. This patent utilizes multiple modules to monitor and analyze high-performance computing cluster jobs in real time, determine the memory fault tolerance mechanism of the currently running job type, and allocate memory power consumption based on the fault tolerance level of the mechanism. This adjusts memory power consumption according to the fault tolerance level, reducing overall power consumption while maintaining cluster performance. However, the internal calculation process is time-consuming and requires repeated searches of computational data. During these searches, some data may be missed, or some data may be calculated multiple times, increasing the device's power consumption. Furthermore, the computational environment for large datasets varies, resulting in different calculation speeds. While the invention primarily adjusts memory power consumption based on fault tolerance levels, other methods could be chosen to allocate the data to be computed to different computing environments, improving the device's compatibility with the computation and thus reducing the device's own power consumption.

[0004] However, existing devices for reducing the power consumption of high-performance memory have a long internal calculation process and require repeated searches for calculation data. During the search process, some data may be missed or some data may be calculated multiple times. Furthermore, different types of data and different calculation environments require redistribution during the allocation process. Existing devices do not have good allocation performance, resulting in high power consumption. Summary of the Invention

[0005] The technical problem to be solved by the present invention is to provide a device and method for reducing the power consumption of high-performance computing memory, which can effectively solve the problems of the long calculation time in the internal calculation process of existing devices for reducing the power consumption of high-performance memory, and the need to repeatedly search for calculation data during the calculation. In the process of searching, some data will inevitably be missed, or some data will be calculated multiple times. In addition, the data types and the calculation environments are different during the allocation process, and the allocation effect of existing devices is not good enough, resulting in high power consumption of the device.

[0006] The technical solution adopted in this invention is: a device for reducing the power consumption of high-performance computing memory, comprising a computing module, a monitoring module, an analysis module, and an allocation module;

[0007] The computing module includes multiple many-core coprocessors and high-performance memory units. The many-core coprocessors adopt a heterogeneous architecture, and the high-performance memory units store the data that needs to be computed. The many-core coprocessors use the high-performance memory units to perform real-time computation and processing on the data. The many-core coprocessors adopt a CPU+GPU heterogeneous architecture to achieve parallel processing. The CPU is responsible for the complex logic calculation part, and the GPU is responsible for the intensive operation with high parallelism and few branches. The many-core coprocessors are used for training large-scale deep neural networks. The parallel operation of the many-core coprocessors is completed through the CHAOS parallel framework.

[0008] Monitoring module: includes resource management unit, early warning unit and data transmission unit. Resource management unit is used to monitor the operation of high performance computing in computing module in real time. When high performance computing fails, early warning unit will light up the warning light and stop the computing module from working. Data transmission unit sends the calculated data to analysis module when computing module is working normally.

[0009] Analysis Module: This module includes multiple algorithm units, which are divided into groups to execute different algorithms to calculate the data transmitted by the monitoring module. Some algorithm units use the breadth-first search algorithm, using a bitmap data structure to represent the visitor structure in the breadth-first search algorithm, which increases the locality of the visitor and reduces the number of memory accesses. The bottom-up search method avoids atomic operations executed by multiple threads, and by combining top-down and bottom-up search methods, the number of traversals during the search process is further reduced, thus reducing memory access overhead. Multiple algorithm units also have memory binding and thread binding optimization techniques, and the incoming data is divided so that when multiple threads execute in parallel, each thread reduces remote memory access during the search, further reducing memory access overhead. After the analysis is completed, the analysis results are sent to the allocation module.

[0010] The allocation module includes a master node manager, a message management unit, slave node executors, and a shared storage unit. The master node manager provides task orchestration definition and scheduling functions, defining the received analysis results as tasks to be run and sending them to the message management unit. Then, slave node executors running in different resource environments monitor the tasks they need to execute from the message management unit. When a task that needs to be executed by a slave node executor appears, the corresponding slave node executor executes the corresponding task. Finally, the slave node executors store the files that need to be input and output in a shared storage unit.

[0011] Preferably, the parallel framework of CHAOS in the above computing module uses the HogWild method to store the gradient accumulation in the computing module body and uses the worker to update the global weight parameters, thereby reducing the training time of each round of the neural network of multiple many-core coprocessors.

[0012] Preferably, the above-mentioned resource management unit uses a unified underlying resource management framework, on which different application frameworks are migrated and installed.

[0013] Preferably, the CPU and GPU read data directly from the high-performance memory unit and perform fast and accurate calculations on the data in the high-performance memory unit.

[0014] Preferably, the different application frameworks in the aforementioned underlying resource management framework are all compatible with the underlying resource management framework.

[0015] Preferably, the GPU has multi-threading technology and a fine-grained synchronization mechanism, thereby accelerating the breadth-first search algorithm and using the SIMDVLQ encoding method to compress external data.

[0016] Preferably, the aforementioned many-core coprocessor, combined with high-performance memory units, has gather / scatter capabilities. Bottom-up and top-down algorithms are employed on the many-core coprocessor, using thousands of threads to traverse the graph.

[0017] Preferably, in the scheduling work of the allocation module, I / O resources, computing resources, accelerator resources, network resources, data and software library resources are dynamically scheduled and configured according to the characteristics of the task at different stages.

[0018] A method for using a device to reduce the power consumption of high-performance computing memory, characterized by the following specific steps:

[0019] Step 1: External computation data is input into a high-performance memory unit for storage, and then computed by multiple many-core coprocessors. The CPU and GPU in the multiple many-core coprocessors directly read data from the high-performance memory unit, and the thread parallel operation of the many-core coprocessors is completed through the parallel framework of CHAOS.

[0020] Step 2: After receiving the data transmitted from the computing module, the monitoring module uses the resource management unit for real-time monitoring and sends normal data to the analysis module through the data transmission unit;

[0021] Step 3: Multiple algorithm units in the analysis module combine top-down and bottom-up search methods to analyze the data. Some algorithm units use bitmap data structures to represent the visitor structure in the breadth-first search algorithm. The data analysis is performed in parallel by multiple threads. After the analysis is completed, the analysis results are sent to the allocation module.

[0022] Step 4: The analysis results are sent to the allocation module, which defines the received analysis results as tasks to be run and sends them to the message management unit. The slave executors running in different resource environments monitor the tasks they need to execute from the message management unit. When a task that needs to be executed by the slave executor appears, the corresponding slave executor executes the corresponding task. Finally, the slave executors store the files that need to be input and output in a shared storage unit.

[0023] The beneficial effects of the present invention are as follows: Compared with the prior art, the present invention has the following advantages:

[0024] 1) In this invention, through the configuration of the computing module, multiple many-core coprocessors adopt a heterogeneous architecture. These coprocessors utilize high-performance memory units to perform real-time computation and processing of data. The many-core coprocessors employ a CPU+GPU heterogeneous architecture to achieve parallel processing. The CPU handles complex logical calculations, while the GPU handles intensive computations with high parallelism and few branches. The computational data is processed simultaneously using multiple many-core coprocessors, effectively improving computational efficiency. Furthermore, these many-core coprocessors are used for training large-scale deep neural networks. The parallel framework of CHAOS enables thread-parallel operations of the many-core coprocessors. The CHAOS parallel framework in the computing module uses the HogWild method to accumulate and store gradients within the computing module itself, updating global weight parameters using workers. Therefore, explicit synchronization is not required, significantly reducing the training time of each round of the neural network using multiple many-core coprocessors, thereby achieving acceleration. Parallel processing effectively improves the efficiency of data processing and analysis, significantly reduces computation time, and employs SIMD. The VLQ encoding method compresses external data to further improve computational performance. It also fully utilizes the gather / scatter capabilities of many-core coprocessors and high-performance memory units. On many-core coprocessors, it employs bottom-up and top-down algorithms, using thousands of threads to traverse the graph. This allows data in the computation module to be effectively searched without repeatedly searching for computational data. During the search process, it is difficult to miss or calculate data multiple times, thus improving the accuracy of the computation.

[0025] 2) In this invention, through the configuration of the analysis module, some algorithm units use the breadth-first search algorithm. The visitor structure in the breadth-first search algorithm is represented by a bitmap data structure, which increases the locality of the visitor and reduces the number of memory accesses. The bottom-up search method avoids atomic operations executed by multiple threads. By combining the top-down and bottom-up search methods, the number of traversals during the search process is further reduced, thus reducing memory access overhead. Multiple algorithm units also have memory binding and thread binding optimization techniques, and the input data is divided so that each thread reduces remote memory access during the search when executing in parallel with multiple threads, further reducing memory access overhead. After the analysis is completed, the analysis results are sent to the allocation module, which can extend the multi-threaded parallel breadth-first search algorithm, accelerate the algorithm, reduce the computation time, and use hardware multi-threading technology to hide memory access latency, resulting in excellent performance.

[0026] 3) In this invention, the allocation module includes a master node manager, a message management unit, slave node executors, and a shared storage unit. The master node manager provides task orchestration definition and scheduling functions, defining the received analysis results as tasks to be run and sending them to the message management unit. Then, slave node executors running in different resource environments monitor the tasks they need to execute from the message management unit. When a task that needs to be executed by a slave node executor appears, the corresponding slave node executor executes the corresponding task. In this process, the message management unit can be used to define the dependencies and environment requirements of tasks, which is easier to maintain than scripts. The message management unit distinguishes task environment types and provides a loosely coupled and flexible task orchestration method. Different tasks can be assigned to appropriate slave node executors for processing, making it convenient to integrate different computational data into a unified shared storage unit. Each type of data has its own suitable computational environment, eliminating the need for multiple allocations and effectively reducing the power consumption of the device. Attached Figure Description

[0027] Figure 1 This is a flowchart of the present invention;

[0028] Figure 2 This is a schematic diagram showing the connection of the various modules of the present invention;

[0029] Figure 3 This is a flowchart of the analysis module of the present invention;

[0030] Figure 4 This is a flowchart of the allocation module of the present invention. Detailed Implementation

[0031] The present invention will be further described below with reference to the accompanying drawings and specific embodiments.

[0032] Example 1: As Figure 1-4 As shown, a device for reducing the power consumption of high-performance computing memory includes a computing module, a monitoring module, an analysis module, and an allocation module.

[0033] The computing module comprises multiple many-core coprocessors and high-performance memory units. The many-core coprocessors employ a heterogeneous architecture, while the high-performance memory units store the data to be computed. The many-core coprocessors utilize the high-performance memory units to perform real-time computation and processing on the data, effectively improving computational efficiency by simultaneously processing the computational data across multiple coprocessors. The combination of many-core coprocessors and high-performance memory units provides gather / scatter capabilities. Scatter / gather capabilities refer to implementing simple I / O operations on multiple buffers, such as reading data from a channel into multiple buffers or writing data from multiple buffers into a channel. Bottom-up and top-down algorithms are employed on the many-core coprocessors, using thousands of threads to traverse the graph to improve computational performance. This allows for effective searching of data within the computing module, eliminating the need for repeated searches and minimizing the risk of missing or recalculating data during the search process, thus improving computational accuracy. The many-core coprocessors utilize a CPU+GPU heterogeneous architecture to achieve parallel processing. The CPU handles complex logical calculations, while the GPU handles high-parallelism, high-resolution computations. This system reduces intensive computation by having the CPU and GPU directly read data from high-performance memory units for fast and accurate calculations. Multiple many-core coprocessors are used for training large-scale deep neural networks. The CHAOS parallel framework enables thread-parallel operation of these coprocessors. CHAOS employs the HogWild method to accumulate gradients and store them within the computation module itself. The HogWild method involves parallel processing across multiple CPUs, with processors accessing parameters via shared memory without locking. Each CPU is allocated a non-overlapping set of parameters, and each CPU only updates the parameters it is responsible for. Workers update global weight parameters, eliminating the need for explicit synchronization. This significantly reduces the training time for each round of the many-core coprocessor neural network, thus accelerating the process. The GPU's multi-threading technology and fine-grained synchronization mechanisms accelerate the breadth-first search algorithm, and the SIMDVLQ encoding method compresses external data, further improving computational performance. Parallel processing effectively improves the efficiency of data processing and analysis, significantly reducing computation time.

[0034] The monitoring module includes a resource management unit, an early warning unit, and a data transmission unit. The resource management unit monitors the operation of high-performance computing in the computing module in real time. When a high-performance computing failure occurs, the early warning unit will light up an alarm and stop the computing module from working. The data transmission unit sends the calculated data to the analysis module when the computing module is working normally. The resource management unit uses a unified underlying resource management framework, on which different application frameworks can be migrated and installed. Different application frameworks in the underlying resource management framework are compatible with the underlying resource management framework. The advantage of this is that the underlying resource framework can centralize global resource information and provide a unified task and resource management strategy, so that the management efficiency and effectiveness can reach a relatively good level.

[0035] The analysis module comprises multiple algorithm units, which are divided into groups to execute different algorithms to calculate the data transmitted by the monitoring module. Some algorithm units use a breadth-first search algorithm, representing the visitor structure in the breadth-first search algorithm with a bitmap data structure, which increases the locality of visitors and reduces the number of memory accesses. The bottom-up search method avoids atomic operations in multi-threaded execution, and the combination of top-down and bottom-up search methods further reduces the number of traversals during the search process, reducing memory access overhead. Multiple algorithm units also have memory binding and thread binding optimization techniques, and the input data is divided so that each thread reduces remote memory access during the search when executing in parallel, further reducing memory access overhead. After the analysis is completed, the analysis results are sent to the allocation module, which can extend the multi-threaded parallel breadth-first search algorithm to accelerate the algorithm, reduce the calculation time, and use hardware multi-threading technology to hide memory access latency.

[0036] The allocation module includes a master node manager, a message management unit, slave node executors, and a shared storage unit. The master node manager provides task orchestration and scheduling functions, defining received analysis results as tasks to be run and sending them to the message management unit. Then, slave node executors running in different resource environments monitor their assigned tasks from the message management unit. When a task requiring execution appears, the corresponding slave executor executes it. Finally, the slave executors store both input and output files in a shared storage unit to achieve optimal matching between the device and the computation. In the scheduling work of the allocation module, based on... Based on the characteristics of tasks at different stages, I / O resources, computing resources, accelerator resources, network resources, data and software library resources are dynamically scheduled and configured to achieve the best match between the system and the application. In this process, the message management unit can be used to define the dependencies and environmental requirements of tasks, which is easier to maintain than scripts. The message management unit can also be used to distinguish task environment types, providing a loosely coupled and flexible task orchestration method. Different tasks can be assigned to appropriate slave node executors for processing, making it easy to integrate different computing data into a unified shared storage unit. Each type of data has its own suitable computing environment, eliminating the need for multiple allocations and effectively reducing the power consumption of the device.

[0037] Example 2: A method for using a device to reduce the power consumption of high-performance computing memory, the method comprising the following specific steps:

[0038] Step 1: The data that needs to be calculated from the outside world is input into a high-performance memory unit for storage, and then calculated by multiple many-core coprocessors. The CPU and GPU in the multiple many-core coprocessors directly read the data from the high-performance memory unit, and the thread parallel operation of the many-core coprocessors is completed through the parallel framework of CHAOS.

[0039] Step 2: After receiving the data transmitted from the computing module, the monitoring module uses the resource management unit for real-time monitoring and sends normal data to the analysis module through the data transmission unit;

[0040] Step 3: Multiple algorithm units in the analysis module combine top-down and bottom-up search methods to analyze the data. Some algorithm units use bitmap data structures to represent the visitor structure in the breadth-first search algorithm. The data analysis is performed in parallel by multiple threads. After the analysis is completed, the analysis results are sent to the allocation module.

[0041] Step 4: The analysis results are sent to the allocation module, which defines the received analysis results as tasks to be run and sends them to the message management unit. The slave executors running in different resource environments monitor the tasks they need to execute from the message management unit. When a task that needs to be executed by the slave executor appears, the corresponding slave executor executes the corresponding task. Finally, the slave executors store the files that need to be input and output in a shared storage unit.

[0042] The working principle of this invention is as follows: The data to be calculated is input into a high-performance memory unit for storage, and then calculated by multiple many-core coprocessors. The CPU and GPU in the many-core coprocessors directly read the data from the high-performance memory unit. The parallel operation of the many-core coprocessors is completed through the parallel framework of CHAOS. The data is then transmitted to the monitoring module. After the monitoring is successful, the data is sent to the analysis module through the data transmission unit. Multiple algorithm units in the analysis module analyze the data using top-down and bottom-up search methods. Some algorithm units use a bitmap data structure to represent the visitor structure in the breadth-first search algorithm. The data analysis operation is performed in parallel by multiple threads. After the multiple algorithm units have finished processing and analyzing the data, they send the analysis results to the allocation module. In the allocation module, slave executors running in different resource environments monitor the tasks they need to execute from the message management unit. When a task that needs to be executed by a slave executor appears, the corresponding slave executor executes the corresponding task. Finally, the slave executors store the files that need to be input and output in a shared memory unit, effectively reducing the power consumption of the device.

[0043] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of protection of the claims.

Claims

1. A device for reducing the power consumption of high-performance computing memory, characterized in that: It includes a calculation module, a monitoring module, an analysis module, and an allocation module; The computing module includes multiple many-core coprocessors and high-performance memory units. The many-core coprocessors adopt a heterogeneous architecture, and the high-performance memory units store the data that needs to be computed. The many-core coprocessors use the high-performance memory units to perform real-time computation and processing on the data. The many-core coprocessors adopt a CPU+GPU heterogeneous architecture to achieve parallel processing. The CPU is responsible for the complex logic calculation part, and the GPU is responsible for the intensive operation with high parallelism and few branches. The many-core coprocessors are used for training large-scale deep neural networks. The parallel operation of the many-core coprocessors is completed through the CHAOS parallel framework. Monitoring module: includes resource management unit, early warning unit and data transmission unit. Resource management unit is used to monitor the operation of high performance computing in computing module in real time. When high performance computing fails, early warning unit will light up the warning light and stop the computing module from working. Data transmission unit sends the calculated data to analysis module when computing module is working normally. Analysis Module: This module includes multiple algorithm units, which are divided into multiple groups to execute different algorithms to calculate the data transmitted by the monitoring module. Some algorithm units use the breadth-first search algorithm, which uses a bitmap data structure to represent the visitor structure in the breadth-first search algorithm. It uses a bottom-up search method and combines top-down and bottom-up search methods. Multiple algorithm units also have memory binding and thread binding optimization techniques, and divide the incoming data. After the analysis is completed, the analysis results are sent to the allocation module. The allocation module includes a master node manager, a message management unit, slave node executors, and a shared storage unit. The master node manager provides the functions of task orchestration definition and scheduling, defining the received analysis results as tasks to be run and sending them to the message management unit. Then, the slave node executors running in different resource environments monitor the tasks they need to execute from the message management unit. When a task that needs to be executed by the slave node executor appears, the corresponding slave node executor executes the corresponding task. Finally, the node executor stores the files that need to be input and output in a shared storage unit.

2. The device for reducing power consumption of high-performance computing memory according to claim 1, characterized in that: The parallel framework of CHAOS in the computation module uses the HogWild method to accumulate and store gradients in the computation module body, and uses workers to update global weight parameters.

3. The device for reducing power consumption of high-performance computing memory according to claim 2, characterized in that: The resource management unit uses a unified underlying resource management framework, on which different application frameworks can be migrated and installed.

4. The device for reducing power consumption of high-performance computing memory according to claim 3, characterized in that: The CPU and GPU directly read data from the high-performance memory unit and perform calculations on the data in the high-performance memory unit.

5. The apparatus for reducing power consumption of high-performance computing memory according to claim 4, characterized in that: The various application frameworks built on top of the underlying resource management framework are all compatible with the underlying resource management framework.

6. The apparatus for reducing power consumption of high-performance computing memory according to claim 5, characterized in that: The GPU features multi-threading technology and a fine-grained synchronization mechanism to accelerate the breadth-first search algorithm and uses the SIMDVLQ encoding method to compress external data.

7. The apparatus for reducing power consumption of high-performance computing memory according to claim 6, characterized in that: The many-core coprocessor, combined with high-performance memory units, has gather / scatter capabilities. It employs bottom-up and top-down algorithms on the many-core coprocessor, using thousands of threads to traverse the graph.

8. The apparatus for reducing power consumption of high-performance computing memory according to claim 1, characterized in that: In the scheduling work of the allocation module, I / O resources, computing resources, accelerator resources, network resources, data and software library resources are dynamically scheduled and configured according to the characteristics of the task at different stages.

9. A method of using a device for reducing the power consumption of high-performance computing memory, characterized in that: The method includes the following specific steps: Step 1: External computation data is input into a high-performance memory unit for storage, and then computed by multiple many-core coprocessors. The CPU and GPU in the multiple many-core coprocessors directly read data from the high-performance memory unit, and the thread parallel operation of the many-core coprocessors is completed through the parallel framework of CHAOS. Step 2: After receiving the data transmitted from the computing module, the monitoring module uses the resource management unit for real-time monitoring and sends normal data to the analysis module through the data transmission unit; Step 3: Multiple algorithm units in the analysis module combine top-down and bottom-up search methods to analyze the data. Some algorithm units use bitmap data structures to represent the visitor structure in the breadth-first search algorithm. The data analysis is performed in parallel by multiple threads. After the analysis is completed, the analysis results are sent to the allocation module. Step 4: The analysis results are sent to the allocation module, which defines the received analysis results as tasks to be run and then sends them to the message management unit. Slave executors running in different resource environments monitor the tasks they need to execute from the message management unit. When a task for a slave executor appears, the corresponding slave executor executes the corresponding task. Finally, the node executor stores both the input and output files in a shared storage unit.