Supercomputing task execution method, control device, storage medium and cluster system
By establishing an SSH protocol communication mechanism between the workstation and the server, generating a resource polling table and an arbitrator, the problem of low resource utilization in distributed systems is solved, and the rational allocation of supercomputing cluster resources and the balance of task execution are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- WUXI STARS MICRO SYSTEM TECHNOLOGIES CO LTD
- Filing Date
- 2025-07-07
- Publication Date
- 2026-06-19
Smart Images

Figure CN120803723B_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of supercomputing technology, and specifically relates to a supercomputing task execution method, control device, storage medium and cluster system. Background Technology
[0002] In distributed systems, supercomputing applications may simultaneously request large-scale resources, such as server computing resources, storage resources, and network bandwidth, for supercomputing. Related resource allocation methods, such as static allocation or simple random allocation, cannot adapt to complex and variable load conditions, easily leading to low utilization and uneven resource usage of servers and other resources in large-scale clusters.
[0003] In related technologies, while polling algorithms can allocate resources sequentially and achieve a relatively balanced resource distribution, they perform poorly when faced with tasks of different priorities or with significantly different resource requirements. For example, high-performance resources may need to execute high-priority tasks due to their superior performance, but due to the limitations of the polling order, they must wait for low-performance resources to finish polling before the high-performance task can be allocated for execution. Summary of the Invention
[0004] The purpose of this application is to provide a supercomputing task execution method, control device, storage medium and cluster system, which aims to solve the problems of low resource utilization, uneven resource use and unreasonable execution of high and low priority tasks in large-scale clusters.
[0005] According to a first aspect of this application, a supercomputing task execution method is provided, applied to a workstation. The supercomputing task execution method includes: sending request information to a server based on a communication mechanism established with the server; receiving resource information of a supercomputing cluster in response to the request information from the server; generating a resource polling table and a resource arbitrator that conform to the supercomputing cluster based on the resource information of the supercomputing cluster, and synchronizing the resource polling table and the resource arbitrator to the server; and after the supercomputing cluster enters a ready state, batch distributing supercomputing tasks to the server according to application requirements, so that the supercomputing cluster executes the corresponding supercomputing tasks according to the resource polling table and the resource arbitrator.
[0006] A communication mechanism based on the SSH protocol is established between the workstations and the server (the computing nodes of the supercomputing cluster). Based on this established communication mechanism, the workstations actively request the supercomputing cluster to obtain all resource information. The server responds to the workstations' requests, replying with all resource information. The workstations calculate a resource polling table and resource arbitrator suitable for the supercomputing cluster and synchronize them to the cluster. The supercomputing cluster receives and confirms the information. After the supercomputing cluster enters a ready state, the workstations batch-issue supercomputing task execution requests according to application requirements. The server executes the corresponding supercomputing tasks based on the resource polling table and resource arbitrator. The resource polling table and resource arbitrator ensure the rationality of resource selection, enabling the reasonable allocation of all server resources in the supercomputing cluster.
[0007] In an optional implementation, before sending the request information to the server, the supercomputing task execution method further includes: requesting storage space from the server to create a temporary data folder on the server and generating a UUID folder with a unique identifier.
[0008] Resource information is saved as a file in the uuid folder for data verification and result checking.
[0009] In an optional implementation, generating a resource polling table and a resource arbitrator that conform to the resource information of the supercomputing cluster includes: extracting at least one of the following feature values from the resource information of the supercomputing cluster: peak performance of computing nodes, number and size of computing node memory, storage type and size of computing nodes, network type and supported speed of computing nodes; and generating the resource polling table and the resource arbitrator based on at least one feature value from the resource information of the supercomputing cluster.
[0010] In an optional implementation, the resource polling table includes two resource polling tables. Each of the two resource polling tables includes an entry index and corresponding entry entity data. The two resource polling tables are configured such that the entry entries are independent, the data is not shared, and the entries are not duplicated.
[0011] In an optional implementation, the resource arbitrator includes a time arbitrator and a data throughput arbitrator. The supercomputing cluster executes the corresponding supercomputing task according to the resource polling table and the resource arbitrator, including: determining the maximum timeout for the corresponding computing node to execute the supercomputing task according to the time arbitrator; executing the supercomputing task on the computing node within the determined maximum timeout; and waiting for the supercomputing task to complete before being selected as a server for supercomputing task execution again, until all entries in the resource polling table have been polled and executed, and then allocating a new supercomputing task in the next polling phase; and determining the maximum data throughput for the corresponding computing node to execute the supercomputing task according to the data throughput arbitrator; executing the supercomputing task on the computing node with the determined maximum data throughput; and waiting for the supercomputing task to complete before being selected as a server for supercomputing task execution again, until all entries in the resource polling table have been polled and executed, and then allocating a new supercomputing task in the next polling phase, wherein the time arbitrator and the data throughput arbitrator do not operate simultaneously.
[0012] By using specific resource polling tables and resource arbitrators, the rationality of resource selection can be ensured, so that all server resources in the supercomputing cluster are allocated reasonably.
[0013] In an optional implementation, the supercomputing task execution method further includes: receiving the supercomputing task execution result sent by the server; and displaying the supercomputing task execution result on the user terminal.
[0014] According to a second aspect of this application, a supercomputing task execution method is provided, applied on a server side. The supercomputing task execution method includes: receiving request information sent by a workstation based on a communication mechanism established with a workstation; responding to the request information, obtaining resource information of a supercomputing cluster, and sending the resource information of the supercomputing cluster to the workstation; receiving a resource polling table and a resource arbitrator generated by the workstation based on the resource information of the supercomputing cluster; and after the supercomputing cluster enters a ready state, receiving supercomputing tasks distributed in batches by the workstation according to application requirements, and executing the corresponding supercomputing tasks according to the resource polling table and the resource arbitrator.
[0015] In an optional implementation, before receiving the request information sent by the workstation, the supercomputing task execution method further includes: receiving the storage space application from the workstation, establishing a temporary data folder, and generating a UUID folder with a unique identifier; and after obtaining the resource information of the supercomputing cluster, storing the obtained resource information into the UUID folder.
[0016] In an optional implementation, after receiving the resource polling table and the resource arbitrator sent by the workstation, the supercomputing task execution method further includes: for the resource polling table, verifying whether the number of servers executing the supercomputing task is accurate, and verifying whether the configuration of each data entry is accurate; and for the resource arbitrator, determining the effective method.
[0017] In an optional implementation, the resource polling table includes two resource polling tables. Each of the two resource polling tables includes an entry index and corresponding entry entity data. The two resource polling tables are configured such that the entry entries are independent, the data is not shared, and the entries are not duplicated.
[0018] In an optional implementation, the resource arbitrator includes a time arbitrator and a data throughput arbitrator. The execution of the corresponding supercomputing task according to the resource polling table and the resource arbitrator includes: determining the maximum timeout for the corresponding computing node to execute the supercomputing task based on the time arbitrator; executing the supercomputing task on the computing node within the determined maximum timeout; and after the supercomputing task is completed, the node will not be selected again as a server for supercomputing task execution until all entries in the resource polling table have been polled and executed, at which point a new supercomputing task is allocated in the next polling phase; and determining the maximum data throughput for the corresponding computing node to execute the supercomputing task based on the data throughput arbitrator; executing the supercomputing task on the computing node with the determined maximum data throughput; and after the supercomputing task is completed, the node will not be selected again as a server for supercomputing task execution until all entries in the resource polling table have been polled and executed, at which point a new supercomputing task is allocated in the next polling phase. The time arbitrator and the data throughput arbitrator do not operate simultaneously.
[0019] In an optional implementation, after receiving the supercomputing task execution requests issued in batches by the workstation according to application requirements, the supercomputing task execution method further includes: middleware that puts the supercomputing task execution requests into an execution queue, so that the middleware selects suitable computing nodes and executes the corresponding supercomputing tasks according to the resource polling table and the resource arbitrator.
[0020] According to a second aspect of this application, a workstation is provided, the workstation including a control device, the control device including: a memory, a processor and a computer program stored in the memory and executable on the processor, the processor executing the computer program to implement the above-described supercomputing task execution method applied to the workstation.
[0021] According to a third aspect of this application, a server is provided, the server including a control device, the control device including: a memory, a processor and a computer program stored in the memory and executable on the processor, the processor executing the computer program to implement the above-described supercomputing task execution method applied to the server.
[0022] According to a fourth aspect of this application, a machine-readable storage medium is provided, on which instructions are stored, which cause a machine to execute the supercomputing task execution method applied to a workstation or the supercomputing task execution method applied to a server as described above.
[0023] According to a fifth aspect of this application, a multi-data center cluster system is provided, the multi-data center cluster system comprising multiple clusters, the aforementioned server electrically connected to each of the multiple clusters, and the aforementioned workstation electrically connected to the server, each cluster comprising multiple computing nodes.
[0024] Through the above technical solution, the supercomputing task execution method provided in this application establishes a communication mechanism based on the SSH protocol between the workstation and the server (the computing nodes of the supercomputing cluster). Based on the established communication mechanism, the workstation actively requests the supercomputing cluster to obtain all resource information of the supercomputing cluster. The server responds to the workstation's request by replying with all resource information of the supercomputing cluster. The workstation calculates a resource polling table and resource arbitrator that conform to the supercomputing cluster and synchronizes them to the supercomputing cluster. The supercomputing cluster receives and confirms the information. After the supercomputing cluster enters the ready state, the workstation issues supercomputing task execution requests in batches according to application requirements, and the server executes the corresponding supercomputing tasks according to the resource polling table and resource arbitrator. This application embodiment can ensure the rationality of resource selection through the resource polling table and resource arbitrator, so as to achieve reasonable allocation of all server resources in the supercomputing cluster.
[0025] Other features and advantages of this application will be set forth in the description which follows, and will be apparent in part from the description, or may be learned by practicing the application. The objectives and other advantages of this application may be realized and obtained by means of the structures and processes shown in the description and the accompanying drawings. Attached Figure Description
[0026] To more clearly illustrate the technical solutions in the embodiments or related technologies of this application, the accompanying drawings used in the description of the embodiments or related technologies will be briefly introduced below. Obviously, the accompanying drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0027] Figure 1 This is a topology diagram of a multi-data center cluster system architecture provided in an exemplary embodiment of this application.
[0028] Figure 2 This is a flowchart illustrating the supercomputing task execution method provided in an exemplary embodiment of this application.
[0029] Figure 3 This is a schematic diagram of a resource polling table and a resource arbitrator, which are exemplary embodiments of this application.
[0030] Figure 4 This is a flowchart illustrating a supercomputing task execution method provided in another exemplary embodiment of this application.
[0031] Figure 5 This is a schematic diagram of the workflow of a multi-data center cluster system, which is an exemplary embodiment of this application. Detailed Implementation
[0032] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0033] As mentioned above, while round-robin algorithms can allocate resources sequentially and achieve a relatively balanced resource distribution, they perform poorly when faced with tasks of different priorities or with significantly different resource requirements. For example, high-performance resources may need to execute high-priority tasks due to their superior performance, but due to the limitations of the round-robin order, they must wait for low-performance resources to finish their round-robin before high-performance tasks can be allocated for execution. Furthermore, arbitration mechanisms are typically based on a single criterion, such as first-come, first-served resource requests or the remaining amount of available resources, failing to comprehensively consider multiple factors, such as task priority, resource performance efficiency, and the overall load balancing of the distributed system, making it difficult to achieve reasonable resource allocation. To address this, this application provides a supercomputing task execution method.
[0034] Before explaining the embodiments of this application in detail, the multi-data center cluster system architecture applied in the embodiments of this application will be introduced first. Figure 1This illustrates the architecture (topology) of a large-scale multi-datacenter cluster system for supercomputing. Within a supercomputing cluster, multiple data centers may exist, and communication rules between these data centers are established using switches or routers. Within the same data center, several switches can be used to establish communication rules between servers. Please refer to [reference needed]. Figure 1 101-116 indicate computing nodes (clusters) within the supercomputing cluster; 120-123 indicate switches within the supercomputing cluster; 124 indicates routers for communication between data centers; 125 indicates jump servers accessing the supercomputing cluster via servers; and 126 indicates workstations. 101-104 represent the four clusters in data center 1, each with multiple server resources deployed, the number depending on actual usage requirements. For example, communication between clusters 101-104 relies on switch 120; that is, server communication between clusters 101-104 is conducted through switch 120 for data transmission and reception. Similarly, Figure 1 The clusters within data centers 2, 3, and 4 all perform the data exchange described above. Data transmission and exchange between data centers relies on routing device 124; that is, data centers do not communicate directly but must pass through routing device 124. Server device 125 is independent of the supercomputing cluster, not a supercomputing node, and is connected to routing device 124. Workstation 126 establishes communication with server device 125, allowing end users to access the supercomputing cluster, query its computing, storage, and network resources, and issue supercomputing application tasks through workstation 126. Upon completion of the supercomputing application execution, the supercomputing cluster returns the execution result to workstation 126, which end users can view.
[0035] Please refer to Figure 2 The supercomputing task execution method provided in this application embodiment can be applied to Figure 1 For workstation 126, the supercomputing task execution method may include the following steps:
[0036] Step S210: Based on the communication mechanism established with the server, send request information to the server.
[0037] In this embodiment, before the communication mechanism is established between the workstation and the server, the workstation already stores relevant information about the server, such as IP address, login account, login password, and other necessary information. Furthermore, the server can be configured to communicate with the data center of the supercomputing cluster; it is also configured to query and modify information including but not limited to: 1) information about all servers in the data center of the supercomputing cluster (e.g., IP address, CPU type, storage, network, etc.); 2) administrator and remote access permissions for all server nodes; 3) queue middleware plugins for supercomputing task distribution, relay, and collection.
[0038] In a preferred embodiment of this application, before sending the request information to the server, the supercomputing task execution method may further include: requesting storage space from the server to create a temporary data folder on the server and generating a UUID folder with a unique identifier.
[0039] For example, the workstation and the server can establish communication based on the Secure Shell (SSH) protocol. SSH is a protocol used for secure remote login and other secure network services over insecure networks, and communication mechanisms based on the SSH protocol have the ability to send and receive data bidirectionally. After communication is established, the workstation can request storage space from the server, create a temporary data folder, and generate a folder with a unique identifier (uuid). This uuid folder can be used for data verification and result checking operations described below. Based on the communication mechanism established with the server, the workstation sends a request to the server, requesting resource information from all servers in the data center of the supercomputing cluster.
[0040] Step S220: Receive the resource information of the supercomputing cluster in response to the request information sent by the server.
[0041] Following the example above, after receiving the request information, the server can retrieve the resource information of all servers in the supercomputing cluster's data center (e.g., IP address, CPU type, storage, network information, etc.) from the database, save the resource information as a file in the uuid folder, and send the resource information to the workstation in the form of a response message.
[0042] Step S230: Based on the resource information of the supercomputing cluster, generate a resource polling table and a resource arbitrator that conform to the supercomputing cluster, and synchronize the resource polling table and the resource arbitrator to the server.
[0043] In a preferred embodiment of this application, generating a resource polling table and a resource arbitrator that conform to the supercomputing cluster's resource information may include: extracting at least one of the following characteristic values from the supercomputing cluster's resource information: peak performance of the supercomputing cluster's computing nodes, quantity and size of computing node memory, storage type and size of computing nodes, network type and supported speed of computing nodes; and generating a resource polling table and a resource arbitrator based on at least one characteristic value from the supercomputing cluster's resource information.
[0044] For example, after receiving the supercomputing cluster resource information from the server, the workstation extracts the following features from the resource information: 1) the theoretical peak performance of the supercomputing cluster's computing nodes; 2) the amount and size of memory on the computing nodes; 3) the storage type and size of the computing nodes; and 4) the network type and supported speed of the computing nodes. The peak performance of a computing node = number of CPU cores on the computing node × number of floating-point operations per cycle × clock frequency. For example, a computing node with 8 cores, a clock speed of 3GHz, and a floating-point operation count of 4 has a peak performance of 8 × 4 × 3.0 = 96 GFLOPS.
[0045] Based on the aforementioned characteristic values, a resource polling table and a resource arbitrator are generated. For example, the resource information of a supercomputing cluster can be input into a pre-configured algorithm or function, which outputs a specific resource polling table and a resource arbitrator.
[0046] The preferred resource polling table in this application embodiment may include two resource polling tables. Each resource polling table includes a table entry index and corresponding table entry entity data. The two resource polling tables are configured such that the table entries are independent, the data is not shared, and the table entries are not duplicated.
[0047] Please refer to Figure 3 The content indicated by 301 and 302 indicates that the resource polling table can include table A and table B. Each table includes multiple table entry indexes and table entry entity data. Each entry can include key information. The data content of tables A and B is not duplicated, and the data in tables A and B is not shared, maintaining their independence. The index entries can be sequentially generated sequence numbers, and the entry entries can store resource information for the corresponding server.
[0048] In this embodiment, resource information can be distributed in two resource polling tables (Table A and Table B) in an approximately even manner. The allocation method tends to be random and does not refer to any characteristic attributes. The two resource polling tables can be configured as Table A and Table B, with consistent table entries, both including index and entry items. The table entries in Table A and Table B are not fixed and can be allocated according to the actual computing nodes of the supercomputing cluster. In Table A and Table B, the table entries are independent, the data is not shared, and all entries are unique.
[0049] Please refer to Figure 3 In this embodiment of the application, the entry item may include the content indicated by clause 304:
[0050] IP field: This field refers to the specific IPv4 address of this compute node's network;
[0051] The weight field is configured to indicate the weight value of the corresponding computing node, which can be used to describe the computing power of the corresponding computing node.
[0052] The throughput field is configured to indicate the data throughput arbiter default unit value of the corresponding compute node. It can be used to describe the storage capacity value of the corresponding compute node and can be configured by the user.
[0053] The timer field is configured to indicate the default unit value of the time arbiter for the corresponding compute node. It can be used to describe the network capability value of the corresponding compute node. This value can be configured by the user.
[0054] The `tag` field is configured to indicate the status of the corresponding compute node, which can be categorized as online, running, or offline. An online status indicates that the compute node can be assigned tasks; a running status indicates that the compute node is currently performing a task; and an offline status indicates that the compute node cannot be assigned tasks.
[0055] The preferred resource arbitrator in this application embodiment may include a time arbitrator and a data throughput arbitrator. That is, this application embodiment provides two arbitration methods: time arbitration or data throughput arbitration. The time arbitrator and the data throughput arbitrator do not operate simultaneously.
[0056] For example, after the workstation generates a resource polling table and a resource arbitrator that conform to the supercomputing cluster based on the resource information of the supercomputing cluster using the above method, it can send the resource polling table and the resource arbitrator to the server through the SSH protocol communication mechanism; after the server receives the corresponding message, it confirms that the reception was successful.
[0057] In a preferred embodiment of this application, after the server receives the resource polling table and resource arbitrator sent by the workstation, the supercomputing task execution method may further include: for the resource polling table, verifying whether the number of servers executing the supercomputing task is accurate, and verifying whether the configuration of each data entry is accurate; and for the resource arbitrator, determining the effective method.
[0058] In this embodiment, there is a time difference between the server sending the supercomputing cluster's resource information and receiving the resource polling table and resource arbitrator. Therefore, after receiving the resource polling table and resource arbitrator from the workstation, the server can verify whether the server entries in the resource polling table are correct (e.g., the server is powered off). If there is a discrepancy, the server configuration is updated to the latest version, for example, by deleting the powered-off server and its corresponding configuration. For example, as described above, after receiving the request information, the server can retrieve the resource information (e.g., including IP address, CPU type, storage, network information, etc.) of all servers in the supercomputing cluster's data center from the database and save the resource information as a file in the uuid folder. After receiving the resource polling table and resource arbitrator from the workstation, the server performs verification operations on the corresponding data. These operations include: verifying whether all server entries in resource polling tables A and B are consistent with all server entries in the supercomputing cluster's data center; confirming whether each entry in the table matches the actual server configuration; if inconsistent, updating the latest server configuration; and confirming the resource arbitrator's activation method (i.e., time-based arbitrator or data throughput arbitrator). After completing these verification operations, the server persists the resource polling table and resource arbitrator to its local UUID folder, with a file format such as .csV.
[0059] Step S240: After the supercomputing cluster enters the ready state, supercomputing tasks are distributed to the server in batches according to application requirements, and the supercomputing cluster executes the corresponding supercomputing tasks according to the resource polling table and resource arbitrator.
[0060] For example, after the supercomputing cluster (the server resources of all computing nodes) enters the ready state, workstations can communicate via the SSH protocol to send supercomputing application requests to batch distribute supercomputing tasks to the server. After receiving the supercomputing application requests from the workstations, the server places the supercomputing tasks into an execution queue middleware. The middleware intelligently selects suitable computing nodes for supercomputing task execution based on a source polling table and a resource arbitrator.
[0061] In a preferred embodiment of this application, the supercomputing cluster executing corresponding supercomputing tasks according to a resource round-robin table and a resource arbitrator may include: determining the maximum timeout for a corresponding computing node to execute a supercomputing task based on a time arbitrator; executing the supercomputing task on the computing node within the determined maximum timeout; and after the supercomputing task is completed, the node will not be selected again as a server for supercomputing task execution until all entries in the resource round-robin table have been round-robin executed, at which point a new supercomputing task is allocated in the next round-robin phase; and determining the maximum data throughput for a corresponding computing node to execute a supercomputing task based on a data throughput arbitrator; executing the supercomputing task on the computing node with the determined maximum data throughput; and after the supercomputing task is completed, the node will not be selected again as a server for supercomputing task execution until all entries in the resource round-robin table have been round-robin executed, at which point a new supercomputing task is allocated in the next round-robin phase. The time arbitrator and the data throughput arbitrator do not operate simultaneously.
[0062] Please refer to Figure 3 As indicated by Clause 303, the resource arbitrator in this application embodiment includes two arbitration methods: a time arbitrator and a data throughput arbitrator. In this application embodiment, the maximum timeout for a computing node to execute a supercomputing task can be expressed as (weight * default value of the time arbitrator), where the weight value can be obtained from the resource polling table. For example, if the weight of a computing node selected as a server is 3, and the default value is 60 seconds, then the maximum timeout for the corresponding computing node to execute a supercomputing task is 3 * 60 = 180 seconds. This can be interpreted as: the computing node executes a supercomputing task for a maximum of 180 seconds and waits for the task to complete before being selected as a task execution server again. Only after all entries in the resource polling table have been polled and executed will a new supercomputing task be allocated in the next polling phase. In this application embodiment, the maximum data throughput for a computing node to execute a supercomputing task can be expressed as (weight * default value), where the weight value can be obtained from the resource polling table. For example, if the selected compute node as a server has a weight of 3 and a default value of 1MB, then the maximum data throughput of the corresponding compute node executing a supercomputing task is 3 * 1MB = 3MB. This can be interpreted as follows: after executing a supercomputing task with a maximum data throughput of 3MB and waiting for the task to complete, the compute node will not be selected again as a supercomputing task execution server until all entries in the resource round-robin table have been executed, at which point it will enter the next round-robin phase and be allocated a new supercomputing execution task. The two arbitrators mentioned above can arbitrate from two dimensions: time and data. End users can choose according to their actual usage.
[0063] In a preferred embodiment of this application, the supercomputing task execution method may further include: receiving the supercomputing task execution result sent by the server; and displaying the supercomputing task execution result on the user terminal.
[0064] Following the example above, once all supercomputing tasks in the server's queue have been completed, the server can send the execution results back to the workstation. Upon receiving the results, the workstation can display them to the end user, allowing the user to analyze the supercomputing tasks based on the results.
[0065] In a preferred embodiment of this application, the server can return the supercomputing task execution results in batches, while releasing the corresponding resources and updating the resource polling table and resource arbitrator.
[0066] Accordingly, the supercomputing task execution method provided in this application establishes a communication mechanism based on the SSH protocol between the workstation and the server (the computing nodes of the supercomputing cluster). Based on the established communication mechanism, the workstation actively requests the supercomputing cluster to obtain all resource information of the supercomputing cluster. The server responds to the workstation's request by replying with all resource information of the supercomputing cluster. The workstation calculates a resource polling table and a resource arbitrator that conform to the supercomputing cluster and synchronizes them to the supercomputing cluster. The supercomputing cluster receives and confirms the information. After the supercomputing cluster enters the ready state, the workstation issues supercomputing task execution requests in batches according to application requirements, and the server executes the corresponding supercomputing tasks according to the resource polling table and the resource arbitrator. This application embodiment ensures the rationality of resource selection through the resource polling table and the resource arbitrator, so as to achieve reasonable allocation of all server resources in the supercomputing cluster.
[0067] Please refer to Figure 4 The supercomputing task execution method provided in this application embodiment can be applied to Figure 1 The server-side 125 of the supercomputing task execution method may include the following steps:
[0068] Step S410: Based on the communication mechanism established with the workstation, receive the request information sent by the workstation.
[0069] In a preferred embodiment of this application, before receiving the request information sent by the workstation, the supercomputing task execution method may further include: receiving the workstation's storage space request, establishing a temporary data folder, and generating a UUID folder with a unique identifier; and after obtaining the resource information of the supercomputing cluster, storing the obtained resource information into the UUID folder.
[0070] As mentioned earlier, for example, the workstation and the server can establish communication based on the SSH protocol. After communication is established, the workstation can request storage space from the server, create a temporary data folder, and generate a folder with a unique UUID. Based on the communication mechanism established with the server, the workstation sends a request to the server, requesting resource information from all servers in the data center of the supercomputing cluster.
[0071] Step S420: In response to the request information, obtain the resource information of the supercomputing cluster and send the resource information of the supercomputing cluster to the workstation.
[0072] Following the example above, after receiving the request information, the server can retrieve the resource information of all servers in the data center of the supercomputing cluster from the information database, save the resource information as a file in the uuid folder, and send the resource information to the workstation in the form of a response message.
[0073] Step S430: Receive the resource polling table and resource arbitrator generated by the workstation based on the resource information of the supercomputing cluster.
[0074] The preferred resource polling table in this application embodiment may include two resource polling tables. Each resource polling table includes a table entry index and corresponding table entry entity data. The two resource polling tables are configured such that the table entries are independent, the data is not shared, and the table entries are not duplicated.
[0075] The preferred resource arbitrator in this application embodiment may include a time arbitrator and a data throughput arbitrator. That is, this application embodiment provides two arbitration methods: time arbitration or data throughput arbitration. The time arbitrator and the data throughput arbitrator do not operate simultaneously.
[0076] Following the example above, after receiving the supercomputing cluster resource information from the server, the workstation generates a specific resource polling table and resource arbitrator based on the following characteristics in the resource information: 1) the (theoretical) peak performance of the supercomputing cluster's computing nodes; 2) the amount and size of memory on the computing nodes; 3) the storage type and size of the computing nodes; and 4) the network type and supported speed of the computing nodes. The resource polling table can be found in [reference needed]. Figure 3 The contents indicated in sections 301 and 302 can be found in the resource arbitrator. Figure 3 The content indicated by 304. For detailed configuration of the resource polling table and resource arbitrator, please refer to the above text; it will not be repeated here.
[0077] In a preferred embodiment of this application, after step S430, the supercomputing task execution method may further include: for the resource polling table, verifying whether the number of servers executing the supercomputing task is accurate, and verifying whether the configuration of each data entry is accurate; and for the resource arbitrator, determining the effective method.
[0078] Following the example above, after receiving the resource polling table and resource arbitrator from the workstation, the server performs verification operations on the corresponding data. These operations include: verifying whether all server entries in resource polling table A and resource polling table B are consistent with all server entries in the supercomputing cluster's data center; confirming whether each entry in the table matches the actual server configuration; if inconsistent, updating the latest server configuration in real time; and confirming the resource arbitrator's activation method (i.e., time-based arbitrator or data throughput arbitrator). After completing these verification operations, the server persists the resource polling table and resource arbitrator to the local UUID folder, with a file format such as .csV.
[0079] Step S440: After the supercomputing cluster enters the ready state, it receives supercomputing tasks distributed in batches by workstations according to application requirements, and executes the corresponding supercomputing tasks according to the resource polling table and resource arbitrator.
[0080] In a preferred embodiment of this application, after receiving supercomputing task execution requests issued in batches by workstations according to application requirements, the supercomputing task execution method may further include: middleware that puts the supercomputing task execution requests into an execution queue, and the middleware that selects suitable computing nodes and executes the corresponding supercomputing tasks based on a resource polling table and a resource arbitrator.
[0081] As mentioned earlier, middleware is a crucial component for implementing asynchronous communication and data transfer. Continuing the example above, after the supercomputing cluster (server resources across all computing nodes) enters a ready state, workstations can communicate via the SSH protocol to send supercomputing application requests, thereby batch-distributing supercomputing tasks to the server. Upon receiving the supercomputing application requests from the workstations, the server places the supercomputing tasks into the middleware's execution queue. The middleware then intelligently selects suitable computing nodes for supercomputing task execution based on a source polling table and a resource arbitrator.
[0082] In a preferred embodiment of this application, the supercomputing cluster executing corresponding supercomputing tasks according to a resource round-robin table and a resource arbitrator may include: determining the maximum timeout for a corresponding computing node to execute a supercomputing task based on a time arbitrator; executing the supercomputing task on the computing node within the determined maximum timeout; and after the supercomputing task is completed, the node will not be selected again as a server for supercomputing task execution until all entries in the resource round-robin table have been round-robin executed, at which point a new supercomputing task is allocated in the next round-robin phase; and determining the maximum data throughput for a corresponding computing node to execute a supercomputing task based on a data throughput arbitrator; executing the supercomputing task on the computing node with the determined maximum data throughput; and after the supercomputing task is completed, the node will not be selected again as a server for supercomputing task execution until all entries in the resource round-robin table have been round-robin executed, at which point a new supercomputing task is allocated in the next round-robin phase. The time arbitrator and the data throughput arbitrator do not operate simultaneously.
[0083] Please refer to Figure 3 As indicated by Clause 303, the resource arbitrator in this application embodiment includes two arbitration methods: a time arbitrator and a data throughput arbitrator. In this application embodiment, the maximum timeout for a computing node to execute a supercomputing task can be expressed as (weight * default value of the time arbitrator), where the weight value can be obtained from the resource polling table. For example, if the weight of a computing node selected as a server is 3, and the default value is 60 seconds, then the maximum timeout for the corresponding computing node to execute a supercomputing task is 3 * 60 = 180 seconds. This can be interpreted as: the computing node executes a supercomputing task for a maximum of 180 seconds and waits for the task to complete before being selected as a task execution server again. Only after all entries in the resource polling table have been polled and executed will a new supercomputing task be allocated in the next polling phase. In this application embodiment, the maximum data throughput for a computing node to execute a supercomputing task can be expressed as (weight * default value), where the weight value can be obtained from the resource polling table. For example, if the selected compute node as a server has a weight of 3 and a default value of 1MB, then the maximum data throughput of the corresponding compute node executing a supercomputing task is 3 * 1MB = 3MB. This can be interpreted as follows: after executing a supercomputing task with a maximum data throughput of 3MB and waiting for the task to complete, the compute node will not be selected again as a supercomputing task execution server until all entries in the resource round-robin table have been executed, at which point it will enter the next round-robin phase and be allocated a new supercomputing execution task. The two arbitrators mentioned above can arbitrate from two dimensions: time and data. End users can choose according to their actual usage.
[0084] This application embodiment also provides a workstation, which includes a control device. The control device includes a memory, a processor, and a computer program stored in the memory and executable on the processor. The processor executes the computer program to implement the above-described supercomputing task execution method applied to the workstation.
[0085] This application also provides a server, which includes a control device. The control device includes a memory, a processor, and a computer program stored in the memory and executable on the processor. The processor executes the computer program to implement a supercomputing task execution method applied to the server.
[0086] This application also provides a machine-readable storage medium storing instructions that cause a machine to execute a supercomputing task execution method applied to a workstation or applied to a server.
[0087] It should be noted that the control device and machine-readable storage medium described above can implement the supercomputing task execution method provided in the above embodiments. For specific implementation methods, please refer to the description of the supercomputing task execution method in the above embodiments, which will not be repeated here.
[0088] This application also provides a multi-data center cluster system, which may include multiple clusters, servers electrically connected to each of the multiple clusters, and workstations electrically connected to the servers, wherein each cluster includes multiple computing nodes.
[0089] In this embodiment of the application, the topology of the data center cluster system can be as follows: Figure 1 As shown, the process of this multi-datacenter cluster system executing supercomputing tasks can be as follows: Figure 5 As shown.
[0090] In step 501, the workstation and server can establish communication using the SSH protocol. Based on the SSH protocol, the workstation and server have the ability to send and receive data bidirectionally. After the communication is established, the workstation requests storage space from the server, creates a temporary data folder, and generates a UUID folder with a unique identifier. The contents of this UUID folder facilitate subsequent data verification, result checking, and other operations.
[0091] In step 502, based on the SSH communication mechanism created in step 501, the workstation sends a request to the server, requesting information on all server nodes in the data center of the supercomputing cluster.
[0092] In step 503, after receiving the request information, the server retrieves all server information of the supercomputing cluster data center (e.g., IP address, CPU type, storage, network information, etc.) from the information database, saves the resource information in the form of a file in the uuid folder, and sends the resource information to the workstation in the form of a response message.
[0093] In step 504, the workstation receives the supercomputing cluster resource information from the server and generates a specific resource polling table and resource arbitrator based on the following characteristics of the resource information: 1) number of CPUs in the computing node * frequency of each CPU; 2) amount and size of memory in the computing node; 3) storage type and size of the computing node; 4) network type and supported speed of the computing node.
[0094] The workstation sends the resource polling table and resource arbitrator to the server. After receiving the corresponding information, the server confirms successful reception.
[0095] In step 505, the server receives the resource polling table and resource arbitrator from the workstation and performs verification. After confirming the server information, it persists the data to the local path. Simultaneously, all server resources enter a ready state, awaiting task execution and replying to the workstation that the supercomputing task can be deployed.
[0096] In step 506, after receiving the status request from the server, the workstation can issue supercomputing application tasks in batches.
[0097] In step 507, the server receives the supercomputing application request sent by the workstation, puts the supercomputing application request into the execution queue middleware, and the middleware intelligently selects suitable computing nodes to execute the supercomputing task based on the source polling table and resource arbitrator.
[0098] In step 508, after all supercomputing tasks in the server's queue have been completed, the server replies with the execution results of all supercomputing application tasks to the workstation. Upon receiving the task execution results, the workstation displays them to the end user, who then performs task analysis based on the results.
[0099] It is understood that the circuit structures, names, and parameters described in the above embodiments are merely examples. Those skilled in the art can also make readily conceived combinations and adjustments to the structural features of the above embodiments according to their needs, and the concept of this application should not be limited to the specific details of the above examples.
[0100] Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.
Claims
1. A method for executing supercomputing tasks, characterized in that, Applied to workstations, the supercomputing task execution method includes: Based on the communication mechanism established with the server, a request message is sent to the server. The resource information of the supercomputing cluster that receives the response to the request information sent by the server. Based on the resource information of the supercomputing cluster, a resource polling table and a resource arbitrator conforming to the supercomputing cluster are generated, and the resource polling table and the resource arbitrator are synchronized to the server; and After the supercomputing cluster enters the ready state, supercomputing tasks are distributed to the server in batches according to application requirements, and the supercomputing cluster executes the corresponding supercomputing tasks according to the resource polling table and the resource arbitrator. The resource polling table includes two resource polling tables. Each resource polling table includes a table entry index and corresponding table entry entity data. The two resource polling tables are configured such that table entries are independent, data is not shared, and table entries are not duplicated. The resource arbitrator includes a time arbitrator and a data throughput arbitrator. The supercomputing cluster executes corresponding supercomputing tasks based on the resource polling table and the resource arbitrator, including: According to the time arbitrator, the maximum timeout for the corresponding computing node to execute the supercomputing task is determined. The computing node executes the supercomputing task within the determined maximum timeout period. After the supercomputing task is completed, it will not be selected as a server for supercomputing task execution again until all entries in the resource polling table have been polled and executed. When entering the next polling phase, a new supercomputing task is allocated. Based on the data throughput arbitrator, the maximum data throughput for the corresponding computing node to execute the supercomputing task is determined. The computing node executes the supercomputing task at the determined maximum data throughput and waits for the supercomputing task to complete. After this process, the node will not be selected again as a server for supercomputing task execution until all entries in the resource polling table have been polled and executed. In the next polling phase, a new supercomputing task is allocated. The time arbitrator and the data throughput arbitrator do not operate simultaneously.
2. The supercomputing task execution method according to claim 1, characterized in that, Before sending the request information to the server, the supercomputing task execution method further includes: Request storage space from the server to create a temporary data folder on the server and generate a UUID folder with a unique identifier.
3. The supercomputing task execution method according to claim 1, characterized in that, The step of generating a resource polling table and a resource arbitrator that conform to the resource information of the supercomputing cluster includes: Extract at least one of the following features from the resource information of the supercomputing cluster: peak performance of the supercomputing nodes, number and size of memory on the computing nodes, storage type and size on the computing nodes, network type and supported speed of the computing nodes; and The resource polling table and the resource arbitrator are generated based on at least one feature value of the resource information of the supercomputing cluster.
4. The supercomputing task execution method according to claim 1, characterized in that, The supercomputing task execution method also includes: Receive the supercomputing task execution results sent by the server; and The results of the supercomputing task are displayed on the user's device.
5. A method for executing supercomputing tasks, characterized in that, Applied to the server side, the supercomputing task execution method includes: Based on the communication mechanism established with the workstation, the request information sent by the workstation is received; In response to the request information, the resource information of the supercomputing cluster is obtained and sent to the workstation; Receives from the workstation a resource polling table and resource arbitrator generated based on the resource information of the supercomputing cluster, conforming to the supercomputing cluster's specifications; and After the supercomputing cluster enters the ready state, it receives supercomputing tasks distributed in batches by the workstations according to application requirements, and executes the corresponding supercomputing tasks according to the resource polling table and the resource arbitrator. The resource polling table includes two resource polling tables. Each resource polling table includes a table entry index and corresponding table entry entity data. The two resource polling tables are configured such that table entries are independent, data is not shared, and table entries are not duplicated. The resource arbitrator includes a time arbitrator and a data throughput arbitrator. The execution of the corresponding supercomputing task based on the resource polling table and the resource arbitrator includes: According to the time arbitrator, the maximum timeout for the corresponding computing node to execute the supercomputing task is determined. The computing node executes the supercomputing task within the determined maximum timeout period. After the supercomputing task is completed, it will not be selected as a server for supercomputing task execution again until all entries in the resource polling table have been polled and executed. When entering the next polling phase, a new supercomputing task is allocated. Based on the data throughput arbitrator, the maximum data throughput for the corresponding computing node to execute the supercomputing task is determined. The computing node executes the supercomputing task at the determined maximum data throughput and waits for the supercomputing task to complete. After this process, the node will not be selected again as a server for supercomputing task execution until all entries in the resource polling table have been polled and executed. In the next polling phase, a new supercomputing task is allocated. The time arbitrator and the data throughput arbitrator do not operate simultaneously.
6. The supercomputing task execution method according to claim 5, characterized in that, Before receiving the request information sent by the workstation, the supercomputing task execution method further includes: Receive the workstation's storage space request, create a temporary data folder, and generate a folder with a unique identifier (UUID); and After obtaining the resource information of the supercomputing cluster, the obtained resource information is stored in the uuid folder.
7. The supercomputing task execution method according to claim 6, characterized in that, After receiving the resource polling table and the resource arbitrator sent by the workstation, the supercomputing task execution method further includes: For the resource polling table, verify whether the number of servers executing supercomputing tasks is accurate, and verify whether the configuration of each data entry is accurate; and For the aforementioned resource arbitrator, determine the effective method.
8. The supercomputing task execution method according to claim 5, characterized in that, After receiving the supercomputing task execution requests issued in batches by the workstation according to application requirements, the supercomputing task execution method further includes: The middleware places the supercomputing task execution request into the execution queue, and then selects a suitable computing node based on the resource polling table and the resource arbitrator to execute the corresponding supercomputing task.
9. A workstation, characterized in that, The workstation includes a control device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor. The processor executes the computer program to implement the supercomputing task execution method according to any one of claims 1-4.
10. A server, characterized in that, The server includes a control device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor. The processor executes the computer program to implement the supercomputing task execution method according to any one of claims 5-8.
11. A machine-readable storage medium, characterized in that, The machine-readable storage medium stores instructions that cause the machine to perform the supercomputing task execution method according to any one of claims 1-4 or any one of claims 5-8.
12. A multi-data center cluster system, characterized in that, The multi-data center cluster system includes multiple clusters, a server as described in claim 10 electrically connected to each of the multiple clusters, and a workstation as described in claim 9 electrically connected to the server. Each cluster comprises multiple compute nodes.