Heterogeneous GPU computing power cluster monitoring method and device, electronic equipment and storage medium

By consuming and storing various monitoring data from the GPU computing cluster in the message center and performing aggregated analysis, the problem of in-depth monitoring in existing technologies has been solved, and application-dimensional monitoring of the GPU computing cluster has been realized.

CN118972302BActive Publication Date: 2026-06-23CHINA TELECOM CORP LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA TELECOM CORP LTD
Filing Date
2024-08-01
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing GPU monitoring technologies cannot obtain deeper monitoring data and cannot meet the effective monitoring and management needs of large-scale GPU computing clusters.

Method used

By subscribing to message topics, static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data are consumed from the message center, and these data are stored in storage media for aggregation and analysis to obtain monitoring results of application monitoring metrics.

Benefits of technology

It enables application-level monitoring of applications running in GPU computing clusters, providing deeper monitoring data and meeting the management needs of large-scale GPU computing clusters.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118972302B_ABST
    Figure CN118972302B_ABST
Patent Text Reader

Abstract

Embodiments of the application disclose a heterogeneous GPU computing power cluster monitoring method and device, electronic equipment and storage medium, the method comprising: consuming static resource information, host monitoring information, container monitoring information, service application instance data and service application traffic data from a message center by subscribing to a message topic, and storing the static resource information, host monitoring information, container monitoring information, service application instance data and service application traffic data in a storage medium; obtaining the static resource information, host monitoring information, container monitoring information, service application instance data and service application traffic data from the storage medium; and performing aggregation analysis on at least one of the static resource information, host monitoring information, container monitoring information, service application instance data and service application traffic data according to an application monitoring index, to obtain a monitoring result of the application monitoring index. The embodiments of the application achieve application dimension monitoring.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computing technology, and in particular to a method, apparatus, electronic device, and storage medium for monitoring heterogeneous GPU computing power clusters. Background Technology

[0002] In modern computer science, GPUs (Graphics Processing Units) are widely used for parallel computing and graphics rendering tasks. With the increasing computing power of GPUs and the continuous expansion of their application scenarios, large-scale GPU computing clusters need to be built to meet business development needs. Therefore, effective monitoring and management of GPUs themselves has become particularly important.

[0003] Currently, there are some monitoring technologies and methods available for monitoring and managing GPU resources. However, these technologies and methods monitor only the CPU, memory, hard drive, and GPU levels, and cannot obtain deeper monitoring data. Summary of the Invention

[0004] This application provides a method, apparatus, electronic device, and storage medium for monitoring heterogeneous GPU computing clusters, which can acquire monitoring data of applications running on the GPU computing cluster.

[0005] To address the aforementioned problems, in a first aspect, embodiments of this application provide a method for monitoring heterogeneous GPU computing clusters, including:

[0006] By subscribing to message topics, static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data are consumed from the message center, and the static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data are stored in the storage medium. The static resource information includes host information and cluster information. The host machine monitoring information and container monitoring information are obtained by resource acquisition plugins in each host machine and sent to the message center by the host machine. The business application instance data is obtained by the computing power scheduling center and sent to the message center. The business application traffic data is obtained by the access gateway of the computing power scheduling center and sent to the message center.

[0007] Obtain the static resource information, host monitoring information, container monitoring information, business application instance data, and business application traffic data from the storage medium;

[0008] Based on the application monitoring metrics, at least one of the static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data is aggregated and analyzed to obtain the monitoring results of the application monitoring metrics.

[0009] Secondly, embodiments of this application provide a heterogeneous GPU computing cluster monitoring device, comprising:

[0010] The data storage module is used to consume static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data from the message center by subscribing to message topics, and store the static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data in a storage medium; the static resource information includes host information and cluster information; the host machine monitoring information and container monitoring information are obtained by resource acquisition plugins in each host machine and sent to the message center by the host machine; the business application instance data is obtained by the computing power scheduling center and sent to the message center; and the business application traffic data is obtained by the access gateway of the computing power scheduling center and sent to the message center.

[0011] The data acquisition module is used to acquire the static resource information, host monitoring information, container monitoring information, business application instance data, and business application traffic data from the storage medium.

[0012] The aggregation and analysis module is used to perform aggregation and analysis on at least one of the static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data according to the application monitoring indicators, and obtain the monitoring results of the application monitoring indicators.

[0013] Thirdly, embodiments of this application also provide an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the heterogeneous GPU computing power cluster monitoring method described in embodiments of this application.

[0014] Fourthly, embodiments of this application provide a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, performs the steps of the heterogeneous GPU computing power cluster monitoring method disclosed in embodiments of this application.

[0015] The heterogeneous GPU computing cluster monitoring method, apparatus, electronic device, and storage medium provided in this application embodiment consume static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data from a message center by subscribing to message topics. These data are then stored in a storage medium. When aggregation analysis is required, the static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data are retrieved from the storage medium. Based on application monitoring metrics, at least one of these metrics is aggregated and analyzed to obtain the monitoring results. This enables monitoring of applications running in the GPU computing cluster according to application monitoring metrics, achieving application-level monitoring. Attached Figure Description

[0016] To more clearly illustrate the technical solutions of the embodiments of this application, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0017] Figure 1 This is a flowchart of a heterogeneous GPU computing power cluster monitoring method provided in an embodiment of this application;

[0018] Figure 2 This is a schematic diagram of the heterogeneous GPU computing power cluster monitoring system provided in the embodiments of this application;

[0019] Figure 3 This is a schematic diagram of runtime monitoring of computing resources in an embodiment of this application;

[0020] Figure 4 This is a schematic diagram of business application traffic data collection in an embodiment of this application;

[0021] Figure 5 This is a schematic diagram illustrating the aggregation and analysis of monitoring data in an embodiment of this application;

[0022] Figure 6 This is a flowchart of cluster-level monitoring in an embodiment of this application;

[0023] Figure 7 This is a flowchart of host-level monitoring in an embodiment of this application;

[0024] Figure 8 This is a flowchart of GPU-level monitoring in an embodiment of this application;

[0025] Figure 9 This is a flowchart illustrating the monitoring process based on application-level monitoring metrics in this application embodiment;

[0026] Figure 10a This is a trend chart of the average resource utilization rate of Application 1 in the embodiments of this application;

[0027] Figure 10b This is a trend chart of the average resource utilization rate of application 2 in the embodiments of this application;

[0028] Figure 11a This is a request concurrency trend chart of Application 1 in the embodiments of this application;

[0029] Figure 11b This is the request concurrency trend chart of Application 2 in the embodiments of this application;

[0030] Figure 12 This is a schematic diagram of the structure of a heterogeneous GPU computing power cluster monitoring device provided in an embodiment of this application;

[0031] Figure 13 This is a schematic diagram of the structure of the electronic device provided in the embodiments of this application. Detailed Implementation

[0032] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0033] Figure 1 This is a flowchart of a heterogeneous GPU computing cluster monitoring method provided in an embodiment of this application. This heterogeneous GPU computing cluster monitoring method can be executed by the monitoring center in the heterogeneous GPU computing cluster monitoring system. Figure 2 This is a schematic diagram of the heterogeneous GPU computing power cluster monitoring system provided in the embodiments of this application, as shown below. Figure 2As shown, the heterogeneous GPU computing cluster monitoring system includes: a GPU computing cluster, a message center, a computing power scheduling center, and a monitoring center. Each host machine in the GPU computing cluster is equipped with a resource acquisition plugin. The resource acquisition plugin is used to collect static resource information of the host machine and collect monitoring data of computing resources during runtime at a certain frequency. The monitoring data of computing resources during runtime includes host machine monitoring information and container monitoring information. The host machine monitoring information is the CPU, memory, hard disk, and GPU information of the host machine, and the container monitoring information is the CPU, memory, hard disk, and GPU information of the container instance. The collected information is reported to the corresponding topic in the message center in the form of a message. The computing power scheduling center can collect business application instance data and business application traffic data and report the collected data to the corresponding topic in the message center in the form of a message. The message center is used to store messages uploaded by the host machine and the computing power scheduling center. The monitoring center is used to execute the heterogeneous GPU computing cluster monitoring method provided in this application embodiment.

[0034] like Figure 1 As shown, the heterogeneous GPU computing power cluster monitoring method includes steps 110 to 130.

[0035] Step 110: Consume static resource information, host monitoring information, container monitoring information, business application instance data, and business application traffic data from the message center by subscribing to message topics, and store the static resource information, host monitoring information, container monitoring information, business application instance data, and business application traffic data in the storage medium.

[0036] The static resource information includes host information and cluster information. The business application instance data is obtained by the computing power scheduling center and sent to the message center. The business application traffic data is obtained by the access gateway of the computing power scheduling center and sent to the message center. The host machine monitoring information and container monitoring information are obtained by the resource collection plugins in each host machine and sent to the message center by the host machine.

[0037] Optionally, the host monitoring information includes: host CPU monitoring information, host memory monitoring information, host disk monitoring information, and host GPU monitoring information; the container monitoring information includes: container information, container CPU monitoring information, and container memory monitoring information.

[0038] Optionally, the host CPU monitoring information includes: host IP, collection time, CPU utilization, number of CPU cores, and cluster identifier; the memory monitoring information includes: host IP, collection time, memory utilization, memory size, and cluster identifier; the host disk monitoring information includes: host IP, collection time, disk utilization, disk size, disk identifier, and cluster identifier; the host GPU monitoring information includes: host IP, collection time, GPU utilization, GPU memory size, GPU identifier, name of the process using the GPU, GPU brand, and cluster identifier. The host IP is the IP address of each physical host. The name of the process using the GPU is the name of the pod using that GPU, which is the identifier of the container instance using that GPU. The cluster identifier can include the cluster name and a unique cluster identifier.

[0039] Optionally, the container information includes: container identifier, container IP, number of container CPU cores, container memory, container GPU card identifier, container startup time, host IP, and cluster identifier, etc.; the container CPU monitoring information includes: host IP, container IP, container identifier, collection time, CPU utilization, number of CPU cores, container tag, and cluster identifier, wherein the container tag represents the service associated with the container; the container memory monitoring information includes: host IP, container IP, container identifier, collection time, memory utilization, memory size, and cluster identifier, etc. The cluster identifier may include the cluster name and a unique cluster identifier.

[0040] Optionally, the business application instance data includes: application name, application description, application manager, application manager's phone number, application deployment host IP list, application instance IP list, application instance name, application deployment tag, cluster name, GPU card type, etc. The application deployment tag represents the business associated with the application; the business application traffic data includes: host IP, application name, request time, response time, HTTP request result code, request path, etc.

[0041] Optionally, the host information includes: host IP, cluster identifier, CPU model, brand, number of CPU cores, GPU model, GPU brand, GPU memory, memory size, memory model, and memory brand, etc.; the cluster information includes: cluster identifier, cluster type, and cluster technology platform, etc. The cluster identifier may include the cluster name and a unique cluster identifier. The host information may also include a GPU identifier.

[0042] The host monitoring information and container monitoring information are monitoring data during the runtime of computing resources. Figure 3 This is a schematic diagram of runtime monitoring of computing resources in an embodiment of this application, such as... Figure 3As shown, the CPU, memory, hard disk, and GPU information of the host machine and container instances can be collected at a certain frequency through the resource collection plugin (also known as the control center) in each host machine. This results in the host machine monitoring information and container monitoring information. After collecting the host machine monitoring information and container monitoring information, the information is cached in the host machine's memory. According to the strategy, the information is periodically synchronized to the topic corresponding to the collected information in the message center. By caching the information for a certain period of time and then synchronizing it to the message center, the pressure caused by synchronizing every time the information is collected can be reduced.

[0043] The resource acquisition plugins in each host machine of the GPU computing power cluster collect host CPU monitoring information from the host machine and report it to the host CPU monitoring topic (HOST_CPU_INFO_MONITOR_TOPIC) in the message center. The message body of the reported host CPU monitoring information includes host IP, collection time, CPU utilization, number of CPU cores, and cluster identifier.

[0044] The resource acquisition plugins in each host machine in the GPU computing power cluster collect host machine memory monitoring information from the host machine and report it to the host memory monitoring topic (HOST_MEM_INFO_MONITOR_TOPIC) in the message center. The message body of the reported host memory monitoring information includes host IP, collection time, memory utilization, memory size, and cluster identifier.

[0045] The host hard disk monitoring information is collected from the host machine by the resource collection plugin in each host machine in the GPU computing power cluster and reported to the host hard disk monitoring topic (HOST_DISK_INFO_MONITOR_TOPIC) in the message center. The message body of the reported host hard disk monitoring information includes host IP, collection time, hard disk utilization, hard disk size, hard disk ID, and cluster identifier.

[0046] The host GPU monitoring information is collected from each host machine in the GPU computing power cluster by the resource collection plugin and reported to the host GPU monitoring topic (HOST_GPU_INFO_MONITOR_TOPIC) in the message center. The message body of the reported host GPU monitoring information includes host IP, collection time, GPU utilization, GPU memory size, GPU unique identifier, name of the process occupying the GPU, GPU brand, and cluster identifier.

[0047] The container information is collected from each host machine in the GPU computing power cluster by the resource collection plugin and reported to the container monitoring topic (POD_INFO_MONITOR_TOPIC) in the message center. The message body of the reported container information includes: container name, container unique identifier, container IP, number of container CPU cores, container memory, container GPU card identifier, container startup time, host machine IP, cluster identifier, etc.

[0048] The resource acquisition plugins in each host machine of the GPU computing power cluster collect container CPU monitoring information from the host machine and report it to the container CPU monitoring topic (POD_CPU_INFO_MONITOR_TOPIC) in the message center. The message body of the reported container CPU monitoring information includes the host machine IP, container IP, container unique identifier, collection time, CPU utilization, number of CPU cores, container tag, and cluster identifier.

[0049] The resource acquisition plugins in each host machine of the GPU computing power cluster collect container memory monitoring information from the host machine and report it to the container memory monitoring topic (POD_MEM_INFO_MONITOR_TOPIC) in the message center. The message body of the reported container memory monitoring information includes the host machine IP, container IP, container unique identifier, collection time, memory utilization, memory size, and cluster identifier.

[0050] The cluster identifier mentioned above may include the cluster name and the cluster unique identifier.

[0051] Deploy application tasks in the computing power scheduling platform, collect business application instance data through application tasks, and synchronously report the business application instance topic (APP_RES_INFO_TOPIC) to the message center. The reported business application instance data message includes: application name, application description, application owner, application owner's phone number, application deployment host IP list, application instance IP list, application instance name, application deployment tag, cluster name, GPU card type, etc.

[0052] Figure 4 This is a schematic diagram of business application traffic data collection in an embodiment of this application, such as... Figure 4 As shown, the access gateway of the computing power scheduling platform receives business traffic and forwards it to the corresponding inference instance (application instance). The business application traffic data is obtained by analyzing the traffic of the access gateway through the probe. The business application traffic data is reported to the business application traffic topic (APP_REQ_INFO_TOPIC) of the message center. The business application traffic data includes key parameters such as the requested host IP, application name, request time, response time, HTTP request result code, and request path.

[0053] Figure 5 This is a schematic diagram illustrating the aggregation and analysis of monitoring data in an embodiment of this application, such as... Figure 5 As shown, when aggregating and analyzing monitoring data, it is necessary to first obtain the monitoring data. This can be done by subscribing to message topics through a message consumer and consuming messages corresponding to each message topic from the message center. This yields static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data. The static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data are then persistently stored in the monitoring center's storage medium. The message topics may include host CPU monitoring topics, host memory monitoring topics, host hard disk monitoring topics, host GPU monitoring topics, container monitoring topics, container CPU monitoring topics, container memory monitoring topics, business application instance topics, and business application traffic topics.

[0054] In summary, the message topics involved in the monitoring center are shown in Table 1.

[0055] Table 1. Message Topics Covered by the Monitoring Center

[0056] Resource types Message Subject Host information synchronization HOST_INFO_MONITOR_TOPIC Cluster information synchronization CLUSTER_INFO_MONITOR_TOPIC Application information synchronization APP_RES_INFO_TOPIC Application traffic synchronization APP_REQ_INFO_TOPIC Host CPU monitoring HOST_CPU_INFO_MONITOR_TOPIC Host memory monitoring HOST_MEM_INFO_MONITOR_TOPIC Host disk monitoring HOST_DISK_INFO_MONITOR_TOPIC Host GPU monitoring HOST_GPU_INFO_MONITOR_TOPIC Container information synchronization POD_INFO_MONITOR_TOPIC Container CPU monitoring POD_CPU_INFO_MONITOR_TOPIC Container memory monitoring POD_MEM_INFO_MONITOR_TOPIC

[0057] Step 120: Obtain the static resource information, host monitoring information, container monitoring information, business application instance data, and business application traffic data from the storage medium.

[0058] like Figure 5 As shown, when it is necessary to aggregate and analyze monitoring data, static resource information, host monitoring information, container monitoring information, business application instance data, and business application traffic data are obtained from the storage medium.

[0059] Step 130: Based on the application monitoring metrics, perform aggregated analysis on at least one of the static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data to obtain the monitoring results of the application monitoring metrics.

[0060] Among them, application monitoring metrics are application-level monitoring metrics used to analyze the resources consumed by applications.

[0061] According to the application monitoring metrics, the monitoring data required for the application monitoring metrics are obtained from at least one of the following: static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data. For each application, the obtained monitoring data is aggregated and analyzed to obtain the monitoring results of each application for the application monitoring metrics.

[0062] like Figure 5 As shown, the data adaptation layer can provide interfaces for the monitoring front-end of the monitoring center to obtain the monitoring results of the corresponding application monitoring metrics. The monitoring front-end can display the monitoring results of the application monitoring metrics in the form of monitoring views. The displayed monitoring results views may include: application resource monitoring view, application idle / busy analysis view, intelligent computing resource distribution view, application concurrency rationality analysis view, etc. The intelligent computing resource distribution view is used to show the distribution location of each intelligent computing resource (i.e., GPU computing power cluster).

[0063] The heterogeneous GPU computing cluster monitoring method provided in this application consumes static resource information, host monitoring information, container monitoring information, business application instance data, and business application traffic data from a message center by subscribing to message topics. This data is then stored in a storage medium. When aggregation analysis is required, the static resource information, host monitoring information, container monitoring information, business application instance data, and business application traffic data are retrieved from the storage medium. Based on application monitoring metrics, at least one of these metrics is aggregated and analyzed to obtain the monitoring results. This method enables monitoring of applications running in the GPU computing cluster according to application monitoring metrics, achieving application-level monitoring.

[0064] Based on the above technical solution, the host information includes: host IP, cluster identifier, CPU model, brand, number of CPU cores, GPU model, GPU brand, GPU memory, memory size, memory model, and memory brand; the cluster information includes: cluster identifier, cluster type, and cluster technology base, etc. Among these, the cluster type represents the architecture type of the cluster, such as a Kubernetes cluster. The cluster technology base represents the composition type of the hosts in the cluster.

[0065] Based on the above technical solution, the method may also include cluster-level monitoring. Figure 6 This is a flowchart of cluster-level monitoring in an embodiment of this application, such as... Figure 6 As shown, cluster-level monitoring includes steps 610 to 640.

[0066] Step 610: Based on the cluster information and the host information, determine the host information associated with the first target cluster identifier, and obtain the first node set corresponding to the first target cluster identifier.

[0067] You can obtain full cluster basic information (including cluster information in each GPU computing cluster) by subscribing to the cluster information monitoring topic (CLUSTER_INFO_MONITOR_TOPIC). Each cluster information includes a cluster identifier, and by statistically analyzing all cluster identifiers, you can obtain the cluster identifiers for all clusters. You can monitor each cluster identifier in the cluster information at the cluster level, using the currently monitored cluster identifier as the first target cluster identifier; alternatively, you can perform cluster-level monitoring based on a specified first target cluster identifier.

[0068] Full cluster basic information (including cluster information in each GPU computing cluster) can be obtained by subscribing to the host information monitoring topic (HOST_INFO_MONITOR_TOPIC). From the host information, host information including the first target cluster identifier is filtered out. Host IPs are then extracted from this information. Each host IP represents a node under the first target cluster corresponding to the first target cluster identifier. All host IPs form the first node set corresponding to that first target cluster identifier. In this way, the logical cluster and physical hosts are associated through the first target cluster identifier (the unique identifier of the target cluster).

[0069] Step 620: Based on each node identifier in the first node set, display the host monitoring information corresponding to each node identifier.

[0070] The node identifier can be a physical host jointly represented by the host IP and the first target cluster identifier.

[0071] By subscribing to the host GPU monitoring topic (HOST_GPU_INFO_MONITOR_TOPIC), you can obtain all host GPU monitoring information. Then, based on each node identifier in the node set, you can retrieve the host machine monitoring information corresponding to that node identifier. You can display the host machine monitoring information for each node identifier individually, or you can display the corresponding host machine monitoring information based on a view command. The host machine monitoring information characterizes the operating load of each node in the first target cluster.

[0072] Step 630: Determine the first GPU card set corresponding to the first target cluster identifier based on the host GPU monitoring information in the host monitoring information.

[0073] The host GPU monitoring information includes a cluster identifier. The host GPU monitoring information with the cluster identifier of the first target cluster identifier can be filtered out from all the host GPU monitoring information, and the GPU identifier can be extracted from these host GPU monitoring information. One GPU identifier represents one GPU card. All GPU identifiers corresponding to the first target cluster identifier constitute the first GPU card set corresponding to the first target cluster identifier.

[0074] Step 640: Based on the host GPU monitoring information, determine the number of idle GPUs and the number of active GPUs corresponding to the first target cluster identifier.

[0075] The idle status of a GPU card is determined by checking whether a process name occupying the GPU is greater than 0 in the host GPU monitoring information. If a process name occupying the GPU is present in the host GPU monitoring information, the GPU card is considered to be in use; otherwise, it is considered to be idle. The idle status of all GPU cards in the first GPU card set is checked, and the number of idle GPU cards is counted based on the results. The number of active GPU cards is also counted based on the results.

[0076] When monitoring at the cluster level, the load of the host machine of each node in the cluster can be monitored, as well as the number of idle GPUs and the number of GPUs in use in the cluster, thus realizing the monitoring of the host machine and GPUs in the cluster.

[0077] Based on the above technical solution, the method may also include host-level monitoring. Figure 7 This is a flowchart of host-level monitoring in an embodiment of this application, such as... Figure 7 As shown, host-level monitoring includes steps 710 to 730.

[0078] Step 710: Determine all host information based on the host information.

[0079] You can obtain full host information (including host information of each host in each GPU computing cluster) by subscribing to the host information monitoring topic (HOST_INFO_MONITOR_TOPIC).

[0080] Step 720: Based on the host IP, cluster identifier and container monitoring information in the full host information, determine the first container set corresponding to each host IP and cluster identifier.

[0081] Based on the host IPs and cluster identifiers in the full host information, the container identifiers corresponding to those host IPs and cluster identifiers can be found in the container information of the container monitoring information. Each container identifier represents a container, and all container identifiers corresponding to that host IP and cluster identifier form the first container set corresponding to that host IP and cluster identifier. The first container set corresponding to each host can be calculated separately.

[0082] Step 730: Collect the second set of GPU cards under the same host from the full host information, and for each host, determine the number of idle GPU cards, the number of in-use GPU cards, and the container instance corresponding to the in-use GPU cards based on the host GPU monitoring information.

[0083] From the full host information, all GPU cards under the same host can be counted, resulting in a second set of GPU cards under that host. The second set of GPU cards can include the GPU identifier, GPU brand, GPU memory, etc. of each GPU card.

[0084] For each host in the full host information (identified by host IP and cluster identifier), the number of idle GPUs, the number of active GPUs, and the container instances corresponding to active GPUs can be counted separately. By subscribing to the host GPU monitoring topic (HOST_GPU_INFO_MONITOR_TOPIC), host GPU monitoring information for the current host can be obtained. For each GPU identifier, the presence of a process name occupying the GPU in the host GPU monitoring information determines whether the GPU is idle. If a process name occupying the GPU is present in the host GPU monitoring information, the GPU is considered active; otherwise, it is considered idle.

[0085] For each GPU card associated with a host, its idle status is determined. Based on the determination result, the number of idle GPU cards for that host is counted. Similarly, the number of active GPU cards for that host is counted. The container instance (i.e., the pod information, or container identifier) ​​corresponding to the active GPU card can be determined based on the name of the GPU process using that active GPU card.

[0086] When monitoring at the host level, it is possible to monitor the host's containers, as well as the number of idle GPUs, the number of active GPUs, and the container instances on active GPUs, thus enabling monitoring of both GPUs and containers within the host.

[0087] Based on the above technical solution, the method may also include GPU-level monitoring. Figure 8This is a flowchart of GPU-level monitoring in an embodiment of this application, such as... Figure 8 As shown, monitoring at the GPU level includes steps 810 to 840.

[0088] Step 810: Determine the third GPU card set based on the host GPU monitoring information in the host machine monitoring information.

[0089] By subscribing to the host GPU monitoring topic (HOST_GPU_INFO_MONITOR_TOPIC), you can obtain all host GPU monitoring information, perform statistics on all host GPU monitoring information, obtain all GPU identifiers, and obtain the third GPU card set.

[0090] Step 820: Determine the occupancy status corresponding to the target GPU identifier based on the host GPU monitoring information corresponding to the target GPU identifier in the third GPU card set.

[0091] You can monitor each GPU identifier in the third GPU card set at the GPU level, and use the GPU identifier currently being monitored at the GPU level as the target GPU identifier; or you can monitor the GPU level based on a specified target GPU identifier.

[0092] When monitoring the target GPU corresponding to the target GPU identifier at the GPU level, the occupancy status of the target GPU identifier is determined by whether there is a process name occupying the GPU in the host GPU monitoring information corresponding to the target GPU identifier. If there is a process name occupying the GPU in the host GPU monitoring information corresponding to the target GPU identifier, the occupancy status of the target GPU identifier is determined to be occupied. If there is no process name occupying the GPU in the host GPU monitoring information corresponding to the target GPU identifier, the occupancy status of the target GPU identifier is determined to be idle.

[0093] Step 830: When the occupancy status is occupied, determine the application information corresponding to the target GPU identifier based on the business application instance data, and display the application information.

[0094] The application information may include information such as the application administrator, application purpose, and application deployment time.

[0095] When the occupancy status corresponding to the target GPU identifier is occupied, determine the business application instance data whose application instance name is the name of the occupied GPU process in the host GPU monitoring information corresponding to the target GPU identifier from all business application instance data, obtain application information such as application manager, application manager's phone number, application purpose, and application deployment time from these business application instance data, and display the application information.

[0096] Step 840: Display the host information corresponding to the target GPU identifier.

[0097] When the target GPU identifier is in occupied or idle status, the host information corresponding to that target GPU identifier can be displayed. When the host information includes the GPU identifier, the host information whose GPU identifier is the target GPU identifier can be identified from all host information and displayed. When the host information does not include the GPU identifier, the host IP and cluster identifier corresponding to the target GPU identifier can be identified from all host GPU monitoring information. Based on the host IP and cluster identifier, the host information corresponding to the host IP and cluster identifier can be identified from all host information. This host information is the host information corresponding to the target GPU identifier and can be displayed.

[0098] When monitoring at the GPU level, the GPU's usage status can be monitored. When the usage status is occupied, the application information of the GPU can be monitored. When the usage status is occupied or idle, the host information corresponding to the GPU can be monitored.

[0099] Based on the above technical solution, the application monitoring metrics are application-dimensional monitoring metrics, and the monitoring results include application cluster information, second node set, second container set, and GPU card usage set. Figure 9 This is a flowchart of monitoring according to application-dimensional monitoring indicators in this application embodiment, which is the implementation flowchart of step 130 above, as follows. Figure 9 As shown, step 130 may include steps 131 to 134.

[0100] Step 131: Based on the second target cluster identifier in the business application instance data corresponding to the first target application identifier, obtain the cluster information corresponding to the second target cluster identifier from the cluster information to obtain the application cluster information.

[0101] The first target application identifier is the application identifier for which application-level monitoring metrics analysis is currently required. This first target application identifier can be represented by the application name.

[0102] The monitoring center can obtain business application instance data (i.e. application information) by subscribing to the message topic of the business application instance (APP_RES_INFO_TOPIC). Adding, deleting or modifying the application information will trigger message synchronization in the monitoring center.

[0103] After the application is deployed, the cluster information (cluster name) it falls into will be included in the business application instance data and synchronized to the monitoring center. This allows the second target cluster identifier (such as cluster name or cluster unique identifier) ​​corresponding to the second target application identifier to be obtained from the business application instance data. Based on this second target cluster identifier, the cluster information corresponding to the second target cluster identifier can be obtained from all the cluster information as the application cluster information corresponding to the first target application identifier.

[0104] Step 132: Determine the second node set corresponding to the first target application identifier based on the application deployment host IP list and the second target cluster identifier in the business application instance data.

[0105] After the application is deployed, the host IP and cluster identifier will be included in the business application instance data and synchronized to the monitoring center along with the application information in the business application instance data. Based on the list of application deployment host IPs and the second target cluster identifier in the business application instance data corresponding to the first target application identifier, the second node set corresponding to the first target application identifier can be determined. Each node in the second node set can be characterized by the host IP and the second target cluster identifier.

[0106] Step 133: Based on the application instance IP list and the second target cluster identifier in the business application instance data, and the container monitoring information, determine the second container set corresponding to the first target application identifier.

[0107] Based on the application instance IP list and the second target cluster identifier in the business application instance data corresponding to the first target application identifier, the container monitoring information containing the application instance IP (i.e., container IP) in the application instance IP list and the second target cluster identifier is searched from the container monitoring information. The container identifier is extracted from these container monitoring information, and all the extracted container identifiers form the second container set corresponding to the first target application identifier.

[0108] Step 134: Determine the set of GPU cards occupied by the first target application identifier based on the target container monitoring information corresponding to each container identifier in the second container set.

[0109] The target container monitoring information corresponding to each container identifier in the second container set is determined from the container information of the container monitoring information. The host IP and cluster identifier are extracted from these target container monitoring information. Based on the host IP, cluster identifier and container identifier, the GPU identifier corresponding to the host IP, cluster identifier and container identifier (the name of the GPU-occupying process in the host GPU monitoring information is the same as the container identifier) ​​is extracted from the host GPU monitoring information. All the extracted GPU identifiers form the set of GPU cards occupied by the first target application identifier.

[0110] When performing aggregated analysis of application-level monitoring metrics, it is possible to monitor application cluster information, node sets, container sets, and GPU card usage sets.

[0111] Based on the above technical solution, the application monitoring indicators are application business dimension monitoring indicators, and the monitoring results include resource usage information and resource utilization rate;

[0112] The step involves performing aggregated analysis of at least one of the following application monitoring metrics: static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data, to obtain the monitoring results of the application monitoring metrics, including:

[0113] Based on the target host IP and application deployment tag in the business application instance data corresponding to the second target application identifier, and the container monitoring information, the resource usage information corresponding to the second target application identifier is determined.

[0114] The resource utilization rate of the target host machine where the second target application identifier is located is obtained from the host machine monitoring information, and the resource utilization rate is displayed. The resource utilization rate includes CPU utilization rate, memory utilization rate and GPU utilization rate.

[0115] The second target application identifier is the application identifier currently being analyzed for application business dimension monitoring indicators.

[0116] By analyzing all business application instance data, the overall set of underlying cluster resource applications can be obtained. Based on the target host IP and application deployment tag in the business application instance data corresponding to the second target application identifier, container monitoring information with the same container tag as the application deployment tag and the host IP being the target host IP is extracted from all container monitoring information. The resource usage information corresponding to the second target application identifier is then determined based on the resource usage information extracted from the container monitoring information. This resource usage information includes CPU usage (the sum of the number of CPU cores in each container), memory usage (after the memory usage of each container), and GPU resource usage (the GPU card identifier used by each container).

[0117] Table 2 shows the resource usage information of an application. As shown in Table 2, the resource usage information of an application may include information such as the number of CPU cores, memory size, GPU identifier, GPU model, and the cluster in which it is deployed.

[0118] Table 2 Application resource usage information

[0119]

[0120]

[0121] After extracting container monitoring information from all container monitoring information where the container tag is the same as the application deployment tag and the host IP is the target host IP, container identifiers are extracted from these container monitoring information. These container identifiers are the target container identifiers corresponding to the second target application identifier. The host IP and cluster identifier are extracted from the container information corresponding to the target container identifier. The host IP and cluster identifier represent the target host where the second target application identifier is located. Then, based on the host IP and cluster identifier, information such as CPU utilization, memory utilization, and GPU utilization corresponding to the host IP and cluster identifier can be extracted from all host monitoring information.

[0122] By monitoring application business-level metrics, we can monitor the application's resource usage and resource utilization.

[0123] Based on the above technical solution, the application monitoring metrics are application-dimensional concurrency monitoring metrics, and the monitoring results include request concurrency and average resource utilization.

[0124] The step involves performing aggregated analysis of at least one of the following application monitoring metrics: static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data, to obtain the monitoring results of the application monitoring metrics, including:

[0125] The traffic data of the target business application of the third target application identifier within the target time period are statistically analyzed to obtain the request concurrency of the third target application identifier within the target time period.

[0126] Based on the target business application traffic data of the third target application identifier within the target time period, the container monitoring data, and the host GPU monitoring information in the host machine monitoring information, the average resource utilization rate of the third target application identifier within the target time period is determined.

[0127] The third target application identifier is the application identifier currently being monitored for concurrent metrics at the application dimension. The target time period can be one hour, one day, one week, or one month, etc.

[0128] Obtain the business application traffic data corresponding to the third target application identifier from all business application traffic data, and determine the target business application traffic data within the target time period from the business application traffic data corresponding to the third target application identifier. Count the number of these target business application traffic data as the request concurrency of the third target application identifier in the target time period.

[0129] The target application name is extracted from the target business application traffic data, and the application deployment tag and target host IP corresponding to the target application name are extracted from the target business application instance data. Container monitoring information with the same container tag as the application deployment tag and the target host IP is extracted from all container monitoring information. Container identifiers are extracted from these container monitoring information. These container identifiers are the target container identifiers corresponding to the second target application identifier. Based on the target container identifier, the resource utilization of the third target application identifier during the target time period, such as CPU utilization, memory utilization, and GPU utilization, can be determined from the container monitoring information and host GPU monitoring information. The average value of these resource utilizations is then calculated to obtain the average resource utilization of the third target application identifier during the target time period.

[0130] Based on the above technical solution, the average resource utilization rate includes: CPU average utilization rate, memory average utilization rate and GPU average utilization rate;

[0131] The step of determining the average resource utilization of the third target application identifier during the target time period based on the target business application traffic data of the third target application identifier within the target time period, the container monitoring data, and the host GPU monitoring information in the host machine monitoring information includes:

[0132] Obtain the application deployment tag corresponding to the third target application identifier from the target business application instance data;

[0133] Based on the application deployment tag, determine the target container tag, obtain the target container identifier corresponding to the target container tag from the container monitoring data, and obtain the container CPU monitoring information and container memory monitoring information corresponding to the target container identifier within the target time period.

[0134] The average CPU utilization of the third target application identifier in the target time period is obtained by statistically analyzing the CPU utilization of the CPU monitoring information of the device during the target time period.

[0135] The average memory utilization rate of the third target application identifier in the memory monitoring information of the third target time period is calculated to obtain the average memory utilization rate of the third target application identifier in the target time period.

[0136] Obtain the target GPU card identifier corresponding to the target container identifier from the container monitoring data, and obtain the GPU utilization rate corresponding to the target GPU card identifier within the target time period from the host GPU monitoring information in the host machine monitoring information;

[0137] The average GPU utilization rate of the third target application identifier during the target time period is calculated by averaging the GPU utilization rate of each GPU during the target time period.

[0138] The application deployment tag corresponding to the third target application identifier is obtained from the third target application instance data. This application deployment tag is then identified as the target container tag. The target container identifier corresponding to the target container tag is obtained from all container monitoring data. The container CPU monitoring information and container memory monitoring information corresponding to the target container identifier within the target time period are then obtained. The average CPU utilization rate in the container CPU monitoring information within the target time period is calculated to obtain the average CPU utilization rate of the third target application identifier within the target time period. Similarly, the average memory utilization rate in the container memory monitoring information within the target time period is calculated to obtain the average memory utilization rate of the third target application identifier within the target time period.

[0139] The target GPU card identifier corresponding to the target container identifier is obtained from the container monitoring data (i.e., the container GPU card identifier in the container monitoring data). The host GPU monitoring information corresponding to the target GPU card identifier is obtained from the host GPU monitoring information of the host machine monitoring information. The host GPU monitoring information within the target time period (collection time within the target time period) is obtained from the determined host GPU monitoring information. The average value of the GPU utilization in these host GPU monitoring information is calculated to obtain the average GPU utilization of the third target application identifier in the target time period.

[0140] When calculating the average resource utilization rate, if the third target application identifier has multiple application instances, it is necessary to calculate the average of the sum of the resource utilization rates of all application instances at the same time point as the final average resource utilization rate.

[0141] Suppose that inference application A is deployed with three instances, c1, c2, and c3. The target time period includes data collection times t1, t2, ..., tn. Let f(c1,t1) be the CPU utilization of instance c1 at time t1, ..., f(c1,tn) be the CPU utilization of instance c1 at time tn, f(c2,t2) be the CPU utilization of instance c2 at time t2, ..., f(cn,tn) be the CPU utilization of instance cn at time tn. Then, the average CPU utilization of this application within the target time period can be calculated using the following formula:

[0142]

[0143] Where avg(cpu) represents the average CPU utilization, sum() represents the summation function, and n represents the number of acquisition times within the target time period.

[0144] The application's average memory utilization during the target time period can be calculated using the following formula:

[0145]

[0146] Where avg(mem) represents the average memory utilization, sum() represents the summation function, n represents the number of collection times within the target time period, k(ci,tj) represents the memory utilization of instance ci at collection time tj, i ranges from 1 to the number of instances corresponding to the application, and j ranges from 1 to n.

[0147] The application's average memory utilization during the target time period can be calculated using the following formula:

[0148]

[0149] Where avg(gpu) represents the average GPU utilization, sum() represents the summation function, n represents the number of collection times within the target time period, T(ci,tj) represents the GPU utilization of instance ci at collection time tj, i ranges from 1 to the number of instances corresponding to the application, and j ranges from 1 to n.

[0150] Figure 10a This is a trend chart of the average resource utilization rate of Application 1 in the embodiments of this application. Figure 10b This is a trend chart of the average resource utilization rate in embodiment 2 of this application. For example... Figure 10a and Figure 10b As shown, the target time period can be one hour, and statistics are collected every hour to obtain the trend chart of the average resource utilization of application 1 and application 2. Curve 1 represents the average CPU utilization, curve 2 represents the average memory utilization, and curve 3 represents the average GPU utilization.

[0151] Based on the above technical solution, after determining the average resource utilization of the third target application identifier during the target time period based on the business application traffic data of the third target application identifier within the target time period, the container monitoring data, and the host GPU monitoring information in the host machine monitoring information, the method further includes:

[0152] If the average CPU utilization is less than or equal to the first CPU utilization threshold, the average memory utilization is less than or equal to the first memory utilization threshold, and the average GPU utilization is less than or equal to the first GPU utilization threshold, a first prompt message for performing a scaling-down operation is sent to the application manager corresponding to the third target application identifier.

[0153] If the average CPU utilization is greater than the second CPU utilization threshold, the average memory utilization is greater than the second memory utilization threshold, and the average GPU utilization is greater than the second GPU utilization threshold, a second prompt message for performing a capacity expansion operation is sent to the application manager corresponding to the third target application identifier.

[0154] When calculating the average utilization rate of resources, the target time period can be a relatively long period, such as a day, a week, or a month.

[0155] The average CPU utilization, maximum CPU utilization, minimum CPU utilization, average memory utilization, maximum memory utilization, minimum memory utilization, average GPU utilization, maximum GPU utilization, minimum GPU utilization, average business request concurrency, maximum business request concurrency, and minimum business request concurrency can be statistically analyzed through several time dimensions, such as one day, seven days, and 30 days.

[0156] If the average resource utilization rate is consistently lower than or equal to a system preset threshold (e.g., 40%), specifically if the average CPU utilization rate is less than or equal to a first CPU utilization threshold, the average memory utilization rate is less than or equal to a first memory utilization threshold, and the average GPU utilization rate is less than or equal to a first GPU utilization threshold, the background system will trigger a first notification message to the application manager and resource management department corresponding to the third target application identifier. This notification message prompts the application deployment to be scaled down to free up GPU resources for other scenarios. The first notification message can be sent via email.

[0157] If the average resource utilization rate consistently exceeds the system's preset thresholds (e.g., 60%), specifically if the average CPU utilization rate exceeds the second CPU utilization threshold, the average memory utilization rate exceeds the second memory utilization threshold, and the average GPU utilization rate exceeds the second GPU utilization threshold, the background system will trigger a second notification message to the application administrator and resource management department. This notification message will prompt the application deployment to be scaled up to meet business development needs. The second notification message can be sent via email.

[0158] Table 3 is an example of resource utilization evaluation criteria. As shown in Table 3, the first threshold (first CPU utilization threshold, first memory utilization threshold, first GPU utilization threshold) can be 40%, and the second threshold (second CPU utilization threshold, second memory utilization threshold, second GPU utilization threshold) can be 60%.

[0159] Table 3 Resource Utilization Rate Assessment Standards

[0160]

[0161] Based on the above technical solution, the monitoring results also include: average response time;

[0162] The step of performing aggregated analysis of at least one of the static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data according to application monitoring metrics to obtain the monitoring results of the application monitoring metrics also includes:

[0163] The average response time of the third target application identifier within the target time period is obtained by statistically analyzing the average response time of the business application traffic data within the target time period.

[0164] The average response time of each target business application traffic data collected within the target time period is calculated to obtain the average response time of the third target application identifier within the target time period.

[0165] Figure 11a This is the request concurrency trend chart of Application 1 in the embodiments of this application. Figure 11b This is a request concurrency trend chart for application 2 in this embodiment. For example... Figure 11a and Figure 11b As shown, the target time period can be one hour, and statistics are collected every hour to obtain the trend graph of request concurrency for application 1 and application 2. Curve 4 represents the number of concurrent requests, and curve 5 represents the average response time.

[0166] By providing an interface to display application traffic trends (i.e., request concurrency trends) and computing resource trends (i.e., average resource utilization trends), users can intuitively monitor application idle and busy periods. When computing resource utilization spikes, users can use the interface to check whether it is caused by high business concurrency.

[0167] Based on the above technical solution, a natural day includes multiple target time periods;

[0168] After statistically analyzing the business application traffic data of the third target application identifier within the target time period to obtain the request concurrency of the third target application identifier within the target time period, the method further includes:

[0169] Within each natural day of the statistical period, at least one target time period with request concurrency less than the first concurrency threshold is determined as the first target time period, and at least one target time period with request concurrency greater than or equal to the second concurrency threshold is determined as the second target time period.

[0170] Based on the first target time period, the idle time period of the third target application identifier within the statistical period is determined, and based on the second target time period, the busy time period of the third target application identifier within the statistical period is determined.

[0171] The statistical period may include multiple natural days; for example, the statistical period may be one week or one month. One natural day includes the target time period, which may be one hour, two hours, etc.

[0172] By sampling the number of concurrent requests for each application in each time period throughout the day, analyzing idle periods, and then repeatedly verifying over a longer time window, we can finally obtain an estimated value of the current application's busy / idle status over a future period.

[0173] Within each natural day of the statistical period, the request concurrency of each target time period within that natural day is compared with the first concurrency threshold and the second concurrency threshold. If the request concurrency of the target time period is less than the first concurrency threshold, then the target time period is designated as the first target time period. If the request concurrency of the target time period is greater than or equal to the second concurrency threshold, then the target time period is designated as the second target time period.

[0174] The first target time period of each natural day is statistically analyzed, and multiple consecutive first target time periods are merged to obtain the idle time period of each natural day. The intersection of the idle time periods of all natural days within the statistical period is taken as the idle time period of the third target application in the statistical period.

[0175] The second target time period within each natural day is statistically analyzed. Multiple consecutive second target time periods are merged to obtain the busy time period within each natural day. The intersection of the busy time periods within all natural days in the statistical period is used as the busy time period of the third target application within the statistical period.

[0176] Table 4 shows the request concurrency of a certain application in one day. As shown in Table 4, assuming that the first concurrency threshold is 40 and the second concurrency threshold is 1000, the idle time period of the application is 0:00-9:00 and the busy time period is 12:00-16:00.

[0177] Table 4 shows the application's request concurrency over a single day.

[0178]

[0179] In addition to performing the above analysis, for abnormal scenarios with low business request concurrency and high utilization of corresponding deployment resources such as CPU, memory, and GPU, a warning email can be triggered to notify the application manager to make rectifications and provide explanations, so as to optimize the efficiency of application request processing.

[0180] The embodiments of this application are more suitable for scenarios where GPU computing power clusters are physically scattered, there are many inference applications deployed, and one-to-one operation and management of applications is difficult. The heterogeneous GPU computing power cluster monitoring method can realize integrated monitoring and operation and maintenance, reduce manpower input, and improve work efficiency.

[0181] This application addresses the scarcity and high cost of GPU resources, which are also scattered in distribution. To achieve reasonable allocation and use of GPU computing power clusters, it can track memory usage, computing unit load, and power consumption in real time. It provides a data support basis for evaluating application deployment resources during runtime, appropriately scaling up applications that run at full load for extended periods, and scaling down applications that do not match business development forecasts, ultimately achieving reasonable allocation and use of resources.

[0182] This application embodiment combines CPU, memory, GPU monitoring and business traffic monitoring to achieve in-depth analysis of the underlying GPU computing power cluster, while also providing intelligence to the monitoring (providing prompts for scaling down or up) and offering sustainable suggestions for application deployment and operation.

[0183] Figure 12 This is a schematic diagram of the structure of a heterogeneous GPU computing power cluster monitoring device provided in an embodiment of this application, as shown below. Figure 12 As shown, the device includes:

[0184] The data storage module 1210 is used to consume static resource information, host monitoring information, container monitoring information, business application instance data, and business application traffic data from the message center by subscribing to message topics, and to store the static resource information, host monitoring information, container monitoring information, business application instance data, and business application traffic data in a storage medium; the static resource information includes host information and cluster information; the host monitoring information and container monitoring information are obtained by resource acquisition plugins in each host and sent to the message center by the host; the business application instance data is obtained by the computing power scheduling center and sent to the message center; and the business application traffic data is obtained by the access gateway of the computing power scheduling center and sent to the message center.

[0185] The data acquisition module 1220 is used to acquire the static resource information, host monitoring information, container monitoring information, business application instance data and business application traffic data from the storage medium.

[0186] The aggregation analysis module 1230 is used to perform aggregation analysis on at least one of the static resource information, host machine monitoring information, container monitoring information, business application instance data and business application traffic data according to the application monitoring indicators, and obtain the monitoring results of the application monitoring indicators.

[0187] Optionally, the host monitoring information includes: host CPU monitoring information, host memory monitoring information, host disk monitoring information, and host GPU monitoring information;

[0188] The container monitoring information includes: container information, container CPU monitoring information, and container memory monitoring information.

[0189] Optionally, the host CPU monitoring information includes: host IP, collection time, CPU utilization, number of CPU cores, and cluster identifier;

[0190] The memory monitoring information includes: host IP, collection time, memory utilization, memory size, and cluster identifier;

[0191] The host disk monitoring information includes: host IP, collection time, disk utilization, disk size, disk identifier, and cluster identifier;

[0192] The host GPU monitoring information includes: host IP, collection time, GPU utilization, GPU memory size, GPU identifier, name of the process occupying the GPU, GPU brand, and cluster identifier.

[0193] Optionally, the container information includes: container identifier, container IP, number of container CPU cores, container memory, GPU card identifier used by the container, container startup time, host IP, and cluster identifier;

[0194] The container CPU monitoring information includes: host IP, container IP, container identifier, collection time, CPU utilization, number of CPU cores, container tag, and cluster identifier. The container tag represents the service associated with the container.

[0195] The container memory monitoring information includes: host IP, container IP, container identifier, collection time, memory utilization, memory size, and cluster identifier.

[0196] Optionally, the business application instance data includes: application name, application description, application manager, application manager's phone number, application deployment host IP list, application instance IP list, application instance name, application deployment tag, cluster name, and GPU card type, wherein the application deployment tag represents the business associated with the application;

[0197] The business application traffic data includes: host IP, application name, request time, response time, HTTP request result code, and request path.

[0198] Optionally, the host information includes: host IP, cluster identifier, CPU model, brand, number of CPU cores, GPU model, GPU brand, GPU memory, memory size, memory model, and memory brand;

[0199] The cluster information includes: cluster identifier, cluster type, and cluster technology base.

[0200] Optionally, the device further includes:

[0201] The first node set determination module is used to determine the host information associated with the first target cluster identifier based on the cluster information and the host information, and obtain the first node set corresponding to the first target cluster identifier.

[0202] The host information display module is used to display the host monitoring information corresponding to each node identifier in the first node set.

[0203] The first GPU card set determination module is used to determine the first GPU card set corresponding to the first target cluster identifier based on the host GPU monitoring information in the host monitoring information;

[0204] The GPU card count determination module is used to determine the number of idle GPU cards and the number of active GPU cards corresponding to the first target cluster identifier based on the host GPU monitoring information.

[0205] Optionally, the device further includes:

[0206] The full host information determination module is used to determine the full host information based on the host information.

[0207] The first container set determination module is used to determine the first container set corresponding to each host IP and cluster identifier based on the host IP, cluster identifier and container monitoring information in the full host information;

[0208] The host information statistics module is used to count the second set of GPU cards under the same host from the full host information, and for each host, determine the number of idle GPU cards, the number of GPU cards in use, and the container instance corresponding to the GPU card in use based on the host GPU monitoring information.

[0209] Optionally, the device further includes:

[0210] The third GPU card set determination module is used to determine the third GPU card set based on the host GPU monitoring information in the host machine monitoring information.

[0211] The occupancy status determination module is used to determine the occupancy status corresponding to the target GPU identifier based on the host GPU monitoring information corresponding to the target GPU identifier in the third GPU card set.

[0212] The application information display module is used to determine the application information corresponding to the target GPU identifier based on the business application instance data when the occupancy status is occupied, and to display the application information.

[0213] The host information display module is used to display the host information corresponding to the target GPU identifier.

[0214] Optionally, the application monitoring metrics are application-dimensional monitoring metrics, and the monitoring results include application cluster information, a second set of nodes, a second set of containers, and a set of GPU cards in use.

[0215] The convergence analysis module includes:

[0216] The application cluster information acquisition unit is used to obtain the cluster information corresponding to the second target cluster identifier from the cluster information based on the second target cluster identifier in the business application instance data corresponding to the first target application identifier, thereby obtaining the application cluster information;

[0217] The second node set determination unit is used to determine the second node set corresponding to the first target application identifier based on the application deployment host machine IP list and the second target cluster identifier in the business application instance data.

[0218] The second container set determination unit is used to determine the second container set corresponding to the first target application identifier based on the application instance IP list and the second target cluster identifier in the business application instance data, as well as the container monitoring information.

[0219] The GPU card set determination unit is used to determine the GPU card set corresponding to the first target application identifier based on the target container monitoring information corresponding to each container identifier in the second container set.

[0220] Optionally, the application monitoring metrics are application business dimension monitoring metrics, and the monitoring results include resource usage information and resource utilization rate;

[0221] The convergence analysis module includes:

[0222] The resource occupancy information determination unit is used to determine the resource occupancy information corresponding to the second target application identifier based on the target host IP and application deployment tag in the business application instance data corresponding to the second target application identifier, as well as the container monitoring information.

[0223] The resource utilization display unit is used to obtain the resource utilization of the target host machine where the second target application identifier is located from the host machine monitoring information, and display the resource utilization, which includes CPU utilization, memory utilization and GPU utilization.

[0224] Optionally, the application monitoring metrics are application-level concurrency monitoring metrics, and the monitoring results include request concurrency and average resource utilization.

[0225] The convergence analysis module includes:

[0226] The request concurrency determination unit is used to statistically analyze the target business application traffic data of the third target application identifier within the target time period to obtain the request concurrency of the third target application identifier within the target time period.

[0227] The resource average utilization determination unit is used to determine the resource average utilization of the third target application identifier in the target time period based on the target business application traffic data of the third target application identifier in the target time period, the container monitoring data, and the host GPU monitoring information in the host machine monitoring information.

[0228] Optionally, the average resource utilization includes: CPU average utilization, memory average utilization, and GPU average utilization;

[0229] The resource average utilization rate determination unit is specifically used for:

[0230] Obtain the application deployment tag corresponding to the third target application identifier from the target business application instance data;

[0231] Based on the application deployment tag, determine the target container tag, obtain the target container identifier corresponding to the target container tag from the container monitoring data, and obtain the container CPU monitoring information and container memory monitoring information corresponding to the target container identifier within the target time period.

[0232] The average CPU utilization of the third target application identifier in the target time period is obtained by statistically analyzing the CPU utilization of the CPU monitoring information of the device during the target time period.

[0233] The average memory utilization rate of the third target application identifier in the memory monitoring information of the third target time period is calculated to obtain the average memory utilization rate of the third target application identifier in the target time period.

[0234] Obtain the target GPU card identifier corresponding to the target container identifier from the container monitoring data, and obtain the GPU utilization rate corresponding to the target GPU card identifier within the target time period from the host GPU monitoring information in the host machine monitoring information;

[0235] The average GPU utilization rate of the third target application identifier during the target time period is calculated by averaging the GPU utilization rate of each GPU during the target time period.

[0236] Optionally, the device further includes:

[0237] The first prompt sending module is used to send a first prompt message for performing a scaling-down operation to the application manager corresponding to the third target application identifier if the average CPU utilization is less than or equal to the first CPU utilization threshold, the average memory utilization is less than or equal to the first memory utilization threshold, and the average GPU utilization is less than or equal to the first GPU utilization threshold.

[0238] The second prompt sending module is used to send a second prompt message for performing a capacity expansion operation to the application manager corresponding to the third target application identifier if the average CPU utilization is greater than the second CPU utilization threshold, the average memory utilization is greater than the second memory utilization threshold, and the average GPU utilization is greater than the second GPU utilization threshold.

[0239] Optionally, the monitoring results may also include: average response time;

[0240] The convergence analysis module also includes:

[0241] The average response time determination unit is used to calculate the average response time in the target business application traffic data within the target time period, and obtain the average response time of the third target application identifier within the target time period.

[0242] Optionally, a natural day may include multiple target time periods;

[0243] The device further includes:

[0244] The time period statistics module is used to determine, within each natural day of the statistical period, at least one target time period in which the request concurrency is less than a first concurrency threshold, as the first target time period, and at least one target time period in which the request concurrency is greater than or equal to a second concurrency threshold, as the second target time period.

[0245] The idle / busy time period determination module is used to determine the idle time period of the third target application identifier within the statistical period based on the first target time period, and to determine the busy time period of the third target application identifier within the statistical period based on the second target time period.

[0246] The heterogeneous GPU computing power cluster monitoring device provided in this application embodiment is used to implement the steps of the heterogeneous GPU computing power cluster monitoring method described in this application embodiment. The specific implementation of each module of the device is described in the corresponding steps, and will not be repeated here.

[0247] The heterogeneous GPU computing cluster monitoring device provided in this application embodiment consumes static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data from a message center by subscribing to message topics. It stores this static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data in a storage medium. When aggregation analysis is required, it retrieves the static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data from the storage medium. Based on application monitoring metrics, it performs aggregation analysis on at least one of the static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data to obtain the monitoring results of the application monitoring metrics. This enables monitoring of applications running in the GPU computing cluster according to application monitoring metrics, achieving monitoring at the application dimension.

[0248] Figure 13 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application, such as... Figure 13As shown, the electronic device 1300 may include one or more processors 1310 and one or more memories 1320 connected to the processors 1310. The electronic device 1300 may also include an input interface 1330 and an output interface 1340 for communicating with another device or system. Program code executed by the processor 1310 may be stored in the memory 1320.

[0249] The processor 1310 in the electronic device 1300 calls the program code stored in the memory 1320 to execute the heterogeneous GPU computing power cluster monitoring method in the above embodiment.

[0250] This application also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the heterogeneous GPU computing power cluster monitoring method as described in this application.

[0251] This application also provides a computer program product that, when executed by a processor, implements the steps of the heterogeneous GPU computing power cluster monitoring method as described in this application.

[0252] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the apparatus embodiments, since they are fundamentally similar to the method embodiments, the description is relatively simple; relevant parts can be referred to the descriptions in the method embodiments.

[0253] The above provides a detailed description of a heterogeneous GPU computing cluster monitoring method, device, electronic device, and storage medium provided in the embodiments of this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.

[0254] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

Claims

1. A method for monitoring heterogeneous GPU computing power clusters, characterized in that, include: By subscribing to message topics, static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data are consumed from the message center, and the static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data are stored in the storage medium. The static resource information includes host information and cluster information. The host monitoring information and container monitoring information are obtained by the resource acquisition plugins in each host and sent to the message center by the host. The business application instance data is obtained by the computing power scheduling center and sent to the message center. The business application traffic data is obtained by the access gateway of the computing power scheduling center and sent to the message center. Obtain the static resource information, host monitoring information, container monitoring information, business application instance data, and business application traffic data from the storage medium; Based on the application monitoring metrics, at least one of the static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data is aggregated and analyzed to obtain the monitoring results of the application monitoring metrics. The application monitoring metrics are application-dimensional monitoring metrics, and the monitoring results include application cluster information, second node set, second container set, and GPU card usage set. The step involves performing aggregated analysis of at least one of the following application monitoring metrics: static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data, to obtain the monitoring results of the application monitoring metrics, including: Based on the second target cluster identifier in the business application instance data corresponding to the first target application identifier, the cluster information corresponding to the second target cluster identifier is obtained from the cluster information to obtain the application cluster information; Based on the application deployment host IP list and the second target cluster identifier in the business application instance data, determine the second node set corresponding to the first target application identifier; Based on the application instance IP list and the second target cluster identifier in the business application instance data, and the container monitoring information, determine the second container set corresponding to the first target application identifier; Based on the target container monitoring information corresponding to each container identifier in the second container set, determine the set of GPU cards occupied by the first target application identifier.

2. The method according to claim 1, characterized in that, The host monitoring information includes: host CPU monitoring information, host memory monitoring information, host disk monitoring information, and host GPU monitoring information; The container monitoring information includes: container information, container CPU monitoring information, and container memory monitoring information.

3. The method according to claim 2, characterized in that, The host CPU monitoring information includes: host IP, collection time, CPU utilization, number of CPU cores, and cluster identifier; The memory monitoring information includes: host IP, collection time, memory utilization, memory size, and cluster identifier; The host disk monitoring information includes: host IP, collection time, disk utilization, disk size, disk identifier, and cluster identifier; The host GPU monitoring information includes: host IP, collection time, GPU utilization, GPU memory size, GPU identifier, name of the process occupying the GPU, GPU brand, and cluster identifier.

4. The method according to claim 2, characterized in that, The container information includes: container identifier, container IP, number of container CPU cores, container memory, GPU card identifier used by the container, container startup time, host IP, and cluster identifier; The container CPU monitoring information includes: host IP, container IP, container identifier, collection time, CPU utilization, number of CPU cores, container tag, and cluster identifier. The container tag represents the service associated with the container. The container memory monitoring information includes: host IP, container IP, container identifier, collection time, memory utilization, memory size, and cluster identifier.

5. The method according to claim 1, characterized in that, The business application instance data includes: application name, application description, application manager, application manager's phone number, application deployment host IP list, application instance IP list, application instance name, application deployment tag, cluster name, and GPU card type. The application deployment tag represents the business associated with the application. The business application traffic data includes: host IP, application name, request time, response time, HTTP request result code, and request path.

6. The method according to claim 1, characterized in that, The host information includes: host IP, cluster identifier, CPU model, brand, number of CPU cores, GPU model, GPU brand, GPU memory, memory size, memory model, and memory brand; The cluster information includes: cluster identifier, cluster type, and cluster technology base.

7. The method according to claim 6, characterized in that, The method further includes: Based on the cluster information and the host information, determine the host information associated with the first target cluster identifier, and obtain the first node set corresponding to the first target cluster identifier; Based on each node identifier in the first node set, display the host monitoring information corresponding to each node identifier; Based on the host GPU monitoring information in the host machine monitoring information, determine the first GPU card set corresponding to the first target cluster identifier; Based on the host GPU monitoring information, determine the number of idle GPUs and the number of active GPUs corresponding to the first target cluster identifier.

8. The method according to claim 6, characterized in that, Also includes: Based on the host information, determine the full host information; Based on the host IP, cluster identifier and container monitoring information in the full host information, determine the first container set corresponding to each host IP and cluster identifier; The second set of GPU cards under the same host is counted from the full host information, and for each host, the number of idle GPU cards, the number of in-use GPU cards, and the container instance corresponding to the in-use GPU cards are determined based on the host GPU monitoring information in the host monitoring information.

9. The method according to any one of claims 1-8, characterized in that, Also includes: The third GPU card set is determined based on the host GPU monitoring information in the host machine monitoring information; Based on the host GPU monitoring information corresponding to the target GPU identifier in the third GPU card set, determine the occupancy status corresponding to the target GPU identifier; When the occupancy status is occupied, the application information corresponding to the target GPU identifier is determined based on the business application instance data, and the application information is displayed. Display the host information corresponding to the target GPU identifier.

10. The method according to any one of claims 1-8, characterized in that, The application monitoring metrics are application business dimension monitoring metrics, and the monitoring results include resource usage information and resource utilization rate; The step involves performing aggregated analysis of at least one of the following application monitoring metrics: static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data, to obtain the monitoring results of the application monitoring metrics, including: Based on the target host IP and application deployment tag in the business application instance data corresponding to the second target application identifier, and the container monitoring information, the resource usage information corresponding to the second target application identifier is determined. The resource utilization rate of the target host machine where the second target application identifier is located is obtained from the host machine monitoring information, and the resource utilization rate is displayed. The resource utilization rate includes CPU utilization rate, memory utilization rate and GPU utilization rate.

11. The method according to any one of claims 1-8, characterized in that, The application monitoring metrics are application-level concurrent monitoring metrics, and the monitoring results include request concurrency and average resource utilization. The step involves performing aggregated analysis of at least one of the following application monitoring metrics: static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data, to obtain the monitoring results of the application monitoring metrics, including: The traffic data of the target business application of the third target application identifier within the target time period are statistically analyzed to obtain the request concurrency of the third target application identifier within the target time period. Based on the target business application traffic data of the third target application identifier within the target time period, the container monitoring information, and the host GPU monitoring information in the host machine monitoring information, the average resource utilization rate of the third target application identifier within the target time period is determined.

12. The method according to claim 11, characterized in that, The average resource utilization includes: CPU average utilization, memory average utilization, and GPU average utilization; The step of determining the average resource utilization of the third target application identifier during the target time period based on the target business application traffic data of the third target application identifier within the target time period, the container monitoring information, and the host GPU monitoring information in the host machine monitoring information includes: Obtain the application deployment tag corresponding to the third target application identifier from the target business application instance data; Based on the application deployment tag, determine the target container tag, obtain the target container identifier corresponding to the target container tag from the container monitoring information, and obtain the container CPU monitoring information and container memory monitoring information corresponding to the target container identifier within the target time period. The average CPU utilization of the third target application identifier in the target time period is obtained by statistically analyzing the CPU utilization of the CPU monitoring information of the device during the target time period. The average memory utilization rate in the memory monitoring information of the device during the target time period is calculated to obtain the average memory utilization rate of the third target application identifier during the target time period. Obtain the target GPU card identifier corresponding to the target container identifier from the container monitoring information, and obtain the GPU utilization rate corresponding to the target GPU card identifier within the target time period from the host GPU monitoring information in the host machine monitoring information; The average GPU utilization rate of the third target application identifier during the target time period is calculated by averaging the GPU utilization rate of each GPU during the target time period.

13. The method according to claim 12, characterized in that, After determining the average resource utilization of the third target application identifier during the target time period based on the business application traffic data of the third target application identifier within the target time period, the container monitoring information, and the host GPU monitoring information in the host machine monitoring information, the method further includes: If the average CPU utilization is less than or equal to the first CPU utilization threshold, the average memory utilization is less than or equal to the first memory utilization threshold, and the average GPU utilization is less than or equal to the first GPU utilization threshold, a first prompt message for performing a scaling-down operation is sent to the application manager corresponding to the third target application identifier. If the average CPU utilization is greater than the second CPU utilization threshold, the average memory utilization is greater than the second memory utilization threshold, and the average GPU utilization is greater than the second GPU utilization threshold, a second prompt message for performing a capacity expansion operation is sent to the application manager corresponding to the third target application identifier.

14. The method according to claim 11, characterized in that, The monitoring results also include: average response time; The step of performing aggregated analysis of at least one of the static resource information, host machine monitoring information, container monitoring information, business application instance data, and business application traffic data according to application monitoring metrics to obtain the monitoring results of the application monitoring metrics also includes: The average response time of the target business application traffic data within the target time period is calculated to obtain the average response time of the third target application identifier within the target time period.

15. The method according to claim 11, characterized in that, A natural day includes multiple target time periods; After statistically analyzing the business application traffic data of the third target application identifier within the target time period to obtain the request concurrency of the third target application identifier within the target time period, the method further includes: Within each natural day of the statistical period, at least one target time period with request concurrency less than the first concurrency threshold is determined as the first target time period, and at least one target time period with request concurrency greater than or equal to the second concurrency threshold is determined as the second target time period. Based on the first target time period, the idle time period of the third target application identifier within the statistical period is determined, and based on the second target time period, the busy time period of the third target application identifier within the statistical period is determined.

16. A monitoring device for heterogeneous GPU computing power clusters, characterized in that, include: The data storage module is used to consume static resource information, host monitoring information, container monitoring information, business application instance data and business application traffic data from the message center by subscribing to message topics, and store the static resource information, host monitoring information, container monitoring information, business application instance data and business application traffic data in the storage medium; The static resource information includes host information and cluster information. The host monitoring information and container monitoring information are obtained by the resource acquisition plugins in each host and sent to the message center by the host. The business application instance data is obtained by the computing power scheduling center and sent to the message center. The business application traffic data is obtained by the access gateway of the computing power scheduling center and sent to the message center. The data acquisition module is used to acquire the static resource information, host monitoring information, container monitoring information, business application instance data, and business application traffic data from the storage medium. The aggregation and analysis module is used to perform aggregation and analysis of at least one of the static resource information, host monitoring information, container monitoring information, business application instance data, and business application traffic data according to the application monitoring indicators, and obtain the monitoring results of the application monitoring indicators. The application monitoring metrics are application-dimensional monitoring metrics, and the monitoring results include application cluster information, second node set, second container set, and GPU card usage set. The convergence analysis module includes: The application cluster information acquisition unit is used to obtain the cluster information corresponding to the second target cluster identifier from the cluster information based on the second target cluster identifier in the business application instance data corresponding to the first target application identifier, thereby obtaining the application cluster information; The second node set determination unit is used to determine the second node set corresponding to the first target application identifier based on the application deployment host machine IP list and the second target cluster identifier in the business application instance data. The second container set determination unit is used to determine the second container set corresponding to the first target application identifier based on the application instance IP list and the second target cluster identifier in the business application instance data, as well as the container monitoring information. The GPU card set determination unit is used to determine the GPU card set corresponding to the first target application identifier based on the target container monitoring information corresponding to each container identifier in the second container set.

17. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the heterogeneous GPU computing power cluster monitoring method according to any one of claims 1 to 15.

18. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the steps of the heterogeneous GPU computing power cluster monitoring method according to any one of claims 1 to 15.