An abnormality positioning method based on a cloud management platform and a cloud management platform
By extracting and evaluating the log characteristics of computing nodes through a cloud management platform, the problem of low efficiency in anomaly location in existing technologies is solved, enabling fast and accurate screening of anomaly nodes and improving the efficiency of anomaly location.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HUAWEI TECH CO LTD
- Filing Date
- 2024-12-31
- Publication Date
- 2026-06-30
AI Technical Summary
In existing technologies, cloud vendors consider only a few factors when locating abnormal computing nodes, resulting in low efficiency in anomaly location. This is especially true when multiple computing nodes are abnormal, requiring them to be checked one by one, making it difficult to quickly and accurately identify the cause of the anomaly.
Logs from computing nodes are obtained through a cloud management platform. Log spatial and temporal features are extracted, and computing nodes are evaluated based on these features. Nodes with evaluation values greater than a threshold are identified as abnormal nodes. Log event templates are used to segment and filter logs, improving the accuracy and efficiency of anomaly localization.
It enables the rapid and accurate screening of abnormal nodes from multiple computing nodes, reducing the need for individual investigation and improving the efficiency of anomaly localization.
Smart Images

Figure CN122309110A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of cloud technology, and in particular to an anomaly location method based on a cloud management platform and the cloud management platform itself. Background Technology
[0002] With the rapid development of cloud technology, more and more tenants are choosing to complete their job tasks in the cloud. Specifically, tenants can deploy their compute node clusters in the cloud. Multiple compute nodes in this cluster can communicate with each other, so these multiple compute nodes can work together to complete the tenant's job tasks, thereby meeting the tenant's job requirements.
[0003] Cloud providers typically offer tenants a large number of compute node clusters. Due to this large number of nodes, if one or more of these nodes malfunction while executing a tenant's job tasks, the job will fail. In related technologies, to pinpoint which compute node(s) are malfunctioning, cloud providers can compare the node's successful and failed historical operations with its current operations. If the current operation matches a failed historical operation, the malfunctioning compute node can be directly identified.
[0004] In the above process, cloud vendors mainly consider the impact between the historical and current operations of a certain computing node. The factors considered are relatively simple. Once several computing nodes in the computing node cluster become abnormal, all computing nodes need to be checked one by one, resulting in low efficiency in anomaly localization. Summary of the Invention
[0005] This application provides an anomaly location method and cloud management platform based on a cloud management platform. It can quickly and accurately filter out all abnormal computing nodes from multiple computing nodes without having to check each computing node one by one, which can improve the efficiency of anomaly location to a certain extent.
[0006] The first aspect of this application provides an anomaly localization method based on a cloud management platform. The cloud management platform implementing this method manages infrastructure that provides cloud services to tenants. This infrastructure includes a large-scale computing node cluster created by the cloud management platform for the tenants. The large-scale computing node cluster may contain multiple computing nodes for executing the tenants' job tasks. The method includes:
[0007] If these multiple compute nodes fail to execute the job task, the cloud management platform can obtain the logs of these compute nodes. For any one of these compute nodes, the logs describe the overall process of that compute node executing the job task.
[0008] After obtaining the logs from these multiple computing nodes, the cloud management platform can extract features from the logs of each computing node to obtain the log spatial features and log temporal features of these multiple computing nodes.
[0009] After obtaining the log space characteristics and log time characteristics of these multiple computing nodes, the cloud management platform can use these log space characteristics and log time characteristics to evaluate these multiple computing nodes, thereby obtaining the evaluation value of these multiple computing nodes. The evaluation value of these multiple computing nodes is used to indicate the degree of abnormality of these multiple computing nodes in the process of executing the job task.
[0010] After obtaining the evaluation values of these multiple computing nodes, the cloud management platform can identify computing nodes whose evaluation values are greater than or equal to a preset threshold as abnormal computing nodes, thereby completing the anomaly location and taking subsequent actions.
[0011] As can be seen from the above method, since the logs of these multiple computing nodes are used to record the process of executing the job task, the cloud management platform can extract the log feature information of these multiple computing nodes (including the log space characteristics and log time characteristics of each computing node) from the logs of these multiple computing nodes. Using the log feature information of these multiple computing nodes, the platform can evaluate and identify abnormal computing nodes. It is evident that during the execution of the job task, the cloud management platform not only considers the impact generated by each computing node itself, but also the mutual influence between computing nodes. The factors considered are relatively comprehensive, which can quickly and accurately filter out all abnormal computing nodes from these multiple computing nodes without having to check each computing node one by one, thus improving the efficiency of anomaly localization to a certain extent.
[0012] In one possible implementation, the job task comprises multiple subtasks. The method further includes: the cloud management platform segmenting the logs of the computing nodes based on log event templates to obtain multiple log events for the computing nodes. These multiple log events describe the process of the computing nodes executing multiple subtasks. The cloud management platform also performs feature extraction on the logs of the multiple computing nodes to obtain log spatial features and log temporal features of the multiple computing nodes. This includes: the cloud management platform extracting features from the multiple log events of the computing nodes to obtain log spatial features and log temporal features of the computing nodes. In the aforementioned implementation, since the job task may contain multiple subtasks, in order to understand the process of these multiple computing nodes executing each subtask, the cloud management platform can obtain log event templates. For any one of these multiple computing nodes, the cloud management platform can use these log event templates to decompose the logs of that computing node into multiple log events for that computing node. These multiple log events can be used to record the process of that computing node executing these multiple subtasks. Based on this, the cloud management platform can extract features from the multiple log events of that computing node to obtain the log spatial features and log temporal features of that computing node. The cloud management platform can perform similar operations on the remaining computing nodes, so the cloud management platform can ultimately obtain the log space characteristics and log time characteristics of these multiple computing nodes accurately.
[0013] In one possible implementation, the infrastructure includes a target compute node located outside of multiple compute nodes. The target compute node successfully executes a job task. The method further includes: acquiring multiple target log events of the target compute node, which describe the process of the target compute node executing multiple sub-tasks; the cloud management platform removes log events matching the multiple target log events from the multiple log events of the compute node, obtaining filtered log events of the compute node; the cloud management platform extracts features from the multiple log events of the compute node to obtain log space features and log time features of the compute node, including: the cloud management platform extracts features from the filtered log events to obtain log space features and log time features of the compute node. In the aforementioned implementation, while creating a large-scale compute node cluster for the tenant, the cloud management platform can also create a small-scale compute node cluster for the tenant. The small-scale compute node cluster may contain the target compute node. The cloud management platform can instruct the target compute node to execute the job task. Since the target compute node successfully executes the job task, the cloud management platform acquires multiple target event tasks of the target compute node in a similar manner. After obtaining the log events from each compute node in a large-scale compute node cluster, for any one of these compute nodes, the cloud management platform can remove log events that match multiple pre-obtained target log events from the node's log events, thus obtaining the filtered log events for that compute node. Based on this, the cloud management platform can extract features from the filtered log events to obtain the log space and log time features of that compute node. The cloud management platform can perform similar operations on the remaining compute nodes, thus ultimately obtaining the log space and log time features of these multiple compute nodes quickly and accurately.
[0014] In one possible implementation, the log space characteristics of a compute node are used to indicate the types and quantities of filtered log events, and the log time characteristics of a compute node are used to indicate the execution time of the filtered log events. In the aforementioned implementation, for any one of these compute nodes, the log space characteristics can be used to indicate the types and quantities of filtered log events for that compute node, and the log time characteristics can be used to indicate the execution time of the filtered log events for that compute node.
[0015] In one possible implementation, the cloud management platform evaluates multiple computing nodes based on their log space characteristics and log time characteristics, obtaining evaluation values for the multiple computing nodes. This evaluation includes: the cloud management platform acquiring a first distance between the log space characteristics of the computing node and the log space characteristics of the other computing nodes; if the first distance is less than or equal to a first distance threshold, the cloud management platform sets the first evaluation value of the computing node to a preset value; if the first distance is greater than the first distance threshold, the cloud management platform sets the first distance to the first evaluation value; the cloud management platform acquiring a second distance between the log time characteristics of the computing node and the log time characteristics of the other computing nodes; if the second distance is less than or equal to a second distance threshold, the cloud management platform sets the second evaluation value of the computing node to a preset value; if the second distance is greater than the second distance threshold, the cloud management platform sets the second distance to the second evaluation value; and the cloud management platform obtaining the evaluation value of the computing node based on the first and second evaluation values. In the aforementioned implementation, for any one of the multiple computing nodes, the cloud management platform can obtain a first distance between the log space characteristics of that computing node and the log space characteristics of the other computing nodes. If the first distance corresponding to that computing node is less than or equal to a first distance threshold, the cloud management platform sets the first evaluation value of that computing node to a preset value. If the first distance corresponding to that computing node is greater than the first distance threshold, the cloud management platform can set the first distance corresponding to that computing node to its first evaluation value. Next, the cloud management platform can also obtain a second distance between the log time characteristics of that computing node and the log time characteristics of the other computing nodes. If the second distance corresponding to that computing node is less than or equal to a second distance threshold, the cloud management platform sets the second evaluation value of that computing node to the aforementioned value. If the second distance corresponding to that computing node is greater than the second distance threshold, the cloud management platform can set the second distance corresponding to that computing node to its second evaluation value. Then, the cloud management platform can add the first evaluation value and the second evaluation value corresponding to that computing node to obtain the evaluation value of that computing node. The cloud management platform can perform similar operations on the remaining computing nodes, thus accurately obtaining the evaluation values for these multiple computing nodes.
[0016] In one possible implementation, the log event template is a pre-set template, or the log event template is obtained by the cloud management platform by parsing the logs of multiple computing nodes.
[0017] In one possible implementation, multiple compute nodes may include any of the following: physical servers, virtual machines, containers, microvirtual machines, and bare metal servers.
[0018] In one possible implementation, multiple compute nodes are located at the same site or different sites, where a site includes any of the following: a region, an availability zone, a data center, a server room, and a rack.
[0019] A second aspect of this application provides a cloud management platform for managing infrastructure, which includes multiple computing nodes for executing tenant jobs. The cloud management platform includes: an acquisition module for acquiring logs of multiple computing nodes if the multiple computing nodes fail to execute job tasks, wherein the log of any one of the multiple computing nodes describes the process of the computing node executing job tasks; an extraction module for extracting features from the logs of the multiple computing nodes to obtain log spatial features and log temporal features of the multiple computing nodes; an evaluation module for evaluating the multiple computing nodes based on the log spatial features and log temporal features of the multiple computing nodes to obtain evaluation values of the multiple computing nodes, wherein the evaluation values of the computing nodes indicate the degree of abnormality of the computing nodes; and a location module for identifying computing nodes whose evaluation values are greater than or equal to a preset threshold as abnormal computing nodes.
[0020] In one possible implementation, the job task includes multiple subtasks, and the cloud management platform also includes: a segmentation module, used to segment the logs of the computing node based on log event templates to obtain multiple log events of the computing node, which are used to describe the process of the computing node executing multiple subtasks; and an extraction module, used to extract features from the multiple log events of the computing node to obtain the log spatial features and log temporal features of the computing node.
[0021] In one possible implementation, the infrastructure includes a target compute node located outside of multiple compute nodes. The target compute node successfully executes a job task. The cloud management platform also includes: a filtering module, used to: acquire multiple target log events of the target compute node, which describe the process of the target compute node executing multiple subtasks; remove log events that match the multiple target log events from the multiple log events of the compute node to obtain the filtered log events of the compute node; and an extraction module, used to extract features from the filtered log events to obtain the log spatial features and log temporal features of the compute node.
[0022] In one possible implementation, the log space characteristics of the compute node are used to indicate the type and number of filtered log events, and the log time characteristics of the compute node are used to indicate the execution time of the filtered log events.
[0023] In one possible implementation, the evaluation module is configured to: obtain a first distance between the log space characteristics of the compute node and the log space characteristics of other compute nodes; if the first distance is less than or equal to a first distance threshold, the cloud management platform sets the first evaluation value of the compute node to a preset value; if the first distance is greater than the first distance threshold, the cloud management platform sets the first distance to the first evaluation value; obtain a second distance between the log time characteristics of the compute node and the log time characteristics of other compute nodes; if the second distance is less than or equal to a second distance threshold, the cloud management platform sets the second evaluation value of the compute node to a preset value; if the second distance is greater than the second distance threshold, the cloud management platform sets the second distance to the second evaluation value; and obtain an evaluation value of the compute node based on the first evaluation value and the second evaluation value.
[0024] In one possible implementation, the log event template is a pre-set template, or the log event template is obtained by the cloud management platform by parsing the logs of multiple computing nodes.
[0025] In one possible implementation, multiple compute nodes may include any of the following: physical servers, virtual machines, containers, microvirtual machines, and bare metal servers.
[0026] In one possible implementation, multiple compute nodes are located at the same site or different sites, where a site includes any of the following: a region, an availability zone, a data center, a server room, and a rack.
[0027] A third aspect of this application provides a computing device cluster, which includes at least one computing device, each computing device including a processor and a memory: the memory is used to store instructions; the processor is used to cause the computing device cluster to perform the method described in the first aspect or any possible implementation of the first aspect according to the instructions.
[0028] A fourth aspect of this application provides a computer storage medium storing one or more instructions that, when executed by one or more computers, cause the one or more computers to perform the method described in the first aspect or any possible implementation of the first aspect.
[0029] A fifth aspect of this application provides a computer program product storing instructions that, when executed by a computer, cause the computer to perform the method described in the first aspect or any possible implementation of the first aspect.
[0030] In this embodiment, the cloud management platform can create a large-scale computing node cluster for tenants. This cluster can contain multiple computing nodes that can execute tenant job tasks. If these multiple computing nodes fail to execute the job task, the cloud management platform can first obtain the logs generated by these multiple computing nodes during the execution of the job task, and generate log space characteristics and log time characteristics of these multiple computing nodes based on the logs. Then, the cloud management platform can evaluate these multiple computing nodes based on these log space characteristics and log time characteristics to obtain evaluation values for these multiple computing nodes. Subsequently, the cloud management platform can identify computing nodes with evaluation values greater than or equal to a preset threshold as abnormal computing nodes, thereby achieving anomaly localization. In the aforementioned process, since the logs of these multiple computing nodes are used to record the execution of the job task, the cloud management platform can extract the log feature information of these multiple computing nodes (including the log space characteristics and log time characteristics of each computing node) from the logs of these multiple computing nodes. Using the log feature information of these multiple computing nodes, the platform can evaluate and identify abnormal computing nodes. It can be seen that in the process of executing the job task, the cloud management platform not only considers the impact generated by each computing node itself, but also the mutual impact between computing nodes. The factors considered are relatively comprehensive, which can quickly and accurately filter out all abnormal computing nodes from these multiple computing nodes without having to check each computing node one by one, thus improving the efficiency of anomaly location to a certain extent. Attached Figure Description
[0031] Figure 1 A schematic diagram of the structure of the cloud service system provided in the embodiments of this application;
[0032] Figure 2 A flowchart illustrating an anomaly location method based on a cloud management platform provided in an embodiment of this application;
[0033] Figure 3 A schematic diagram illustrating an application example of the anomaly location method based on a cloud management platform provided in this application embodiment;
[0034] Figure 4 A schematic diagram of the structure of the cloud management platform provided in the embodiments of this application;
[0035] Figure 5 A schematic diagram of the structure of a computing device provided in an embodiment of this application;
[0036] Figure 6 A schematic diagram of the structure of a computing device cluster provided in an embodiment of this application;
[0037] Figure 7This is a schematic diagram illustrating the network connection of computer devices in a computer cluster provided in an embodiment of this application. Detailed Implementation
[0038] This application provides an anomaly location method and cloud management platform based on a cloud management platform. It can quickly and accurately filter out all abnormal computing nodes from multiple computing nodes without having to check each computing node one by one, which can improve the efficiency of anomaly location to a certain extent.
[0039] The terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such terms are interchangeable where appropriate; this is merely a way of distinguishing objects with the same attributes in the embodiments of this application. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion, so that a process, method, system, product, or apparatus that comprises a series of elements is not necessarily limited to those elements, but may include other elements not explicitly listed or inherent to those processes, methods, products, or apparatuses.
[0040] With the rapid development of cloud technology, more and more tenants are choosing to complete their job tasks in the cloud. Specifically, tenants can deploy their compute node clusters in the cloud. Multiple compute nodes in this cluster can communicate with each other, so these multiple compute nodes can work together to complete the tenant's job tasks, thereby meeting the tenant's job requirements.
[0041] To efficiently complete tenant jobs, cloud providers offer large-scale compute node clusters, which typically contain a large number of compute nodes. Due to the large number of these nodes, if one or more compute nodes malfunction while executing tenant jobs, the jobs will fail. In related technologies, to pinpoint which compute node(s) are malfunctioning, for any given compute node, the cloud provider can compare its past successful and failed operations with its current operation. If the current operation matches a successful past operation, the compute node is considered normal; if it matches a failed past operation, the compute node is considered malfunctioning, and similar handling measures can be taken to resolve the issue.
[0042] In the above process, cloud vendors mainly consider the impact between the historical and current operations of a certain computing node. The factors considered are relatively simple. Once several computing nodes in the computing node cluster become abnormal, all computing nodes need to be checked one by one, resulting in low efficiency in anomaly localization.
[0043] Furthermore, during the execution of tenant jobs, communication often occurs between compute nodes. If one compute node malfunctions, it can cause other compute nodes that are communicating with it to also malfunction. Therefore, checking multiple compute nodes one by one often fails to identify the compute node that is the source of the malfunction and cannot accurately determine the cause of the malfunction.
[0044] To address the aforementioned issues, this application provides an anomaly localization method based on a cloud management platform. This method can be implemented through a cloud service system. Figure 1 A schematic diagram of the cloud service system provided in the embodiments of this application is shown below. Figure 1 As shown, a cloud service system includes the infrastructure that provides cloud services and a cloud management platform that manages this infrastructure. The cloud management platform and the infrastructure are described separately below:
[0045] A cloud management platform can centrally manage the infrastructure of the entire cloud service system (for example, within the infrastructure, it can create a large-scale computing node cluster to serve a tenant, as instructed by that tenant. This cluster can contain multiple computing nodes that execute the tenant's job tasks, thus meeting the tenant's job requirements). The cloud management platform can also be accessible to tenants outside the cloud service system and respond to their requests. For example, it can provide various interfaces, such as login and processing interfaces, for tenant clients (e.g., the terminal device used by the tenant or the browser on that device). Specifically, the cloud management platform can authenticate a tenant's client through the login interface, allowing the client to log in after successful authentication. For example, the cloud management platform can also allow the tenant's client to send processing requests for its job tasks to the cloud management platform through a processing interface. Since the processing request for the job task includes the performance requirements of the job task, the cloud management platform can create a large-scale computing node cluster in the infrastructure that meets the performance requirements of the job task. This cluster can contain multiple computing nodes (for executing the tenant's job tasks), and these multiple computing nodes can communicate with each other, thus enabling them to collaboratively execute the tenant's job tasks. If these multiple computing nodes fail to execute the job task, the cloud management platform can first obtain the logs generated by these multiple computing nodes in executing the job task, and generate log spatial characteristics and log temporal characteristics for these multiple computing nodes based on these logs. Then, the cloud management platform can evaluate these multiple computing nodes based on these log spatial and temporal characteristics to obtain evaluation values for these multiple computing nodes. Subsequently, the cloud management platform can identify computing nodes with evaluation values greater than or equal to a preset threshold as abnormal computing nodes, thereby achieving anomaly localization.
[0046] The infrastructure comprises a large-scale cluster of compute nodes providing cloud services to the tenant. This cluster can contain multiple compute nodes that run the tenant's applications. These applications can execute the tenant's job tasks to meet the tenant's job requirements. It should be noted that the specifications of these compute nodes all meet the performance requirements of the job task (e.g., the specifications of the computing resources, storage resources, and network resources required to execute the job task), and these compute nodes have a certain communication relationship (this communication relationship can be specified by the tenant (included in the job task information) or determined by the cloud management platform). In other words, these compute nodes can communicate to collaboratively complete the job task.
[0047] It should be noted that for any one of these computing nodes, it records the overall process of executing the job task (the logs of this computing node are typically unstructured data). Since the job task may contain multiple subtasks, and to understand in detail the process of the computing node executing each subtask, the cloud management platform can break down the computing node's logs into multiple log events (these log events are typically structured data) in a certain way. It should also be noted that any one of the multiple log events of this computing node can be used to record the process of the computing node executing a specific subtask, and the same applies to the other log events. Therefore, these multiple log events can be used separately to record the process of the computing node executing these multiple subtasks.
[0048] It should also be noted that for any one of these computing nodes, since the number of log events on that node is usually large, the cloud management platform can filter these log events in a certain way to remove some unnecessary log events, leaving only the necessary ones. These necessary log events constitute the filtered (several) log events for that computing node. Understandably, the number of filtered log events for that computing node is less than the number of log events before filtering. Furthermore, the cloud management platform can perform similar operations on the log events of the other computing nodes, which will not be elaborated upon here.
[0049] It should also be noted that, for any one of these computing nodes, the cloud management platform can further process the filtered log events of that computing node to obtain its log space characteristics and log time characteristics. The log space characteristics indicate the types and quantities of the filtered log events, while the log time characteristics indicate the order of the filtered log events, i.e., the chronological order of their execution. (Of course, the cloud management platform can also process the unfiltered log events of the computing node, i.e., the multiple log events of that computing node, to obtain its log space characteristics and log time characteristics. In this case, the log space characteristics indicate the types and quantities of the multiple log events, and the log time characteristics indicate the order of the multiple log events.)
[0050] Furthermore, for the multiple computing nodes serving the tenant, these computing nodes can be presented in various forms. For example, these computing nodes can be physical servers (containing certain specifications of computing resources, storage resources, and network resources, etc.) selected by the cloud management platform in the infrastructure. Alternatively, they can be bare metal servers (containing certain specifications of computing resources, storage resources, and network resources, etc.) selected by the cloud management platform in the infrastructure. Furthermore, these computing nodes can be virtual machines (VMs) created by the cloud management platform on physical servers or bare metal servers using virtualization technology. They can also be containers (Docker) created by the cloud management platform on physical servers or bare metal servers using virtualization technology. Finally, they can be microVMs created by the cloud management platform on physical servers or bare metal servers using virtualization technology, and so on.
[0051] Furthermore, for multiple computing nodes serving a tenant, these computing nodes can be deployed in the same site or different sites. Sites can be presented in various forms, such as a region in the infrastructure, an availability zone in the infrastructure, a data center (DC) in the infrastructure, a room in the infrastructure, or a rack (also known as a physical server group) in the infrastructure, etc.
[0052] Based on the aforementioned cloud service system, the cloud management platform can create large-scale compute node clusters for tenants. These clusters can contain multiple compute nodes that execute tenant job tasks. If these compute nodes fail to execute the job task, the cloud management platform can first obtain the logs generated by these compute nodes during the task execution and generate log space characteristics and log time characteristics for each compute node based on these logs. Then, the cloud management platform can evaluate these compute nodes based on these log space and time characteristics to obtain their evaluation values. Subsequently, the cloud management platform can identify compute nodes with evaluation values greater than or equal to a preset threshold as abnormal compute nodes, thereby achieving anomaly localization. In the aforementioned process, since the logs of these multiple computing nodes are used to record the execution of the job task, the cloud management platform can extract log feature information (including log space characteristics and log time characteristics of each computing node) from these logs. Using this log feature information, the platform can assess and identify abnormal computing nodes. Therefore, during the execution of the job task, the cloud management platform considers not only the impact of each computing node itself but also the mutual influence between computing nodes, taking a comprehensive approach. This allows for quick and accurate screening of all abnormal computing nodes without needing to check each node individually, thus improving the efficiency of anomaly localization. To further understand the working process of the cloud management platform, the following section will combine... Figure 2 This workflow will be described in further detail. Figure 2 A flowchart illustrating the anomaly localization method based on a cloud management platform provided in this application embodiment is shown below. Figure 2 As shown, this method can be achieved through, as Figure 1 The cloud service system shown includes infrastructure that provides cloud services to tenants and a cloud management platform that manages this infrastructure. This infrastructure includes a large-scale computing node cluster created by the cloud management platform for the tenants. The large-scale computing node cluster contains multiple computing nodes that can be used to execute the tenants' job tasks (e.g., model training tasks, etc.).
[0053] The method includes:
[0054] 201. If multiple compute nodes fail to execute job tasks, the cloud management platform obtains logs from multiple compute nodes. The logs of any one of the compute nodes are used to describe the process of the compute node executing the job task.
[0055] In this embodiment, multiple compute nodes in a large-scale compute node cluster serving a tenant can execute the tenant's job tasks. The cloud management platform can monitor the execution of these job tasks by these compute nodes in real time to determine whether the job tasks have been successfully executed. If the execution of the job tasks by these compute nodes fails, the cloud management platform needs to perform anomaly localization on these compute nodes to determine which part of the compute nodes is experiencing anomalies, thereby resolving these anomalies in a timely manner. To perform anomaly localization on these compute nodes, the cloud management platform can first obtain the logs of these compute nodes. Specifically, for any one of these compute nodes, the logs describe the overall process of that compute node executing the job task.
[0056] For example, such as Figure 3 As shown ( Figure 3 (This is a schematic diagram of an application example of the anomaly location method based on the cloud management platform provided in the embodiments of this application). When compute nodes 1 to 100 in a tenant's large-scale compute node cluster fail to execute a job task, the cloud management platform can obtain the logs generated by compute nodes 1 to 100 in executing the job task. These logs can also be understood as failure logs.
[0057] Specifically, the cloud management platform can also preprocess the logs from these multiple computing nodes:
[0058] Since this job task may contain multiple subtasks, in order to understand the process of these multiple computing nodes executing each subtask, the cloud management platform can obtain log event templates. These log event templates can be either pre-set templates or parsed by the cloud management platform from the logs of multiple computing nodes (for example, when the cloud management platform finds a large amount of duplicate content in several parts of the sub-logs, with only some parameters being different, it can use this duplicate content as a log template. Similarly, the cloud management platform can obtain multiple log event templates from these logs).
[0059] Then, for any one of these computing nodes, the cloud management platform can use these log event templates to break down the node's logs into multiple log events. Each of these log events can record the execution (entire or partial) of a subtask within the job task by the computing node. The same applies to the remaining log events; therefore, these multiple log events can be used to record the execution of the multiple subtasks included in the job task. The cloud management platform can perform similar operations on the logs of the other computing nodes, thus also obtaining multiple log events from the other computing nodes, which will not be elaborated further here.
[0060] Continuing with the example above, after obtaining the logs from compute nodes 1 to 100, assuming the job task contains 200 subtasks, these logs will record the overall process of these 100 compute nodes executing the job task (wherein, the content of each of these 200 subtasks is usually different). Therefore, the cloud management platform can traverse these logs and extract 50 log event templates. (Generally, these 100 compute nodes will execute these 200 subtasks. That is to say, when these 100 compute nodes execute these 200 subtasks, the operations performed by different compute nodes are similar, and these operations will be recorded in the logs. Assuming that there are 50 types of operations involved in the process of these 100 compute nodes executing these 200 subtasks, each type is a log event template. However, each compute node can execute each type of operation at least once. Assuming that each compute node executes these 50 types of operations a total of 300 times, each execution of an operation is a log event.) Then, the cloud management platform can use these 50 log event templates to break down the logs of compute node 1 into 300 log events for compute node 1. Similarly, the cloud management platform can also break down the logs of compute node 2 into 300 log events for compute node 2, ... and break down the logs of compute node 100 into 300 log events for compute node 100.
[0061] Specifically, the 300 log events of compute node 1 are used to indicate the execution of these 200 subtasks by compute node 1, and so on, the 300 log events of compute node 100 are used to indicate the execution of these 200 subtasks by compute node 100. Furthermore, each of the 300 log events of compute node 1 corresponds one-to-one with one of the 300 operations performed by compute node 1; if some operations are successful, the corresponding log events are success log events; if some operations fail, the corresponding log events are failure log events. The same applies to the remaining computations, which will not be elaborated upon here.
[0062] More specifically, the cloud management platform can also filter multiple log events from each compute node:
[0063] While creating large-scale compute node clusters for tenants, the cloud management platform can also create an additional small-scale compute node cluster for them. It's important to note that the small-scale compute node cluster contains fewer compute nodes than the large-scale cluster, and the compute nodes in these two clusters are different. For clarity, the compute nodes in the small-scale cluster will be referred to as the target compute nodes below.
[0064] After creating a small-scale compute node cluster, the cloud management platform allows all target compute nodes in the cluster to pre-execute tenant jobs. Since the target compute nodes successfully execute the job, the cloud management platform can obtain the logs generated by the target compute nodes during the job execution. It then uses log event templates to break down these logs into multiple target log events for the target compute nodes. These target log events can be used to instruct the target compute nodes on the execution of multiple sub-tasks within the job. It should be noted that the process of obtaining the log event template and breaking down the target log events is similar to the process described above for obtaining the log event template and breaking down log events in large-scale compute node clusters; it will not be repeated here.
[0065] After creating a large-scale compute node cluster, the cloud management platform can then have multiple compute nodes within the cluster execute the job task. If these compute nodes fail to execute the job, the cloud management platform can obtain the logs generated by these nodes and use log event templates to break them down into multiple log events for each compute node. Next, for any one of these compute nodes, the cloud management platform can remove log events that match pre-defined target log events from that node's multiple log events, thus obtaining the filtered log events for that node. Similar operations can be performed on the remaining compute nodes, ultimately resulting in filtered log events for all compute nodes in the large-scale compute node cluster.
[0066] As in the example above, the cloud management platform can also create a small-scale compute node cluster for the tenant. This cluster contains compute nodes 101 to 110, and these 10 compute nodes can pre-execute the tenant's job tasks. After the job task is successfully executed, the cloud management platform can obtain the logs generated by these 10 compute nodes in executing the job task. These logs can also be understood as success logs.
[0067] Similarly, the cloud management platform can use log event templates to break down the logs of compute node 101 into 150 log events for compute node 101, and so on, breaking down the logs of compute node 110 into 150 log events for compute node 110. It should be noted that although the job tasks executed by compute nodes 101 to 110 are the same as those executed by compute nodes 1 to 100, due to the smaller number of compute nodes 101 to 110, although they also involve the 200 sub-tasks of this job task, these 10 compute nodes may only involve 30 types of operations when executing these 200 sub-tasks, and only need to complete them 150 times in total. Therefore, each of these 10 compute nodes contains only 150 log events, and all of these log events are success log events.
[0068] Based on this, if 100 of the 300 log events in compute node 1 match 10 log events in compute node 101, ..., and 10 log events in compute node 110 (e.g., they are the same), the cloud management platform can remove these 100 log events from the 300 log events in compute node 1, resulting in the remaining (filtered) 200 log events for compute node 1. This process continues until the remaining 200 log events for compute node 1, ..., and the remaining 200 log events for compute node 100 are obtained.
[0069] 202. The cloud management platform extracts features from the logs of multiple computing nodes to obtain the log spatial features and log temporal features of multiple computing nodes.
[0070] After obtaining the logs from these multiple computing nodes, the cloud management platform can extract features from the logs of these multiple computing nodes to obtain the log spatial features and log temporal features of these multiple computing nodes.
[0071] Specifically, cloud management platforms can extract features in the following ways:
[0072] (1) For any one of these computing nodes, after the cloud management platform decomposes the logs of the computing node into multiple log events of the computing node, the cloud management platform can extract features from the multiple log events of the computing node to obtain the log space features and log time features of the computing node. The log space features of the computing node are used to indicate the types and quantities of the multiple log events (including the log events) of the computing node, and the log time features of the computing node are used to indicate the execution time of the multiple log events of the computing node.
[0073] Similarly, for the other computing nodes, the cloud management platform can also perform operations on multiple log events of the other computing nodes, so that the log space characteristics and log time characteristics of these multiple computing nodes can be obtained in the end.
[0074] Continuing with the example above, after obtaining 300 log events from compute node 1, ..., and 300 log events from compute node 100, the cloud management platform can extract features from these 300 log events to obtain the log space features v1 = (c1, c2, ..., cn) and s1 = (e1, e2, ..., e300) of compute node 1. Here, n represents the types of log events on compute node 1 (i.e., the types of operations performed by compute node 1, for example, n = 50), c1 represents the number of log events of the first type (i.e., the number of times the first type of operation is performed), cn represents the number of log events of the nth type (i.e., the number of times the nth type of operation is performed), e1 represents the log event corresponding to the first operation performed by compute node 1 (hence e1 has the earliest execution time), and e300 represents the log event corresponding to the last operation performed by compute node 1 (hence e300 has the latest execution time).
[0075] Similarly, the cloud management platform can also obtain the log space feature v2 and log time feature s2 of computing node 2, ..., the log space feature v100 and log time feature s100 of computing node 100.
[0076] (2) For any one of these computing nodes, after the cloud management platform obtains the filtered log events of the computing node, the cloud management platform can extract features from the filtered log events of the computing node to obtain the log space features and log time features of the computing node. The log space features of the computing node are used to indicate the types and quantities of the filtered log events of the computing node, and the log time features of the computing node are used to indicate the execution time of the filtered log events of the computing node.
[0077] Similarly, for the remaining computing nodes, the cloud management platform can also perform operations on the filtered log events of the remaining computing nodes, so that the log space characteristics and log time characteristics of these multiple computing nodes can be obtained in the end.
[0078] Continuing with the example above, after obtaining the remaining 200 log events for compute node 1, ..., the remaining 200 log events for compute node 100, the cloud management platform can extract features from the remaining 200 log events for compute node 1, thereby obtaining the log space features v1 = (c1, c2, ..., cm) and s1 = (e1, e2, ..., e200) for compute node 1, where m is the type of remaining log events for compute node 1, c1 is the number of log events of the first type, cm is the number of log events of the m type, e1 is the log event corresponding to the first operation executed by compute node 1, and e200 is the log event corresponding to the last operation executed by compute node 1.
[0079] Similarly, the cloud management platform can also obtain the log space feature v2 and log time feature s2 of computing node 2, ..., the log space feature v100 and log time feature s100 of computing node 100.
[0080] 203. The cloud management platform evaluates multiple computing nodes based on the log space characteristics and log time characteristics of multiple computing nodes, and obtains evaluation values for multiple computing nodes. The evaluation values of computing nodes are used to indicate the degree of anomaly of computing nodes.
[0081] After obtaining the log space characteristics and log time characteristics of these multiple computing nodes, the cloud management platform can perform a series of processing steps on these characteristics to evaluate the computing nodes and obtain their evaluation values. For any one of these computing nodes, the evaluation value indicates the degree of anomaly during the execution of the job task.
[0082] Specifically, the cloud management platform can obtain the evaluation values of each computing node in the following ways:
[0083] For any one of these computing nodes, the cloud management platform can use an algorithm (e.g., the Isolation Forest algorithm) to obtain the first distance between the log space characteristics of that computing node and the log space characteristics of the other computing nodes, i.e., the first distance corresponding to that computing node. If the first distance corresponding to that computing node is less than or equal to a first distance threshold, the cloud management platform sets the first evaluation value of that computing node to a preset value (this value is usually much smaller than the first distance threshold; for example, the value can be zero, etc.). If the first distance corresponding to that computing node is greater than the first distance threshold, the cloud management platform can set the first distance corresponding to that computing node as the first evaluation value of that computing node.
[0084] Next, the cloud management platform can also obtain a second distance between the log time characteristics of the compute node and the log time characteristics of other compute nodes through another algorithm (e.g., dynamic time warping distance, etc.), that is, the second distance corresponding to the compute node. If the second distance corresponding to the compute node is less than or equal to the second distance threshold, the cloud management platform sets the second evaluation value of the compute node to the aforementioned value; if the second distance corresponding to the compute node is greater than the second distance threshold, the cloud management platform can set the second distance corresponding to the compute node as the second evaluation value of the compute node.
[0085] Then, the cloud management platform can add the first evaluation value and the second evaluation value corresponding to the computing node to obtain the evaluation value of the computing node, which is the abnormal score of the computing node.
[0086] For the remaining computing nodes, the cloud management platform can perform similar operations, thus ultimately obtaining the evaluation values for these multiple computing nodes. It should be noted that the aforementioned first distance threshold can be a preset distance threshold, or it can be obtained by collecting the first distances corresponding to these multiple computing nodes and then aggregating the first distances corresponding to the most densely packed (closest) computing nodes into a distance range. This distance range represents the normal distance set, and distances outside this range are considered abnormal distances. Therefore, the left endpoint (minimum value) of this distance range can be used as the first distance threshold for determining whether a distance is abnormal, and so on. Similarly, the aforementioned second distance threshold can also be obtained in the same way as the first distance threshold, which will not be elaborated upon here.
[0087] Continuing with the example above, after obtaining the log space feature v1 and log time feature s1 of compute node 1, ..., the log space feature v100 and log time feature s100 of compute node 100, the cloud management platform can calculate the distances between v1 and v2, v1 and v3, ..., and v1 and v100. Assuming the distance between v1 and v2 is greater than distance threshold 1 (the aforementioned first distance threshold), the cloud management platform can set this distance as the evaluation value 1.1 for compute node 1. Next, the cloud management platform can also calculate the distances between s1 and s2, s1 and s3, ..., and s1 and s100. Assuming these distances are all less than distance threshold 2 (the aforementioned second distance threshold), the cloud management platform can set 0 as the evaluation value 2.1 for compute node 1. Then, the cloud management platform can add the evaluation value 1.1 and the evaluation value 2.1 to obtain the final evaluation value for compute node 1.
[0088] Similarly, the cloud management platform can also calculate the evaluation value of computing node 2, ..., and the evaluation value of computing node 100.
[0089] 204. The cloud management platform identifies computing nodes with evaluation values greater than or equal to a preset threshold as abnormal computing nodes among multiple computing nodes.
[0090] After obtaining the evaluation values of these multiple computing nodes, the cloud management platform can identify computing nodes whose evaluation values are greater than or equal to a preset threshold (the size of this threshold can be set according to actual needs and is not limited here) as abnormal computing nodes. It should be noted that when there are several abnormal computing nodes, the cloud management platform can identify the computing node with the highest evaluation value as the source of the anomaly. In this way, the cloud management platform can complete the anomaly location and take subsequent actions.
[0091] As in the example above, after obtaining the evaluation values of compute node 1, ..., compute node 100, since the evaluation values of compute nodes 1 to 8 are all greater than a certain evaluation threshold, the cloud management platform can regard compute nodes 1 to 8 as abnormal compute nodes. Moreover, the evaluation value of compute node 5 is the largest, so compute node 5 can be regarded as the source of the anomaly.
[0092] In this embodiment, the cloud management platform can create a large-scale computing node cluster for tenants. This cluster can contain multiple computing nodes that can execute tenant job tasks. If these multiple computing nodes fail to execute the job task, the cloud management platform can first obtain the logs generated by these multiple computing nodes during the execution of the job task, and generate log space characteristics and log time characteristics of these multiple computing nodes based on the logs. Then, the cloud management platform can evaluate these multiple computing nodes based on these log space characteristics and log time characteristics to obtain evaluation values for these multiple computing nodes. Subsequently, the cloud management platform can identify computing nodes with evaluation values greater than or equal to a preset threshold as abnormal computing nodes, thereby achieving anomaly localization. In the aforementioned process, since the logs of these multiple computing nodes are used to record the execution of the job task, the cloud management platform can extract the log feature information of these multiple computing nodes (including the log space characteristics and log time characteristics of each computing node) from the logs of these multiple computing nodes. Using the log feature information of these multiple computing nodes, the platform can evaluate and identify abnormal computing nodes. It can be seen that in the process of executing the job task, the cloud management platform not only considers the impact generated by each computing node itself, but also the mutual impact between computing nodes. The factors considered are relatively comprehensive, which can quickly and accurately filter out all abnormal computing nodes from these multiple computing nodes without having to check each computing node one by one, thus improving the efficiency of anomaly location to a certain extent.
[0093] Furthermore, in this embodiment of the application, after the cloud management platform determines several abnormal computing nodes based on the evaluation values of each computing node, it can further determine the computing node with the largest evaluation value as the source of the anomaly. This is beneficial for the cloud management platform to accurately detect the cause of the anomaly and take corresponding anomaly handling measures.
[0094] The above is a detailed description of the anomaly location method based on a cloud management platform provided in the embodiments of this application. The cloud management platform provided in the embodiments of this application will be introduced below. Figure 4 A schematic diagram of the structure of the cloud management platform provided in the embodiments of this application is shown below. Figure 4 As shown, the cloud management platform is used to manage the infrastructure, which includes multiple computing nodes for executing tenant jobs. The cloud management platform includes:
[0095] The acquisition module 401 is used to acquire logs from multiple computing nodes if multiple computing nodes fail to execute job tasks. The logs of any one of the computing nodes describe the process of executing the job task. For example, the acquisition module 401 is used to execute... Figure 2 Step 201 in the illustrated embodiment.
[0096] Extraction module 402 is used to extract features from the logs of multiple computing nodes to obtain the log spatial features and log temporal features of the multiple computing nodes; for example, extraction module 402 is used to perform... Figure 2 Step 202 in the illustrated embodiment.
[0097] Evaluation module 403 is used to evaluate multiple computing nodes based on their log space characteristics and log time characteristics, obtaining evaluation values for each computing node. These evaluation values indicate the degree of anomaly of the computing node. For example, evaluation module 403 is used to execute... Figure 2 Step 203 in the illustrated embodiment.
[0098] The location module 404 is used to identify, among multiple computing nodes, computing nodes whose evaluation values are greater than or equal to a preset threshold as abnormal computing nodes. For example, the location module 404 is used to perform... Figure 2 Step 204 in the illustrated embodiment.
[0099] In one possible implementation, the job task includes multiple subtasks, and the cloud management platform also includes: a segmentation module, used to segment the logs of the computing node based on the log event template to obtain multiple log events of the computing node, which are used to describe the process of the computing node executing multiple subtasks; and an extraction module 402, used to extract features from the multiple log events of the computing node to obtain the log spatial features and log temporal features of the computing node.
[0100] In one possible implementation, the infrastructure includes a target computing node located outside of multiple computing nodes. The target computing node successfully executes a job task. The cloud management platform also includes: a filtering module, used to: acquire multiple target log events of the target computing node, the multiple target log events being used to describe the process of the target computing node executing multiple sub-tasks; among the multiple log events of the computing node, log events matching the multiple target log events are removed to obtain the filtered log events of the computing node; and an extraction module 402, used to extract features from the filtered log events to obtain the log spatial features and log temporal features of the computing node.
[0101] In one possible implementation, the log space characteristics of the compute node are used to indicate the type and number of filtered log events, and the log time characteristics of the compute node are used to indicate the execution time of the filtered log events.
[0102] In one possible implementation, the evaluation module is configured to: obtain a first distance between the log space characteristics of the compute node and the log space characteristics of other compute nodes; if the first distance is less than or equal to a first distance threshold, the cloud management platform sets the first evaluation value of the compute node to a preset value; if the first distance is greater than the first distance threshold, the cloud management platform sets the first distance to the first evaluation value; obtain a second distance between the log time characteristics of the compute node and the log time characteristics of other compute nodes; if the second distance is less than or equal to a second distance threshold, the cloud management platform sets the second evaluation value of the compute node to a preset value; if the second distance is greater than the second distance threshold, the cloud management platform sets the second distance to the second evaluation value; and obtain an evaluation value of the compute node based on the first evaluation value and the second evaluation value.
[0103] In one possible implementation, the log event template is a pre-set template, or the log event template is obtained by the cloud management platform by parsing the logs of multiple computing nodes.
[0104] In one possible implementation, multiple compute nodes may include any of the following: physical servers, virtual machines, containers, microvirtual machines, and bare metal servers.
[0105] In one possible implementation, multiple compute nodes are located at the same site or different sites, where a site includes any of the following: a region, an availability zone, a data center, a server room, and a rack.
[0106] It should be noted that the information interaction and implementation process between the modules / units of the above-mentioned device are based on the same concept as the method embodiments of this application, and the resulting technical effects are the same as those of the method embodiments of this application. For details, please refer to the description in the method embodiments shown above in the embodiments of this application, and will not be repeated here.
[0107] Please see Figure 5 , Figure 5 This is a schematic diagram of the structure of a computing device provided in an embodiment of this application. Figure 5 As shown, the computing device 500 (which can be used to present the aforementioned cloud management platform) includes: a processor 501, a memory 502, a communication interface 503, and a bus 504. The processor 501, memory 502, and communication interface 503 are coupled via the bus (not shown in the figure). The memory 502 stores instructions. When the instructions in the memory 502 are executed, the computing device 500 performs the method executed by the cloud management platform in the above method embodiment.
[0108] The computing device 500 may be one or more integrated circuits configured to implement the methods described above, such as: one or more application-specific integrated circuits (ASICs), or one or more digital signal processors (DSPs), or one or more field-programmable gate arrays (FPGAs), or a combination of at least two of these forms of integrated circuits. Furthermore, when the units in the device can be implemented in the form of a processing element scheduler, the processing element may be a general-purpose processor, such as a central processing unit (CPU) or other processor capable of calling programs. Alternatively, these units may be integrated together to implement a system-on-a-chip (SOC).
[0109] Processor 501 can be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. A general-purpose processor can be a microprocessor or any conventional processor.
[0110] Memory 502 can be volatile memory or non-volatile memory, or may include both. The non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. The volatile memory can be random access memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced synchronous dynamic random access memory (ESDRAM), synchronous linked dynamic random access memory (SLDRAM), and direct rambus RAM (DR RAM).
[0111] The memory 502 stores executable program code, and the processor 501 executes this executable program code to implement the functions of the aforementioned acquisition module, extraction module, evaluation module, and location module, thereby realizing the above-mentioned anomaly location method based on the cloud management platform. That is, the memory 502 stores instructions for executing the above-mentioned anomaly location method based on the cloud management platform.
[0112] The communication interface 503 uses transceiver modules such as, but not limited to, network interface cards and transceivers to enable communication between the computing device 500 and other devices or communication networks.
[0113] In addition to the data bus, the 504 bus can also include a power bus, a control bus, and a status signal bus. The bus can be a Peripheral Component Interconnect Express (PCIe) bus, an Extended Industry Standard Architecture (EISA) bus, a Unified Bus (Ubus or UB), a Compute Express Link (CXL) bus, a Cache Coherent Interconnect for Accelerators (CCIX) bus, etc. The bus can be divided into address bus, data bus, and control bus.
[0114] Please see Figure 6 , Figure 6 This is a schematic diagram of a computing device cluster provided in an embodiment of this application. Figure 6 As shown, the computing device cluster 600 includes at least one computing device 500.
[0115] like Figure 6 As shown, the computing device cluster 600 includes at least one computing device 500. The memory 502 of one or more computing devices 500 in the computing device cluster 600 may store the same instructions for executing the aforementioned anomaly localization method based on the cloud management platform.
[0116] In some possible implementations, the memory 502 of one or more computing devices 500 in the computing device cluster 600 may also store partial instructions for executing the aforementioned anomaly localization method based on the cloud management platform. In other words, a combination of one or more computing devices 500 can jointly execute the aforementioned anomaly localization method based on the cloud management platform.
[0117] It should be noted that the memory 502 in different computing devices 500 within the computing device cluster 600 can store different instructions, each used to execute a portion of the functions of the aforementioned cloud management platform. That is, the instructions stored in the memory 502 of different computing devices 500 can implement the functions of one or more modules, such as the acquisition module, extraction module, evaluation module, and positioning module.
[0118] In some possible implementations, one or more computing devices 500 in the computing device cluster 600 can be connected via a network. This network can be a wide area network (WAN) or a local area network (LAN), etc.
[0119] Please see Figure 7 , Figure 7 This is a schematic diagram illustrating the network connection of computer devices in a computer cluster provided in an embodiment of this application. Figure 7 As shown, the two computing devices 500A and 500B are connected via a network. Specifically, they are connected to the network through the communication interfaces in each computing device.
[0120] In one possible implementation, the memory in computing device 500A stores instructions for performing the functions of modules such as the acquisition module. Meanwhile, the memory in computing device 500B stores instructions for performing the functions of modules such as the extraction module, the evaluation module, and the positioning module.
[0121] It should be understood that Figure 7 The functions of computing device 500A shown can also be performed by multiple computing devices. Similarly, the functions of computing device 500B can also be performed by multiple computing devices.
[0122] This application also relates to a computer storage medium storing a program for signal processing, which, when run on a computer, causes the computer to perform actions such as... Figure 2 The steps performed by the cloud management platform in the illustrated embodiment.
[0123] This application also relates to a computer program product that stores instructions that, when executed by a computer, cause the computer to perform actions such as... Figure 2 The steps performed by the cloud management platform in the illustrated embodiment.
[0124] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0125] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection between apparatuses or units through some interfaces, and may be electrical, mechanical, or other forms.
[0126] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0127] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0128] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
Claims
1. A method for locating an anomaly based on a cloud management platform, characterized in that, The cloud management platform is used to manage infrastructure, which includes multiple computing nodes for executing tenant job tasks. The method includes: If the multiple computing nodes fail to execute the job task, the cloud management platform obtains the logs of the multiple computing nodes, and the logs of any one of the multiple computing nodes are used to describe the process of the computing node executing the job task. The cloud management platform extracts features from the logs of the multiple computing nodes to obtain the log space features and log time features of the multiple computing nodes. The cloud management platform evaluates the multiple computing nodes based on their log space characteristics and log time characteristics, and obtains evaluation values for the multiple computing nodes. These evaluation values are used to indicate the degree of abnormality of the computing nodes. The cloud management platform identifies computing nodes whose evaluation values are greater than or equal to a preset threshold as abnormal computing nodes among the multiple computing nodes.
2. The method according to claim 1, characterized in that, The task comprises multiple sub-tasks, and the method further includes: The cloud management platform segments the logs of the computing node based on log event templates to obtain multiple log events of the computing node. These multiple log events describe the process of the computing node executing the multiple sub-tasks. The cloud management platform extracts features from the logs of the multiple computing nodes to obtain the log spatial features and log temporal features of the multiple computing nodes, including: The cloud management platform extracts features from multiple log events of the computing node to obtain the log space features and log time features of the computing node.
3. The method according to claim 2, characterized in that, The infrastructure includes a target computing node located outside the plurality of computing nodes, the target computing node successfully executing the job task, and the method further includes: Obtain multiple target log events of the target computing node, wherein the multiple target log events are used to describe the process of the target computing node executing the multiple subtasks; The cloud management platform removes log events that match the multiple target log events from the multiple log events of the computing node, thus obtaining the filtered log events of the computing node. The cloud management platform extracts features from multiple log events of the computing node to obtain the log space features and log time features of the computing node, including: The cloud management platform extracts features from the filtered log events to obtain the log space features and log time features of the computing node.
4. The method according to claim 3, characterized in that, The log space characteristics of the computing node are used to indicate the type and quantity of the filtered log events, and the log time characteristics of the computing node are used to indicate the execution time of the filtered log events.
5. The method according to any one of claims 1 to 4, characterized in that, The cloud management platform evaluates the multiple computing nodes based on their log space characteristics and log time characteristics, obtaining evaluation values for the multiple computing nodes, including: The cloud management platform obtains the log space characteristics of the computing node and the first distance between the log space characteristics of the other computing nodes. If the first distance is less than or equal to a first distance threshold, the cloud management platform sets the first evaluation value of the computing node to a preset value. If the first distance is greater than the first distance threshold, the cloud management platform sets the first distance to the first evaluation value. The cloud management platform obtains a second distance between the log time characteristics of the computing node and the log time characteristics of the other computing nodes. If the second distance is less than or equal to a second distance threshold, the cloud management platform sets the second evaluation value of the computing node to the value. If the second distance is greater than the second distance threshold, the cloud management platform sets the second distance to the second evaluation value. The cloud management platform obtains the evaluation value of the computing node based on the first evaluation value and the second evaluation value.
6. The method according to any one of claims 2 to 4, characterized in that, The log event template is a preset template, or the log event template is obtained by the cloud management platform by parsing the logs of the multiple computing nodes.
7. The method according to any one of claims 1 to 6, characterized in that, The plurality of computing nodes includes any of the following: physical servers, virtual machines, containers, microvirtual machines, and bare metal servers.
8. The method according to any one of claims 1 to 7, characterized in that, The multiple computing nodes are located in the same site or different sites, and the site includes any of the following: region, availability zone, data center, server room, and rack.
9. A cloud management platform, characterized in that, The cloud management platform is used to manage infrastructure, which includes multiple computing nodes for executing tenant job tasks. The cloud management platform includes: The acquisition module is used to acquire the logs of the multiple computing nodes if the multiple computing nodes fail to execute the job task. The logs of any one of the multiple computing nodes are used to describe the process of the computing node executing the job task. The extraction module is used to extract features from the logs of the multiple computing nodes to obtain the log spatial features and log temporal features of the multiple computing nodes. An evaluation module is used to evaluate the multiple computing nodes based on their log space characteristics and log time characteristics, and to obtain evaluation values for the multiple computing nodes. The evaluation values of the computing nodes are used to indicate the degree of abnormality of the computing nodes. The positioning module is used to identify computing nodes whose evaluation values are greater than or equal to a preset threshold as abnormal computing nodes among the plurality of computing nodes.
10. The cloud management platform according to claim 9, characterized in that, The job task includes multiple sub-tasks, and the cloud management platform also includes: The segmentation module is used to segment the logs of the computing node based on the log event template to obtain multiple log events of the computing node. The multiple log events of the computing node are used to describe the process of the computing node executing the multiple sub-tasks. The extraction module is used to extract features from multiple log events of the computing node to obtain the log space features and log time features of the computing node.
11. The cloud management platform according to claim 10, characterized in that, The infrastructure includes a target computing node located outside the plurality of computing nodes, the target computing node successfully executing the job task, and the cloud management platform further includes: a filtering module, used for: Obtain multiple target log events of the target computing node, wherein the multiple target log events are used to describe the process of the target computing node executing the multiple subtasks; In the multiple log events of the computing node, log events that match the multiple target log events are removed to obtain the filtered log events of the computing node; The extraction module is used to extract features from the filtered log events to obtain the log space features and log time features of the computing node.
12. The cloud management platform according to claim 11, characterized in that, The log space characteristics of the computing node are used to indicate the type and quantity of the filtered log events, and the log time characteristics of the computing node are used to indicate the execution time of the filtered log events.
13. The cloud management platform according to any one of claims 9 to 12, characterized in that, The evaluation module is used for: The cloud management platform obtains the first distance between the log space characteristics of the computing node and the log space characteristics of the other computing nodes. If the first distance is less than or equal to a first distance threshold, the cloud management platform sets the first evaluation value of the computing node to a preset value. If the first distance is greater than the first distance threshold, the cloud management platform sets the first distance to the first evaluation value. The cloud management platform obtains a second distance between the log time characteristics of the computing node and the log time characteristics of the other computing nodes. If the second distance is less than or equal to a second distance threshold, the cloud management platform sets the second evaluation value of the computing node to the value. If the second distance is greater than the second distance threshold, the cloud management platform sets the second distance to the second evaluation value. Based on the first evaluation value and the second evaluation value, the evaluation value of the computing node is obtained.
14. The cloud management platform according to any one of claims 10 to 12, characterized in that, The log event template is a preset template, or the log event template is obtained by the cloud management platform by parsing the logs of the multiple computing nodes.
15. The cloud management platform according to any one of claims 9 to 14, characterized in that, The plurality of computing nodes includes any of the following: physical servers, virtual machines, containers, microvirtual machines, and bare metal servers.
16. The cloud management platform according to any one of claims 9 to 15, characterized in that, The multiple computing nodes are located in the same site or different sites, and the site includes any of the following: region, availability zone, data center, server room, and rack.
17. A computing device cluster, characterized in that, The computing device cluster includes at least one computing device, each computing device including a processor and memory: The memory is used to store instructions; The processor is configured to, according to the instructions, cause the computing device cluster to perform the method of any one of claims 1 to 8.
18. A computer storage medium, characterized in that, The computer storage medium stores one or more instructions that, when executed by one or more computers, cause the one or more computers to perform the method of any one of claims 1 to 8.
19. A computer program product, characterized in that, The computer program product stores instructions that, when executed by a computer, cause the computer to perform the method described in any one of claims 1 to 8.