Model training method, fault positioning method, device and storage medium
By using neural network model training methods and utilizing input-output recorded information and actual component failure results, accurate fault location of cloud hard drives is achieved, solving the problems of low efficiency and low accuracy in fault location in existing technologies, and improving the efficiency and accuracy of fault handling.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENZHEN TENCENT COMP SYST CO LTD
- Filing Date
- 2022-07-07
- Publication Date
- 2026-06-26
AI Technical Summary
In existing technologies, cloud disk fault location devices can only locate faults to storage server clusters, computing server clusters, infrastructure management servers, or network switches, and cannot accurately locate specific components, resulting in low fault location efficiency and low accuracy.
By acquiring the input/output records and actual component failure results of the storage server cluster, and using neural network model training methods, component failure prediction results are generated, enabling automatic fault location of components in the storage server cluster.
It improves the efficiency and accuracy of fault location, can automatically identify the type and cause of component faults, reduces manual intervention, and improves the accuracy of fault handling.
Smart Images

Figure CN115329840B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of cloud storage technology, and in particular to a model training method, a fault location method, an apparatus, and a storage medium. Background Technology
[0002] The physical architecture of a cloud server consists of a storage server cluster, a compute server cluster, an infrastructure management server, and network switches. A cloud disk is a hard drive that a cloud service provider logically volumes after building a large-scale physical storage system, and then allocates it to users for mounting. If a cloud disk fails, users will be unable to access the cloud server normally; therefore, fault location for cloud disks is crucial.
[0003] Currently, fault location equipment can only pinpoint faults to storage server clusters, compute server clusters, infrastructure management servers, or network switches. In other words, the fault location equipment treats the storage server cluster as a black box. If the fault location equipment determines that the fault occurred in the storage server cluster, further investigation currently relies on manual methods to determine which component within the storage server cluster is experiencing the fault. However, this manual location method suffers from low efficiency and low accuracy in fault location. Summary of the Invention
[0004] This application provides a model training method, a fault location method, an apparatus, and a storage medium, which can improve the efficiency and accuracy of fault location.
[0005] Firstly, a model training method is provided, comprising: acquiring multiple input / output record information of a storage server cluster and the actual component failure results corresponding to the multiple input / output record information; for each input / output record information in the multiple input / output record information, determining the target storage server in the storage server cluster that generated the input / output record information, acquiring information about each component of the target storage server, acquiring the operation logs and operational data of each component within a preset time period based on the information of each component, generating a target vector based on the operation logs and operational data, inputting the target vector into a target neural network model to obtain the component failure prediction results corresponding to the input / output record information, wherein the preset time period includes the generation time of the input / output record information; and training the target neural network model based on the actual component failure results and component failure prediction results corresponding to the multiple input / output record information.
[0006] Secondly, a fault location method is provided, comprising: acquiring target input / output record information of a storage server cluster; identifying the target storage server in the storage server cluster that generates the target input / output record information; acquiring information about each component of the target storage server; acquiring the operation logs and operational data of each component within a preset time period based on the information of each component, the preset time period including the generation time of the target input / output record information; generating a target vector based on the operation logs and operational data; and inputting the target vector into a target neural network model trained by the above model training method to obtain the component fault prediction result corresponding to the target input / output record information.
[0007] Thirdly, a model training device is provided, comprising: an acquisition module, a processing module, and a training module. The acquisition module is used to acquire multiple input / output record information of a storage server cluster and the actual component failure results corresponding to the multiple input / output record information. The processing module is used to determine the target storage server in the storage server cluster that generated the input / output record information for each input / output record information, acquire information about each component of the target storage server, acquire the operation logs and operational data of each component within a preset time period based on the information of each component, generate a target vector based on the operation logs and operational data, input the target vector into a target neural network model, and obtain the component failure prediction result corresponding to the input / output record information. The preset time period includes the generation time of the input / output record information. The training module is used to train the target neural network model based on the actual component failure results and component failure prediction results corresponding to the multiple input / output record information.
[0008] Optionally, for each component, the runtime log includes: the alarm level of each submodule in the component; the operational data includes the performance metrics of each submodule in the component over multiple time periods; the processing module is specifically used to: convert the alarm level of each submodule into a first value corresponding to each submodule; for each time period, convert the performance metrics of each submodule in the time period into a second value corresponding to each submodule in the time period; for any submodule, obtain a third value corresponding to the submodule based on the second value corresponding to the submodule in the multiple time periods; and combine the first value and the third value corresponding to each submodule to form a target vector.
[0009] Optionally, the processing module is specifically used to: normalize the alarm levels of each sub-module to obtain the first value corresponding to each sub-module.
[0010] Optionally, the processing module is specifically used to: normalize the performance indicators of each sub-module for each time period in the multiple time periods, and obtain the second value corresponding to each sub-module in the time period.
[0011] Optionally, the processing module is specifically used to: calculate the average of the second values corresponding to any one of the sub-modules over multiple time periods to obtain the third value corresponding to the sub-module.
[0012] Optionally, the actual component failure result includes: the actual failure type; the component failure prediction result includes: the predicted failure type.
[0013] Optionally, the actual component failure result also includes: the actual failure cause; the component failure prediction result also includes: the predicted failure cause.
[0014] Optionally, multiple input / output log entries include at least one input / output pending alarm log entry.
[0015] Optionally, the multiple input / output record information may also include: at least one normal input / output record information.
[0016] Optionally, for any one of the at least one input / output normal record information, the actual fault type corresponding to the input / output normal record information is a new type of fault.
[0017] Fourthly, a fault location device is provided, comprising: a first acquisition module, a determination module, a second acquisition module, a third acquisition module, a generation module, and an input module. The first acquisition module acquires target input / output record information of a storage server cluster; the determination module determines the target storage server in the storage server cluster that generates the target input / output record information; the second acquisition module acquires information about each component of the target storage server; the third acquisition module acquires the operating logs and operational data of each component within a preset time period based on the information of each component, the preset time period including the generation time of the target input / output record information; the generation module generates a target vector based on the operating logs and operational data; and the input module inputs the target vector into a target neural network model trained by the aforementioned model training method to obtain the component fault prediction result corresponding to the target input / output record information.
[0018] Optionally, the component failure prediction result includes: the predicted failure type; the device also includes: a display module and a push module, wherein if the predicted failure type is a new type of failure, the display module is used to display the component failure prediction result, and the push module is used to push alarm information to notify the operation and maintenance team of the existence of a new type of failure.
[0019] Optionally, for each component, the runtime log includes: the alarm level of each sub-module in the component; the operational data includes the performance metrics of each sub-module in the component over multiple time periods; the generation module is specifically used to: convert the alarm level of each sub-module into a first value corresponding to each sub-module; for each time period, convert the performance metrics of each sub-module in the time period into a second value corresponding to each sub-module in the time period; for any sub-module, obtain a third value corresponding to the sub-module based on the second value corresponding to the sub-module in the multiple time periods; and combine the first value and the third value corresponding to each sub-module to form a target vector.
[0020] Optionally, the generation module is specifically used to: normalize the alarm levels of each sub-module to obtain the first value corresponding to each sub-module.
[0021] Optionally, the generation module is specifically used to: normalize the performance indicators of each sub-module for each time period in multiple time periods, and obtain the second value corresponding to each sub-module in each time period.
[0022] Optionally, the generation module is specifically used to: for any one of the sub-modules, calculate the average of the second values corresponding to the sub-module over multiple time periods, and obtain the third value corresponding to the sub-module.
[0023] Fifthly, an electronic device is provided, comprising: a processor and a memory for storing a computer program, the processor for calling and running the computer program stored in the memory, and performing methods as described in the first aspect, the second aspect, or various implementations thereof.
[0024] Sixthly, a computer-readable storage medium is provided for storing a computer program that causes a computer to perform the methods described in the first aspect, the second aspect, or various implementations thereof.
[0025] A seventh aspect provides a computer program product including computer program instructions that cause a computer to perform methods as described in the first aspect, the second aspect, or various implementations thereof.
[0026] Eighthly, a computer program is provided that causes a computer to perform the methods described in the first aspect, the second aspect, or various implementations thereof.
[0027] In this embodiment of the invention, the training device can construct training data by combining input / output record information, component information, operation logs, operational data, and actual component failure results to train the neural network model. During the execution phase, the execution device only needs to input the target vector corresponding to the target input / output record information into the target neural network model to obtain the component failure prediction result corresponding to the target input / output record information. This automatic fault location method can improve the efficiency and accuracy of fault location compared to manual location methods. Attached Figure Description
[0028] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0029] Figure 1 A flowchart of a fault location method provided by the prior art;
[0030] Figure 2 This is a schematic diagram of a system architecture according to an embodiment of the present invention;
[0031] Figure 3 A system architecture diagram of a server provided in an embodiment of this application;
[0032] Figure 4 A flowchart illustrating a model training method provided in this application embodiment;
[0033] Figure 5 A schematic diagram of an MLP model provided in an embodiment of this application;
[0034] Figure 6 A flowchart illustrating a fault location method provided in an embodiment of this application;
[0035] Figure 7 A schematic diagram of a model training device 700 provided in an embodiment of this application;
[0036] Figure 8 A schematic diagram of a fault location device 800 provided in an embodiment of this application;
[0037] Figure 9 This is a schematic block diagram of the electronic device 900 provided in the embodiments of this application. Detailed Implementation
[0038] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0039] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or server that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or devices.
[0040] Figure 1 A flowchart of a fault location method provided by the prior art, such as Figure 1 As shown, electronic devices can collect streaming data through pipelines, including real-time and non-real-time data, and store them in a database in tabular form. Real-time data can include virtual hard disk (VHD) (i.e., cloud disk) failure signals, while non-real-time data includes virtual machine (VM) information, user information, network topology, etc. Furthermore, electronic devices can construct a directed acyclic graph of all paths from VM to VHD using the non-real-time data in the database within periodic time windows. Assuming that the failure probabilities of each device on the path (i.e., storage server cluster, compute server cluster, infrastructure management server, and network switch) are independent, a probability equation for each path is constructed by combining the ratio of the number of successful VHD read / write requests to the total number of read / write requests within the current time window. The failure probability of each device is solved using the Lasso regression algorithm, and confidence is verified through hypothesis testing to filter out the devices most likely to fail. Finally, the electronic device can display the view of the faulty device to the corresponding operation and maintenance team, such as the computing cluster operation and maintenance team, network operation and maintenance team, and storage cluster operation and maintenance team, and send alarm information to notify the operation and maintenance team of the device that has triggered the alarm.
[0041] Currently, fault location equipment can only pinpoint faults to storage server clusters, compute server clusters, infrastructure management servers, or network switches. In other words, the fault location equipment treats the storage server cluster as a black box. If the fault location equipment determines that the fault occurred in the storage server cluster, further investigation requires manual determination of which component within the cluster is affected. However, this manual fault location method suffers from low efficiency and low accuracy.
[0042] To address the aforementioned technical issues, this application can combine input / output record information, component information, operation logs, operational data, and actual component failure results to construct training data for training a neural network model. This allows the model to obtain the component failure results corresponding to the input / output record information to be predicted during the model execution phase.
[0043] This application mainly relates to cloud storage technology, which is a new concept that is extended and developed from the concept of cloud computing. A distributed cloud storage system (hereinafter referred to as a storage system) refers to a storage system that uses cluster applications, grid technology and distributed storage file systems to bring together a large number of storage devices of various types in the network (storage devices are also called storage nodes) to work together through application software or application interfaces to jointly provide data storage and business access functions to the outside world.
[0044] Currently, the storage method in storage systems is as follows: Logical volumes are created. During creation, physical storage space is allocated to each logical volume. This physical storage space may consist of a single storage device or the disks of several storage devices. Clients store data on a logical volume, which means storing the data on the file system. The file system divides the data into many parts, each part being an object. Each object contains not only the data but also additional information such as a data identifier (ID). The file system writes each object to the physical storage space of that logical volume and records the storage location information of each object. Therefore, when a client requests access to data, the file system can allow the client to access the data based on the storage location information of each object.
[0045] The process by which a storage system allocates physical storage space to a logical volume is as follows: the physical storage space is pre-divided into strips according to the capacity estimate of the objects stored in the logical volume (this estimate often has a large margin relative to the actual capacity of the objects to be stored) and the grouping of Redundant Array of Independent Disks (RAID). A logical volume can be understood as a strip, thus allocating physical storage space to the logical volume.
[0046] In some embodiments of the present invention, the system architecture is as follows: Figure 2 As shown.
[0047] Figure 2 This is a schematic diagram of a system architecture according to an embodiment of the present invention. The system architecture may include: a storage server cluster 201, a training device 202 and an execution device 203. The training device 202 may include: a first data aggregator 204 and a training module 205. The execution device 203 may include: a second data aggregator 206 and a computing module 207.
[0048] The storage server cluster 201 can store multiple input / output record information, the actual results of component failures corresponding to each of the multiple input / output record information, component information of each storage server, and the operation logs and operational data of each component of each storage server.
[0049] The first data aggregator 204 can obtain multiple input / output record information, component information of each storage server, and operation logs and operational data of each component of each storage server from the storage server cluster 201. It can then determine the storage server that generated each input / output record information, determine the component information of that storage server, and further determine the operation logs and operational data of each component of that storage server. Furthermore, the first data aggregator 204 can generate a target vector based on the operation logs and operational data.
[0050] The training module 205 can obtain the target vector corresponding to each input and output record information from the first data aggregator 204, and obtain the actual component failure results corresponding to each input and output record information from the storage server cluster 201. The target vector corresponding to each input and output record information and the actual component failure results are used as a set of training data to train the target neural network model.
[0051] Additionally, refer to Figure 2 The storage server cluster 201 can also store the input / output record information to be predicted, the component information of the storage server corresponding to the input / output record information to be predicted, the operation logs and operational data of each component of the storage server, etc.
[0052] The second data aggregator 206 can obtain the input / output record information to be predicted, the component information of the storage server corresponding to the input / output record information to be predicted, the operation logs and operational data of each component of the storage server from the storage server cluster 201, determine the storage server that generated the input / output record information to be predicted, determine the component information of the storage server, and then determine the operation logs and operational data of each component of the storage server. Furthermore, the second data aggregator 206 can generate a target vector based on the operation logs and operational data.
[0053] The calculation module 207 can obtain the target vector corresponding to the input and output record information to be predicted from the second data aggregator 206, and input the target vector into the trained target neural network model to obtain the component fault prediction result corresponding to the input and output record information to be predicted.
[0054] It should be noted that, Figure 2 This is merely a schematic diagram of a system architecture provided by an embodiment of the present invention, and the positional relationships between the devices, components, modules, etc. shown in the figure do not constitute any limitation. In some embodiments, the training device 202 and the execution device 203 may be the same device, for example, both being management devices for managing and controlling storage server clusters. The first data aggregator 204 and the second data aggregator 206 may be the same data aggregator. The training module 205 and the computing module 207 may be the same module.
[0055] Optionally, the components of each storage server may include, but are not limited to, the storage server, distribution layer, indexing layer, caching layer, persistence layer, and hard disk.
[0056] For example, Figure 3 A system architecture diagram of a server provided in this application embodiment, such as... Figure 3 As shown, a storage server cluster can include one master storage server and two slave storage servers. The master storage server can receive write requests and distribute them to the two slave storage servers through a distribution layer, enabling the slave storage servers to back up the data to be written. An index layer can generate an index for the data to be written. A caching layer can cache the write request input / output record. A persistence layer can store the data to be written and its corresponding index to disk. The master storage server can also receive read requests and distribute them to the two slave storage servers through a distribution layer, enabling it to read data from the slave storage servers. A caching layer can cache the read request input / output record. A persistence layer can read the data from disk based on the index of the data to be read. Furthermore, the master storage server can generate operational logs and data for each component within the server, storing these logs and data through either a caching layer or a persistence layer. Similarly, the slave storage servers can also generate their own operational logs and data for each component, storing these logs and data through their own caching layer or persistence layer.
[0057] The control equipment can obtain the operation logs and data of each component from the storage server cluster, such as the main storage server, and perform fault location based on the operation logs and data of each component.
[0058] The technical solutions of the embodiments of the present invention will be described in detail below through some examples. These embodiments can be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments.
[0059] Figure 4 A flowchart illustrating a model training method provided in this application embodiment. This method can be executed by any electronic device such as a computer, desktop computer, or laptop computer. For example, this method can be performed by... Figure 2 The training device 202 in the middle is used for execution, but is not limited to this, such as Figure 4 As shown, the method may include the following steps:
[0060] S410: Obtain multiple input / output record information of the storage server cluster and the actual component failure results corresponding to each of the multiple input / output record information;
[0061] S420: For each input / output record in multiple input / output record information, determine the target storage server in the storage server cluster that generated the input / output record information, obtain the information of each component of the target storage server, obtain the operation logs and operational data of each component within a preset time period based on the information of each component, generate a target vector based on the operation logs and operational data, input the target vector into the target neural network model, and obtain the component fault prediction result corresponding to the input / output record information;
[0062] S430: Train the target neural network model based on the actual component failure results and component failure prediction results corresponding to multiple input and output record information.
[0063] It should be understood that the actual result of a component failure refers to the actual outcome of the failure. This actual result can be obtained by the operations team based on the analysis of operational logs and data corresponding to the input and output records.
[0064] Optionally, the actual result of a component failure includes: the actual failure type. For example, the actual failure type could be a failure of the primary storage server, a failure of the distribution layer of the primary storage server, a failure of the hard drive of the primary storage server, etc.
[0065] Optionally, the actual result of a component failure may also include the actual cause of the failure. For example, the actual cause of a hard drive failure type could be high hard drive I / O write latency. Based on this, the actual result of a component failure could be: hard drive failure - high hard drive I / O write latency.
[0066] Optionally, the input / output log information is log information about read / write requests. For example, an input / output log information might be: "2022-06-22-11:00 Data A was written to storage server 1, critical alarm." Here, 2022-06-22-11:00 indicates the time when the input / output log information was generated, and critical alarm indicates the alarm level of the input / output log information.
[0067] It should be understood that, in order for the target neural network model to identify component failures, the aforementioned multiple input / output record information may include at least one input / output hang (IO hang) alarm record information.
[0068] It should be understood that although the types of component failures are limited, the operations and maintenance team may not be able to consider all failure types comprehensively. Based on this, during model training, a certain number of normal input and output records can be used to classify the component failure types corresponding to these records as new failure types. If, during model execution, the component failure type corresponding to the input and output records to be tested is a new failure type, it indicates that the failure type corresponding to the input and output records to be tested needs to be further determined. Once the operations and maintenance team has determined the failure type corresponding to the input and output records to be tested, this failure type can be added back to the existing component failure types to retrain the target neural network model, thereby enabling the target neural network model to identify more failure types.
[0069] Optionally, the training device can select normal input / output records according to a certain proportion of IO hang alarm records, such as 5%, and randomly combine the datasets consisting of the target vectors corresponding to these normal input / output records and IO hang alarm records, as well as the actual component failure results corresponding to these records, to obtain training and testing sets for training and testing the target neural network. Specifically, the target vector and the actual component failure result corresponding to each input / output record constitute a set of training or testing data.
[0070] Optionally, the input / output record information includes the identifier of the target storage server that generated the input / output record information, so that the training device can determine the target storage server based on the identifier, obtain information about each component of the target storage server, and then obtain the operation logs and operational data of each component within a preset time period. For example, an input / output record information is: "2022-6-22-11:00 Data A was written to storage server 1, critical alarm". Here, "1" is the identifier of the target storage server. Assuming that the components of the target storage server include: storage server 1, distribution layer 11, index layer 12, cache layer 13, persistence layer 14, and hard disk 15, then the training device can determine that the identifiers of each component of the target storage server are 1, 11, 12, 13, 14, and 15. Furthermore, the training device can determine the operation logs and operational data including these component identifiers and select the operation logs and operational data within a preset time period.
[0071] It should be understood that while the target storage server can generate operational logs and data for each component in real time, the training device only needs to acquire the operational logs and data for each input / output record within a preset time period, which includes the generation time of the input / output record. For example, if the generation time of an input / output record is 2022-06-22-11:00, then the preset time period for that record could be from 2022-06-22-10:55 to 2022-06-22-11:00. In other words, the training device needs to acquire the operational logs and data within this preset time period.
[0072] Optionally, for each component, the operation log includes the alarm level of each submodule within that component. The operation log may or may not include alarm level 0, which indicates normal operation. For example, the operation logs of each component in the primary storage server may include: a level 2 alarm for the resource management submodule, normal operation of the snapshot submodule, a level 1 alarm for the authentication submodule in the distribution layer, a level 2 alarm for the lookup submodule in the index layer, normal operation of the cache submodule in the cache layer, a level 3 alarm for the read / write submodule in the persistence layer, and a level 2 alarm for the hard disk. As another example, the operation logs of each component in the primary storage server may include: a level 2 alarm for the resource management submodule, a level 1 alarm for the authentication submodule in the distribution layer, a level 2 alarm for the lookup submodule in the index layer, a level 3 alarm for the read / write submodule in the persistence layer, and a level 2 alarm for the hard disk.
[0073] Optionally, for each component, the operational data includes: performance metrics of each sub-module in that component over multiple time periods, and may also include latency statistics of each sub-module over multiple time periods.
[0074] Optionally, performance metrics include: Central Processing Unit (CPU) utilization, memory utilization, and hard disk utilization. For example, the operational data of each component in the main storage server includes: the resource management submodule in the main storage server has a CPU utilization of 5% in time period 1, a memory utilization of 20% in time period 1, and a hard disk utilization of 8% in time period 1; the resource management submodule has a CPU utilization of 10% in time period 2, a memory utilization of 30% in time period 2, and a hard disk utilization of 10% in time period 3; the resource management submodule has a CPU utilization of 12% in time period 3, a memory utilization of 21% in time period 3, and a hard disk utilization of 10% in time period 3.
[0075] The training device can determine the target vector in any of the following ways, but is not limited to:
[0076] One possible approach is to, for any component in the target neural network model, the training device can convert the alarm level of each sub-module in that component into a first value corresponding to each sub-module; for each time period in multiple time periods, the performance index of each sub-module in that time period is converted into a second value corresponding to each sub-module in that time period; for any sub-module in each sub-module, a third value corresponding to the sub-module is obtained based on the second value corresponding to the sub-module in multiple time periods; and the first and third values corresponding to each sub-module are combined to form a target vector.
[0077] The second possible approach is to, for any component in the target neural network model, the training device can convert the alarm level of each sub-module in that component into a first value corresponding to each sub-module; for any sub-module in each sub-module, based on the performance indicators of that sub-module over multiple time periods, obtain a fourth value for that sub-module; based on the fourth value corresponding to each sub-module, obtain a fifth value corresponding to each sub-module; and combine the first value and the fifth value corresponding to each sub-module to form a target vector.
[0078] The following explains the first possible implementation method:
[0079] The training device can convert the alarm level of each submodule in the component into a first value corresponding to each submodule in any of the following possible ways, but is not limited to:
[0080] In one possible implementation, the training device can normalize the alarm levels of each sub-module based on the alarm levels of each sub-module to obtain the first value corresponding to each sub-module.
[0081] For example, suppose the components of the target storage server include: the target storage server itself, a distribution layer, an indexing layer, a caching layer, a persistence layer, and a hard disk. The target storage server includes a resource management submodule and a snapshot submodule. The distribution layer includes an authentication submodule. The indexing layer includes a lookup submodule. The caching layer includes a cache submodule. The persistence layer includes a read / write submodule. The hard disk is a separate submodule. Assume their respective alarm levels are: resource management submodule level 2 alarm, snapshot submodule operating normally (alarm level 0), authentication submodule level 1 alarm, lookup submodule level 2 alarm, caching submodule operating normally, read / write submodule level 3 alarm, and hard disk level 2 alarm. Based on this, the first value corresponding to the resource management submodule is... The first value corresponding to the snapshot submodule is 0; the first value corresponding to the authentication submodule is... The first value corresponding to the submodule is The first value corresponding to the caching submodule is 0; the first value corresponding to the resource management submodule is... The first value corresponding to the hard drive is
[0082] As mentioned above, for each component, the operation log includes the alarm level of each sub-module in that component. The operation log may not include alarm level 0. Therefore, for the sub-module corresponding to alarm level 0, the first value corresponding to the sub-module can be assumed to be 0. For the sub-modules with alarm levels greater than 0, the training device can normalize the alarm levels of these sub-modules based on their alarm levels to obtain the first value corresponding to each of these sub-modules.
[0083] As described above, for each component, the runtime log includes the alarm levels of each submodule within that component. The runtime log can include alarm level 0. Before normalization, the training device can filter alarms with alarm levels greater than 0 using a pre-defined set of regular expressions. For submodules with alarm levels greater than 0, the training device can normalize their alarm levels to obtain a first value for each submodule. For submodules corresponding to alarm level 0, the first value can be assumed to be 0.
[0084] It should be understood that regular expressions are strings composed of characters with special meanings, and are mostly used to find and replace strings that match rules. For example, the regular expression $ represents the end of a string, that is, the last character. Each alarm message in the runtime log can be regarded as a string, so the regular expression $ can be used to filter out alarm messages whose last character is greater than 0.
[0085] In another possible implementation, the training device can normalize the alarm levels of each of the above sub-modules based on all alarm levels to obtain the first value corresponding to each sub-module.
[0086] It should be understood that in the embodiments of this application, all alarm levels are independent of sub-modules. For example, all alarm levels include: alarm level 0, alarm level 1, alarm level 2 and alarm level 3. The larger the alarm level index, the higher the alarm level.
[0087] For example, suppose the components of the target storage server include: the target storage server itself, a distribution layer, an indexing layer, a caching layer, a persistence layer, and a hard disk. The target storage server includes a resource management submodule and a snapshot submodule. The distribution layer includes an authentication submodule. The indexing layer includes a lookup submodule. The caching layer includes a cache submodule. The persistence layer includes a read / write submodule. The hard disk is a separate submodule. Assume their respective alarm levels are: resource management submodule level 2 alarm, snapshot submodule operating normally (alarm level 0), authentication submodule level 1 alarm, lookup submodule level 2 alarm, caching submodule operating normally, read / write submodule level 3 alarm, and hard disk level 2 alarm. Based on this, the first value corresponding to the resource management submodule is... The first value corresponding to the snapshot submodule is 0; the first value corresponding to the authentication submodule is... The first value corresponding to the submodule is The first value corresponding to the caching submodule is 0; the first value corresponding to the resource management submodule is... The first value corresponding to the hard drive is
[0088] As mentioned above, for each component, the operation log includes the alarm level of each sub-module in that component. The operation log may not include alarm level 0. Therefore, for the sub-module corresponding to alarm level 0, the first value corresponding to the sub-module can be assumed to be 0. For the sub-module with alarm level greater than 0, the training device can normalize the alarm levels of these sub-modules based on all alarm levels to obtain the first value corresponding to each of these sub-modules.
[0089] As described above, for each component, the runtime log includes the alarm levels of each submodule within that component. The runtime log can include alarm level 0. Before normalization, the training device can filter alarm levels greater than 0 using a pre-defined set of regular expressions. For submodules with alarm levels greater than 0, the training device can normalize the alarm levels of these submodules based on all alarm levels to obtain the first value corresponding to each submodule. For the submodule corresponding to alarm level 0, the first value can be assumed to be 0.
[0090] In another possible implementation, the training device can use the index of the alarm level of each of the above sub-modules as the first value corresponding to each sub-module.
[0091] For each of the multiple time periods, the training device can convert the performance metrics of each submodule for that time period into a second value corresponding to each submodule for that time period in the following ways, but is not limited to:
[0092] In one possible implementation, for each of multiple time periods, the training device can normalize the performance metrics of each submodule for that time period based on the performance metrics of each submodule for that time period, and the second value corresponding to each submodule for that time period.
[0093] Optionally, the training device may use the Z-score method to normalize the performance metrics of each sub-module for that time period, but is not limited to this.
[0094] For example, suppose the components of the target storage server include: the target storage server itself, a distribution layer, an indexing layer, a caching layer, a persistence layer, and a hard disk. The target storage server includes a resource management submodule and a snapshot submodule. The distribution layer includes an authentication submodule. The indexing layer includes a lookup submodule. The caching layer includes a caching submodule. The persistence layer includes a read / write submodule. The hard disk is a separate submodule. Assume their respective CPU utilization rates during time period 1 are: 5% for the resource management submodule, 2% for the snapshot submodule, 2% for the authentication submodule, 3% for the lookup submodule, 1% for the caching submodule, 5% for the read / write submodule, and 3% for the hard disk. Assume their respective CPU utilization rates in time period 2 are as follows: Resource Management submodule CPU utilization is 5%, Snapshot submodule CPU utilization is 3%, Authentication submodule CPU utilization is 3%, Lookup submodule CPU utilization is 4%, Caching submodule CPU utilization is 2%, Read / Write submodule CPU utilization is 4%, and Disk CPU utilization is 2%. Based on this, the second value for the Resource Management submodule in time period 1 is... The second value of the snapshot submodule corresponding to time period 1 is The second value of the authentication submodule corresponding to time period 1 is The second value of the submodule in time period 1 is The second value of the cache submodule corresponding to time period 1 is The second value of the read / write submodule corresponding to time period 1 is The second value for the hard drive in time period 1 is The second value of the resource management submodule in time period 2 is The second value of the snapshot submodule corresponding to time period 2 is The second value of the authentication submodule corresponding to time period 2 is The second value corresponding to time period 2 in the search submodule is The second value of the cache submodule in time period 2 is The second value of the read / write submodule in time period 2 is The second value for the hard drive in time period 2 is
[0095] It should be understood that if the operational data does not include the performance metrics of a certain module in a certain time period, then the second value of that sub-module in the time period can be assumed to be 0. For the performance metrics included in the operational data, the training device can normalize the performance metrics based on these performance metrics.
[0096] In another possible implementation, for each of the multiple time periods, the training device can use the performance metric of each submodule in that time period as the second value corresponding to each submodule in that time period.
[0097] Optionally, for any one of the submodules, the training device can obtain the third value corresponding to the submodule according to any of the following feasible methods:
[0098] In one possible implementation, for any one of the sub-modules, the training device can calculate the average of the second values corresponding to that sub-module over multiple time periods to obtain the third value corresponding to the sub-module.
[0099] For example, suppose the second value of the resource management submodule in time period 1 is The second value of the snapshot submodule corresponding to time period 1 is The second value of the authentication submodule corresponding to time period 1 is The second value of the submodule in time period 1 is The second value of the cache submodule corresponding to time period 1 is The second value of the read / write submodule corresponding to time period 1 is The second value for the hard drive in time period 1 is The second value of the resource management submodule in time period 2 is The second value of the snapshot submodule corresponding to time period 2 is The second value of the authentication submodule corresponding to time period 2 is The second value corresponding to time period 2 in the search submodule is The second value of the cache submodule in time period 2 is The second value of the read / write submodule in time period 2 is The second value for the hard drive in time period 2 is So, what is the third value corresponding to the resource management submodule? The third value corresponding to the snapshot submodule is The third value corresponding to the authentication submodule is The third value corresponding to the submodule is The third value corresponding to the cache submodule is The third value corresponding to the read / write submodule is The third value corresponding to the hard drive is
[0100] In another possible implementation, for any one of the submodules, the training device can calculate the sum of the second values corresponding to that submodule over multiple time periods to obtain the third value corresponding to the submodule.
[0101] Optionally, the training device can combine the first and third values corresponding to each sub-module according to preset rules to obtain the target vector.
[0102] Optionally, the preset rules can be in the order of authentication submodule, lookup submodule, cache submodule, read / write submodule and hard disk, and for each submodule, sorted in the order of first value, third value corresponding to CPU utilization, third value corresponding to memory utilization, and third value corresponding to hard disk utilization.
[0103] For example, assuming the performance metric is CPU utilization, then combining the above example, the final target vector is (0.33,0,0.17,0.33,0,0.5,0.33,0.23,0.12,0.12,0.16,0.07,0.21,0.12). T .
[0104] The following explains the second possible method:
[0105] It should be understood that the method for determining the first value can be referred to above, and will not be repeated in the embodiments of this application.
[0106] For any given submodule, the training device can obtain a fourth value for that submodule based on its performance metrics over multiple time periods, but is not limited to the following methods:
[0107] In one possible implementation, for any one of the submodules, the training device calculates the average of the performance metrics of that submodule over multiple time periods to obtain a fourth value for that submodule.
[0108] For example, suppose the components of the target storage server include: the target storage server itself, a distribution layer, an indexing layer, a caching layer, a persistence layer, and a hard disk. The target storage server includes a resource management submodule and a snapshot submodule. The distribution layer includes an authentication submodule. The indexing layer includes a lookup submodule. The caching layer includes a caching submodule. The persistence layer includes a read / write submodule. The hard disk is a separate submodule. Assume their respective CPU utilization rates during time period 1 are: 5% for the resource management submodule, 2% for the snapshot submodule, 2% for the authentication submodule, 3% for the lookup submodule, 1% for the caching submodule, 5% for the read / write submodule, and 3% for the hard disk. Assume their respective CPU utilization rates in time period 2 are as follows: Resource Management submodule CPU utilization is 5%, Snapshot submodule CPU utilization is 3%, Authentication submodule CPU utilization is 3%, Lookup submodule CPU utilization is 4%, Caching submodule CPU utilization is 2%, Read / Write submodule CPU utilization is 4%, and Disk CPU utilization is 2%. Based on this, the fourth value corresponding to the Resource Management submodule is... The fourth value corresponding to the snapshot submodule is The fourth value corresponding to the authentication submodule is The fourth value corresponding to the submodule is The fourth value corresponding to the cache submodule is The fourth value corresponding to the read / write submodule is The fourth value corresponding to the hard drive is
[0109] In another possible implementation, for any one of the submodules, the training device calculates the sum of the performance metrics of that submodule over multiple time periods to obtain a fourth value for that submodule.
[0110] Optionally, the training device may obtain the fifth value corresponding to each submodule based on the fourth value corresponding to each submodule in the following feasible manner, but is not limited to this.
[0111] In one possible implementation, the training device normalizes the fourth value corresponding to each submodule to obtain the fifth value corresponding to each submodule.
[0112] Optionally, the training device may use the Z-score method to normalize the fourth value corresponding to each sub-module, but is not limited to this.
[0113] For example, suppose the fourth value corresponding to the resource management submodule is The fourth value corresponding to the snapshot submodule is The fourth value corresponding to the authentication submodule is The fourth value corresponding to the submodule is The fourth value corresponding to the cache submodule is The fourth value corresponding to the read / write submodule is The fourth value corresponding to the hard drive is Based on this, the fifth value corresponding to the resource management submodule The fifth value corresponding to the snapshot submodule The fifth value corresponding to the authentication submodule The fifth value corresponding to the search submodule is The fifth value corresponding to the cache submodule is The fifth value corresponding to the read / write submodule is The fifth value corresponding to the hard drive is
[0114] In another possible implementation, the training device normalizes the fourth value corresponding to each submodule as the fifth value corresponding to each submodule.
[0115] Optionally, the training device can combine the first and fifth values corresponding to each sub-module according to preset rules to obtain the target vector.
[0116] Optionally, the preset rules can be in the order of authentication submodule, lookup submodule, cache submodule, read / write submodule and hard disk, and for each submodule, sorted in the order of the first value, the fifth value corresponding to CPU utilization, the fifth value corresponding to memory utilization, and the fifth value corresponding to hard disk utilization.
[0117] For example, assuming the performance metric is CPU utilization, then combining the above example, the final target vector is (0.33,0,0.17,0.33,0,0.5,0.33,0.23,0.11,0.11,0.16,0.07,0.20,0.11). T .
[0118] It should be understood that since the types of component failures in a storage server cluster are relatively fixed, and the causes of failures are limited after manual sorting and classification of historical IO hangs, the problem of fault localization can be transformed into a multi-classification task in machine learning. This can be solved automatically and quickly using a multilayer perceptron (MLP) model trained on historical data. In other words, the target neural network model mentioned above can be a multilayer perceptron model, but is not limited to it.
[0119] In this embodiment of the application, the input of the MLP model is the above-mentioned target vector, and the output is the probability of all component failure results. There are a total of C+1 component failure results, C of which are manually defined component failure results, and the remaining 1 component failure result is a novel failure.
[0120] It should be understood that if a component failure result only includes the failure type, then the number of categories of component failure results is the same as the number of components in the storage server. For example, the components of a storage server may include: the storage server, the distribution layer, the index layer, the cache layer, the persistence layer, and the disk. Then the corresponding component failure results include: storage server failure, distribution layer failure, index layer failure, cache layer failure, persistence layer failure, and disk failure. If a component failure result includes both the failure type and the failure cause, then the number of categories of component failure results is the same as the product of the number of components in the storage server and the failure cause, and the same component failure result can correspond to at least one failure cause.
[0121] It should be understood that MLP models can be implemented using a sequential model. Figure 5 A schematic diagram of an MLP model provided in an embodiment of this application is shown below. Figure 5 As shown, this MLP model includes an input layer, hidden layers, and an output layer. Generally, the first layer is the input layer, the last layer is the output layer, and the layers in between are hidden layers, with full connections between them. The number of neurons in the input layer is N, where N is the dimension of the target vector. If a storage service server includes the storage server, distribution layer, index layer, caching layer, persistence layer, and hard disk (six components in total), and each component corresponds to an alarm level, and the operational data includes three performance metrics: CPU utilization, memory utilization, and hard disk utilization, then after normalization, these alarm levels and performance metrics can yield 24 values between 0 and 1. Based on this, the dimension of the target vector is N = 24. The default number of hidden layers is 2, and the number of neurons can be 128. For each part of the MLP model, it is a linear model plus an activation function. For example, for any neuron in any hidden layer, the training device can first calculate the linear result from the previous layer's neurons to that neuron, and then apply an activation function to that linear result to obtain the corresponding value of that neuron.
[0122] In the embodiments of this application, the activation function of the hidden layer can be any of the following, but is not limited to:
[0123] In some embodiments, the activation function of the hidden layer can be a Rectified Linear Unit (ReLU) function:
[0124]
[0125] In some embodiments, the activation function of the hidden layer can be the Sigmoid function:
[0126]
[0127] In some embodiments, the activation function of the hidden layer can be the tanh function:
[0128]
[0129] In some embodiments, the activation function of the hidden layer can be the Leaky ReLU function:
[0130]
[0131] In some embodiments, the activation function of the hidden layer can be an ELU function:
[0132]
[0133] In all the activation functions mentioned above, x represents the linear result from the previous neuron to this neuron.
[0134] Alternatively, the activation function of the output layer can be a normalized exponential function (Softmax):
[0135]
[0136] Among them, z i Softmax(z) represents the linear result of the i-th neuron in the output layer, C+1 represents the number of types of component failure results, and Softmax(z) represents the linear result of the i-th neuron in the output layer. i ) represents the probability of the i-th component failure outcome.
[0137] It should be understood that since fault location is essentially a multi-class classification task, the loss function can be the multi-class cross-entropy loss function.
[0138] Optionally, the optimizer used in the embodiments of this application may be any of the following, but not limited to: Adaptive Gradient Algorithm (Adagrad), standard momentum optimization algorithm, Adaptive Moment Estimation (Adam) algorithm, etc.
[0139] Adagrad is used to make larger updates to low-frequency parameters (i.e., fault characteristics), which can speed up convergence. The standard momentum optimization algorithm introduces momentum into stochastic gradient descent (SGD), solving the problem that SGD oscillates significantly with the correct gradient during convergence. The Adam algorithm is an upgraded version of Adagrad.
[0140] Optionally, when the loss function reaches its minimum, the training device stops training the target neural network model until the training set is updated. For example, if the operations and maintenance team analyzes a new type of fault and obtains new component fault results, the training set can be updated based on this. Alternatively, the target neural network model can be updated again after a preset time period.
[0141] It should be understood that after obtaining the probabilities of all component failures through the MLP model, the training device can select the component failure result with the highest probability as the component failure prediction result.
[0142] Optionally, the component failure prediction result may include: the predicted failure type, for example, the predicted failure type may be a failure of the distribution layer of the main storage server, a failure of the hard disk of the main storage server, a failure of the index layer of the main storage server, a failure of the cache layer of the main storage server, etc.
[0143] To help the operations and maintenance team identify the cause of the failure, the component failure prediction results can also include the predicted cause of the failure. For example, the cause of a hard drive failure could be high hard drive I / O write latency. Therefore, the component failure prediction result could be: Hard drive failure - High hard drive I / O write latency.
[0144] In this embodiment, the training device can combine input / output record information, component information, operation logs, operational data, and actual component failure results to construct training data to train a neural network model. This model can then automatically locate faults. Compared to manual location methods, the automatic fault location method implemented through model training in this embodiment can improve the efficiency and accuracy of fault location.
[0145] The component failure prediction results obtained through the target neural network model can also include: the cause of the failure, which helps the operation and maintenance team to repair cloud storage failures as soon as possible.
[0146] Figure 6 A flowchart illustrating a fault location method provided in this application embodiment. This method can be executed by any electronic device such as a computer, desktop computer, or laptop computer. For example, the method can be performed by... Figure 2 The execution device 203 in the middle can perform, but is not limited to, such as Figure 6 As shown, the method may include the following steps:
[0147] S610: Obtain target input / output record information of the storage server cluster;
[0148] S620: Determine the target storage server in the storage server cluster that generates the target input / output record information;
[0149] S630: Obtain information about each component of the target storage service server;
[0150] S640: Obtain the running logs and operational data of each component within a preset time period based on the information of each component. The preset time period includes the generation time of the target input and output record information.
[0151] S650: Generates target vectors based on operation logs and operational data;
[0152] S660: Input the target vector into the target neural network model trained by the above model training method to obtain the component fault prediction result corresponding to the target input and output record information.
[0153] It should be understood that the target input / output log information is the input / output log information to be predicted. The execution device can obtain this input / output log information from the storage server cluster.
[0154] It should be understood that the explanations of S610 to S650 can be found above, and will not be repeated here.
[0155] Optionally, the component failure prediction results include: the predicted failure type.
[0156] Optionally, the component failure prediction results may also include the predicted cause of failure.
[0157] Optionally, if the predicted fault type is a novel fault type, the device displays the component fault prediction result and pushes an alarm message to alert the operations and maintenance team of the novel fault. This allows the operations and maintenance team to quickly locate the fault type and cause of the novel fault, forming a component fault result. The component fault result and its corresponding target vector are then added to the training data to further train the target neural network model.
[0158] In this embodiment, the execution device only needs to input the target vector corresponding to the target input and output recording information into the target neural network model to obtain the component fault prediction result corresponding to the target input and output recording information. This automatic fault location method can improve the efficiency and accuracy of fault location compared with the manual location method.
[0159] Figure 7This is a schematic diagram of a model training device 700 provided in an embodiment of this application. The device 700 may include: an acquisition module 710, a processing module 720, and a training module 730. The acquisition module 710 is used to acquire multiple input / output record information of a storage server cluster and the actual component failure results corresponding to the multiple input / output record information. The processing module 720 is used to determine the target storage server in the storage server cluster that generated the input / output record information for each input / output record information, acquire information about each component of the target storage server, acquire the operation logs and operational data of each component within a preset time period based on the information of each component, generate a target vector based on the operation logs and operational data, input the target vector into the target neural network model, and obtain the component failure prediction result corresponding to the input / output record information. The preset time period includes the generation time of the input / output record information. The training module 730 is used to train the target neural network model based on the actual component failure results and component failure prediction results corresponding to the multiple input / output record information.
[0160] Optionally, for each component, the runtime log includes: the alarm level of each sub-module in the component; the operational data includes the performance metrics of each sub-module in the component over multiple time periods; the processing module 720 is specifically used to: convert the alarm level of each sub-module into a first value corresponding to each sub-module; for each time period, convert the performance metrics of each sub-module in the time period into a second value corresponding to each sub-module in the time period; for any sub-module, obtain a third value corresponding to the sub-module based on the second value corresponding to the sub-module in the multiple time periods; and combine the first value and the third value corresponding to each sub-module to form a target vector.
[0161] Optionally, the processing module 720 is specifically used to: normalize the alarm levels of each sub-module to obtain the first value corresponding to each sub-module.
[0162] Optionally, the processing module 720 is specifically used to: normalize the performance indicators of each sub-module in each time period for each of the multiple time periods, and obtain the second value corresponding to each sub-module in each time period.
[0163] Optionally, the processing module 720 is specifically used to: calculate the average value of the second value corresponding to any one of the sub-modules over multiple time periods to obtain the third value corresponding to the sub-module.
[0164] Optionally, the actual component failure result includes: the actual failure type; the component failure prediction result includes: the predicted failure type.
[0165] Optionally, the actual component failure result also includes: the actual failure cause; the component failure prediction result also includes: the predicted failure cause.
[0166] Optionally, multiple input / output log entries include at least one input / output pending alarm log entry.
[0167] Optionally, the multiple input / output record information may also include: at least one normal input / output record information.
[0168] Optionally, for any one of the at least one input / output normal record information, the actual fault type corresponding to the input / output normal record information is a new type of fault.
[0169] It should be understood that the device embodiments and method embodiments can correspond to each other, and similar descriptions can be referred to the method embodiments. To avoid repetition, further details will not be provided here. Specifically, Figure 7 The device 700 shown can perform Figure 4 The corresponding method embodiments, and the foregoing and other operations and / or functions of each module in device 700 are respectively implemented to achieve Figure 4 For the sake of brevity, the corresponding processes in each method are not described in detail here.
[0170] The apparatus 700 of this application embodiment has been described above from the perspective of functional modules in conjunction with the accompanying drawings. It should be understood that this functional module can be implemented in hardware, in software instructions, or in a combination of hardware and software modules. Specifically, the steps of the method embodiments in this application can be completed by integrated logic circuits in the processor's hardware and / or by software instructions. The steps of the method disclosed in this application embodiment can be directly embodied as being executed by a hardware decoding processor, or by a combination of hardware and software modules in the decoding processor. Optionally, the software module can reside in a mature storage medium in the art, such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, etc. This storage medium is located in memory, and the processor reads information from the memory and, in conjunction with its hardware, completes the steps in the above method embodiments.
[0171] Figure 8 This is a schematic diagram of a fault location device 800 provided in an embodiment of this application, as shown below. Figure 8As shown, the device 800 includes: a first acquisition module 810, a determination module 820, a second acquisition module 830, a third acquisition module 840, a generation module 850, and an input module 860. The first acquisition module 810 acquires target input / output record information of the storage server cluster; the determination module 820 determines the target storage server in the storage server cluster that generates the target input / output record information; the second acquisition module 830 acquires information about each component of the target storage server; the third acquisition module 840 acquires the operating logs and operational data of each component within a preset time period, including the generation time of the target input / output record information, based on the information of each component; the generation module 850 generates a target vector based on the operating logs and operational data; and the input module 860 inputs the target vector into the target neural network model trained by the aforementioned model training method to obtain the component fault prediction result corresponding to the target input / output record information.
[0172] Optionally, the component failure prediction result includes: the predicted failure type; the device 800 also includes: a display module 870 and a push module 880, wherein if the predicted failure type is a new type of failure, the display module 870 is used to display the component failure prediction result, and the push module 880 is used to push alarm information to notify the operation and maintenance team that a new type of failure exists.
[0173] Optionally, for each component, the runtime log includes: the alarm level of each sub-module in the component; the operational data includes the performance metrics of each sub-module in the component over multiple time periods; the generation module 850 is specifically used to: convert the alarm level of each sub-module into a first value corresponding to each sub-module; for each time period, convert the performance metrics of each sub-module in the time period into a second value corresponding to each sub-module in the time period; for any sub-module, obtain a third value corresponding to the sub-module based on the second value corresponding to the sub-module in the multiple time periods; and combine the first value and the third value corresponding to each sub-module to form a target vector.
[0174] Optionally, the generation module 850 is specifically used to: normalize the alarm levels of each sub-module to obtain the first value corresponding to each sub-module.
[0175] Optionally, the generation module 850 is specifically used to: normalize the performance indicators of each sub-module in each time period for each of the multiple time periods, and obtain the second value corresponding to each sub-module in each time period.
[0176] Optionally, the generation module 850 is specifically used to: for any one of the sub-modules, calculate the average value of the second value corresponding to the sub-module over multiple time periods, and obtain the third value corresponding to the sub-module.
[0177] It should be understood that the device embodiments and method embodiments can correspond to each other, and similar descriptions can be referred to the method embodiments. To avoid repetition, further details will not be provided here. Specifically, Figure 8 The device 800 shown can perform Figure 6 The corresponding method embodiments, and the foregoing and other operations and / or functions of each module in the device 800 are respectively implemented to achieve Figure 6 For the sake of brevity, the corresponding processes in each method are not described in detail here.
[0178] The apparatus 800 of this application embodiment has been described above from the perspective of functional modules in conjunction with the accompanying drawings. It should be understood that this functional module can be implemented in hardware, in software instructions, or in a combination of hardware and software modules. Specifically, the steps of the method embodiments in this application can be completed by integrated logic circuits in the processor's hardware and / or by software instructions. The steps of the method disclosed in this application embodiment can be directly embodied as being executed by a hardware decoding processor, or by a combination of hardware and software modules in the decoding processor. Optionally, the software module can be located in a mature storage medium in the art, such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, etc. This storage medium is located in memory, and the processor reads information from the memory and, in conjunction with its hardware, completes the steps in the above method embodiments.
[0179] Figure 9 This is a schematic block diagram of the electronic device 900 provided in the embodiments of this application.
[0180] like Figure 9 As shown, the electronic device 900 may include:
[0181] The system includes a memory 910 and a processor 920. The memory 910 stores computer programs and transfers the program code to the processor 920. In other words, the processor 920 can retrieve and run the computer program from the memory 910 to implement the methods described in the embodiments of this application.
[0182] For example, the processor 920 can be used to execute the above-described method embodiments according to instructions in the computer program.
[0183] In some embodiments of this application, the processor 920 may include, but is not limited to:
[0184] General-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
[0185] In some embodiments of this application, the memory 910 includes, but is not limited to:
[0186] Volatile memory and / or non-volatile memory. Non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be random access memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDR SDRAM), Enhanced Synchronous DRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DR RAM).
[0187] In some embodiments of this application, the computer program may be divided into one or more modules, which are stored in the memory 910 and executed by the processor 920 to complete the method provided in this application. The one or more modules may be a series of computer program instruction segments capable of performing a specific function, which describe the execution process of the computer program in the electronic device.
[0188] like Figure 9 As shown, the electronic device may also include:
[0189] Transceiver 930, which can be connected to processor 920 or memory 910.
[0190] The processor 920 can control the transceiver 930 to communicate with other devices; specifically, it can send information or data to other devices or receive information or data sent by other devices. The transceiver 930 may include a transmitter and a receiver. The transceiver 930 may further include antennas, which may be one or more.
[0191] It should be understood that the various components in the electronic device are connected through a bus system, which includes a data bus, a power bus, a control bus, and a status signal bus.
[0192] This application also provides a computer storage medium storing a computer program thereon, which, when executed by a computer, enables the computer to perform the methods of the above-described method embodiments. Alternatively, embodiments of this application also provide a computer program product containing instructions that, when executed by a computer, cause the computer to perform the methods of the above-described method embodiments.
[0193] When implemented using software, it can be implemented entirely or partially as a computer program product. This computer program product includes one or more computer instructions. When these computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., digital video disc (DVD)), or a semiconductor medium (e.g., solid-state disk (SSD)).
[0194] Those skilled in the art will recognize that the modules and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0195] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of modules is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple modules or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or modules may be electrical, mechanical, or other forms.
[0196] The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical modules; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. For example, the functional modules in the various embodiments of this application may be integrated into one processing module, or each module may exist physically separately, or two or more modules may be integrated into one module.
[0197] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A model training method, characterized in that, include: Obtain multiple input / output records from the storage server cluster and the actual component failure results corresponding to each of the multiple input / output records; For each input / output record in the multiple input / output record information, determine the target storage server in the storage server cluster that generated the input / output record information, obtain information about each component of the target storage server, and obtain the operation logs and operational data of each component within a preset time period based on the information of each component. For each component, the operation log includes the alarm level of each sub-module in the component, and the operational data includes the performance indicators of each sub-module in the component over multiple time periods. The alarm levels of each sub-module are normalized to obtain the first value corresponding to each sub-module. For each of the multiple time periods, the performance indicators of each sub-module in that time period are normalized to obtain the second value corresponding to each sub-module in that time period. For any one of the sub-modules, calculate the average of the second values corresponding to the sub-module in the multiple time periods to obtain the third value corresponding to the sub-module; The first and third values corresponding to each sub-module are combined to form a target vector; The target vector is input into the target neural network model to obtain the component fault prediction result corresponding to the input-output recording information. The preset duration includes the generation time of the input-output recording information. The target neural network model is trained based on the actual component failure results and component failure prediction results corresponding to the multiple input and output record information.
2. The method according to claim 1, characterized in that, The actual component failure results include: the actual failure type; the component failure prediction results include: the predicted failure type.
3. The method according to claim 2, characterized in that, The actual component failure result also includes: the actual failure cause; the component failure prediction result also includes: the predicted failure cause.
4. The method according to any one of claims 1-3, characterized in that, The multiple input / output record information includes: at least one input / output pending alarm record information.
5. The method according to claim 4, characterized in that, The multiple input / output record information also includes at least one normal input / output record information.
6. The method according to claim 5, characterized in that, For any one of the at least one normal input / output record information, the actual fault type corresponding to the normal input / output record information is a novel fault type.
7. A fault location method, characterized in that, include: Obtain the target input / output record information of the storage server cluster; Identify the target storage server in the storage server cluster that generated the target input / output record information; Obtain information about each component of the target storage service server; Based on the information of each component, obtain the running logs and operational data of each component within a preset time period, where the preset time period includes the generation time of the target input / output record information; A target vector is generated based on the operation logs and the operational data; wherein, for each of the components, the operation logs include: the alarm levels of each sub-module in the component; and the operational data includes the performance metrics of each sub-module in the component over multiple time periods. The target vector is input into the target neural network model trained by any one of the methods of claims 1-6 to obtain the component fault prediction result corresponding to the target input-output record information.
8. The method according to claim 7, characterized in that, The component failure prediction result includes: the predicted failure type; the method further includes: If the predicted fault type is a new type of fault, the component fault prediction result will be displayed and an alarm message will be pushed to alert the operations and maintenance team that a new type of fault exists.
9. The method according to claim 7 or 8, characterized in that, The step of generating the target vector based on the operation log and the operational data includes: The alarm levels of each sub-module are converted into the first value corresponding to each sub-module. For each of the multiple time periods, the performance indicators of each sub-module in that time period are converted into a second value corresponding to each sub-module in that time period. For any one of the sub-modules, a third value corresponding to the sub-module is obtained based on the second value corresponding to the sub-module in the plurality of time periods; The first and third values corresponding to each sub-module are combined to form the target vector.
10. The method according to claim 9, characterized in that, The step of converting the alarm levels of each sub-module into a first value corresponding to each sub-module includes: The alarm levels of each submodule are normalized to obtain the first value corresponding to each submodule.
11. The method according to claim 9, characterized in that, The step of converting the performance indicators of each submodule in each of the plurality of time periods into a second value corresponding to each submodule in that time period includes: For each of the multiple time periods, the performance indicators of each sub-module in that time period are normalized to obtain the second value corresponding to each sub-module in that time period.
12. The method according to claim 9, characterized in that, The step of obtaining a third value for any one of the sub-modules based on a second value corresponding to the sub-module in the plurality of time periods includes: For any one of the sub-modules, calculate the average of the second values corresponding to the sub-module in the multiple time periods to obtain the third value corresponding to the sub-module.
13. A model training device, characterized in that, include: The acquisition module is used to acquire multiple input / output record information of the storage server cluster and the actual component failure results corresponding to the multiple input / output record information respectively; The processing module is used to determine the target storage server in the storage server cluster that generated the input / output record information for each of the multiple input / output record information, obtain information about each component of the target storage server, and obtain the operation logs and operational data of each component within a preset time period based on the information of each component. For each component, the operation log includes the alarm level of each sub-module within the component, and the operational data includes the performance indicators of each sub-module within the component over multiple time periods. The alarm levels of each sub-module are normalized to obtain the respective performance indicators of each sub-module. The first value is used to normalize the performance index of each sub-module in each of the multiple time periods to obtain the second value of each sub-module in the time period. For any sub-module, the average value of the second value of the sub-module in the multiple time periods is calculated to obtain the third value of the sub-module. The first value and the third value of each sub-module are combined to form a target vector. The target vector is input into the target neural network model to obtain the component fault prediction result corresponding to the input and output recording information. The preset duration includes the generation time of the input and output recording information. The training module is used to train the target neural network model based on the actual component failure results and component failure prediction results corresponding to the multiple input and output record information.
14. A fault location device, characterized in that, include: The first acquisition module is used to acquire target input / output record information of the storage server cluster; The determination module is used to determine the target storage server in the storage server cluster that generated the target input / output record information; The second acquisition module is used to acquire information about each component of the target storage service server; The third acquisition module is used to acquire the running logs and operational data of each component within a preset time period based on the information of each component. The preset time period includes the generation time of the target input / output record information. A generation module is used to generate a target vector based on the operation logs and the operational data; wherein, for each of the components, the operation logs include: the alarm levels of each sub-module in the component; the operational data includes the performance indicators of each sub-module in the component over multiple time periods; The input module is used to input the target vector into the target neural network model trained by any one of the methods of claims 1-6, and obtain the component fault prediction result corresponding to the target input-output record information.
15. An electronic device, characterized in that, include: A processor and a memory, the memory being used to store a computer program, the processor being used to invoke and run the computer program stored in the memory to perform the method of any one of claims 1 to 12.
16. A computer-readable storage medium, characterized in that, Used to store a computer program that causes a computer to perform the method as described in any one of claims 1 to 12.