Feature extraction method and device, electronic equipment and storage medium
By extracting temporal and spatial features from the raw logs of the cloud system, and then processing and fusing them, the problem of data uniformity in memory fault detection in cloud computing systems is solved, and highly accurate and stable fault analysis and prediction are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ALIBABA CLOUD COMPUTING CO LTD
- Filing Date
- 2022-12-02
- Publication Date
- 2026-06-19
AI Technical Summary
In cloud computing systems, the memory fault detection process suffers from a single data source, a small number of features, and a lack of feature fusion, resulting in low accuracy and instability in fault analysis and prediction, and poor prediction performance.
By acquiring raw logs and configuration information during device operation, the original temporal features of memory timing anomalies and the original spatial features of memory space anomalies are extracted, and then processed and fused to generate rich high-level features. These features are then combined with the temporal and spatial features for memory fault detection.
It improves the accuracy and stability of memory fault analysis and prediction, enhances prediction effectiveness, and ensures the normal operation of cloud systems.
Smart Images

Figure CN116127292B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of equipment testing technology, and in particular to a feature extraction method, a feature extraction device, an electronic device, and a computer-readable storage medium. Background Technology
[0002] With the rapid development of cloud computing technologies and the increasing maturity of cloud platforms, a large amount of software and hardware development and deployment utilizes the cloud. However, due to the fundamental nature and widespread application of cloud services, any anomalies can have extensive negative impacts, including economic and reputational damage. Therefore, ensuring the high availability of cloud services is crucial. Among the cloud nodes involved in cloud computing systems, memory failure is one of the important reasons for cloud service unavailability. In the process of detecting memory failures in cloud nodes, memory failure signals are captured for fault analysis or prediction. However, because abnormal events (such as a large number of memory errors) have a certain probability of affecting system stability, the collected data may be biased, affecting the accuracy of fault analysis and prediction. Summary of the Invention
[0003] This application provides a feature extraction method, apparatus, electronic device, and computer-readable storage medium to solve or partially solve the problems of low accuracy, instability, and poor prediction performance in fault analysis and prediction caused by a single data source, a small number of features, and a lack of feature fusion during memory detection.
[0004] This application discloses a feature extraction method for device detection, including:
[0005] Obtain the original logs generated during device operation corresponding to the device, as well as the device configuration information corresponding to the device;
[0006] Extract the original temporal features corresponding to memory timing anomaly events and the original spatial features corresponding to memory space anomaly events from the different original logs.
[0007] Based on the original time features, memory fault features of the device are extracted from the statistical time level to obtain time features corresponding to the device. Based on the original spatial features, memory fault features of the device are extracted from the hardware fault level to obtain spatial features corresponding to the device.
[0008] The device configuration information, the spatial characteristics, and the temporal characteristics are used as memory characteristics for fault detection of the device's memory.
[0009] Optionally, extracting the original temporal features corresponding to the memory timing anomaly events from the different original logs includes:
[0010] Extract memory timing anomaly events that occur within a first time window from different raw logs, as well as the number of occurrences of the memory timing anomaly events in each statistical period within the first time window. The first time window is divided into several statistical periods according to equal values.
[0011] The original logs include at least two of the following: hardware fault logs, error and correction logs, system operation logs, and backplane management controller logs.
[0012] Optionally, the step of extracting memory fault features of the device from a statistical time perspective based on the original time features to obtain time features corresponding to the device includes:
[0013] Based on the chronological order of the first time window, the number sequence of the memory timing anomaly events within the first time window is generated by using the number of occurrences of the memory timing anomaly events in each statistical period.
[0014] Obtain the mean and standard deviation of the sequence corresponding to the quantity sequence;
[0015] Based on at least one of the number of occurrences of the memory timing anomaly events within the statistical period, the mean of each sequence, and the standard deviation of the sequence, memory fault features of the device are extracted from the statistical time level to obtain the time features corresponding to the device.
[0016] Optionally, the first time window includes at least two non-overlapping sub-windows. The step of extracting memory fault features of the device from the statistical time level based on at least one of the number of occurrences of the memory timing anomaly events within the statistical period, the mean of each sequence, and the standard deviation of the sequence, to obtain the time features corresponding to the device, includes:
[0017] The total number of occurrences and the standard deviation of the memory timing anomaly events within the first time window are calculated using the number of occurrences of the memory timing anomaly events within the statistical period.
[0018] Obtain the total historical occurrence count of the memory timing anomaly event in the previous time window corresponding to the first time window, and calculate the window increment of the memory timing anomaly event in the first time window relative to the previous time window based on the current total occurrence count and the historical total occurrence count;
[0019] The number of occurrences of the memory timing anomaly event within the statistical period is used to calculate the first occurrence count of the memory timing anomaly event in the first sub-window of the first time window and the second occurrence count in the second sub-window of the first time window, and the window increment of the memory timing anomaly event in the second sub-window relative to the first sub-window is calculated based on the first occurrence count and the second occurrence count.
[0020] Using the quantity sequence, the sequence mean, and the sequence standard deviation, calculate the peak value and skewness of the probability density distribution corresponding to the quantity sequence;
[0021] Based on the memory timing anomaly event and the first time window, Cartesian products are calculated with at least one of the current total number of occurrences, the standard deviation of occurrence, the inter-window increment, the intra-window increment, the peak value, and the skewness to generate a time feature corresponding to the device.
[0022] Optionally, the memory includes several memory libraries, each memory library including several memory units, and the step of extracting the original spatial features corresponding to memory space anomaly events from different original logs includes:
[0023] Obtain several memory error patterns corresponding to memory space abnormal events. The memory error patterns are used to reflect the severity and distribution of underlying memory hardware failures of the device.
[0024] Extract at least one memory error information corresponding to each memory error mode that occurred within the second time window from the different original logs. The memory error information includes at least the number of memory libraries corresponding to the memory libraries that err under the same memory error mode, the number of memory units corresponding to the memory units that err, and the number of errors that occurred.
[0025] Optionally, the second time window includes at least two non-overlapping sub-time windows, and the step of extracting memory fault features of the device from the hardware fault level based on the original spatial features to obtain spatial features corresponding to the device includes:
[0026] Using the memory error information, calculate the feature mean, feature variance, and feature maximum value of memory errors occurring under each memory error mode;
[0027] Extract the first error count and the second error count of the first sub-window that occurred in the second time window from the memory error information, and calculate the error increment of the memory space abnormal event in the second sub-window relative to the first sub-window based on the first error count and the second error count.
[0028] Based on the memory space anomaly event and the second time window, Cartesian products are calculated with at least one of the feature mean, the feature variance, the feature maximum value, and the error increment to generate spatial features corresponding to the device.
[0029] Optionally, the step of using each of the memory error information to calculate the feature mean, feature variance, and feature maximum value of memory errors occurring under each of the memory error modes includes:
[0030] Extract the maximum memory cell value and the maximum number of errors that occurred in different memory banks under the same error mode from the memory error information;
[0031] Using the number of memory libraries corresponding to each memory error mode, calculate the mean and variance of the first memory library corresponding to the memory library where errors occur under the same error mode, and the mean and variance of the second memory library corresponding to the memory library where errors occur under different error modes;
[0032] Using the number of memory cells corresponding to each memory error mode, calculate the mean and variance of the first memory cells corresponding to the memory cells that err under the same error mode, and the mean and variance of the second memory cells corresponding to the memory cells that err under different error modes;
[0033] Using the number of errors corresponding to each memory error mode, calculate the first error mean, the second error mean, and the third error mean for errors occurring in different memory banks under the same error mode, the second error mean for errors occurring in different memory units of the same memory bank under the same error mode, and the third error mean for errors occurring under different error modes.
[0034] Optionally, the step of extracting the first error count and the second error count of the first sub-window occurring in the second time window from the memory error information, and calculating the error increment of the memory space anomaly event in the second sub-window relative to the first sub-window based on the first error count and the second error count, includes:
[0035] Extract the number of first memory libraries, the number of first memory units, and the number of first errors corresponding to each memory error mode in the first sub-window of the second time window, and the number of second memory libraries, the number of second memory units, and the number of second errors corresponding to each memory error mode in the second sub-window from the memory error information;
[0036] Using the first memory library quantity and the second memory library quantity, calculate the memory library increment of the memory space anomaly event in the second sub-window relative to the first sub-window; using the first memory unit quantity and the second memory unit quantity, calculate the memory unit increment of the memory space anomaly event in the second sub-window relative to the first sub-window; and using the first error count and the second error count, calculate the error occurrence increment of the memory space anomaly event in the second sub-window relative to the first sub-window.
[0037] Optionally, the memory error modes include at least single-event soft error mode, single-event hard error mode, multiple-event error mode, and mixed error mode; wherein, the single-event soft error mode is a mode in which only one memory cell in a memory bank experiences an error once, the single-event hard error mode is a mode in which only one memory cell in a memory bank experiences an error at least twice, the multiple-event error mode is a mode in which at least two memory cells in a row or column of a memory bank each experience an error once, and the mixed error mode is a mode in which at least two different memory cells in at least two rows or two columns of a memory bank experience errors.
[0038] Optionally, it also includes:
[0039] The memory features are input into the memory detection model to perform memory detection on the device, and the detection results for the memory detection are output.
[0040] This invention also discloses a feature extraction device for device detection, comprising:
[0041] The log acquisition module is used to acquire the original logs generated during device operation that correspond to the device, as well as the device configuration information corresponding to the device.
[0042] The raw feature extraction module is used to extract the raw temporal features corresponding to memory timing anomalies from different raw logs, and to extract the raw spatial features corresponding to memory space anomalies.
[0043] The feature extraction module is used to extract memory fault features of the device from the statistical time level based on the original time features to obtain time features corresponding to the device, and to extract memory fault features of the device from the hardware fault level based on the original spatial features to obtain spatial features corresponding to the device.
[0044] The memory feature determination module is used to use the device configuration information, the spatial features, and the temporal features as memory features for fault detection of the device's memory.
[0045] Optionally, the original feature extraction module is specifically used for:
[0046] Extract memory timing anomaly events that occur within a first time window from different raw logs, as well as the number of occurrences of the memory timing anomaly events in each statistical period within the first time window. The first time window is divided into several statistical periods according to equal values.
[0047] The original logs include at least two of the following: hardware fault logs, error and correction logs, system operation logs, and backplane management controller logs.
[0048] Optionally, the feature extraction module is specifically used for:
[0049] Based on the chronological order of the first time window, the number sequence of the memory timing anomaly events within the first time window is generated by using the number of occurrences of the memory timing anomaly events in each statistical period.
[0050] Obtain the mean and standard deviation of the sequence corresponding to the quantity sequence;
[0051] Based on at least one of the number of occurrences of the memory timing anomaly events within the statistical period, the mean of each sequence, and the standard deviation of the sequence, memory fault features of the device are extracted from the statistical time level to obtain the time features corresponding to the device.
[0052] Optionally, the first time window includes at least two non-overlapping sub-windows, and the feature extraction module is specifically used for:
[0053] The total number of occurrences and the standard deviation of the memory timing anomaly events within the first time window are calculated using the number of occurrences of the memory timing anomaly events within the statistical period.
[0054] Obtain the total historical occurrence count of the memory timing anomaly event in the previous time window corresponding to the first time window, and calculate the window increment of the memory timing anomaly event in the first time window relative to the previous time window based on the current total occurrence count and the historical total occurrence count;
[0055] The number of occurrences of the memory timing anomaly event within the statistical period is used to calculate the first occurrence count of the memory timing anomaly event in the first sub-window of the first time window and the second occurrence count in the second sub-window of the first time window, and the window increment of the memory timing anomaly event in the second sub-window relative to the first sub-window is calculated based on the first occurrence count and the second occurrence count.
[0056] Using the quantity sequence, the sequence mean, and the sequence standard deviation, calculate the peak value and skewness of the probability density distribution corresponding to the quantity sequence;
[0057] Based on the memory timing anomaly event and the first time window, Cartesian products are calculated with at least one of the current total number of occurrences, the standard deviation of occurrence, the inter-window increment, the intra-window increment, the peak value, and the skewness to generate a time feature corresponding to the device.
[0058] Optionally, the memory includes several memory libraries, each memory library including several memory units, and the original feature extraction module is specifically used for:
[0059] Obtain several memory error patterns corresponding to memory space abnormal events. The memory error patterns are used to reflect the severity and distribution of underlying memory hardware failures of the device.
[0060] Extract at least one memory error information corresponding to each memory error mode that occurred within the second time window from the different original logs. The memory error information includes at least the number of memory libraries corresponding to the memory libraries that err under the same memory error mode, the number of memory units corresponding to the memory units that err, and the number of errors that occurred.
[0061] Optionally, the second time window includes at least two non-overlapping sub-windows, and the feature extraction module is specifically used for:
[0062] Using the memory error information, calculate the feature mean, feature variance, and feature maximum value of memory errors occurring under each memory error mode;
[0063] Extract the first error count and the second error count of the first sub-window that occurred in the second time window from the memory error information, and calculate the error increment of the memory space abnormal event in the second sub-window relative to the first sub-window based on the first error count and the second error count.
[0064] Based on the memory space anomaly event and the second time window, Cartesian products are calculated with at least one of the feature mean, the feature variance, the feature maximum value, and the error increment to generate spatial features corresponding to the device.
[0065] Optionally, the feature extraction module is specifically used for:
[0066] Extract the maximum memory cell value and the maximum number of errors that occurred in different memory banks under the same error mode from the memory error information;
[0067] Using the number of memory libraries corresponding to each memory error mode, calculate the mean and variance of the first memory library corresponding to the memory library where errors occur under the same error mode, and the mean and variance of the second memory library corresponding to the memory library where errors occur under different error modes;
[0068] Using the number of memory cells corresponding to each memory error mode, calculate the mean and variance of the first memory cells corresponding to the memory cells that err under the same error mode, and the mean and variance of the second memory cells corresponding to the memory cells that err under different error modes;
[0069] Using the number of errors corresponding to each memory error mode, calculate the first error mean, the second error mean, and the third error mean for errors occurring in different memory banks under the same error mode, the second error mean for errors occurring in different memory units of the same memory bank under the same error mode, and the third error mean for errors occurring under different error modes.
[0070] Optionally, the feature extraction module is specifically used for:
[0071] Extract the number of first memory libraries, the number of first memory units, and the number of first errors corresponding to each memory error mode in the first sub-window of the second time window, and the number of second memory libraries, the number of second memory units, and the number of second errors corresponding to each memory error mode in the second sub-window from the memory error information;
[0072] Using the first memory library quantity and the second memory library quantity, calculate the memory library increment of the memory space anomaly event in the second sub-window relative to the first sub-window; using the first memory unit quantity and the second memory unit quantity, calculate the memory unit increment of the memory space anomaly event in the second sub-window relative to the first sub-window; and using the first error count and the second error count, calculate the error occurrence increment of the memory space anomaly event in the second sub-window relative to the first sub-window.
[0073] Optionally, the memory error modes include at least single-event soft error mode, single-event hard error mode, multiple-event error mode, and mixed error mode; wherein, the single-event soft error mode is a mode in which only one memory cell in a memory bank experiences an error once, the single-event hard error mode is a mode in which only one memory cell in a memory bank experiences an error at least twice, the multiple-event error mode is a mode in which at least two memory cells in a row or column of a memory bank each experience an error once, and the mixed error mode is a mode in which at least two different memory cells in at least two rows or two columns of a memory bank experience errors.
[0074] Optionally, the device further includes:
[0075] The fault detection module is used to input the memory features into the memory detection model to perform memory detection on the device, and output the detection results for the memory detection.
[0076] This application also discloses an electronic device, including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus;
[0077] The memory is used to store computer programs;
[0078] When the processor executes a program stored in the memory, it implements the method described in the embodiments of this application.
[0079] This application also discloses a computer-readable storage medium storing instructions that, when executed by one or more processors, cause the processors to perform the methods described in this application.
[0080] The embodiments of this application have the following advantages:
[0081] In this embodiment, a cloud system may include several devices, and each settlement node may include at least one memory. To ensure the accuracy of memory fault analysis and prediction during the cloud system's memory fault analysis and prediction process, the system can obtain the original logs generated during operation corresponding to each device, as well as the device configuration information. Then, it can extract the original temporal features corresponding to memory timing anomalies and the original spatial features corresponding to memory space anomalies from the different original logs. Based on the original temporal features, it can extract memory fault features from the statistical time perspective to obtain the time features corresponding to each device. Finally, it can analyze the hardware faults of the device based on the original spatial features. Memory fault features are extracted to obtain spatial features corresponding to the device. Then, device configuration information, spatial features, and temporal features are used as memory features for fault detection. In the process of memory fault analysis and prediction of cloud systems, multi-source raw data is extracted based on different logs to ensure data diversity. The raw data is further processed and fused to generate rich high-order features, increasing the number of features. At the same time, temporal and spatial features are fused to further enrich the memory features. As a result, when performing memory fault analysis and prediction of cloud systems based on the obtained memory features, the accuracy and stability of fault analysis and prediction can be effectively guaranteed, and the prediction effect can be improved. Attached Figure Description
[0082] Figure 1 This is a flowchart of the steps of a feature extraction method for device detection provided in the embodiments of this application;
[0083] Figure 2 This is a schematic diagram of memory feature extraction provided in the embodiments of this application;
[0084] Figure 3 This is a structural block diagram of a feature extraction device for device detection provided in an embodiment of this application;
[0085] Figure 4 This is a block diagram of an electronic device provided in the embodiments of this application. Detailed Implementation
[0086] To make the above-mentioned objectives, features and advantages of this application more apparent and understandable, the application will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0087] As an example, a cloud system may include several computing nodes (hereinafter referred to as devices), each of which can be configured with at least one memory module (i.e., a memory stick). For cloud systems, memory failure is one of the most significant causes of cloud service unavailability. Therefore, effectively analyzing and predicting memory failures in cloud systems is crucial for ensuring the normal operation of cloud services. Memory failure analysis and prediction rely on the collection of memory features. Rich and effective memory features can effectively guarantee the accuracy, stability, and predictive performance of memory failure analysis and prediction. However, related technologies suffer from problems such as a single data source for memory features, a limited number of features, and insufficient spatiotemporal fusion. This significantly reduces prediction accuracy and stability during memory failure analysis and prediction, resulting in poor prediction performance and severely impacting the normal operation of cloud systems.
[0088] One of the core inventive points of this application lies in obtaining the original logs corresponding to the device generated during the operation of the cloud system, as well as the device configuration information corresponding to the device. Then, it extracts the original temporal features corresponding to memory timing anomalies and the original spatial features corresponding to memory space anomalies from different original logs. Based on the original temporal features, it extracts memory fault features from the statistical time perspective to obtain the time features corresponding to the device. Based on the original spatial features, it extracts memory fault features from the hardware fault perspective to obtain the spatial features corresponding to the device. Finally, it uses the device configuration information, spatial features, and temporal features as memory features for memory fault detection. Thus, in the process of memory fault analysis and prediction of the cloud system, it extracts multi-source original data based on different logs, ensuring data diversity. Furthermore, it performs secondary processing and fusion of the original data to generate rich high-order features, increasing the number of features. Simultaneously, it fuses the time and spatial features, further enriching the memory features. This ensures that subsequent memory fault analysis and prediction of the cloud system based on the obtained memory features can effectively guarantee high accuracy and stability in fault analysis and prediction, and improve prediction results.
[0089] Reference Figure 1 The diagram illustrates a flowchart of a feature extraction method for device detection provided in an embodiment of this application, which may specifically include the following steps:
[0090] Step 101: Obtain the original logs generated by the device during operation, corresponding to the device, and the device configuration information corresponding to the device;
[0091] Step 102: Extract the original temporal features corresponding to memory timing anomaly events and the original spatial features corresponding to memory space anomaly events from the different original logs.
[0092] Step 103: Extract memory fault features of the device from the statistical time level based on the original time features to obtain time features corresponding to the device; and extract memory fault features of the device from the hardware fault level based on the original spatial features to obtain spatial features corresponding to the device.
[0093] Step 104: Use the device configuration information, the spatial characteristics, and the temporal characteristics as memory characteristics for fault detection of the device's memory.
[0094] Optionally, memory feature extraction can involve extracting the memory of computing nodes involved in the cloud system. A cloud system can include several computing nodes, each of which can be an electronic device configured with at least one memory. In the process of providing cloud services, each computing node can host multiple virtual machines, which then provide cloud services. When the memory of a computing node fails, all virtual machines hosted on it will fail to operate normally, significantly reducing the availability of cloud services. Therefore, memory failure analysis and prediction of computing nodes are necessary.
[0095] Before performing memory fault analysis and prediction, the original logs and device configuration information generated during device operation can be obtained. These original logs can include two of the following: hardware fault logs (Machine CheckException logs), error and correction logs (Error Detection and Correction logs), system operation logs (Kernel logs), and baseboard management controller logs (BaseboardManagementController logs). Hardware fault logs can record information about faults occurring during device operation, error and correction logs can record information about errors occurring and being corrected during device operation (such as error correction codes, uncorrectable errors, etc.), system operation logs can record information reported by the operating system, and baseboard management controller logs can record relevant information during the operation of the baseboard management controller. This application does not impose any limitations on these aspects.
[0096] In addition, the device configuration information can be the static information features corresponding to the device, which may include the device's CPU (central processing unit) manufacturer, CPU model, CPU parameters, memory manufacturer, number of memory modules, memory module size, memory module parameters, and other feature information that characterizes the static configuration of the device.
[0097] In this embodiment, after obtaining different raw logs, temporal raw features corresponding to memory timing anomalies and spatial raw features corresponding to memory space anomalies can be extracted from the different raw logs. Memory timing anomalies can be memory error events associated with time information, and memory space anomalies can be memory error events associated with the underlying memory hardware. Accordingly, temporal raw features can include feature information related to time information, and spatial raw features can include feature information related to the underlying hardware. This allows for feature extraction from different logs, enriching the data source, and ensuring data diversity by extracting spatiotemporal features based on different logs.
[0098] In practical implementation, corresponding raw time features can be extracted from different raw logs, and these raw time features can be further processed to obtain time features for fault analysis and prediction of the device. At the same time, corresponding raw spatial features can be extracted, and these raw spatial features can be further processed to obtain spatial features for fault analysis and prediction of the device. Furthermore, the time features and spatial features can be fused, thereby generating rich high-order features, increasing the number of features, and further enriching the memory features. As a result, when performing memory fault analysis and prediction of the cloud system based on the obtained memory features, it is possible to effectively ensure high accuracy and stability of fault analysis and prediction, and improve the prediction effect.
[0099] It should be noted that both memory timing exceptions and memory space exceptions can be classified as memory exceptions, and this application does not impose any restrictions on this.
[0100] For the extraction of time features, firstly, memory timing anomalies occurring within a first time window and the frequency of these anomalies within each statistical period of the first time window can be extracted from different raw logs. The first time window is divided into several statistical periods of equal value. Next, based on the chronological order of the first time window and the frequency of memory timing anomalies within each statistical period, a sequence of memory timing anomalies within the first time window can be generated. The mean and standard deviation of this sequence are then obtained. Finally, based on at least one of the frequency of memory timing anomalies within the statistical periods, the mean of each sequence, and the standard deviation of the sequence, memory fault features are extracted from the device at the statistical time level, obtaining the time features corresponding to the device. Optionally, the mean of the sequence can be the average frequency of memory timing anomalies occurring within each statistical period of the first time window. Correspondingly, the standard deviation of the sequence can be obtained based on the mean of the sequence, which can then be used to present the distribution of memory timing anomalies within each statistical period of the first time window.
[0101] The first time window can be the collection period set for feature extraction of the device. It can include at least two non-overlapping sub-windows (e.g., when the first time window includes two sub-windows, they can be the first half window and the second half window, corresponding to the time periods 0-t / 2 and t / 2-t of the first time window). For example, it can be 7 days, 1 day, 3 hours, etc. Furthermore, the first time window can be further divided into several statistical periods, such as dividing it into equal-value periods per hour or per minute, resulting in several statistical periods within a collection period. This allows the extraction of memory timing anomaly events occurring within the corresponding collection period and the number of occurrences of these events in each statistical period from different raw logs. Optionally, assuming a memory timing anomaly event x, a time window of 1 hour, and a statistical period of 1 minute, the number of occurrences of this memory timing anomaly event per minute within that 1 hour can be obtained based on different raw logs.
[0102] After obtaining the original time features, secondary processing can be performed based on these features to obtain the time features corresponding to the device. Specifically, the total number of occurrences and standard deviation of memory timing anomalies in the first time window can be calculated using the number of occurrences of memory timing anomalies within the statistical period. Simultaneously, the historical total number of occurrences of memory timing anomalies in the previous time window corresponding to the first time window can be obtained. Based on the current total number of occurrences and the historical total number of occurrences, the window increment of memory timing anomalies in the first time window relative to the previous time window can be calculated. Then, the first occurrence count of memory timing anomalies in the first sub-window of the first time window and the first occurrence count within the first time window can be calculated using the number of occurrences of memory timing anomalies within the statistical period. The second occurrence count of the second sub-window is calculated, and the intra-window increment of the memory timing anomaly event in the second sub-window relative to the first sub-window is calculated based on the first occurrence count and the second occurrence count. The peak value and skewness of the probability density distribution corresponding to the quantity sequence are calculated using the quantity sequence, sequence mean, and sequence standard deviation. Then, the Cartesian product is calculated based on the memory timing anomaly event and the first time window with at least one of the current total occurrence count, occurrence standard deviation, inter-window increment, intra-window increment, peak value, and skewness to generate the time feature corresponding to the device. Thus, rich time features can be obtained through different memory anomaly times, time windows, and time aggregation statistics. The device's memory can be presented from the time dimension through different time features.
[0103] In one example, the temporal feature can be the Cartesian product of memory fault events, time windows, and aggregated statistics. Aggregated statistics can include Sum (summation function), Std (standard deviation function), Diff (differentiation function), Delta (Dirac delta function), Kurtosis (kurtosis aggregation function), Skew (function returning the asymmetry of the distribution), etc. Specifically, let x represent a specific memory fault-related fault event, and t represent a point in time. Assuming the number of fault events is counted by minute, within the time window (tw, t], for event x, the minute-level event count will form a sequence X, such as the sequence of CE counts for each minute within the past day for a node. Let... Let X be the mean. Let X be the standard deviation of X. At time point t, the definitions of each statistic can be as follows:
[0104] Sum(x, t, w): The total number of occurrences of event x within the time window;
[0105] Std(x, t, w): The standard deviation of event x calculated within the time window (i.e., the occurrence standard deviation).
[0106] - The increment of the memory exception event in the second child window relative to the first child window (i.e., the increment within the window).
[0107] - The increment of a memory exception event in the current window relative to the previous window (i.e., the inter-window increment).
[0108] The peak value of the probability density distribution of sequence X;
[0109] t, w : Skewness of the probability density distribution of sequence X.
[0110] By using different memory anomaly times, time windows, and time aggregation statistics, rich time features can be obtained, and the device's memory can be presented from a time dimension.
[0111] The process of extracting spatial features can be performed from the underlying hardware level. For each memory, there can be several memory banks, and each memory bank can include several memory units. Specifically, at the micro level, a DRAM (Dynamic Random Access Memory) memory chip can be composed of multiple banks (memory banks, physical storage units). Each bank can be regarded as a two-dimensional matrix composed of multiple cells (memory units). Each cell can store several data bits and is indexed by a unique row ID and column ID.
[0112] Regarding spatial characteristics, multiple different memory error modes can be set for different memory space anomaly events. These different memory error modes can reflect the severity and distribution of underlying memory hardware failures. Specifically, memory error modes can include single-event soft error (SEE), single-event hard error (SEE), multiple-event (MEE) error (MEE), and mixed error modes. A SEE is a mode where only one memory cell in a memory bank experiences an error once. A SEE is a mode where only one memory cell in a memory bank experiences an error at least twice. A MEE is a mode where at least two memory cells in a row or column of a memory bank each experience an error once. A mixed error mode is a mode where at least two different memory cells in at least two rows or columns of a memory bank experience errors.
[0113] In one example, memory error modes may include Single_soft_error, Single_hard_error, Faulty_row, Faulty_column, Corrupt_row, Corrupt_column, etc., and the relevant definitions can be described as follows:
[0114] Singlesofterror: An error occurred in a cell within a bank, and the cell that caused the error is not in the faultyrow / column of the bank.
[0115] Singleharderror: A cell in a bank has experienced at least two errors, and the cell that experienced the error is not in the faultyrow / column of the bank.
[0116] Faultyrow / column: At least two distinct cells in a row / column of a bank have encountered an error;
[0117] Corruptrow / column: A faultyrow / faultycolumn that has experienced a singleharderror (i.e., at least two different cells in at least two rows / columns have encountered an error).
[0118] In the specific implementation, several memory error modes corresponding to memory space exception events are obtained. Then, at least one memory error information corresponding to each memory error mode that occurred within a second time window can be extracted from different raw logs. The memory error information includes at least the number of memory libraries corresponding to the memory libraries that erred under the same memory error mode, the number of memory units corresponding to the memory units that erred, and the number of errors that occurred. The second time window can include at least two non-overlapping sub-time windows.
[0119] After obtaining the memory error information corresponding to each memory error mode, spatial aggregation of the memory error information corresponding to each memory error mode can be performed at the memory and device levels to calculate the feature parameters corresponding to each memory error mode. Specifically, using each memory error information, the feature mean, feature variance, and feature maximum value of memory errors occurring under each memory error mode are calculated. Simultaneously, the number of first errors occurring in the first sub-window and the number of second errors occurring in the second sub-window within the second time window are extracted from the memory error information. Based on the number of first and second errors, the error increment of memory space anomaly events in the second sub-window relative to the first sub-window is calculated. Then, Cartesian products are performed on the memory space anomaly events and the second time window with at least one of the feature mean, feature variance, feature maximum value, and error increment to generate spatial features corresponding to the device. This process generates rich spatial features, increasing the number of features, and fuses temporal and spatial features, further enriching the memory features. Consequently, when performing memory fault analysis and prediction on cloud systems based on the obtained memory features, the accuracy and stability of fault analysis and prediction can be effectively guaranteed, and the prediction effect can be improved.
[0120] The processing of feature mean, feature variance, and feature maximum value can include: extracting the maximum memory cell value and maximum error count of different memory banks under the same error mode from the memory error information; calculating the first memory bank mean and variance corresponding to different memory banks under the same error mode, and the second memory bank mean and variance corresponding to memory banks under different error modes, using the number of memory banks corresponding to each memory error mode; calculating the first memory cell mean and variance corresponding to different memory cells under the same error mode, and the second memory cell mean and variance corresponding to memory cells under different error modes, using the number of memory cells corresponding to each memory error mode; and calculating the first error mean, the second error mean, and the third error mean of different memory banks under the same error mode, using the error count corresponding to each memory error mode. Optionally, the maximum value can be the highest value among the number of error cells and the error count in different banks under the same memory error mode.
[0121] The process of handling error increments may include: extracting the number of first memory libraries, the number of first memory units, and the number of first errors corresponding to each memory error mode in the first sub-window of the second time window from the memory error information; extracting the number of second memory libraries, the number of second memory units, and the number of second errors corresponding to each memory error mode in the second sub-window; then using the number of first memory libraries and the number of second memory libraries, calculating the memory library increment of the memory space abnormal event in the second sub-window relative to the first sub-window; using the number of first memory units and the number of second memory units, calculating the memory unit increment of the memory space abnormal event in the second sub-window relative to the first sub-window; and using the number of first errors and the number of second errors, calculating the error occurrence increment of the memory space abnormal event in the second sub-window relative to the first sub-window.
[0122] In one example, assuming we statistically analyze the spatial characteristics of devices within a day, we can extract corresponding raw spatial characteristics based on different raw data. After classifying each memory error event according to memory error patterns, we can obtain the following information:
[0123] Singlesofterror: 5 banks experience errors; 5 cells experience errors, with a total error count of 5 (assuming that each bank has exactly one cell experiencing a memory error once).
[0124] Singleharderror: Errors occur in 5 banks; errors occur in 5 cells, with a total error count of 10 (assuming that there is exactly one cell in each bank that experiences two errors).
[0125] Faultyrow / column: 3 banks have errors; 6 cells have errors, with the number of errors being 6 (assuming that there are 2 different cells in the same row / column for each bank that have one error).
[0126] Corruptrow / column: 5 banks have errors; 10 cells have errors, with a total error count of 20 (assuming each bank has two different cells in the same row / column that each have two errors).
[0127] Using the above parameters, spatiotemporal feature fusion can be performed using the corresponding time windows. The total amount, maximum value, increment, mean, and variance of various spatial features corresponding to each memory error mode in the corresponding time window can be calculated. For example, in addition to setting 1 day, different time windows such as 1 week, 2 weeks, 4 weeks, and online lifecycle can be set to perform spatiotemporal feature fusion and enrich memory features.
[0128] In practical implementation, after obtaining the temporal and spatial characteristics corresponding to the device through the above process, these characteristics, along with the device configuration information, can be used as memory features for memory fault analysis and prediction. This allows for memory fault analysis and prediction based on these features. Specifically, refer to... Figure 2This illustration shows a schematic diagram of memory feature extraction provided in an embodiment of this application. After extracting original features from the original logs (MCE logs, EDAC logs, and other related logs), the original features corresponding to spatial and temporal anomaly events are obtained. Then, based on the Cartesian product, (anomaly event * time window * statistical operation) can be calculated to obtain the corresponding spatial and temporal features. At the same time, the machine configuration information corresponding to the device can be obtained and used as static information features. Then, the temporal features, spatial features, and static information features can be input into the corresponding machine learning classification algorithm model for memory fault analysis and prediction to obtain the analysis results and prediction results corresponding to the device. Thus, in the process of memory fault analysis and prediction of the cloud system, multi-source original data is extracted based on different logs to ensure data diversity. The original data is further processed and fused to generate rich high-order features, increasing the number of features. At the same time, the temporal and spatial features are fused to further enrich the memory features. Furthermore, based on the above extraction of the device's memory features, the extracted memory features can be input into the corresponding memory detection model to detect memory faults in the device, thereby obtaining the device's memory status. Thus, when performing memory fault analysis and prediction on the cloud system based on the obtained memory features, it can effectively ensure high accuracy and stability of fault analysis and prediction, and improve the prediction effect.
[0129] It should be noted that the embodiments of this application include, but are not limited to, the examples described above. It is understood that those skilled in the art can make further settings according to actual needs under the guidance of the ideas in the embodiments of this application, and this application does not impose any restrictions on this.
[0130] In this embodiment, a cloud system may include several devices, and each settlement node may include at least one memory. To ensure the accuracy of memory fault analysis and prediction during the cloud system's memory fault analysis and prediction process, the system can obtain the original logs generated during operation corresponding to each device, as well as the device configuration information. Then, it can extract the original temporal features corresponding to memory timing anomalies and the original spatial features corresponding to memory space anomalies from the different original logs. Based on the original temporal features, it can extract memory fault features from the statistical time perspective to obtain the time features corresponding to each device. Finally, it can analyze the hardware faults of the device based on the original spatial features. Memory fault features are extracted to obtain spatial features corresponding to the device. Then, device configuration information, spatial features, and temporal features are used as memory features for fault detection. In the process of memory fault analysis and prediction of cloud systems, multi-source raw data is extracted based on different logs to ensure data diversity. The raw data is further processed and fused to generate rich high-order features, increasing the number of features. At the same time, temporal and spatial features are fused to further enrich the memory features. As a result, when performing memory fault analysis and prediction of cloud systems based on the obtained memory features, the accuracy and stability of fault analysis and prediction can be effectively guaranteed, and the prediction effect can be improved.
[0131] It should be noted that, for the sake of simplicity, the method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments of this application are not limited to the described order of actions, because according to the embodiments of this application, some steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the embodiments of this application.
[0132] Reference Figure 3 The diagram illustrates a structural block diagram of a feature extraction device for device detection provided in an embodiment of this application, which may specifically include the following modules:
[0133] The log acquisition module 301 is used to acquire the original logs generated during device operation that correspond to the device, as well as the device configuration information corresponding to the device.
[0134] The original feature extraction module 302 is used to extract the original temporal features corresponding to memory timing anomaly events from different original logs, and to extract the original spatial features corresponding to memory space anomaly events.
[0135] The feature extraction module 303 is used to extract memory fault features of the device from the statistical time level based on the original time features to obtain time features corresponding to the device, and to extract memory fault features of the device from the hardware fault level based on the original spatial features to obtain spatial features corresponding to the device.
[0136] The memory feature determination module 304 is used to use the device configuration information, the spatial features, and the temporal features as memory features for fault detection of the device's memory.
[0137] In one optional embodiment, the original feature extraction module 302 is specifically used for:
[0138] Extract memory timing anomaly events that occur within a first time window from different raw logs, as well as the number of occurrences of the memory timing anomaly events in each statistical period within the first time window. The first time window is divided into several statistical periods according to equal values.
[0139] The original logs include at least two of the following: hardware fault logs, error and correction logs, system operation logs, and backplane management controller logs.
[0140] In an optional embodiment, the first time window includes at least two non-overlapping sub-windows, and the feature extraction module 303 is specifically used for:
[0141] Based on the chronological order of the first time window, the number sequence of the memory timing anomaly events within the first time window is generated by using the number of occurrences of the memory timing anomaly events in each statistical period.
[0142] Obtain the mean and standard deviation of the sequence corresponding to the quantity sequence;
[0143] Based on at least one of the number of occurrences of the memory timing anomaly events within the statistical period, the mean of each sequence, and the standard deviation of the sequence, memory fault features of the device are extracted from the statistical time level to obtain the time features corresponding to the device.
[0144] In one optional embodiment, the feature extraction module 303 is specifically used for:
[0145] The total number of occurrences and the standard deviation of the memory timing anomaly events within the first time window are calculated using the number of occurrences of the memory timing anomaly events within the statistical period.
[0146] Obtain the total historical occurrence count of the memory timing anomaly event in the previous time window corresponding to the first time window, and calculate the window increment of the memory timing anomaly event in the first time window relative to the previous time window based on the current total occurrence count and the historical total occurrence count;
[0147] The number of occurrences of the memory timing anomaly event within the statistical period is used to calculate the first occurrence count of the memory timing anomaly event in the first sub-window of the first time window and the second occurrence count in the second sub-window of the first time window, and the window increment of the memory timing anomaly event in the second sub-window relative to the first sub-window is calculated based on the first occurrence count and the second occurrence count.
[0148] Using the quantity sequence, the sequence mean, and the sequence standard deviation, calculate the peak value and skewness of the probability density distribution corresponding to the quantity sequence;
[0149] Based on the memory timing anomaly event and the first time window, Cartesian products are calculated with at least one of the current total number of occurrences, the standard deviation of occurrence, the inter-window increment, the intra-window increment, the peak value, and the skewness to generate a time feature corresponding to the device.
[0150] In one optional embodiment, the memory includes a plurality of memory libraries, each of which includes a plurality of memory units, and the original feature extraction module 302 is specifically used for:
[0151] Obtain several memory error patterns corresponding to memory space abnormal events. The memory error patterns are used to reflect the severity and distribution of underlying memory hardware failures of the device.
[0152] Extract at least one memory error information corresponding to each memory error mode that occurred within the second time window from the different original logs. The memory error information includes at least the number of memory libraries corresponding to the memory libraries that err under the same memory error mode, the number of memory units corresponding to the memory units that err, and the number of errors that occurred.
[0153] In one alternative embodiment, the second time window includes at least two non-overlapping sub-windows, and the feature extraction module 303 is specifically used for:
[0154] Using the memory error information, calculate the feature mean, feature variance, and feature maximum value of memory errors occurring under each memory error mode;
[0155] Extract the first error count and the second error count of the first sub-window that occurred in the second time window from the memory error information, and calculate the error increment of the memory space abnormal event in the second sub-window relative to the first sub-window based on the first error count and the second error count.
[0156] Based on the memory space anomaly event and the second time window, Cartesian products are calculated with at least one of the feature mean, the feature variance, the feature maximum value, and the error increment to generate spatial features corresponding to the device.
[0157] In one optional embodiment, the feature extraction module 303 is specifically used for:
[0158] Extract the maximum memory cell value and the maximum number of errors that occurred in different memory banks under the same error mode from the memory error information;
[0159] Using the number of memory libraries corresponding to each memory error mode, calculate the mean and variance of the first memory library corresponding to the memory library where errors occur under the same error mode, and the mean and variance of the second memory library corresponding to the memory library where errors occur under different error modes;
[0160] Using the number of memory cells corresponding to each memory error mode, calculate the mean and variance of the first memory cells corresponding to the memory cells that err under the same error mode, and the mean and variance of the second memory cells corresponding to the memory cells that err under different error modes;
[0161] Using the number of errors corresponding to each memory error mode, calculate the first error mean, the second error mean, and the third error mean for errors occurring in different memory banks under the same error mode, the second error mean for errors occurring in different memory units of the same memory bank under the same error mode, and the third error mean for errors occurring under different error modes.
[0162] In one optional embodiment, the feature extraction module 303 is specifically used for:
[0163] Extract the number of first memory libraries, the number of first memory units, and the number of first errors corresponding to each memory error mode in the first sub-window of the second time window, and the number of second memory libraries, the number of second memory units, and the number of second errors corresponding to each memory error mode in the second sub-window from the memory error information;
[0164] Using the first memory library quantity and the second memory library quantity, calculate the memory library increment of the memory space anomaly event in the second sub-window relative to the first sub-window; using the first memory unit quantity and the second memory unit quantity, calculate the memory unit increment of the memory space anomaly event in the second sub-window relative to the first sub-window; and using the first error count and the second error count, calculate the error occurrence increment of the memory space anomaly event in the second sub-window relative to the first sub-window.
[0165] In one optional embodiment, the memory error modes include at least a single-event soft error mode, a single-event hard error mode, a multiple-event error mode, and a mixed error mode; wherein, the single-event soft error mode is a mode in which only one memory cell in a memory bank experiences an error once, the single-event hard error mode is a mode in which only one memory cell in a memory bank experiences an error at least twice, the multiple-event error mode is a mode in which at least two memory cells in a row or column of a memory bank each experience an error once, and the mixed error mode is a mode in which at least two different memory cells in at least two rows or two columns of a memory bank experience errors.
[0166] In one alternative embodiment, the device further includes:
[0167] The fault detection module is used to input the memory features into the memory detection model to perform memory detection on the device, and output the detection results for the memory detection.
[0168] As the device embodiment is basically similar to the method embodiment, the description is relatively simple, and relevant parts can be found in the description of the method embodiment.
[0169] In addition, this application also provides an electronic device, including: a processor, a memory, and a computer program stored in the memory and executable on the processor. When the computer program is executed by the processor, it implements the various processes of the above-described feature extraction method embodiments for device detection and achieves the same technical effect. To avoid repetition, it will not be described again here.
[0170] This application also provides a computer-readable storage medium storing a computer program. When the computer program is executed by a processor, it implements the various processes of the above-described feature extraction method embodiments and achieves the same technical effect. To avoid repetition, it will not be described again here. The computer-readable storage medium may be a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.
[0171] Figure 4 A schematic diagram of the hardware structure of an electronic device for implementing the various embodiments of this application.
[0172] The electronic device 400 includes, but is not limited to, components such as: a radio frequency unit 401, a network module 402, an audio output unit 403, an input unit 404, a sensor 405, a display unit 406, a user input unit 407, an interface unit 408, a memory 409, a processor 410, and a power supply 411. Those skilled in the art will understand that the electronic device structure described in the embodiments of this application does not constitute a limitation on the electronic device. An electronic device may include more or fewer components than illustrated, or combine certain components, or have different component arrangements. In the embodiments of this application, the electronic device includes, but is not limited to, mobile phones, tablet computers, laptops, PDAs, in-vehicle terminals, wearable devices, and pedometers.
[0173] It should be understood that, in this embodiment, the radio frequency unit 401 can be used for receiving and transmitting signals during information transmission or calls. Specifically, it receives downlink data from the base station and processes it with the processor 410; additionally, it transmits uplink data to the base station. Typically, the radio frequency unit 401 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low-noise amplifier, a duplexer, etc. Furthermore, the radio frequency unit 401 can also communicate with networks and other devices through a wireless communication system.
[0174] The electronic device provides users with wireless broadband internet access through network module 402, such as helping users send and receive emails, browse web pages, and access streaming media.
[0175] The audio output unit 403 can convert audio data received by the radio frequency unit 401 or the network module 402 or stored in the memory 409 into audio signals and output them as sound. Furthermore, the audio output unit 403 can also provide audio output related to specific functions performed by the electronic device 400 (e.g., call signal reception sound, message reception sound, etc.). The audio output unit 403 includes a speaker, a buzzer, and a receiver, etc.
[0176] Input unit 404 is used to receive audio or video signals. Input unit 404 may include a graphics processing unit (GPU) 4041 and a microphone 4042. The GPU 4041 processes image data of still images or videos acquired by an image capture device (such as a camera) in video capture mode or image capture mode. The processed image frames can be displayed on display unit 406. The image frames processed by GPU 4041 can be stored in memory 409 (or other storage media) or transmitted via radio frequency unit 401 or network module 402. Microphone 4042 can receive sound and process such sound into audio data. The processed audio data can be converted into a format that can be transmitted to a mobile communication base station via radio frequency unit 401 in telephone call mode.
[0177] The electronic device 400 also includes at least one sensor 405, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor includes an ambient light sensor and a proximity sensor. The ambient light sensor can adjust the brightness of the display panel 4061 according to the ambient light level, and the proximity sensor can turn off the display panel 4061 and / or backlight when the electronic device 400 is moved to the ear. As a type of motion sensor, an accelerometer sensor can detect the magnitude of acceleration in various directions (generally three axes). When stationary, it can detect the magnitude and direction of gravity and can be used to identify the posture of the electronic device (such as landscape / portrait switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer, tapping), etc. The sensor 405 may also include a fingerprint sensor, pressure sensor, iris sensor, molecular sensor, gyroscope, barometer, hygrometer, thermometer, infrared sensor, etc., which will not be described in detail here.
[0178] The display unit 406 is used to display information input by the user or information provided to the user. The display unit 406 may include a display panel 4061, which may be configured in the form of a liquid crystal display (LCD), an organic light-emitting diode (OLED), or the like.
[0179] User input unit 407 can be used to receive input numerical or character information, and to generate key signal inputs related to user settings and function control of electronic devices. Specifically, user input unit 407 includes a touch panel 4071 and other input devices 4072. Touch panel 4071, also known as a touch screen, can collect touch operations performed by the user on or near it (such as operations performed by the user using a finger, stylus, or any suitable object or accessory on or near touch panel 4071). Touch panel 4071 may include two parts: a touch detection device and a touch controller. The touch detection device detects the user's touch position and the signal generated by the touch operation, and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device, converts it into touch point coordinates, and sends it to the processor 410, which receives and executes commands from the processor 410. In addition, touch panel 4071 can be implemented using various types such as resistive, capacitive, infrared, and surface acoustic wave. Besides touch panel 4071, user input unit 407 may also include other input devices 4072. Specifically, other input devices 4072 may include, but are not limited to, physical keyboards, function keys (such as volume control buttons, power buttons, etc.), trackballs, mice, joysticks, etc., which will not be described in detail here.
[0180] Furthermore, the touch panel 4071 can cover the display panel 4061. When the touch panel 4071 detects a touch operation on or near it, it transmits the information to the processor 410 to determine the type of touch event. Subsequently, the processor 410 provides corresponding visual output on the display panel 4061 according to the type of touch event. It is understood that in one embodiment, the touch panel 4071 and the display panel 4061 are implemented as two independent components to realize the input and output functions of the electronic device. However, in some embodiments, the touch panel 4071 and the display panel 4061 can be integrated to realize the input and output functions of the electronic device. The specific implementation is not limited here.
[0181] Interface unit 408 serves as an interface for connecting external devices to electronic device 400. For example, external devices may include a wired or wireless headphone port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device with an identification module, an audio input / output (I / O) port, a video I / O port, a headphone port, and so on. Interface unit 408 can be used to receive input from external devices (e.g., data, power, etc.) and transmit the received input to one or more components within electronic device 400, or it can be used to transmit data between electronic device 400 and external devices.
[0182] The memory 409 can be used to store software programs and various data. The memory 409 may primarily include a program storage area and a data storage area. The program storage area may store the operating system, applications required for at least one function (such as sound playback, image playback, etc.), etc.; the data storage area may store data created based on the use of the mobile phone (such as audio data, phonebook, etc.). Furthermore, the memory 409 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, or other volatile solid-state storage device.
[0183] The processor 410 is the control center of the electronic device. It connects various parts of the electronic device via various interfaces and lines. By running or executing software programs and / or modules stored in the memory 409, and by calling data stored in the memory 409, it performs various functions and processes data, thereby providing overall monitoring of the electronic device. The processor 410 may include one or more processing units; preferably, the processor 410 may integrate an application processor and a modem processor. The application processor mainly handles the operating system, user interface, and applications, while the modem processor mainly handles wireless communication. It is understood that the modem processor may not be integrated into the processor 410.
[0184] The electronic device 400 may also include a power supply 411 (such as a battery) that supplies power to various components. Preferably, the power supply 411 can be logically connected to the processor 410 through a power management system, thereby enabling functions such as managing charging, discharging, and power consumption through the power management system.
[0185] In addition, the electronic device 400 includes some functional modules not shown, which will not be described in detail here.
[0186] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.
[0187] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal (which may be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods described in the various embodiments of this application.
[0188] The embodiments of this application have been described above with reference to the accompanying drawings. However, this application is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms under the guidance of this application without departing from the spirit and scope of the claims, and all of these forms are within the protection scope of this application.
[0189] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed in this application can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0190] Those skilled in the art will understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0191] In the embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative. For instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.
[0192] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0193] In addition, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.
[0194] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, ROM, RAM, magnetic disks, or optical disks.
[0195] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A feature extraction method for device detection, characterized in that, include: Obtain the original logs generated during device operation corresponding to the device, as well as the device configuration information corresponding to the device; Extract the original temporal features corresponding to memory timing anomaly events and the original spatial features corresponding to memory space anomaly events from the different original logs. Based on the original time features, memory fault features of the device are extracted from the statistical time level to obtain time features corresponding to the device. Based on the original spatial features, memory fault features of the device are extracted from the hardware fault level to obtain spatial features corresponding to the device. The device configuration information, the spatial characteristics, and the temporal characteristics are used as memory characteristics for fault detection of the device's memory. The step of extracting the original spatial features corresponding to memory space anomaly events from different original logs includes: Retrieve several memory error modes corresponding to memory space exception events; Extract at least one memory error information corresponding to each memory error mode that occurred within the second time window from the different original logs; The second time window includes at least two non-overlapping sub-time windows. The step of extracting memory fault features of the device from a hardware fault perspective based on the original spatial features to obtain spatial features corresponding to the device includes: Using the various memory error information, calculate the characteristic parameters of the memory error occurring under each memory error mode; Calculate the error increment of the memory space anomaly event between different sub-time windows based on the memory error information; Based on the memory space anomaly event and the second time window, Cartesian products are calculated with at least one of the feature parameters and the error increment to generate spatial features corresponding to the device.
2. The method according to claim 1, characterized in that, The step of extracting the original temporal features corresponding to memory timing anomaly events from different original logs includes: Extract memory timing anomaly events that occur within a first time window from different raw logs, as well as the number of occurrences of the memory timing anomaly events in each statistical period within the first time window. The first time window is divided into several statistical periods according to equal values. The original logs include at least two of the following: hardware fault logs, error and correction logs, system operation logs, and backplane management controller logs.
3. The method according to claim 2, characterized in that, The step of extracting memory fault features of the device from a statistical time perspective based on the original time features to obtain time features corresponding to the device includes: Based on the chronological order of the first time window, the number sequence of the memory timing anomaly events within the first time window is generated by using the number of occurrences of the memory timing anomaly events in each statistical period. Obtain the mean and standard deviation of the sequence corresponding to the quantity sequence; Based on at least one of the number of occurrences of the memory timing anomaly events within the statistical period, the mean of each sequence, and the standard deviation of the sequence, memory fault features of the device are extracted from the statistical time level to obtain the time features corresponding to the device.
4. The method of claim 3, wherein, The first time window includes at least two non-overlapping sub-windows. The step of extracting memory fault features of the device from the statistical time level based on at least one of the occurrence frequency of the memory timing anomaly event within the statistical period, the mean of each sequence, and the standard deviation of the sequence, to obtain the time features corresponding to the device, includes: The total number of occurrences and the standard deviation of the memory timing anomaly events within the first time window are calculated using the number of occurrences of the memory timing anomaly events within the statistical period. Obtain the total historical occurrence count of the memory timing anomaly event in the previous time window corresponding to the first time window, and calculate the window increment of the memory timing anomaly event in the first time window relative to the previous time window based on the current total occurrence count and the historical total occurrence count; The number of occurrences of the memory timing anomaly event within the statistical period is used to calculate the first occurrence count of the memory timing anomaly event in the first sub-window of the first time window and the second occurrence count in the second sub-window of the first time window, and the window increment of the memory timing anomaly event in the second sub-window relative to the first sub-window is calculated based on the first occurrence count and the second occurrence count. Using the quantity sequence, the sequence mean, and the sequence standard deviation, calculate the peak value and skewness of the probability density distribution corresponding to the quantity sequence; Based on the memory timing anomaly event and the first time window, Cartesian products are calculated with at least one of the current total number of occurrences, the standard deviation of occurrence, the inter-window increment, the intra-window increment, the peak value, and the skewness to generate a time feature corresponding to the device.
5. The method according to claim 1, characterized in that, The memory includes several memory libraries, and each memory library includes several memory units; the memory error mode is used to reflect the severity and distribution of underlying memory hardware failures of the device; the memory error information includes at least the number of memory libraries corresponding to memory libraries that have errors under the same memory error mode, the number of memory units corresponding to memory units that have errors, and the number of errors that have occurred.
6. The method of claim 5, wherein, The step of extracting memory fault features from the hardware fault level of the device based on the original spatial features to obtain spatial features corresponding to the device includes: Using the memory error information, calculate the feature mean, feature variance, and feature maximum value of memory errors occurring under each memory error mode; Extract the first error count and the second error count of the first sub-window that occurred in the second time window from the memory error information, and calculate the error increment of the memory space abnormal event in the second sub-window relative to the first sub-window based on the first error count and the second error count. Based on the memory space anomaly event and the second time window, Cartesian products are calculated with at least one of the feature mean, the feature variance, the feature maximum value, and the error increment to generate spatial features corresponding to the device.
7. The method according to claim 6, characterized in that, The step of using each of the memory error information to calculate the feature mean, feature variance, and feature maximum value of memory errors occurring under each of the memory error modes includes: Extract the maximum memory cell value and the maximum number of errors that occurred in different memory banks under the same error mode from the memory error information; Using the number of memory libraries corresponding to each memory error mode, calculate the mean and variance of the first memory library corresponding to the memory library where errors occur under the same error mode, and the mean and variance of the second memory library corresponding to the memory library where errors occur under different error modes; Using the number of memory cells corresponding to each memory error mode, calculate the mean and variance of the first memory cells corresponding to the memory cells that err under the same error mode, and the mean and variance of the second memory cells corresponding to the memory cells that err under different error modes; Using the number of errors corresponding to each memory error mode, calculate the first error mean, the second error mean, and the third error mean for errors occurring in different memory banks under the same error mode, the second error mean for errors occurring in different memory units of the same memory bank under the same error mode, and the third error mean for errors occurring under different error modes.
8. The method of claim 7, wherein, The step of extracting the number of first errors occurring in the first sub-window and the number of second errors occurring in the second sub-window from the memory error information, and calculating the error increment of the memory space anomaly event in the second sub-window relative to the first sub-window based on the first error count and the second error count, includes: Extract the number of first memory libraries, the number of first memory units, and the number of first errors corresponding to each memory error mode in the first sub-window of the second time window, and the number of second memory libraries, the number of second memory units, and the number of second errors corresponding to each memory error mode in the second sub-window from the memory error information; Using the first memory library quantity and the second memory library quantity, calculate the memory library increment of the memory space anomaly event in the second sub-window relative to the first sub-window; using the first memory unit quantity and the second memory unit quantity, calculate the memory unit increment of the memory space anomaly event in the second sub-window relative to the first sub-window; and using the first error count and the second error count, calculate the error occurrence increment of the memory space anomaly event in the second sub-window relative to the first sub-window.
9. The method according to any one of claims 5 to 8, characterized in that, The memory error modes include at least single-event soft error mode, single-event hard error mode, multiple-event error mode, and mixed error mode; wherein, the single-event soft error mode is a mode in which only one memory cell in a memory bank experiences an error once, the single-event hard error mode is a mode in which only one memory cell in a memory bank experiences an error at least twice, the multiple-event error mode is a mode in which at least two memory cells in a row or column of a memory bank each experience an error once, and the mixed error mode is a mode in which at least two different memory cells in at least two rows or two columns of a memory bank experience errors.
10. The method of claim 1, wherein, Also includes: The memory features are input into the memory detection model to perform memory detection on the device, and the detection results for the memory detection are output.
11. A feature extraction apparatus for device detection, characterized by, include: The log acquisition module is used to acquire the original logs generated during device operation that correspond to the device, as well as the device configuration information corresponding to the device. The raw feature extraction module is used to extract the raw temporal features corresponding to memory timing anomalies from different raw logs, and to extract the raw spatial features corresponding to memory space anomalies. The feature extraction module is used to extract memory fault features of the device from the statistical time level based on the original time features to obtain time features corresponding to the device, and to extract memory fault features of the device from the hardware fault level based on the original spatial features to obtain spatial features corresponding to the device. A memory feature determination module is used to use the device configuration information, the spatial features, and the temporal features as memory features for fault detection of the device's memory; Specifically, the original feature extraction module is used for: Retrieve several memory error modes corresponding to memory space exception events; Extract at least one memory error message corresponding to each memory error mode that occurred within the second time window from the different original logs; The second time window includes at least two non-overlapping sub-time windows, and the feature extraction module is specifically used for: Using the various memory error information, calculate the characteristic parameters of the memory error occurring under each memory error mode; Calculate the error increment of the memory space anomaly event between different sub-time windows based on the memory error information; Based on the memory space anomaly event and the second time window, Cartesian products are calculated with at least one of the feature parameters and the error increment to generate spatial features corresponding to the device.
12. An electronic device, comprising: It includes a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus; The memory is used to store computer programs; When the processor executes a program stored in the memory, it implements the method as described in any one of claims 1-10.
13. A computer-readable storage medium having instructions stored thereon that, when executed by one or more processors, cause the processors to perform the method as described in any one of claims 1-10.