Memory fault prediction method, electronic device and computer readable storage medium

By acquiring various log data for feature engineering and concatenation, and combining them with a pre-trained model, the problem of low accuracy in memory fault prediction in existing technologies has been solved, achieving more accurate memory fault prediction.

CN115981911BActive Publication Date: 2026-06-26ZTE INTELLIGENT TECH NANJING CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZTE INTELLIGENT TECH NANJING CO LTD
Filing Date
2021-10-12
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

In existing technologies, memory fault prediction methods based on ECC counting have low accuracy and cannot effectively predict memory faults, leading to frequent computer crashes and server service interruptions.

Method used

By acquiring various log data, including memory error information address data, memory logs, operating system kernel logs, EDAC logs, performance data, and environmental and location information data, feature engineering is performed to construct and splice the data, and a pre-trained fault prediction model is used to predict memory faults.

Benefits of technology

It improves the accuracy and stability of memory fault prediction, enhances the generalization ability of the fault prediction model, and achieves accurate prediction of memory faults.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115981911B_ABST
    Figure CN115981911B_ABST
Patent Text Reader

Abstract

The embodiment of the application relates to the technical field of big data analysis and artificial intelligence, in particular to a memory fault prediction method, an electronic device and a computer readable storage medium. The memory fault prediction method comprises the following steps: obtaining a plurality of log data of a to-be-tested memory; wherein the plurality of log data at least comprises memory error information address data; performing feature engineering construction according to the plurality of log data to obtain a feature data table corresponding to each of the plurality of log data; splicing the feature data table corresponding to each of the plurality of log data to obtain a feature splicing data table; obtaining a fault prediction result of the to-be-tested memory according to the feature splicing data table and a pre-trained fault prediction model; wherein the fault prediction model is obtained by training a pre-collected training data set, and samples in the training data set comprise a plurality of log data of a plurality of memories, so that the accuracy of predicting memory faults can be improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of big data analytics and artificial intelligence, and in particular to a method for predicting memory faults, an electronic device, and a computer-readable storage medium. Background Technology

[0002] In big data applications such as servers, unexpected memory failures frequently cause computer crashes, and in severe cases, even disrupt server operations. Therefore, real-time memory fault prediction and health assessment are of significant practical importance for maintaining business reliability and stability.

[0003] Currently, the industry perceives memory failure risk by simply summing up memory reporting logs: when the number of Error Checking and Correcting (ECC) operations reaches a set threshold, an alarm is reported to the network administrator. However, in reality, there is no strong correlation between the number of ECC operations and memory failures, and relying solely on ECC counting to predict memory failures has very low accuracy. Summary of the Invention

[0004] The main objective of this application is to provide a method, electronic device, and computer-readable storage medium for predicting memory faults, thereby improving the accuracy of memory fault prediction.

[0005] To achieve the above objectives, this application provides a method for predicting memory faults, comprising: acquiring multiple log data of the memory to be tested; wherein the multiple log data includes at least: memory error information address data; constructing feature data tables corresponding to the multiple log data through feature engineering; concatenating the feature data tables corresponding to the multiple log data to obtain a feature concatenation data table; and obtaining a fault prediction result of the memory to be tested based on the feature concatenation data table and a pre-trained fault prediction model; wherein the fault prediction model is trained based on a pre-collected training dataset, and the samples in the training dataset include multiple log data of multiple types of memory.

[0006] To achieve the above objectives, embodiments of this application also provide an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the above-described memory fault prediction method.

[0007] To achieve the above objectives, embodiments of this application also provide a computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the above-described method for predicting memory faults.

[0008] In this embodiment, multiple log data of the memory under test are acquired. These log data include at least memory error information address data. The multiple log data can measure the state of the memory under test from multiple perspectives, thus providing a more comprehensive reference for fault prediction and improving its accuracy. The inclusion of memory error information address data ensures that necessary reference information is provided for fault prediction. By performing feature engineering on each of the multiple log data sets and concatenating the feature data tables corresponding to each log data set after feature engineering, a feature concatenation data table suitable for the input of the fault prediction model can be obtained. Based on the feature concatenation data table and the pre-trained fault prediction model, the fault prediction result of the memory under test can be obtained. Since the training dataset of the fault prediction model includes multiple log data sets of various memory types, the generalization ability and stability of the fault prediction model are stronger, resulting in more accurate prediction results and improved accuracy in predicting memory faults. Attached Figure Description

[0009] One or more embodiments are illustrated by way of example with reference to the accompanying drawings. These illustrations do not constitute a limitation on the embodiments. Elements with the same reference numerals in the drawings are denoted as similar elements. Unless otherwise stated, the figures in the drawings do not constitute a limitation on scale.

[0010] Figure 1 This is a flowchart of a memory fault prediction method according to an embodiment of the present invention;

[0011] Figure 2 This is an architecture diagram of the memory under test provided according to an embodiment of the present invention;

[0012] Figure 3 This is a schematic diagram illustrating the acquisition of various log data according to an embodiment of the present invention;

[0013] Figure 4 This is a data preprocessing flowchart provided according to an embodiment of the present invention;

[0014] Figure 5 This is a flowchart of obtaining the feature data table corresponding to the memory error information address data according to an embodiment of the present invention;

[0015] Figure 6 This is a flowchart of training a fault prediction model according to an embodiment of the present invention;

[0016] Figure 7 This is a labeling diagram of a binary classification and regression method according to an embodiment of the present invention;

[0017] Figure 8 This is a structural diagram of an electronic device provided according to an embodiment of the present invention. Detailed Implementation

[0018] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the various embodiments of this application will be described in detail below with reference to the accompanying drawings. However, those skilled in the art will understand that many technical details have been provided in the various embodiments of this application to help readers better understand this application. However, the technical solutions claimed in this application can be implemented even without these technical details and various changes and modifications based on the following embodiments. The division of the various embodiments below is for the convenience of description and should not constitute any limitation on the specific implementation of this application. The various embodiments can be combined with and referenced by each other without contradiction.

[0019] One embodiment of this application relates to a method for predicting memory faults, applied to an electronic device. The electronic device can be a server for fault prediction (hereinafter referred to as a prediction server). For example, the device under test (DUT) that needs memory fault prediction collects its own data and sends it to the prediction server, which then predicts the memory faults of the DUT. The DUT can be a server, a big data center cluster, a communication base station, a PC, or other device containing memory devices. In this embodiment, the specific flowchart of the memory fault prediction method is as follows: Figure 1 As shown, it includes:

[0020] Step 101: Obtain various log data of the memory to be tested; among which, the various log data include at least: memory error information address data.

[0021] Step 102: Perform feature engineering to construct feature data tables corresponding to various log data types.

[0022] Step 103: Concatenate the feature data tables corresponding to the various log data to obtain the feature concatenated data table.

[0023] Step 104: Based on the feature concatenation data table and the pre-trained fault prediction model, obtain the fault prediction result of the memory to be tested; wherein, the fault prediction model is trained based on the pre-collected training dataset, and the samples in the training dataset include various log data of various types of memory.

[0024] In this embodiment, multiple log data sets of the memory under test are acquired. These log data sets include at least memory error information address data. These multiple log data sets can measure the state of the memory under test from multiple perspectives, thus providing a more comprehensive reference for fault prediction and improving the accuracy of fault prediction. The inclusion of memory error information address data ensures that essential reference information is provided for fault prediction. By performing feature engineering on each of the multiple log data sets and concatenating the feature data tables corresponding to each log data set after feature engineering, a feature concatenation data table suitable for the input of the fault prediction model can be obtained. Based on the feature concatenation data table and the pre-trained fault prediction model, the fault prediction result of the memory under test can be obtained. Since the training dataset of the fault prediction model includes multiple log data sets of various memory types, the generalization ability and stability of the fault prediction model are stronger, resulting in more accurate prediction results and improved accuracy in predicting memory faults.

[0025] The following is a detailed explanation of the implementation details of the memory fault prediction method in this embodiment. The following content is only for the convenience of understanding and is not necessary for implementing this solution.

[0026] In step 101, the server hosting the memory under test can obtain various log data from the memory under test and send this log data to the prediction server, thereby enabling the prediction server to obtain various log data from the memory under test. These various log data include at least memory error information address data, ensuring that essential reference information is provided for memory fault prediction.

[0027] The architecture diagram of the memory under test is as follows: Figure 2 As shown, Figure 2 The left side of the image shows a schematic diagram of multiple memory chips in the memory under test. Figure 2 The diagram on the right shows a chip in the memory under test. A chip is physically divided into eight layers, with each layer representing a bank. For example, Figure 2 In this context, bank 1 can represent the first layer in the chip, row is a row in bank 1, column is a column in bank 1, and the intersection of row and column is a memory unit (cell).

[0028] The acquired memory error information address data includes: memory serial number, memory manufacturer, log reporting time, dual-inline-memory modules (DIMMs), memory rank number, memory chip number, memory bank information, row number of the memory cell, and column number of the memory cell. The memory error information address data may also include the detailed physical location of the memory failure parsed from the memory fault log.

[0029] In one embodiment, the acquired log data, in addition to memory error information address data, also includes one or more combinations of the following: memory log data, operating system kernel logs, error detection and correction EDAC logs, performance data, and environment and location information data.

[0030] The memory log data includes: register group number field, transaction field, memory serial number, memory manufacturer, and log reporting time.

[0031] In one example, the memory log data is the Dynamic Random Access Memory (DRAM) fault log collected by the mcelog tool and reported by mcelog. The mcelog tool is a standard tool for Linux systems to record DRAM faults based on Intel's Machine Check Architecture (MCA). In other words, the memory log data can be mcelog data.

[0032] In one example, the operating system kernel log can record information related to memory failures obtained from the Linux kernel log. The operating system kernel log includes: memory serial number, memory manufacturer, log reporting time, and various fields related to memory errors.

[0033] In one example, the Error Detection and Correction (EDAC) log includes: memory sequence number, log reporting time, memory controller (MC) field, page field, and offset field. Both EDAC and ECC are error detection and correction mechanisms; ECC is hardware-based error detection and correction for memory, while EDAC is software-based error detection and correction within the Linux kernel.

[0034] In one example, the performance data records physical performance data related to the server where the memory under test resides, including: page in times per second (data transfer from disk to memory), page out times per second (data transfer from memory to disk), minimum voltage, maximum voltage, configuration voltage, and memory operating speed.

[0035] In one example, the environmental and location information data records the environmental and location information of the server where the memory under test is located, including: temperature, humidity, the site where the server is located, the room where the server is located, and the rack where the server is located.

[0036] In this embodiment, research has found that memory log data, operating system kernel logs, EDAC logs, performance data, and environmental and location information data all have a significant impact on memory failures. The inclusion of these data can improve the stability of the generalization ability of the failure prediction model and help improve the accuracy of failure prediction.

[0037] In one embodiment, when implementing step 101, the server where the memory under test resides can simultaneously acquire six types of log data from the memory under test, such as... Figure 3 As shown, it includes: mcelog data, Linux kernel log data, EDAC log data, performance data, environment and location information data, and memory error information address data.

[0038] In step 102, before constructing feature data tables corresponding to various log data based on feature engineering, the prediction server can preprocess the various log data in the memory to be tested. The preprocessing may include: filling fields that are not obtained from the various log data with the mean or 0; discarding fields that are not related to memory failure; if some columns of the various log data are highly correlated, then only one column is kept and the other columns are deleted.

[0039] In one example, data verification can also be performed on the preprocessed log data. Specifically, this involves verifying whether the time length and number of sampling points of the data can meet the minimum requirements for fault prediction or the amount of data required for fault prediction. That is, whether the data collection time meets the time length required for feature engineering construction, and whether the collected log data includes at least memory error information address data.

[0040] In one embodiment, step 102, which processes and constructs feature engineering for various log data to obtain feature data tables corresponding to each type of log data, can be implemented through the following sub-steps, as detailed below. Figure 4 As shown, it includes:

[0041] Sub-step 1021 involves preprocessing various log data.

[0042] Preprocessing may include: filling null values ​​with "0"; analyzing the various log data collected using Pearson correlation coefficients; deleting columns with strong correlations between fields and columns with unchanged data or all "0" values; and discarding fields that are not related to memory failures.

[0043] Sub-step 1022: Data verification.

[0044] Data verification may include verifying the duration and number of sampling points of various preprocessed log data. This involves verifying whether the duration and number of sampling points meet the minimum data requirements for fault detection and prediction: that is, whether the data acquisition time meets the time required for feature engineering construction, and whether the various log data include at least memory error information address data.

[0045] Sub-step 1023: Based on the verified log data, feature engineering is performed to construct features respectively.

[0046] In one embodiment, the prediction server constructs features based on the preprocessed memory error information address data to obtain a feature data table corresponding to the memory error information address data. (See reference...) Figure 5 The feature data table corresponding to the memory error information address data includes:

[0047] Step 501: Count the first quantity and the first count. The prediction server, based on the memory error information address data, counts the first quantity of the first target appearing within a first preset time period before the current time point, and the first number of ECC occurrences on each first target; wherein, the first target is the memory cell in the tested memory that has accumulated two or more ECC occurrences within a past second preset time period, and the duration of the second preset time period is longer than the duration of the first preset time period.

[0048] Step 502: Count the second quantity and the second number of occurrences. The prediction server, based on the memory error information address data, counts the second quantity of the second target appearing within the third preset time period prior to the current time point, and the second number of ECC occurrences on each second target. The second target is defined as the column of memory cells under test that has accumulated two or more ECC occurrences at the same time within the past fourth preset time period. The duration of the fourth preset time period is longer than the duration of the third preset time period. The same time can be the same second or the same minute.

[0049] Step 503: Count the third number and the third number. Specifically, the prediction server, based on the memory error information address data, counts the third number of third targets appearing within the fifth preset time period prior to the current time point, and the third number of ECC occurrences on each third target. The third target is defined as the row of memory cells under test that has accumulated two or more ECC occurrences at the same time within the past sixth preset time period. The duration of the sixth preset time period is longer than the duration of the fifth preset time period.

[0050] Step 504: Count the fourth quantity. The prediction server counts the fourth quantity of the fourth target that occurred within the seventh preset time period prior to the current time point, based on the memory error information address data. The fourth target is a memory cell block in the tested memory consisting of at least three memory cells located in the same column. All three memory cells in the same column experienced ECC at the same time during the previous eighth preset time period, and these three memory cells are spaced at most one memory cell apart. The duration of the eighth preset time period is longer than the duration of the seventh preset time period.

[0051] Step 505: Count the fifth quantity. The prediction server counts the fifth quantity of the fifth target that occurred within the ninth preset time period prior to the current time point, based on the memory error information address data. The fifth target is a memory cell block in the tested memory consisting of at least three memory cells located in the same row. All three memory cells in the same row experienced ECC at the same time during the past tenth preset time period, and the at least three memory cells in the same row are spaced at most one memory cell apart. The duration of the tenth preset time period is longer than the duration of the ninth preset time period.

[0052] Step 506: Obtain the feature data table corresponding to the memory error information address data.

[0053] Based on one or any combination of the following, obtain the feature data table corresponding to the memory error information address data: first quantity, first count, second quantity, second count, third quantity, third count, fourth quantity, fifth quantity.

[0054] In step 501, the first preset time period is a period of time before the current time point, the first target can be denoted as ERROR CELL, the first quantity is the number of first target ERROR CELLs that appear in the first preset time period before the current time point, the first count is the total number of ECC occurrences on each ERROR CELL in the first preset time period before the current time point, and the second preset time period is the collection time length of the memory error information address data required when constructing the first target.

[0055] In one example, the second preset time period can be 3 months. To facilitate data splicing and alignment, the first preset time period can be a whole minute granular time period before the current time point. For example, the first preset time period can be 8 minutes, 4 minutes, 2 minutes, and 1 minute before the current time point. That is, the total number of times ECC appears on ERROR CELL within 8 minutes, 4 minutes, 2 minutes, and 1 minute before the current time point (i.e., the first number mentioned above) and the number of ERROR CELLs corresponding to ECC (i.e., the first quantity mentioned above) are counted respectively.

[0056] In step 502, the second target can be denoted as ERROR COLUMN, the second quantity is the number of second target ERROR COLUMNs that appear in the third preset time period before the current time point, the second number is the total number of ECC occurrences on each ERROR COLUMN in the third preset time period before the current time point, and the fourth preset time period is the collection time length of memory error information address data required when constructing the second target.

[0057] In one example, the fourth preset time period can be 3 months, and the third preset time period can be 8 minutes, 4 minutes, 2 minutes, and 1 minute before the current time point. That is, the total number of times ECC appears on the ERROR COLUMN within 8 minutes, 4 minutes, 2 minutes, and 1 minute before the current time point (i.e., the second number mentioned above) and the number of ERROR COLUMNs corresponding to ECC (i.e., the second quantity mentioned above) are counted respectively.

[0058] In step 503, the third target can be denoted as ERROR ROW, the third quantity is the number of third target ERROR ROWs that appear in the fifth preset time period before the current time point, the third number is the total number of ECC occurrences on each ERROR ROW in the fifth preset time period before the current time point, and the sixth preset time period is the collection time length of the memory error information address data required when constructing the third target.

[0059] In one example, the sixth preset time period can be 3 months, and the fifth preset time period can be 8 minutes, 4 minutes, 2 minutes, and 1 minute before the current time point. That is, the total number of ECC occurrences on ERROR ROW within 8 minutes, 4 minutes, 2 minutes, and 1 minute before the current time point (i.e., the third number mentioned above) and the number of ERROR ROWs corresponding to ECC (i.e., the third quantity mentioned above) are counted respectively.

[0060] In step 504, the fourth target can be denoted as ERRORCOL BLOCK, the fourth quantity is the number of fourth target ERRORCOL BLOCKs that appear in the seventh preset time period before the current time point, and the eighth preset time period is the length of time required to collect memory error information address data when constructing the fourth target.

[0061] In one example, the eighth preset time period can be 3 months. For instance, if at the same moment, 3 out of 5 consecutive memory cells in the same column have experienced ECC (Extreme Collapse), meaning the 1st, 3rd, and 5th memory cells have experienced ECC, then these 5 memory cells constitute one ERRORCOLOCK. Alternatively, if at the same moment, all 3 consecutive memory cells in the same column have experienced ECC, then these 3 memory cells constitute one ERRORCOLOCK. The seventh preset time period can be 8 minutes, 4 minutes, 2 minutes, and 1 minute prior to the current time point, that is, counting the number of ERRORCOLOCKs within 8 minutes, 4 minutes, 2 minutes, and 1 minute prior to the current time point (i.e., the fourth quantity mentioned above). The same moment can be the same second or the same minute.

[0062] In step 505, the fifth target can be denoted as ERRRROW BLOCK, the fifth number is the number of fifth target ERRRROW BLOCKs that appear within the ninth preset time period before the current time point, and the tenth preset time period is the collection time length of the memory error information address data required when constructing the fifth target.

[0063] In one example, the tenth preset time period can be 3 months. For instance, if at the same moment, 3 out of 5 consecutive memory cells in the same row have experienced ECC (Extremely Common Error), meaning the 1st, 3rd, and 5th memory cells have experienced ECC, then these 5 consecutive memory cells constitute one ERROROW BLOCK. Alternatively, if at the same moment, all 3 consecutive memory cells in the same row have experienced ECC, then these 3 memory cells constitute one ERROROW BLOCK. The ninth preset time period can be 8 minutes, 4 minutes, 2 minutes, and 1 minute prior to the current time point, that is, counting the number of ERROROW BLOCKs within 8 minutes, 4 minutes, 2 minutes, and 1 minute prior to the current time point (i.e., the fifth quantity mentioned above).

[0064] It should be noted that in the above steps, the preset time periods for the first, third, fifth, seventh, and ninth steps can be the same or different, and the preset time periods for the second, fourth, sixth, eighth, and tenth steps can be the same or different.

[0065] In this embodiment, since memory faults are contagious across rows and columns of memory, constructing concepts related to memory defects, such as the first target ERROR CELL, the second target ERROR COLUMN, the third target ERROR ROW, the fourth target ERROR COL BLOCK, and the fifth target ERROR ROW BLOCK, and performing related quantity and frequency statistics, helps to accurately measure memory faults and improve the accuracy of memory fault prediction.

[0066] In one embodiment, the prediction server constructs features based on various log data to obtain feature data tables corresponding to each type of log data. These various log data, in addition to including memory error information address data, also include one or more combinations of the following: memory log data, operating system kernel logs, Error Detection and Correction (EDAC) logs, performance data, and environmental and location information data. By performing refined processing on memory log data, operating system kernel logs, EDAC logs, performance data, and environmental and location information data, feature data tables corresponding to each type of log data are obtained, which helps improve the accuracy of fault prediction. The following is a detailed explanation of obtaining feature data tables corresponding to various log data:

[0067] In one example, when multiple log data include memory log data, the transaction field of the eleventh preset time period before the current time point is summed based on the memory log data to obtain the sum of the transaction field. The register group number field of the eleventh preset time period before the current time point is counted to obtain the count of the register group number field. Then, based on the sum of the transaction field and the count of the register group number field, the feature data table corresponding to the memory log data is obtained.

[0068] In one example, the register group number field can be the mca_bank field, and the eleventh preset time period can be 8min, 4min, 2min, or 1min. That is, the prediction server can count the mca_bank within 8min, 4min, 2min, or 1min before the current time point and sum the transaction field as a new feature column. This new feature column is the feature column in the feature data table corresponding to the memory log data.

[0069] For example, if the current time is 10:27, then the data within 1 minute before the current time point is the data collected between 10:26 and 10:27. Then, the mca_bank field collected during the period of 10:26-10:27 is counted, and the transaction field is summed. The data within 2 minutes before the current time point is the data collected between 10:25 and 10:27. The data within 4 minutes before the current time point is the data collected between 10:23 and 10:27. The data within 8 minutes before the current time point is the data collected between 10:19 and 10:27.

[0070] In the specific implementation, since the register group number field is the register group number retrieved from memory by the mcelog tool and its data type is string, it can be counted; while the transaction field is a set of operations transferred between the CPU and memory, including read transactions and write transactions, and its data type is integer. Summing the transaction field can better reflect the data characteristics of the field. Simply counting will lose the data information in the transaction field. Therefore, it is necessary to sum the transaction field.

[0071] In one example, when multiple log data include operating system kernel logs, various fields related to memory errors in the operating system kernel logs are counted. The various fields within the twelfth preset time period before the current time point are counted to obtain the count results corresponding to each field. Then, based on the count results corresponding to each field, the feature data table corresponding to the operating system kernel logs is obtained.

[0072] Among them, statistics can be performed on various fields related to memory errors within the twelfth preset time period according to the granularity of whole minutes. For example, when the twelfth preset time period is 8min, 4min, 2min, and 1min, the prediction server can count various fields related to memory errors within 8min, 4min, 2min, and 1min before the current time point as new feature columns. These new feature columns are the feature columns in the feature data table corresponding to the operating system kernel log.

[0073] In the specific implementation, since there are 24 types of fields in the operating system kernel log, of which 8 types are related to memory errors, it is only necessary to count the 8 types of fields related to memory errors within the twelfth preset time period before the current time point to obtain the count results of the 8 types of fields related to memory errors. Then, based on the count results corresponding to these 8 types of fields related to memory errors, the feature data table corresponding to the operating system kernel log is obtained.

[0074] In one example, when multiple log data include EDAC logs, the MC, page, and offset fields are counted within the thirteenth preset time period prior to the current time point based on the EDAC logs. The count results for the MC, page, and offset fields are then obtained. Based on these count results, a feature data table corresponding to the EDAC logs is derived. Here, the MC field represents the memory controller number, page represents a virtual memory page, and offset represents the offset. The physical memory address can be calculated by adding page and offset.

[0075] Specifically, the fields (MC, page, offset) within the thirteenth preset time period can be statistically analyzed at the minute level. For example, if the thirteenth preset time period is 8 minutes, 4 minutes, 2 minutes, and 1 minute, the prediction server can count the fields (MC, page, offset) within the 8 minutes, 4 minutes, 2 minutes, and 1 minute prior to the current time point as new feature columns. These new feature columns are the feature columns in the feature data table corresponding to the EDAC logs.

[0076] In one example, when multiple log data also include performance data, the average of various performance data affecting memory failures within the fourteenth preset time period prior to the current time point is calculated based on the performance data to obtain the average value of each type of performance data. Based on the average value of each type of performance data, the corresponding feature data table is obtained.

[0077] Among them, the fields (Page_in, page_out, minimum voltage, maximum voltage, configuration voltage, memory running speed) within the fourteenth preset time period can be statistically analyzed at the minute level. For example, the fourteenth preset time period can be 8min, 4min, 2min, and 1min. That is, the prediction server can average the fields related to performance data within the previous 8min, 4min, 2min, and 1min as new feature columns.

[0078] In the specific implementation, the average values ​​of page in, page out, minimum voltage, maximum voltage, configuration voltage, and memory running speed within the fourteenth preset time period before the current time point are calculated to obtain the average values ​​of page in, page out, minimum voltage, maximum voltage, configuration voltage, and memory running speed. Then, based on these average values, the feature data table corresponding to the performance data is obtained.

[0079] In one example, when multiple log data also include environmental and location information data, the average temperature and humidity over a 15th preset time period prior to the current time point are calculated based on the environmental and location information data to obtain the average temperature and the average humidity. Based on the average temperature and the average humidity, a feature data table corresponding to the environmental and location information data is obtained.

[0080] Specifically, fields (such as temperature and humidity) within the fifteenth preset time period can be statistically analyzed at the minute level. For example, the fifteenth preset time period can be 8 minutes, 4 minutes, 2 minutes, and 1 minute. The prediction server can average the temperature, humidity, and other environmental data-related fields within the previous 8 minutes, 4 minutes, 2 minutes, and 1 minute as a new feature column. This new feature column is the feature column in the feature data table corresponding to the environmental and location information data. Then, the memory under test is mapped to the location information such as the site, room, and rack of the server.

[0081] It should be noted that, in specific implementations, the eleventh, twelfth, thirteenth, fourteenth, and fifteenth preset time periods mentioned above may be the same or different, and this embodiment does not impose any specific limitations on this.

[0082] In one embodiment, in step 101, the prediction server can acquire various log data of the memory under test at a granular level of whole minutes. Step 102, feature engineering construction, can include: when acquiring various log data within the nearest m minutes to the current time, merging the log data within the nearest m minutes with the log data within the nearest n minutes before that time, and using the merged log data as the log data required for fault prediction at the current time; where n+m minutes is the preset minimum duration required for fault prediction. Based on the log data required for fault prediction at the current time, minute-level feature engineering construction is performed to obtain feature data tables corresponding to the various log data. For example, the value of m can be 1, 10, 0.1, etc. Acquiring log data at a minute-level granularity facilitates minute-level prediction, enabling real-time monitoring of memory status and improving the real-time performance of memory fault prediction.

[0083] In one example, m can be 1, and n can be 7. Specifically, if multiple log data within the nearest minute to the current time are obtained, and 8 minutes of multiple log data are required for fault prediction (i.e., 8 minutes is the preset minimum duration for fault prediction), then at the 9th minute, the multiple log data obtained in the 9th minute can be merged with the multiple log data obtained within the previous 7 minutes, and the merged multiple log data will be used as the multiple log data required for fault prediction in the 9th minute. For example, if the current time is 10:27, and multiple log data within the minute before 10:27 (i.e., 10:26-10:27) are obtained, but multiple log data within the 8 minutes from 10:19-10:27 are needed for fault prediction, then the 10:26-10:27 minute data will be merged with the 7 minutes from 10:19-10:26, and the merged 8 minutes of data will be used as the multiple log data required for fault prediction at 10:27.

[0084] In step 103, features are constructed for various log data of the memory under test. After obtaining the corresponding feature data tables, the feature data table corresponding to the memory error information address data and one or more of the following feature data tables are concatenated according to the memory sequence number and the whole minute granularity: the feature data table corresponding to the memory log data, the feature data table corresponding to the operating system kernel log, the feature data table corresponding to the performance data, and the feature data table corresponding to the environment and location information data, to obtain the feature concatenation data table.

[0085] In step 104, the servers housing various types of memory pre-collect various log data from these memory types. The prediction server then performs feature engineering based on this log data to construct feature data tables corresponding to each type of log data. These feature data tables are then concatenated to form a concatenated feature data table. This concatenated feature data table is used as a sample in the training dataset for model training, resulting in a pre-trained fault prediction model. After training, the fault prediction model can be used to predict faults in the memory under test.

[0086] In one embodiment, the fault prediction model is trained using a pre-collected training dataset and the LightGBM machine learning method, where the loss function of the LightGBM machine learning method is Focal Loss, and the Focal Loss formula is:

[0087]

[0088] Where α is the balance factor, γ is the modulation parameter, y′ is the predicted sample value, y is the sample label, and L flThis represents the error between the predicted value and the label value of the sample.

[0089] In practice, memory fault prediction faces the problem of imbalanced positive and negative samples, where normal memory data is much larger than faulty memory data. The Focal Loss function is specifically designed to handle the problem of imbalanced positive and negative samples. Therefore, when using the LightGBM machine learning method to build a training model, the use of the Focal Loss function can effectively solve the problem of imbalanced positive and negative samples in the training dataset of the fault prediction model, thus ensuring the accuracy of memory fault prediction.

[0090] In one embodiment, all samples in the training dataset are labeled with a binary value. For example, if a sample at a certain point in time is a memory failure sample, then samples within T minutes prior to that point are labeled "1", and samples before T minutes are labeled "0". The value of T can be between 30 and 100, for example, 90. That is, although this is a faulty memory, it had not reached the risk threshold 90 minutes ago and is considered normal data. Based on the fault prediction model trained using samples labeled with this binary method, if the output of the fault prediction model is "1", it indicates that the memory under test may fail within the next T time period; if the output of the fault prediction model is "0", it indicates that the memory under test is normal.

[0091] In one embodiment, all samples in the training dataset are labeled with a tag value, which is determined by regression, for example: calculating the time interval between the time point corresponding to each sample and the time point corresponding to the sample that experienced a memory failure; calculating the tag value of each sample based on the time interval and a calculation formula used to map the time interval to the [0,1] interval; wherein the calculation formula is:

[0092]

[0093] Where label is the calculated label value, X is the time interval, a is the first preset coefficient, and T is the preset fault impact duration. A regression-based labeling method is used to avoid data loss caused by the one-size-fits-all approach of binary labeling. The label value can reflect the proximity of the data to the fault, making it more realistic.

[0094] In the specific implementation, the samples in the training dataset are mapped to the interval [0,1] using a modified sigmoid function. The label value of each sample is calculated. The sigmoid function can be used to predict the time point when the memory under test will fail, so as to determine the degree of distance between the label and the time point of failure. Among them, the first preset coefficient 'a' is an empirical value with a value range of [1,10], for example, it can be 8; the value range of X / T is [0,1].

[0095] In one embodiment, after obtaining the output of the fault prediction model based on the feature splicing data table and the pre-trained fault prediction model, the predicted time of the fault occurrence of the memory under test can be obtained based on the output and the time formula used to predict the time of the fault occurrence.

[0096] The time formula is as follows:

[0097]

[0098] Where t represents the time point when the memory under test fails, b is the second preset coefficient, output is the output result, and T is the preset duration of the failure's impact. It can not only predict whether a failure will occur in the future, but also predict the specific time when it will occur, resulting in higher prediction accuracy.

[0099] The output of the fault prediction model has a value range of [0,1], and the second preset coefficient is an empirical value with a value range of [1,20], such as 11.25.

[0100] In one example, a*b = T. For instance, when a = 8 and b = 11.25, that is, a*b = 8 * 11.25 = 90, then the preset fault impact duration T = 90.

[0101] In one embodiment, the fault prediction model in step 104 can be achieved through, for example... Figure 6 The method shown includes:

[0102] Step 601: Label the samples in the training dataset to obtain the label value of each sample.

[0103] Step 602: Train the model to obtain the fault prediction model based on each sample with labeled values.

[0104] Step 603: Evaluate the fault prediction model.

[0105] The training dataset can be formed in the following ways:

[0106] S1: Collect memory-related log data from different data centers, manufacturers, and models (such as the 6 types of log data mentioned above);

[0107] S2: Perform feature engineering on the log data to obtain feature data tables corresponding to various log data types;

[0108] S3: Combine the feature data tables corresponding to the various log data to form a training dataset.

[0109] In S2, when various log data include memory error information address data, in order to obtain the feature data table corresponding to the memory error information address data and use it as a training dataset sample for the fault prediction model, allowing the prediction server to build the fault prediction model, feature engineering can be performed based on the memory error information address data obtained in S1. The method for constructing features based on memory error information address data can be found in [reference needed]. Figure 5 The method described above will not be repeated here to avoid repetition.

[0110] In S2, various log data also include memory log data, operating system kernel logs, performance data, and environmental and location information data. To obtain feature data tables corresponding to these data—memory log data, operating system kernel logs, performance data, and environmental and location information data—and use them as training dataset samples for the fault prediction model, the prediction server can construct feature data based on the memory log data, operating system kernel logs, Error Detection and Correction (EDAC) logs, performance data, and environmental and location information data obtained in S1. The method of feature engineering construction has already been described above and will not be repeated here to avoid repetition.

[0111] In one example, step 601 can use a binary classification method (i.e., the binary method described above) to label the samples in the training dataset. Specifically, the label-time relationship graph for binary classification is as follows: Figure 7 As shown by the dashed line "Classification", the horizontal axis represents the time period before the memory under test fails, and the vertical axis represents the label values ​​labeled using the binary classification method. If the preset failure duration T is 90 minutes, then for the binary classification method, the prediction server will label all samples in the training dataset within the last 90 minutes of the failure as "1", and label samples before 90 minutes as "0".

[0112] In another example, step 601 can use regression to label the samples in the training dataset. Specifically, the label-time relationship graph of the regression model is as follows: Figure 7The solid line Regression in the diagram represents the time elapsed before the memory failure, with the horizontal axis representing the label values ​​assigned using the regression method. If the preset failure duration T is 90 minutes, then for the regression method, the prediction server will map the time interval to the [0,1] interval using the following modified sigmoid function:

[0113]

[0114] Where label is the calculated label value, and X is the time interval. For normal in-memory data, X is set to +∞, and label is 0.

[0115] In one example, step 602 can use LightGBM to train and model each sample with labeled values, and the loss function can be the mean square error (MSE) loss function or the Focal loss function.

[0116] In another example, step 602 can use a random forest to train and model each sample with labeled values.

[0117] In step 603, 30% of the sample data in the training dataset is randomly selected as the validation set, and the remaining sample data is used as the training set.

[0118] In one example, the model can be evaluated using the F1-Score metric, that is, by evaluating it on the validation set using the F1-Score metric. The definition, related terminology, and detailed metrics of the F1-Score metric are as follows, where precision is the accuracy rate and recall is the recall rate:

[0119] n pp The number of memory modules predicted to fail within the next T time period within the evaluation window;

[0120] n tp The number of faulty memory segments detected T time points ahead of time within the assessment window;

[0121] n tr : Evaluate the total number of memory faults within the evaluation window.

[0122]

[0123]

[0124]

[0125] One approach is to use a grid search on the model's relevant parameters to maximize the model's F1 score.

[0126] In one embodiment, the fault prediction result of the memory under test is obtained based on the feature concatenation data table and the pre-trained fault prediction model, including: obtaining a confidence level to classify the memory under test as 1 based on the feature concatenation data table and the pre-trained fault prediction model; and determining the health level of the memory under test based on the confidence level; wherein the health level is 1 - confidence level, and the lower the health level, the more likely the memory under test is to fail. That is, if a binary classification method is used to label the samples in the training dataset, then... Figure 6 The output of the fault prediction model obtained in the above method is as follows: the machine learning algorithm classifies the memory under test as "1" with a confidence level (range of 0-100%), so that the health of the memory under test can be determined based on the confidence level, and finally the health index is reported to the monitoring center.

[0127] In one embodiment, the fault prediction result of the memory under test is obtained based on the feature concatenation data table and the pre-trained fault prediction model, including: obtaining the predicted value of the memory under test based on the feature concatenation data table and the pre-trained fault prediction model; wherein the predicted value is a floating-point number between 0 and 1; and determining the health of the memory under test based on the predicted value; wherein the health is 1 minus the predicted value, and the lower the health, the more likely the memory under test is to fail. That is, if the regression method is used to label the samples in the training dataset, the prediction server will map the time interval to the [0,1] interval using the following modified sigmoid function:

[0128]

[0129] Where label is the calculated label value, and X is the time interval. For normal in-memory data, X is set to +∞, and label is 0. Figure 6 The output of the fault prediction model obtained in the above method is a floating-point number between [0,1]. Generally, if the output is greater than 0.5, it means that a memory failure will occur within the next T time period; otherwise, it means that a memory failure will not occur within the next T time period. The output is converted into a health level of 0-100%. The output is the predicted value (range 0-1) of the memory under test, obtained through a machine learning algorithm, and the health level of the memory under test is 1 - the predicted value. The lower the health level, the higher the probability of memory failure within the next T time period. Finally, the health level is reported to the monitoring center.

[0130] In one embodiment, if regression is used to label the samples in the training dataset, then through Figure 6 The prediction result of the fault prediction model obtained in the above way can be the time point at which a memory failure will occur. For example, when building the fault prediction model, the loss function is the MSE loss function, and the time point at which a memory failure will occur can be obtained through the following formula:

[0131]

[0132] Finally, the prediction server can report the calculated time of memory failure to the monitoring center so that the monitoring center can take appropriate action.

[0133] It should be noted that the examples described above in this embodiment are merely illustrative for ease of understanding and do not constitute a limitation on the technical solution of the present invention.

[0134] The steps of the various methods described above are only for clarity. In practice, they can be combined into one step or some steps can be split into multiple steps. As long as they include the same logical relationship, they are all within the scope of protection of this patent. Adding insignificant modifications or introducing insignificant designs to the algorithm or process, but without changing the core design of the algorithm and process, are also within the scope of protection of this patent.

[0135] Another embodiment of the present invention relates to an electronic device, such as Figure 8 As shown, it includes at least one processor 801; and a memory 802 communicatively connected to at least one processor 801; wherein the memory 802 stores instructions executable by at least one processor 801, the instructions being executed by at least one processor 801 to enable at least one processor 801 to execute the memory fault prediction method of the above embodiment.

[0136] The memory 802 and processor 801 are connected via a bus, which can include any number of interconnecting buses and bridges. The bus connects various circuits of one or more processors 801 and memory 802 together. The bus can also connect various other circuits, such as peripheral devices, voltage regulators, and power management circuits, which are well known in the art and therefore will not be described further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver can be a single element or multiple elements, such as multiple receivers and transmitters, providing a unit for communicating with various other devices over a transmission medium. Data processed by processor 801 is transmitted over a wireless medium via an antenna, which further receives data and transmits it to processor 801.

[0137] The processor 801 is responsible for managing the bus and general processing, and can also provide various functions, including timing, peripheral interfaces, voltage regulation, power management, and other control functions. The memory 802 can be used to store data used by the processor 801 during operation.

[0138] Another embodiment of the present invention relates to a computer-readable storage medium storing a computer program. When executed by a processor, the computer program implements the method embodiments described above.

[0139] That is, those skilled in the art will understand that all or part of the steps in the methods of the above embodiments can be implemented by a program instructing related hardware. This program is stored in a storage medium and includes several instructions to cause a device (which may be a microcontroller, chip, etc.) or processor to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as a USB flash drive, a portable hard drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.

[0140] Those skilled in the art will understand that the above embodiments are specific examples of implementing the present invention, and in practical applications, various changes in form and detail may be made without departing from the spirit and scope of the present invention.

Claims

1. A method for predicting memory faults, characterized in that, include: Obtain various log data from the memory under test; wherein, the various log data includes at least: memory error information address data; Based on the various log data, feature engineering is performed to construct feature data tables corresponding to the various log data; The feature data tables corresponding to the various log data are concatenated to obtain a feature concatenated data table. Based on the feature concatenation data table and the pre-trained fault prediction model, the fault prediction result of the memory to be tested is obtained; wherein, the fault prediction model is trained based on a pre-collected training dataset, and the samples in the training dataset include various log data of various types of memory. The step of obtaining the fault prediction result of the memory under test based on the feature concatenation data table and the pre-trained fault prediction model includes: Based on the feature concatenation data table and the pre-trained fault prediction model, the output result of the fault prediction model is obtained; Based on the output results and the time formula used to predict the time of failure, the predicted time of failure of the memory under test is obtained. The time formula is as follows: Where t is the time point at which the memory under test fails, and b is the second preset coefficient. The output result is defined as [0, 1], and T is the preset fault duration.

2. The memory fault prediction method according to claim 1, characterized in that, The step of constructing feature data tables corresponding to the various log data based on feature engineering includes: Based on the memory error information address data, the first number of first targets appearing in the first preset time period before the current time point and the first number of error checks and corrections (ECCs) appearing on each of the first targets are counted; wherein, the first target is the memory unit in the memory under test that has accumulated 2 or more ECCs in each memory unit in the past second preset time period, and the duration of the second preset time period is longer than the duration of the first preset time period. Based on the memory error information address data, the second number of second targets appearing in the third preset time period before the current time point and the second number of ECC occurrences on each second target are counted; wherein, the second target is the column of memory cells in the memory under test that has accumulated 2 or more ECC occurrences at the same time in the past fourth preset time period, and the duration of the fourth preset time period is longer than the duration of the third preset time period. Based on the memory error information address data, the third number of third targets appearing in the fifth preset time period before the current time point and the third number of ECC occurrences on each third target are counted; wherein, the third target is the row of memory cells in the memory under test that has accumulated 2 or more ECC occurrences at the same time in the past sixth preset time period, and the duration of the sixth preset time period is longer than the duration of the fifth preset time period. Based on the memory error information address data, the fourth number of fourth targets appearing in the seventh preset time period before the current time point is counted; wherein, the fourth target is a memory cell block in the memory under test consisting of at least 3 memory cells located in the same column, the at least 3 memory cells located in the same column all have ECC at the same time in the past eighth preset time period, and the at least 3 memory cells located in the same column are separated by at most 1 memory cell, and the duration of the eighth preset time period is longer than the duration of the seventh preset time period; Based on the memory error information address data, count the fifth number of fifth targets that appeared within the ninth preset time period before the current time point; wherein, the fifth target is a memory cell block in the memory under test consisting of at least 3 memory cells located in the same row, wherein the at least 3 memory cells located in the same row all experienced ECC at the same time in the past tenth preset time period, and the at least 3 memory cells located in the same row are separated by at most 1 memory cell, and the duration of the tenth preset time period is longer than the duration of the ninth preset time period; The feature data table corresponding to the memory error information address data is obtained based on one or any combination of the following: The first quantity, the first number of times, the second quantity, the second number of times, the third quantity, the third number, the fourth quantity, and the fifth quantity.

3. The memory fault prediction method according to claim 1 or 2, characterized in that, The various log data also include one or any combination of the following: Memory log data, operating system kernel logs, error detection and correction EDAC logs, performance data, and environmental and location information data.

4. The memory fault prediction method according to claim 3, characterized in that, When the plurality of log data also includes the memory log data, the step of constructing feature engineering based on the plurality of log data to obtain feature data tables corresponding to the plurality of log data respectively includes: Based on the memory log data, the transaction field of the eleventh preset time period before the current time point is summed to obtain the summation result of the transaction field, and the register group number field of the eleventh preset time period before the current time point is counted to obtain the count result of the register group number field. Based on the summation result of the transaction field and the counting result of the register group number field, the feature data tables corresponding to the memory log data are obtained respectively; When the multiple log data also include the operating system kernel log, the step of constructing feature engineering based on the multiple log data to obtain feature data tables corresponding to each of the multiple log data includes: Analyze the various fields related to memory errors in the operating system kernel logs; Count the various fields within the twelfth preset time period before the current time point to obtain the count results corresponding to each of the various fields; Based on the counting results corresponding to each of the aforementioned fields, the feature data table corresponding to the operating system kernel log is obtained; When the multiple log data also include the EDAC logs, the step of constructing feature engineering based on the multiple log data to obtain feature data tables corresponding to each of the multiple log data includes: Based on the EDAC log, the MC field, page field, and offset field are counted within the thirteenth preset time period before the current time point to obtain the count results of the MC field, the page field, and the offset field; Based on the count results of the MC field, the page field, and the offset field, the feature data table corresponding to the EDAC log is obtained; When the multiple log data also include the performance data, the step of constructing feature engineering based on the multiple log data to obtain feature data tables corresponding to each of the multiple log data includes: Based on the performance data, the average value of various performance data that affect memory failure is calculated for the fourteenth preset time period before the current time point. Based on the average value of the various performance data, a feature data table corresponding to the performance data is obtained; When the various log data also include the environmental and location information data, the step of constructing feature engineering based on the various log data to obtain feature data tables corresponding to each of the various log data includes: Based on the environmental and location information data, the average temperature and humidity during the fifteenth preset time period prior to the current time point are calculated to obtain the average temperature and the average humidity. Based on the average temperature and the average humidity, a feature data table corresponding to the environmental and location information data is obtained.

5. The memory fault prediction method according to claim 1, characterized in that, The fault prediction model is trained using a pre-collected training dataset and a machine learning method. The machine learning method uses FocalLoss as its default loss function, and the formula for the loss function is as follows: in, As a balance factor, For modulation parameters, For the sample predicted value, For sample label values, The error between the predicted value of the sample and the label value of the sample.

6. The memory fault prediction method according to claim 1, characterized in that, All samples in the training dataset are labeled with tag values, which are determined in the following way: Calculate the time interval between the time point corresponding to each sample and the time point corresponding to the sample that experienced a memory failure; The label value of each sample is calculated based on the time interval and the calculation formula used to map the time interval to the interval [0,1]. The calculation formula is as follows: Where label is the calculated label value, X is the time interval, a is the first preset coefficient, and T is the preset fault impact duration.

7. The memory fault prediction method according to claim 1, characterized in that, The acquisition of various log data from the memory under test includes: Obtain various log data of the memory under test at a minute granularity; The step of constructing feature data tables corresponding to the various log data based on feature engineering includes: When multiple log data within the m minutes closest to the current time are obtained, the multiple log data within the m minutes and the multiple log data within the n minutes before the m minutes are merged, and the merged multiple log data is used as the log data required for fault prediction at the current time; where n+m minutes is the preset minimum duration required for fault prediction; Based on the log data required for fault prediction at the current time, feature engineering is performed at the minute granularity to obtain feature data tables corresponding to the various log data.

8. The memory fault prediction method according to claim 5, characterized in that, The step of obtaining the fault prediction result of the memory under test based on the feature concatenation data table and the pre-trained fault prediction model includes: Based on the feature splicing data table and the pre-trained fault prediction model, the confidence level for classifying the memory under test as 1 is obtained; The health of the memory under test is determined based on the confidence level; wherein the health level is 1 minus the confidence level, and the lower the health level, the more likely the memory under test is to fail.

9. The memory fault prediction method according to claim 6, characterized in that, The step of obtaining the fault prediction result of the memory under test based on the feature concatenation data table and the pre-trained fault prediction model includes: Based on the feature concatenation data table and the pre-trained fault prediction model, a predicted value for the memory to be tested is obtained; wherein the predicted value is a floating-point number between 0 and 1. The health status of the memory under test is determined based on the predicted value; wherein, the health status is 1 - the predicted value, and the lower the health status, the more likely the memory under test is to fail.

10. An electronic device, characterized in that, include: At least one processor; as well as, A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the memory fault prediction method as described in any one of claims 1 to 9.

11. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements the memory fault prediction method according to any one of claims 1 to 9.