Memory error recovery method and device based on repeated page chain, equipment and medium

By detecting hardware memory errors, determining whether they are KSM pages, and finding healthy copies, combined with dynamically adjusting linked list thresholds and load balancing mechanisms, the problem of unrecoverable memory caused by hardware failures is solved, achieving data recovery and business continuity under hardware failures.

CN122309252APending Publication Date: 2026-06-30KYLIN CORP

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
KYLIN CORP
Filing Date
2026-04-02
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies cannot effectively recover data when faced with unrecoverable physical memory page errors caused by hardware failures, thus affecting business continuity.

Method used

After detecting hardware memory errors, it is determined whether the page is a kernel page merge (KSM) page. A healthy copy is searched in the duplicate page chain. The recovery strategy is based on the importance of the process, and the chain formation threshold is dynamically adjusted. Load balancing and health indicators are used for failover and page table remapping. The memory reservation pool is used for emergency replenishment.

Benefits of technology

It significantly expands the scope of application of memory error recovery, ensures fault tolerance protection even in the critical state of OOM, reduces system performance jitter, maintains the redundancy of linked lists at a safe level, and ensures the continuity of core business.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309252A_ABST
    Figure CN122309252A_ABST
Patent Text Reader

Abstract

This invention discloses a memory error recovery method, apparatus, device, and medium based on a repeating page chain, relating to the field of computer system technology. The method includes: responding to a hardware memory error and determining whether the erroneous page is a KSM page; searching for a healthy copy in its repeating page chain; determining the recovery strategy level based on the parameters of the affected process, allocating a new physical page and adding it to the chain, copying data from the healthy copy to form a recovery copy; remapping the page table of the affected process to the recovery copy, updating the load count and sharing count, and isolating the erroneous page. By introducing dynamic thresholds, periodic degradation marking, health-level response, and a memory reservation pool mechanism, the fault tolerance threshold is significantly reduced, the fault tolerance protection scope is expanded, and the chain is given self-maintenance capabilities. This significantly improves the fault tolerance range and system reliability under extreme memory pressure while reducing performance jitter.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer system technology, and in particular to a method, apparatus, device, and medium for memory error recovery based on repeated page chains. Background Technology

[0002] In computer systems, physical memory (RAM) is the key hardware for storing program execution data. Each physical memory page is typically a 4KB block of memory, and its mapping is established between a physical address and the virtual address space of a process. When a physical memory page suffers physical damage or an unrecoverable error such as bit flipping, the data stored in that physical memory page becomes unusable.

[0003] Traditional Linux kernels identify faulty pages through hardware memory error detection mechanisms and immediately terminate all processes using those pages. While this ensures system security, it can lead to unexpected application termination and disrupt business continuity. Mechanisms such as soft offline of memory_failure, HWPOISON flags, and page migration can also be used to remove affected processes and pages, or migrate pages to move data to new physical pages. However, when faced with a source page that has suffered a hardware error and whose data is corrupted, it is impossible to migrate the correct data, limiting its recovery capabilities for uncorrectable errors such as hardware failures. Summary of the Invention

[0004] This invention provides a memory error recovery method, apparatus, device, and medium based on repeated page chains to solve the technical problem that physical memory pages cannot be correctly recovered due to uncorrectable errors such as hardware failures.

[0005] In a first aspect, embodiments of the present invention provide a memory error recovery method based on a repeating page chain, comprising: S101, in response to a detected hardware memory error, checks whether the faulty memory page is a kernel page merge (KSM) page; S102, when the faulty memory page is a KSM page, search for the healthy page copy corresponding to the faulty memory page in the duplicate page chain to which the KSM page belongs; S103, when an available healthy page copy is found, obtain the parameters of all affected processes mapped to the faulty memory page to determine the recovery strategy level; S104: Allocate new physical memory pages according to the recovery policy level, add the newly allocated physical memory pages to the linked list of the duplicate page chain, copy data from the healthy page copy, initialize its load count and share count information, and form a recovery copy page of the faulty memory page. S105, modify the page table mapping of all affected processes mapped to the faulty memory page to the recovery copy page, refresh the translation backup buffer, update the load count and share count of the recovery copy page, and mark and isolate the faulty memory page.

[0006] Secondly, embodiments of the present invention provide a memory error recovery device based on a repeating page chain, comprising: The error checking module is used to check whether the faulty memory page is a kernel page merge (KSM) page in response to a detected hardware memory error. The replica lookup module is used to find a healthy page replica corresponding to the faulty memory page in the duplicate page chain to which the KSM page belongs when the faulty memory page is a KSM page. The recovery strategy grading module is used to obtain parameters of all affected processes mapped to the faulty memory page when an available healthy page copy is found to determine the recovery strategy level; The recovery copy allocation module is used to allocate new physical memory pages according to the recovery policy level, add the newly allocated physical memory pages to the linked list of the duplicate page chain, copy data from healthy page copies, initialize their load count and share count information, and form recovery copy pages for faulty memory pages. The recovery page update module is used to modify the page table mapping of all affected processes that are mapped to the faulty memory page to the recovery copy page, refresh the translation back buffer, update the load count and share count of the recovery copy page, and mark and isolate the faulty memory page.

[0007] Thirdly, embodiments of the present invention provide an electronic device, including: One or more processors; Storage device for storing one or more programs. When the one or more programs are executed by the one or more processors, the one or more processors implement the memory error recovery method based on the repeated page chain described above.

[0008] Fourthly, embodiments of the present invention provide a storage medium containing computer-executable instructions, which, when executed by a computer processor, are used to perform the aforementioned memory error recovery method based on a repeating page chain.

[0009] This invention provides a memory error recovery method, apparatus, device, and medium based on a duplicate page chain. The method, upon detecting a hardware memory error, determines whether the erroneous memory page is a Kernel Page Merging (KSM) page and searches for a healthy copy in its duplicate page chain for replacement. It prioritizes affected processes by acquiring parameters such as the control group flag and applies a priority-differentiated recovery strategy. During failover and page table remapping, a linked list node load balancing method and a maximum sharing limit are introduced, prioritizing healthy nodes with the lowest load count and not yet reaching the sharing limit as mapping targets. A dynamic pressure threshold calculation and periodic recalculation mechanism dynamically adjusts the linked list formation threshold of the duplicate page chain, allowing the minimum threshold to be as low as 3 pages. By introducing a health index calculation and a tiered response mechanism, it overcomes the risk of redundancy gradually being exhausted due to node failures and provides emergency replenishment of pages from the memory reservation pool, supplementing the number of nodes based on the current and target health values. The dynamic threshold mechanism significantly reduces the linked list formation threshold from a fixed 256 pages to a minimum of 3 pages, enabling more duplicate pages to receive fault tolerance protection when memory is sufficient, thus significantly expanding the applicability of memory error recovery. Periodic degradation marking can avoid frequent creation and dissolution of linked lists due to short-term fluctuations in memory pressure, effectively reducing system performance jitter. The health-level response and proactive page replenishment mechanism enable the linked list to have self-maintenance capabilities, ensuring that the redundancy of the linked list is always maintained at a safe level, and overcoming the problem of linked list nodes gradually degrading and eventually failing due to failure. The memory reservation pool mechanism ensures that the linked list can still obtain replenished pages even in the OOM critical state, guaranteeing fault tolerance under extreme memory pressure. Attached Figure Description

[0010] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an undue limitation of the invention. In the drawings: Figure 1 This is a flowchart of a memory error recovery method based on a repeating page chain as described in Embodiment 1 of the present invention; Figure 2 This is a flowchart of a memory error recovery method based on a repeating page chain as described in Embodiment 2 of the present invention; Figure 3 This is a schematic diagram of the structure of a memory error recovery device based on a repeated page chain according to Embodiment 3 of the present invention; Figure 4 This is a structural diagram of the electronic device described in Embodiment 4 of the present invention. Detailed Implementation

[0011] The present invention will now be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and not intended to limit it. Furthermore, it should be noted that, for ease of description, the accompanying drawings show only the parts relevant to the present invention, and not all of the structures.

[0012] When physical corruption or irreparable hardware errors occur in the physical memory (typically 4KB physical pages) of the existing Linux kernel, all processes using that page are often terminated directly, causing unexpected application termination and impacting business continuity. Mechanisms like memory_failure also fail to utilize redundant copies of data already existing in memory. However, by using the kernel same-page merging (KSM) mechanism, redundant structures originally used for memory optimization can be reused for memory reliability assurance. The specific method is as follows: Example 1 Figure 1 The flowchart below illustrates a memory error recovery method based on a duplicate page chain according to Embodiment 1 of the present invention. When a hardware memory error is detected, a healthy copy is found through the KSM duplicate page chain mechanism. Combined with dynamic threshold adjustment, chain health assessment, and proactive page replenishment mechanism, transparent memory error recovery and dynamic self-maintenance of the chain are achieved. Specifically, the method includes the following steps: S101, in response to a detected hardware memory error, checks whether the faulty memory page is a kernel page merge (KSM) page.

[0013] During system operation, the status of physical memory is continuously monitored through underlying hardware. When irreparable physical damage such as bit flips occurs, the hardware memory error is detected first. Instead of a blanket termination of related processes, the faulty memory page is examined to determine if it belongs to KSM (Kernel Same-page Merging). By querying the page flags (page->flags) of the faulty memory page, it is determined whether it belongs to a shared page managed by the KSM mechanism. KSM duplicate page chaining is a memory optimization technique used to manage data structures of pages with identical content. When multiple physical pages with identical content exist in the system, these pages are organized into a linked list, with each node recording the virtual address mapping information to that node.

[0014] S102, when the faulty memory page is a KSM page, search for the healthy page copy corresponding to the faulty memory page in the duplicate page chain to which the KSM page belongs.

[0015] After confirming that the faulty memory page is a KSM page, since KSM maintains a chain of duplicate pages consisting of physical memory pages with identical content, the error flags of other nodes in the chain can be searched to filter out nodes that are not marked as faulty and are in normal condition. This allows us to confirm whether there are any undamaged, data-completely identical healthy page copies available for recovery.

[0016] S103, when an available healthy page copy is found, obtain the parameters of all affected processes mapped to the faulty memory page to determine the recovery strategy level.

[0017] After finding a healthy node, the parameters of all processes using the faulty memory page can be obtained based on the affected processes (i.e., processes whose virtual addresses are mapped to the faulty memory page). This allows for the classification of processes by importance and priority, and the subsequent allocation of recovery resources to determine the priority order.

[0018] S104: Allocate new physical memory pages according to the recovery policy level, add the newly allocated physical memory pages to the linked list of the duplicate page chain, copy data from the healthy page copy, initialize its load count and sharing count information, and form a recovery copy page of the faulty memory page.

[0019] Based on the recovery policy level determined for the process, the system attempts to request a new physical memory page from the memory management unit. Upon successful request, since the page content of each node in the KSM linked list is identical, the healthy and complete data can be copied from the previously found healthy page copy to the new physical memory page. This new page is then attached as a new node to the duplicate page chain. Simultaneously, metadata such as the load count reflecting the number of virtual address mappings it currently carries and the number of share attempts reflecting the maximum allowed share limit are initialized, forming a healthy recovery copy page for recovering the erroneous memory page.

[0020] S105, modify the page table mapping of all affected processes mapped to the faulty memory page to the recovery copy page, refresh the translation backup buffer, update the load count and share count of the recovery copy page, and mark and isolate the faulty memory page.

[0021] After creating the recovery copy page, pointer redirection is required. This can be achieved by calling underlying functions to modify all page table entries (PTEs) of affected processes that previously pointed to the faulty physical page, so that they now point to the newly created recovery copy page. After the page table modification is complete, the CPU's Translation Lookaside Buffer (TLB) is flushed to make the modified page table content effective, ensuring that subsequent process accesses to memory are accurately directed to healthy pages. Simultaneously, the load count and share count information for redirecting to the recovery copy page need to be updated accordingly. Finally, a marking function is called to add an unavailable mark to the faulty memory page and remove the faulty node from the duplicate page chain, completing fault isolation and preventing the reuse of the corrupted physical memory page.

[0022] Optionally, the method further includes: The dynamic pressure threshold of the duplicate page chain is calculated based on the current system memory pressure, and the linked list formation threshold of the duplicate page chain is dynamically adjusted based on the dynamic pressure threshold. The linked list formation threshold includes at least 3 pages, 32 pages, 128 pages, and 256 pages.

[0023] The original KSM (Knowledge, Management, and Service) duplicate page chain formation threshold is fixed at 256 identical pages. This means the KSM mechanism is only triggered when 256 identical pages exist, limiting recovery capabilities and flexibility. However, dynamically adjusting the duplicate page chain formation threshold based on current system memory pressure improves the recovery capability and flexibility for faulty memory pages. The dynamic pressure threshold for the duplicate page chain can be calculated based on the current system memory pressure, and the dynamically adjusted chain formation threshold can be determined accordingly. For example, when memory utilization is sufficient (<50%), the linked list formation threshold can be set to a minimum of 3 pages to maximize redundancy coverage and prioritize fault tolerance. When memory utilization is moderate (50%-80%), the linked list formation threshold can be set to 32 pages to balance memory overhead and fault tolerance benefits. When memory is strained (>80%), the linked list formation threshold can be set to 128 pages to reduce linked list maintenance overhead. When memory is extremely strained and reaches the OOM (Out of Memory) threshold, the linked list formation threshold can be set to 256 pages (original value) to maintain consistency with the original behavior. Dynamically adjusted linked list formation thresholds. T The calculation formula is as follows: in, This indicates the lower limit of the threshold for forming a linked list; the default value is 3. This indicates the threshold for forming the underlying linked list (default is 256). Indicates that memory has been used. This represents the total memory. A threshold formed by dynamically adjusting the linked list ensures maximum redundancy coverage when memory is ample and automatically reduces overhead when memory is tight. If the number of pages with the same content reaches... T If so, create a repeating page chain; if not, T If not, no linked list will be created, and the original KSM merging behavior will be maintained.

[0024] Based on the dynamically adjusted linked list formation threshold, when the kernel finds KSM pages with the same content during page merging, it no longer uses a fixed 256 as the formation condition. Instead, it determines whether to create a duplicate page chain based on the currently calculated linked list formation threshold, and initializes the load count and share count for each linked list node of the created duplicate page chain.

[0025] When each KSM scan completes a full cycle, a linked list formation threshold is calculated. Based on the recalculated linked list formation threshold, the number of nodes in the existing duplicate page chains is checked. Linked lists with fewer nodes than the recalculated linked list formation threshold are marked as degraded. If the number of nodes is still lower than the threshold in the next linked list formation threshold calculation, the duplicate page chain corresponding to that linked list is disbanded.

[0026] In each KSM scan cycle (scanning once every 100 milliseconds by default), a linked list formation threshold is dynamically calculated based on the current memory pressure. The number of nodes in the existing duplicate page chains is then checked. If the number of nodes is lower than the recalculated linked list formation threshold, the linked list is marked as degraded but not immediately disbanded. The next calculated linked list formation threshold is used for another check. If the number is still lower, the degraded condition is met, and the corresponding duplicate page chain is disbanded. This avoids performance jitter caused by frequent creation and disbanding due to short-term fluctuations in memory pressure.

[0027] Based on the occurrence of dynamic pressure thresholds, the health index of the linked list of duplicate pages in which the erroneous memory page is located is calculated. It is defined as the ratio of the number of healthy nodes to the total number of nodes. The health index includes healthy, attention, warning, and danger.

[0028] For example, a health metric, health_ratio, is maintained for each KSM repeating page list, and the calculation formula is as follows: health_ratio = healthy_node_count / total_node_count Where healthy_node_count is the number of nodes currently in a normal state, and total_node_count is the total number of nodes in the linked list.

[0029] The health index is compared with a preset page replenishment threshold. Based on the comparison result, the duplicate page chain is replenished accordingly. When the health index drops to a dangerous level, new physical memory pages are allocated from the pre-configured memory reservation pool.

[0030] For example, a health index >70% indicates a healthy state, meaning sufficient redundancy and good fault tolerance, allowing for normal operation without additional steps. A health index between 50% and 70% indicates a "Caution" state, meaning low-priority replenishment can be performed without affecting the front-end, with new pages allocated asynchronously in the background. A health index between 30% and 50% indicates a "Warning" state, meaning replenishment priority needs to be increased, and pages can be allocated from the memory reservation pool, prioritizing new pages for replenishment. A health index of 30% indicates a "Critical" state, meaning the highest priority pages need to be replenished, and low-priority pages can be reclaimed if necessary, with forced allocation and emergency replenishment from the memory reservation pool. The pre-configured memory reservation pool size is 0.1% of total memory by default and can be adjusted according to actual needs. The memory reservation pool does not participate in regular memory allocation and is only used for emergency replenishment of the linked list in critical states. Forcibly allocating new physical memory pages from the pre-configured memory reservation pool to supplement the list ensures that the linked list can still maintain itself even under extreme system memory pressure, such as in the critical state of OOM.

[0031] The health index is compared with the preset target health index to determine the number of supplementary nodes for the page supplementation operation. The supplemented new physical memory pages are added to the linked list as new nodes, and their load count and sharing count are initialized.

[0032] The number of nodes to be added can be determined by comparing the health index with the preset target health level. The calculation formula is as follows: healthy_node_count in, This represents the current total number of nodes. This indicates the preset target health level (default 70%). `healthy_node_count` represents the current number of healthy nodes, which can be calculated based on the total number of nodes and the health index. New physical memory pages are added to the linked list as new nodes after data is copied from healthy nodes, and their load count and sharing count are initialized to enable the linked list to maintain itself.

[0033] This embodiment, upon detecting a hardware memory error, determines whether the erroneous memory page is a Kernel Page Merging (KSM) page and searches for a healthy copy in its corresponding duplicate page chain for replacement. It prioritizes affected processes by acquiring parameters such as the control group flag and applies a priority-differentiated recovery strategy. During failover and page table remapping, a linked list node load balancing method and a maximum sharing limit are introduced, prioritizing healthy nodes with the lowest load counts and not yet reaching the sharing limit as mapping targets. A dynamic pressure threshold calculation and periodic recalculation mechanism dynamically adjusts the linked list formation threshold of the duplicate page chain, allowing the minimum threshold to be as low as 3 pages. By introducing a health index calculation and a tiered response mechanism, it overcomes the risk of redundancy gradually being exhausted due to node failures and provides emergency replenishment of pages from the memory reservation pool, supplementing the number of nodes based on the current and target health values. The dynamic threshold mechanism significantly reduces the linked list formation threshold from a fixed 256 pages to a minimum of 3 pages, enabling more duplicate pages to receive fault tolerance protection when memory is sufficient, thus significantly expanding the applicability of memory error recovery. Periodic degradation marking can avoid frequent creation and dissolution of linked lists due to short-term fluctuations in memory pressure, effectively reducing system performance jitter. The health-level response and proactive page replenishment mechanism enable the linked list to have self-maintenance capabilities, ensuring that the redundancy of the linked list is always maintained at a safe level, and overcoming the problem of linked list nodes gradually degrading and eventually failing due to failure. The memory reservation pool mechanism ensures that the linked list can still obtain replenished pages even in the OOM critical state, guaranteeing fault tolerance under extreme memory pressure.

[0034] Example 2 Figure 2 This is a flowchart of a memory error recovery method based on a duplicate page chain according to Embodiment 2 of the present invention. This embodiment is an optimization based on the above embodiment. In this embodiment, S104 is specifically optimized as follows: A load counter is maintained for each linked list node of the repeating page chain to record the number of virtual addresses mapped to each node; If allocating a new physical page fails, the node with the lowest load and whose share count has not reached the limit is found based on the load counter to determine the target node for load balancing mapping. If the number of times the lowest-loaded node has been shared reaches the limit, then select the next lowest-loaded node (excluding the lowest-loaded node) as the target node for load balancing mapping. If the load counters of all nodes exceed the preset load limit threshold, a new physical page will be forcibly allocated and added to the duplicate page chain as the target node for load balancing mapping.

[0035] Accordingly, the memory error recovery method based on duplicate page chains provided in this embodiment specifically includes: S201, in response to a detected hardware memory error, checks whether the faulty memory page is a kernel page merge (KSM) page.

[0036] Specifically, hardware memory errors are detected through the MCE hardware memory error detection mechanism. When a physical memory error is detected, the page flag of the faulty memory page is queried, and the red-black tree index maintained by the kernel page merging is used to determine whether the faulty memory page is a kernel page merging (KSM) page. The KSM page consists of at least two physical memory page nodes with the same content, and each node records the virtual address mapping information mapped to that node.

[0037] The health status of physical memory is monitored in real time through a hardware memory error detection mechanism. When the CPU's memory controller detects an Uncorrected Error (UCE) in a physical memory page, a Machine Check Exception (MCE) interrupt is triggered. At this time, the kernel's MCE handler is called, reads the error information from the Machine Check Registers, and locates the corresponding memory page flag (struct page) based on the physical address where the error occurred. Since the KSM mechanism sets specific flags for pages during page merging, the PG_ksm flag in the page's flags field is checked. If the PG_ksm flag is set, it indicates that the page is a KSM merged page. Furthermore, the stable tree index maintained by KSM is used. Because KSM stores the root node of each repeating page chain in a red-black tree, the hash value of the page content (calculated by KSM) is used as the key to search the red-black tree for secondary confirmation. If the corresponding node is found, it is confirmed that the page belongs to the KSM repeating page chain (it is a KSM page). A KSM page consists of at least two identical physical memory page nodes, each of which records virtual address mapping information mapped to that node.

[0038] If the faulty memory page is not a KSM page, it means that the page does not have a redundant backup. In this case, the memory error recovery process is terminated, and the original memory_failure handling function is called. This function will traverse all processes mapped to the faulty page and send a SIGBUS signal.

[0039] S202, when the faulty memory page is a KSM page, search for the healthy page copy corresponding to the faulty memory page in the duplicate page chain to which the KSM page belongs.

[0040] Specifically, when the faulty memory page is a KSM page, all nodes in the duplicate page chain to which the KSM page belongs are traversed, the error flag and load count of each node are checked, and an available healthy page copy is found.

[0041] When the faulty memory page is confirmed to be a KSM page, all nodes in the linked list of the duplicate page chain are traversed, and the error flag of each node is checked to exclude faulty nodes that have experienced hardware errors. At the same time, the load count of each node is checked. The load count records the number of virtual addresses currently mapped to the node to find nodes that are not marked as faulty and whose load count has not reached the upper limit, as usable healthy page copies.

[0042] If multiple physical failures result in no healthy redundant nodes remaining in the entire linked list, and no usable healthy page copy is found, the memory error recovery process is terminated, and the process reverts to the traditional memory error handling method.

[0043] S203, when a usable copy of a healthy page is found, retrieve the parameters of all affected processes mapped to the faulty memory page to determine the recovery strategy level.

[0044] Specifically, the cgroup control group flag, nice value, and scheduling policy of the affected process are queried. The recovery policies of the affected processes are divided into critical, important, normal, and low priority levels, and the corresponding recovery policies are executed for the affected processes classified into each level.

[0045] For each affected process, the system obtains its priority identifier, including: cgroup control group flag, used to query the control group (cgroup) to which the process belongs and determine whether it is marked as a critical business; nice value, used to obtain the static priority value of the process, the smaller the nice value, the higher the priority; scheduling policy, used to check the scheduling policy of the process (e.g., SCHED_FIFO, SCHED_RR for real-time processes, SCHED_NORMAL for ordinary processes). Then, the recovery strategy for the affected processes is divided into critical, important, ordinary, and low priority levels. For example, processes marked as critical in cgroup or using a real-time scheduling policy are automatically classified as critical, and new pages are allocated first for the creation of recovery copies. If allocation fails, direct mapping to an existing healthy node is attempted. Processes with a nice value < 0 are classified as important, and new pages are allocated first for recovery, or direct mapping is attempted, or a SIGBUS signal is sent. Ordinary user processes are classified as ordinary, and recovery can be attempted but not guaranteed; if recovery fails, a SIGBUS signal is sent to terminate. Background tasks with a nice value > 10 are classified as low priority, and the process can be terminated directly to ensure no recovery resources are consumed.

[0046] S204, maintain a load counter for each linked list node of the repeating page chain to record the number of virtual addresses mapped to each node.

[0047] When multiple faulty memory pages need to be recovered simultaneously, if all mappings are transferred to the same healthy node, that node will be under excessive pressure and may become a new single point of failure. By introducing a linked list node load balancing algorithm, a load counter (load_count) is maintained for each linked list node of the duplicate page chain to record the number of virtual addresses mapped to each node. This is used for subsequent load judgment and load balancing. The counter is incremented when each new mapping is established and decremented when each mapping is terminated.

[0048] S205, if allocating a new physical page fails, then find the node with the lowest load and whose sharing count has not reached the upper limit based on the load counter, in order to determine the target node for load balancing mapping.

[0049] If the allocation of new physical pages fails due to reasons such as severe system memory shortage, blind mapping transfer will no longer be performed. Instead, the node with the lowest load and whose share count has not reached the upper limit (max_share_count) will be found based on the load counter and identified as the target node for mapping transfer in order to balance the load of each node.

[0050] S206 If the number of times the lowest load node has been shared reaches the upper limit, then select the next lowest load node other than the lowest load node and determine it as the target node for load balancing mapping.

[0051] If the node with the lowest load reaches its sharing limit, it can be skipped, and the next node with the lowest load can be found. Again, it's necessary to confirm whether the sharing limit has been reached. This process continues until a node that meets both the lowest load and the sharing limit is found. This node can then be used as the target node for load balancing. The sharing limit can also be linked to the number of nodes in the linked list. When there are many nodes, `max_share_count` can be set lower (e.g., 512), and when there are few nodes, `max_share_count` can be set higher (e.g., 1024).

[0052] S207, if the load counters of all nodes exceed the preset load limit threshold, force the allocation of new physical pages and add them to the duplicate page chain as the target nodes for load balancing mapping.

[0053] To maintain list redundancy, if the load counters of all healthy nodes exceed the preset load limit threshold (default is 80% of max_share_count), it indicates that the existing healthy nodes are close to their capacity limit. Adding more mappings to these nodes would result in an excessively high single point of failure risk. In this case, even if memory is tight, a new physical page will be forcibly allocated and added to the duplicate page chain as the target node for load balancing mapping.

[0054] S208, modify the page table mapping of all affected processes mapped to the faulty memory page to the recovery copy page, refresh the translation backup buffer, update the load count and share count of the recovery copy page, and mark and isolate the faulty memory page.

[0055] In one optional implementation of this embodiment, step S208 further includes: The reverse mapping traversal function is called to modify the page table entries of all affected processes corresponding to the faulty memory page to point to the recovery copy page, and the translation back buffer is flushed to make the page table entry modifications take effect. The load technology and share count of the recovery copy page are also updated.

[0056] When establishing a new mapping relationship, the system calls the reverse mapping traversal function rmap_walk() to traverse the reverse mapping linked list of the faulty memory page, modifies the page table entries (PTEs) of all affected processes corresponding to the faulty memory page to point to the previously established recovery copy page, and refreshes the translation lookup buffer (TLB cache) to make the page table entry modification take effect immediately. At the same time, the load count and share count of the recovery copy page are updated.

[0057] The `set_page_hwpoison` flag function is called to mark faulty memory pages in order to isolate them from the fault.

[0058] To prevent faulty memory pages from being reused, the PageHWPOISON flag of the faulty memory page is set by calling the set_page_hwpoison flag function, and the faulty node is removed from the duplicate page chain, thus isolating it from the fault.

[0059] If allocating a new physical page fails, the page tables of all affected processes mapped to the faulty memory page are modified to map to the target node, the translation back buffer is refreshed, and the load count and share count of the target page are updated.

[0060] If the allocation of new physical pages fails due to severe system memory shortages, a reverse mapping traversal function can be invoked to modify the page table entries of all affected processes corresponding to the faulty page to point to the previously identified target node. The translation back buffer is then flushed to make the page table entry modifications take effect immediately. Simultaneously, the load count and sharing count are updated, increasing by the number corresponding to the affected processes. This ensures that even under extreme memory pressure where new pages cannot be allocated, the mapping of erroneous memory pages can still be migrated to existing healthy nodes through load balancing mapping, preventing single-node overload or system crashes.

[0061] This embodiment employs a dual confirmation method combining the MCE hardware memory error detection mechanism with page flags and KSM red-black tree indexes to accurately determine whether an erroneous page is a KSM page. It then checks the error flags and load counts within the corresponding duplicate page chain to filter available healthy page replicas. By acquiring the cgroup control group flags, nice values, and scheduling policies of the affected processes, it categorizes processes into recovery strategy levels and executes priority-differentiated recovery strategies accordingly. Furthermore, it calculates and dynamically adjusts the linked list formation threshold of the duplicate page chain based on the current system memory pressure. During erroneous memory page relocation and page table remapping, a linked list node load balancing algorithm is introduced, prioritizing healthy nodes with the lowest load counts that have not reached the sharing limit as mapping targets. Finally, it triggers different levels of page replenishment processes by real-time monitoring and calculation of the health indicators of the duplicate page chain. The dual confirmation mechanism of MCE detection and KSM ensures that the recovery process is triggered only when a KSM page with redundant copies fails. The hierarchical recovery strategy based on process priority prioritizes critical business processes, ensuring the continuity of core business. The load balancing and sharing limit mechanism effectively distributes the access pressure caused by the remapping of faulty memory pages, avoiding the risk of healthy nodes becoming new single points of failure due to overload during failover, thus improving system stability. The forced allocation of new pages mechanism actively supplements new nodes when existing healthy nodes are close to their capacity limits, maintaining the redundancy of the linked list and ensuring that the recovery operation itself does not introduce new single points of failure. This enables the duplicate page chain to have the ability to automatically maintain and actively repair itself, ensuring the long-term effectiveness of this fault-tolerant system.

[0062] Example 3 Figure 3 This is a schematic diagram of a memory error recovery device based on a repeating page chain according to Embodiment 3 of the present invention. In this embodiment, the memory error recovery device based on a repeating page chain includes: Error checking module 810 is used to check whether the faulty memory page is a kernel page merge (KSM) page in response to a detected hardware memory error. The replica lookup module 820 is used to search for a healthy page replica corresponding to the faulty memory page in the duplicate page chain to which the KSM page belongs when the faulty memory page is a KSM page. The recovery strategy grading module 830 is used to obtain parameters of all affected processes mapped to the faulty memory page to determine the recovery strategy level when an available healthy page copy is found. The recovery copy allocation module 840 is used to allocate new physical memory pages according to the recovery policy level, add the newly allocated physical memory pages to the linked list of the duplicate page chain, copy data from the healthy page copy, initialize its load count and share count information, and form a recovery copy page for the faulty memory page. The recovery page update module 850 is used to modify the page table mapping of all affected processes mapped to the faulty memory page to the recovery copy page, refresh the translation back buffer, update the load count and share count of the recovery copy page, and mark and isolate the faulty memory page.

[0063] This embodiment detects hardware memory errors and confirms whether they are KSM pages through an error checking module. A replica lookup module searches for healthy page replicas in the corresponding duplicate page chain. A recovery strategy grading module obtains parameters of the affected process to determine the recovery strategy level. A recovery replica allocation module allocates the corresponding physical memory page and copies the data from the healthy page replica to form a recovery replica page. A recovery page update module modifies the page table mapping of the affected process and refreshes the translated backup buffer to make it effective. A dynamic threshold mechanism significantly reduces the linked list formation threshold from a fixed 256 pages to a minimum of 3 pages, allowing more duplicate pages to receive fault tolerance protection when memory is sufficient, significantly expanding the applicability of memory error recovery. Periodic degradation marking avoids frequent creation and dissolution of the linked list due to short-term fluctuations in memory pressure, effectively reducing system performance jitter. A health grading response and proactive page replenishment mechanism enable the linked list to have self-maintenance capabilities, ensuring that the linked list redundancy is always maintained at a safe level, overcoming the problem of linked list nodes gradually degrading and eventually failing due to fault consumption. A memory reservation pool mechanism ensures that the linked list can still obtain replenished pages even in the OOM critical state, guaranteeing fault tolerance under extreme memory pressure.

[0064] The memory error recovery device based on repeated page chains provided in this invention can execute the memory error recovery method based on repeated page chains provided in any embodiment of this invention, and has the corresponding functional modules and beneficial effects of the method.

[0065] Example 4 Figure 4 This is a structural diagram of an electronic device according to Embodiment 4 of the present invention. Figure 4A block diagram is shown of an exemplary electronic device 12 suitable for implementing embodiments of the present invention. Figure 4 The electronic device 12 shown is merely an example and should not impose any limitation on the functionality and scope of use of the embodiments of the present invention.

[0066] like Figure 4 As shown, the electronic device 12 is represented in the form of a general-purpose computing device. The components of the electronic device 12 may include, but are not limited to: one or more processors or processing units 16, system memory 28, and bus 18 connecting different system components (including system memory 28 and processing unit 16).

[0067] Bus 18 represents one or more of several bus architectures, including a memory bus or memory controller, a peripheral bus, a graphics acceleration port, a processor, or a local bus using any of the various bus architectures. For example, these architectures include, but are not limited to, the Industry Standard Architecture (ISA) bus, the Micro Channel Architecture (MAC) bus, the Enhanced ISA bus, the Video Electronics Standards Association (VESA) local bus, and the Peripheral Component Interconnect (PCI) bus.

[0068] Electronic device 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by electronic device 12, including volatile and non-volatile media, removable and non-removable media.

[0069] System memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and / or cache memory 32. Electronic device 12 may further include other removable / non-removable, volatile / non-volatile computer system storage media. By way of example only, storage system 34 may be used to read and write non-removable, non-volatile magnetic media (… Figure 4 Not shown; usually referred to as a "hard drive"). Although Figure 4 Not shown, a disk drive for reading and writing to a removable non-volatile disk (e.g., a "floppy disk") and an optical disk drive for reading and writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 via one or more data media interfaces. System memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to perform the functions of the embodiments of the present invention.

[0070] A program / utility 40 having a set (at least one) of program modules 42 may be stored, for example, in system memory 28. Such program modules 42 include, but are not limited to, an operating system, one or more application programs, other program modules, and program data. Each or some combination of these examples may include an implementation of a network environment. Program modules 42 typically perform the functions and / or methods described in the embodiments of the present invention.

[0071] Electronic device 12 can also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), and with one or more devices that enable a user to interact with the electronic device 12 / server / computer, and / or with any device that enables the electronic device 12 to communicate with one or more other computing devices (e.g., network card, modem, etc.). This communication can be performed through input / output (I / O) interface 22. Furthermore, electronic device 12 can also communicate with one or more networks (e.g., local area network (LAN), wide area network (WAN), and / or public networks, such as the Internet) via network adapter 20. Figure 4 As shown, network adapter 20 communicates with other modules of electronic device 12 via bus 18. It should be understood that, although... Figure 4 As not shown, other hardware and / or software modules may be used in conjunction with electronic device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems.

[0072] The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, such as implementing the memory error recovery method based on repeated page chains provided in the embodiments of the present invention.

[0073] Example 5 Embodiment 5 of the present invention also provides a storage medium containing computer-executable instructions, which, when executed by a computer processor, are used to perform the memory error recovery method based on repeated page chains as provided in the above embodiments.

[0074] The computer storage medium of this invention can be any combination of one or more computer-readable media. A computer-readable medium can be a computer-readable signal medium or a computer-readable storage medium. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media (a non-exhaustive list) include: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this document, a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.

[0075] Computer-readable signal media may include data signals propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals may take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. Computer-readable signal media may also be any computer-readable medium other than computer-readable storage media, capable of sending, propagating, or transmitting programs for use by or in connection with an instruction execution system, apparatus, or device.

[0076] The program code contained on a computer-readable medium may be transmitted using any suitable medium, including—but not limited to—wireless, wire, optical fiber, RF, etc., or any suitable combination thereof.

[0077] Computer program code for performing the operations of this invention can be written in one or more programming languages ​​or a combination thereof, including object-oriented programming languages ​​such as Java, Smalltalk, and C++, as well as conventional procedural programming languages ​​such as "C" or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0078] Note that the above description is merely a preferred embodiment of the present invention and the technical principles employed. Those skilled in the art will understand that the present invention is not limited to the specific embodiments described herein, and various obvious changes, readjustments, and substitutions can be made without departing from the scope of protection of the present invention. Therefore, although the present invention has been described in detail through the above embodiments, the present invention is not limited to the above embodiments, and may include many other equivalent embodiments without departing from the concept of the present invention, the scope of which is determined by the scope of the appended claims.

Claims

1. A memory error recovery method based on repeated page chains, characterized in that, include: S101, in response to a detected hardware memory error, checks whether the faulty memory page is a kernel page merge (KSM) page; S102, when the faulty memory page is a KSM page, search for the healthy page copy corresponding to the faulty memory page in the duplicate page chain to which the KSM page belongs; S103, when an available healthy page copy is found, obtain the parameters of all affected processes mapped to the faulty memory page to determine the recovery strategy level; S104: Allocate new physical memory pages according to the recovery policy level, add the newly allocated physical memory pages to the linked list of the duplicate page chain, copy data from the healthy page copy, initialize its load count and sharing count information, and form a recovery copy page of the faulty memory page. S105, modify the page table mapping of all affected processes mapped to the faulty memory page to the recovery copy page, refresh the translation backup buffer, update the load count and share count of the recovery copy page, and mark and isolate the faulty memory page.

2. The method according to claim 1, characterized in that, The method further includes: The dynamic pressure threshold of the duplicate page chain is calculated based on the current system memory pressure, and the linked list formation threshold of the duplicate page chain is dynamically adjusted based on the dynamic pressure threshold. The linked list formation threshold includes at least 3 pages, 32 pages, 128 pages, and 256 pages. When the kernel page merging finds KSM pages with the same content, it determines whether to create a duplicate page chain based on the current linked list formation threshold, and initializes the load count and share count for each linked list node of the created duplicate page chain. When each KSM scan completes a full cycle, a linked list formation threshold is calculated once. Based on the recalculated linked list formation threshold, the number of linked list nodes of existing duplicate page chains is checked. Linked lists with a number of nodes lower than the recalculated linked list formation threshold are marked as degraded. If the number of nodes is still lower than the threshold in the next linked list formation threshold calculation, the duplicate page chain corresponding to the linked list is disbanded. Based on the dynamic pressure threshold, the health index of the linked list of the duplicate page chain where the faulty memory page is located is calculated. The health index includes healthy, attention, warning, and danger. The health index is compared with the preset page replenishment threshold. Based on the comparison result, the corresponding page replenishment operation is performed on the duplicate page chain. When the health index drops to a dangerous level, new physical memory pages are allocated from the pre-configured memory reservation pool. The health index is compared with the preset target health index to determine the number of supplementary nodes for the page supplementation operation. The supplemented new physical memory pages are added to the linked list as new nodes, and their load count and sharing count are initialized.

3. The method according to claim 1, characterized in that, S101 includes: Hardware memory errors are detected through the MCE hardware memory error detection mechanism. When a physical memory error is detected, the page flag of the faulty memory page is queried, and the red-black tree index maintained by the kernel page merging is used to determine whether the faulty memory page is a kernel page merging (KSM) page. The KSM page consists of at least two physical memory page nodes with the same content, and each node records the virtual address mapping information mapped to that node. If the faulty memory page is not a KSM page, the memory error recovery process is terminated.

4. The method according to claim 1, characterized in that, S102 includes: When the faulty memory page is a KSM page, traverse all nodes in the duplicate page chain to which the KSM page belongs, check the fault flag and load count of each node, and find an available healthy page replica. If no usable copy of the health page is found, the memory error recovery process is terminated.

5. The method according to claim 1, characterized in that, S103 further includes: Query the cgroup control group flag, nice value, and scheduling policy of the affected process, classify the recovery policy of the affected process into critical, important, normal, and low priority levels, and execute the corresponding level of recovery policy for the affected process classified into each level.

6. The method according to claim 1, characterized in that, S104 further includes: A load counter is maintained for each linked list node of the repeating page chain to record the number of virtual addresses mapped to each node; If allocating a new physical page fails, the node with the lowest load and whose share count has not reached the limit is found based on the load counter to determine the target node for load balancing mapping. If the number of times the lowest-loaded node has been shared reaches the limit, then select the next lowest-loaded node (excluding the lowest-loaded node) as the target node for load balancing mapping. If the load counters of all nodes exceed the preset load limit threshold, new physical pages will be forcibly allocated from the pre-configured memory reservation pool and added to the duplicate page chain as the target node for load balancing mapping.

7. The method according to claim 6, characterized in that, The S105 further includes: Call the reverse mapping traversal function to modify the page table entries of all affected processes corresponding to the faulty memory page to point to the recovery copy page, and refresh the translation back buffer to make the page table entry modifications take effect, and update the load count and share count of the recovery copy page; The `set_page_hwpoison` flag function is called to mark faulty memory pages in order to isolate them from the fault. If allocating a new physical page fails, the page tables of all affected processes mapped to the erroneous memory page are modified to map to the target node, the translation back buffer is refreshed, and the load count and share count of the target page are updated.

8. A memory error recovery apparatus based on a repeating page chain, used to implement the memory error recovery method based on a repeating page chain as described in any one of claims 1-7, characterized in that, include: The error checking module is used to check whether the faulty memory page is a kernel page merge (KSM) page in response to a detected hardware memory error. The replica lookup module is used to find a healthy page replica corresponding to the faulty memory page in the duplicate page chain to which the KSM page belongs when the faulty memory page is a KSM page. The recovery strategy grading module is used to obtain parameters of all affected processes mapped to the faulty memory page when an available healthy page copy is found to determine the recovery strategy level; The recovery copy allocation module is used to allocate new physical memory pages according to the recovery policy level, add the newly allocated physical memory pages to the linked list of the duplicate page chain, copy data from healthy page copies, initialize their load count and share count information, and form recovery copy pages for faulty memory pages. The recovery page update module is used to modify the page table mapping of all affected processes that are mapped to the faulty memory page to the recovery copy page, refresh the translation back buffer, update the load count and share count of the recovery copy page, and mark and isolate the faulty memory page.

9. An electronic device, characterized in that, The electronic device includes: One or more processors; Storage device for storing one or more programs. When the one or more programs are executed by the one or more processors, the one or more processors implement the memory error recovery method based on repeated page chains as described in any one of claims 1-7.

10. A storage medium containing computer-executable instructions, which, when executed by a computer processor, are used to perform the memory error recovery method based on a repeating page chain as described in any one of claims 1-7.