Memory fault processing method, heterogeneous system, medium and product

By isolating the erroneous global address when a memory failure occurs and triggering hardware repair when the system allows, the problem of memory failure repair affecting system efficiency in existing technologies is solved, achieving efficient handling of memory failures and improving application performance and stability.

CN122240364APending Publication Date: 2026-06-19CAMBRIAN (KUNSHAN) INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CAMBRIAN (KUNSHAN) INFORMATION TECH CO LTD
Filing Date
2024-12-18
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies require stopping all applications during memory fault repair to ensure data access security, resulting in low computing system efficiency.

Method used

By obtaining memory error information, the global address of the error is determined and isolated to prevent application use. Then, when the system load allows, a hardware repair mechanism is triggered to complete the repair and return of the address.

Benefits of technology

Without affecting system performance, memory errors are isolated and repaired, address holes that are not noticeable to the application are eliminated, the amount of available memory is increased, and the performance and stability of the application are ensured.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240364A_ABST
    Figure CN122240364A_ABST
Patent Text Reader

Abstract

This application provides a memory fault handling method, a heterogeneous system, a medium, and a product. The method includes: acquiring memory error information, wherein the error information represents an erroneous physical address; acquiring the erroneous global address corresponding to the error information, and performing an isolation operation on the erroneous global address; wherein the global address is a physical address managed by a software-view physical memory allocator; triggering a hardware repair mechanism, and upon receiving a successful repair notification, returning the erroneous global address to the physical memory allocator. This method aims to reduce the impact of memory fault handling on the operating efficiency of a computing system.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to computer technology, and more particularly to a memory fault handling method, heterogeneous system, medium, and product. Background Technology

[0002] Memory is a crucial component of a computer, used to temporarily store data and programs used in computation. The performance of the memory directly impacts the overall performance of the computing system. As modern computing systems use increasingly larger amounts of memory, the likelihood of memory errors is also rising, potentially affecting system performance.

[0003] In existing technologies, hardware repair mechanisms are often used to repair memory addresses that have erroneous. However, the hardware repair process requires stopping all applications and resetting the computing system to ensure that no data is accessed during the process. Therefore, it will greatly affect the operating efficiency of the computing system. Summary of the Invention

[0004] This application provides a memory fault handling method, a heterogeneous system, a medium, and a product to reduce the impact of memory fault handling on the operating efficiency of a computing system.

[0005] On one hand, embodiments of this application provide a memory fault handling method applied to heterogeneous systems, the method comprising:

[0006] Obtain memory error information, wherein the error information represents the faulty physical address;

[0007] Obtain the global address of the error message and perform an isolation operation on the global address of the error; wherein, the global address is a physical address managed by the software-view physical memory allocator;

[0008] The hardware repair mechanism is triggered, and upon receiving a successful repair notification, the erroneous global address is returned to the physical memory allocator.

[0009] In one possible implementation, the isolation operation on the erroneous global address includes:

[0010] Based on a specific address range, an isolation operation is performed on the portion of memory where the erroneous global address is located;

[0011] Returning the erroneous global address to the physical memory allocator includes:

[0012] Based on a specific address range, the portion of memory containing the erroneous global address is returned to the physical memory allocator.

[0013] In one possible implementation, the specific address range is the address range corresponding to the page.

[0014] In one possible implementation, the heterogeneous system includes a work queue; prior to triggering the hardware repair mechanism, it further includes:

[0015] Add the isolated global address of the error to the work queue;

[0016] The hardware repair trigger mechanism includes:

[0017] The work queue is woken up, and the isolated global addresses of errors in the work queue are passed to the hardware repair unit.

[0018] In one possible implementation, waking up the work queue includes:

[0019] Detect the currently available memory on the device side of the heterogeneous system;

[0020] If the available memory is greater than a preset threshold, the work queue is woken up when the load on the heterogeneous system is lower than a preset load.

[0021] If the available memory is not greater than a preset threshold, the work queue is woken up after the isolated global error address is added to the work queue.

[0022] In one possible implementation, passing the isolated global address of the error in the work queue to the hardware repair unit includes:

[0023] The isolated global addresses of errors are traversed in the work queue, and the isolated global addresses of errors are passed to the hardware repair unit.

[0024] In one possible implementation, the method further includes:

[0025] If a repair error message is received, the global address corresponding to the repair error message will be marked as repair failed.

[0026] The step of traversing the isolated global addresses of errors in the work queue includes:

[0027] If an error global address is marked as a failed repair, then that error global address is skipped.

[0028] In one possible implementation, obtaining the global error address corresponding to the error information includes:

[0029] Based on the mapping relationship between physical addresses and global addresses, the global address corresponding to the error message is determined.

[0030] In one possible implementation, the isolation operation on the erroneous global address includes:

[0031] The erroneous global address is marked as unavailable, and the application using the erroneous global address is terminated, so that the erroneous global address is isolated by the physical memory allocator after being released from the application.

[0032] In one possible implementation, the hardware repair mechanism includes one or more of the following: address replacement at the bit level, address replacement at the cell level, address replacement at the row level, address replacement at the column level, address replacement at the bank level, address replacement at the device level, address replacement at the rank level, address replacement at the channel level, or address replacement of a specific address range in memory.

[0033] On the other hand, embodiments of this application provide a heterogeneous system, including: interconnected hosts and devices;

[0034] The host is configured to: acquire memory error information, wherein the error information represents an erroneous physical address; acquire the erroneous global address corresponding to the error information, and perform an isolation operation on the erroneous global address; wherein the global address is a physical address managed by the software-view physical memory allocation manager; and trigger a hardware repair mechanism.

[0035] The device is used to perform hardware repair processing on isolated erroneous global addresses;

[0036] The host is also configured to return the erroneous global address to the physical memory allocator after receiving a successful repair notification.

[0037] In another aspect, embodiments of this application provide a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, are used to implement the method described above.

[0038] In another aspect, embodiments of this application provide a computer program product, including computer execution instructions, which, when executed by a processor, are used to implement the method described above.

[0039] In the memory fault handling method, heterogeneous system, medium, and product provided in this application, when a memory error occurs, the corresponding global address of the error is first determined based on the error information and isolated, so that the global address of the error will not be used by the application. At the same time, there is no need to reset the heterogeneous system, thus avoiding impacting the system's operating efficiency. After isolating the global address of the error, a hardware repair mechanism is triggered based on the system's operating status to complete the repair of the global address of the error. This can eliminate address gaps that are perceptible to the application, increase the amount of available memory, and ensure the performance and stability of the application. Attached Figure Description

[0040] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.

[0041] Figure 1 The diagram above illustrates a flowchart of a memory fault handling method provided in an embodiment of this application.

[0042] Figure 2 The diagram above illustrates a scenario of memory fault handling provided in an embodiment of this application.

[0043] Figure 3 The diagram below illustrates a flowchart of another memory fault handling method provided in an embodiment of this application.

[0044] Figure 4 The diagram above exemplifies a structural schematic of a heterogeneous system provided in an embodiment of this application.

[0045] Figure 5 The diagram above exemplarily illustrates a structural schematic of another heterogeneous system provided in an embodiment of this application;

[0046] Figure 6 The diagram above exemplarily illustrates the structure of an electronic device provided in an embodiment of this application.

[0047] The accompanying drawings illustrate specific embodiments of this application, which will be described in more detail below. These drawings and descriptions are not intended to limit the scope of the concept in any way, but rather to illustrate the concept of this application to those skilled in the art through reference to particular embodiments. Detailed Implementation

[0048] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.

[0049] In this application, a module refers to a functional module or a logical module. It can be in software form, where its function is implemented by a processor executing program code; or it can be in hardware form. "And / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, or B alone. The character " / " generally indicates that the preceding and following related objects have an "OR" relationship.

[0050] First, the terms used in the embodiments of this application will be explained.

[0051] A graphics processing unit (GPU), also known as a display core, display chip, video processor, or graphics card, is a hardware processor used for processing images and graphics calculations. It can also perform some non-graphics-related high-performance computing tasks.

[0052] Neural Processing Unit (NPU): A hardware processor specifically designed to accelerate neural network operations, primarily used for artificial intelligence (AI) tasks, especially deep learning-related computations.

[0053] Dynamic Random Access Memory (DRAM): A commonly used type of volatile memory that allows the processor to quickly access data and instructions. High Bandwidth Memory (HBM): A high-performance memory based on 3D stacking technology, characterized by high bandwidth and high latency.

[0054] ECC (Error Checking and Correcting) is a memory error correction technology that uses redundant bits to detect and correct errors in the data.

[0055] Memory, such as DRAM or HBM, is a crucial component of a computer. It temporarily stores data and programs used in the computing system, and its performance directly impacts the overall performance of the computing system. As modern computing systems use increasingly larger amounts of memory, the likelihood of memory errors is also rising, potentially affecting system performance.

[0056] In existing technologies, hardware repair mechanisms are often used to repair memory addresses that have erroneous. However, the hardware repair process requires stopping all applications and resetting the computing system to ensure that no data is accessed during the process. Therefore, it will greatly affect the operating efficiency of the computing system.

[0057] This application provides a memory fault handling method applied to a heterogeneous system comprising a host and devices. When a memory error occurs, the host first determines the corresponding erroneous global address based on the error information and isolates the erroneous global address, preventing it from being used by the application. This eliminates the need to reset the heterogeneous system, thus avoiding impact on system efficiency. After isolating the erroneous global address, a hardware repair mechanism is triggered based on system operation to repair the erroneous global address. This eliminates address gaps perceptible to the application, increases the amount of available memory, and ensures application performance and stability.

[0058] The technical solutions of this application are illustrated below with specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments.

[0059] Figure 1 This is a flowchart illustrating a memory fault handling method provided in an embodiment of this application. The method is applied to a heterogeneous system, which may include a host side (software side) and a device (hardware side). The host side may include a CPU and host-side memory, while the device may include a neural network processor such as a GPU or NPU and device memory. The host-side and device processors have different instruction set architectures. The device memory may be DRAM or HBM, etc. The device memory may include an error checking module, such as an ECC module, which can correct and check for errors such as memory failures occurring in the device memory and can transmit the physical address of the error to the host. Furthermore, the device may also include a hardware repair unit for replacing and repairing the memory region where the memory error occurred. For example, the HBM design implements address replacement at the row level to replace and repair the erroneous memory region.

[0060] The host may include an address translation unit, a physical memory allocator, and an isolation unit. The address translation unit translates and converts erroneous physical addresses transmitted by the device, enabling the host to further process the erroneous physical addresses. The physical memory allocator is used to implement memory allocation; the isolation unit is used to isolate erroneous physical addresses, preventing applications from continuing to use that portion of storage space.

[0061] like Figure 1 As shown in the embodiments of this application, the memory fault handling method can be executed by a host in a heterogeneous system. The method may include:

[0062] S101, retrieve memory error information, including the faulty physical address.

[0063] Figure 2 This is a schematic diagram illustrating a memory fault handling scenario provided in an embodiment of this application. Figure 2 As shown, in the specific implementation, the error verification module checks whether a memory error has occurred, and when a memory error is detected, it sends a memory error interrupt notification to the host. The memory error interrupt notification includes the memory error information. After receiving the memory error interrupt notification sent by the error verification module, the host parses the memory error information, which includes the faulty physical address.

[0064] For example, the memory may include, but is not limited to, DRAM, HBM, etc., the error verification module may be an ECC verification module, and the error information may include, but is not limited to, ECC error information, etc. The selection can be made according to production needs, and there are no restrictions on it here.

[0065] It is understandable that a physical address refers to the physical address within a specific memory location. When the memory is DRAM, an erroneous physical address can include: the erroneous address within the memory channel, the memory channel ID, and DRAM interleaving information.

[0066] S102, obtain the global address of the error corresponding to the error information, and perform an isolation operation on the global address of the error; where the global address is the physical address managed by the physical memory allocator from the software perspective.

[0067] In practical implementation, the host can obtain the physical address corresponding to the error physical address from the software perspective based on the error physical address in the error message, and then isolate the error global address. After the error global address is isolated, it will not be used by the application, thus preventing memory errors from affecting application performance and avoiding program crashes or security vulnerabilities.

[0068] It is understandable that this global address is different from the physical address. The global address is equivalent to a global physical address, such as the global physical address composed of multiple memory locations.

[0069] S103 triggers the hardware repair mechanism and, upon receiving a successful repair notification, returns the erroneous global address to the physical memory allocator.

[0070] In the specific implementation, since the erroneous global address has been isolated, the application will no longer use it. Therefore, a hardware repair mechanism can be triggered as needed, enabling the device-side hardware repair unit to replace and repair the erroneous physical address in the device-side memory based on the host's trigger signal. Optionally, the hardware repair unit can execute one or more of the following hardware repair mechanisms based on the host's trigger signal: bit-level address replacement, cell-level address replacement, row-level address replacement, column-level address replacement, bank-level address replacement, device-level address replacement, rank-level address replacement, channel-level address replacement, or address replacement of a specific address range in memory. Other hardware repair mechanisms can also be selected according to actual production needs; no restrictions are imposed here.

[0071] After receiving a successful repair notification, the host indicates that the memory error has been corrected. The corrected global address can be returned to the physical memory allocator so that the physical memory allocator can reallocate the global address. The application can then resume using the global address, thereby eliminating address gaps that are not apparent to the application, increasing the amount of available memory, avoiding the impact on business performance due to insufficient available memory, and ensuring the performance and stability of the application.

[0072] In the embodiments of this disclosure, when a memory error occurs, the corresponding global address of the error is first determined based on the error information and isolated, so that the global address of the error will not be used by the application. At the same time, there is no need to reset the heterogeneous system, thus avoiding the impact on the system's operating efficiency. After isolating the global address of the error, a hardware repair mechanism is triggered based on the system's operating status to complete the repair of the global address of the error. This can eliminate address gaps that are perceptible to the application, increase the amount of available memory, and ensure the performance and stability of the application.

[0073] In one possible implementation, the host performs isolation operations on erroneous global addresses, including:

[0074] The host performs isolation operations on erroneous global addresses based on a specific address range;

[0075] The host returns the erroneous global address to the physical memory allocator, including:

[0076] The host returns the erroneous global address to the physical memory allocator based on a specific address range.

[0077] like Figure 2 As shown, in a specific implementation, the host can define a specific address range, perform isolation operations on the memory space corresponding to the erroneous global address based on the specific address range, and return the memory space corresponding to the erroneous global address to the memory allocator after the erroneous global address is repaired, so that the returned memory space can continue to be used by the application. The method of this disclosure operates on a specific address range basis, which can improve the convenience and efficiency of the operation, and at the same time, the specific address range can be freely selected, which can effectively improve the flexibility of memory fault handling.

[0078] For example, a specific address range can be the address range corresponding to a page. The host performing isolation operations on the specific address range containing the erroneous global address can include isolating the memory page containing the erroneous global address, thereby achieving page retirement. The host returning the specific address range containing the erroneous global address to the physical memory allocator can include returning the memory page containing the erroneous global address to the physical memory allocator.

[0079] In memory paging management, physical memory can be divided into multiple fixed-size memory blocks, called physical pages or page frames. Correspondingly, software-level virtual memory can be divided into multiple blocks of the same size, called virtual pages, with a mapping relationship between pages and page frames. Generally, the size of a page is a power of 2. The host's physical memory manager supports the memory paging management mechanism and can manage physical memory based on the mapping relationship between pages and page frames. In the method of this disclosure embodiment, the host uses pages as the smallest unit to isolate and release erroneous global addresses, thereby achieving management such as physical memory isolation.

[0080] Figure 3 This is a flowchart illustrating another memory fault handling method provided in an embodiment of this application. Figure 2 and Figure 3 As shown, in one possible implementation, before the host triggers the hardware repair mechanism, the following steps are also included:

[0081] The host adds the isolated error global address to the work queue;

[0082] The host triggers a hardware repair mechanism, including:

[0083] The host wakes up the work queue and passes the isolated global addresses of errors in the work queue to the hardware repair unit.

[0084] In a practical implementation, the host can manage isolated global error addresses through a work queue. After isolating a global error address, the host can add it to the work queue. This work queue can be a host-maintained FIFO (First Input First Output) queue used to store at least one isolated page.

[0085] In one possible implementation, the host wakes up the work queue, including:

[0086] The host detects the current available memory on the heterogeneous system device side;

[0087] If the available memory is greater than the preset threshold, the work queue will be woken up when the load on the heterogeneous system is lower than the preset load.

[0088] If the available memory is not greater than the preset threshold, the work queue will be woken up after the isolated global error address is added to the work queue.

[0089] In practical implementation, the host can decide whether to wake up the work queue directly after completing the isolation operation or delay waking it up when the heterogeneous system's load is low, based on the remaining available memory in the heterogeneous system. When the available memory in the heterogeneous system is less than a preset threshold, the available memory is insufficient, so the host needs to wake up the work queue to complete hardware repair, thereby increasing the amount of available memory and ensuring the performance and stability of the application. When the available memory in the heterogeneous system is greater than the preset threshold, the available memory is sufficient, so the host can wake up the work queue when the heterogeneous system's load is below a preset load. When the heterogeneous system's load is low, even if hardware repair requires resetting the heterogeneous system, the impact on the heterogeneous system's operating efficiency is minimal.

[0090] In one possible implementation, the host passes the isolated global addresses of errors in the work queue to the hardware repair unit, including:

[0091] The host iterates through the isolated global addresses of errors in the work queue and passes the global addresses of isolated errors to the hardware repair unit.

[0092] In a practical implementation, the host can pass the isolated global addresses of errors to the hardware repair unit according to the first-in, first-out (FIFO) principle and the order in which the isolated global addresses of errors were added to the work queue, so that the hardware repair unit can perform hardware repair on the isolated physical memory. In other embodiments, the isolated global addresses of errors may also be passed in other orders, which are not restricted here.

[0093] Furthermore, the host's physical memory allocator can obtain the isolated physical memory based on the isolated global address of the error and transmit the address of the isolated physical memory to the hardware repair unit. The hardware repair unit can then isolate and repair this portion of physical memory based on its address. As mentioned earlier, the hardware repair unit can execute one or more of the following hardware repair mechanisms based on the host's trigger signal: bit-level address replacement, cell-level address replacement, row-level address replacement, column-level address replacement, bank-level address replacement, device-level address replacement, rank-level address replacement, channel-level address replacement, or address replacement of a specific address range of memory. Other hardware repair mechanisms can also be selected based on actual production needs; these are not restricted here.

[0094] In one possible implementation, the above method also includes:

[0095] If the host receives a repair error message, it will mark the global address corresponding to the repair error message as a repair failure.

[0096] Furthermore, if an error global address is marked as a failed repair, the host can skip that error global address.

[0097] In practice, hardware repair may fail. For example... Figure 3 As shown, if a repair exception is received, the host marks the corresponding global address of the repair exception as a repair failure. This allows the host to skip the global address of the error when iterating through the isolated global addresses of the error in the work queue, thus avoiding repeated iterations of the global addresses of the error that cannot be repaired in the subsequent work queue. This can effectively improve the reliability of memory fault handling.

[0098] like Figure 2 and Figure 3 As shown, in one possible implementation, the host obtains the global error address corresponding to the error information, including:

[0099] The host determines the global address corresponding to the error message based on the mapping relationship between physical addresses and global addresses.

[0100] In its implementation, the address translation unit stores the mapping relationship between physical addresses and global addresses. The host can send error address information to the address translation unit, which then translates the error information into the corresponding error global address, allowing the host to retrieve the corresponding error global address based on the error information.

[0101] In one possible implementation, the host performs isolation operations on erroneous global addresses, which also includes:

[0102] The host marks the erroneous global address as unavailable and terminates the application that is using the erroneous global address so that the erroneous global address is isolated by the physical memory allocator after being released from the application.

[0103] In the actual implementation, after the host marks the erroneous global address as unavailable, it can determine the application using that erroneous global address based on the address and terminate that application. After the application exits, the erroneous global address is released and isolated by the physical memory allocator, thus completing the isolation operation for the erroneous global address.

[0104] For example, terminating an application may include actively terminating the application or waiting for the application to exit voluntarily.

[0105] In this embodiment, when a memory error occurs, the corresponding global address of the error is first determined based on the error information and isolated so that the global address of the error cannot be used by the application. At the same time, there is no need to reset the heterogeneous system, thus avoiding impacting the system's operating efficiency. After isolating the global address of the error, a hardware repair mechanism is triggered based on the system's operating status to complete the repair of the global address of the error. This can eliminate address gaps that are perceptible to the application, increase the amount of available memory, and ensure the performance and stability of the application.

[0106] Figure 4 This is a schematic diagram of a heterogeneous system provided in an embodiment of this application. Figure 5 This is a schematic diagram of another heterogeneous system provided in an embodiment of this application. For example... Figure 4 and Figure 5 As shown, the heterogeneous system 10 includes: an interconnected host 11 and a device 12;

[0107] Host 11 is used to obtain memory error information, including the error physical address; obtain the error global address corresponding to the error information, and perform isolation operation on the error global address; wherein, the global address is the physical address managed by the physical memory allocation manager from the software perspective; and trigger the hardware repair mechanism.

[0108] Device 12 is used to perform hardware repair processing on isolated faulty global addresses;

[0109] Host 11 is also used to return the faulty global address to the physical memory allocator after receiving a successful repair notification.

[0110] The host 11 includes a driver 111 and a physical memory allocator 112, while the device 12 includes an error verification module 121 and a hardware repair unit 122. The error verification module 121 verifies memory errors and sends memory error information to the driver 111 when a memory error occurs. The driver 111 obtains the global address corresponding to the error information and performs isolation operations on the global address.

[0111] Hardware repair unit 122 is used to perform hardware repair processing on isolated erroneous global addresses. In one possible implementation, the hardware repair mechanism includes one or more of the following: bit-level address replacement, cell-level address replacement, row-level address replacement, column-level address replacement, bank-level address replacement, device-level address replacement, rank-level address replacement, channel-level address replacement, or address replacement of a specific address range in memory.

[0112] Driver 111 is also used to return the faulty global address to physical memory allocator 112 after receiving a successful repair notification.

[0113] In one possible implementation, host 11 further includes: isolation unit 113;

[0114] Driver 111 is also used to send the error global address to isolation unit 112;

[0115] Isolation unit 113 is used to perform isolation operations on the portion of memory where the erroneous global address is located based on a specific address range;

[0116] Driver 111 is also used for:

[0117] Based on a specific address range, the portion of memory containing the erroneous global address is returned to the physical memory allocator.

[0118] In one possible implementation, the specific address range is the address range corresponding to the page.

[0119] In one possible implementation, host 11 also includes work queue 114;

[0120] Driver 111 is also used to: add the isolated error global address to work queue 114; and wake up work queue 114;

[0121] Work queue 114 is used to pass the isolated global addresses of errors in work queue 114 to hardware repair unit 122.

[0122] In one possible implementation, driver 111 is specifically used for:

[0123] Detect the current available memory of device 12;

[0124] If the available memory is greater than the preset threshold, then when the load on heterogeneous system 10 is lower than the preset load, work queue 114 will be woken up.

[0125] If the available memory is not greater than the preset threshold, then after adding the isolated error global address to work queue 112, work queue 114 will be woken up.

[0126] In one possible implementation, work queue 114 is specifically used for:

[0127] The isolated global addresses of errors are traversed in the work queue 114, and the global addresses of the isolated errors are passed to the hardware repair unit 122.

[0128] In one possible implementation, work queue 114 is also used for:

[0129] If a repair error message is received, the global address corresponding to the repair error message will be marked as repair failed.

[0130] If an error global address is marked as a failed repair, then that error global address is skipped.

[0131] In one possible implementation, host 11 further includes address translation unit 115;

[0132] Driver 111 is also configured to: send error information to address translation unit 115;

[0133] Address translation unit 115 is used to translate the error global address corresponding to the error information; wherein, address translation unit 115 stores the mapping relationship between physical address and global address.

[0134] In one possible implementation, isolation unit 113 is specifically used for:

[0135] The faulty global address is marked as unavailable, and the application using the faulty global address is terminated so that the faulty global address is isolated by the physical memory allocator after being released from the application. Figure 6 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. This electronic device can be a heterogeneous system as described above. Figure 6 As shown, the electronic device includes:

[0136] The electronic device includes a processor 291 and a memory 292; it may also include a communication interface 293 and a bus 294. The processor 291, memory 292, and communication interface 293 can communicate with each other via the bus 294. The communication interface 293 can be used for information transmission. The processor 291 can invoke logical instructions stored in the memory 292 to execute the methods of the above embodiments.

[0137] Furthermore, the logic instructions in the aforementioned memory 292 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium.

[0138] The memory 292, as a computer-readable storage medium, can be used to store software programs and computer-executable programs, such as program instructions / modules corresponding to the methods in the embodiments of this application. The processor 291 executes functional applications and data processing by running the software programs, instructions, and modules stored in the memory 292, thereby implementing the methods in the above-described method embodiments.

[0139] The memory 292 may include a program storage area and a data storage area. The program storage area may store the operating system and application programs required for at least one function; the data storage area may store data created based on the use of the terminal device. Furthermore, the memory 292 may include high-speed random access memory and may also include non-volatile memory.

[0140] This application provides a non-transitory computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, are used to implement the methods described in the foregoing embodiments.

[0141] This application provides a computer program product, including a computer program that, when executed by a processor, implements the methods provided in any of the embodiments described above.

[0142] Other embodiments of this application will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this application are indicated by the following claims.

[0143] It should be understood that this application is not limited to the precise structure described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this application is limited only by the appended claims.

Claims

1. A memory fault handling method, characterized in that, Applied to heterogeneous systems, the method includes: Obtain memory error information, wherein the error information represents the faulty physical address; Obtain the global address of the error message and perform an isolation operation on the global address of the error; wherein, the global address is a physical address managed by the software-view physical memory allocator; The hardware repair mechanism is triggered, and upon receiving a successful repair notification, the erroneous global address is returned to the physical memory allocator.

2. The method according to claim 1, characterized in that, The isolation operation on the erroneous global address includes: Based on a specific address range, an isolation operation is performed on the erroneous global address; Returning the erroneous global address to the physical memory allocator includes: Based on a specific address range, the erroneous global address is returned to the physical memory allocator.

3. The method according to claim 2, characterized in that, The specific address range refers to the address range corresponding to the page.

4. The method according to claim 1, characterized in that, The heterogeneous system includes a work queue; prior to triggering the hardware repair mechanism, it also includes: Add the isolated global address of the error to the work queue; The hardware repair trigger mechanism includes: The work queue is woken up, and the isolated global addresses of errors in the work queue are passed to the hardware repair unit.

5. The method according to claim 4, characterized in that, The step of waking up the work queue includes: Detect the currently available memory on the device side of the heterogeneous system; If the available memory is greater than a preset threshold, the work queue is woken up when the load on the heterogeneous system is lower than a preset load. If the available memory is not greater than a preset threshold, the work queue is woken up after the isolated global error address is added to the work queue.

6. The method according to claim 4, characterized in that, The method further includes: If a repair error message is received, the global address corresponding to the repair error message will be marked as repair failed. If an error global address is marked as a failed repair, then that error global address is skipped.

7. The method according to claim 1, characterized in that, Obtaining the global address of the error corresponding to the error information includes: Based on the mapping relationship between physical addresses and global addresses, the global address corresponding to the error message is determined.

8. The method according to claim 1, characterized in that, The isolation operation on the erroneous global address includes: The erroneous global address is marked as unavailable, and the application using the erroneous global address is terminated, so that the erroneous global address is isolated by the physical memory allocator after being released from the application.

9. The method according to any one of claims 1-8, characterized in that, The hardware repair mechanism includes one or more of the following: address replacement at the bit level, address replacement at the cell level, address replacement at the row level, address replacement at the column level, address replacement at the bank level, address replacement at the device level, address replacement at the rank level, address replacement at the channel level, or address replacement of a specific memory address range.

10. A heterogeneous system, characterized in that, include: Interconnected hosts and devices; The host is used to obtain memory error information, the error information representing the faulty physical address; Obtain the global address of the error message and perform an isolation operation on the global address; wherein the global address is a physical address managed by the software-view physical memory allocation manager; and trigger a hardware repair mechanism. The device is used to perform hardware repair processing on isolated erroneous global addresses; The host is also configured to return the erroneous global address to the physical memory allocator after receiving a successful repair notification.

11. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, are used to implement the method as described in any one of claims 1-9.

12. A computer program product, characterized in that, It includes computer execution instructions, which, when executed by a processor, are used to implement the method as described in any one of claims 1-9.