Methods for handling ECC errors in heterogeneous systems, heterogeneous systems and related products

By collaboratively shielding the storage addresses of ECC errors on both the host and device sides, the problem of error propagation when hardware repair resources are exhausted is solved, ensuring system stability and subsequent repair conditions.

CN116166468BActive Publication Date: 2026-06-30CAMBRICON TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CAMBRICON TECH CO LTD
Filing Date
2021-11-24
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In the prior art, uncorrectable ECC errors that occur in memory after prolonged read and write operations cannot be repaired by hardware. When spare memory is exhausted, repair is not possible, causing ECC errors to spread and affect the normal operation of other applications.

Method used

By using software mechanisms to work collaboratively on the host and device sides, the storage addresses associated with ECC errors are obtained and masked to prevent them from being reassigned and thus prevent the spread of errors.

Benefits of technology

It effectively prevents the spread of ECC errors, ensures system stability, provides conditions for subsequent hardware repair, and avoids impacting other applications.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116166468B_ABST
    Figure CN116166468B_ABST
Patent Text Reader

Abstract

This disclosure relates to a method for handling ECC errors in heterogeneous systems, a heterogeneous system, and related products. The heterogeneous system includes a computing device within a combined processing unit, which may further include interface devices and other processing devices. The computing device interacts with the other processing devices to jointly perform user-specified computational operations. The combined processing unit may also include a storage device connected to both the computing device and the other processing devices to store data from those devices. The disclosed solution can effectively handle ECC errors and prevent their propagation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure generally relates to the field of storage. More specifically, this disclosure relates to methods for handling error checking and correcting (“ECC”) errors in heterogeneous systems, heterogeneous systems, computer-readable storage media, computer program products, and computing devices. Background Technology

[0002] In data storage and transmission, various types of memory, such as high-bandwidth memory (HBM), play a crucial role. After prolonged read and write operations, memory often develops data errors. In some scenarios, these data errors cannot be corrected by the memory's own ECC error correction (referred to as ECC errors in this application context). To address such uncorrectable errors, some memory types (e.g., HBM) support the use of spare memory to repair and replace faulty memory when an ECC error occurs. However, due to hardware limitations, such as the limited number of spare memory modules available for replacement, the hardware cannot repair subsequent uncorrectable ECC errors when spare memory is exhausted. Therefore, a solution for addressing such uncorrectable ECC errors is needed in the prior art. Summary of the Invention

[0003] In view of the technical problems mentioned in the background section above, this disclosure proposes a software mechanism that can mask the memory where the ECC error occurred when hardware repair resources are unable to repair the ECC error or repair it in a timely manner, so as to ensure that the faulty memory is not redistributed and thus prevents the spread of the ECC error. To this end, this disclosure provides ECC error solutions for heterogeneous systems in several aspects.

[0004] In a first aspect, this disclosure provides a method for handling ECC errors in a heterogeneous system, wherein the heterogeneous system includes a host side and a device side, wherein the method is performed at the device side and includes: obtaining address information of a storage address to be masked from the host side, wherein the storage address is associated with an ECC error; and performing a masking operation on the storage address based on the address information.

[0005] In a second aspect, this disclosure provides a method for handling ECC errors in a heterogeneous system, wherein the heterogeneous system includes a host side and a device side, the method is executed on the host side and includes obtaining address information of a storage address to be masked, wherein the address information is associated with an ECC error; and storing the address information so that the device side can perform a reading and a masking operation on the storage address based on the address information.

[0006] In a third aspect, this disclosure provides a computer system including a device side and a host side, wherein the device side is configured to perform the method described in the first aspect, and the host side is configured to perform the method described in the second aspect, thereby enabling a shielding operation against the ECC error within the computer system.

[0007] In a fourth aspect, this disclosure provides a method for handling ECC errors in a heterogeneous system, the heterogeneous system including a device side and a host side, the method comprising: obtaining address information of a storage address to be masked at the host side, wherein the storage address is associated with the ECC error; storing the address information at the host side in a storage area shared with the device side; and instructing the device side to read the address information from the shared storage area and perform a masking operation on the address information.

[0008] In a fifth aspect, this disclosure provides a computer-readable storage medium having stored thereon computer program code for handling ECC errors in heterogeneous systems, wherein the computer program code, when run by a processing device, performs the methods described above.

[0009] In a sixth aspect, this disclosure provides a computer program product including a computer program for handling ECC errors in a heterogeneous system, which, when executed by a processor, implements the steps of the above-described method.

[0010] In a seventh aspect, this disclosure provides a computer apparatus including a memory, a processor, and a computer program stored in the memory, wherein when the processor executes the computer program, it implements the steps of the above-described method.

[0011] According to the methods, heterogeneous systems, and related devices provided in the foregoing aspects of this disclosure, when hardware repair resources are unable to perform hardware repair on memory that has experienced an ECC error, the faulty physical memory can be masked through software, thereby ensuring that it cannot be re-allocated. In this way, the spread of ECC errors can be prevented, thus preventing potential impact on the normal operation of other applications. Attached Figure Description

[0012] The above and other objects, features, and advantages of this disclosure will become readily apparent from the following detailed description of exemplary embodiments with reference to the accompanying drawings. In the drawings, several embodiments of this disclosure are illustrated by way of example and not limitation, and like or corresponding reference numerals denote like or corresponding parts, wherein:

[0013] Figure 1This is a structural diagram of a board according to an embodiment of the present disclosure;

[0014] Figure 2 This is a structural diagram illustrating an integrated circuit device according to an embodiment of the present disclosure;

[0015] Figure 3 This is a schematic diagram showing the internal structure of a single-core computing device according to an embodiment of the present disclosure;

[0016] Figure 4 This is a schematic diagram showing the internal structure of a multi-core computing device according to an embodiment of the present disclosure;

[0017] Figure 5 This is a schematic diagram illustrating the internal structure of a processor core according to an embodiment of the present disclosure;

[0018] Figure 6 This is a flowchart illustrating a method for handling ECC errors in a heterogeneous system according to an embodiment of the present disclosure;

[0019] Figure 7 This is a flowchart illustrating another method for handling ECC errors in heterogeneous systems according to embodiments of the present disclosure;

[0020] Figure 8 This is a schematic block diagram illustrating the handling of ECC errors during the initialization phase of a heterogeneous system according to an embodiment of the present disclosure;

[0021] Figure 9 This is a schematic block diagram illustrating the handling of ECC errors during the operation phase of a heterogeneous system according to an embodiment of the present disclosure;

[0022] Figure 10 This is a schematic state transition diagram illustrating the state machine used in the operation phase of a heterogeneous system according to embodiments of the present disclosure; and

[0023] Figure 11 This is a flowchart illustrating the handling of ECC errors in a heterogeneous system according to an embodiment of the present disclosure. Detailed Implementation

[0024] The technical solutions in the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this disclosure, not all of them. Based on the embodiments in this disclosure, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this disclosure.

[0025] It should be understood that the terms "first," "second," and "third," etc., that may be used in the claims, specification, and drawings of this disclosure are used to distinguish different objects, rather than to describe a specific order. The terms "comprising" and "including" as used in the specification and claims of this disclosure indicate the presence of the described features, integrals, steps, operations, elements, and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components, and / or sets thereof.

[0026] It should also be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of this disclosure. As used in this disclosure and claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used in this disclosure and claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes such combinations.

[0027] As used in this specification and claims, the term "if" may be interpreted, depending on the context, as "when," "once," "in response to determination," or "in response to detection." Similarly, the phrase "if determined" or "if [described condition or event] is detected" may be interpreted, depending on the context, as "once determined," "in response to determination," "once [described condition or event] is detected," or "in response to detection of [described condition or event]."

[0028] The specific embodiments of this disclosure will now be described in detail with reference to the accompanying drawings.

[0029] Figure 1 A schematic diagram of the structure of a board 10 according to an embodiment of the present disclosure is shown. It will be understood that... Figure 1 The structure and composition shown are merely examples and are not intended to limit the scope of this disclosure in any way.

[0030] like Figure 1As shown, board 10 includes chip 101, which can be a system-on-chip (SoC), i.e., a system-on-a-chip as described in the context of this disclosure. In one implementation scenario, it can integrate one or more combined processing devices. The aforementioned combined processing device can be an artificial intelligence computing unit used to support various deep learning and machine learning algorithms, meeting the intelligent processing needs of complex scenarios in fields such as computer vision, speech, natural language processing, and data mining, especially since deep learning technology is widely used in the field of cloud intelligence. A significant characteristic of cloud intelligence applications is the large amount of input data, which places high demands on the platform's storage and computing capabilities. Board 10 of this embodiment is suitable for cloud intelligence applications, possessing massive off-chip storage, on-chip storage, and powerful computing capabilities.

[0031] As further shown in the figure, chip 101 is connected to external device 103 via external interface device 102. External device 103 can be a host computer, which can be a general-purpose processor with a different instruction set architecture than chip 101, such as a central processing unit (CPU). Of course, depending on the application scenario, external device 103 can be, for example, a server, computer, camera, monitor, mouse, keyboard, network card, or Wi-Fi interface. Data to be processed can be transmitted from external device 103 to chip 101 via external interface device 102. The calculation results of chip 101 can be transmitted back to external device 103 via external interface device 102. Depending on the application scenario, external interface device 102 can have different interface forms, such as a PCIe interface.

[0032] The board 10 may also include a storage device 104 for storing data, which includes one or more memory cells 105. The storage device 104 is connected to and transmits data with the controller 106 and the chip 101 via a bus. The controller 106 in the board 10 can be configured to regulate the state of the chip 101. For this purpose, in one application scenario, the controller 106 may include a microcontroller (MCU).

[0033] Figure 2 This is a structural diagram illustrating the combined processing apparatus in chip 101 according to the above embodiment. Figure 2 As shown, the combined processing device 20 may include a computing device 201, an interface device 202, a processing device 203, and a dynamic random access memory (DRAM) DRAM 204.

[0034] The computing device 201 can be configured to execute user-specified operations, primarily implemented as a single-core or multi-core intelligent processor. In some operations, it can be used to perform deep learning or machine learning calculations, and can also interact with the processing device 203 via the interface device 202 to jointly complete the user-specified operations.

[0035] The interface device 202 can be used to transmit data and control commands between the computing device 201 and the processing device 203. For example, the computing device 201 can obtain input data from the processing device 203 via the interface device 202 and write it into the on-chip storage device of the computing device 201. Further, the computing device 201 can obtain control commands from the processing device 203 via the interface device 202 and write them into the on-chip control cache of the computing device 201. Alternatively or optionally, the interface device 202 can also read data from the storage device of the computing device 201 and transmit it to the processing device 203.

[0036] The processing device 203, as a general-purpose processing device, performs basic controls including but not limited to data transfer and starting and / or stopping the computing device 201. Depending on the implementation, the processing device 203 may be one or more types of processors, including but not limited to digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and their number can be determined according to actual needs. As mentioned above, the computing device 201 of this disclosure can be considered to have a single-core structure or a homogeneous multi-core structure. However, when the computing device 201 and the processing device 203 are considered together, they are considered to form a heterogeneous multi-core structure. DRAM 204 is used to store data to be processed. It is DDR memory, typically 16G or larger in size, and is used to store data in computing device 201 and / or processing device 203.

[0037] Figure 3The diagram shows the internal structure of the single-core computing device 201. The single-core computing device 301 is used to process input data such as computer vision, speech, natural language, and data mining. The single-core computing device 301 includes three main modules: a control module 31, a processing module 32, and a storage module 33.

[0038] The control module 31 coordinates and controls the operation of the computation module 32 and the storage module 33 to complete the deep learning task. It includes an instruction fetch unit (IFU) 311 and an instruction decode unit (IDU) 312. The instruction fetch unit 311 fetches instructions from the processing device 203, and the instruction decode unit 312 decodes the fetched instructions and sends the decoding result as control information to the computation module 32 and the storage module 33.

[0039] The computation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 performs vector operations and supports complex operations such as vector multiplication, addition, and nonlinear transformations. The matrix operation unit 322 is responsible for the core computations of the deep learning algorithm, namely matrix multiplication and convolution. The storage module 33 is used to store or move related data, including a neuron RAM (NRAM) 331, a weight RAM (WRAM) 332, and a direct memory access (DMA) module 333. The NRAM 331 stores the input neurons, output neurons, and intermediate results after computation. The WRAM 332 stores the convolution kernels of the deep learning network, i.e., the weights. The DMA 333 is connected to the DRAM 204 via the bus 34 and is responsible for data transfer between the single-core computing device 301 and the DRAM 204.

[0040] Figure 4 A schematic diagram of the internal structure of the computing device 201 as a multi-core processor is shown. The multi-core computing device 41 adopts a hierarchical design. As a system-on-a-chip, the multi-core computing device 41 includes at least one cluster (or computing cluster) according to this disclosure, and each cluster includes multiple processor cores. In other words, the multi-core computing device 41 is constructed in a hierarchical structure of system-on-a-chip, cluster, and processor cores. From the perspective of the system-on-a-chip hierarchy, as... Figure 4 As shown, the multi-core computing device 41 includes an external storage controller 401, a peripheral communication module 402, an on-chip interconnect module 403, a synchronization module 404, and multiple clusters 405.

[0041] There may be multiple external storage controllers 401 (two are shown as an example in the figure), which are used to access external storage devices, i.e., off-chip memory in the context of this disclosure, in response to access requests issued by the processor core (e.g., external storage devices). Figure 2 The DRAM 204 in the chip allows data to be read from or written to external devices. The peripheral communication module 402 receives control signals from the processing device 203 via the interface device 202, initiating the computing device 201 to execute tasks. The on-chip interconnect module 403 connects the external memory controller 401, the peripheral communication module 402, and multiple clusters 405 to transmit data and control signals between modules. The synchronization module 404 is a Global Barrier Controller (GBC) used to coordinate the working progress of each cluster and ensure information synchronization. The multiple clusters 405 of this disclosure are the computing cores of the multi-core computing device 41. Although in Figure 4 The illustration shows four clusters; however, with the development of hardware, the multi-core computing device 41 of this disclosure may also include eight, sixteen, sixty-four, or even more clusters 405. In one application scenario, cluster 405 can be used to efficiently execute deep learning algorithms.

[0042] From the perspective of cluster hierarchy, such as Figure 4 As shown, each cluster 405 may include multiple processor cores (IPU cores) 406 and a memory core (MEM core) 407, which may include, for example, a cache memory (e.g., LLC).

[0043] The processor core 406 is exemplarily shown as four in the figure. This disclosure does not limit the number of processor cores 406, and its internal architecture is as follows. Figure 5 As shown. Each processor core 406 is similar to Figure 3 The single-core computing device 301 can also include three modules: a control module 51, an arithmetic module 52, and a storage module 53. The functions and structures of the control module 51, arithmetic module 52, and storage module 53 are roughly the same as those of the control module 31, arithmetic module 32, and storage module 33, and will not be described in detail here. It should be noted that the storage module 53 can include an Input / Output Direct Memory Access (IODMA) module 533 and a Move Direct Memory Access (MVDMA) module 534. The IODMA 533 controls the memory access of NRAM 531 / WRAM 532 and DRAM 204 through the broadcast bus 409; the MVDMA 534 is used to control the memory access of NRAM 531 / WRAM 532 and SRAM 408.

[0044] Back Figure 4 The storage core 407 is primarily used for storage and communication, namely storing shared data or intermediate results among processor cores 406, and performing communication between cluster 405 and DRAM 204, communication between clusters 405, and communication between processor cores 406. In other embodiments, the storage core 407 may have scalar operation capabilities to perform scalar operations.

[0045] Storage core 407 may include Static Random-Access Memory (SRAM) 408, a broadcast bus 409, a Cluster Direct Memory Access (CDMA) module 410, and a Global Direct Memory Access (GDMA) module 411. In one implementation scenario, SRAM 408 can act as a high-performance data relay station. Therefore, data multiplexed between different processor cores 406 within the same cluster 405 does not need to be obtained from DRAM 204 by each processor core 406 individually; instead, it is relayed between processor cores 406 via SRAM 408. Furthermore, storage core 407 only needs to quickly distribute multiplexed data from SRAM 408 to multiple processor cores 406, thereby improving inter-core communication efficiency and significantly reducing on-chip and off-chip I / O access.

[0046] Broadcast bus 409, CDMA 410, and GDMA 411 are used to perform communication between processor cores 406, communication between clusters 405, and data transfer between cluster 405 and DRAM 204, respectively. These will be explained below.

[0047] The broadcast bus 409 is used to complete high-speed communication between the processor cores 406 within the cluster 405. In this embodiment, the broadcast bus 409 supports inter-core communication methods including unicast, multicast, and broadcast. Unicast refers to point-to-point (e.g., data transmission from one processor core to another) data transmission. Multicast is a communication method that transmits a piece of data from SRAM 408 to several specific processor cores 406. Broadcast is a communication method that transmits a piece of data from SRAM 408 to all processor cores 406, and is a special case of multicast.

[0048] CDMA 410 is used to control memory accesses of SRAM 408 between different clusters 405 within the same computing device 201. GDMA 411 works in conjunction with the external memory controller 401 to control memory accesses from SRAM 408 of cluster 405 to DRAM 204, or to read data from DRAM 204 into SRAM 408. As described above, communication between DRAM 204 and NRAM 431 or WRAM 432 can be achieved in two ways. The first way is through IODAM 433 directly communicating with DRAM 204 and NRAM 431 or WRAM 432; the second way is to first transmit data between DRAM 204 and SRAM 408 via GDMA 411, and then transmit data between SRAM 408 and NRAM 431 or WRAM 432 via MVDMA 534. Although the second approach may require more components and has a longer data flow, in some embodiments, the bandwidth of the second approach is significantly greater than that of the first approach. Therefore, performing communication between DRAM 204 and NRAM 431 or WRAM 432 using the second approach may be more efficient. It is understood that the data transmission methods described herein are merely exemplary, and those skilled in the art can flexibly select and apply various data transmission methods according to the specific hardware arrangement based on the teachings of this disclosure.

[0049] In other embodiments, the functions of GDMA 411 and IODMA 533 can be integrated into the same component. Although this disclosure treats GDMA 411 and IODMA 533 as different components for ease of description, those skilled in the art will recognize that any component implementing similar functions and achieving similar technical effects falls within the scope of this disclosure. Furthermore, the functions of GDMA 411, IODMA 533, CDMA 410, and MVDMA 534 can also be implemented by the same component.

[0050] The above combination Figures 1-5 The hardware architecture and internal structure of this disclosure have been described in detail. It is understood that the above description is merely exemplary and not restrictive. Depending on different application scenarios and hardware specifications, those skilled in the art may also make changes to the board and its internal structure disclosed herein, and such changes shall still fall within the protection scope of this disclosure.

[0051] According to the scheme disclosed herein, the above combination Figures 1-5The described board or chip 101 can act as a device in a heterogeneous system, or be implemented on the device side of a heterogeneous system. Therefore, it can cooperate with a host in the heterogeneous system to handle ECC errors occurring in the heterogeneous system under the host's control. Based on this, the technical solution of this disclosure proposes a software solution based on a heterogeneous platform, which performs processing for the occurrence of uncorrectable ECC errors, so as to isolate ECC errors in a timely and effective manner. Thus, the solution of this disclosure can prevent the spread of ECC errors and facilitate the repair of faulty memory. The following will combine... Figure 6 and Figure 7 The present disclosure provides a detailed description of the procedures for handling ECC errors in heterogeneous systems, executed at both the device and host sides.

[0052] Figure 6 This is a simplified flowchart illustrating a method 600 for handling ECC errors performed on the device side of a heterogeneous system. As previously described, in the context of this disclosure, the heterogeneous system may include a host side and a device side. In one embodiment, the aforementioned host side may be a processing device or apparatus including a main processor, which may be, for example, a general-purpose processor (“CPU”). Correspondingly, the aforementioned device side may be a processing device or apparatus including a slave processor, which may be a dedicated processor (such as a graphics processing unit “GPU”) for computations in the field of artificial intelligence. In one embodiment, the device side may include the combination of the aforementioned... Figures 1-5 The board or chip described herein.

[0053] like Figure 6 As shown, in step S602, the address information of the storage address to be masked is obtained, wherein the storage address is associated with the ECC error, that is, the physical memory address where an unrecoverable ECC error occurs. In an exemplary operating scenario, the host-side main processor can read the address information of the storage address to be masked from the device-side memory (e.g., Static Random Access Memory SRAM 408) when the memory management module is initialized, and write it to a storage area shared with the slave device. In this scenario, a shared memory can be arranged on the host side or the device side to facilitate the storage of address information on the host side and the reading of the corresponding address information on the device side. Specifically, the aforementioned shared memory can be a memory space on the device side (e.g., DRAM 204). Alternatively, the shared memory can also be a memory space allocated on the host side to store the address information when an ECC error occurs.

[0054] Since the operating system on the device side is also stored in the DRAM on the device side, and the memory space on this DRAM used to store the operating system may also have unmodifiable ECC errors, this application can perform the reading of the corresponding address information on the device side through the host side, which is not affected by ECC errors. Of course, in other embodiments, when the shared memory is set on the device side, the address information to be masked can also be read directly from the shared memory by the slave processor on the device side.

[0055] Next, in step S604, a masking operation (“PageRetire”) is performed on the storage address based on the aforementioned address information. This masking operation isolates ECC errors in a software manner, thereby preventing the spread of ECC errors and providing conditions for timely subsequent repairs.

[0056] In one implementation scenario, the masking scheme described above can be executed during the initialization and / or runtime phases of the entire heterogeneous system. When the masking operation is performed during the initialization phase, it can address the situation where spare memory resources are exhausted due to uncorrectable ECC errors during driver loading, thereby masking the erroneous physical addresses that cannot be repaired by hardware at the software level. When the masking operation is performed during the runtime phase, it can address situations where uncorrectable ECC errors occur during application execution and hardware repair methods cannot intervene quickly, requiring software masking of the erroneous physical addresses to prevent error propagation. Subsequently, for example, the user can be allowed to manually reset the driver, and then relevant hardware mechanisms (e.g., HBM) can be used to repair the erroneous physical memory.

[0057] When executed during the initialization phase of a heterogeneous system, the method may further include performing initialization operations on the device side, and after completing the initialization operations, obtaining the address information from the host side. Then, a masking operation can be performed on the storage address based on the address information. Additionally or optionally, the masking result obtained after performing the masking operation can be written to the host side so that the host side can perform its subsequent initialization operations or subsequent masking operations.

[0058] In one operational scenario, the aforementioned masking operation on a storage address based on address information may include isolating the storage address in response to its current unoccupied state. Conversely, if the storage address to be masked is currently occupied, it can be marked as an error. Specifically, when the storage address to be masked is occupied, the storage address can be saved, and after it is released, a masking operation can be performed on the released storage address.

[0059] In an exemplary operation, for the aforementioned unoccupied storage addresses (i.e., erroneous physical addresses), an interface can be directly invoked to isolate them, preventing improper allocation and use of erroneous physical addresses during the storage address allocation phase, which could lead to the spread of ECC errors. Correspondingly, for erroneous physical addresses that are already occupied, they can be marked in their relevant program structures so that the memory space pointed to by the erroneous address can be isolated after the memory corresponding to the erroneous address is released, thereby preventing the subsequent allocation and use of the erroneous physical memory. Then, the hardware error flag bit in the corresponding structure of the erroneous physical address can be set. Regarding the masking of erroneous addresses, in one operational scenario, the device side can invoke a designated interface (e.g., the function malloc()) in the memory allocator used for memory allocation, thereby separating the erroneous physical address from the memory allocator through this designated interface. Specifically, the memory space pointed to by the erroneous physical address is allocated through this memory allocation interface (i.e., the designated interface), and by occupying the memory space pointed to by the erroneous physical address, the erroneous physical address is masked.

[0060] When a masking operation is performed during the operation of the heterogeneous system disclosed herein, the storage address to be masked is saved in response to its current occupation. Then, the masking operation can be performed on the saved storage address after it is released. In one embodiment, the aforementioned saving operation can be implemented using a linked list data structure; for example, the storage address to be masked can be attached as a node to the linked list. Based on this, the linked list can be traversed during a memory release operation to determine whether a storage address to be released exists in the list. And in response to the existence of the storage address in the linked list, a masking operation can be performed on the storage address after, for example, a memory release operation targeting that storage address, to achieve isolation of the storage address.

[0061] In one implementation scenario, to accelerate memory allocation and deallocation operations in a heterogeneous system, both master and slave processors have their own memory caching mechanisms. Memory released by a program structure may be cached by the device-side memory caching mechanism. Therefore, the method of this disclosure proposes performing a cache invalidation operation on the memory address where an ECC error occurs, so that the memory address is released from the device-side cache. As an example, the aforementioned cache invalidation operation may include releasing memory allocated before the masking operation is performed, so that the allocated memory does not reside in the device-side cache. Furthermore, to ensure accurate release of the memory allocated before the masking operation, the method of this disclosure proposes recording the first moment of memory allocation in the cache and recording the second moment before the masking operation begins. Then, by comparing the first and second moments, all memory residing in the cache before the second moment can be released. In terms of implementation, the aforementioned recording operation can be accomplished by setting corresponding timestamps.

[0062] Figure 7 This is a simplified flowchart illustrating a method 700 for handling ECC errors executed on the host side of a heterogeneous system. Based on the above description, those skilled in the art will understand that the heterogeneous system referred to here is the one described above in conjunction with... Figure 6 The aforementioned heterogeneous system, therefore, the preceding description of the heterogeneous system and its included equipment side also applies to the following text.

[0063] like Figure 7 As shown, in step S702, the address information of the storage address to be masked is obtained, wherein the address information is associated with an ECC error. As mentioned earlier, the address information obtained here may be address information obtained from shared memory shared with the host side. Next, in step S704, the aforementioned address information can be stored so that the device side can read it and perform a masking operation on the storage address according to the address information. In practice, when an ECC error occurs on the device side, triggering an interrupt handling for the masking operation, the host side's main processor can write the address information of the ECC error to the shared memory shared with the device side for the device side to read.

[0064] As previously described, the operation of the heterogeneous system disclosed herein may include an initialization phase and an operation phase. Therefore, during the initialization phase of the heterogeneous system, the host side can perform its initialization operations and read and store the address information associated with ECC errors. Additionally, after the slave device successfully completes the masking operation, the master device can read the masking result obtained after performing the masking operation from the slave device and update the status bits associated with the masking operation based on the masking result.

[0065] During the operation of a heterogeneous system, the method performed by the host side can also invalidate the host-side memory cache flag. For example, host-side cache invalidation can be achieved by timely updating the corresponding flag in the host-side memory cache to invalidate the memory address. This ensures that memory blocks in the cache pool only receive release requests from users and are no longer re-allocated. In one scenario during operation, the host side can also receive an interrupt signal generated in response to an ECC error and, based on this interrupt signal, call the interrupt handler function for the masking operation to initiate communication with the device side and cause the device side to perform the masking operation. Specifically, the process by which the host side calls the interrupt handler function for the masking operation based on the interrupt signal can be similar to... Figure 7 The implementation steps shown will not be repeated here.

[0066] In one embodiment, considering that the masking operation performed by a device-side component, such as a memory allocator, is not performed within an interrupt handler, there is a possibility that a new ECC error interrupt may occur before the previous masking operation request has been processed. To avoid multiple calls to the interrupt handler causing unprocessed error address information stored in, for example, shared memory to be overwritten by new error address information, the present disclosure proposes arranging a state machine on the host side. This state machine can be used to maintain the timing of address information storage and the working state of the masking operation.

[0067] In one implementation scenario, the state machine can be configured to include at least an idle state and a suspended state. Based on this, the method executed by the host side can further include performing N masking operations associated with N address information respectively, based on the idle and suspended states of the state machine, where N is greater than or equal to 2. In this scenario, to successfully execute the N masking operations, the host side can sequentially perform the following actions: First, in response to the state machine being in an idle state, the nth address information is stored for retrieval by the device side, where 1 ≤ n ≤ N and is a positive integer. Next, after writing the nth address information, the state machine is updated to a suspended state to prevent the storage of the (n+1)th address information associated with the (n+1)th masking operation. Subsequently, in response to the device side completing the nth masking operation, the state machine is updated to an idle state to allow the storage of the (n+1)th address information.

[0068] To achieve the statistics and maintenance of the aforementioned multiple (i.e., N) masking operations, this disclosure further proposes to set a counter on the host side for counting the masking operations to be performed. In this case, the host side can determine whether to store the address information based on the counter value and the state of the state machine described above.

[0069] In terms of specific implementation, the state machine can be set to include a running state for the counter. Therefore, in response to the state machine being in a suspended state, its state is updated to the running state, so that the device side can perform the masking operation in the running state. Next, in response to receiving a notification from the device side that the masking operation has been completed, the state machine is updated to the idle state and the counter value is decremented by one. Afterwards, it can be determined whether the counter value is zero. In response to the counter's current value not being zero (i.e., there are still uncompleted masking operations), the next address information can be stored and the state machine's state updated to the suspended state, so that the device side can then perform the masking operation for the next address information. It can be understood that when the counter value is zero, it means that N masking operations have been completed.

[0070] Figure 8 This is a schematic block diagram illustrating a process 800 for handling ECC errors during the initialization phase of a heterogeneous system according to an embodiment of this disclosure. It is understood that... Figure 8 The processing flow shown is only Figure 6 One possible implementation of method 600 shown, therefore regarding Figure 6 The description also applies to the following combinations Figure 8 The description made.

[0071] In step S802, the host-side main processor can use the driver to initialize the host-side memory management module.

[0072] In step S804, the address information for which the masking operation needs to be performed is read from the device-side memory (e.g., the device-side SRAM 408) and written to the shared memory shared with the device side in step S806. In one implementation scenario, the aforementioned SRAM can be a storage space in the device-side memory used to store address information of ECC errors. When the device-side hardware memory (e.g., "HBM", "DRAM") detects an ECC error, the address information associated with the ECC error can be stored in the device-side memory (e.g., SRAM) for the main processor to read and write to the shared memory.

[0073] At step S808, the device side performs a memory module initialization operation. For example, the slave processor on the device side can load firmware to initialize its memory management module.

[0074] Next, in step S810, the initialization of the masking structure is performed, that is, the program related to the masking operation to be performed is initialized. In one scenario, this initialization operation may involve initializing the data structures (e.g., a structure used to manage the number of ECC errors) and related resources (e.g., the device-side memory space used to store the data structures) required for the masking operation.

[0075] After initializing the masking operation, in step S812, the device can read the address information to be masked (i.e., the physical memory address where the error occurred) from the shared memory and execute the masking operation S814 of this disclosure. In one implementation scenario, during initialization, if the erroneous physical address is not occupied by a program structure, the device can directly execute the masking operation of this disclosure; if the erroneous physical address is occupied by a related program structure, the device can first set the hardware error flag bit in the structure corresponding to the erroneous physical address to complete the masking operation after the erroneous physical address is released. Specifically, the slave processor on the device can call the memory allocator to call a specified interface so that the physical address where the ECC error occurred can be allocated from the memory allocator through the interface, thereby completing the masking operation. When there are multiple address information associated with the ECC error, step S814 can be executed multiple times. After all masking operations are executed, the process proceeds to step S816.

[0076] In step S816, the device can store the final processing result of the masking operation performed on all erroneous addresses into shared memory, and wait for the main processor to parse the aforementioned processing result.

[0077] The process returns to the host side. Specifically, in step S818, the main processor waits for device-side firmware initialization. Here, device-side firmware initialization may involve the initialization of hardware resources or scheduling modules on the processor. Then, after successfully returning from the processor's firmware initialization process, the main processor executes a two-phase initialization process for the memory module in step S820. In one embodiment, this two-phase initialization process includes retrieving the processing result from shared memory after the processor performs the masking operation in step S822. In one scenario, the main processor can update the corresponding status bits based on the processing result after the masking operation, so that users can query the working status of the masking operation through relevant applications and understand which erroneous physical addresses have been masked.

[0078] The above combination Figure 8 This disclosure describes the operational procedures for handling ECC errors during the initialization phase of a heterogeneous system. By executing such procedures, timely handling and isolation of ECC errors can be achieved during the initialization phase, thereby effectively preventing the propagation and spread of ECC errors during the initialization phase of the heterogeneous system.

[0079] Figure 9 This is a schematic operational block diagram illustrating a process 900 for handling ECC errors during the operation phase of a heterogeneous system according to an embodiment of this disclosure. It is understood that the heterogeneous system here can be any system described above in conjunction with... Figure 8The heterogeneous systems described herein (e.g., including host-side and device-side systems) are therefore subject to the same descriptions above, which will not be repeated hereafter. Furthermore, it is understood that the solutions of this invention can also be applied to computer systems including host-side and device-side systems within the context of this application.

[0080] like Figure 9 As shown, in step S901, the occurrence of an ECC error triggers an interrupt signal. Specifically, when an uncorrectable ECC error occurs during the application runtime phase of a heterogeneous system, the host processor on the host side receives a corresponding interrupt signal. Then, in step S902, in response to receiving the interrupt signal, the host processor can call the interrupt handler function related to the masking operation, so that the device side can perform the masking operation. In one scenario, the processing flow in the interrupt handler function for the masking operation is similar to the processing flow discussed in the initialization phase above. For example, the erroneous address information can be read from the device-side SRAM and written to shared memory. Then, the lower half of the interrupt handler function can be called, that is, waking up communication between the master and slave processors and notifying the slave processor to perform tasks including the masking operation, such as memory allocation and deallocation.

[0081] Unlike the initialization phase, during the runtime phase of a heterogeneous system, applications may frequently allocate and release memory. To accelerate memory allocation and release operations in heterogeneous systems, both master and slave processors have their own memory caching mechanisms. When memory in use is released, it may be cached by the caching mechanism. In this case, the memory corresponding to the physical address of an uncorrectable ECC error used in the program structure may be cached and thus cannot be reclaimed from the physical memory allocator. To ensure that the physical address of the ECC error does not reside in any level of cache, this application proposes to perform a cache invalidation operation for the erroneous physical address in step S903. For the cache invalidation operation of the master processor, for example, the memory address can be invalidated by timely updating the corresponding flag bit of the master processor's memory cache, thereby ensuring that the memory block in the cache pool only receives release requests issued by the user and is no longer re-allocated. For the cache invalidation operation on the device side, the device side can perform a cache invalidation operation for the storage address of the ECC error so that the storage address is released from the device-side cache.

[0082] Next, in step S904, it can be determined whether the state machine is in an idle state. (As previously mentioned...) Figure 7As described, to manage and maintain multiple masking operations, a state machine and a counter can be deployed on the host side. The state machine can be used to maintain the timing of address information storage and the working state of the masking operation, while the counter can count the execution of multiple masking operations. Based on this, when it is determined in step S904 that the state machine is in an idle state, the state machine is updated to a suspended state in step S906, and the address information (i.e., the erroneous physical address information) is stored in shared memory shared with the device side in step S907 for subsequent reading by the device side. When it is determined in step S904 that the state machine is not in an idle state, the counter value is updated in step S905. In other words, when a new ECC error triggers an interrupt call for a masking operation, if the state machine's status flag is not in an idle state, the operation of writing the erroneous address information to shared memory is no longer performed; instead, the counter's count regarding the masking operation is updated to indicate that there is still address information associated with the ECC error that has not been read.

[0083] In step S908, when the master processor subsequently initiates master-slave processor communication, for example based on an interrupt handler, it determines in step S909 whether the current state machine is in a suspended state. In response to determining that the current state of the state machine is suspended, it updates the state machine to a running state and notifies the slave processor on the device side in step S911 to execute related tasks (e.g., including masking operations). When it is determined in step S909 that the state machine is not suspended, the process proceeds directly to step S911. When communication from the master side is received in step S911, the process can proceed to the device-side processing operations. The related operations performed on the device side will be described below.

[0084] In step S912, the slave processor on the device side can execute a communication callback. For example, after the master processor initiates master-slave processor communication, if the slave processor determines that the communication is a request to release memory, the slave processor will register a callback function for releasing memory with the master processor. Next, in step S913, the slave processor can determine whether the state machine is in the running state. If it is determined to be in the running state, then in step S914, the slave processor can read the address information about the ECC error that occurred from the shared memory, and in step S915, update the cache timestamp. (As previously stated...) Figure 6 The timestamp here can be used to record the second moment before the masking operation begins.

[0085] Subsequently, in step S916, a specific masking operation is performed, such as setting the hardware error flag in the structure corresponding to the erroneous physical address. Additionally, other tasks can be performed in step S917, such as performing a cache release operation. In some scenarios, when there may be multiple addresses requiring masking operations, some erroneous addresses may be found to be occupied during the masking operation in S916. In this case, the masking operation in step S916 can be performed again on the erroneous physical addresses in the released memory after the memory release operation in step S917. In one embodiment, to ensure that allocated memory is directly released back to the physical memory allocator and does not reside in the processor's cache, this disclosure proposes using a timestamp to record the time of memory allocation in the cache (i.e., the first time mentioned above), and comparing the first time with the second time mentioned above, thereby releasing all memory residing in the cache before the second time, so that physical address masking for ECC errors can be performed in the released memory.

[0086] After completing the aforementioned device-side operations, the process returns to the host side. Subsequently, in step S918, the main processor can update the state machine's state to the idle state. In step S919, the main processor can determine if the counter's count value is zero. When the counter's count value is zero, the entire process returns, i.e., it returns to step S901 to wait for a subsequent interrupt signal. Conversely, if in step S919 the counter's value is not zero, indicating unfinished masking operations, the state machine's state is updated to the suspended state in step S906, so that the address information of the next masking operation can be written to shared memory in step S907 for subsequent reading by the processor. As mentioned earlier, when multiple address information exists, this disclosure also proposes using a linked list to store multiple address information, i.e., the stored address is linked as a node in the linked list. Based on this, the linked list can be traversed when performing a memory release operation to determine whether there is a memory address to be released in the linked list. In response to the existence of such a memory address in the linked list, a masking operation can be performed on such memory address after performing a memory release operation on such memory address, so as to isolate erroneous memory addresses.

[0087] The above combination Figure 9 The present disclosure provides a detailed exemplary description of the operation of handling ECC errors in a heterogeneous system during the runtime phase. Based on this runtime processing, and especially considering the impact of the caching mechanism, the solution of this disclosure can achieve the ordered execution of multiple masked operations. Furthermore, by designing a state machine and a counter, the solution of this disclosure also achieves effective management and efficient execution of multiple masked operations.

[0088] Figure 10This is a schematic state transition diagram illustrating the state machine 1000 used in the operation phase of a heterogeneous system according to an embodiment of the present disclosure.

[0089] As mentioned earlier, to implement multiple masking operations for multiple ECC errors, this disclosure proposes using a state machine to maintain the timing of shared memory writes and the working state of the masking operations. Specifically, when the masking operation executed in the memory allocator is not performed within an interrupt handler, a new ECC error interrupt may occur before the previous masking operation request has been processed. To prevent unprocessed address information in shared memory from being overwritten by new address information due to multiple calls to the interrupt handler, this disclosure introduces a state machine with idle, suspended, and running states.

[0090] Specifically, when the host processor calls the interrupt handler for the masking operation due to an ECC error, if the current state machine is in the idle state as shown at position 1002, the host processor reads the address information associated with the ECC error and writes it into shared memory. After the write operation is complete, the state machine is updated to the suspended state as shown at position 1004. When a new ECC error subsequently triggers the interrupt call for the masking operation, if it is determined that the state machine is not in the idle state (e.g., determined by the state machine's status flags), the operation of writing the address information into shared memory is not executed. Instead, the wait reference count for the masking operation is updated (e.g., by incrementing or decrementing the counter as described above) to indicate that there is still address information related to the ECC error that has not been read.

[0091] Furthermore, when the master processor initiates master-slave processor communication, if it determines that the state machine is in a suspended state, it adjusts the state machine's flag to the running state, as shown at position 1006 in the diagram. Subsequently, the slave processor performs a masking operation before processing communication requests (e.g., requests for memory release). Once the slave processor completes its task and notifies the master processor, the master processor updates the state machine to the idle state (this state transition is shown by the arrow from 1006 to 1002). At this point, if the master processor finds that the wait reference count for the masking operation is not 0 (i.e., the counter value is not 0), it continues, for example, by reading address information from SRAM and writing it to shared memory, and sets the state machine's state to the suspended state (this state transition is shown by the arrow from 1002 to 1004), thus starting a new loop processing.

[0092] Figure 11 This is a flowchart illustrating a method 1100 for handling ECC errors in a heterogeneous system according to an embodiment of this disclosure. It is understood that the heterogeneous system here is the same as the heterogeneous system described above in conjunction with the accompanying drawings, and therefore the foregoing description of the heterogeneous system also applies to the heterogeneous system involved in method 1100.

[0093] like Figure 11 As shown, in step S1102, the address information of the storage address to be masked is obtained on the host side. As mentioned earlier, this storage address is associated with the ECC error and is also the physical address where the error occurred, as mentioned above. Next, in step S1104, the address information is stored on the host side in a storage area shared with the device side. This shared storage area can be implemented as the shared memory mentioned above. Finally, in step S1106, the device side is instructed to read the address information from the shared storage area and perform a masking operation on that address information.

[0094] For the sake of brevity, the above description does not elaborate further on method 1100. However, those skilled in the art will understand from the foregoing description in conjunction with the accompanying drawings that method 1100 may include more steps and operations, such as using a state machine and / or a counter to maintain the timing and state changes of the masking operation.

[0095] The above description, in conjunction with the accompanying drawings, provides a detailed overview of the solutions disclosed herein. Depending on the application scenario, the shielding solutions disclosed herein may also include or be implemented in servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablets, smart terminals, PC devices, IoT terminals, mobile terminals, mobile phones, dashcams, navigators, sensors, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and / or medical devices. The vehicles include airplanes, ships, and / or vehicles; the home appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, lights, gas stoves, and range hoods; the medical devices include MRI scanners, ultrasound machines, and / or electrocardiographs. The electronic devices or apparatus disclosed herein can also be applied in fields such as the Internet, IoT, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and healthcare. Furthermore, the electronic devices or apparatuses disclosed herein can also be used in application scenarios related to artificial intelligence, big data, and / or cloud computing, such as cloud computing, edge computing, and terminals. In one or more embodiments, the high-computing-power electronic devices or apparatuses according to the disclosed scheme can be applied to cloud devices (e.g., cloud servers), while the low-power electronic devices or apparatuses can be applied to terminal devices and / or edge devices (e.g., smartphones or cameras). In one or more embodiments, the hardware information of the cloud devices and the hardware information of the terminal devices and / or edge devices are compatible with each other, so that suitable hardware resources can be matched from the hardware resources of the cloud devices to simulate the hardware resources of the terminal devices and / or edge devices based on the hardware information of the terminal devices and / or edge devices, so as to complete the unified management, scheduling, and collaborative work of end-to-cloud or cloud-edge-end integration.

[0096] It should be noted that, for the sake of brevity, this disclosure describes some methods and their embodiments as a series of actions and combinations thereof. However, those skilled in the art will understand that the solutions disclosed herein are not limited by the order of the described actions. Therefore, based on the disclosure or teachings of this document, those skilled in the art will understand that some steps can be performed in a different order or simultaneously. Furthermore, those skilled in the art will understand that the embodiments described in this disclosure can be considered optional embodiments, that is, the actions or modules involved are not necessarily essential for the implementation of one or more solutions disclosed herein. In addition, depending on the solution, the description of some embodiments in this disclosure may have different emphases. In view of this, those skilled in the art will understand that parts not described in detail in a certain embodiment of this disclosure can also be referred to the relevant descriptions of other embodiments.

[0097] In terms of specific implementation, based on the disclosure and teachings of this document, those skilled in the art will understand that several embodiments disclosed herein can also be implemented in other ways not disclosed herein. For example, regarding the various units in the electronic device or apparatus embodiments described above, this document divides them based on logical functions, but in actual implementation, there may be other division methods. As another example, multiple units or components can be combined or integrated into another system, or some features or functions in a unit or component can be selectively disabled. Regarding the connection relationships between different units or components, the connections discussed above in conjunction with the accompanying drawings can be direct or indirect couplings between units or components. In some scenarios, the aforementioned direct or indirect couplings involve communication connections utilizing interfaces, where the communication interface can support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

[0098] In this disclosure, the units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units. The aforementioned components or units may be located in the same location or distributed across multiple network units. Furthermore, depending on actual needs, some or all of the units can be selected to achieve the purpose of the solution described in the embodiments of this disclosure. Additionally, in some scenarios, multiple units in the embodiments of this disclosure may be integrated into one unit or each unit may exist physically independently.

[0099] In some implementation scenarios, the integrated unit described above can be implemented as a software program module. If implemented as a software program module and sold or used as an independent product, the integrated unit can be stored in a computer-readable storage device (CMSDD). Therefore, when the disclosed solution is embodied in a software product (e.g., a computer-readable storage medium), the software product can be stored in a memory, which may include several instructions to cause a computer device (e.g., a personal computer, server, or network device) to execute some or all of the steps of the method described in the embodiments of this disclosure. The aforementioned memory may include, but is not limited to, various media capable of storing program code, such as USB flash drives, flash drives, read-only memory (ROM), random access memory (RAM), portable hard drives, magnetic disks, or optical disks.

[0100] In other implementation scenarios, the integrated units described above can also be implemented in hardware, i.e., as specific hardware circuits, which may include digital circuits and / or analog circuits. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors. Therefore, the various devices described herein (e.g., computing devices or other processing devices) can be implemented using appropriate hardware processors, such as CPUs, GPUs, FPGAs, DSPs, and ASICs. Furthermore, the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), such as resistive random access memory ("RRAM"), dynamic random access memory ("DRAM"), static random access memory ("SRAM"), enhanced dynamic random access memory ("EDRAM"), high bandwidth memory ("HBM"), hybrid memory cube ("HMC"), ROM, and RAM, etc.

[0101] The foregoing can be better understood in accordance with the following terms:

[0102] Clause A1. A method for handling ECC errors in a heterogeneous system, wherein the heterogeneous system includes a host side and a device side, wherein the method is performed at the device side and includes:

[0103] Obtain the address information of the storage address to be masked from the host side, wherein the storage address is associated with the ECC error; and

[0104] The storage address is masked based on the address information.

[0105] Clause A2. The method according to Clause A1, wherein during the initialization phase of the heterogeneous system, the method includes:

[0106] Perform the initialization operation on the device side;

[0107] After performing the initialization operation, the address information is obtained from the host side, and a masking operation is performed on the storage address based on the address information; and

[0108] The shielding result obtained after performing the shielding operation is written to the host side.

[0109] Clause A3. The method according to Clause A2, wherein performing a masking operation on the storage address based on the address information includes:

[0110] In response to the fact that the storage address to be masked is currently occupied, the storage address is marked as faulty.

[0111] Clause A4. The method according to Clause A2, wherein performing a masking operation on the storage address based on the address information includes:

[0112] In response to the fact that the storage address to be blocked is currently unoccupied, isolation is performed on the storage address.

[0113] Clause A5. The method according to Clause A3 or A4, wherein during the operation phase of the heterogeneous system, the method further includes:

[0114] In response to the fact that the storage address to be blocked is currently occupied, the storage address is saved; and

[0115] After the storage address is released, the masking operation is performed on the saved storage address.

[0116] Clause A6. The method according to Clause A5, wherein saving the storage address includes attaching the storage address as a node to a linked list, the method further includes:

[0117] When performing a memory release operation, the linked list is traversed to determine whether the memory address to be released exists in the linked list;

[0118] In response to the existence of the storage address in the linked list, after performing a memory release operation for the storage address, the masking operation is performed on the storage address.

[0119] Clause A7. The method described pursuant to Clauses A5 or A6 further includes:

[0120] In response to the occurrence of the ECC error, a cache invalidation operation is performed on the storage address to release the storage address from the device-side cache.

[0121] Clause A8. The method according to Clause A7, wherein performing a cache invalidation operation for the storage address includes:

[0122] Release the memory requested before performing the masking operation so that the requested memory does not reside in the device-side cache.

[0123] Clause A9. The method according to Clause A8, wherein the method includes, in the release of memory requested before performing the masking operation:

[0124] Record the first moment when memory is allocated in the cache;

[0125] Record the second moment before the masking operation begins; and

[0126] Compare the first time point and the second time point in order to release all the memory residing in the cache before the second time point.

[0127] Clause A10. A method for handling ECC errors in a heterogeneous system, wherein the heterogeneous system includes a host side and a device side, the method is performed on the host side, and includes:

[0128] Obtain the address information of the storage address to be masked, wherein the address information is associated with the ECC error; and

[0129] The address information is stored so that the device can read it and perform a masking operation on the stored address based on the address information.

[0130] Clause A11. The method according to Clause A10, wherein during the initialization phase of the heterogeneous system, the method comprises:

[0131] Perform the initialization operation on the host side; and

[0132] Read the address information associated with the ECC error and store it.

[0133] Clause A12. The method described pursuant to Clause A10 further includes:

[0134] Read the shielding result obtained after performing the shielding operation from the device side; and

[0135] Update the status bits associated with the masking operation based on the masking result.

[0136] Clause A13. The method according to any one of clauses A10-A12, wherein during the operation phase of the heterogeneous system, the method further comprises:

[0137] Before triggering the shielding operation on the device side, the memory cache flag on the host side is invalidated so as to prevent the memory associated with the storage address from being allocated and used.

[0138] Clause A14. The method according to Clause A13, wherein during the said operation phase, the method further comprises:

[0139] Acquire the interrupt signal generated in response to the ECC error; and

[0140] The interrupt handler function for the masking operation is invoked based on the interrupt signal to initiate communication with the device and cause the device to perform the masking operation.

[0141] Clause A15. The method according to Clause A13, wherein a state machine is arranged on the host side for maintaining the storage timing and masking operation status of the address information, wherein the state machine includes at least an idle state and a suspended state, the method comprising:

[0142] Based on the idle and suspended states of the state machine, execute N masking operations associated with N address information respectively, where N is greater than or equal to 2.

[0143] Clause A16. The method according to Clause A15, wherein in performing the N masking operations, the method includes:

[0144] In response to the state machine being in an idle state, the nth address information is stored so that it can be read by the device side, where 1≤n≤N and are positive integers;

[0145] After writing the nth address information, the state machine is updated to a suspended state to prevent the storage of the (n+1)th address information associated with the (n+1)th masking operation; and

[0146] In response to the completion of the nth masking operation on the device side, the state of the state machine is updated to the idle state to allow the storage of the "n+1"th address information.

[0147] Clause A17. The method according to Clause A15 or A16, wherein a counter for counting the masking operations to be performed is provided on the host side, the method further comprising:

[0148] Based on the value of the counter and the state of the state machine, determine whether to store the address information.

[0149] Clause A18. The method according to Clause A17, wherein the states of the state machine further include a running state, the method further includes:

[0150] In response to the state machine being in the suspended state, the state of the state machine is updated to the running state, so that the device side performs the shielding operation in the running state;

[0151] In response to receiving a notification from the device side that the current shielding operation has been completed, the state machine is updated to the idle state and the value of the counter is decremented by one;

[0152] Determine whether the counter is zero; and

[0153] In response to the counter being non-zero, the next address information is stored and the state machine is updated to the suspended state so that the device side can perform a masking operation for the next address information.

[0154] Clause A19. A computer system comprising a device side and a host side, wherein the device side is configured to perform the method according to any one of clauses A1-A9, and the host side is configured to perform the method according to any one of clauses A10-A18, thereby enabling a masking operation for the ECC error within the computer system.

[0155] Clause A20. A method for handling ECC errors in a heterogeneous system, the heterogeneous system comprising a device side and a host side, the method comprising:

[0156] The address information of the storage address to be masked is obtained at the host side, wherein the storage address is associated with the ECC error;

[0157] The address information is stored on the host side in a storage area shared with the device side; and

[0158] The device reads the address information from the shared storage area and performs a masking operation on the address information.

[0159] Clause A21. A computer-readable storage medium having stored thereon computer program code for handling ECC errors in a heterogeneous system, wherein when the computer program code is run by a processor, it performs the method described in any one of Clauses A1-A9 or any one of Clauses A10-A18.

[0160] Clause A22. A computer program product comprising a computer program for handling ECC errors in a heterogeneous system, which, when executed by a processor, implements the steps of the method described in any one of Clauses A1-A9 or any one of Clauses A10-A18.

[0161] Clause A23. A computing device comprising a memory, a processor, and a computer program stored in the memory, wherein when the processor executes the computer program, it implements the steps of the method described in any one of Clauses A1-A9 or any one of Clauses A10-A18.

[0162] While numerous embodiments of this disclosure have been shown and described herein, it will be apparent to those skilled in the art that such embodiments are provided by way of example only. Many modifications, alterations, and alternatives will occur to those skilled in the art without departing from the spirit and intent of this disclosure. It should be understood that various alternatives to the embodiments of this disclosure described herein may be employed in the practice of this disclosure. The appended claims are intended to define the scope of this disclosure and therefore cover equivalents or alternatives within the scope of these claims.

Claims

1. A method for handling uncorrectable ECC errors in a heterogeneous system, wherein the heterogeneous system includes a host side and a device side, wherein the method is performed at the device side and includes: Obtain the address information of the storage address to be blocked, wherein the storage address is associated with the ECC error, and the address information is obtained from the shared memory of the host side and the device side; as well as By calling the specified interface in the memory allocator used to request memory, a masking operation is performed on the storage address based on the address information; and wherein, The shielding operation is performed during the initialization and / or operation phases of the heterogeneous system. Specifically, during the initialization phase of the heterogeneous system, performing a masking operation on the storage address based on the address information includes: In response to the fact that the storage address to be blocked is currently occupied, the storage address is marked as an error; In response to the fact that the storage address to be blocked is currently unoccupied, isolation is performed on the storage address; During the operation phase of the heterogeneous system, the method further includes: In response to the fact that the storage address to be blocked is currently occupied, the storage address is saved; and After the storage address is released, the masking operation is performed on the saved storage address.

2. The method according to claim 1, wherein during the initialization phase of the heterogeneous system, the method further includes: Perform the initialization operation on the device side; After performing the initialization operation, the address information is obtained from the host side and a masking operation is performed on the storage address based on the address information; as well as The shielding result obtained after performing the shielding operation is written to the host side.

3. The method according to claim 1, wherein, During the operation phase of the heterogeneous system, saving the storage address includes attaching the storage address as a node to a linked list. The method further includes: When performing a memory release operation, the linked list is traversed to determine whether the memory address to be released exists in the linked list; In response to the existence of the storage address in the linked list, after performing a memory release operation for the storage address, the masking operation is performed on the storage address.

4. The method according to any one of claims 1 to 3, further comprising: In response to the occurrence of the ECC error, a cache invalidation operation is performed on the storage address to release the storage address from the device-side cache, wherein performing the cache invalidation operation includes releasing the memory allocated before performing the masking operation.

5. The method of claim 4, wherein the method includes, in releasing the memory allocated before performing the masking operation, the method comprises: Record the first moment when memory is allocated in the cache; Record the second moment before the masking operation begins; as well as Compare the first time point and the second time point in order to release all the memory residing in the cache before the second time point.

6. A method for handling uncorrectable ECC errors in a heterogeneous system, wherein the heterogeneous system includes a host side and a device side, the method is performed on the host side, and includes: Obtain the address information of the storage address to be blocked, wherein the address information is associated with the ECC error, and the address information is obtained from the shared memory of the host side and the device side; as well as The address information is stored so that, by calling a specified interface in the memory allocator for requesting memory, the device can read the information and perform a masking operation on the stored address based on the address information; and wherein, The shielding operation is performed during the initialization and / or operation phases of the heterogeneous system; The initialization phase of the heterogeneous system includes performing a masking operation on the storage address based on the address information, which includes: In response to the fact that the storage address to be blocked is currently occupied, the storage address is marked as an error; In response to the fact that the storage address to be blocked is currently unoccupied, isolation is performed on the storage address; During the operation phase of the heterogeneous system, the method further includes: In response to the fact that the storage address to be blocked is currently occupied, the storage address is saved; and After the storage address is released, the masking operation is performed on the saved storage address.

7. The method according to claim 6, further comprising: Read the shielding result obtained after performing the shielding operation from the device side; as well as Update the status bits associated with the masking operation based on the masking result.

8. The method according to claim 6 or 7, wherein during the operation phase of the heterogeneous system, the method further comprises: Before triggering the shielding operation on the device side, the memory cache flag on the host side is invalidated to prevent the memory associated with the storage address from being allocated and used.

9. The method of claim 8, wherein during the operation phase, the method further comprises: Obtain the interrupt signal generated in response to the ECC error; The interrupt handler function for the masking operation is invoked based on the interrupt signal in order to initiate communication with the device side and cause the device side to perform the masking operation.

10. The method according to claim 6 or 7, wherein a state machine is arranged on the host side for maintaining the storage timing and working state of the address information and the masking operation, wherein the state machine includes at least an idle state and a suspended state, the method comprising: Based on the idle and suspended states of the state machine, execute N masking operations associated with N address information respectively, where N is greater than or equal to 2.

11. The method of claim 10, wherein in performing the N masking operations, the method comprises: In response to the state machine being in an idle state, the nth address information is stored so that it can be read by the device side, where 1≤n≤N and are positive integers; After writing the nth address information, the state machine is updated to a suspended state to prevent the storage of the (n+1)th address information associated with the (n+1)th masking operation. as well as In response to the completion of the nth masking operation on the device side, the state of the state machine is updated to the idle state to allow the storage of the "n+1"th address information.

12. The method of claim 11, wherein a counter for counting the masking operations to be performed is provided on the host side, the method further comprising: Based on the value of the counter and the state of the state machine, determine whether to store the address information.

13. The method of claim 11, wherein the state of the state machine further includes a running state, and the method further includes: In response to the state machine being in the suspended state, the state of the state machine is updated to the running state, so that the device side performs the shielding operation in the running state; In response to receiving a notification from the device side that the current shielding operation has been completed, the state machine is updated to the idle state and the counter value is decremented by one; Determine whether the counter is zero; as well as In response to the counter being non-zero, the next address information is stored and the state machine is updated to the suspended state so that the device side can perform a masking operation for the next address information.

14. A computer system comprising a device side and a host side, wherein the device side is configured to perform the method according to any one of claims 1-5, and the host side is configured to perform the method according to any one of claims 6-13, thereby enabling a masking operation for uncorrectable ECC errors within the computer system.

15. A method for handling uncorrectable ECC errors in a heterogeneous system, the heterogeneous system comprising a device side and a host side, the method comprising: The address information of the storage address to be masked is obtained at the host side, wherein the storage address is associated with the ECC error, and the address information is obtained from the shared memory of the host side and the device side; The address information is stored on the host side in a storage area shared with the device side; as well as By invoking a specified interface in the memory allocator for requesting memory, the device reads the address information from the shared storage area and performs a masking operation on the address information; and in The shielding operation is performed during the initialization and / or operation phases of the heterogeneous system; The masking operation performed on the storage address based on the address information during the initialization phase of the heterogeneous system includes: In response to the fact that the storage address to be blocked is currently occupied, the storage address is marked as an error; In response to the fact that the storage address to be blocked is currently unoccupied, isolation is performed on the storage address; During the operation phase of the heterogeneous system, the method further includes: In response to the fact that the storage address to be blocked is currently occupied, the storage address is saved; as well as After the storage address is released, the masking operation is performed on the saved storage address.

16. A computer-readable storage medium having stored thereon computer program code for handling uncorrectable ECC errors in a heterogeneous system, wherein when the computer program code is run by a processor, it performs the method of any one of claims 1-5 or any one of claims 6-13.