Indirect memory copy method, processing unit, computing device, and system

By introducing an index cache and addressing unit outside the processing unit, the problem of slow indirect access to main memory in the prior art is solved, and more efficient data processing and optimization of computing resources are achieved.

CN115904213BActive Publication Date: 2026-06-12ALIBABA DAMO (HANGZHOU) TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ALIBABA DAMO (HANGZHOU) TECH CO LTD
Filing Date
2021-09-30
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In existing technologies, indirect access to main memory requires the use of arithmetic units for index calculation and data loading, resulting in long execution cycles and slow speed.

Method used

An index cache and an addressing unit are introduced, located outside the arithmetic unit. The index address is sent to the index cache through an indirect storage copy instruction. The addressing unit calculates the source address based on the base address and the index, and loads data from the main storage area, thus separating the computing function from the storage function.

🎯Benefits of technology

It saves instruction overhead and computing resources of the computing unit, improves the continuity and copying speed of data processing, and enhances data access efficiency, especially in sparse neural networks and recommendation machine learning.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115904213B_ABST
    Figure CN115904213B_ABST
Patent Text Reader

Abstract

Embodiments of the present disclosure provide an indirect memory copy method, a processing unit, a computing device and a system. The processing unit of the embodiments of the present disclosure comprises an operation unit, an addressing unit and an index cache, and the addressing unit and the index cache are located between the main storage area outside the processing unit and the operation unit. The operation unit executes an indirect memory copy instruction, and the indirect memory copy instruction at least has a base address, an index address and a destination address, so as to send the index address to the index cache and send the base address and the destination address to the addressing unit; the index cache loads a corresponding index from the main storage area according to the index address and sends the index to the addressing unit; the addressing unit determines a source address of the main storage area corresponding to source data according to the base address and the index; and the addressing unit loads the source data from the main storage area according to the source address and sends the source data to the destination address of the internal cache of the operation unit. The scheme of the embodiments of the present disclosure saves the computing resources of the operation unit itself.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of computer technology, and in particular to an indirect storage copying method, processing unit, computing device, and system. Background Technology

[0002] Generally, processing units such as Graphics Processing Units (GPUs) access source data in main memory through two methods: direct access and indirect access. In direct access, the processing unit first obtains the source address of the source data in main memory and then accesses the source data based on that address. In indirect access, the processing unit can obtain the index of the source address, calculate the source address based on the index, and then further access the source data based on that address. Indirect access is common when the data to be accessed is concentrated in a specific area of ​​main memory. Because the data is stored relatively centrally, representing the address using an address representation relative to a base address is more efficient. Indirect access is primarily used in sparse neural networks, Generative Neural Networks (GNNs), and embedding collection in recommendation machine learning.

[0003] In the prior art, when implementing indirect access, the arithmetic unit generates index loading instructions and data loading instructions. The index is loaded from main memory using the index loading instructions, and the source data is loaded using the source data loading instructions after the source address is calculated. However, one drawback of this method is that it requires the arithmetic unit to be occupied, so that the arithmetic unit cannot be relieved while performing a large number of calculations. Another drawback is that it requires multiple instructions, has a long execution cycle, and is slow. Summary of the Invention

[0004] In view of this, embodiments of the present disclosure provide an indirect storage copying method, processing unit, computing device, and system to save the instruction overhead of the computing unit itself and the data copying speed.

[0005] According to a first aspect of the present disclosure, a processing unit is provided. The processing unit includes an arithmetic unit, an addressing unit, and an index cache. The addressing unit and the index cache are located between a main storage area outside the processing unit and the arithmetic unit. The arithmetic unit has an internal cache. The arithmetic unit executes an indirect storage copy instruction, which has at least a base address, an index address, and a destination address, to send the index address to the index cache and send the base address and the destination address to the addressing unit. The index cache loads a corresponding index from the main storage area based on the index address and sends it to the addressing unit. The addressing unit determines the source address of the main storage area corresponding to the source data based on the base address and the index. The addressing unit loads the source data from the main storage area based on the source address and sends it to the destination address in the internal cache.

[0006] In another implementation of this disclosure, the indirect storage copy instruction further includes an addressing operand for calculating the source address; the arithmetic unit also sends the addressing operand to the addressing unit; the addressing unit determines the source address based on the base address, the index, and the addressing operand.

[0007] In another implementation of this disclosure, the addressing operand includes at least one of an offset and a step size.

[0008] In another implementation of this disclosure, the addressing unit determines the source address according to one of the following: source address = base address + index; source address = base address + index × step size; source address = base address + offset + index; source address = base address + offset + index × step size.

[0009] In another implementation of this disclosure, the index cache loads the corresponding index from the main storage area and stores the index and the index address in correspondence; after receiving the index address, the index cache searches for the corresponding stored index and index address; if the corresponding index is found, it is sent to the addressing unit; if not found, the corresponding index is loaded from the main storage area according to the index address.

[0010] In another implementation of this disclosure, the addressing unit determines the main storage area that matches the source address, and loads the source data from the main storage area according to the correspondence between the main storage area and the arithmetic unit.

[0011] In another implementation of this disclosure, the processing unit includes a first partition and a second partition. The first partition includes a first arithmetic unit and a first addressing unit. The first addressing unit determines that the main storage area corresponds to the first partition, loads the source data from the main storage area via the last-level cache corresponding to the first partition, and sends the source data to the destination address in the first arithmetic unit.

[0012] In another implementation of this disclosure, the second partition includes a second processing unit and a second addressing unit. The first addressing unit and the second addressing unit are connected via an internal transmission link. The first addressing unit determines that the main storage area corresponds to the second partition and sends a read request to the second addressing unit via the internal transmission link. The read request instructs the second addressing unit to load the source data from the main storage area and return the source data to the first addressing unit via the internal transmission link. The first addressing unit sends the source data to the destination address in the first processing unit.

[0013] In another implementation of this disclosure, the addressing unit determines whether the source address matches the main storage area by determining whether a predetermined field of the source address is a preset value corresponding to the main storage area; or, the addressing unit determines that the source address matches the main storage area based on the pre-stored address-main storage area matching relationship.

[0014] According to a second aspect of the present disclosure, a computing device is provided, comprising: a processing unit as described in the first aspect and the main storage area.

[0015] According to a third aspect of the present disclosure, an on-chip interconnect network system is provided, including the processing unit described in the first aspect.

[0016] According to a fourth aspect of the present disclosure, an indirect storage copying method for a processing unit is provided. The processing unit includes an arithmetic unit, an addressing unit, and an index cache. The addressing unit and the index cache are located between a main storage area outside the processing unit and the arithmetic unit. The arithmetic unit has an internal cache. The method includes: executing an indirect storage copying instruction through the arithmetic unit, the indirect storage copying instruction having at least a base address, an index address, and a destination address, to send the index address to the index cache and send the base address and the destination address to the addressing unit; loading a corresponding index from the main storage area according to the index address through the index cache and sending it to the addressing unit; determining the source address of the main storage area corresponding to source data through the addressing unit according to the base address and the index; and loading the source data from the main storage area according to the source address through the addressing unit and sending it to the destination address of the internal cache.

[0017] In the scheme of this embodiment, since the index cache and addressing unit are located outside the arithmetic unit, the arithmetic unit avoids the calculation of the index based on the index address, the calculation of the source data based on the index, and the calculation of loading the source data based on the source address, thus saving the instruction overhead and computing resources of the arithmetic unit itself. Furthermore, the index cache is used to retrieve the index, and the addressing unit is used to perform the calculation of the source address, achieving separation of computational and storage functions. This ensures the storage efficiency of the index cache and the computational efficiency of the addressing unit. The addressing unit calculates the source address and then loads the source data; this continuity of data processing ensures the speed of source data copying. Attached Figure Description

[0018] To more clearly illustrate the technical solutions in the embodiments of this disclosure or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in the embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings.

[0019] Figure 1A This is a schematic structural diagram of a processing unit based on an example;

[0020] Figure 1B This is a schematic block diagram of an example of the processing unit in Figure 1;

[0021] Figure 2A This is a schematic structural diagram of a processing unit according to an embodiment of the present disclosure;

[0022] Figure 2B This is a schematic structural diagram of a computing device according to another embodiment of the present disclosure;

[0023] Figure 3 This is a schematic block diagram of an indirect storage copying method according to another embodiment of the present disclosure;

[0024] Figure 4 This is a schematic block diagram of an indirect storage copying method according to another embodiment of the present disclosure;

[0025] Figure 5 This is a schematic structural diagram of an on-chip Internet of Things system according to another embodiment of the present disclosure;

[0026] Figure 6 This is a schematic flowchart of an indirect storage copying method according to another embodiment of the present disclosure. Detailed Implementation

[0027] To enable those skilled in the art to better understand the technical solutions in the embodiments of this disclosure, the technical solutions in the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this disclosure, and not all embodiments. Based on the embodiments of this disclosure, all other embodiments obtained by those skilled in the art should fall within the protection scope of this disclosure.

[0028] The specific implementation of the embodiments of this disclosure will be further described below with reference to the accompanying drawings.

[0029] Figure 1AThis is a schematic structural diagram of a processing unit based on an example. The computer system 10 is an example of a "centralized" system architecture. The computer system 10 can be built based on various models of processing units currently on the market and is driven by operating systems such as Windows, UNIX, and Linux. Furthermore, the computer system 10 can be implemented in hardware and / or software such as PCs, desktops, laptops, servers, and mobile communication devices.

[0030] like Figure 1A As shown, the computer system 10 of this embodiment may include one or more processing units 12 and a memory 14.

[0031] The memory 14 in the computer system 10 can be main memory (or simply main memory or RAM). It is used to store instruction information and / or data information represented by data signals, such as storing data provided by the processing unit 12 (e.g., calculation results), and can also be used to realize data exchange between the processing unit 12 and the external storage device 16 (or auxiliary memory or external storage).

[0032] In some situations, processing unit 12 may need to access memory 14 to retrieve or modify data in memory 14. Because memory 14 has a relatively slow access speed, to alleviate the speed difference between processing unit 12 and memory 14, computer system 10 also includes a cache memory 18 coupled to bus 11. The cache memory 18 is used to cache program data or message data that may be repeatedly accessed in memory 14. The cache memory 18 is implemented, for example, by a storage device of the type Static Random Access Memory (SRAM). The cache memory 18 can be a multi-level structure, such as a three-level cache structure with a level 1 cache (L1 cache), a level 2 cache (L2 cache), and a level 3 cache (L3 cache), or a cache structure with more than three levels or other types of cache structures. In some embodiments, a portion of the cache memory 18 (e.g., the level 1 cache, or a level 1 cache and a level 2 cache) can be integrated inside processing unit 12 or integrated with processing unit 12 on the same on-chip system.

[0033] Based on this, the processing unit 12 may include an instruction execution unit 13, and may also include a memory management unit, etc. When executing some instructions that require memory modification, the instruction execution unit 13 initiates a write access request, which specifies the data to be written into memory and the corresponding physical address; the memory management unit is used to translate the virtual address specified by these instructions into the physical address mapped by the virtual address, and the physical address specified by the write access request may be the same as the physical address specified by the corresponding instruction.

[0034] Information exchange between memory 14 and cache memory 18 is typically organized in blocks. In some embodiments, cache memory 18 and memory 14 may be divided into data blocks of the same spatial size, and a data block may serve as the smallest unit of data exchange between cache memory 18 and memory 14 (including one or more data of a preset length). For clarity, each data block in cache memory 18 will be referred to as a cache block (or cache line), and different cache blocks will have different cache block addresses; each data block in memory 14 will be referred to as a memory block, and different memory blocks will have different memory block addresses. Cache block addresses may include, for example, physical address tags used to locate the data blocks.

[0035] Due to space and resource limitations, cache memory 18 cannot cache all the contents of memory 14; that is, the storage capacity of cache memory 18 is usually smaller than that of memory 14, and the addresses of individual cache blocks provided by cache memory 18 cannot correspond to all the memory block addresses provided by memory 14. When processing unit 12 needs to access memory, it first accesses cache memory 18 via bus 11 to determine whether the content to be accessed is already stored in cache memory 18. If so, cache memory 18 is hit, and processing unit 12 directly retrieves the content to be accessed from cache memory 18. If the content to be accessed by processing unit 12 is not in cache memory 18, processing unit 12 needs to access memory 14 via bus 11 to find the corresponding information in memory 14. Because the access speed of cache memory 18 is very fast, when cache memory 18 is hit, the efficiency of processing unit 12 can be significantly improved, thereby improving the performance and efficiency of the entire computer system 10.

[0036] The following section will provide a general explanation of the working principle of a computer system using two specific processing unit architectures. It should be understood that the two processing unit architectures described below are merely examples, and the computer system may also employ other processing unit architectures.

[0037] Example of processing unit architecture

[0038] Figure 1B This is a schematic block diagram of an example of the processing unit in Figure 1. In this schematic architecture, the processing unit can be a GPU.

[0039] In some embodiments, the processing unit architecture may include one or more streaming multiprocessors (SMs) 122 for processing instructions, the processing and execution of which can be controlled by a user (e.g., through an application) and / or the system platform. In some embodiments, each SM 122 may be used to process a specific instruction set. In some embodiments, the instruction set may support Complex Instruction Set Computing (CISC), Reduced Instruction Set Computing (RISC), or computation based on Very Long Instruction Words (VLIW). Different SMs 122 may each process different or the same instruction sets. In some embodiments, the SM 122 may also include other processing modules, such as a Digital Signal Processor (DSP).

[0040] In some embodiments, Figure 1A The cache memory 18 shown can be partially integrated into the SM 122 as a multi-level cache 182. Depending on the architecture, the cache memory 18 can be a single or multiple levels of internal high-speed cache memory located within and / or outside each SM 122 (e.g., Figure 1B The two-level high-speed cache memory L1 and L2 are shown. Figure 1B The cache, uniformly identified as 182, may also include an instruction-oriented cache and a data-oriented cache. In some embodiments, the components in SM 122 may share at least a portion of the cache memory as shared memory 136, such as... Figure 1B As shown, multiple SM 122s share the second-level high-speed cache memory L2. It should be understood that other cache structures can also be used as external caches for SM 122.

[0041] In some embodiments, such as Figure 1BAs shown, the SM 122 may include multiple Streaming Processor (SP) register files 138. Register file 138 may include multiple registers for storing different types of data and / or instructions. These registers can be of different types. For example, register file 138 may include integer registers, floating-point registers, status registers, instruction registers, and pointer registers. The registers in register file 138 can be implemented using general-purpose registers, or a specific design can be adopted according to the actual needs of the SM 122.

[0042] Each SP 1221 is connected to register file 138 and is used to execute the instruction sequence (i.e., program). SP 1221 fetches the instruction after instruction scheduling from SM 122. SP 1221 performs steps such as decoding the fetched instruction, executing the decoded instruction, and saving the instruction execution result. This process is repeated until all instructions in the instruction sequence have been executed or a halt instruction is encountered.

[0043] When executing instructions, each SP 1221 can use the registers corresponding to that SP 1221 in register file 138. Different SPs 1221 can share the storage space in shared memory 136, and shared memory 136 and register file 138 can be different logical regions in the same storage medium. In addition, different SPs 122 can also have global memory, etc. (not shown).

[0044] Furthermore, each SP 1221 and register file 138 can be connected to the scheduling unit 1222 of SM 122. Figure 1BIn each example, data access can be performed using either direct addressing or indirect addressing. The scheduling unit 1222 can be used to generate direct store copy instructions and indirect store copy instructions. As an example of direct addressing, the processing unit core 1201 or SM 122 can respond to a direct store copy instruction by retrieving the address information of the data from memory 14 (including memory 141 or memory 142) via cache memory 18, and then retrieve the data from memory 14 based on the address information. As an example of indirect addressing, the processing unit core 1201 or SM 122 can respond to an indirect store copy instruction by retrieving the index address corresponding to the source data from memory 14 (including memory 141 or memory 142) via cache memory 18, retrieve the index from memory 14 via cache memory 18 based on the index address, calculate the source address of the source data based on the index, and then retrieve the source data from memory 14 via cache memory 18 based on the source address. When parallel computation is required on a large amount of source data stored in adjacent memory, the above indirect addressing method allows the processing unit to first obtain the index corresponding to the source data and then calculate the actual source address of the source data based on the index, which improves the addressing efficiency to a certain extent. However, compared with the direct addressing method, calculating the actual source address of the source data based on the index consumes more computing resources of the processing unit.

[0045] The following will first combine Figure 2A and Figure 2B The processing unit of the present disclosure embodiment will be described below. Figure 2A This is a schematic block diagram of a processing unit according to another embodiment of the present disclosure. Figure 2B This is a schematic structural diagram of a computing device including a processing unit according to another embodiment of the present disclosure.

[0046] Figure 2A The processing unit 200 includes an arithmetic unit 210, an addressing unit 230, and an index cache 220. The addressing unit 230 and the index cache 220 are located between the main memory area outside the processing unit 200 and the arithmetic unit 210. The arithmetic unit 210 has an internal cache 211.

[0047] The arithmetic unit 210 executes an indirect storage copy instruction, which has at least a base address, an index address, and a destination address, to send the index address to the index cache 220 and the base address and destination address to the addressing unit 230.

[0048] Index cache 220 loads the corresponding index from the main storage area according to the index address and sends it to addressing unit 230.

[0049] Addressing unit 230 determines the source address of the main storage area corresponding to the source data based on the base address and index.

[0050] Addressing unit 230 loads source data from the main storage area according to the source address and sends it to the destination address of internal cache 211.

[0051] It should be understood that the processing units in this document include, but are not limited to, such as Figure 1A and 1B The various processing unit architectures shown include, but are not limited to, multiprocessor architectures, CPU architectures, and GPU architectures. When the processing unit is a CPU, the arithmetic unit can be a CPU core; when the processing unit is a GPU, the arithmetic unit can be an SM or SP. When the arithmetic unit is an SM, the internal cache can be a L1 cache, shared memory, or L2 cache; when the arithmetic unit is an SP, the memory cache can be a L1 cache or shared memory.

[0052] It should also be understood that cache 240 can handle any of the multi-level caches within the unit. The cache can be implemented as Dynamic Random Access Memory (DRAM).

[0053] It should also be understood that index cache 220 can correspond to cache 240 configuration. Specifically, index cache 220 is a cache independent of cache 240. For example, index cache 220 can be a logically separate storage space within cache 240. Alternatively, index cache 220 can also be a physical storage space independent of cache 240.

[0054] It should also be understood that addressing unit 230 can correspond to cache 240. Addressing unit 230 can be configured in software or hardware, and can communicate directly with cache 240 or index cache 220, for example, directly retrieving data of any form from cache 240 or index cache 220 or caching data in cache 240 or index cache 220. The main memory area can be implemented as main memory or a block of main memory, for example, as dynamic random access memory.

[0055] It should also be understood that the addressing unit 230 and index cache 220 mentioned above can correspond to the same level cache in a multi-level cache, or they can correspond to different levels cache.

[0056] It should also be understood that the base address can be interpreted as the location where the first source data in a series of source data is stored; in other words, the base address can be the source address of the first source data. The source addresses of other source data can be obtained by referring to the base address. For example, source address = base address + relative position, where the relative position refers to the relative position between the source address and the base address. The relative position can be obtained based on the offset of the source address relative to the base address, or it can be obtained based on the index. It should also be understood that the index itself can indicate the relative position with respect to the base address, or the relative position can be determined based on the product of the index and the step size.

[0057] It should also be understood that the source address can be the address of the source data in the main storage area. The data can be obtained from the main storage area based on the source address, or it can be obtained from the cache based on the source address, or it can be obtained from the main storage area via the cache. This embodiment does not limit this.

[0058] It should also be understood that the destination address can be the address of the memory cache in the arithmetic unit 210. After the source data used for calculation processing by the arithmetic unit is copied to the internal cache, the source data can be further calculated according to the corresponding instructions.

[0059] It should also be understood that the index cache and addressing unit being located between the main storage area and the arithmetic unit means that the index cache and addressing unit are located outside the arithmetic unit in terms of hardware configuration, while the index cache and addressing unit are part of the processing unit and are located inside the processing unit in terms of physical configuration.

[0060] The following will refer to Figure 2B The operations performed by each of the above units will be explained. Figure 2B The computing device may include Figure 2A The processing unit 200 and the main storage area 250.

[0061] The arithmetic unit 210 may include registers, a logic computation unit, and a scheduling unit. When the arithmetic unit is a CPU, the logic computation unit can be used to retrieve source data from the internal cache, and the scheduling unit can directly or indirectly retrieve indirect memory copy instructions from the main memory area, and determine the base address, destination address, address parameters, etc., based on the indirect memory copy instructions. When the arithmetic unit is a GPU, the scheduling unit can be executed by the SM (Memory Management System), the logic computation unit can be included in the SP (Service Pack), and the registers can be registers in the SP or registers in the register file of the SM. The computation can also be performed by the SP. Specifically, the architecture of the arithmetic unit in this example can be found in [reference needed]. Figure 1A and 1B .

[0062] The following example illustrates the operations and methods performed by each unit. In this example, the unit of operation is SM, the main memory area 250 is implemented as a DRAM chip, and the cache 240 is implemented as a Level 2 cache. The Level 2 cache can be used as the last-level cache, and the Level 1 cache is not shown.

[0063] The SM 210 may include multiple stream processors SP 219 and a register file 218. The register file 218 may include multiple registers for storing different types of data and / or instructions. These registers can be of different types. For example, the register file 218 may include integer registers, floating-point registers, status registers, instruction registers, and pointer registers. The registers in the register file 218 can be implemented using general-purpose registers, or a specific design can be adopted according to the actual needs of the SM210.

[0064] Each SP 219 is linked to register file 218 and is used to execute the instruction sequence (i.e., the program). SP 219 fetches the instruction after instruction scheduling from SM210. SP 219 performs steps such as decoding the fetched instruction, executing the decoded instruction, and saving the instruction execution result. This process is repeated until all instructions in the instruction sequence have been executed or a halt instruction is encountered.

[0065] When executing instructions, each SP 219 can use the registers corresponding to that SP 219 in register file 218. Different SPs 219 can share the storage space in shared memory 211, and shared memory 211 and register file 218 can be different logical regions in the same storage medium. In addition, different SMs 210 can also have global memory, etc. (not shown).

[0066] First, the SM 210 can obtain indirect memory copy instructions from the DRAM 250. These instructions have at least a base address, an index address, and a destination address. For example, the SM 210 can use the scheduling unit to resolve the indirect memory copy instruction into the base address, index address, and destination address.

[0067] Then, SM 210 can directly send the index address to index cache 220 and the base address and destination address to addressing unit 230. Alternatively, SM 210 can send the base address, index address, and destination address to an index request queue outside of SM 210, through which the index address will be sent to index cache 220 and the base address and destination address will be sent to addressing unit 230. It should be understood that when DRAM 250 is main memory, the index address can be obtained from DRAM 250; when DRAM 250 is a partition in main memory, the index address can also be obtained from other partitions.

[0068] Index cache 220 loads the corresponding index from DRAM 250 based on the obtained index address and sends it to addressing unit 230. Specifically, index cache 220 can obtain an index request from the index request queue, which includes the index address. Regarding the index acquisition method, in one example, index cache 220 obtains the index corresponding to the index address from DRAM 250 and caches the index locally; in another example, index cache 220 checks if an index corresponding to the index address exists locally, and if it exists, it locks the index directly. Index cache 220 then sends the index to addressing unit 230. It should be understood that when checking if the index exists locally, index cache 220 can check if there is a pre-stored index that corresponds to the index address, and if so, it locks the index. In addition, when obtaining an index from DRAM 250 to index cache 220, a correspondence between the index address and the index can be established, and this correspondence can be removed when the index is removed from index cache 220. The reason for removing an index from index cache 220 could be that the currently cached indexes have exceeded a threshold, thus the previously cached indexes are removed; or, the current processing task of SM 210 has ended, requiring the indexes in index cache 220 to be cleared. More specifically, a queue can be maintained in index cache 220. In one example, the current state of the queue maintained in the index cache can be determined. If the queue is full, the target index to be retrieved replaces another index in the queue. When the queue is a first-in-first-out (FIFO) queue, the target index can be copied to the head of the FIFO queue, and the other index at the tail of the queue can be deleted. In another example, the usage frequency of each index stored in the queue can be determined, and the least frequently used index among the various indices can be replaced with the target index to be retrieved.

[0069] Addressing unit 230 determines the source address of the DRAM 250 corresponding to the source data based on the base address and index. Specifically, addressing unit 230 can calculate the source address of the source data based on the base address and index, for example, source address = base address + index. Additionally, the indirect memory copy instruction also has addressing operands for source address calculation. For example, the scheduling unit parses the indirect memory copy instruction to obtain addressing operands. In this case, addressing unit 230 can calculate the source address based on the base address, index, and addressing operands. The addressing operands can include at least one of offset and step size, where source address = base address + index × step size; or, source address = base address + offset + index; or, source address = base address + offset + index × step size.

[0070] Addressing unit 230 loads source data from DRAM 250 according to the source address and sends it to the destination address of shared memory 211. Specifically, addressing unit 230 can obtain source data according to the source address and send it to shared memory 211 in SM 210, where SM 210, such as SP 219 and register file 218, can perform further processing based on the data.

[0071] In the embodiments of this disclosure, since the index cache and addressing unit are located outside the arithmetic unit, the arithmetic unit avoids the calculations performed on the index based on the index address, the source data based on the index, and the source data based on the source address, thus saving the instruction overhead and computational resources of the arithmetic unit itself. Furthermore, the index cache is used to retrieve the index, and the addressing unit is used to perform the calculation of the source address, achieving separation of computational and storage functions. This ensures the storage efficiency of the index cache and the computational efficiency of the addressing unit. The addressing unit calculates the source address and then loads the source data; this continuity of data processing ensures the efficiency of source data retrieval. In other words, according to the indirect store copy instruction, after processing by the index cache and addressing unit, the source data is returned to the internal cache of the arithmetic unit, saving the arithmetic unit the computation based on the source data loading instruction and the source address calculation instruction.

[0072] Furthermore, other parts related to the addressing unit and index cache in the embodiments of this disclosure can all be referred to. Figure 1A and 1B Based on the descriptions and explanations in the various examples above, those skilled in the art can implement the computations and data processing related to the indirect storage copying method of the embodiments of this disclosure.

[0073] Furthermore, when GPUs execute neural network-related algorithms, the elements within a neural network layer exhibit simple correlations. The solutions in this disclosure can significantly improve the data processing efficiency of GPUs. This is particularly true in low-level data computations related to sparse neural networks, graph neural networks (GNNs), and embedding clustering in machine learning algorithms such as recommendation algorithms, further conserving GPU resources and computing power.

[0074] Figure 3This is a schematic block diagram of an indirect storage copying method according to another embodiment of the present disclosure. It should be understood that the request queue 310 may include a base address, an index address, and a destination address. For example, the request queue 310 may obtain the base address, index address, and destination address from the processing unit. The request queue 310 may include an index request queue and a source data request queue; here, only a unified request queue 310 is used as an example. The response queue 320 is a response queue for source data. The request queue 310 may send the index address to the index cache 220 and send the base address to the addressing unit. The request queue 310 may also send the destination address to the response queue 320 via the addressing unit 230.

[0075] First, index cache 220 retrieves an index request from request queue 310, the index request including an index address. In one example, this index address corresponds to main storage area 251, so index cache 220 can retrieve the index corresponding to that index address from cache 240. If the index does not exist in cache 240, it can retrieve the index corresponding to that index address from main storage area 251 and store it in cache 240. Alternatively, index cache 220 can also directly retrieve the index corresponding to that index address from main storage area 251. In another example, the index address may also correspond to main storage area 252, in which case the index can be retrieved through an internal communication link.

[0076] Then, index cache 220 sends the index to addressing unit 230, which calculates the source address of the source data based on the previously acquired base address and index. Specifically, addressing unit 230 can calculate the source address based on the base address, index, and addressing operands. The addressing operands can include at least one of offset and step size, and the source address can be calculated in any of the following ways: First way: source address add = [base] + [ind], where [base] represents the base address included in the address parameters, and [ind] represents the index. Second way: source address add = [base] + [ind] * n, where [base] and n represent the step size included in the address operand, and [ind] represents the target index. Third way: source address add = [base] + [offset] + [ind], where [base] and [offset] represent the offset included in the address operand, and [ind] represents the index. The fourth method: source address add = [base] + [offset] + [ind] * n, where [base], [offset], and n represent the offset and step size included in the address operands, respectively, and [ind] represents the index. Therefore, when the main storage area stores a large amount of location-related source data, calculating the source address of the source data through step size, offset, and index improves the overall data access efficiency.

[0077] Then, addressing unit 230 puts the source data request, including the source address, into request queue 310. Request queue 310 sends the source data request to cache 240. If the source data has already been cached from main storage area 251 into cache 240, cache 240 sends the source data directly to response queue 320; or, the source data is cached from main storage area 251 into cache 240, and cache 240 then sends the source data to response queue 320.

[0078] Then, response queue 320 can send the source data to the internal cache in the processing unit based on the pre-obtained destination address.

[0079] Figure 4 This is a schematic block diagram of an indirect storage copying method according to another embodiment of the present disclosure. Figure 4 In the example, addressing unit 230 is used to determine the main storage area 250 that matches the source address, and loads source data from main storage area 520 according to the correspondence between main storage area 250 and arithmetic unit 210. It should be understood that main storage areas 251 and 252 are examples of main storage area 250. The correspondence between main storage area 250 and arithmetic unit 210 improves the efficiency and reliability of source data retrieval. Main storage areas with a correspondence to arithmetic unit can store source data with high access frequency, while main storage areas without a correspondence to arithmetic unit can store source data with low access frequency, thereby improving the overall computational efficiency of the arithmetic unit.

[0080] Specifically, the processing unit may include a first partition and a second partition (corresponding to respectively) Figure 4 The first partition and the second partition represent different partitions among multiple partitions of the processing unit. The first partition includes a first arithmetic unit 2101 and a first addressing unit 231, and the second partition includes a second arithmetic unit 2102 and a first addressing unit 232.

[0081] In one example of retrieving source data, the first addressing unit 231 determines that the main storage area 251 corresponds to the first partition, loads the source data from the main storage area 251 via the last-level cache 241 corresponding to the first partition, and sends the source data to the destination address in the first processing unit 2101. Similarly, the second addressing unit 232 determines that the main storage area 252 corresponds to the second partition, loads the source data from the main storage area 252 via the last-level cache 242 corresponding to the second partition, and sends the source data to the destination address in the second processing unit 2102. Thus, the first addressing unit efficiently reads the source data from the main storage area via its corresponding cache, and the second addressing unit efficiently reads the source data from the main storage area via its corresponding cache.

[0082] In another example of acquiring source data, the first addressing unit 231 and the second addressing unit 232 are communicatively connected via an internal transmission link 410. It should be understood that the internal transmission link 410 is used to enable communication between different partitions; in other words, in this example, the internal transmission link 410 can connect at least the addressing units of different partitions, i.e., it enables the first addressing unit 231 and the second addressing unit 232 to communicate. Specifically, the internal transmission link 410 can be implemented as a Network on Chip (NOC), and within each partition, the internal transmission link 410 can also be used to connect arithmetic units to caches, or caches of different levels (e.g., in the case of a Level 3 cache in a partition, the internal transmission link 410 can be used to connect the Level 2 cache and the Level 3 cache). In this example, the first addressing unit 231 determines that the main storage area 252 corresponds to the second partition and sends a read request to the second addressing unit 232 via the internal transmission link 410. The read request instructs the second addressing unit 232 to load source data from the main storage area 252 and return the source data to the first addressing unit 231 via the internal transmission link 410. Then, the first addressing unit 231 sends the source data to the destination address in the first processing unit 2101, thereby enabling the first addressing unit to efficiently read the source data from the main storage area via the internal transmission link.

[0083] Furthermore, the index retrieval method can also be performed based on the internal transmission link 410. The first partition includes a first index cache 221, and the second partition includes a second index cache 222. The internal transmission link 410 is also used to connect the index caches in different partitions, that is, to connect the first index cache 221 and the second index cache 222.

[0084] In one example of retrieving the index, the first addressing unit 221 determines that the index address corresponds to the first partition, and loads the index from the main storage area 251 via the last-level cache 241 corresponding to the first partition, based on the index address of the main storage area 251. Similarly, the second index cache 222 determines that the index address corresponds to the second partition, and loads the index from the main storage area 252 via the last-level cache 242 corresponding to the second partition, based on the index address of the main storage area 252. Therefore, the first index cache efficiently retrieves the index from the main storage area via its corresponding cache, and the second index cache efficiently retrieves the index from the main storage area via its corresponding cache.

[0085] In another example of index retrieval, the first index cache 221 determines that the index address corresponds to the second partition and sends an index request to the second index cache 222 via the internal transmission link 410. The index request instructs the second index cache 222 to load the index from the main storage area 252 according to the index address and return the index to the first index cache 221 via the internal transmission link 410. Similarly, if the second index cache 222 determines that the index address corresponds to the first partition, it can also send an index request to the first index cache 221 via the internal transmission link 410. The first index cache 221 can respond to the index request, load the index corresponding to the index address in the main storage area 252, and return the index to the second index cache 222. Therefore, the first index cache can efficiently retrieve the index from the main storage area via the internal transmission link, and the second index cache can also efficiently retrieve the index from the main storage area via the internal transmission link.

[0086] It should be understood that the above examples illustrate how the target partition's arithmetic unit generates indirect storage copy instructions, retrieves the index through the index cache corresponding to the arithmetic unit, then retrieves the source data through the addressing unit corresponding to the arithmetic unit, and finally returns the source data to the target partition's own internal cache. If the index is in the primary storage area corresponding to the target partition, it is retrieved directly from that primary storage area; otherwise, it is retrieved indirectly from the primary storage area corresponding to other partitions. Similarly, if the source data is in the primary storage area of ​​the target partition, it is retrieved directly from that primary storage area; otherwise, it is retrieved indirectly from the primary storage area corresponding to other partitions. These other partitions may or may not be adjacent to the target partition.

[0087] Furthermore, regarding the method by which the addressing unit determines whether the source address of the source data corresponds to a specific partition, in one example, it can be determined whether the source address matches the main storage area by judging whether a predetermined field of the source address is a preset value corresponding to the main storage area. For example, after calculating the source address, the first addressing unit 231 judges whether the predetermined field of the source address is a preset value corresponding to the main storage area 251. If yes, the first addressing unit 231 judges that the source address matches the main storage area 251, in other words, the source address matches the first partition A; if no, the first addressing unit 231 judges that the source address matches the main storage area 252, in other words, the source address matches the second partition B.

[0088] In another example, the addressing unit can determine that the source address matches the main memory area based on the pre-stored address matching relationship with the main memory area. For example, the first addressing unit 231 determines that the source address matches the main memory area 251 or 252 based on the pre-stored address matching relationship with the main memory area 251 or 252.

[0089] It should be understood that, for example, the internal transmission link of the NOC is located between the arithmetic unit and the last-level cache, thus efficiently assisting the last-level cache in processing data already cached in the last-level cache.

[0090] Figure 5 This is a schematic structural diagram of an on-chip Internet of Things system according to another embodiment of the present disclosure. Figure 5 The 500 on-chip Internet of Things system includes:

[0091] Multiple processing unit partitions 510 and an internal communication link 520 are provided, and the multiple processing unit partitions 510 are interconnected through the internal communication link 520. Each processing unit partition 510 may include at least one addressing unit, at least one arithmetic unit, and at least one index cache as described in the above embodiments. For example, each processing unit partition 510 includes at least one CPU core, or at least one SM. The on-chip interconnect network system 500 may also include main memory (not shown).

[0092] It should be understood that the internal communication link 520 can be implemented as a NOC. The internal communication link 520 can connect the addressing units and / or index caches of different processing unit partitions through communication, thereby realizing the connectivity of different processing unit partitions. The on-chip interconnect network system 500 achieves overall data processing efficiency through the cooperation and coordination of data access and computation processing between multiple processing unit partitions 510.

[0093] The on-chip interconnect network system 500 can access various main memory areas in main memory to load source data. Each main memory area in main memory can correspond to a processing unit partition 510. In one example, within each processing unit partition 510, each arithmetic unit can access the main memory area corresponding to that processing unit partition without going through the internal communication link 520. In another example, the internal communication link 520 is also used to communicate between different levels of cache within the same processing unit partition. In this case, different arithmetic units within the same processing unit partition can obtain source data from the main memory area corresponding to that processing unit partition via the internal communication link 520.

[0094] Figure 6 This is a schematic flowchart of an indirect storage copying method according to another embodiment of the present disclosure. Figure 6 The indirect storage copying method is used for the processing unit. The processing unit includes an arithmetic unit, an addressing unit, and an index cache. The addressing unit and index cache are located between the main memory area outside the processing unit and the arithmetic unit. The arithmetic unit has an internal cache. Figure 6 The methods include:

[0095] S610: Through the arithmetic unit, execute the indirect storage copy instruction. The indirect storage copy instruction has at least a base address, an index address, and a destination address, so as to send the index address to the index cache and send the base address and destination address to the addressing unit.

[0096] S620: Through the index cache, the corresponding index is loaded from the main storage area according to the index address and sent to the addressing unit.

[0097] S630: Using the addressing unit, the source address of the main storage area corresponding to the source data is determined based on the base address and index.

[0098] S640: Using the addressing unit, it loads source data from the main storage area according to the source address and sends it to the destination address of the internal cache.

[0099] In the embodiments of this disclosure, since the index cache and addressing unit are located outside the arithmetic unit, the arithmetic unit avoids the calculations performed on the index based on the index address, the source data based on the index, and the source data based on the source address, thus saving the instruction overhead and computational resources of the arithmetic unit itself. Furthermore, the index cache is used to retrieve the index, and the addressing unit is used to perform the calculation of the source address, achieving a separation of computational and storage functions. This ensures the storage efficiency of the index cache and the computational efficiency of the addressing unit. The addressing unit calculates the source address and then loads the source data; this continuity of data processing ensures the efficiency of source data retrieval.

[0100] In other examples, the indirect storage copy instruction also has an addressing operand for calculating the source address. The method further includes sending the addressing operand to the addressing unit via the arithmetic unit. Determining the source address of the main memory area corresponding to the source data based on the base address and the index includes determining the source address based on the base address, the index, and the addressing operand.

[0101] In other examples, the addressing operand includes at least one of an offset and a step size.

[0102] In other examples, determining the source address based on the base address, the index, and the addressing operand includes determining the source address based on one of the following: source address = base address + index; source address = base address + index × step size; source address = base address + offset + index; source address = base address + offset + index × step size.

[0103] In other examples, the method further includes: loading a corresponding index from the main storage area through the index cache, storing the index in correspondence with the index address; searching for the stored index and index address; if the corresponding index is found, sending it to the addressing unit; if not found, loading the corresponding index from the main storage area according to the index address.

[0104] In other examples, loading source data from the main storage area based on the source address includes: determining the main storage area that matches the source address, and loading the source data from the main storage area according to the correspondence between the main storage area and the arithmetic unit.

[0105] In other examples, the processing unit includes a first partition and a second partition, the first partition including a first arithmetic unit and a first addressing unit. Determining the main storage area matching the source address includes: the first addressing unit determining that the main storage area corresponds to the first partition. Loading the source data from the main storage area according to the correspondence between the main storage area and the arithmetic unit includes: loading the source data from the main storage area via the last-level cache corresponding to the first partition, so as to send the source data to the destination address in the first arithmetic unit.

[0106] In other examples, the second partition includes a second processing unit and a second addressing unit, the first addressing unit and the second addressing unit being communicatively connected via an internal transmission link. Determining the main storage area matching the source address includes: the first addressing unit determining that the main storage area corresponds to the second partition. Loading the source data from the main storage area based on the correspondence between the main storage area and the processing unit includes: sending a read request to the second addressing unit via the internal transmission link, wherein the read request instructs the second addressing unit to load the source data from the main storage area, and returning the source data to the first addressing unit via the internal transmission link to send the source data to the destination address in the first processing unit.

[0107] In other examples, determining the source address of the main storage area corresponding to the source data includes: determining whether the source address matches the main storage area by judging whether a predetermined field of the source address is a preset value corresponding to the main storage area; or, determining that the source address matches the main storage area based on the pre-stored address matching relationship with the main storage area.

[0108] It should be noted that, depending on the implementation needs, the various components / steps described in the embodiments of this disclosure can be broken down into more components / steps, or two or more components / steps or parts of the operation of components / steps can be combined into new components / steps to achieve the purpose of the embodiments of this disclosure.

[0109] The methods described above according to embodiments of this disclosure can be implemented in hardware, firmware, or as software or computer code that can be stored in a recording medium (such as a CD-ROM, RAM, floppy disk, hard disk, or magneto-optical disk), or as computer code downloaded over a network that is originally stored in a remote recording medium or a non-transitory machine-readable medium and will be stored in a local recording medium. Thus, the methods described herein can be processed by software stored on a recording medium using a general-purpose computer, a dedicated processing unit, or programmable or dedicated hardware (such as an ASIC or FPGA). It is understood that the computer, processing unit, microprocessor unit controller, or programmable hardware includes storage components (e.g., RAM, ROM, flash memory, etc.) capable of storing or receiving software or computer code that, when accessed and executed by the computer, processing unit, or hardware, implements the methods described herein. Furthermore, when a general-purpose computer accesses code used to implement the methods shown herein, the execution of the code transforms the general-purpose computer into a dedicated computer for executing the methods shown herein.

[0110] Those skilled in the art will recognize that the units and method steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of the embodiments disclosed herein.

[0111] The above embodiments are only used to illustrate the embodiments of this disclosure, and are not intended to limit the embodiments of this disclosure. Those skilled in the art can make various changes and modifications without departing from the spirit and scope of the embodiments of this disclosure. Therefore, all equivalent technical solutions also fall within the scope of the embodiments of this disclosure, and the patent protection scope of the embodiments of this disclosure should be defined by the claims.

Claims

1. A processing unit, comprising an arithmetic unit, an addressing unit, and an index cache, wherein the addressing unit and the index cache are located between a main memory area outside the processing unit and the arithmetic unit, and the arithmetic unit has an internal cache, wherein, The arithmetic unit executes an indirect storage copy instruction, which has at least a base address, an index address, and a destination address, to send the index address to the index cache and send the base address and the destination address to the addressing unit. The index cache loads the corresponding index from the main storage area according to the index address and sends it to the addressing unit; The addressing unit determines the source address of the main storage area corresponding to the source data based on the base address and the index; The addressing unit loads the source data from the main storage area according to the source address and sends it to the destination address in the internal cache.

2. The processing unit according to claim 1, wherein, The indirect storage copy instruction also has an addressing operand for source address calculation; The arithmetic unit also sends the addressing operand to the addressing unit; The addressing unit determines the source address based on the base address, the index, and the addressing operand.

3. The processing unit according to claim 2, wherein, The addressing operands include at least one of offset and step size.

4. The processing unit according to claim 3, wherein, The addressing unit determines the source address based on one of the following: Source address = base address + index; Source address = base address + index × step size; Source address = base address + offset + index; Source address = base address + offset + index × step size.

5. The processing unit according to claim 1, wherein, The index cache loads the corresponding index from the main storage area and stores the index along with the index address; After receiving the index address, the index cache searches for the corresponding stored index and index address. If the corresponding index is found, it is sent to the addressing unit. If not found, load the corresponding index from the main storage area according to the index address.

6. The processing unit according to claim 1, wherein, The addressing unit determines the main storage area that matches the source address, and loads the source data from the main storage area according to the correspondence between the main storage area and the arithmetic unit.

7. The processing unit according to claim 6, wherein, The processing unit includes a first partition and a second partition. The first partition includes a first arithmetic unit and a first addressing unit. The first addressing unit determines that the main storage area corresponds to the first partition, loads the source data from the main storage area via the last-level cache corresponding to the first partition, and sends the source data to the destination address in the first processing unit.

8. The processing unit according to claim 7, wherein, The second partition includes a second processing unit and a second addressing unit, and the first addressing unit and the second addressing unit are communicatively connected via an internal transmission link. The first addressing unit determines that the main storage area corresponds to the second partition, and sends a read request to the second addressing unit via the internal transmission link, wherein the read request instructs the second addressing unit to load the source data from the main storage area and return the source data to the first addressing unit via the internal transmission link; The first addressing unit sends the source data to the destination address in the first processing unit.

9. The processing unit according to claim 6, wherein, The addressing unit determines whether the source address matches the main storage area by determining whether a predetermined field of the source address is a preset value corresponding to the main storage area; or, The addressing unit determines that the source address matches the main storage area based on the pre-stored address matching relationship with the main storage area.

10. A computing device, comprising: The processing unit according to any one of claims 1-9; The main storage area.

11. An on-chip interconnect network system, comprising a processing unit according to any one of claims 1-9.

12. An indirect storage replication method for a processing unit, the processing unit comprising an arithmetic unit, an addressing unit, and an index cache, wherein the addressing unit and the index cache are located between a main memory area outside the processing unit and the arithmetic unit, and the arithmetic unit has an internal cache, wherein... The method includes: The arithmetic unit executes an indirect storage copy instruction, which has at least a base address, an index address, and a destination address, to send the index address to the index cache and send the base address and the destination address to the addressing unit. The corresponding index is loaded from the main storage area according to the index address through the index cache and sent to the addressing unit; The source address of the main storage area corresponding to the source data is determined by the addressing unit based on the base address and the index. The addressing unit loads the source data from the main storage area according to the source address and sends it to the destination address in the internal cache.