Hardware-based micro-scaling methods, systems, computer devices, readable storage media, and program products

By incorporating a micro-scaling module into the tensor memory acceleration unit, a hardware-based data processing flow is achieved, which solves the computational pressure on the vector core and improves the computational efficiency of the artificial intelligence chip.

CN122243717APending Publication Date: 2026-06-19SHANGHAI BIREN TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI BIREN TECH CO LTD
Filing Date
2026-03-09
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing technologies, when AI chips execute the Micro Scaling mechanism, the vector core must simultaneously undertake scaling factor calculation and quantization conversion tasks, occupying a large number of instruction cycles and register resources, resulting in a decrease in computational efficiency.

Method used

The tensor memory acceleration unit incorporates a micro-scaling module, which uses hardware circuitry to acquire data block addresses, scale and quantize data, and convert formats. This automatically performs data analysis, scaling factor calculation, and quantization conversion, thereby releasing the computing power of the vector core.

Benefits of technology

It significantly improves the computing efficiency of artificial intelligence chips, reduces reliance on vector cores, and increases the speed and efficiency of data processing.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243717A_ABST
    Figure CN122243717A_ABST
Patent Text Reader

Abstract

This application relates to a hardware-based microscaling method, system, computer device, computer-readable storage medium, and computer program product. The method includes: receiving a microscaling instruction sent by an instruction scheduling unit, the microscaling instruction carrying configuration information, including the address information of the original data block, target quantization format information, and global memory address information of the target data block; parsing the configuration information from the microscaling instruction; using the hardware circuit structure built into the microscaling module to implement operations such as obtaining the original data block from the address information of the original data block, scaling and quantizing the original data block based on the target quantization format information, and associating the scaled and quantized target data block and the scaling factor into global memory based on the global memory address information. This method can free up the computing power of vector cores and improve the computational efficiency of artificial intelligence chips.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to a hardware-based microscaling method, system, computer device, readable storage medium, and program product. Background Technology

[0002] In the field of artificial intelligence training, AI chips have become the core hardware support due to their powerful parallel computing capabilities. To further improve training efficiency, the industry has introduced ultra-low bit-width floating-point formats such as FP8 (8-bit floating point) and FP4 (4-bit floating point) to accelerate training by reducing data storage and transmission overhead. However, these formats have limited dynamic range and require micro-scaling data scaling mechanisms to ensure model training stability.

[0003] In related technologies, the Micro Scaling mechanism is implemented in software by the Vector Core of the artificial intelligence chip. It includes: the Vector Core traverses the data block (such as matrix rows / columns) to count the maximum value, calculates the scaling factor in combination with the range that the target low-precision format can represent; uses the scaling factor to perform multiplication and format conversion on the data block, and finally writes the quantized data and scaling factor back to memory respectively.

[0004] As a result, since the vector core needs to simultaneously perform scaling factor calculation and quantization conversion tasks, which occupy a large number of instruction cycles and register resources, it severely squeezes the computing power of the vector core and thus reduces the computing efficiency of the artificial intelligence chip. Summary of the Invention

[0005] Therefore, it is necessary to provide a hardware-based microscaling method, system, computer device, readable storage medium, and program product that can improve the computational efficiency of artificial intelligence chips, addressing the aforementioned technical problems.

[0006] In a first aspect, this application provides a hardware-based microscaling method applied to a tensor memory acceleration unit, wherein the tensor memory acceleration unit has a built-in microscaling module, and the method includes:

[0007] The system receives a micro-scaling instruction sent by the instruction scheduling unit. The micro-scaling instruction carries configuration information, including the address information of the original data block, the target quantization format information, and the global memory address information of the target data block.

[0008] The configuration information is parsed from the micro-scaling instruction. The hardware circuit structure built into the micro-scaling module is used to obtain the original data block from the address information of the original data block, scale and quantize the original data block based on the target quantization format information, and write the scaled and quantized target data block and scaling factor into global memory based on the global memory address information.

[0009] In one embodiment, the step of using the hardware circuit structure built into the micro-scaling module to obtain the original data block from the address information of the original data block, and scaling and quantizing the original data block based on the target quantization format information includes:

[0010] The original data block is obtained based on its address information, and the original data block is traversed based on the comparator tree hardware logic built into the micro-scaling module to obtain the numerical range of the original data block.

[0011] Based on the numerical range and the target quantization format information, a scaling factor is determined, and the array multiplier built into the micro scaling module performs a scaling operation on each data element in the original data block based on the scaling factor in parallel to obtain a scaled data block.

[0012] The scaling data block is converted into the target format indicated by the target quantization format information by the hardware format conversion circuit built into the micro-scaling module to obtain the target data block.

[0013] In one embodiment, the global memory address information includes first address information and second address information. The step of associating the scaled and quantized target data block and the scaling factor with the global memory based on the global memory address information includes:

[0014] Write the target data block to the data queue and write the scaling factor to the factor queue;

[0015] When the amount of data stored in the data queue meets the first transmission condition, the data stored in the data queue is packaged and written into global memory based on the first address information;

[0016] When the amount of data stored in the factor queue meets the second transmission condition, the data stored in the factor queue is packaged and written into global memory based on the second address information.

[0017] In one embodiment, the step of associating the scaled and quantized target data block and the scaling factor with the global memory address information and writing them into the global memory further includes:

[0018] After the microscaling instruction is completed, if the data queue and / or the factor queue are not empty, the data stored in the data queue is packaged and written into global memory based on the first address information, and / or the data stored in the factor queue is packaged and written into global memory based on the second address information.

[0019] Secondly, this application also provides a hardware-based microscaling system, the system comprising: an instruction scheduling unit and a tensor memory acceleration unit, wherein the tensor memory acceleration unit has a built-in microscaling module, wherein:

[0020] The instruction scheduling unit is used to send micro-scaling instructions to the tensor memory acceleration unit. The micro-scaling instructions carry configuration information, including the address information of the original data block, the target quantization format information, and the global memory address information of the target data block.

[0021] The micro-scaling module is used to receive the micro-scaling instruction, parse the configuration information from the micro-scaling instruction, and use a built-in hardware circuit structure to implement the operation of obtaining the original data block from the address information of the original data block, scaling and quantizing the original data block based on the target quantization format information, and writing the scaled and quantized target data block and scaling factor into global memory based on the global memory address information.

[0022] In one embodiment, the microscaling module includes a quantization submodule for:

[0023] The original data block is obtained based on its address information, and the original data block is traversed based on the built-in comparator tree hardware logic to obtain the numerical range of the original data block.

[0024] Based on the numerical range and the target quantization format information, a scaling factor is determined, and a scaling operation on each data element in the original data block is performed in parallel using a built-in array multiplier based on the scaling factor to obtain a scaled data block.

[0025] The scaled data block is converted into the target format indicated by the target quantization format information by the built-in hardware format conversion circuit to obtain the target data block.

[0026] In one embodiment, the global memory address information includes first address information and second address information, and the micro-scaling module further includes a splicing submodule for:

[0027] Write the target data block to the data queue and write the scaling factor to the factor queue;

[0028] When the amount of data stored in the data queue meets the first transmission condition, the data stored in the data queue is packaged and written into global memory based on the first address information;

[0029] When the amount of data stored in the factor queue meets the second transmission condition, the data stored in the factor queue is packaged and written into global memory based on the second address information.

[0030] Thirdly, this application also provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the hardware microscaling method described above.

[0031] Fourthly, this application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements any of the above-mentioned hardware microscaling methods.

[0032] Fifthly, this application also provides a computer program product, including a computer program that, when executed by a processor, implements any of the above-mentioned hardware microscaling methods.

[0033] The aforementioned hardware-based microscaling method, system, computer device, readable storage medium, and program product are applied to a tensor memory acceleration unit. The tensor memory acceleration unit has a built-in microscaling module that receives microscaling instructions sent by an instruction scheduling unit. The microscaling instructions carry configuration information, including the address information of the original data block, the target quantization format information, and the global memory address information of the target data block. The configuration information is parsed from the microscaling instructions. The hardware circuit structure built into the microscaling module is used to implement the operations of obtaining the original data block from the address information of the original data block, scaling and quantizing the original data block based on the target quantization format information, and writing the scaled and quantized target data block and scaling factor into global memory based on the global memory address information. By employing the hardware-based microscaling method, system, computer device, readable storage medium, and program product provided in the embodiments of this application, the data processing flow of microscaling is reconstructed. Through the microscaling module built into the tensor memory acceleration unit, data analysis, scaling factor calculation, and quantization conversion are automatically completed at the hardware level without any other external intervention. This makes the entire MicroScaling process independent of the vector core, freeing up the computing power of the vector core and thus significantly improving the computing efficiency of the artificial intelligence chip. Attached Figure Description

[0034] To more clearly illustrate the technical solutions in the embodiments of this application or related technologies, the drawings used in the description of the embodiments of this application or related technologies will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0035] Figure 1 This is a flowchart illustrating a hardware-based microscaling method in one embodiment;

[0036] Figure 2 This is a schematic diagram of the GPGPU structure in one embodiment;

[0037] Figure 3 This is a flowchart illustrating step 104 in one embodiment;

[0038] Figure 4 This is a flowchart illustrating step 104 in another embodiment;

[0039] Figure 5 This is a block diagram of a hardware-based microscaling system in one embodiment;

[0040] Figure 6 This is a schematic diagram of the hardware-based microscaling system in another embodiment;

[0041] Figure 7 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation

[0042] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0043] It should be noted that the terms "first," "second," etc., used in this application can be used to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish the first element from the second element. The terms "comprising" and "having," and any variations thereof, used in this application, are intended to cover non-exclusive inclusion. The term "multiple" used in this application refers to two or more. The term "and / or" used in this application refers to one of the embodiments, or any combination of multiple embodiments.

[0044] like Figure 1As shown, a hardware-based microscaling method is provided for application in artificial intelligence chips. In this embodiment, the artificial intelligence chip can be any one of GPU (Graphics Processing Unit), TPU (Tensor Processing Unit), NPU (Neural Network Processing Unit), DPU (Deep Learning Processing Unit), APU (Accelerated Processing Unit), and GPGPU (General-Purpose Graphics Processing Unit). This embodiment does not specifically limit the specific type of chip, and the following description uses GPGPU as an example.

[0045] Reference Figure 2 The diagram shows a schematic of a GPGPU. A GPGPU is actually an array of Streaming Processor Clusters (SPCs), including, for example,... Figure 2 The diagram shows streaming processor clusters 1, ..., M, where M is a positive integer greater than 1. In a graphics processing unit (GPU), one streaming processor cluster processes one computational task, or multiple streaming processor clusters process one computational task. Multiple streaming processor clusters share data through a global cache or global memory.

[0046] like Figure 2 As shown, taking streaming processor cluster 1 as an example, one streaming processor cluster includes multiple computing units, such as... Figure 2 The system is structured as Computation Unit 1, Computation Unit 2, ..., Computation Unit N, where N is a positive integer. Each Computation Unit (CU) performs arithmetic and logical operations other than matrix calculations such as matrix multiplication and convolution, including operations like accumulation, reduction, and standard addition, subtraction, multiplication, and division. A Computation Unit contains multiple cores (also called computational kernels), each including an Arithmetic Logic Unit (ALU), a floating-point unit, etc., which are used to execute specific computational tasks. Furthermore, the Computation Unit also includes registers (e.g., ...). Figure 2 The register file and shared cache in a computing unit are used to store source and destination data related to computing tasks in a hierarchical manner. The shared cache in a computing unit is used to share data between the cores of that computing unit.

[0047] In parallel computing, computational tasks are typically executed by multiple threads. These threads are divided into multiple thread blocks before execution in a general-purpose graphics processor (or parallel computing processor), and then dispatched via a thread block distribution module. Figure 2 (Not shown in the image) Multiple thread blocks are distributed to various computation units. All threads in a thread block must be assigned to the same computation unit for execution. Simultaneously, thread blocks are broken down into minimum execution thread bundles (or simply warps), each containing a fixed number (or less than this fixed number) of threads, for example, 32 threads. Multiple thread blocks can execute in the same computation unit or in different computation units.

[0048] In each computing unit, the thread beam scheduling / distribution module ( Figure 2 (Not shown in the diagram) Thread bundles are scheduled and allocated so that multiple computing cores within the computing unit can run thread bundles. Depending on the number of computing cores in the computing unit, multiple thread bundles within a thread block can be executed concurrently or in a time-sharing manner. Multiple threads within each thread bundle execute the same instructions. Memory execution instructions are issued to the shared cache within the computing unit or further issued to intermediate-level caches, global caches, or global memory for read and write operations, etc.

[0049] like Figure 2 As shown, the streaming processor cluster 1 also includes a tensor operation unit, which is used to perform tensor calculations, such as matrix multiplication, convolution operations, etc.

[0050] Reference Figure 1 As shown, this application embodiment provides a hardware-based microscaling method applied to a tensor memory acceleration unit. The tensor memory acceleration unit has a built-in microscaling module. The method may include the following steps 102 to 104, wherein:

[0051] Step 102: Receive the micro-scaling instruction sent by the instruction scheduling unit. The micro-scaling instruction carries configuration information, including the address information of the original data block, the target quantization format information, and the global memory address information of the target data block.

[0052] In this embodiment of the application, the instruction scheduling unit is the initiator of the micro-scaling instruction. After receiving the micro-scaling instruction issued by the instruction scheduling unit, the tensor memory acceleration unit responds to the micro-scaling instruction by starting to perform operations such as data analysis, scaling factor calculation and quantization conversion.

[0053] The micro-scaling instructions are batch-level task instructions, rather than scattered instructions targeting individual data blocks. They carry configuration information, including the address information of the original data block, the target quantization format information, and the global memory address information of the target data block. For example, the address information of the original data block may include storage level identifiers, such as the GSM (Global Shared Memory) base address, fixed offset per data block, total number of data blocks, and data block granularity. For instance, the address information of the original data block may include a GSM base address of 0x700000, a fixed offset per data block of 64 bytes, a total number of data blocks of 100, and a data block granularity of matrix rows. The target quantization format information indicates the format type of low-precision quantization, such as FP8 E4M3, FP8 E5M2, FP4 E2M1, FP4 E1M2, INT8, INT4, etc. The global memory address information of the target data block is the storage configuration information in HBM (High Bandwidth Memory), which may include information such as the quantization data base address and the scaling factor reserved area base address. For example, the quantization data base address is 0x800000 and the scaling factor base address is 0x900000, which is used to specify the storage location of the quantized data block and the scaling factor.

[0054] It should be noted that the embodiments of this application do not specifically limit parameters such as the total number of data blocks, the offset of a single data block, or the specific quantization format. Those skilled in the art can make dynamic adjustments based on the tensor dimension of the AI ​​(Artificial Intelligence) training task, the bandwidth characteristics of the HBM bus, and the hardware computing accuracy requirements.

[0055] Step 104: Parse the configuration information from the micro-scaling instruction, and use the hardware circuit structure built into the micro-scaling module to obtain the original data block from the address information of the original data block, scale and quantize the original data block based on the target quantization format information, and write the scaled and quantized target data block and scaling factor into global memory based on the global memory address information.

[0056] In this embodiment, after receiving the micro-scaling instruction from the instruction scheduling unit, the micro-scaling module parses the configuration information in the micro-scaling instruction through its built-in hardware circuit structure, and calculates the specific address of each original data block in real time through hardware logic based on the parsed address information of the original data blocks. For example, the specific address of the original data block = the base address of the original data block + the data block number × the offset of a single data block, such as the address of the 0th data block = 0x700000 + 0 × 64 = 0x700000, and the address of the 1st data block = 0x700000 + 1 × 64 = 0x700040.

[0057] The micro-scaling module can read high-precision raw data blocks one by one or in batches from a specified storage level through the hardware data channel. Based on the target quantization format information in the configuration information, it scales and quantizes the raw data blocks to obtain the corresponding scaling factor and the low-precision target data block. Finally, based on the global memory address information of the target data block, the target data block and the scaling factor are associated and written to global memory. That is, during the data writing process, the hardware automatically records the address mapping relationship between each scaling factor and the corresponding target data block to facilitate fast reading by subsequent computing units.

[0058] The aforementioned hardware-based microscaling method is applied to a tensor memory acceleration unit. This unit has a built-in microscaling module that receives microscaling instructions from an instruction scheduling unit. These instructions carry configuration information, including the address information of the original data block, the target quantization format information, and the global memory address information of the target data block. The configuration information is parsed from the microscaling instructions. The hardware circuit structure built into the microscaling module is used to retrieve the original data block from its address information, scale and quantize the original data block based on the target quantization format information, and write the scaled and quantized target data block and scaling factor to global memory based on the global memory address information. By employing the hardware-based microscaling method provided in this application, the data processing flow for microscaling is reconstructed. Through the microscaling module built into the tensor memory acceleration unit, data analysis, scaling factor calculation, and quantization conversion are automatically completed at the hardware level without external intervention. This makes the entire MicroScaling process independent of the vector core, freeing up the vector core's computing power and significantly improving the computational efficiency of the AI ​​chip.

[0059] In one exemplary embodiment, reference is made to Figure 3 As shown, in step 104, the hardware circuit structure built into the micro-scaling module is used to obtain the original data block from the address information of the original data block, and the original data block is scaled and quantized based on the target quantization format information. This may include the following steps 302 to 306, wherein:

[0060] Step 302: Obtain the original data block based on the address information of the original data block, and traverse the original data block based on the comparator tree hardware logic built into the micro-scaling module to obtain the numerical range of the original data block.

[0061] Step 304: Determine the scaling factor based on the numerical range and target quantization format information, and perform scaling operations on each data element in the original data block in parallel using the array multiplier built into the micro-scaling module based on the scaling factor to obtain the scaled data block.

[0062] Step 306: The scaled data block is converted into the target format indicated by the target quantization format information by the hardware format conversion circuit built into the micro-scaling module to obtain the target data block.

[0063] In this embodiment, the micro-scaling module incorporates a quantization submodule. Based on the address information of the original data block, the quantization submodule calculates the specific physical address of the data block to be processed in real time using hardware logic. Subsequently, the original data block is read from global memory via a dedicated hardware data channel, such as high-precision tensor data in FP16 / FP32 format. After obtaining the original data block, the data elements in the original data block can be traversed using the comparator tree built into the quantization submodule. During the traversal, the comparator tree tracks the maximum and minimum values ​​of the data elements in real time, ultimately outputting the numerical range of the original data block.

[0064] The comparator tree is a hardware structure composed of multiple levels of parallel comparators, employing hierarchical parallel comparison. For example, for a raw data block containing 1024 data elements, the comparator tree performs 10 levels of parallel comparison, with each comparator in each level processing two data elements simultaneously. This allows for the traversal of all data elements to be completed in just 10 hardware clock cycles, resulting in significantly higher execution efficiency than the serial software traversal of the vector core.

[0065] After determining the numerical range of the original data block, based on this numerical range (e.g., [-127.8, 126.3]) and the representable range of the target format indicated by the target quantization format information (e.g., the representable range of FP8 E4M3 format [-448, 448]), the scaling factor is calculated in real time by hardware logic. The calculation logic is: scaling factor = absolute maximum value of the original data block ÷ maximum representable value of the target quantization format (e.g., 127.8 ÷ 448 ≈ 0.285). This ensures that the data elements in the original data block will not exceed the representable range of the target format after scaling, while preserving the dynamic range of the data.

[0066] The scaling factor can be stored in a high-precision format (such as FP32) to avoid precision loss during quantization. Subsequently, the array-type multipliers built into the quantization submodule receive all data elements of the original data block and the calculated scaling factor. A hardware-based module calculates the reciprocal of the scaling factor, and through a hardware-level broadcast mechanism, synchronously distributes the reciprocal of the scaling factor to each group of multipliers. This completes the multiplication of all data elements with the reciprocal of the scaling factor, thereby scaling each data element in the original data block to obtain the scaled data block.

[0067] Finally, the scaled data block is converted to the target format indicated by the target quantization format information through a hardware format conversion circuit, resulting in the target data block. This hardware format conversion circuit mainly comprises three sub-modules: a mantissa processing unit, a sign bit configuration unit, and an overflow protection unit. These sub-modules work collaboratively in a pipelined manner. The sign bit configuration unit allocates one sign bit to each element in the scaled data block according to the target quantization format information. The mantissa processing unit performs truncation or rounding on the mantissa portion of the scaled data block according to the number of mantissa bits in the target format (e.g., 3 mantissa bits for FP8 E4M3, 2 mantissa bits for E5M2), ensuring that the mantissa length meets the requirements of the target format. Simultaneously, the exponent portion of the scaled data is re-encoded according to the number of exponent bits in the target format (e.g., 4 exponent bits for FP8 E4M3, 5 exponent bits for E5M2) to adapt to the exponent representation range of the target format. When the scaled data, after processing, still exceeds the representable range of the target format, the overflow protection unit forces it to saturate to the maximum or minimum value of the target format, such as 448 or -448 for FP8 E4M3, to avoid calculation errors caused by data overflow. After the above hardware conversion process, the scaled data block is encoded into a low-precision data block in the target format (such as FP8, FP4, INT8, INT4), i.e., the target data block. This target data block corresponds one-to-one with the scaling factor and is then associated with and written to global memory.

[0068] It should be noted that the embodiments of this application do not limit the number of levels in the comparator tree or the number of parallel comparators. Those skilled in the art can make dynamic adjustments based on the number of elements in the original data block and the hardware clock frequency. As long as it can achieve fast parallel detection of the numerical range of the original data block, it can be applied to the embodiments of this application.

[0069] In one exemplary embodiment, refer to Figure 4 As shown, the global memory address information includes first address information and second address information. In step 104, the target data block obtained after scaling and quantization and the scaling factor are associated and written into the global memory based on the global memory address information. This may include steps 402 to 406, wherein:

[0070] Step 402: Write the target data block to the data queue and write the scaling factor to the factor queue.

[0071] Step 404: When the amount of data stored in the data queue meets the first transmission condition, the data stored in the data queue is packaged and written into the global memory based on the first address information;

[0072] Step 406: When the amount of data stored in the factor queue meets the second transmission condition, the data stored in the factor queue is packaged and written into global memory based on the second address information.

[0073] In this embodiment, the micro-scaling module further includes a splicing submodule. After obtaining the target data block and scaling factor, the splicing submodule can write the target data block to a data queue and the scaling factor to a factor queue. The data queue and the factor queue are two independent FIFO (First In First Out) hardware cache queues within the splicing submodule, which can satisfy batch writing of cached data and guarantee the timing of the association between the target data block and the scaling factor, thereby ensuring that the association relationship is not disordered during subsequent writing to global memory.

[0074] Both queues mentioned above are hardware-level caches, supporting parallel read and write operations. While data is being written, the splicing submodule can use a built-in hardware counter to monitor the amount of cached data in the data queue and factor queue in real time. When the amount of cached data meets the transmission conditions, it can immediately trigger batch write logic. Alternatively, if the amount of cached data does not meet the transmission conditions, it will continue to cache subsequent data until the amount of cached data meets the transmission conditions or the task is completed, at which point a forced write operation will be performed.

[0075] When the data queue meets the first transmission condition, the data stored in the data queue is packaged and written to global memory based on the first address information. The first transmission condition is an efficient transmission threshold set to maximize HBM bus bandwidth utilization; its value can be determined by HBM hardware characteristics. The first address information is the storage base address of the target data block in HBM (e.g., 0x800000). When the factor queue meets the second transmission condition, the data stored in the factor queue is packaged and written to global memory based on the second address information. The second transmission condition is an efficient transmission threshold set for the scaling factor; its value can be determined by the factor size and HBM hardware characteristics. The second address information is the storage base address of the scaling factor in HBM.

[0076] In this embodiment, the quantized data and the scaling factor generated therefrom are aggregated within the tensor memory acceleration unit at the data writing source. The two are temporarily stored and packaged into global memory according to the bandwidth-optimal strategy, which solves the bandwidth waste problem caused by the separate writing of quantized data and scaling factor from the root and effectively improves bandwidth utilization.

[0077] It should be noted that the embodiments of this application do not specifically limit the specific thresholds corresponding to the first and second transmission conditions, and the first and second transmission conditions can be configured independently and do not need to be consistent. The specific thresholds corresponding to the transmission conditions can be dynamically adjusted according to the HBM bus bandwidth and the tensor dimension of the AI ​​training task. As long as efficient batch transmission can be achieved, they can be applied to the embodiments of this application.

[0078] In an exemplary embodiment, the scaling-quantized target data block and scaling factor are associated and written into global memory based on global memory address information, and the method further includes:

[0079] After the microscaling instruction is completed, if the data queue and / or the factor queue are not empty, the data stored in the data queue is packaged and written into global memory based on the first address information, and / or the data stored in the factor queue is packaged and written into global memory based on the second address information.

[0080] In this embodiment, after all the original data blocks corresponding to the micro-scaling instruction have been processed, if the data queue and / or factor queue are not empty, but the data has already been processed, no new data will flow into the data queue and factor queue, thus failing to meet the first and / or second transmission conditions. Therefore, the remaining data in the data queue and / or factor queue cannot be packaged and written to global memory. Based on this, after all the original data blocks corresponding to the micro-scaling instruction have been quantized and converted, the data in the data queue and / or factor queue will be forcibly packaged and written to global memory based on the first and / or second address information. This ensures that no target data blocks or scaling factors remain in the queue cache, avoiding data loss and subsequent calculation errors caused by data residue.

[0081] In one exemplary embodiment, refer to Figure 5 As shown, this application embodiment also provides a hardware-based microscaling system 500, which includes: an instruction scheduling unit 502 and a tensor memory acceleration unit 504. The tensor memory acceleration unit has a built-in microscaling module 506, wherein:

[0082] The instruction scheduling unit 502 is used to send micro-scaling instructions to the tensor memory acceleration unit. The micro-scaling instructions carry configuration information, including the address information of the original data block, the target quantization format information, and the global memory address information of the target data block.

[0083] The micro-scaling module 506 is used to receive micro-scaling instructions and parse configuration information from them. It uses a built-in hardware circuit structure to obtain the original data block from the address information of the original data block, scale and quantize the original data block based on the target quantization format information, and write the scaled and quantized target data block and scaling factor into global memory based on the global memory address information.

[0084] In this embodiment, the hardware-based microscaling system 500 mainly consists of an instruction scheduling unit 502 and a tensor memory acceleration unit 504. The instruction scheduling unit 502 is the instruction initiator of the hardware-based microscaling system 500, typically integrated into the main control module of an AI chip. It is responsible for generating and issuing batch-level microscaling instructions based on AI training task requirements, providing clear operational guidelines for the tensor memory acceleration unit 504, and ensuring the accuracy and efficiency of hardware-based processing.

[0085] The microscaling module 506 is a hardware functional module of the tensor memory acceleration unit 504, mainly responsible for operations such as instruction parsing, data acquisition, quantization calculation, and associative writing. Its built-in hardware circuit structure has no software code dependency and executes autonomously through hardware logic throughout the entire process. It should be noted that the specific implementation of each functional module in the hardware-based microscaling system 500 can be referred to the relevant descriptions in the foregoing embodiments, and will not be repeated here.

[0086] The aforementioned hardware-based microscaling system includes an instruction scheduling unit and a tensor memory acceleration unit. The tensor memory acceleration unit has a built-in microscaling module. The instruction scheduling unit sends microscaling instructions to the tensor memory acceleration unit. The microscaling instructions carry configuration information, including the address information of the original data block, the target quantization format information, and the global memory address information of the target data block. The microscaling module receives the microscaling instructions, parses the configuration information from the microscaling instructions, and uses the hardware circuit structure built into the microscaling module to implement the operations of obtaining the original data block from the address information of the original data block, scaling and quantizing the original data block based on the target quantization format information, and writing the scaled and quantized target data block and scaling factor into global memory based on the global memory address information. The hardware-based microscaling system provided in this application reconstructs the data processing flow of microscaling. By using the microscaling module built into the tensor memory acceleration unit, data analysis, scaling factor calculation and quantization conversion are automatically completed at the hardware level without any other external intervention. This makes the entire Micro Scaling process independent of the vector core, freeing up the computing power of the vector core and thus significantly improving the computing efficiency of the artificial intelligence chip.

[0087] In one exemplary embodiment, refer to Figure 5 As shown, the microscaling module 506 includes a quantization submodule 508, used for:

[0088] The original data block is obtained based on its address information, and the original data block is traversed based on the built-in comparator tree hardware logic to obtain the numerical range of the original data block.

[0089] The scaling factor is determined based on the numerical range and target quantization format information, and the scaling operation of each data element in the original data block is performed in parallel by the built-in array multiplier based on the scaling factor to obtain the scaled data block.

[0090] The scaled data block is converted into the target format indicated by the target quantization format information through the built-in hardware format conversion circuit to obtain the target data block.

[0091] In this embodiment, the micro-scaling module 506 further includes a quantization submodule 508. The quantization submodule 508 is primarily responsible for operations such as numerical range detection, scaling factor calculation, high-precision scaling, and format conversion. The specific implementation of the quantization submodule 508 can be found in the descriptions of the foregoing embodiments, and will not be repeated here.

[0092] In one exemplary embodiment, refer to Figure 5 As shown, the global memory address information includes first address information and second address information. The micro-scaling module 506 also includes a splicing submodule 510, used for:

[0093] Write the target data block to the data queue and the scaling factor to the factor queue;

[0094] When the amount of data stored in the data queue meets the first transmission condition, the data stored in the data queue is packaged and written into global memory based on the first address information;

[0095] When the amount of data stored in the factor queue meets the second transmission condition, the data stored in the factor queue is packaged and written into global memory based on the second address information.

[0096] In this embodiment, the microscaling module 506 further includes a splicing submodule 510. The splicing submodule 510 is a data caching and high-efficiency writing unit of the microscaling module 506. It incorporates two independent hardware FIFO queues and a batch transmission control circuit. Through hardware logic of cache batching, conditional triggering, and associated writing, it maximizes the HBM bus bandwidth utilization while ensuring the association between the target data block and the scaling factor. The specific implementation of the splicing submodule 510 can be referred to the relevant descriptions in the foregoing embodiments; further details are omitted here.

[0097] To enable those skilled in the art to better understand the embodiments of this application, the embodiments of this application are described below through specific examples.

[0098] Reference Figure 6As shown in this embodiment, the tensor memory acceleration unit of the artificial intelligence chip has been updated, adding two tightly coupled key sub-modules: a quantization sub-module and a splicing sub-module. The quantization sub-module is the processing unit for the data write path of the tensor memory acceleration unit, located at the intermediate node where data flows from the computation unit to the HBM, and only receives raw high-precision data blocks. After the data block flows into the quantization sub-module, its built-in circuitry (such as a comparator tree for finding the absolute maximum value) traverses the data block to quickly determine its numerical range. Based on this range and the characteristics of the target FP8 format (such as E4M3), a high-precision scaling factor is calculated in real time. Subsequently, a multiplication operation is performed on the original data block based on the reciprocal of this scaling factor, completing the conversion from high precision to FP8 format. Finally, the converted FP8 data block and its uniquely corresponding single high-precision scaling factor are output to the splicing sub-module.

[0099] The splicing submodule receives the output from the quantization submodule and acts as a buffer before data is written out. It sends the FP8 data blocks and corresponding scaling factors output from the quantization submodule into the internal data queue and factor queue for caching, respectively. When the amount of data in the FP8 data queue accumulates to a level sufficient for efficient utilization of the HBM bus bandwidth (e.g., forming one or more cache lines), it is packaged and written in batches to global memory. Similarly, when the number of factors in the factor queue accumulates to a level sufficient for efficient transmission, these high-precision factors are packaged and written in batches to the designated reserved area of ​​HBM. When the entire data processing instruction set has been executed, all data in the cache is forcibly written out to ensure that no target data blocks or scaling factors remain in the queue cache, thus avoiding data loss and subsequent calculation errors caused by data residue.

[0100] By introducing the above two sub-modules, the complex processes that were originally scattered in the vector core and software are integrated into the hardware acceleration pipeline of the tensor memory acceleration unit, which completely solves the two major problems of performance and bandwidth.

[0101] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages in other steps. It is understood that the steps in different embodiments can be freely combined as needed, and all non-contradictory solutions formed by such combinations are within the scope of protection of this application.

[0102] In one exemplary embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as follows: Figure 7 As shown, the computer device includes a processor, memory, input / output interfaces, a communication interface, a display unit, and an input device. The processor, memory, and input / output interfaces are connected via a system bus, and the communication interface, display unit, and input device are also connected to the system bus via the input / output interfaces. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The input / output interfaces are used for exchanging information between the processor and external devices. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, mobile cellular networks, Near Field Communication (NFC), or other technologies. When the computer program is executed by the processor, it implements a hardware-based micro-scaling method. The display unit is used to form a visually visible image and can be a display screen, a projection device, or a virtual reality imaging device. The display screen can be an LCD screen or an e-ink screen. The input device of the computer device can be a touch layer covering the display screen, or buttons, trackballs, or touchpads set on the casing of the computer device, or external keyboards, touchpads, or mice, etc.

[0103] Those skilled in the art will understand that Figure 7The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0104] In one embodiment, a computer device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps in the above method embodiments.

[0105] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon that, when executed by a processor, implements the steps in the above method embodiments.

[0106] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps in the above method embodiments.

[0107] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data must comply with relevant regulations.

[0108] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile memory and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, artificial intelligence (AI) processors, etc., and are not limited to these.

[0109] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this application.

[0110] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. A hardware-based microscaling method, characterized in that, Applied to a tensor memory acceleration unit, wherein the tensor memory acceleration unit has a built-in micro-scaling module, the method includes: The system receives a micro-scaling instruction sent by the instruction scheduling unit. The micro-scaling instruction carries configuration information, including the address information of the original data block, the target quantization format information, and the global memory address information of the target data block. The configuration information is parsed from the micro-scaling instruction. The hardware circuit structure built into the micro-scaling module is used to obtain the original data block from the address information of the original data block, scale and quantize the original data block based on the target quantization format information, and write the scaled and quantized target data block and scaling factor into global memory based on the global memory address information.

2. The method according to claim 1, characterized in that, The step of using the hardware circuit structure built into the micro-scaling module to obtain the original data block from the address information of the original data block, and scaling and quantizing the original data block based on the target quantization format information includes: The original data block is obtained based on its address information, and the original data block is traversed based on the comparator tree hardware logic built into the micro-scaling module to obtain the numerical range of the original data block. Based on the numerical range and the target quantization format information, a scaling factor is determined, and the array multiplier built into the micro-scaling module performs a scaling operation on each data element in the original data block based on the scaling factor in parallel to obtain a scaled data block. The scaling data block is converted into the target format indicated by the target quantization format information by the hardware format conversion circuit built into the micro-scaling module to obtain the target data block.

3. The method according to claim 2, characterized in that, The global memory address information includes first address information and second address information. The step of associating the scaled and quantized target data block and scaling factor with the global memory based on the global memory address information includes: Write the target data block to the data queue and write the scaling factor to the factor queue; When the amount of data stored in the data queue meets the first transmission condition, the data stored in the data queue is packaged and written into global memory based on the first address information; When the amount of data stored in the factor queue meets the second transmission condition, the data stored in the factor queue is packaged and written into global memory based on the second address information.

4. The method according to claim 3, characterized in that, The step of associating the scaled and quantized target data block and the scaling factor with the global memory address information and writing them into the global memory also includes: After the microscaling instruction is completed, if the data queue and / or the factor queue are not empty, the data stored in the data queue is packaged and written into global memory based on the first address information, and / or the data stored in the factor queue is packaged and written into global memory based on the second address information.

5. A hardware-based microscaling system, characterized in that, The system includes: an instruction scheduling unit and a tensor memory acceleration unit, wherein the tensor memory acceleration unit has a built-in micro-scaling module, wherein: The instruction scheduling unit is used to send micro-scaling instructions to the tensor memory acceleration unit. The micro-scaling instructions carry configuration information, including the address information of the original data block, the target quantization format information, and the global memory address information of the target data block. The micro-scaling module is used to receive the micro-scaling instruction, parse the configuration information from the micro-scaling instruction, and use a built-in hardware circuit structure to implement the operation of obtaining the original data block from the address information of the original data block, scaling and quantizing the original data block based on the target quantization format information, and writing the scaled and quantized target data block and scaling factor into global memory based on the global memory address information.

6. The system according to claim 5, characterized in that, The microscaling module includes a quantization submodule, used for: The original data block is obtained based on its address information, and the original data block is traversed based on the built-in comparator tree hardware logic to obtain the numerical range of the original data block. Based on the numerical range and the target quantization format information, a scaling factor is determined, and a scaling operation on each data element in the original data block is performed in parallel using a built-in array multiplier based on the scaling factor to obtain a scaled data block. The scaled data block is converted into the target format indicated by the target quantization format information by the built-in hardware format conversion circuit to obtain the target data block.

7. The system according to claim 5, characterized in that, The global memory address information includes first address information and second address information. The micro-scaling module also includes a splicing submodule, used for: Write the target data block to the data queue and write the scaling factor to the factor queue; When the amount of data stored in the data queue meets the first transmission condition, the data stored in the data queue is packaged and written into global memory based on the first address information; When the amount of data stored in the factor queue meets the second transmission condition, the data stored in the factor queue is packaged and written into global memory based on the second address information.

8. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 4.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 4.

10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 4.