Isomeric processing circuit, matching method, and integrated circuit
By combining a heterogeneous processing circuit with a graphics processing unit and a tensor processor, the problem of computationally intensive operations in 3D graphics computing is solved, data flow synchronization and processing efficiency are improved, and the performance of the graphics processing unit is optimized.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XINXIN HANGTU (SUZHOU) TECHNOLOGY CO LTD
- Filing Date
- 2024-12-16
- Publication Date
- 2026-06-23
AI Technical Summary
The graphics processing unit (GPU) performs computationally intensive operations during 3D graphics calculations, which affects processing efficiency and makes it difficult to synchronize data flow with the tensor processor, resulting in a decrease in performance and computational efficiency.
Design a heterogeneous processing circuit that combines a graphics processing unit with a tensor processor. Through heterogeneous computing, it achieves synchronization between the pipeline structure and multiple parallel processing units. It utilizes the advantages of the first and second processing circuits to process the first and second data in the input data respectively, and coordinates the data flow through a buffer to optimize the data path.
It improves the efficiency of the graphics processing unit in processing 3D graphics, reduces data transmission latency and asynchronous computation tasks, realizes the continuity of data flow and the reliability of processing, and enhances the overall computing performance.
Smart Images

Figure CN122265014A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of integrated circuit technology, and in particular to a heterogeneous processing circuit, a matching method, and an integrated circuit. Background Technology
[0002] A graphics processing unit (GPU) is a microprocessor specifically designed for processing graphics and video rendering tasks. Compared to a central processing unit (CPU), a GPU has more processing cores and can handle a large number of parallel computing tasks simultaneously, excelling in areas such as graphics rendering, video decoding, scientific computing, and deep learning. With technological advancements, GPUs can be used in high-performance computing (HPC) and artificial intelligence (AI), particularly in machine learning and deep learning, where they can provide faster data processing speeds compared to CPUs. However, computationally intensive operations in 3D graphics processing can impact the processing efficiency of GPUs. Summary of the Invention
[0003] This application provides a heterogeneous processing circuit, a matching method, and an integrated circuit to improve the processing efficiency of 3D graphics.
[0004] In a first aspect, a heterogeneous processing circuit is provided for processing three-dimensional graphics, comprising: a first processing circuit, a first buffer, and a second processing circuit; the first processing circuit includes multiple processing units configured to acquire input data and process first data in the input data in parallel; the first buffer is connected between the first processing circuit and the second processing circuit and configured to transmit second data in the input data to the second processing circuit; the second processing circuit includes multiple processing modules configured to process the second data in the input data in a pipelined manner; wherein the data flow between the first processing circuit and the second processing circuit is matched, and a first number of the multiple processing modules is determined based on a second number of the multiple processing units.
[0005] The above heterogeneous processing circuits, through the first and second processing circuits respectively processing the first and second data in the input data, can leverage the advantages of each type of processing circuit. Simultaneously, the first number of processing modules, determined by the second number of processing units, enables matching of data flow between the first and second processing circuits, reducing timing delays in data processing and improving the efficiency of processing 3D graphics.
[0006] In one implementation, the second data includes: vertex coordinates, vertex normals, vertex texture coordinates, vertex colors, or transformation matrices.
[0007] In one implementation, the first buffer is also configured to store the first processing result of the first data; or, to store the second processing result of the second data.
[0008] In the above heterogeneous processing circuit, the first buffer can help coordinate the data flow between the first processing circuit and the second processing circuit, realize the integrity of the unified generation of output results from input data, and ensure the continuity of data flow between the first data and the second data processing processes, thereby improving the efficiency and reliability of the heterogeneous processing circuit in processing input data.
[0009] In one implementation, multiple processing modules include: a first processing module and a second processing module; the first processing module includes: a first calculation unit and a second calculation unit; the first calculation unit is configured to be coupled to the second calculation unit, and performs matrix operations on a first portion of the second data to determine a first calculation result; the second calculation unit is configured to accumulate and calculate the first calculation result to determine a first element of the second processing result. The second processing module includes: a third calculation unit and a fourth calculation unit; the third calculation unit is configured to be coupled to the first calculation unit, and performs matrix operations on a second portion of the second data to determine a second calculation result; the fourth calculation unit is configured to accumulate and calculate the second calculation result to determine a second element of the second processing result.
[0010] In the above heterogeneous processing circuit, the first computing unit (or the third computing unit) can focus on performing matrix operations, while the second computing unit (or the fourth computing unit) is responsible for accumulating and summing intermediate results, thus effectively distributing the computational burden of a single computing unit. Simultaneously, the third and fourth computing units can process other vertex data or different parts of the matrix within the same vertex data in parallel, further improving processing efficiency, achieving parallelization of computational tasks, and shortening data processing time.
[0011] In one implementation, the first processing module further includes: a first register and a second register; the first register is configured to be coupled to a first computing unit, to retrieve second data from memory and transmit it to the first computing unit, and to transmit it to a second cache submodule; the second register is configured to be coupled to a second computing unit, to retrieve a first element from the second computing unit. The second processing module further includes: a third register and a fourth register; the third register is connected between the first register and the third computing unit, configured to retrieve second data from the first register and transmit it to the third computing unit; the fourth register is connected between the fourth computing unit and the second register, configured to retrieve a second element from the fourth computing unit and transmit the second element to the second register.
[0012] The above heterogeneous processing circuit, with its multiple registers, enables data caching, module decoupling, and parallel processing, thereby improving the pipeline efficiency of the second processing circuit and optimizing the data flow path.
[0013] In one implementation, the second processing module further includes a fifth register, and the first processing module further includes a sixth register; the fifth register is configured to be coupled to the fourth register to cache the first matrix formed based on the second element; the sixth register is configured to be coupled to the fifth register to cache the second matrix formed based on the first matrix and the first element, and write the second matrix as the second processing result into the first buffer.
[0014] In one implementation, the second processing circuit further includes: multiple pipelines, each pipeline including multiple processing modules; the second processing circuit is also configured to run the multiple pipelines in parallel to process multiple second data.
[0015] In a second aspect, an integrated circuit is provided, comprising: a control circuit and an interface circuit, and further comprising a heterogeneous processing circuit as provided in the first aspect; the control circuit being configured to generate a first control instruction, a second control instruction, and a third control instruction; the interface circuit being configured to retrieve input data from a memory according to the first control instruction; the heterogeneous processing circuit being configured to process the first data in parallel or process the second data in pipelined manner according to the second control instruction; and the interface circuit being further configured to transfer a first processing result of the first data in a first buffer or a second processing result of the second data in a second buffer to a memory according to the third control instruction.
[0016] In one implementation, the interface circuit includes: an interface unit and at least one reading unit; the reading unit is configured to cyclically acquire input data based on the bandwidth of the interface unit.
[0017] Thirdly, a heterogeneous matching method is provided for matching data flow between a first processing circuit and a second processing circuit in a heterogeneous processing circuit of the first aspect, comprising: obtaining a first parameter of a memory, the first parameter including: the capacity of the memory; obtaining a second parameter of an interface circuit, the second parameter including: the bandwidth of the interface circuit; determining a first loop number for the interface circuit to read data from the memory based on the first parameter and the second parameter; and determining a first number of multiple processing modules based on the first loop number and a second number of multiple processing units.
[0018] The above heterogeneous matching method can coordinate the relationship between data input volume and processing capacity in the interface circuit, the first processing circuit, and the second processing circuit, improving the efficiency of data flow and data processing. Furthermore, it simplifies the complex matching process between the graphics processing unit and the tensor processor, reducing matching difficulty and achieving pipelined synchronization. When using heterogeneous processing circuits implemented with the above methods for 3D graphics processing, the advantages of the two different architectures can be leveraged respectively, improving the efficiency of heterogeneous processing circuits in processing 3D graphics.
[0019] In one implementation, the first parameter further includes: a third number of storage cells in the memory and the capacity of the storage cells; the second parameter further includes: a fourth number of read units in the interface circuit; based on the first and second parameters, determining the first number of cycles for the interface circuit to read data from the memory includes: based on the third and fourth numbers, determining a fifth number of storage cells to be allocated by the read units; based on the bandwidth and the capacity of the storage cells, determining a sixth number of storage cells to be read by the heterogeneous processing circuit at one time; and based on the fifth and sixth numbers, determining the first number of cycles.
[0020] In one implementation, determining the first number of multiple processing modules based on a first loop number and a second number of multiple processing units includes: performing a square root operation on the first loop number to determine a seventh number; determining the second number of multiple processing units based on the seventh number; and determining the first number of multiple processing modules based on the second number.
[0021] In one implementation, the absolute value of the difference between the first quantity and the second quantity is less than or equal to a first preset value, which is determined based on the second quantity.
[0022] Fourthly, a heterogeneous matching device is provided, comprising: an acquisition unit for acquiring a first parameter of a memory, the first parameter including the capacity of the memory; and acquiring a second parameter of an interface circuit, the second parameter including the bandwidth of the interface circuit; a matching unit for determining a first cycle number for the interface circuit to read data from the memory based on the first parameter and the second parameter; and determining a first number of multiple processing modules based on the first cycle number and a second number of multiple processing units.
[0023] Fifthly, a computer program product is provided, including instructions, wherein when the instructions are executed by a processor, the heterogeneous matching method of the third aspect is executed.
[0024] In a sixth aspect, a controller is provided, comprising an integrated circuit provided in the second aspect.
[0025] In a seventh aspect, a vehicle is provided, including a controller provided in the sixth aspect.
[0026] The descriptions of the beneficial effects in the second, sixth, and seventh aspects above can be found in the description of the first aspect, and the descriptions of the beneficial effects in the fourth and fifth aspects can be found in the description of the third aspect. They will not be repeated here. Attached Figure Description
[0027] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. The accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0028] Figure 1 This is a schematic diagram of the structure of an integrated circuit (or processor) provided in an embodiment of this application;
[0029] Figure 2 This is a schematic diagram of a heterogeneous processing circuit provided in an embodiment of this application;
[0030] Figure 3 This is a schematic diagram of the structure of multiple processing modules provided in the embodiments of this application;
[0031] Figure 4 This is a schematic diagram of another set of multiple processing modules provided in an embodiment of this application;
[0032] Figure 5 This is a flowchart illustrating a heterogeneous matching method provided in an embodiment of this application;
[0033] Figure 6 This is a flowchart illustrating a method for determining the first cycle number provided in an embodiment of this application;
[0034] Figure 7 This is a flowchart illustrating a method for determining a first number of multiple processing modules provided in an embodiment of this application;
[0035] Figure 8 This is a schematic diagram illustrating the cyclic operation of multiple processing units provided in an embodiment of this application;
[0036] Figure 9 This is a schematic diagram of the structure of a heterogeneous matching device provided in the embodiments of this application. Detailed Implementation
[0037] To more clearly illustrate the technical solutions in the embodiments of this application, the specific implementation methods of this application will be described below with reference to the accompanying drawings. The accompanying drawings described below are merely some embodiments of this application. For those skilled in the art, other drawings and other implementation methods can be obtained based on these drawings without creative effort. Adjustments and improvements made without departing from the concept of this application are all within the protection scope of this application.
[0038] To keep the drawings simple, each figure only schematically shows the parts related to the corresponding embodiment, and they do not represent the actual structure of the product. In addition, for the sake of simplicity and ease of understanding, some figures only schematically show parts of components with the same structure or function, and there may actually be more or fewer components with the same structure or function.
[0039] In the embodiments of this application, unless otherwise expressly specified and limited, ordinal numbers, such as "first," "second," etc., are used only to distinguish and describe related objects, and should not be construed as indicating or implying the relative importance or order between related objects; furthermore, they do not represent the number of related objects. "Multiple" includes two or more, and other quantifiers are similar. " / " is used to describe the relationship between related objects, indicating an "or" relationship between related objects. "And / or" is used to describe the relationship between related objects, including any combination relationship between related objects, such as "a and / or b" including: "a alone," "b alone," or "a and b." "One or more" or "at least one" of multiple objects refers to any object or any combination of multiple objects, such as "one or more of a1, a2, a3" or "at least one of a1, a2, a3" including: "a1 alone," "a2 alone," "a3 alone," "a1 and a2," "a1 and a3," "a2 and a3," or "a1, a2 and a3."
[0040] In this embodiment of the application, "connection" includes direct or indirect connection between objects. It can be directly connected through a medium (e.g., wires, wiring, etc.), or indirectly connected through other components, or it can be an internal connection.
[0041] When processing 3D graphics, the graphics processing unit (GPU) typically follows a phased graphics rendering pipeline, handling different graphics operations. These phases include: vertex processing, which involves calculating and transforming vertex data and can be performed by a vertex shader; clipping, which transforms vertices to clip space and determines if they are within the viewport, thus clipping out content outside the viewport and preventing it from participating in subsequent rendering; rasterization, which converts 3D objects into 2D screen pixels and transforms the position and size of the graphics on the screen into multiple fragments, each containing relevant information for that pixel, such as color, depth, and texture coordinates; fragment processing, performed by a fragment shader, which calculates the final color or depth values for each fragment; and output merging, which merges the output of the fragment shader with the pixel data in the current framebuffer. Alternatively, it can be divided into two stages: a vertex processing stage (integrating the clipping stage) and a fragment processing stage (integrating compositing and output), handled by the vertex shader and fragment shader respectively. After processing by the above graphics rendering pipeline, the graphics processor writes the results to the frame buffer, which is then displayed on the screen. The vertex shader, in calculating and transforming vertex data, involves matrix operations with high computational density. Traditional graphics processing units can perform these tasks, but their computational efficiency is low, and they struggle to handle other tasks simultaneously while performing vertex calculations.
[0042] Tensor processing units (TPUs) are widely used for processing neural network models. They are dedicated hardware designed for deep learning, particularly adept at accelerating the training and inference of deep learning models. In deep learning, data is typically represented as tensors, which are multidimensional arrays. Tensor processors excel at performing numerous matrix operations (such as matrix multiplication and convolution), which are central to the training and inference processes of deep learning models, such as transformer networks and convolutional neural networks (CNNs).
[0043] Therefore, combining a graphics processing unit (GPU) with a tensor processor can improve the efficiency of the GPU in processing vertex matrix operations for 3D graphics. However, it is difficult for the GPU to synchronize data flow with the tensor processor, which may lead to data transmission delays and asynchrony of computational tasks, thus affecting the performance and computational efficiency of the GPU.
[0044] Considering the above factors, embodiments of this application provide a heterogeneous processing circuit, a matching method, and an integrated circuit that can combine a graphics processing unit and a tensor processor to achieve synchronization between the pipeline structure of the tensor processor and multiple parallel processing units in the graphics processing unit. Through heterogeneous computing, the efficiency of the graphics processing unit in processing three-dimensional graphics is improved.
[0045] The following explanation is provided in conjunction with the accompanying drawings. Please refer to them. Figure 1 This is a schematic diagram of the structure of an integrated circuit (or processor) provided in an embodiment of this application. Figure 1 As shown, the integrated circuit 100 includes a control circuit 110, an interface circuit 120, and a heterogeneous processing circuit 130. The control circuit 110 is configured to generate a first control instruction, a second control instruction, and a third control instruction. The interface circuit 120 is configured to retrieve input data from the memory 10 according to the first control instruction. The heterogeneous processing circuit 130 is configured to process the first data in parallel or process the second data in pipelined manner according to the second control instruction. The interface circuit 120 is also configured to transmit the first processing result of the first data or the second processing result of the second data from the heterogeneous processing circuit 130 to the memory 10 according to the third control instruction. It should be noted that the data output to the memory 10 proposed in this embodiment is only an example. The first processing result and the second processing result can also be transmitted to other processing circuits or systems that utilize this data output. The input data includes, for example, vertex data in the vertex processing stage, data in the clipping stage, data in the rasterization stage, or data in the fragment processing stage during graphics processing. The first data may include one or more of the above input data, and the second data may include vertex data.
[0046] In one embodiment, the interface circuit 120 includes an interface unit 121 and at least one reading unit 122; the reading unit 122 is configured to cyclically acquire input data based on the bandwidth of the interface unit.
[0047] This application does not limit the type of the integrated circuit 100 described above. For example, it can be a central processing unit (CPU), microcontroller unit (MCU), microprocessor unit (MPU), graphics processor (GPU), or digital signal processor (DSP). In another example, the integrated circuit 100 can realize its processing capability through the logical relationship of hardware circuits, which can be fixed or reconfigurable. For example, the processor can be a dedicated processor, such as a processor implemented using an application-specific integrated circuit (ASIC), which achieves processing capability through the design of the logical relationship between components within the circuit. Another example is a processor implemented using a programmable logic device (PLD), which achieves processing capability by configuring the logical relationship between logic devices through a configuration file; for example, a processor implemented using a field-programmable gate array (FPGA). In yet another example, the processor can be a hardware circuit designed for artificial intelligence, which can be understood as an ASIC, such as a neural network processing unit (NPU), tensor processor, deep learning processing unit (DPU), etc. In one implementation, integrated circuit 100 can be integrated into an integrated circuit, such as a system-on-chip (SOC). Further, the integrated circuit can be located in a controller, such as the controller of an electronic device like a mobile phone or computer, or the controller of a vehicle such as a vehicle, ship, aircraft (e.g., flying vehicle, or drone), robot (e.g., industrial robot or household robot), or surveying equipment. Taking intelligent driving scenarios as an example, the integrated circuit 100 can be integrated into an in-vehicle SOC; further, the in-vehicle SOC can be integrated into an in-vehicle controller. Such in-vehicle controllers include, for example, a domain control unit (DCU), an electronic control unit (ECU), a vehicle central computer (VCC), a zone controller (zonal / zone ECU, or zone control unit, ZCU), or a vehicle control unit (VCU).Domain controllers include, for example, vehicle domain controllers (VDC), cockpit domain controllers (CDC), or advanced driving assistance systems / autonomous driving (ADAS / AD, domain controller, ADC), etc.
[0048] Please refer to Figure 2 This is a schematic diagram of a heterogeneous processing circuit provided in an embodiment of this application. Figure 2 As shown, the heterogeneous processing circuit 200, used for processing three-dimensional graphics, includes: a first processing circuit 210, a first buffer 220, and a second processing circuit 230; the first processing circuit 210 includes multiple processing units 21n, configured to acquire first data from input data and process the first data in parallel; the first buffer 220 is connected between the first processing circuit 210 and the second processing circuit 230, and configured to transmit second data from the input data to the second processing circuit 230; the second processing circuit 230 includes multiple processing modules 221, configured to process the second data from the input data in a pipelined manner; wherein the data flow between the first processing circuit 210 and the second processing circuit 230 is matched, and the first number n of the multiple processing modules 21n is determined based on the second number m of the multiple processing units 23m.
[0049] The first processing circuit 210 and the second processing circuit 230 can process the stages of the 3D graphics processing process jointly or separately. For example, in the processing of 3D graphics vertex data, since the graphics processing involves graphics transformation, and the transformed vertex coordinates can be determined by the original vertex coordinates and the transformation matrix, this process involves a large number of matrix operations, such as matrix multiplication (matmul), and the scale of the matrices involved in the operations may be large. Therefore, the first processing circuit 210 and the second processing circuit 230 can process this data jointly. For example, the first data and the second data can be the same, both involving the vertex data of the 3D graphics. For example, the first processing circuit 210 and the second processing circuit 230 can process multiple vertex data simultaneously, jointly completing all vertex transformations of the 3D graphics. On the other hand, the 3D graphics processing process involves not only the vertex processing stage but may also involve other graphics rendering pipeline stages such as the fragment processing stage. These stages involve fewer matrix operations. To improve processing efficiency, the first processing circuit 210 can process the first data, which may involve data from the clipping stage, rasterization stage, or fragment processing stage—data that involves fewer matrix operations. The second data may differ from the first data. The second data processed by the second processing circuit 230 involves vertex data involving dense matrix operations. In one implementation, the second data may include: vertex coordinates, vertex normals, vertex texture coordinates, vertex colors, or transformation matrices. For example, the second processing circuit 230 may determine the vertex coordinates after matrix operations based on the transformation matrix and vertex coordinates in the second data. It should be noted that vertex data can be used to transform vertex coordinates in object space (or model space) to screen space through operations with the transformation matrix. For example, the transformation matrix may include: model matrix, view matrix, or projection matrix. Vertex data can be used to perform matrix operations with the transformation matrix to obtain the final screen space coordinates. The heterogeneous processing circuit of this application embodiment can leverage the advantages of both processing circuits, improving the efficiency of the heterogeneous processing circuit in processing 3D graphics.
[0050] When processing 3D graphics through a graphics rendering pipeline, multiple stages of the processing occur consecutively, and data flows continuously within the same or different processing stages. Discontinuous data may require more register overhead to store intermediate data, impacting processing timing and efficiency. Therefore, a matching mechanism can be established between multiple processing units 21n in the first processing circuit 210 and multiple processing modules 23m in the second processing circuit 230 to ensure the continuity of data flow during 3D graphics processing. For example, when the first processing circuit 210 processes the first data, some data within the first data may require further processing based on the processing result of the second data. For instance, vertex data processing can be performed first in the graphics rendering pipeline, followed by subsequent related processing. Therefore, by rationally designing the first number n of processing units and the second number m of processing modules, the processing efficiency of the multiple processing modules in the second processing circuit 230 can be matched with the processing efficiency of the first processing circuit 210, reducing the time the first processing circuit 210 waits for the second processing circuit 230 to process the second data. For example, the output of 3D graphics processing may require the simultaneous output of the first and second processing results. After the second processing circuit 230 completes the processing of the second data, the first processing circuit 210, whose processing efficiency matches that of the second processing circuit, can complete the processing of the first data with a short time delay, achieving simultaneous output. Therefore, the first processing circuit 210 and the second processing circuit 230 can be connected without relying on a bus protocol, using a pipelined connection to minimize data transmission latency and improve processing efficiency. Matching can be reflected in the number of processing units and processing modules being the same or similar, and the processing efficiency being the same or similar. This avoids situations where the first processing circuit 210 waits for the second processing circuit 230, or vice versa. The heterogeneous processing circuit 200 described above can reduce timing delays and improve the efficiency of processing 3D graphics. The first number n of multiple processing units can also be determined by the input bandwidth of the heterogeneous processing circuit 200 and the parameters of the storage circuit to be read (such as a memory), to coordinate the data flow of input data entering the heterogeneous processing circuit 200 for processing.
[0051] In some implementations, the first buffer 220 is also configured to store a first processing result of the first data; or, to store a second processing result of the second data.
[0052] When the heterogeneous processing circuit 200 processes 3D graphics, the first processing circuit 210 and the second processing circuit 220 may have different processing speeds. For example, the first processing circuit 210 may complete the processing of the first data before the second processing circuit 220. In this case, the first processing circuit 210 can store the first processing result in the first buffer 220, waiting for the second processing circuit 220 to complete the processing of the second data and generate the second processing result before transferring it from the first buffer 220 to the memory 10 or to other systems. Alternatively, the second processing circuit 230 may complete the processing of the second data before the first processing circuit 210, and the first buffer 220 may cache the second processing result, waiting for the first processing circuit 210 to generate the first processing result. As another example, in the first data processed by the first processing circuit 210, the processing of some data may require the second processing result of the second data as a basis. In this case, the second processing circuit 230 may have already completed the processing of the second data and can store the second processing result in the first buffer 220. During the processing of the first data by the first processing circuit 210, the second processing result can be read from the first buffer 220. The first buffer 220 can help coordinate the data flow between the first processing circuit 210 and the second processing circuit 230, realize the integrity of the unified generation of output results from input data, and ensure the continuity of data flow between the first data and the second data processing processes, thereby improving the efficiency and reliability of heterogeneous processing circuits in processing input data.
[0053] Please refer to Figure 3 This is a schematic diagram of the structure of multiple processing modules provided in an embodiment of this application. For example... Figure 3 As shown, the multiple processing modules include: a first processing module 310 and a second processing module 320; the first processing module 310 includes: a first calculation unit 311 and a second calculation unit 312; the first calculation unit 311 is configured to be coupled to the second calculation unit 312, and performs matrix operations on a first part of the second data to determine a first calculation result; the second calculation unit 312 is configured to accumulate and calculate the first calculation result to determine a first element of the second processing result. The second processing module 320 includes: a third calculation unit 321 and a fourth calculation unit 322; the third calculation unit 321 is configured to be coupled to the first calculation unit 311, and performs matrix operations on a second part of the second data to determine a second calculation result; the fourth calculation unit 322 is configured to accumulate and calculate the second calculation result to determine a second element of the second processing result.
[0054] The processing module can calculate the vertex processing stage with high matrix operation density in 3D graphics processing. The input second data may include vertex data (such as vertex data generated in vector or matrix form) and a transformation matrix. After the vertex data is processed with the transformation matrix, vertex data after spatial transformation (e.g., the transformation process of a vertex in 3D graphics from object space to world space, view space, and screen space) is generated. The first calculation unit 311 can perform matrix operations on the vertex data (such as vertex data generated in vector or matrix form) and the transformation matrix in the second data. The operations may include matrix addition, matrix subtraction, or matrix multiplication. The first calculation result obtained by the matrix operation may be a set of numbers to be summed. The second calculation unit 312 performs an accumulation and summation operation on the first calculation result to determine the first element in the second processing result. The second processing result may be represented in vector or matrix form, and the first element may include one or more element values in the vector or matrix. For example, the second data may include vertex coordinates and a transformation matrix, where the vertex coordinates are a matrix formed by x rows and i columns, and the transformation matrix is a matrix formed by y rows and j columns. In this system, each element of the first row a1i in the vertex coordinates can be the first part of the second data. Each element of the first row a1i in the vertex coordinates can be mapped one-to-one with the elements of the first column bj1 in the transformation matrix, for example, a11 corresponds to b11, a12 corresponds to b21, a13 corresponds to b31, and so on. The first calculation result determined by the first calculation unit 311 can be a set of intermediate results that have not yet been summed, such as a11×b11+a12×b21+a13×b31… This intermediate result can be accumulated and calculated by the second calculation unit 312 to determine the first element of the second processing result. The third calculation unit 321 can calculate the second part of the second data. Continuing with the example provided in the above embodiment of performing matrix operations between a matrix formed by x rows and i columns of vertex coordinates and a matrix with y rows and j columns of transformation matrix, the second part of the second data can be the element a2i in the second row of the vertex coordinate matrix performing matrix operations with the transformation matrix to obtain the second calculation result. After obtaining the second result, the second element of the second processing result is determined by the fourth calculation unit 322.
[0055] Through the data flow between the first processing module 310 and the second processing module 320, continuous computation can be performed on the matrix in the second data. The first computation unit 311 (or the third computation unit 321) focuses on performing basic matrix operations (such as element-wise product calculation in matrix multiplication), while the second computation unit 312 (or the fourth computation unit 322) is responsible for accumulating intermediate results, effectively distributing the computational burden. Simultaneously, the third computation unit 321 and the fourth computation unit 322 in the second processing module 320 can continue to process other vertex data or different parts of the matrix within the same vertex data in parallel, further improving processing efficiency. Different computation units can simultaneously process different parts of the same matrix (such as operations between different rows of the vertex matrix and the transformation matrix), achieving parallelization of computational tasks. When the vertex data is large, this parallel architecture can significantly shorten computation time.
[0056] Continue to refer to Figure 3 The first processing module 310 further includes: a first register 313 and a second register 314; the first register 313 is configured to be coupled to the first computing unit 311, to obtain second data and transmit it to the first computing unit 311, and to transmit it to the second cache submodule 340; the second register 314 is configured to be coupled to the second computing unit 312, and to obtain the first element from the second computing unit 312.
[0057] The second processing module 320 further includes a third register 323 and a fourth register 324; the third register 323 is connected between the first register 313 and the third calculation unit 321, and is configured to obtain second data from the first register 313 and transmit it to the third calculation unit 321; the fourth register 324 is connected between the fourth calculation unit 322 and the second register 314, and is configured to obtain a second element from the fourth calculation unit 322 and transmit the second element to the second register 314.
[0058] The first register 313 can cache the second data obtained from the outside, reducing the latency of the first computing unit 311 directly accessing external data and improving the data supply speed. The connection established between the first register 313 and the third register 323 can avoid the overhead of the third computing unit 321 repeatedly loading external data, realizing multiple computing units reuse of data input once and improving data processing efficiency. The second register 314 caches the first element determined by the second computing unit 312, which can avoid the second processing result of the second processing circuit depending on the real-time processing of multiple processing modules, reducing the burden of multiple processing modules in the pipeline. For example, it can temporarily wait for the processing result of the subsequent processing module (such as the second processing module) to achieve the common output of the processing result. Register configurations can act as data buffers between different computational units. For example, the connection between the first register 313 and the third register 323 decouples data transfer between the first and second processing modules, ensuring that the second processing module can obtain the second data at any time during the computation of the first processing module 310, without having to wait for the first processing module 310 to complete its processing. For instance, the first and second parts of the second data can be obtained simultaneously through the first register 313 and the third register 323 and transmitted to the corresponding computational unit for processing. Alternatively, when multiple vertex data use the same transformation matrix for matrix operations, the first register 313 or the third register 323 can temporarily store the previously input transformation matrix without needing to transmit it again, reducing data transmission pressure and improving data transmission efficiency. The connection between the second register 314 and the fourth register 324 allows accumulation and calculation to be performed independently of data transfer, eliminating direct dependencies between processing units and improving the overall throughput of the pipeline. Furthermore, the second register 314 can integrate multiple elements determined by the fourth register 324 or subsequent processing modules to form a matrix of the second processing result, achieving unified data output. The configuration of these multiple registers enables data caching, module decoupling, and parallel processing, thereby improving the pipeline efficiency of the second processing circuit and optimizing the data flow path.
[0059] Please refer to Figure 4 This is a schematic diagram of another set of multiple processing modules provided in this application embodiment. The second processing module 420 further includes a fifth register 425, and the first processing module 410 further includes a sixth register 415; the fifth register 425 is configured to be coupled to the fourth register 424 to cache the first matrix formed based on the second element; the sixth register 415 is configured to be coupled to the fifth register 425 to cache the second matrix formed based on the first matrix and the first element, and write the second matrix as the second processing result into the first buffer.
[0060] The fifth register 425 can obtain the second element from the fourth register 424 to form a matrix, and the sixth register 415 can obtain the matrix formed based on the second element from the fifth register 425 and combine it with the first element to form a new matrix. Alternatively, if there are other processing modules after the first processing module 410 and the second processing module 420 (such as a third processing module coupled to the second processing module 420), the matrices transmitted by the other processing modules can be combined to form the matrix of the second processing result. By integrating data from multiple registers, the fifth register 425 and the sixth register 415 can efficiently generate larger-scale matrices (e.g., combinations of multiple elements constructing the matrix of the second processing result during the calculation of vertex coordinates and transformation matrices) to meet the complex needs of 3D graphics processing. Furthermore, it can reduce the complexity of intermediate result transmission; for example, by integrating data from multiple processing modules, the fifth register 425 reduces the amount of data that needs to be transmitted, improving overall transmission efficiency. The sixth register 415 further processes the data transmitted by the fifth register 425 and outputs a standardized matrix of the second processing result. This avoids data duplication and redundancy, providing a complete output for the entire pipeline's data processing. For complex matrix calculation scenarios, it can optimize the computation path for large-scale graphics processing. The above multiple processing modules are not limited to the examples of the first and second processing modules; they can also include 3, 4, 5, or even more processing modules. The number of processing modules can be determined based on the number of processing units in the first processing circuit, or set in combination with parameters such as the number of processing units and the input bandwidth of the heterogeneous processing circuit.
[0061] In one embodiment, the second processing circuit further includes: multiple pipelines, each pipeline including multiple processing modules; the second processing circuit is also configured to run the multiple pipelines in parallel to process multiple second data. Multiple pipelines can run simultaneously, distributing computational tasks, for example, by allocating multiple second data to each pipeline for parallel processing, thereby significantly improving the overall processing speed of the system. Especially in high-density computational scenarios such as graphics processing, matrix computation, or tensor processing, the parallel operation of multiple pipelines can significantly reduce the load on a single pipeline, avoiding performance bottlenecks. This allows for batch data processing, simultaneously processing multiple second data of the same type (such as multiple vertex data) or different types of second data (such as texture or lighting data), suitable for large-scale 3D graphics rendering scenarios requiring high throughput.
[0062] This application does not limit the type of memory, the first buffer, or the register. For example, it can be read-only memory (ROM) or random access memory (RAM). For example, it can be non-volatile memory or volatile memory. For example, it can be on-chip storage resources or off-chip storage resources. ROM includes, for example, mask ROM, programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash ROM; RAM includes, for example, static random access memory (SRAM) or dynamic random access memory (DRAM). On-chip storage resources refer to the storage circuitry integrated within the chip. These circuits are integrated with other circuits on the chip (such as processing circuits) to provide storage space. Examples include on-chip static random access memory (SRAM) resources or on-chip configuration register space resources. Off-chip storage resources refer to independent storage circuitry or chips, such as off-chip double data rate synchronous dynamic random access memory (DDR SDRAM, or DDR for short).
[0063] Please refer to Figure 5 This is a flowchart illustrating a heterogeneous matching method provided in an embodiment of this application. The heterogeneous matching method is used to match data flow between a first processing circuit and a second processing circuit in an integrated circuit provided in the first aspect, and includes at least the following steps:
[0064] S510: Obtain the first parameter of the memory, the first parameter including: the capacity of the memory;
[0065] S520: Obtain the second parameter of the interface circuit, which includes the bandwidth of the interface circuit;
[0066] S530: Based on the first parameter and the second parameter, determine the first loop number for the interface circuit to read data from the memory;
[0067] S540: Determine the first number of multiple processing modules based on the first loop number and the second number of multiple processing units.
[0068] When input data enters the heterogeneous processing circuits within an integrated circuit for processing, to improve data flow and processing efficiency, the relationship between the amount of data input and processing capabilities in the interface circuit, the first processing circuit, and the second processing circuit can be coordinated. This is achieved by obtaining the first parameter of the memory connected to the integrated circuit, such as its capacity. This capacity can be the total or partial readable capacity of the memory, or the capacity occupied by the data to be read. The second parameter of the interface circuit is obtained, such as its bandwidth. By matching the bandwidth with the capacity, the first cycle number required to read the data (input data) from the memory through the interface circuit can be determined. For example, if the memory capacity is q and the interface circuit bandwidth is p, the number of cycles required to completely read the memory through the interface circuit can be determined by dividing q by p. To improve processing efficiency, a match can be achieved between the first cycle number, the number of processing units in the first processing circuit, and the number of processing modules in the second processing circuit, maximizing the efficiency of the pipelined processing. Matching the data reading speed of the interface circuit with the calculation speed of the multiple processing units in the first processing circuit can also improve data processing efficiency. For example, one processing unit can process the data of one memory cell per clock cycle. The interface circuit reads data at a rate of 16 memory units per cycle. Therefore, to avoid waiting by processing units, the entire pipeline can be configured with 16 processing units working in parallel per cycle to process the 16 memory units read. This achieves better pipeline matching and reduces the likelihood of timing or data processing order imbalances. Simultaneously, after determining the number of processing units, the number of processing modules can be matched accordingly. For example, it can be set to the same number of processing modules as the number of processing units, enabling cyclical data pipeline operation. Alternatively, it can be set to a number of processing modules within a certain range of the difference between the number of processing units and the number of processing modules. For instance, if the number of processing units is greater than the number of processing modules and the processing efficiency is higher, this can be achieved by having processing units wait for processing modules. Thus, the range of this difference can be determined based on the latency that the architecture can tolerate, minimizing waiting time and improving processing efficiency.
[0069] The heterogeneous processing matching method described above simplifies the complex matching process between the graphics processing unit (GPU) and the tensor processor, reducing the matching difficulty and enabling pipelined synchronization. When using heterogeneous processing circuits implemented with the above method for 3D graphics processing, the advantages of the two different architectures can be leveraged to improve the efficiency of the heterogeneous processing circuit in processing 3D graphics.
[0070] In some implementations, the first parameter further includes: a third number of storage cells in the memory and the capacity of the storage cells; the second parameter further includes: a fourth number of read units in the interface circuit. Please refer to [reference needed]. Figure 6 This is a flowchart illustrating a method for determining the first loop count provided in an embodiment of this application. Step S530, based on the first parameter and the second parameter, determines the first loop count for the interface circuit to read data from the memory, and further includes:
[0071] S610: Based on the third and fourth quantities, determine the fifth quantity of read storage units to be allocated to the read unit;
[0072] S620: Based on bandwidth and storage unit capacity, determine the sixth number of storage units that the heterogeneous processing circuit reads at one time;
[0073] S630: Determine the first cycle number based on the fifth and sixth quantities.
[0074] The memory may include multiple memory cells, and the interface circuit may include multiple read units, which can read from multiple memory cells in the memory. Each read unit can divide the memory cells into a third number, determining a fifth number of memory cells that a read unit can read. However, the bandwidth of the interface units in the interface circuit is still a limiting factor. Therefore, by using the bandwidth and the capacity of each memory cell, a sixth number of memory cells that the interface circuit can read in one read can be determined. Combining the fifth and sixth numbers, the first cycle number for the read unit to complete reading all of its divided memory cells can be determined.
[0075] Please refer to Figure 7 This is a flowchart illustrating a method for determining a first number of multiple processing modules provided in an embodiment of this application. Step S540, based on a first loop count and a second number of multiple processing units, determines the first number of multiple processing modules, and further includes:
[0076] S710: The first cycle number of the square root operation determines the seventh quantity;
[0077] S720: Determine the second number of multiple processing units based on the seventh number;
[0078] S730: Based on the second quantity, determine the first quantity of multiple processing modules.
[0079] During the parallel processing of input data in the first processing circuit, in order to fully utilize the processing capacity of each processing unit, multiple processing units can be operated at full load in a pipeline manner. Please refer to [reference needed]. Figure 8 This is a schematic diagram of multiple processing units operating cyclically, provided in an embodiment of this application. Figure 8This diagram illustrates the pipelined data processing process of eight processing units. Each column represents a clock cycle, and each row represents the operational status of the processing units in the pipeline (how many processing units are executing tasks within a given clock cycle). Tasks are passed sequentially through the pipeline, exhibiting the characteristics of pipelined operation. After the eighth cycle, the pipeline is completely full (each processing unit has tasks executing). To ensure pipeline balance, a certain proportional relationship needs to be maintained between the number of processing units and the number of cycles. Figure 8 As shown, the task distribution of the pipeline (the blocks in the figure) exhibits the characteristics of a two-dimensional array, and the number of processing units and the filling cycle can be symmetrical. Therefore, 8×8=64 is a symmetrically distributed filling state. To match the first cycle number, a symmetrical point can be found by taking the square root. In the hardware implementation, the first cycle number may not be able to be square rooted completely. For cases where the square root yields an integer, a reasonable integer value can be chosen. For example, when the first cycle number is 128, the square root result is close to 11.3. Therefore, this value can be taken as an integer such as 8, 9, 10, 11, 12, etc., to determine the second number of processing units. When determining the first number of multiple processing modules using the second number, to ensure processing efficiency during data flow, the difference between the first and second numbers can be determined within a reasonable range. For example, the first number and the second number can be equal. Or, the efficiency of the first processing circuit and the second processing circuit in processing data can differ within a certain range of cycle numbers. In one implementation, the absolute value of the difference between the first and second numbers is less than or equal to a first preset value, which is determined based on the second number. This ensures that the processing capacity (throughput) of multiple processing units is close to the computing capacity of multiple processing modules within a reasonable range, so as to avoid excessive differences that could lead to insufficient capacity of multiple processing modules, resulting in data backlog (bottleneck effect), or excessive processing capacity of multiple processing modules that could lead to unnecessary idle hardware resources.
[0080] Based on the same technical concept, please refer to Figure 9 This is a schematic diagram of a heterogeneous matching device provided in an embodiment of this application. The heterogeneous matching device 900 includes: an acquisition unit 910, used to acquire a first parameter of a memory, the first parameter including the capacity of the memory; and to acquire a second parameter of an interface circuit, the second parameter including the bandwidth of the interface circuit; and a matching unit 920, used to determine a first loop number for the interface circuit to read data from the memory based on the first parameter and the second parameter; and to determine a first number of multiple processing modules based on the first loop number and a second number of multiple processing units.
[0081] In one embodiment, the matching unit 920 is further configured to determine a fifth number of storage units to be read by the reading unit based on a third number and a fourth number; determine a sixth number of storage units to be read by the heterogeneous processing circuit at one time based on bandwidth and storage unit capacity; and determine a first cycle number based on the fifth number and the sixth number.
[0082] In one embodiment, the matching unit 920 is further configured to perform a square root operation on the first cycle number to determine the seventh quantity; based on the seventh quantity, determine the second quantity of the plurality of processing units; and based on the second quantity, determine the first quantity of the plurality of processing modules.
[0083] The specific implementation methods and beneficial effects of each unit in the heterogeneous matching device 900 can be referred to the above embodiments, and will not be repeated here.
[0084] The above division of units is merely a logical functional division. In actual implementation, they can be fully or partially integrated into a single physical entity, or they can be physically separated. Furthermore, these units can be implemented by a processor calling software; for example, a heterogeneous matching device includes a processor coupled to memory, which stores instructions. The processor calls the instructions stored in memory to implement any of the heterogeneous matching methods or to realize the functions of each unit. The processor can be, for example, a general-purpose processor, such as a CPU, and the memory can be memory within a signal processing device or memory outside the signal processing device. Alternatively, these units can be implemented as hardware circuits. The functions of some or all units can be realized through the design of the hardware circuit, which can be understood as one or more processors. For example, the hardware circuit includes an Application-Specific Integrated Circuit (ASIC), which implements the functions of some or all units by designing the logical relationships between the components within the circuit. Furthermore, the hardware circuit can be implemented using a programmable logic device (PLD), which can include a large number of logic gates. The logical relationships between the logic gates are configured through a configuration file, thereby realizing the functions of some or all units. All units of the above heterogeneous matching device can be implemented entirely through processor calling programs, or entirely through hardware circuits, or partially through processor calling programs with the remaining parts implemented through hardware circuits.
[0085] Furthermore, embodiments of this application also provide a computer program product, including instructions, wherein when the instructions are executed by a processor, the above heterogeneous matching method is executed.
[0086] In the above embodiments, the descriptions of each embodiment have their own emphasis. Parts not described in detail or in a particular embodiment can be referred to in the relevant descriptions of other embodiments. Furthermore, the above embodiments can be freely combined as needed.
Claims
1. A heterogeneous processing circuit, comprising: For processing three-dimensional graphics, including: a first processing circuit, a first buffer, and a second processing circuit; The first processing circuit includes multiple processing units configured to acquire input data and process first data in the input data in parallel. The first buffer, connected between the first processing circuit and the second processing circuit, is configured to transmit the second data in the input data to the second processing circuit; The second processing circuit includes multiple processing modules configured to process the second data in the input data in a pipeline manner; The data flow between the first processing circuit and the second processing circuit is matched, and the first number of the plurality of processing modules is determined based on the second number of the plurality of processing units.
2. The heterogeneous processing circuit of claim 1, wherein, The second data includes: vertex coordinates, vertex normals, vertex texture coordinates, vertex colors, or transformation matrices.
3. The heterogeneous processing circuit of claim 2, wherein, The first buffer is also configured to store the first processing result of the first data; or, The second processing result stores the second data.
4. The heterogeneous processing circuit of claim 3, wherein, The plurality of processing modules include: a first processing module and a second processing module; the first processing module includes: a first computing unit and a second computing unit; The first computing unit is configured to be coupled to the second computing unit to perform matrix operations on a first part of the second data to determine a first calculation result; The second calculation unit is configured to perform an accumulation and calculation on the first calculation result to determine the first element of the second processing result. The second processing module includes: a third computing unit and a fourth computing unit; The third computing unit is configured to be coupled to the first computing unit, and to perform matrix operations on the second part of the second data to determine the second calculation result. The fourth calculation unit is configured to perform a summation calculation on the second calculation result to determine the second element of the second processing result.
5. The heterogeneous processing circuit of claim 4, wherein, The first processing module further includes: a first register and a second register; The first register is configured to be coupled to the first computing unit, to retrieve the second data from the memory and transmit it to the first computing unit, and to transmit it to the second cache submodule; The second register is configured to be coupled to the second computing unit to obtain the first element from the second computing unit. The second processing module further includes: a third register and a fourth register; The third register is connected between the first register and the third computing unit, and is configured to obtain the second data from the first register and transmit it to the third computing unit; The fourth register is connected between the fourth computing unit and the second register, and is configured to obtain the second element from the fourth computing unit and to transfer the second element to the second register.
6. The heterogeneous processing circuit of claim 5, wherein, The second processing module further includes a fifth register, and the first processing module further includes a sixth register; The fifth register is configured to be coupled to the fourth register to cache the first matrix formed based on the second element; The sixth register is configured to be coupled to the fifth register, configured to cache the second matrix formed based on the first matrix and the first element, and to write the second matrix as the second processing result into the first buffer.
7. The heterogeneous processing circuit of any one of claims 1-6, wherein, The second processing circuit further includes: multiple pipelines, each pipeline including the multiple processing modules; The second processing circuit is also configured to run the plurality of pipelines in parallel to process the plurality of the second data.
8. An integrated circuit, characterized by include: The control circuit and interface circuit further include the heterogeneous processing circuit as described in any one of claims 1-7; The control circuit is configured to generate a first control command, a second control command, and a third control command; The interface circuit is configured to retrieve input data from the memory according to the first control instruction; The heterogeneous processing circuit is configured to process the first data in parallel or process the second data in pipeline, according to the second control instruction. The interface circuit is further configured to transmit the first processing result of the first data in the first buffer, or the second processing result of the second data, to the memory according to the third control instruction.
9. The integrated circuit of claim 8, wherein, The interface circuit includes: an interface unit and at least one reading unit; The reading unit is configured to cyclically acquire the input data based on the bandwidth of the interface unit.
10. A method of heterogeneous matching, the method comprising: For matching the data flow between the first processing circuit and the second processing circuit in the heterogeneous processing circuit according to any one of claims 1-7, it includes: Obtain the first parameter of the memory, the first parameter including: the capacity of the memory; Obtain the second parameter of the interface circuit, the second parameter including: the bandwidth of the interface circuit; Based on the first parameter and the second parameter, the first loop number for the interface circuit to read data from the memory is determined; Based on the first loop number and the second number of the plurality of processing units, the first number of the plurality of processing modules is determined.
11. The heterogeneous matching method of claim 10, wherein, The first parameter further includes: the third number of storage cells in the memory and the capacity of the storage cells; the second parameter further includes: the fourth number of read units in the interface circuit; Determining the first loop number for the interface circuit to read data from the memory based on the first parameter and the second parameter includes: Based on the third and fourth quantities, a fifth quantity is determined for the reading unit to read from the storage unit; Based on the bandwidth and the capacity of the storage unit, determine the sixth number of storage units that the heterogeneous processing circuit reads at one time; The first cycle number is determined based on the fifth and sixth quantities.
12. The heterogeneous matching method of claim 10 or 11, wherein, Determining the first number of the plurality of processing modules based on the first loop count and the second number of the plurality of processing units includes: The first cycle number is squared to determine the seventh quantity; Based on the seventh quantity, a second quantity of the plurality of processing units is determined; Based on the second quantity, a first quantity of the plurality of processing modules is determined.
13. The heterogeneous matching method of claim 12, wherein, The absolute value of the difference between the first quantity and the second quantity is less than or equal to a first preset value, which is determined based on the second quantity.
14. A heterogeneous matching apparatus, characterized by, include: The acquisition unit is used to acquire a first parameter of the memory, the first parameter including the capacity of the memory; and to acquire a second parameter of the interface circuit, the second parameter including the bandwidth of the interface circuit; A matching unit is configured to determine a first loop number for the interface circuit to read data from the memory based on the first parameter and the second parameter; and to determine a first number of the plurality of processing modules based on the first loop number and a second number of the plurality of processing units.
15. A computer program product, characterised in that, Includes instructions, wherein, when executed by a processor, the heterogeneous matching method of any one of claims 10-13 is executed.