A GPU scheduling apparatus and GPU chip

By introducing a central control module, a dependency configuration module, and a gater into the GPU chip, and utilizing a dependency mapping table to achieve flexible and reliable configuration between core processing modules, the high cost of GPU chip design verification in existing technologies is solved, thereby improving design efficiency and flexibility.

CN116451640BActive Publication Date: 2026-06-19METAX INTEGRATED CIRCUITS (SHANGHAI) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
METAX INTEGRATED CIRCUITS (SHANGHAI) CO LTD
Filing Date
2022-01-10
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing technologies, the core processing modules of GPU chips cannot be flexibly configured with dependencies after placement and routing, resulting in high design verification costs and low efficiency. Furthermore, once the dependencies are adjusted, the placement and routing must be redone, which fails to achieve the flexibility and reliability of the core processing modules.

Method used

It employs a central control module, a core processing module, a dependency configuration module, and a selector. A dependency mapping table enables flexible and reliable configuration between core processing modules, keeping the hardware layout and wiring unchanged. Only the mapping table needs to be adjusted to change the dependencies.

Benefits of technology

It reduces chip design verification costs, improves design verification efficiency, enables flexible and reliable dependency configuration between core processing modules, and improves chip utilization and design efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116451640B_ABST
    Figure CN116451640B_ABST
Patent Text Reader

Abstract

This invention relates to a GPU scheduling device and a GPU chip. The device includes a central control module, M core processing modules, and a dependency configuration module. The central control module is connected to each core processing module and is used to distribute task groups with dependencies. Each core processing module is connected to the dependency configuration module and sends the current task group distribution status to the dependency configuration module. The dependency configuration module includes a dependency mapping table and M gates, S... m For the corresponding U m The swatch, S m Read U from the dependency mapping table m The task group identifier it depends on, select U. m The corresponding task group distribution status is transmitted to U. m , when U m Upon receiving the completion pulse of its corresponding dependent task group, the distribution of U begins. m The current corresponding task group. This invention reduces chip design and verification costs, and improves chip design flexibility and verification efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of chip design technology, and more particularly to a GPU scheduling device and a GPU chip. Background Technology

[0002] A Graphics Processing Unit (GPU) chip typically includes multiple core processing modules, each containing multiple execution modules. These core processing modules often have dependencies on each other when executing tasks. That is, an execution module can only distribute its own task group and execute it sequentially after receiving the task group distribution instructions from the execution modules it depends on. The final placement and routing of the GPU chip must satisfy these dependencies between the core processing modules. However, once the core processing modules are designed, these dependencies cannot be directly determined before the placement is completed.

[0003] Current technologies typically require actual placement and routing of core processing modules before verification, which contradicts the design-verification-placement sequence. Furthermore, if placement fails, the design must be modified and re-placed and routed, necessitating re-verification. This process is highly iterative and inefficient (time-consuming). Moreover, once the placement and routing of the core processing modules are determined, execution follows this dependency, limiting flexibility. A failure in one task group renders the entire GPU chip unusable. Therefore, achieving flexible and reliable configuration of GPU core processing module dependencies and reducing design verification and placement / routing costs in the GPU chip design process is a pressing technical challenge. Summary of the Invention

[0004] The purpose of this invention is to provide a GPU scheduling device and a GPU chip that can flexibly and reliably configure the dependencies between GPU core processing modules while keeping the hardware layout and wiring unchanged, thereby reducing chip design and verification costs and improving chip design and verification efficiency.

[0005] According to a first aspect of the present invention, a GPU scheduling device is provided, comprising a central control module and M core processing modules {U1, U2, ... U...} M} and dependency configuration modules, where U m This is the m-th core processing module, where m ranges from 1 to M.

[0006] The central control module is connected to each core processing module and is used to issue task groups with dependencies to the M core processing modules.

[0007] Each core processing module is connected to the dependency configuration module. The core processing module is used to send the corresponding current task group distribution status to the dependency configuration module. When the current task group distribution is completed, a completion pulse is sent to the dependency configuration module. When the task group distribution is not completed, the distribution status is low.

[0008] The dependency configuration module includes a pre-configured dependency mapping table and M gates {S1, S2, ... S...} M Each of the aforementioned gates is connected to all core processing modules and is used to obtain the current task group distribution status of each core processing module.

[0009] The dependency mapping table is used to configure the mapping relationship between each core processing module identifier and the core processing module identifiers it depends on.

[0010] S m For U m The corresponding strobe, S m Used to read U from the dependency mapping table m The task group identifier it depends on, select U. m The task group distribution status corresponding to the task group identifier is transmitted to U. m , when U m Upon receiving the completion pulse, begin distributing U. m The current task group.

[0011] According to a second aspect of the present invention, a GPU chip is provided, including the GPU scheduling device.

[0012] Compared with existing technologies, this invention has significant advantages and beneficial effects. Through the above technical solution, the GPU scheduling device and GPU chip provided by this invention achieve considerable technological advancement and practicality, and have broad industrial application value. It possesses at least the following advantages:

[0013] This invention enables flexible and reliable configuration of dependencies between GPU core processing modules while maintaining the hardware layout and routing unchanged. It achieves these dependencies by setting a dependency configuration module, combined with selectors and a dependency mapping table. This invention decouples the chip design front-end and back-end, allowing for layout verification of any dependency relationship even when the target circuit layout is unknown. After successful verification, only the dependencies corresponding to the target circuit routing need to be configured in the dependency mapping table; no re-layout is required. This invention reduces chip design verification costs and improves chip design verification efficiency.

[0014] The above description is merely an overview of the technical solution of the present invention. In order to better understand the technical means of the present invention and to implement it in accordance with the contents of the specification, and to make the above and other objects, features and advantages of the present invention more apparent and understandable, preferred embodiments are described below in detail with reference to the accompanying drawings. Attached Figure Description

[0015] Figure 1 This is a schematic diagram of a GPU scheduling device provided in an embodiment of the present invention. Detailed Implementation

[0016] To further illustrate the technical means and effects adopted by the present invention to achieve the intended purpose, the following detailed description, in conjunction with the accompanying drawings and preferred embodiments, describes in detail the specific implementation methods and effects of a GPU scheduling device and GPU chip proposed according to the present invention.

[0017] This invention provides a GPU scheduling device, such as... Figure 1 As shown, it includes a central control module ( Figure 1 (The central control module is represented by TC), and there are M core processing modules {U1, U2, ... U... M} and dependency configuration modules, where U m Let m be the m-th core processing module, where m ranges from 1 to M, and M is a positive integer greater than or equal to 2. The central control module is connected to each core processing module and is used to distribute task groups with dependencies to the M core processing modules. It should be noted that the central control module is connected to the upper-layer software and is used to receive task groups distributed by the upper-layer software. The proportions correspond to the same program, distributing a chained task group {WG1, WG2, ... WG...} with dependencies. N}, WG n This is the nth task group, where n ranges from 1 to N, and N is a positive integer greater than or equal to 2. To ensure resource balance within the central control module, the central control module will distribute WG according to a preset distribution rule. n Tasks are distributed to different core processing modules, and the dependencies between task groups are reflected in WG. n-1 After the corresponding core processing modules are distributed, WG n Distribution begins. It should be noted that the dependencies refer to the task distribution order, not the task execution order. Understandably, the final chip layout design should ensure that the dependencies between task groups distributed by the central control module according to preset distribution rules are consistent with the dependencies between the corresponding central control modules.

[0018] Each core processing module is connected to the dependency configuration module. The core processing module sends the corresponding current task group distribution status to the dependency configuration module. When the current task group distribution is complete, it sends a completion pulse to the dependency configuration module; the task group distribution status is low when distribution is incomplete. Therefore, the dependency configuration module can read the current task group distribution status corresponding to each core processing module in real time. It should be noted that each core processing module includes a management module and an execution module. Task groups need to be distributed to the corresponding execution modules, and the management unit is responsible for sending the corresponding current task group distribution status to the dependency configuration module.

[0019] The dependency configuration module ( Figure 1 The dependency configuration module (represented by TD) includes a pre-configured dependency mapping table Table and M gates {S1, S2, ... S}. M Each of the aforementioned gates is connected to all core processing modules to obtain the current task group distribution status of each core processing module. The dependency mapping table is used to configure the mapping relationship between each core processing module identifier and the identifiers of the core processing modules it depends on. By connecting each of the aforementioned gates to all core processing modules, each gate can obtain the current task group distribution status of each core processing module, thereby enabling flexible configuration of the dependencies between core processing modules in conjunction with the dependency mapping table.

[0020] S m For U m The corresponding strobe, S m Used to read U from the dependency mapping table m The task group identifier it depends on, select U. m The task group distribution status corresponding to the task group identifier is transmitted to U. m , when U m Upon receiving the completion pulse, begin distributing U. m The current corresponding task group. Each selector selects the read path of the core processing module that the core processing module depends on based on the dependency mapping table, so that each core processing module only obtains the current task group distribution status corresponding to the core processing module it depends on.

[0021] As one embodiment, each selector and each core processing module are connected by two sets of circuit lines, namely input circuit lines and output circuit lines. It should be noted that... Figure 1 Not all circuit lines are shown; only the basic architecture of the device and the connection relationships between its various hardware components are illustrated. Figure 1 Each selector will input the distribution status of M task groups corresponding to M core processing modules, such as... Figure 1The multiple arrows above the selector indicate that by querying the dependency table, you can select the corresponding task group to distribute the status output, as shown below. Figure 1 An arrow below the selector indicates that its output is connected to the core processing module corresponding to that selector. The core processing module first sends the current task group distribution status to the selector via the input circuitry, and the selector then sends the current task group distribution status of the core processing module it depends on back to the core processing module via its corresponding output circuitry.

[0022] It should be noted that in existing technologies, each chip verification can only verify a fixed set of hardware devices corresponding to specific dependencies. Furthermore, when these dependencies need adjustment, the design, verification, and hardware layout and routing must be reworked. In contrast, the device described in this invention only requires reconfiguring the dependencies in the dependency mapping table. As one embodiment, when designing the core processing module's circuit layout for the GPU chip, the dependencies between core processing modules do not need to be considered. After successful layout, the hardware layout and connection relationships of the device remain unchanged based on the layout result. The dependencies in the dependency mapping table are reconfigured as needed and updated in the dependency configuration module. Preferably, in the dependency mapping table, the task group identifiers exhibit circular dependencies, where each task group identifier depends on another task group identifier and is simultaneously depended upon by yet another task group identifier. Each task group identifier is depended upon only once.

[0023] Taking M=4 as an example, the dependency mapping table predicts before layout design that U1 depends on U3, U3 depends on U2, U2 depends on U4, and U4 depends on U1. Task groups can then be distributed based on this dependency mapping table to verify the correctness of the dependencies between the core processing modules. If this does not match the relative positions (dependencies) of the core processing modules after the actual layout, the dependency mapping table can be readjusted for verification. For example, the required adjustment after the actual layout could be: U1 depends on U2, U2 depends on U3, U3 depends on U4, and U4 depends on U1, and verification can be performed again. For confirmed target dependencies, the target dependencies can be directly set in the dependency mapping table without needing to redo the hardware layout and routing.

[0024] In existing technologies, once the placement and routing of the core processing modules in a GPU chip are set, they can only be used based on the dependencies between the core processing modules corresponding to the current placement and routing. It is impossible to adapt to dependencies between other core processing modules, thus making it impossible to achieve virtual grouping of core processing modules. However, the device described in this invention, in addition to being used for one-time convergence of the chip design verification placement process, can also achieve virtual grouping of core processing modules through flexible configuration of a dependency mapping table. As one embodiment, if the GPU chip is used for virtual grouping of core processing modules, the hardware layout and connection relationships of the device remain unchanged. The task group identifiers are divided into R virtual groups, each virtual group including multiple core processing modules. Each core processing module resides in only one virtual group, and each virtual group is independent of the others, configured in the dependency mapping table. Within each virtual group, the task group identifiers within the group have a circular dependency relationship. Each task group identifier within the group depends on another task group identifier within the group and is simultaneously depended on by yet another task group identifier within the group. Each task group identifier within a group is depended on only once. The device enables virtual group processing of the core processing module by flexibly configuring the dependencies between task group identifiers within each virtual group. This is flexible, reliable, and the hardware can be reused, greatly reducing the cost of chip design verification and chip manufacturing.

[0025] In existing technologies, once the core processing modules in a GPU chip are laid out and routed, they can only be used based on the dependencies between the core processing modules corresponding to the current layout and routing. When one of the core processing modules fails, the dependencies between the entire core processing modules are broken, causing the entire GPU chip to become unusable. However, the device described in this invention, if some core processing modules fail (e.g., due to manufacturing defects) and become unusable, maintains the hardware layout and connection relationships unchanged, deletes the identifier of the failed task core processing module, and re-establishes a dependency mapping table based on the identifiers of the normal task core processing modules, updating it in the dependency configuration module. This ensures that even if some core processing modules fail, the GPU chip remains usable, improving the utilization rate of the GPU chip.

[0026] This invention also provides a GPU chip, including the GPU scheduling device described in this invention.

[0027] The GPU scheduling device and GPU chip described in this invention can flexibly and reliably configure the dependencies between GPU core processing modules while maintaining the hardware layout and wiring of the device. This is achieved by setting a dependency configuration module, combined with a selector and a dependency mapping table. This invention decouples the chip design front-end and back-end, enabling layout verification of any dependency relationship even when the target circuit layout is unknown. After successful verification, only the dependency relationship corresponding to the target circuit wiring needs to be configured in the dependency mapping table; no re-layout is required. This invention reduces chip design verification costs and improves chip design verification efficiency.

[0028] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention in any way. Although the present invention has been disclosed above with reference to preferred embodiments, it is not intended to limit the present invention. Any person skilled in the art can make some modifications or alterations to the above-disclosed technical content to create equivalent embodiments without departing from the scope of the present invention. Any simple modifications, equivalent changes, and alterations made to the above embodiments based on the technical essence of the present invention without departing from the scope of the present invention shall still fall within the scope of the present invention.

Claims

1. A GPU scheduling device, characterized in that, Includes a central control module and M core processing modules {U1, U2, ... U... M } and dependency configuration modules, where U m This is the m-th core processing module, where m ranges from 1 to M. The central control module is connected to each core processing module and is used to issue task groups with dependencies to the M core processing modules. Each core processing module is connected to the dependency configuration module. The core processing module is used to send the corresponding current task group distribution status to the dependency configuration module. When the current task group distribution is completed, a completion pulse is sent to the dependency configuration module. When the task group distribution is not completed, the distribution status is low. The dependency configuration module comprises a pre-configured dependency mapping table and M gateways {S1, S2, … S M}, each of which is connected with all the core processing modules and used for obtaining the current task group distribution state of each core processing module. The dependency mapping table is used to configure the mapping relationship between each core processing module identifier and the core processing module identifiers it depends on. S m For U m The corresponding strobe, S m Used to read U from the dependency mapping table m The task group identifier it depends on, select U. m The task group distribution status corresponding to the task group identifier is transmitted to U. m , when U m Upon receiving the completion pulse, begin distributing U. m The current task group.

2. The apparatus according to claim 1, characterized in that, Each selector and each core processing module are connected by two sets of circuit lines, namely input circuit lines and output circuit lines. The core processing module first sends the current task group distribution status to the selector through the input circuit line, and the selector sends the current task group distribution status of the core processing module to which the core processing module depends through the output circuit line.

3. The apparatus according to claim 1 or 2, characterized in that, If the GPU chip is used for the core processing module circuit layout design, when the core processing module needs to change the layout design, the hardware layout and connection relationship of the device remain unchanged, and the dependency relationship in the dependency relationship mapping table is reconfigured as needed and updated to the dependency configuration module.

4. The apparatus according to claim 3, characterized in that, In the dependency mapping table, the task group identifiers are in a circular dependency relationship, where each task group identifier depends on another task group identifier and is also depended on by yet another task group identifier. Each task group identifier is depended on only once.

5. The apparatus according to claim 1 or 2, characterized in that, If the GPU chip is used for virtual grouping of core processing modules, the hardware layout and connection relationship of the device remain unchanged, the task group identifier is divided into R virtual groups, each virtual group includes multiple core processing modules, each core processing module is located in only one virtual group, each virtual group is independent of each other, and is configured in the dependency mapping table.

6. The apparatus according to claim 5, characterized in that, In each virtual group, the task group identifiers within the group have a circular dependency relationship. Each task group identifier within the group depends on another task group identifier within the group, and is also depended on by yet another task group identifier within the group. Each task group identifier within the group is depended on only once.

7. The apparatus according to claim 1 or 2, characterized in that, If some core processing modules in the GPU chip malfunction and become unusable, the hardware layout and connection relationships remain unchanged. The identifier of the malfunctioning task core processing module is deleted, and a dependency mapping table is re-established based on the identifier of the normal task core processing module, and updated in the dependency configuration module.

8. A GPU chip, comprising the GPU scheduling device according to any one of claims 1-7.