A graphics processor, an instruction thread scheduling method, a device and a storage medium

By introducing a thread assembler, a global counter, and a launch controller into the graphics processor, and using dwell time and saturation registers to filter target instruction slots, the problem of high hardware overhead in graphics processors is solved, and processing efficiency and performance are improved.

CN122243719APending Publication Date: 2026-06-19RICUN TECH (SHANGHAI) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
RICUN TECH (SHANGHAI) CO LTD
Filing Date
2026-05-22
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In graphics processing units (GPUs), the divergence of different data execution paths leads to low instruction processing efficiency. Existing technologies have high hardware overhead when choosing to force issue instruction slots, which affects processing performance.

Method used

Deploying a thread assembler, global counter, and issue controller in the graphics processor, and using dwell time registers and saturation registers to filter target instruction slots, reduces hardware overhead and improves processing performance.

🎯Benefits of technology

By reducing the complexity of hardware comparison circuitry and decision latency, the performance of the graphics processor in processing instruction threads is improved, saving hardware resources.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243719A_ABST
    Figure CN122243719A_ABST
Patent Text Reader

Abstract

This invention discloses a graphics processing unit (GPU), an instruction thread scheduling method, a device, and a storage medium, relating to the field of computer technology. The GPU includes a thread assembler, a global counter, and a launch controller. Each instruction slot in the thread assembler corresponds to a dwell time register and a saturation register. The global counter updates the current count value after the thread assembler receives an instruction thread to be processed. The dwell time register updates the current value based on the global counter's count value. The saturation register updates the current value after an instruction thread is pushed into an instruction slot. The thread assembler selects a target instruction slot from multiple occupied instruction slots. The launch controller forces the launch of multiple instruction threads in the target instruction slot. The technical solution of this invention can reduce the hardware overhead caused by the instruction thread scheduling process, save GPU hardware resources, and improve the GPU's instruction thread processing performance.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer technology, and in particular to a graphics processor, an instruction thread scheduling method, a device, and a storage medium. Background Technology

[0002] In a Graphics Processing Unit (GPU), Single Instruction Multiple Data (SIMD) is a parallel computing architecture that allows the GPU to perform the same operation on multiple data points simultaneously using a single instruction. However, for GPUs, not all data points have the same execution path; different data points may correspond to different computational methods. In such cases, the GPU cannot achieve parallel data execution, resulting in low instruction processing efficiency.

[0003] To address the issue of low processing efficiency caused by divergent data execution paths, multiple instruction threads with the same execution path can be pushed into the same instruction slot to form a thread bundle for parallel execution. However, due to the limited number of instruction slots, when the new instruction thread received by the graphics processor has a different execution path than the corresponding instruction slots, it is necessary to select one instruction slot from the occupied slots for a kickoff to provide storage space for the new instruction thread.

[0004] In existing technologies, when selecting a command slot to force-fire from a full pool of command slots, a timer is typically configured for each command slot. Then, the timer values ​​for each command slot are compared pairwise or one by one, and finally, the command slot with the largest timer value is selected for firing. However, this method of comparing one by one leads to significant hardware overhead, wasting hardware resources and impacting the processing performance of the graphics processor. Summary of the Invention

[0005] This invention provides a graphics processor, an instruction thread scheduling method, a device, and a storage medium, which can reduce the hardware overhead caused by the instruction thread scheduling process, save hardware resources of the graphics processor, and improve the graphics processor's processing performance for instruction threads.

[0006] According to one aspect of the present invention, a graphics processor is provided, the graphics processor including a thread assembler, a global counter and a launch controller; the thread assembler includes a plurality of instruction slots, each instruction slot corresponding to a dwell time register and a saturation register respectively; The global counter is used by the thread assembler to update the current count value after receiving the instruction thread to be processed. The dwell time register is used to update the current value based on the count value of the global counter; the saturation register is used to update the current value after an instruction thread is pushed into the instruction slot. The thread assembler is used to filter the target instruction slot from among the multiple occupied instruction slots when the execution path of the instruction thread to be processed does not match the execution path of each instruction slot. The launch controller is used to force the launch of multiple instruction threads in the target instruction slot.

[0007] According to another aspect of the present invention, an instruction thread scheduling method is provided, applied to a graphics processor; the method includes: After receiving the target instruction thread to be processed, determine whether the execution path of the target instruction thread matches the execution paths of each instruction slot in the thread assembler; If there is no instruction slot matching the target instruction thread in the thread assembler, then among the multiple occupied instruction slots, the target instruction slot is selected according to the saturation mask value and dwell time mask value corresponding to each instruction slot. Multiple raw instruction threads in the target instruction slot are forcibly issued, and then the target instruction thread is pushed into the target instruction slot.

[0008] According to another aspect of the present invention, an electronic device is provided, the electronic device comprising: At least one graphics processor; and A memory communicatively connected to the at least one graphics processor; wherein, The memory stores a computer program that can be executed by the at least one graphics processor, the computer program being executed by the at least one graphics processor to enable the at least one graphics processor to execute the instruction thread scheduling method according to any embodiment of the present invention.

[0009] According to another aspect of the present invention, a computer-readable storage medium is provided, the computer-readable storage medium storing computer instructions for causing a graphics processor to implement the instruction thread scheduling method described in any embodiment of the present invention when executed.

[0010] According to another aspect of the present invention, a computer program product is provided, the computer program product comprising a computer program that, when executed by a graphics processor, implements the instruction thread scheduling method described in any embodiment of the present invention.

[0011] The technical solution provided by this invention deploys a thread assembler, a global counter, and a launch controller in a graphics processor. The thread assembler includes multiple instruction slots, each corresponding to a dwell time register and a saturation register. The global counter updates the current count value after the thread assembler receives an instruction thread to be processed. The dwell time register updates the current value based on the count value of the global counter. The saturation register updates the current value after an instruction thread is pushed into an instruction slot. When the execution path of the instruction thread to be processed does not match the execution paths of any instruction slot, the thread assembler selects a target instruction slot from the multiple occupied instruction slots based on the values ​​corresponding to the dwell time register and the saturation register, respectively. The launch controller forces the launch of multiple instruction threads in the target instruction slot. This technical means can reduce the hardware overhead caused by the instruction thread scheduling process, save the hardware resources of the graphics processor, and improve the processing performance of the graphics processor for instruction threads.

[0012] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of the present invention, nor is it intended to limit the scope of the invention. Other features of the invention will become readily apparent from the following description. Attached Figure Description

[0013] To more clearly illustrate the technical solutions in this invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0014] Figure 1 This is a schematic diagram of the structure of a graphics processor according to Embodiment 1 of the present invention; Figure 2 This is a schematic diagram of another graphics processor provided according to Embodiment 2 of the present invention; Figure 3 This is a flowchart of an instruction thread scheduling method provided in Embodiment 3 of the present invention; Figure 4 This is a flowchart of another instruction thread scheduling method provided in Embodiment 4 of the present invention; Figure 5 This is a schematic diagram of an electronic device structure that implements the instruction thread scheduling method of this invention. Detailed Implementation

[0015] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0016] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0017] This embodiment provides a graphics processor. Figure 1 This is a schematic diagram of the structure of the graphics processor, as shown below. Figure 1 As shown, the graphics processor 11 includes a thread assembler 101, a global counter 102, and a launch controller 103; the thread assembler 101 includes a plurality of instruction slots 104, each instruction slot 104 corresponding to a dwell time register 105 and a saturation register 106.

[0018] A global counter 102 is used by the thread assembler 101 to update the current count value after receiving a pending instruction thread. The instruction thread can be a shading instruction thread corresponding to a ray tracing task or a ray classification task in the graphics rendering scene. The global counter 102's count value, global_cnt, has a bit width of 5 bits. Each time the thread assembler 101 receives a pending instruction thread, the global counter 102 increments global_cnt by 1.

[0019] The dwell time register 105 is used to update the current value based on the count value of the global counter 102. As the number of instruction threads to be processed in the thread assembler 101 increases, global_cnt will also continue to increase. When global_cnt reaches a preset threshold (e.g., 0b11111), the dwell time register 105 corresponding to each instruction slot 104 in the thread assembler 101 can update the current value. The value of the dwell time register 105 is used to reflect the dwell time of the corresponding instruction slot 104.

[0020] The saturation register 106 is used to update the current value after an instruction thread is pushed into the instruction slot 104. Specifically, after an instruction thread is pushed into each instruction slot 104, the corresponding saturation register 106 can be incremented by 1. The value of the saturation register 106 is used to reflect the saturation of the corresponding instruction slot 104.

[0021] The thread assembler 101, upon receiving a thread of instructions to be processed, determines whether the execution path of that thread matches the execution paths of each instruction slot 104. If the execution path of the thread to be processed does not match the execution paths of any instruction slot 104, it selects a target instruction slot from the multiple occupied instruction slots based on the values ​​corresponding to the dwell time register 105 and the saturation register 106, respectively. The launch controller 103 is used to force the launch of multiple instruction threads in the target instruction slot.

[0022] The advantage of this configuration is that, compared to the existing technology that compares the timer values ​​corresponding to each instruction slot in pairs or one by one, requiring multiple calls to the comparison instruction, this embodiment can compare multiple bits of the dwell time register and saturation register with a single instruction, without having to perform multi-way comparisons of the complete timer value. This fully utilizes the parallel computing capabilities of the graphics processor hardware, reduces the complexity and decision latency of the hardware comparison circuit, and greatly improves the graphics processor's processing performance for instruction threads.

[0023] The technical solution provided by this invention deploys a thread assembler, a global counter, and a launch controller in a graphics processor. The thread assembler includes multiple instruction slots, each corresponding to a dwell time register and a saturation register. The global counter updates the current count value after the thread assembler receives an instruction thread to be processed. The dwell time register updates the current value based on the count value of the global counter. The saturation register updates the current value after an instruction thread is pushed into an instruction slot. When the execution path of the instruction thread to be processed does not match the execution paths of any instruction slot, the thread assembler selects a target instruction slot from the multiple occupied instruction slots based on the values ​​corresponding to the dwell time register and the saturation register, respectively. The launch controller forces the launch of multiple instruction threads in the target instruction slot. This technical means can reduce the hardware overhead caused by the instruction thread scheduling process, save the hardware resources of the graphics processor, and improve the processing performance of the graphics processor for instruction threads.

[0024] This second embodiment provides another graphics processor. Figure 2 This is a schematic diagram of the structure of the graphics processor, as shown below. Figure 2 As shown, the dwell time register 105 includes a dwell period register 107 and a dwell mask register 108.

[0025] The dwell period register 107 is used to update the current value when the count value of the global counter 102 reaches the threshold; the dwell mask register 108 is used to update the current value according to the comparison result between the value of the dwell period register 107 and the period threshold.

[0026] Specifically, the dwell period register 107 is used to count the 4-bit global count period value round_cnt, which reflects the dwell time of each instruction slot. When the count value global_cnt of the global counter 102 reaches a preset threshold (e.g., 0b11111), the dwell period register 107 can increment the current value round_cnt by 1, while the global counter 102 sets the count value to 0.

[0027] The graphics processor 11 can preset multiple cycle thresholds, including a first cycle threshold (tbu_age_threshold_bit0), a second cycle threshold (tbu_age_threshold_bit1), a third cycle threshold (tbu_age_threshold_bit2), and a fourth cycle threshold (tbu_age_threshold_bit3).

[0028] In this embodiment, the threshold value of the first cycle is less than the threshold value of the second cycle, the threshold value of the second cycle is less than the threshold value of the third cycle, and the threshold value of the third cycle is less than the threshold value of the fourth cycle. The threshold value of the first cycle is 1 by default (with a value range of [0,1]), the threshold value of the second cycle is 2 by default (with a value range of [2,3]), the threshold value of the third cycle is 4 by default (with a value range of [4,7]), and the threshold value of the fourth cycle is 8 by default (with a value range of [8,15]). The specific values ​​can be adjusted according to the actual situation, and this embodiment does not impose any restrictions on them.

[0029] The dwell mask register 108 is used to update the 4-bit dwell duration mask value age_mask based on the comparison result between the value of the dwell period register 107 and the period threshold.

[0030] For example, if the value of the dwell period register 107 is greater than or equal to the first period threshold, the dwell mask register 108 sets bit 0 of the dwell duration mask value to 1; if the value of the dwell period register 107 is greater than or equal to the second period threshold, the dwell mask register 108 sets bit 1 of the dwell duration mask value to 1; if the value of the dwell period register 107 is greater than or equal to the third period threshold, the dwell mask register 108 sets bit 2 of the dwell duration mask value to 1; if the value of the dwell period register 107 is greater than or equal to the fourth period threshold, the dwell mask register 108 sets bit 3 of the dwell duration mask value to 1.

[0031] Therefore, by obtaining the set value of the dwell time mask value for each instruction slot, the dwell time of each instruction slot can be quickly determined. That is, the more bits set to 1 in the dwell time mask value, the longer the dwell time of the corresponding instruction slot will be.

[0032] like Figure 2 As shown, the saturation register 106 includes a thread count register 109 and a saturation mask register 110; wherein, the thread count register 109 is used to update the current value after a new instruction thread is pushed into the instruction slot; the saturation mask register 110 is used to update the current value based on the comparison result between the value of the thread count register 109 and the count threshold.

[0033] Specifically, after a new instruction thread is pushed into instruction slot 104, the thread count register 109 can increment the 5-bit number of collected threads, simd_cnt, by 1.

[0034] The graphics processor 11 can preset multiple thread number thresholds, including a first threshold (tbu_simd_threshold_bit0), a second threshold (tbu_simd_threshold_bit1), and a third threshold (tbu_simd_threshold_bit2). The first threshold is less than the second threshold, and the second threshold is less than the third threshold. The first threshold defaults to 8, the second threshold defaults to 16, and the third threshold defaults to 24. These specific values ​​can be adjusted according to actual needs; this embodiment does not impose any limitations on this.

[0035] The saturation mask register 110 is used to update the 3-bit saturation mask value accumulate_mask based on the comparison result between the value of the thread count register 109 and the count threshold.

[0036] For example, if the value of the thread count register 109 is greater than or equal to the first count threshold, the saturation mask register 110 sets bit 0 of the saturation mask value to 1; if the value of the thread count register 109 is greater than or equal to the second count threshold, the saturation mask register 110 sets bit 1 of the saturation mask value to 1; if the value of the thread count register 109 is greater than or equal to the third count threshold, the saturation mask register 110 sets bit 2 of the saturation mask value to 1.

[0037] Therefore, by obtaining the saturation mask value set to 1 for each instruction slot, the saturation of each instruction slot can be quickly determined. That is, the more bits set to 1 in the saturation mask value, the higher the saturation of the corresponding instruction slot.

[0038] The thread assembler 101 is also used to push the instruction thread to be processed into the target instruction slot after multiple instruction threads in the target instruction slot have been forcibly issued.

[0039] The technical solution provided by this invention, by deploying a dwell period register and a dwell mask register in the dwell duration register, and deploying a thread number register and a saturation mask register in the saturation register, can quickly select the instruction slot with the highest saturation and the longest dwell time as the target instruction slot, saving the time spent on selecting the target instruction slot and improving the graphics processor's processing performance for instruction threads.

[0040] This third embodiment provides an instruction thread scheduling method. Figure 3 This is a flowchart of the instruction thread scheduling method. This embodiment is applicable to the scheduling of instruction slots and instruction threads corresponding to the thread assembler in a graphics processor. This method can be executed by the graphics processor, such as... Figure 3 As shown, the method includes: Step 310: After receiving the target instruction thread to be processed, determine whether the execution path of the target instruction thread matches the execution paths of each instruction slot in the thread assembler.

[0041] In this embodiment, optionally, after receiving the target instruction thread to be processed, the graphics processor can determine the execution path of the target instruction thread based on the data dependency relationship between the target instruction thread and other instruction threads, or the hardware resource usage corresponding to the target instruction thread.

[0042] In this step, specifically, after the graphics processor determines the execution path of the target instruction thread, it can obtain the execution path of each instruction slot according to the identification information of each instruction slot in the thread assembler, and then determine whether the execution path of the target instruction thread matches the execution path of each instruction slot.

[0043] The execution path of an instruction slot can be understood as a unified path for executing the collected instruction threads within that instruction slot. The identification information of each instruction slot and the mapping relationship between execution paths can be pre-written into a preset storage space. The thread assembler is a core scheduling component in the graphics processor rendering pipeline, used to classify and group different types of shading instruction threads (such as vertex shading instruction threads, pixel shading instruction threads, ray tracing shading instruction threads, etc.) and record their execution priorities, resource dependencies, and execution paths.

[0044] In one specific implementation, if there is an instruction slot in the thread assembler that matches the target instruction thread, the target instruction thread can be pushed into the instruction slot so that multiple instruction threads in the instruction slot can be executed in parallel.

[0045] Step 320: If there is no instruction slot in the thread assembler that matches the target instruction thread, then among the multiple occupied instruction slots, filter the target instruction slot according to the saturation mask value and dwell time mask value corresponding to each instruction slot.

[0046] In this step, if there is no instruction slot in the thread assembler that matches the target instruction thread, the instruction slot with the highest saturation and longest dwell time can be selected from the multiple occupied instruction slots based on the saturation mask value and dwell time mask value corresponding to each instruction slot as the target instruction slot.

[0047] In this embodiment, during the process of pushing the received instruction threads into the corresponding instruction slots, the graphics processor can record the saturation mask value and dwell time mask value of the fixed bit width corresponding to each instruction slot in real time according to the saturation of each instruction slot (i.e., the number of instruction threads collected) and the dwell time.

[0048] The advantage of this configuration is that, compared to the existing technology that compares the timer values ​​corresponding to each instruction slot in pairs or one by one, requiring multiple calls to the comparison instruction, this embodiment can compare multiple bits of the mask value with a single instruction, without having to perform multiple comparisons of the complete timer value. This fully utilizes the parallel computing capabilities of the graphics processor hardware, reduces the complexity and decision latency of the hardware comparison circuit, and greatly improves the graphics processor's processing performance for instruction threads.

[0049] Step 330: Force issue multiple raw instruction threads in the target instruction slot, and then push the target instruction thread into the target instruction slot.

[0050] In this step, multiple raw instruction threads collected in the target instruction slot can be combined into a complete thread bundle and submitted to the execution unit for parallel execution. At the same time, the target instruction slot can be marked as idle so that the target instruction slot can store the target instruction thread.

[0051] The technical solution provided in this embodiment, after receiving the target instruction thread to be processed, determines whether the execution path of the target instruction thread matches the execution paths of each instruction slot in the thread assembler; if there is no instruction slot in the thread assembler that matches the target instruction thread, then among the multiple occupied instruction slots, the target instruction slot is selected according to the saturation mask value and dwell time mask value corresponding to each instruction slot; the multiple original instruction threads in the target instruction slot are forcibly issued, and then the target instruction thread is pushed into the target instruction slot. This technical means can reduce the hardware overhead caused by the instruction thread scheduling process, save the hardware resources of the graphics processor, and improve the graphics processor's processing performance for instruction threads.

[0052] Figure 4 This is a flowchart of another instruction thread scheduling method provided in Embodiment 4 of the present invention. This embodiment is a further refinement of the above embodiments. Figure 4 As shown, the method includes: Step 401: Receive the target instruction thread to be processed.

[0053] Step 402: Update the total number of instruction threads received by the thread assembler, and then execute steps 403 and 406 respectively.

[0054] Step 403: Determine whether the total number of instruction threads has reached the preset threshold. If yes, proceed to steps 404-405. If no, return to the operation of step 401.

[0055] Step 404: Update the global count cycle value corresponding to each instruction slot in the thread assembler, and then execute step 405.

[0056] Step 405: Compare the global count cycle value corresponding to each instruction slot in the thread assembler with the cycle threshold, and determine the dwell time mask value corresponding to each instruction slot based on the comparison result.

[0057] Step 406: Determine whether the execution path of the target instruction thread matches the execution paths of each instruction slot in the thread assembler. If yes, proceed to steps 407-408; otherwise, proceed to steps 409-410.

[0058] Step 407: Push the target instruction thread into the instruction slot, update the number of collected threads corresponding to the instruction slot, and then execute step 408.

[0059] Step 408: Compare the number of collected threads corresponding to the instruction slot with a preset number threshold, and determine the saturation mask value corresponding to the instruction slot based on the comparison result.

[0060] Step 409: Among the multiple occupied instruction slots, filter the target instruction slots according to the saturation mask value and dwell time mask value corresponding to each instruction slot, and then execute step 410.

[0061] In one embodiment of this example, among multiple occupied instruction slots, a target instruction slot is selected based on the saturation mask value and dwell time mask value corresponding to each instruction slot. This includes: obtaining the saturation mask value corresponding to each instruction slot among the multiple occupied instruction slots, and selecting the instruction slot with the most 1 bits in the saturation mask value as a candidate instruction slot; obtaining the dwell time mask value corresponding to each candidate instruction slot, and selecting the instruction slot with the most 1 bits in the dwell time mask value as the target instruction slot.

[0062] In one specific embodiment, when selecting a target command slot to be launched from multiple occupied command slots, the candidate command slot with the highest saturation can be obtained first, and then the command slot with the longest dwell time can be selected as the target command slot. Optionally, if the candidate command slots include multiple command slots with the longest dwell time, the command slot with the earlier number can be selected as the target command slot.

[0063] The advantage of this setup is that, compared to the existing technology which requires comparing the complete timer values ​​of all instruction slots one by one, this embodiment only needs to determine whether the saturation mask value and the dwell time mask value are set to 1 to quickly filter out the instruction slots to be issued. This greatly speeds up the scheduling efficiency of instruction threads in the graphics processor and reduces the number of logic gates and timing overhead of the graphics processor.

[0064] Step 410: Force issue multiple raw instruction threads in the target instruction slot, and then push the target instruction thread into the target instruction slot.

[0065] In this embodiment, after forcibly issuing multiple raw instruction threads in the target instruction slot, the method further includes: clearing the number of collected threads, saturation mask value, global count cycle value, and dwell time mask value corresponding to the target instruction slot.

[0066] The technical solution provided in this embodiment involves a graphics processor receiving a target instruction thread to be processed, updating the total number of instruction threads, updating the global count period value corresponding to each instruction slot in the thread assembler when the total number of instruction threads reaches a preset threshold, determining the dwell time mask value corresponding to each instruction slot, and determining whether the execution path of the target instruction thread matches the execution path of each instruction slot in the thread assembler. If so, the target instruction thread is pushed into the instruction slot and the number of collected threads corresponding to the instruction slot is updated, and the saturation mask value corresponding to the instruction slot is determined. If not, among the multiple occupied instruction slots, the target instruction slot is selected based on the saturation mask value and dwell time mask value corresponding to each instruction slot, and multiple original instruction threads in the target instruction slot are forcibly issued. Then, the target instruction thread is pushed into the target instruction slot. This technical approach can reduce the hardware overhead caused by the instruction thread scheduling process, save the hardware resources of the graphics processor, and improve the processing performance of the graphics processor for instruction threads.

[0067] Figure 5 A schematic diagram of an electronic device 10, which can be used to implement embodiments of the present invention, is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the invention described and / or claimed herein.

[0068] like Figure 5As shown, the electronic device 10 includes at least one graphics processor 11 and a memory, such as read-only memory (ROM) or random access memory (RAM), communicatively connected to the at least one graphics processor 11. The memory stores computer programs executable by the at least one graphics processor. The graphics processor 11 can perform various appropriate actions and processes based on the computer program stored in the read-only memory 12 or loaded from the storage unit 18 into the random access memory 13. The random access memory 13 can also store various programs and data required for the operation of the electronic device 10. The graphics processor 11, read-only memory 12, and random access memory 13 are interconnected via a bus 14. Input / output (I / O) interfaces are also connected to the bus 14.

[0069] Multiple components in electronic device 10 are connected to input / output interface 15, including: input unit 16, such as keyboard, mouse, etc.; output unit 17, such as various types of monitors, speakers, etc.; storage unit 18, such as disk, optical disk, etc.; and communication unit 19, such as network card, modem, wireless transceiver, etc. Communication unit 19 allows electronic device 10 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks. Graphics processor 11 executes the various methods and processes described above, such as instruction thread scheduling methods.

[0070] In some embodiments, the instruction thread scheduling method may be implemented as a computer program tangibly contained in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and / or mounted on electronic device 10 via read-only memory 12 and / or communication unit 19. When the computer program is loaded into random access memory 13 and executed by graphics processor 11, one or more steps of the instruction thread scheduling method described above may be performed. Alternatively, in other embodiments, graphics processor 11 may be configured to execute the instruction thread scheduling method by any other suitable means (e.g., by means of firmware).

[0071] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-chips (SoCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0072] Computer programs used to implement the methods of the present invention can be written in any combination of one or more programming languages. These computer programs can be provided to the graphics processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, such that when executed by the graphics processor, the computer programs cause the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The computer programs can be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0073] In the context of this invention, a computer-readable storage medium can be a tangible medium that may contain or store a computer program for use by or in conjunction with an instruction thread scheduling system, apparatus, or device. A computer-readable storage medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination thereof. Alternatively, a computer-readable storage medium can be a machine-readable signal medium. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, RAM, ROM, erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.

[0074] To provide interaction with a user, the systems and techniques described herein can be implemented on an electronic device having: a display device (e.g., a cathode ray tube (CRT) or a liquid crystal display (LCD)) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the electronic device. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0075] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as data servers), or middleware components (e.g., application servers), or frontend components (e.g., user computers with graphical user interfaces or web browsers through which users can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., communication networks). Examples of communication networks include local area networks (LANs), wide area networks (WANs), blockchain networks, and the Internet.

[0076] A computing system can include clients and servers. Clients and servers are generally geographically separated and typically interact via communication networks. The client-server relationship is created by computer programs running on the respective computers and having a client-server relationship with each other. The server can be a cloud server, also known as a cloud computing server or cloud host, which is a hosting product within the cloud computing service system. It addresses the shortcomings of traditional physical hosts and Virtual Private Servers (VPS) in terms of management difficulty and weak business scalability.

[0077] It should be understood that the various forms of processes shown above can be used, with steps reordered, added, or deleted. For example, the steps described in this invention can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution of this invention can be achieved, and this is not limited herein.

[0078] The specific embodiments described above do not constitute a limitation on the scope of protection of this invention. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this invention should be included within the scope of protection of this invention.

Claims

1. A graphics processor, characterized in that, The graphics processor includes a thread assembler, a global counter, and a launch controller; the thread assembler includes multiple instruction slots, each corresponding to a dwell time register and a saturation register. The global counter is used by the thread assembler to update the current count value after receiving the instruction thread to be processed. The dwell time register is used to update the current value based on the count value of the global counter; The saturation register is used to update the current value after an instruction thread is pushed into the instruction slot; The thread assembler is used to filter the target instruction slot from among the multiple occupied instruction slots when the execution path of the instruction thread to be processed does not match the execution path of each instruction slot. The launch controller is used to force the launch of multiple instruction threads in the target instruction slot.

2. The graphics processor according to claim 1, characterized in that, The dwell time register includes a dwell period register and a dwell mask register; The dwell period register is used to update the current value when the count value of the global counter reaches the threshold. The dwell mask register is used to update the current value based on the comparison result between the value of the dwell period register and the period threshold.

3. The graphics processor according to claim 1, characterized in that, The saturation register includes a thread count register and a saturation mask register; The thread count register is used to update the current value after a new instruction thread is pushed into the instruction slot. The saturation mask register is used to update the current value based on the comparison result between the value of the thread count register and the count threshold.

4. The graphics processor according to claim 1, characterized in that, The thread assembler is also used to push the instruction thread to be processed into the target instruction slot after multiple instruction threads in the target instruction slot are forcibly issued.

5. A method for scheduling instruction threads, characterized in that, Applied to the graphics processor of claim 1, the method includes: After receiving the target instruction thread to be processed, determine whether the execution path of the target instruction thread matches the execution paths of each instruction slot in the thread assembler; If there is no instruction slot matching the target instruction thread in the thread assembler, then among the multiple occupied instruction slots, the target instruction slot is selected according to the saturation mask value and dwell time mask value corresponding to each instruction slot. Multiple raw instruction threads in the target instruction slot are forcibly issued, and then the target instruction thread is pushed into the target instruction slot.

6. The method according to claim 5, characterized in that, After receiving the target instruction thread to be processed, the process also includes: updating the total number of instruction threads that the thread assembler has received; After determining whether the execution path of the target instruction thread matches the execution paths of the instruction slots in the thread assembler, the process also includes: If the thread assembler has an instruction slot that matches the target instruction thread, the target instruction thread is pushed into the instruction slot, and the number of collected threads corresponding to the instruction slot is updated.

7. The method according to claim 6, characterized in that, After updating the number of collected threads corresponding to the instruction slot, the method further includes: The number of collected threads corresponding to the instruction slot is compared with a number threshold, and the saturation mask value corresponding to the instruction slot is determined based on the comparison result.

8. The method according to claim 6, characterized in that, After updating the total number of instructions received by the thread assembler, the following is also included: Determine whether the total number of instruction threads has reached a preset threshold; If so, the global count cycle value corresponding to each instruction slot in the thread assembler is updated.

9. The method according to claim 8, characterized in that, After updating the global count cycle value corresponding to each instruction slot in the thread assembler, the process further includes: The global count cycle value corresponding to each instruction slot in the thread assembler is compared with the cycle threshold, and the dwell time mask value corresponding to each instruction slot is determined based on the comparison result.

10. The method according to claim 5, characterized in that, Among multiple occupied instruction slots, target instruction slots are selected based on the saturation mask value and dwell time mask value corresponding to each instruction slot, including: In multiple occupied instruction slots, obtain the saturation mask value corresponding to each instruction slot, and select the instruction slot with the most 1 bits in the saturation mask value as the candidate instruction slot. Obtain the dwell time mask value corresponding to each of the candidate instruction slots, and select the instruction slot with the most 1 bits in the dwell time mask value as the target instruction slot.

11. An electronic device, characterized in that, The electronic device includes: At least one graphics processor; A memory communicatively connected to the at least one graphics processor; wherein, The memory stores a computer program that can be executed by the at least one graphics processor, the computer program being executed by the at least one graphics processor to enable the at least one graphics processor to perform the instruction thread scheduling method according to any one of claims 5-10.

12. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions that cause the graphics processor to implement the instruction thread scheduling method of any one of claims 5-10 when executed.