Process switching method and apparatus, electronic device, storage medium, and program product

By using a context switching method at the instruction granularity level, unexecuted instructions in the graphics processor are paused and saved, which solves the latency problem of the graphics processor when switching high-priority tasks, realizes fast process switching, and improves the user experience.

CN122240274APending Publication Date: 2026-06-19MOORE THREADS TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
MOORE THREADS TECH CO LTD
Filing Date
2026-04-27
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Graphics processors have difficulty switching between task processes quickly when performing high-priority tasks, resulting in long process switching delays and affecting user experience.

Method used

It employs an instruction-level context switching method to pause instructions of unexecuted thread bundles in the current process, saves context information to video memory, and records breakpoint information, allowing switching to a higher-priority process before the current process completes.

Benefits of technology

It effectively reduces the latency of process switching and improves the user experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240274A_ABST
    Figure CN122240274A_ABST
Patent Text Reader

Abstract

This disclosure relates to a process switching method, apparatus, electronic device, storage medium, and program product, comprising: in response to receiving a context save command, pausing the issuance of a new first thread group in the current first process to the execution unit, and pausing unexecuted instructions in each target thread bundle in the execution unit; saving the instruction-level context information of the target thread bundle to video memory, wherein the target thread bundle is the thread bundle in the second thread group of the first process that is currently executing instructions; recording breakpoint information of the first thread group to video memory, the breakpoint information being used to indicate the position of the first thread group; starting the second process to be switched to, until the execution result of the second process is obtained. Embodiments of this disclosure enable instruction-level context switching, effectively reducing process switching latency and improving user experience.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of computer technology, and in particular to a process switching method, apparatus, electronic device, storage medium, and program product. Background Technology

[0002] Computational tasks performed on a Graphics Processing Unit (GPU) are typically time-consuming, and when faced with preemption by high-priority tasks, it is often difficult to quickly switch task processes. For example, in an Artificial Intelligence Personal Computer (AIPC), when performing artificial intelligence (AI) calculations, the GPU may encounter situations where it needs to render the desktop or quickly execute other lightweight rendering tasks. Similarly, in autonomous driving, when performing a heavy-load, low-priority AI voice interaction task, it may suddenly need to compute another high-priority obstacle recognition task. These scenarios require the GPU to quickly switch off the currently executing task and simultaneously schedule the high-priority task to begin execution. Summary of the Invention

[0003] In view of this, this disclosure provides a process switching method, apparatus, electronic device, storage medium, and program product.

[0004] According to one aspect of this disclosure, a process switching method is provided, wherein the process calls multiple thread groups in a pipelined order, each thread group includes multiple thread bundles, each thread bundle includes multiple threads, and each thread includes multiple instructions. The method is applied to the front end of a target processor, the target processor further includes multiple processor units, each processor unit includes multiple execution units, and the method includes: in response to obtaining a context save command, pausing the issuance of a new first thread group in the current first process to the execution unit, and pausing unexecuted instructions in each target thread bundle in the execution unit; saving the instruction-level context information of the target thread bundle to video memory, wherein the target thread bundle is a thread bundle in the second thread group of the first process that is currently executing instructions and has been issued to the execution unit; recording breakpoint information of the first thread group to video memory, the breakpoint information being used to indicate the position of the first thread group; starting the second process to be switched until the execution result of the second process is obtained, wherein the first process and the second process are different processes.

[0005] In one possible implementation, the method further includes: in response to obtaining the running result of the second process, executing a context recovery command, the context recovery command being used to restore each target thread bundle in the second thread group in the execution unit according to context information read from the video memory; in response to the completion of execution of each target thread bundle in the second thread group, sending the first thread group to the execution unit according to the breakpoint information read from the video memory, until the running result of the first process is obtained.

[0006] In one possible implementation, the execution unit includes a thread bundle scheduler and an instruction issuing unit. Saving the instruction-level context information of the target thread bundle to video memory includes: forwarding the context save command to the thread bundle scheduler and the instruction issuing unit of the execution unit, so that the instruction issuing unit suspends the unissued instructions in the target thread bundle; when the thread bundle scheduler has completed the execution of the issued instructions in the target thread bundle, it saves the context information of the target thread bundle to a register; in response to the context information of the target thread bundle being saved to the register, it calls a switching program to read the context information from the register to video memory.

[0007] In one possible implementation, the thread bundle scheduler is also used to wake up the target thread bundle that is in a synchronized state, so that the target thread bundle exits the synchronized waiting state.

[0008] In one possible implementation, the execution context recovery command includes: invoking a recovery program to read context information from the video memory into a register; and recovering the paused emission instructions in each target thread bundle of the second thread group.

[0009] In one possible implementation, obtaining the context save command includes: receiving a context save command from an off-chip processor; or, generating the context save command if the first process is detected to be in an infinite loop.

[0010] In one possible implementation, the context information includes at least one of the following: hardware information for executing the target thread bundle, address information of the target thread bundle, data information, and identification information.

[0011] According to another aspect of this disclosure, a process switching apparatus is provided, wherein a process sequentially calls multiple thread groups, each thread group including multiple thread bundles, each thread bundle including multiple threads, and each thread including multiple instructions. The apparatus is applied to the front end of a target processor, the target processor further including multiple processor units, each processor unit including multiple execution units. The apparatus includes: a response module, configured to, in response to receiving a context save command, suspend the issuance of a new first thread group in the current first process to the execution unit, and suspend the execution of unexecuted instructions in each target thread bundle in the execution unit, and save the instruction-level context information of the target thread bundle to video memory, wherein the target thread bundle is a thread bundle in the second thread group of the first process that is currently executing instructions and has been issued to the execution unit; a recording module, configured to record breakpoint information of the first thread group to video memory, the breakpoint information indicating the position of the first thread group; and a startup module, configured to start the second process to be switched until the execution result of the second process is obtained, wherein the first process and the second process are different processes.

[0012] In one possible implementation, the apparatus further includes: a recovery module, configured to execute a context recovery command in response to obtaining the execution result of the second process, the context recovery command being configured to restore each target thread bundle in the second thread group in the execution unit based on context information read from the video memory; and a launch module, configured to launch the first thread group to the execution unit based on the breakpoint information read from the video memory in response to the completion of execution of each target thread bundle in the second thread group, until the execution result of the first process is obtained.

[0013] In one possible implementation, where the execution unit includes a thread bundle scheduler and an instruction issuing unit, the response module is configured to: forward the context saving command to the thread bundle scheduler and the instruction issuing unit of the execution unit, so that the instruction issuing unit suspends the unissued instructions in the target thread bundle; when the thread bundle scheduler has completed the execution of the issued instructions in the target thread bundle, it saves the context information of the target thread bundle to a register; in response to the context information of the target thread bundle being saved to the register, it calls a switching program to read the context information from the register into video memory.

[0014] In one possible implementation, the thread bundle scheduler is also used to wake up the target thread bundle that is in a synchronized state, so that the target thread bundle exits the synchronized waiting state.

[0015] In one possible implementation, the recovery module is used to: call a recovery program to read context information from the video memory into a register; and restore the paused emission instructions in each target thread bundle of the second thread group.

[0016] In one possible implementation, the response module is further configured to: receive a context save command from an off-chip processor; or, generate the context save command if the first process is detected to be in an infinite loop.

[0017] In one possible implementation, the context information includes at least one of the following: hardware information for executing the target thread bundle, address information of the target thread bundle, data information, and identification information.

[0018] According to another aspect of this disclosure, an electronic device is provided, including a memory, a processor, and a computer program stored in the memory, wherein the processor executes the computer program to implement the steps of the above-described method.

[0019] According to another aspect of this disclosure, a non-volatile computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the steps of the above-described method.

[0020] According to another aspect of this disclosure, a computer program product is provided, including a computer program or a non-volatile computer-readable storage medium carrying the computer program, wherein the computer program, when executed by a processor, implements the steps of the above-described method.

[0021] According to the process switching method of this disclosure, in response to receiving a context save command, it can pause the issuance of a new first thread group in the current first process to the execution unit, and pause the execution of unexecuted instructions in each target thread bundle in the execution unit; save the instruction-level context information of the target thread bundle to video memory, wherein the target thread bundle is the thread bundle in the second thread group that has been issued to the execution unit in the first process and is currently executing instructions; record the breakpoint information of the first thread group to video memory, the breakpoint information being used to indicate the position of the first thread group; start the second process to be switched, until the running result of the second process is obtained.

[0022] In this way, compared to related technologies where context saving is done at the thread group level, the process switching method of this disclosure allows for context switching at the instruction granularity level. Even if the thread group in the current process has not finished executing, it can save and pause the unexecuted instructions of the thread bundles in the thread group and directly switch to the next process, effectively reducing the latency of process switching and improving the user experience.

[0023] Other features and aspects of this disclosure will become clear from the following detailed description of exemplary embodiments with reference to the accompanying drawings. Attached Figure Description

[0024] The accompanying drawings, which are included in and form part of this specification, illustrate exemplary embodiments, features, and aspects of this disclosure together with the specification and serve to explain the principles of this disclosure.

[0025] Figure 1 A flowchart illustrating a process switching method according to an embodiment of the present disclosure is shown.

[0026] Figure 2 A schematic diagram of a target processor according to an embodiment of the present disclosure is shown.

[0027] Figure 3 A schematic diagram of a processor according to an embodiment of the present disclosure is shown.

[0028] Figure 4 A schematic diagram of a process switching method according to an embodiment of the present disclosure is shown.

[0029] Figure 5 A block diagram of a process switching apparatus according to an embodiment of the present disclosure is shown.

[0030] Figure 6 A block diagram of an electronic device according to an embodiment of the present disclosure is shown. Detailed Implementation

[0031] Various exemplary embodiments, features, and aspects of this disclosure will now be described in detail with reference to the accompanying drawings. The same reference numerals in the drawings denote elements that have the same or similar functions. Although various aspects of the embodiments are shown in the drawings, they are not necessarily drawn to scale unless specifically indicated otherwise.

[0032] As used herein, the terms “comprising,” “including,” “having,” or variations thereof are open-ended and include one or more of the stated features, integrals, elements, steps, components, or functions, but do not exclude the presence or addition of one or more other features, integrals, elements, steps, components, functions, or groups thereof.

[0033] When an element is referred to as “connected,” “coupled,” “responding,” or a variation thereof relative to another element, it may be directly connected, coupled, or responding to another element, or there may be an intermediate element present.

[0034] Although the terms first, second, third, etc., may be used herein to describe various elements / operations, these elements / operations should not be limited by these terms. These terms are only used to distinguish one element / operation from another. Therefore, without departing from the teachings of the inventive concept, a first element / operation in some embodiments may be referred to as a second element / operation in other embodiments.

[0035] The term “exemplary” as used herein means “serving as an example, embodiment, or illustration.” Any embodiment illustrated herein as “exemplary” is not necessarily to be construed as superior to or better than other embodiments.

[0036] Furthermore, to better illustrate this disclosure, numerous specific details are set forth in the following detailed description. Those skilled in the art will understand that this disclosure can be practiced without certain specific details. In some instances, methods, means, components, and circuits well known to those skilled in the art have not been described in detail in order to highlight the main points of this disclosure.

[0037] It should be noted that the information (including but not limited to user device information, user personal information, etc.), data (including but not limited to data used for analysis, data stored, data displayed, etc.) and signals involved in this application are all authorized by the user or fully authorized by all parties, and the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant regions.

[0038] The process task switching procedure may include: pausing the original task, saving the context of the original task, terminating the original task, starting a new task, ending the new task, restoring the context of the original task, and continuing to execute the original task until it ends or proceeding to the next task switch. In related technologies, the granularity of context saving is at the thread group (e.g., workgroup / block) level. The current process is saved and a new process is switched only after one thread group in the process has finished executing.

[0039] Therefore, in related technologies, process switching has a long delay and is not timely enough. A thread group takes a long time to complete, and a new process can only be switched to after the entire thread group has finished. For example, if an infinite loop occurs within the instructions of a thread group, such as in a program for dual-card communication, where card 0 needs to obtain the result from card 1 to start working, but card 1 terminates for unknown reasons, card 0 will be stuck in an infinite loop, unable to switch to a new task.

[0040] In view of this, embodiments of the present disclosure provide a process switching method, which, in response to obtaining a context save command, suspends the issuance of a new first thread group in the current first process to the execution unit, and suspends the execution of unexecuted instructions in each target thread bundle in the execution unit, saves the instruction-level context information of the target thread bundle to the video memory, wherein the target thread bundle is the thread bundle in the second thread group that has been issued to the execution unit in the first process and is currently executing instructions; records the breakpoint information of the first thread group to the video memory, the breakpoint information being used to indicate the position of the first thread group; and starts the second process to be switched until the running result of the second process is obtained.

[0041] This approach allows for context switching at the instruction granularity level. Even if the thread group in the current process has not finished executing, the unexecuted instructions of the thread bundle in the thread group can be saved and paused, and the process can be switched directly to the next process. This effectively reduces the latency of process switching and improves the user experience.

[0042] Figure 1 A flowchart illustrating a process switching method according to an embodiment of this disclosure is shown. Figure 1 As shown, the process calls multiple thread groups in a pipelined sequence. Each thread group includes multiple thread bundles, each thread bundle includes multiple threads, and each thread includes multiple instructions. The method is applied to the front end of a target processor, which also includes multiple processor units. Each processor unit includes multiple execution units. The method includes:

[0043] In step S11, in response to receiving the context save command, the issuance of new first thread groups in the current first process to the execution unit is paused, and the unexecuted instructions in each target thread bundle in the execution unit are paused. The instruction-level context information of the target thread bundle is saved to the video memory. The target thread bundle is the thread bundle in the second thread group that has been issued to the execution unit in the first process and is currently executing instructions.

[0044] In step S12, the breakpoint information of the first thread group is recorded to the video memory, and the breakpoint information is used to indicate the position of the first thread group;

[0045] In step S13, the second process to be switched is started until the running result of the second process is obtained.

[0046] The first process and the second process are different processes. For example, the first process and the second process can be processes for different tasks, processes that serve different users, processes that serve different applications (APPs), or processes that process different data. The embodiments of this disclosure do not impose specific limitations on this.

[0047] In one possible implementation, the method can be applied to the front end of a target processor (GPU) in an electronic device, which may include user equipment (UE), mobile device, user terminal, terminal device, server, cellular phone, cordless phone, personal digital assistant (PDA), handheld device, computing device, in-vehicle device, wearable device, etc. The embodiments disclosed herein are not intended to be limiting.

[0048] In one possible implementation, the target processor of this disclosure embodiment may be a completely new design or an improvement on an existing processor chip. The type of processor chip may include, but is not limited to: a central processing unit (CPU), a graphics processing unit (GPU), a general-purpose graphics processing unit (GPGPU), a neural network processing unit (NPU), an application-specific integrated circuit (ASIC), a tensor processing unit (TPU), a field-programmable gate array (FPGA), or other programmable logic devices, and may also include a microprocessor or other conventional processor.

[0049] Figure 2 A schematic diagram of a target processor according to an embodiment of the present disclosure is shown. For example... Figure 2 As shown, the target processor may include multiple processor clusters (PCs), such as processor cluster 1 to processor cluster 4; each processor cluster may include multiple processor execution engines (PXs), such as processor execution engine 1 to processor execution engine 4; each processor execution engine may further include multiple processors, such as processor 1 and processor 2. Each processor execution engine within a processor cluster may have a control unit responsible for managing the task scheduling and resource allocation of its multiple processors.

[0050] Multiple processor execution engines within each processor cluster can be connected to the Last Level Cache (LLC) via a Network On Chip (NOC), which is directly connected to the video memory. This video memory may include, for example, Dynamic Random Access Memory (DRAM). The target processor also includes a front-end located outside the processor cluster, which can be used for global control of the processor cluster.

[0051] It should be noted that, although... Figure 2 The target processor has been described above as an example, but those skilled in the art will understand that the embodiments of this disclosure do not limit the number of processor clusters in the target processor, the number of processor execution engines in the processor cluster, or the number of processors in the processor execution engine, and can be flexibly set according to the actual application scenario.

[0052] Figure 3 A schematic diagram of a processor according to an embodiment of the present disclosure is shown. (As shown) Figure 3 As shown, each processor unit may include four execution units. Each execution unit may include a register set (e.g., a 128KB register, shared registers, etc.), a floating-point processing unit (FP), an integer processing unit (INT), a special function unit (SFU), a load store unit (LSU), a tensor memory engine (TME), a wave manager (WM), and an instruction issue unit (II).

[0053] The floating-point unit (FP) is used to handle single-precision or double-precision floating-point arithmetic; the integer unit (INT) is used to handle integer operations; the special unit (SFU) can calculate various functions, such as reciprocal functions, square root functions, exponential functions, logarithmic functions, and activation functions, through numerical approximation methods, which can be used to improve the computational performance of the computing unit; the load-store component (LSU) is used to execute load or store instructions; the tensor storage component (TME) is used for data transfer between memory and registers; the thread bundle scheduler (WM) is responsible for thread scheduling and resource allocation; and the instruction issuing unit (II) is used to issue instructions according to instruction dependencies.

[0054] The processor may also include a level 1 cache, local memory, and a tensor computation engine consisting of multiple (e.g., 4) arithmetic logic units (ALUs) and control logic units.

[0055] In one possible implementation, the method is applied to the front end of a target processor (e.g., a GPU), which may be firmware or hardware circuitry, and the embodiments of this disclosure are not limited thereto.

[0056] In one possible implementation, a process can call multiple programs in a pipelined order to perform different tasks. Each task can be divided into multiple thread groups, each thread group can be divided into multiple thread waves, each thread wave includes multiple threads, and each thread includes multiple instructions.

[0057] In one possible implementation, process preemption occurs when a high-priority process (the second process) needs to run. The GPU frontend suspends the currently executing low-priority process (the first process) and allocates GPU resources to the high-priority second process. Here, the first process and the second process are different processes; the first process is the process currently executing on the target processor, and the second process is the process that wants to preempt the first process.

[0058] Figure 4 A schematic diagram of a process switching method according to an embodiment of the present disclosure is shown, such as... Figure 4 As shown, the front-end within the target processor sequentially sends thread groups from the first process to the execution unit according to the pipeline order, until the front-end receives a context save command. When the front-end receives the context save command, the thread groups in the first process that have already been issued to the execution unit are referred to as the second thread group, and the thread groups in the first process that have not yet been issued to the execution unit are referred to as the first thread group. For example, as... Figure 4 As shown, assuming the first process contains thread group A, thread group B, and thread group C, when the front end obtains the context save command, Group A and Group B, which have been issued to the execution unit, can be called the second thread group, and Group C, which has not yet been issued to the execution unit, can be called the first thread group.

[0059] In one possible implementation, in step S11, obtaining the context save command may include: receiving a context save command from an off-chip processor; or, generating the context save command if the first process is detected to be in an infinite loop.

[0060] In scenarios where a second external process attempts to preempt the first process, the front-end of the target processor can receive a context save command from an off-chip processor. Conversely, in scenarios where the first process enters an infinite loop and cannot exit, the front-end of the target processor can generate its own context save command to forcibly break the loop. This approach allows for different methods to be used to obtain the context save command depending on the application scenario, improving the adaptability of this method.

[0061] For example, assuming the target processor is a GPU and the off-chip processor is a CPU, and the GPU is executing a first process, the CPU can send a context save command to the GPU to allow a higher-priority second process to preempt the first process. Upon receiving this command, the GPU's front-end can suspend the current execution of the first process, that is, complete the instructions already issued in the first process and stop issuing any pending instructions. Waiting for the instructions already issued in the first process to complete saves the instruction-level context information of the first process, facilitating a quick switch to a higher-priority second process.

[0062] Alternatively, if the GPU front-end detects that the execution time of the first process exceeds a preset threshold, or if the first process repeatedly executes the same code, indicating that the first process is in an infinite loop, the GPU front-end can directly generate a context saving command to make the first process pause its current execution and wait for the instructions already issued in the first process to be executed. This can save the context information at the instruction granularity level of the first process and break the first process from entering an infinite loop.

[0063] It should be understood that the embodiments of this disclosure do not limit the specific method of the front end obtaining the context saving command, and can be set according to the actual application scenario.

[0064] In step S11, whenever the front end obtains the context save command, it will pause the transmission of the new first thread group in the current first process to the execution unit, and pause the unexecuted instructions in each target thread bundle in the execution unit, and save the context information of each target thread bundle that is currently executing instructions in each second thread group that has been transmitted to the execution unit in the first process to the video memory.

[0065] like Figure 4As shown, the first process includes multiple thread groups, such as thread group A, thread group B, and thread group C. Thread group C depends on thread group B, and thread group B depends on thread group A. Assuming that when the front-end receives the context save command, it has already sent the second thread group A and the second thread group B to the execution unit, but has not yet sent the first thread group C, the front-end can pause sending the first thread group C to the execution unit and forward the context save command to the execution unit.

[0066] Suppose that when the execution unit receives the context save command, it has already finished executing the second thread group Group A and is currently executing the second thread group Group B. Furthermore, the thread bundle WaveA1 in the second thread group Group A has finished executing, and the execution unit is now executing the target thread bundle WaveA2 in the second thread group Group A. In this case, the execution unit can, according to the context save command, pause the unexecuted instructions in the target thread bundle WaveA2, and after saving the instruction-level context information of the target thread bundle WaveA2 to video memory, send a notification to the front end indicating that the saving of the instruction-level context information is complete.

[0067] In step S12, the front end can record the breakpoint information of the first thread group to the video memory, and the breakpoint information is used to indicate the position of the first thread group.

[0068] like Figure 4 As shown, when the front end receives a notification that the execution unit has completed saving the context information at the WaveA2 instruction level of the target thread bundle to the video memory, it can record the breakpoint information of the first thread group Group C that has not been emitted to the video memory, so that when the first process is resumed later, the first thread group Group C in the first process can be resumed through the breakpoint information.

[0069] In step S13, the second process to be switched can be started until the running result of the second process is obtained. The first process is the preempted process, and the second process is the process that preempted the first process.

[0070] Once the front end receives the execution result of the second process, it can send a context recovery command to the execution unit. Upon receiving the context recovery command, the execution unit can read the context information of the target thread bundle WaveA2 from the video memory and restore the unexecuted instructions in the target thread bundle WaveA2.

[0071] In response to the completion of the target thread bundle WaveA2, the front end can be notified. The front end can read the breakpoint information of the first thread group Group C from the video memory, restore the first thread group Group C based on the breakpoint information, and continue to send the first thread group Group C to the execution unit until the running result of the first process is obtained.

[0072] Alternatively, suppose the first process includes multiple thread groups, such as thread group A, thread group B, and thread group C, where thread group C depends on thread groups A and B, and thread groups A and B are thread groups that can be executed in parallel. Assuming the GPU's processor cluster is currently executing Group A and thread group B (two existing second thread groups), when the front end receives the context save command, it will pause issuing thread group C (the first thread group) to the GPU's processor cluster and save the context information of each target thread bundle currently executing instructions in thread groups A and B (the second thread groups) to video memory.

[0073] For example, suppose thread group A includes thread bundles waveA1, waveA2, and waveA3, where waveA3 depends on the result of waveA2, and waveA2 depends on the result of waveA1. When the front-end receives a context save command, if it detects that the GPU is executing waveA2 within thread group A, it can use waveA2 as the target thread bundle. Instructions already issued in waveA2 must be completed, and unissued instructions are stopped from being issued. The front-end waits for the already issued instructions in waveA2 to complete, and then saves the context information of waveA2 to video memory.

[0074] Similarly, suppose thread group B includes thread bundles waveB1, waveB2, and waveB3, where waveB3 depends on the result of waveB2, and waveB2 depends on the result of waveB1. When the front end receives a context save command, if it detects that the GPU is executing waveB3 within thread group B, it can use waveB3 as the target wave. Instructions already issued in waveB3 must be completed, and unissued instructions are stopped from being issued. The front end waits for the already issued instructions in waveB3 to complete, and then saves the context information of waveB3 to video memory.

[0075] In one possible implementation, the target processor further includes multiple processor units, each processor unit comprising multiple execution units, the execution units including a thread scheduler and an instruction issuing unit (see...). Figure 3 The process of saving the instruction-level context information of the target thread bundle to video memory includes: forwarding the context save command to the thread bundle scheduler and the instruction issuing unit of the execution unit, so that the instruction issuing unit suspends the unissued instructions in the target thread bundle; when the thread bundle scheduler has completed the execution of the issued instructions in the target thread bundle, it saves the context information of the target thread bundle to a register; in response to the context information of the target thread bundle being saved to the register, it calls a switching program to read the context information from the register to video memory.

[0076] For example, when a high-priority new task needs to preempt resources, the front end receives a context save command and can forward it to the processor executing the first process in the GPU processor cluster. This context save command raises the context save register. The target thread bundle currently running in the processor can monitor the register changes in real time through the thread bundle scheduler. Once the context save register is raised, the current execution is paused, and a switching state is entered. Pausing the currently executing target thread bundle means that instructions already issued in the target thread bundle need to be completed, and instructions not yet issued are stopped from being issued. This can be achieved by pausing the unissued instructions in the target thread bundle through the instruction issuing unit.

[0077] In the switching state, each target thread bundle needs to wait for the preceding instruction to be issued to complete. Then the hardware (e.g., registers) saves the state of the target thread bundle as context information, such as the processor cluster identifier where the current target thread bundle is located, the identifier of the target processor that specifically executes the target thread bundle, the image address information (scratchbase), etc. The embodiments disclosed herein do not limit this.

[0078] After the hardware saves the context information of the target thread bundle, it can jump to the address of the next instruction to the starting address of the predefined switching procedure and start fetching and executing the switching procedure.

[0079] The switching process needs to save the context information of the current target thread bundle, which includes at least one of the following: hardware information for executing the target thread bundle, address information of the target thread bundle, data information, and identification information.

[0080] The context information is instruction-level information, such as the contents of registers and shared buffers. See the table below for reference:

[0081]

[0082] It should be understood that the context information shown in the table is only an example, and the specific content of the context information is not limited in the embodiments disclosed herein.

[0083] The saving of register contents is accomplished by software instructions. The switching program will end with the instruction end marker END, indicating that the target thread bundle should exit.

[0084] This method preserves instruction-level context information, which facilitates process switching at the instruction level, effectively reducing process switching latency and improving user experience.

[0085] In one possible implementation, the thread bundle scheduler is also used to wake up the target thread bundle that is in a synchronized state, so that the target thread bundle exits the synchronized waiting state.

[0086] For example, suppose there exist thread bundles wave0, wave1, wave2, and wave3, where thread bundles wave0, wave1, wave2, and wave3 are synchronous thread bundles.

[0087] When the endpoint receives the context save command, thread bundles wave2 and wave3 have executed part of their instructions but have entered the switching state before reaching the instruction that needs to be executed for synchronization. Thread bundles wave0 and wave1 have reached the synchronization node; their currently issued instructions are waiting for synchronization and they can only enter the switching state after receiving synchronization signals from thread bundles wave0 and wave1. Since thread bundles wave2 and wave3 are already in the switching state, and thread bundles wave0 and wave1 cannot wait for the synchronization signals from thread bundles wave2 and wave3, their issued instructions cannot be completed, thus preventing the saving of their context information. In this situation, the thread bundle scheduler can use the synchronized threads wave0 and wave1 to exit the synchronization waiting state.

[0088] In other words, if thread bundles wave2 and wave3 are paused before synchronization is complete, while wave0 and wave1 are waiting for their signals at the synchronization node, the latter will be unable to complete the synchronization instructions and thus will be unable to save the context information. The scheduler wakes up wave0 and wave1, forcing them to exit the synchronization waiting state, so that the context saving of the target thread bundle can be successfully performed.

[0089] By waking up target thread bundles that are in a synchronous waiting state through the thread bundle scheduler, the synchronous deadlock problem caused by some thread bundles entering the switching state in advance is solved. This allows the context information of all target thread bundles to be maintained at the instruction level, which helps to effectively reduce the latency of process switching.

[0090] In step S11, the context information of each target thread bundle currently executing an instruction in the second thread group that already exists in the first process is saved to video memory. In step S12, the breakpoint information of the first thread group is recorded to video memory. The breakpoint is used to indicate the position of the next first thread group to be launched. For example, the front end can directly save the three-dimensional scalar group xyz of the breakpoint position of the first thread group, which is used to restore the context information of the first process and relaunch the first thread group.

[0091] After the front-end saves the task, a new second process can be switched in step S13 to work until the result of the second process is obtained.

[0092] In one possible implementation, the method further includes: in response to obtaining the running result of the second process, executing a context recovery command, the context recovery command being used to restore each target thread bundle in the second thread group in the execution unit according to context information read from the video memory; in response to the completion of execution of each target thread bundle in the second thread group, sending the first thread group to the execution unit according to the breakpoint information read from the video memory, until the running result of the first process is obtained.

[0093] For example, after the second process preempts, it is necessary to restore the original context information of the first process. At this time, the front end needs to initiate a context restoration command. The hardware first restores the saved target thread bundle that is currently being executed and restores it to the corresponding execution unit. Based on the context information read from the video memory, each target thread bundle in the second thread group can be restored.

[0094] The execution context recovery command includes: calling a recovery program to read context information from the video memory into a register; and recovering the paused execution instructions in each target thread bundle of the second thread group. For example, the recovery program can be called to read the context information from the video memory and recover each target thread bundle in the second thread group. The recovery program first loads the saved context information from the video memory and restores it to the hardware register unit, and then recovers the next instruction to be executed, continuing the execution of the interrupted program.

[0095] Once the preempted target thread bundle is restored and executed, the front end continues to read the breakpoint information groupxyz of the first thread group from the video memory, re-launches the new first thread group, and continues until the execution result of the first process is obtained.

[0096] This approach allows for instruction-level pause and resume, as well as context switching, improving the timeliness of process switching responses.

[0097] In summary, the process switching method of this disclosure may include a process of saving and restoring context information.

[0098] During the context saving process, the front end pauses issuing new first thread groups. Upon receiving the context save command, the thread bundle scheduler and instruction issuing unit within each execution unit pause all target thread bundles from fetching and issuing instructions, waiting for all issued instructions to complete. The thread bundle scheduler wakes up target thread bundles in the synchronization state, controlling all active target thread bundles to enter the switching state. Hardware saving is enabled, saving the address of the next issue instruction for the current target thread bundle, along with the current target thread bundle's state information, such as the processor cluster identifier MPCid, processor execution engine identifier MPXid, processor identifier MPid, execution unit MPEid, thread bundle identifier waveid, and mirror address scratch_base. Then, it can jump to the starting address of the switching program. The switching program begins execution, saving the target thread bundle's register contents to video memory via software. The switching program exits after reaching the end marker END. After all issued second thread groups have completed (exited), the front end records the breakpoint group xyz for the next first thread group and saves it in video memory. The front end saves the read pointer corresponding to the current control flow. The front end ends the context saving process.

[0099] During the context recovery process, the front-end executes a context recovery command to restore the preempted target thread bundle. Based on the processor identifier and thread bundle waveid in the saved context information, the front-end sends the target thread bundle to the corresponding hardware unit. Each execution unit within the processor, including its thread bundle scheduler and instruction issuing unit, receives the target thread bundle from the context recovery process and begins execution from the recovery routine, restoring the context content saved in video memory to on-chip registers (e.g., through a load instruction). After recovery, the last instruction of the recovery routine changes the current instruction address to the previously saved address of the next instruction. Once all preempted target thread bundles have completed execution, the front-end continues sending new first thread groups to each processor from the recovery breakpoint group xyz. Execution continues until the first process is completed.

[0100] This approach allows for context switching at the instruction granularity level, enabling rapid response to preemption of high-priority processes at the instruction level, effectively reducing process switching latency and improving user experience.

[0101] It is understood that the various method embodiments mentioned above in this disclosure can be combined with each other to form combined embodiments without violating the principle and logic. Due to space limitations, this disclosure will not elaborate further. Those skilled in the art will understand that in the above methods of specific implementation, the specific execution order of each step should be determined by its function and possible internal logic.

[0102] In addition, this disclosure also provides process switching apparatus, electronic devices, computer-readable storage media, and program products, all of which can be used to implement any of the process switching methods provided in this disclosure. The corresponding technical solutions and descriptions are described in the corresponding records in the method section and will not be repeated here.

[0103] Figure 5 A block diagram of a process switching apparatus according to an embodiment of the present disclosure is shown, such as Figure 5 As shown, the process calls multiple thread groups in a pipelined sequence. Each thread group includes multiple thread bundles, each thread bundle includes multiple threads, and each thread includes multiple instructions. The device is applied to the front end of a target processor, which further includes multiple processor units, each processor unit including multiple execution units. The device includes:

[0104] The response module 51 is used to, in response to receiving a context save command, pause the issuance of new first thread groups in the current first process to the execution unit, and pause the unexecuted instructions in each target thread bundle in the execution unit, and save the instruction-level context information of the target thread bundle to the video memory, wherein the target thread bundle is the thread bundle in the second thread group that has been issued to the execution unit in the first process and is currently executing instructions;

[0105] Recording module 52 is used to record the breakpoint information of the first thread group to the video memory, and the breakpoint information is used to indicate the position of the first thread group;

[0106] The startup module 53 is used to start the second process to be switched until the running result of the second process is obtained, wherein the first process and the second process are different processes.

[0107] In one possible implementation, the apparatus further includes: a recovery module, configured to execute a context recovery command in response to obtaining the execution result of the second process, the context recovery command being configured to restore each target thread bundle in the second thread group in the execution unit based on context information read from the video memory; and a launch module, configured to launch the first thread group to the execution unit based on the breakpoint information read from the video memory in response to the completion of execution of each target thread bundle in the second thread group, until the execution result of the first process is obtained.

[0108] In one possible implementation, where the execution unit includes a thread bundle scheduler and an instruction issuing unit, the response module 51 is configured to: forward the context save command to the thread bundle scheduler and the instruction issuing unit of the execution unit, so that the instruction issuing unit suspends the unissued instructions in the target thread bundle; when the thread bundle scheduler has completed the execution of the issued instructions in the target thread bundle, it saves the context information of the target thread bundle to a register; in response to the context information of the target thread bundle being saved to the register, it calls a switching program to read the context information from the register into video memory.

[0109] In one possible implementation, the thread bundle scheduler is also used to wake up the target thread bundle that is in a synchronized state, so that the target thread bundle exits the synchronized waiting state.

[0110] In one possible implementation, the recovery module is used to: call a recovery program to read context information from the video memory into a register; and restore the paused emission instructions in each target thread bundle of the second thread group.

[0111] In one possible implementation, the response module 51 is further configured to: receive a context save command from an off-chip processor; or, generate the context save command if the first process is detected to be in an infinite loop.

[0112] In one possible implementation, the context information includes at least one of the following: hardware information for executing the target thread bundle, address information of the target thread bundle, data information, and identification information.

[0113] In some embodiments, the functions or modules of the apparatus provided in this disclosure can be used to perform the methods described in the above method embodiments. The specific implementation can be referred to the description of the above method embodiments, and for the sake of brevity, it will not be repeated here.

[0114] This disclosure also provides an electronic device, including a memory, a processor, and a computer program stored in the memory, wherein the processor executes the computer program to implement the steps of the above method.

[0115] This disclosure also provides a non-volatile computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the above-described method.

[0116] This disclosure also provides a computer program product, including a computer program or a non-volatile computer-readable storage medium carrying the computer program, wherein the computer program, when executed by a processor, implements the steps of the above method.

[0117] Figure 6 A block diagram of an electronic device according to an embodiment of the present disclosure is shown. For example, electronic device 1900 may be provided as a server or terminal device. (Refer to...) Figure 6 The apparatus 1900 includes a processing component 1922, which further includes one or more processors, and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by the processing component 1922. The application programs stored in memory 1932 may include one or more modules, each corresponding to a set of instructions. Furthermore, the processing component 1922 is configured to execute instructions to perform the methods described above.

[0118] Electronic device 1900 may also include a power supply component 1926 configured to perform power management of electronic device 1900, a wired or wireless network interface 1950 configured to connect electronic device 1900 to a network, and an input / output interface 1958 (I / O interface). Electronic device 1900 can operate on an operating system, such as Windows Server, stored in memory 1932. TM Mac OS X TM Unix TM Linux TM FreeBSD TM Or similar.

[0119] In an exemplary embodiment, a non-volatile computer-readable storage medium is also provided, such as a memory 1932 including computer program instructions that can be executed by a processing component 1922 of an electronic device 1900 to perform the above-described method.

[0120] Computer-readable storage media can be tangible devices capable of holding and storing programs / instructions used by instruction execution devices. Computer-readable storage media include, but are not limited to, electrical storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination thereof. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static random access memory (SRAM), portable compact disc read-only memory (CD-ROM), digital multifunction disc (DVD), memory sticks, floppy disks, mechanical encoding devices, such as punch cards or recessed protrusions storing instructions thereon, and any suitable combination thereof. The computer-readable storage media used herein are not to be construed as transient signals themselves, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses through fiber optic cables), or electrical signals transmitted through wires.

[0121] The computer program (or computer-readable program instructions) described herein can be downloaded from a computer-readable storage medium to various computing / processing devices, or downloaded via a network, such as the Internet, local area network, wide area network, and / or wireless network, to an external computer or external storage device. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to the computer-readable storage medium in the respective computing / processing device.

[0122] The computer program (or computer program instructions) used to perform the operations of this disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages ​​such as Smalltalk, C++, etc., and conventional procedural programming languages ​​such as the "C" language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuitry, such as programmable logic circuitry, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), is personalized by utilizing state information from the computer-readable program instructions to implement various aspects of this disclosure.

[0123] Various aspects of this disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.

[0124] These computer-readable program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processor of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner; thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.

[0125] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions executed on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.

[0126] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than those shown in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.

[0127] The various embodiments of this disclosure have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical application, or technical improvements to the embodiments in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.

Claims

1. A process switching method, characterized in that, The process invokes multiple thread groups in a pipelined sequence. Each thread group includes multiple thread bundles, each thread bundle includes multiple threads, and each thread includes multiple instructions. The method is applied to the front end of a target processor, which further includes multiple processor units, each processor unit including multiple execution units. The method includes: In response to receiving a context save command, the issuance of new first thread groups in the current first process to the execution unit is paused, and the unexecuted instructions in each target thread bundle in the execution unit are paused. The instruction-level context information of the target thread bundle is saved to the video memory, wherein the target thread bundle is the thread bundle in the second thread group that has been issued to the execution unit in the first process and is currently executing instructions. Record the breakpoint information of the first thread group to the video memory, and the breakpoint information is used to indicate the position of the first thread group; Start the second process to be switched until the result of the second process is obtained, wherein the first process and the second process are different processes.

2. The method according to claim 1, characterized in that, The method further includes: In response to obtaining the running result of the second process, a context recovery command is executed, which is used to restore each target thread bundle in the second thread group in the execution unit according to the context information read from the video memory; In response to the completion of each target thread bundle in the second thread group, the first thread group is sent to the execution unit based on the breakpoint information read from the video memory, until the running result of the first process is obtained.

3. The method according to claim 1, characterized in that, The execution unit includes a thread beam scheduler and an instruction issuing unit. Saving the instruction-level context information of the target thread bundle to video memory includes: The context saving command is forwarded to the thread bundle scheduler and the instruction issuing unit of the execution unit, so that the instruction issuing unit suspends the instructions that have not been issued in the target thread bundle. When the instructions that have been issued in the target thread bundle have been executed, the thread bundle scheduler saves the context information of the target thread bundle to a register. In response to the context information of the target thread bundle being saved to the register, the switching program is called to read the context information from the register into video memory.

4. The method according to claim 3, characterized in that, The thread bundle scheduler is also used to wake up the target thread bundle that is in a synchronized state, so that the target thread bundle exits the synchronized waiting state.

5. The method according to claim 2, characterized in that, The execution context recovery command includes: The recovery procedure is invoked to read the context information from the video memory into a register; Resume the paused emission instructions in each target thread bundle of the second thread group.

6. The method according to any one of claims 1 to 5, characterized in that, Obtaining the context save command includes: Receive a context save command from an off-chip processor; or, If the first process is detected to be in an infinite loop, the context save command is generated.

7. The method according to any one of claims 1 to 5, characterized in that, The context information includes at least one of the following: hardware information for executing the target thread bundle, address information of the target thread bundle, data information, and identification information.

8. A process switching device, characterized in that, The process invokes multiple thread groups in a pipelined sequence. Each thread group includes multiple thread bundles, each thread bundle includes multiple threads, and each thread includes multiple instructions. The device is applied to the front end of a target processor, which further includes multiple processor units, each processor unit including multiple execution units. The device includes: The response module is used to, in response to receiving a context save command, pause the issuance of new first thread groups in the current first process to the execution unit, and pause the unexecuted instructions in each target thread bundle in the execution unit, and save the instruction-level context information of the target thread bundle to the video memory, wherein the target thread bundle is the thread bundle in the second thread group that has been issued to the execution unit in the first process and is currently executing instructions; A recording module is used to record the breakpoint information of the first thread group to the video memory, wherein the breakpoint information is used to indicate the position of the first thread group; The startup module is used to start the second process to be switched until the running result of the second process is obtained, wherein the first process and the second process are different processes.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory, characterized in that, The processor executes the computer program to implement the steps of the method according to any one of claims 1 to 7.

10. A non-volatile computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 7.

11. A computer program product comprising a computer program, or a non-volatile computer-readable storage medium carrying a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 7.