Ai chip control method, signal processing method, and task processing system
By introducing low-power modes and hardware-software co-optimization into AI chips, the problem of excessive power consumption when performing large language model tasks has been solved, achieving higher energy efficiency and stability, making it suitable for edge computing scenarios.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- SHENZHEN INTELLIFUSION TECHNOLOGIES CO LTD
- Filing Date
- 2025-11-03
- Publication Date
- 2026-07-02
Smart Images

Figure CN2025132162_02072026_PF_FP_ABST
Abstract
Description
AI chip control methods, signal processing methods, and task processing systems Technical Field
[0001] This invention relates to the field of artificial intelligence, and in particular to a control method, signal processing method and task processing system for an AI chip.
[0002] This application claims priority to Chinese Patent Application No. 202411976515.5, filed on December 26, 2024, entitled "Control Method, Signal Processing Method and Task Processing System for AI Chip", the entire contents of which are incorporated herein by reference. Background Technology
[0003] With the continuous development of artificial intelligence (AI) technology, AI chips, as crucial hardware supporting these technologies, are increasingly facing challenges related to performance and power consumption. Especially in edge computing scenarios, AI chips need to efficiently execute complex tasks such as Large Language Modeling (LLM), while simultaneously encountering stringent power consumption and heat dissipation limitations. Currently, AI chips on the market often suffer from excessive power consumption, heat dissipation difficulties, and poor energy efficiency when performing LLM tasks. This not only affects the stability and reliability of the chips but also limits their application in low-power scenarios such as edge computing. Therefore, a solution to reduce the power consumption of AI chips is urgently needed. Technical issues
[0004] This invention provides an edge AI chip control method, aiming to offer a low-power solution for AI chips to reduce power consumption when performing LLM tasks. By introducing a first low-power mode and a second low-power mode, the AI chip can maintain a low power consumption state both when waiting for and executing tasks, solving the problem of excessive power consumption in existing systems when running large language model decoding tasks. Simultaneously, by optimizing hardware and software collaboration, a higher energy efficiency ratio can be achieved, ensuring a significant reduction in overall power consumption while maintaining high performance.
[0005] In a first aspect, embodiments of the present invention provide an edge AI chip control method, wherein the AI chip is disposed in an edge terminal, the edge terminal is communicatively connected to a host terminal, and the AI chip and the host terminal are jointly used to execute a question-answering task of a large language model. The question-answering task includes, in execution order, a question-asking phase task, a processing phase task, and a decoding phase task. The host terminal is used to execute the question-asking phase task, and the AI chip is used to execute the processing phase task and the decoding phase task. The edge AI chip control method includes the following steps:
[0006] When the wake-up command is received from the host, the AI chip is controlled to exit the first low-power mode and enter the second low-power mode. The wake-up command is generated and sent to the AI chip when the host completes the questioning phase task. The first low-power mode is used to wait for the host to complete the questioning phase task in a sleep state. The second low-power mode is used to control the AI chip to execute the processing phase task and the decoding phase task in a manner lower than the normal operating power consumption.
[0007] In the second low-power mode, the AI chip is controlled to execute the processing stage task and the decoding stage task;
[0008] After the AI chip completes the decoding stage task, it sends the task result to the host and controls the AI chip to exit the second low-power mode and enter the first low-power mode.
[0009] Optionally, in the second low-power mode, controlling the AI chip to execute the processing stage task and the decoding stage task includes:
[0010] In the second low-power mode, when the first memory instruction is received from the host, the AI chip is controlled to run at a first memory frequency to execute the processing stage task, and the first memory instruction corresponds to the first memory frequency;
[0011] When the host receives the second memory instruction, it controls the AI chip to run the decoding stage task at the second memory frequency. The second memory instruction corresponds to the second memory frequency, and the first memory frequency is lower than the second memory frequency.
[0012] Optionally, in the second low-power mode, controlling the AI chip to execute the processing stage task and the decoding stage task includes:
[0013] In the second low-power mode, when the host receives the first processing instruction, it controls the AI chip to execute the processing stage task at a first processing frequency. The first processing instruction corresponds to the first processing frequency. The first processing instruction is generated by the host when the number of text units to be processed in the processing stage task is less than a quantity threshold.
[0014] Upon receiving the second processing instruction from the host, the AI chip is controlled to execute the processing stage task at a second processing frequency. The second processing instruction corresponds to the second processing frequency, and the first processing frequency is lower than the second processing frequency. The second processing instruction is generated by the host when the number of text units to be processed in the processing stage task is not less than a quantity threshold.
[0015] Optionally, in the second low-power mode, controlling the AI chip to execute the processing stage task and the decoding stage task includes:
[0016] In the second low-power mode, when the AI chip is controlled to execute the processing stage task and the decoding stage task, and a third processing instruction is received from the host, the AI chip is controlled to enter the third low-power mode. The third processing instruction is generated by the host when the large language model is called.
[0017] In the third low-power mode, it waits for an interrupt signal to be generated. If an interrupt signal is received, it controls the AI chip to exit the third low-power mode.
[0018] Optionally, controlling the AI chip to enter a third low-power mode includes:
[0019] The low-power resource operations of the AI chip are executed in static random access memory by using a preset low-power function.
[0020] Secondly, embodiments of the present invention also provide a signal processing method for a host terminal, wherein the host terminal is communicatively connected to an edge terminal, and the AI chip and the host terminal jointly perform a question-answering task of a large language model. The question-answering task includes, in execution order, a question-asking phase task, a processing phase task, and a decoding phase task. The host terminal is used to execute the question-asking phase task, and the AI chip is used to execute the processing phase task and the decoding phase task. The signal processing method of the host terminal includes:
[0021] Upon completion of the questioning phase task, a wake-up command is generated;
[0022] The wake-up command is sent to the edge device so that when the edge device receives the wake-up command from the host device, it controls the AI chip to exit the first low-power mode and enter the second low-power mode. The wake-up command is generated and sent to the AI chip when the host device completes the questioning phase task. The first low-power mode is used to wait for the host device to complete the questioning phase task in a sleep state. The second low-power mode is used to control the AI chip to execute the processing phase task and the decoding phase task in a manner lower than the normal operating power. In the second low-power mode, the AI chip is controlled to execute the processing phase task and the decoding phase task. When the AI chip completes the decoding phase task, it sends the task result to the host device and controls the AI chip to exit the second low-power mode and enter the first low-power mode.
[0023] Optionally, after the AI chip enters the second low-power mode, the method further includes:
[0024] Before the AI chip executes the processing stage task, a first memory instruction is generated;
[0025] The first memory instruction is sent to the edge terminal so that when the edge terminal receives the first memory instruction from the host terminal, it controls the AI chip to run at a first memory frequency to execute the processing stage task. The first memory instruction corresponds to the first memory frequency.
[0026] Before the AI chip executes the processing stage task, a second memory instruction is generated;
[0027] The second memory instruction is sent to the edge device so that when the edge device receives the second memory instruction from the host device, it controls the AI chip to run a decoding stage task at a second memory frequency. The second memory instruction corresponds to the second memory frequency, and the first memory frequency is lower than the second memory frequency.
[0028] Optionally, after the AI chip enters the second low-power mode, the method further includes:
[0029] Obtain the number of text units that the task in the aforementioned processing stage needs to process;
[0030] When the number of text units that the task needs to process in the processing stage is less than the quantity threshold, a first processing instruction is generated.
[0031] The first processing instruction is sent to the edge terminal so that when the edge terminal receives the first processing instruction from the host terminal, it controls the AI chip to execute the processing stage task at a first processing frequency, wherein the first processing instruction corresponds to the first processing frequency.
[0032] When the number of text units that the task needs to process in the processing stage is not less than the quantity threshold, a second processing instruction is generated.
[0033] The first processing instruction is sent to the edge device so that when the edge device receives the second processing instruction from the host device, it controls the AI chip to execute the processing stage task at a second processing frequency. The second processing instruction corresponds to the second processing frequency, and the first processing frequency is lower than the second processing frequency.
[0034] Optionally, after the AI chip enters the second low-power mode, the method further includes:
[0035] When the large language model is invoked, a third processing instruction is generated;
[0036] The third processing instruction is sent to the edge device so that when the edge device receives the third processing instruction from the host device during the process of controlling the AI chip to perform the processing stage task and the decoding stage task, it controls the AI chip to enter the third low-power mode. In the third low-power mode, it waits for an interrupt signal to be generated. If an interrupt signal is received, it controls the AI chip to exit the third low-power mode.
[0037] Thirdly, embodiments of the present invention also provide a task processing system, the task processing system including an edge terminal and a host terminal, the host terminal being communicatively connected to the edge terminal, the AI chip and the host terminal jointly being used to execute a question-answering task of a large language model, the question-answering task including a questioning phase task, a processing phase task and a decoding phase task in the execution order, the host terminal being used to execute the questioning phase task, the AI chip being used to execute the processing phase task and the decoding phase task, the edge terminal being used to execute the control method of the AI chip as described in any one embodiment of the present invention, and the host terminal being used to execute the signal processing method as described in any one embodiment of the present invention.
[0038] In this embodiment of the invention, upon receiving a wake-up command from the host, the AI chip is controlled to exit the first low-power mode and enter the second low-power mode. The wake-up command is generated and sent to the AI chip when the host completes the query phase task. The first low-power mode is used to wait for the host to complete the query phase task while in sleep mode. The second low-power mode is used to control the AI chip to execute the processing phase task and the decoding phase task at a lower power consumption than normal operation. In the second low-power mode, the AI chip is controlled to execute the processing phase task and the decoding phase task. After the AI chip completes the decoding phase task, it sends the task result to the host and controls the AI chip to exit the second low-power mode and enter the first low-power mode. By introducing the first low-power mode and the second low-power mode, the AI chip can maintain a low power consumption state when waiting for and executing tasks, solving the problem of excessive power consumption in existing systems when running large language model decoding tasks. At the same time, by optimizing hardware and software collaboration, a higher energy efficiency ratio can be achieved, ensuring that overall power consumption is significantly reduced while maintaining high performance. Attached Figure Description
[0039] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0040] Figure 1 is a schematic diagram of the architecture of a task processing system provided in an embodiment of the present invention;
[0041] Figure 2 is an architecture diagram of another task processing system provided in an embodiment of the present invention;
[0042] Figure 3 is a schematic diagram of a task division provided by an embodiment of the present invention;
[0043] Figure 4 is a schematic diagram of signaling interaction of a task processing system provided in an embodiment of the present invention;
[0044] Figure 5 is a schematic diagram of signaling interaction of a task processing system in a second low-power mode according to an embodiment of the present invention;
[0045] Figure 6 is a schematic diagram of signaling interaction of another task processing system provided in the second low-power mode according to an embodiment of the present invention;
[0046] Figure 7 is a schematic diagram of signaling interaction of another task processing system provided in the second low-power mode according to an embodiment of the present invention;
[0047] Figure 8 is a flowchart of a task processing system in a third low-power mode according to an embodiment of the present invention;
[0048] Figure 9 is a flowchart of a control method for an AI chip provided in an embodiment of the present invention;
[0049] Figure 10 is a flowchart of a signal processing method provided by an embodiment of the present invention. Embodiments of the present invention
[0050] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0051] The following defines some terms used in the embodiments of the present invention.
[0052] AI: Artificial Intelligence refers to a technology or field that aims to simulate, extend, and expand human intelligence, including abilities such as learning, reasoning, perception, understanding, and creation. AI achieves these intelligent behaviors through computer algorithms and models and is widely applied in various fields, such as robotics, autonomous driving, speech recognition, image recognition, and natural language processing.
[0053] LLM (Large Language Model) is a computer program capable of understanding and generating human language. Based on deep learning technology, it learns the rules and patterns of language by training on large amounts of text data, thereby generating coherent and logical text. Large language models have wide applications in natural language processing, text generation, dialogue systems, and intelligent customer service.
[0054] Pre-filling is a stage in the text generation process of Large Language Models (LLMs). In this stage, the model generates preliminary tokens (text fragments or candidate words) based on given context or prompts. These tokens serve as input for the subsequent decoding stage, further generating the complete text. The pre-filling stage helps the model better understand the task requirements and provides a foundation for subsequent text generation.
[0055] Decode: This is a stage in the text generation process of a Large Language Model (LLM), also known as the decoding stage. In this stage, the model generates complete, coherent text based on the tokens generated in the pre-filling stage and the given context or hints. The decoding stage typically involves processing and optimizing the model's output to ensure that the generated text conforms to grammatical rules, is logically clear, and is relevant to the given context or hints.
[0056] As shown in Figure 1, Figure 1 is an architecture diagram of a task processing system provided by an embodiment of the present invention. The task processing system includes an edge terminal and a host terminal. The host terminal and the edge terminal are communicatively connected. The AI chip and the host terminal are used together to execute the question-answering task of a large language model. The question-answering task includes, in the order of execution, a question-asking phase task, a processing phase task, and a decoding phase task. The host terminal is used to execute the question-asking phase task, the AI chip is used to execute the processing phase task and the decoding phase task, the edge terminal is used to execute the control method of the AI chip, and the host terminal is used to execute the signal processing method.
[0057] The aforementioned edge device can be an edge computing module, as shown in Figure 2. Figure 2 is an architecture diagram of another task processing system provided by an embodiment of the present invention. The edge device includes an AI chip, DDR (Double Data Rate memory), EMMC (Embedded Multi Media Card), and PCIe bus protocol. The AI chip is connected to the DDR via the DDRC pin, and the AI chip is connected to the EMMC via the sdio pin. The AI chip communicates with the host via the PCIe bus protocol. The DDR can be LPDDR@64bit, and the PCIe bus protocol can be the PCIe 3.0 bus protocol. The AI chip is interconnected with the host via a Mini-PCIe Gold Finger. The host can be a terminal device with a processor architecture such as x86 or ARM.
[0058] As shown in Figure 3, which is a schematic diagram of task division provided by an embodiment of the present invention, the question-and-answer task can be divided into stage A, stage B, and stage C according to execution time. Execution time T1 is stage A, execution time T2 is stage B, and execution time T3 is stage C. In stage A, the host executes the question-asking stage task (asking for questions). In stages B and C, the edge executes the processing stage task and the decoding stage task, respectively. Stage B corresponds to the processing stage task (Preill process), and stage C corresponds to the decoding stage task (Decode process).
[0059] Specifically, for LLM (decode only) large models in task processing systems, there are mainly the following three stages.
[0060] Phase A: On the host side, the user enters their inquiry.
[0061] Phase B: Prefill process, which processes the entire input sequence. It requires high computing power and is executed on an AI chip. The prefill process is used to process the entire input sequence, including multi-head attention mechanisms and feedforward neural networks. It requires a large number of matrix multiplications and weighted summations, and the input data volume is large, therefore, it requires high computing power.
[0062] Phase C: The Decode process generates q, k, and v from the newly generated tokens and calculates their attention with all previous tokens. This step requires reading the key-value pairs of all previous tokens from the KV Cache, making it an access-intensive, memory-bound process executed on the AI chip. At each step, the generated sequence (including the "Prefill" sequence and previously generated tokens) is used as input, and calculations are performed through multiple layers of Attention and FFN (Feedforward Neural Network) to ultimately output the probability distribution of the next token.
[0063] The pre- and post-processing of the Prefill and Decode processes is executed on the host side.
[0064] Specifically, upon completion of the query phase task on the host side, a wake-up command can be generated and sent to the edge device. Upon receiving the wake-up command from the host side, the edge device controls the AI chip to exit the first low-power mode and enter the second low-power mode. In the second low-power mode, the AI chip executes the processing phase task and the decoding phase task. After completing the decoding phase task, the AI chip sends the task result to the host side and controls the AI chip to exit the second low-power mode and enter the first low-power mode. After receiving the task result, the host side displays the result to the user.
[0065] When exiting phase C, it can be considered that phase A has begun. During this time, the user enters the question on the host side, and the AI chip does not need to work. Therefore, the AI chip can be controlled to enter the first low-power mode to wait for the host side to complete phase A. After the host side completes phase A (for example, after the user enters the question and presses Enter or sends), the host side will generate a wake-up command and send a wake-up command to the edge device. The edge device controls the AI chip to exit the first low-power mode and enter the second low-power mode. In the second low-power mode, the AI chip is controlled to execute phase B and phase C.
[0066] The first low-power mode is used to wait for the host to complete the query phase task while in sleep mode. In the first low-power mode, most of the hardware in the AI chip is turned off, and it can only be woken up by transmitting a wake-up command through the PCIe bus protocol. The second low-power mode is used to control the AI chip to perform processing phase tasks and decoding phase tasks at a lower power consumption than normal operation. The second low-power mode mainly manages the low power consumption of at least one of the following: the processing frequency of the embedded neural network processor (NPU), the memory frequency of DDR, the pre- and post-processing of the prefill process, and the pre- and post-processing of the decode process.
[0067] More specifically, as shown in Figure 4, which is a signaling interaction diagram of a task processing system provided in an embodiment of the present invention, the host side includes the host (master controller) and host software, and the edge side includes CPU Linux software. When the question-and-answer task exits stage C, it can be assumed to enter stage A of the question-and-answer task, waiting for the user to input a question. At this time, the host initiates PCIe suspend (PCIe suspended state) through the host software, causing the underlying communication between the host side and the edge side to pause, and the AI chip enters PCIe suspend, with the entire chip going to sleep. When the question-and-answer task exits stage A, it will enter stage B, that is, when the user completes inputting the question, it will enter Preill. The host exits PCIe suspend through the host software, PCIe communication resumes, other chip resources are restored, underlying communication is restored, and the AI chip executes the subsequent stages B and C of the question-and-answer task.
[0068] Further, as shown in Figure 5, which is a signaling interaction diagram of a task processing system in a second low-power mode according to an embodiment of the present invention, the host side includes a host (master controller) and host software, and the edge side includes CPU Linux software. When the question-and-answer task exits stage A, it enters stage B. At this time, since stage B corresponds to the processing stage task, which is computationally intensive and has a low demand for DDR memory resources, the host sends a first memory instruction to the CPU Linux software through the host software, applying a first memory frequency. Upon receiving the first memory instruction, the CPU Linux software controls the DDR to operate at the first memory frequency, thereby causing the AI chip to run at the first memory frequency to execute the question-and-answer task in stage B. After the question-and-answer task in stage B is completed, it exits stage B and enters stage C. Since stage C corresponds to the decoding stage task, which is access-intensive (requiring frequent memory access) and has a high demand for memory resources, the host sends a second memory instruction to the CPU Linux software through the host software, applying a second memory frequency. Upon receiving the second memory instruction, the CPU Linux software controls the DDR to operate at the second memory frequency, thereby causing the AI chip to run at the second memory frequency to execute the question-and-answer task in stage B. The first memory frequency is lower than the second memory frequency. That is, the DDR frequency is reduced in stage B and increased or restored to normal memory frequency in stage C. In this way, the power consumption of DDR can be reduced in stage B.
[0069] Furthermore, as shown in Figure 6, which is a signaling interaction diagram of another task processing system provided by an embodiment of the present invention in the second low-power mode, the host side includes a host (master controller) and host software, and the edge side includes CPU Linux software. When the question-and-answer task exits stage A, it enters stage B. Since stage B corresponds to a computationally intensive task, the processor resources are determined based on the number of tokens. The larger the number of tokens, the higher the required processor resources. These processor resources can be understood as the processor's processing frequency. Therefore, when the number of tokens to be processed is less than a threshold, the host sends a first processing instruction to the CPU Linux software via the host software, applying a first processing frequency. Upon receiving the first processing signal, the CPU Linux software controls the AI chip to operate at the first processing frequency, thereby enabling the AI chip to run at the first processing frequency to execute the question-and-answer task in stage B. When the number of tokens to be processed is not less than (greater than or equal to) a certain threshold, the host sends a second processing instruction to the CPU Linux software via the host software, applying a second processing frequency. Upon receiving the second processing signal, the CPU Linux software controls the AI chip to operate at the second processing frequency, thus causing the AI chip to run at the second processing frequency to perform the question-answering task in stage B. The first processing frequency is lower than the second processing frequency. That is, in stage B, if the number of tokens to be processed is small, the NPU of the AI chip is downclocked; if the number of tokens to be processed is large, the NPU of the AI chip is upclocked or restored to the normal processing frequency. In this way, when the number of tokens to be processed is small, the power consumption of the AI chip can be reduced.
[0070] In one possible embodiment, after exiting phase B, phase C is entered, where the host sends a second processing instruction to the CPU Linux software via the host software, applying a second processing frequency.
[0071] The threshold for the number of tokens mentioned above can be tested based on the actual model, ensuring that it does not affect performance; that is, the delay time for the first tokens should not exceed 2-3 seconds. The first processing frequency mentioned above can be determined based on the performance of the actual model; in this application, the first processing frequency is lower than the normal processing frequency.
[0072] Furthermore, as shown in Figure 7, which is a signaling interaction diagram of another task processing system provided in the second low-power mode according to an embodiment of the present invention, in Figure 7, the Prefill process performs forward inference (processed by the NPU), and the Decode process performs model running (module.run). Before the NPU processing, the model execution interface needs to be called to perform a pre-run check. Before module.run, the model execution interface needs to be called to perform a pre-run check. The pre-run check is completed on the host side (or the main controller). Taking the main controller as an x86 as an example, during the pre-run check, the AI chip can enter the third low-power mode. When the host side calls the model execution interface, it needs to notify the AI chip to enter the third low-power mode; after receiving the notification, the AI chip needs to immediately enter the third low-power mode; after calling the model execution interface, the host side needs to notify the AI chip to exit the third low-power mode.
[0073] Specifically, as shown in Figure 8, which is a flowchart of a task processing system in the third low-power mode provided by an embodiment of the present invention, during the execution of the third low-power mode, the Linux tickless function is enabled. At this time, the CPU runs on a single core, core0. During the pre- and post-processing of the large language model, there is basically nothing to do; it is waiting for the tick (timer) task. When the large language model finishes processing a token and generates an interrupt (npu int0), Linux will immediately exit the low-power mode to execute the interrupt. The kernel cpuidle (CPU idle state) process can be configured using the Linux kernel and ATF firmware. The process enters the ATF firmware via psci, proceeds to psci_cpu_suspend to CPU standby, and enters the pre-designed low-power function deepeye_pwr_domain_standby, waiting for the tick timer and other asynchronous interrupts. After the NPU completes the computation task and generates an int0 interrupt, deepeye_pwr_domain_standby receives the interrupt and sets the CPU to the running state (i.e., exits the third low-power mode) via psci_set_cpu_local_state(psci_set_local_state_run).
[0074] `psci_cpu_suspend` is a function provided by the PSCI interface that allows Linux to put a specified CPU into a low-power suspend state. When the CPU is in a suspend state, its power consumption is significantly reduced, while maintaining a certain level of capability to quickly resume normal operation.
[0075] CPU standby refers to the CPU operating in a low-power mode, in which the CPU clock is turned off or its frequency is reduced.
[0076] After enabling Linux tickless (Linux low-power mode), the AI chip receives the third processing instruction from the host to enter the third low-power mode. The kernel then quickly triggers the entry into the ATF firmware CPU idle low-power process. In the low-power function deepeye_pwr_domain_standby, key chip low-power resource operations are placed into the chip's SRAM for execution.
[0077] The low-power function deepeye_pwr_domain_standby primarily operates on resources including disabling the DDR clock / SOC (Power Management) top-level and subsystem clocks. These operations offer significant power savings but have minimal time overhead. Once the large language model has finished executing, the NPU will generate a completion interrupt. Upon receiving the interrupt, it will immediately exit the third low-power mode, and the relevant resources will be restored.
[0078] To further illustrate the effects of this embodiment, the Tongyi 1.8B large model is used as the large language model, and the IPU-X200 is used as the edge module of the AI chip as an example. Power consumption optimization is performed through the first low-power mode, the second low-power mode, and the third low-power mode. The measured power consumption gains after optimization are as follows:
[0079] Phase A, Phase B, Phase C: Before optimization > 2W > 3W > 3W; After optimization: 80% reduction, 20% reduction, 10% reduction.
[0080] In this embodiment of the invention, the task processing system includes an edge terminal and a host terminal, which are communicatively connected. The AI chip and the host terminal jointly execute a question-answering task based on a large language model. The question-answering task, in execution order, includes a question-asking phase, a processing phase, and a decoding phase. The host terminal executes the question-asking phase, while the AI chip executes the processing and decoding phases. By introducing a first low-power mode and a second low-power mode, the AI chip maintains a low power consumption state while waiting for and executing tasks, solving the problem of excessive power consumption in existing systems when running large language model decoding tasks. Simultaneously, by optimizing hardware and software collaboration, a higher energy efficiency ratio can be achieved, ensuring a significant reduction in overall power consumption while maintaining high performance.
[0081] As shown in Figure 9, Figure 9 is a flowchart of a control method for an AI chip provided by an embodiment of the present invention. The control method for the AI chip includes the following steps:
[0082] 101. Upon receiving a wake-up command from the host, control the AI chip to exit the first low-power mode and enter the second low-power mode.
[0083] In this embodiment of the invention, the control method of the AI chip is used at the edge of the task processing system. The task processing system includes an edge and a host. The host and the edge are communicatively connected. The AI chip and the host are used together to execute the question-answering task of the large language model. The question-answering task includes, in the order of execution, a question-asking phase task, a processing phase task, and a decoding phase task. The host is used to execute the question-asking phase task, and the AI chip is used to execute the processing phase task and the decoding phase task.
[0084] The wake-up command is generated and sent to the AI chip when the host completes the questioning phase task. The first low-power mode is used to wait for the host to complete the questioning phase task in sleep mode. The second low-power mode is used to control the AI chip to execute the processing phase task and the decoding phase task in a manner lower than the normal operating power consumption.
[0085] Specifically, when the host completes the questioning phase task, a wake-up command can be generated and sent to the edge. When the edge receives the wake-up command from the host, it controls the AI chip to exit the first low-power mode and enter the second low-power mode. The first low-power mode is used to wait for the host to complete the questioning phase task. In the first low-power mode, most of the hardware in the AI chip is turned off, and it can only be woken up by transmitting the wake-up command through the PCIe bus protocol.
[0086] The above-mentioned question-asking phase task involves receiving user-input questions on the host side, corresponding to phase A of the question-answering task. The above-mentioned processing phase task is the Prefill process, used to process the entire input sequence, corresponding to phase B of the question-answering task. The above-mentioned decoding phase task is the Decode process, used to generate q, k, and v from the latest generated tokens and calculate its attention with all previous tokens, corresponding to phase C of the question-answering task.
[0087] 102. In the second low-power mode, control the AI chip to perform processing stage tasks and decoding stage tasks.
[0088] In this embodiment of the invention, the second low-power mode can be used to control the AI chip to perform processing stage tasks and decoding stage tasks; specifically, the second low-power mode can be used to perform low-power management on at least one of the following: the processing frequency of the embedded neural network processor (NPU), the memory frequency of DDR, the pre- and post-processing of the prefill process, and the pre- and post-processing of the Decode process.
[0089] 103. After the AI chip completes the decoding stage task, it sends the task result to the host and controls the AI chip to exit the second low-power mode and enter the first low-power mode.
[0090] In this embodiment of the invention, after the AI chip completes the decoding phase task, it obtains the task result, which is the answer to the inquiry. After the AI chip completes the decoding phase task, it sends the task result to the host terminal, which can then display the result to the user.
[0091] Since the AI chip needs to wait for a new question-and-answer task after completing the above decoding stage, the edge device can control the AI chip to exit the second low-power mode and enter the first low-power mode to wait with lower power consumption.
[0092] Referring to Figure 3, when exiting stage C, it can be considered that stage A has begun. During this time, the user inputs the question on the host side, and the AI chip does not need to work. Therefore, the AI chip can be controlled to enter the first low-power mode to wait for the host side to complete stage A. After the host side completes stage A (for example, after the user inputs the question and presses Enter or sends), the host side will generate a wake-up command and send a wake-up command to the edge device. The edge device controls the AI chip to exit the first low-power mode and enter the second low-power mode. In the second low-power mode, the AI chip is controlled to execute stage B and stage C.
[0093] In this embodiment of the invention, upon receiving a wake-up command from the host, the AI chip is controlled to exit the first low-power mode and enter the second low-power mode. The wake-up command is generated and sent to the AI chip when the host completes the questioning phase task. The first low-power mode is used to wait for the host to complete the questioning phase task, while the second low-power mode is used to control the AI chip to execute the processing phase task and the decoding phase task. In the second low-power mode, the AI chip is controlled to execute the processing phase task and the decoding phase task. After the AI chip completes the decoding phase task, the task result is sent to the host, and the AI chip is controlled to exit the second low-power mode and enter the first low-power mode. By introducing the first and second low-power modes, the AI chip can maintain a low power consumption state when waiting for and executing tasks, solving the problem of excessive power consumption in existing systems when running large language model decoding tasks. At the same time, by optimizing hardware and software collaboration, a higher energy efficiency ratio can be achieved, ensuring that overall power consumption is significantly reduced while maintaining high performance.
[0094] It is understood that in the specific implementation of this application, data such as task data, chip data, and user data are involved. When the embodiments in this application are applied to specific products or technologies, user permission or consent is required. Furthermore, the collection, use, and processing of related data, as well as the training, deployment, and invocation of algorithm models, must comply with the relevant laws, regulations, and standards of the relevant countries and regions.
[0095] Optionally, in the second low-power mode, in the step of controlling the AI chip to execute the processing stage task and the decoding stage task, when a first memory instruction is received from the host, the AI chip can be controlled to run at a first memory frequency to execute the processing stage task, where the first memory instruction corresponds to the first memory frequency; when a second memory instruction is received from the host, the AI chip can be controlled to run at a second memory frequency to execute the decoding stage task, where the second memory instruction corresponds to the second memory frequency, and the first memory frequency is lower than the second memory frequency.
[0096] In this embodiment of the invention, the second low-power mode is used to control the AI chip to perform processing stage tasks and decoding stage tasks; the second low-power mode can be used to perform low-power management of DDR memory frequency.
[0097] The processing phase is computationally intensive, while the decoding phase is access intensive (requiring frequent memory access). Therefore, the DDR frequency can be reduced during the processing phase and increased or restored to normal memory frequency during the decoding phase.
[0098] After completing the questioning phase, the host computer enters a second low-power mode. Once in this mode, during the processing phase, the host generates a first memory instruction and sends it to the edge. Upon receiving this instruction, the edge controls the DDR memory frequency to change to the first memory frequency, causing the AI chip to run at that frequency to execute the processing phase. After completing the processing phase, the AI chip enters the decoding phase. In this second phase, the host generates a second memory instruction and sends it to the edge. Upon receiving this instruction, the edge controls the DDR memory frequency to change to the second memory frequency, causing the AI chip to run at that frequency to execute the processing phase.
[0099] Referring to Figure 5, the host side includes the host (controller) and host software, while the edge side includes the CPU Linux software. After exiting stage A of the question-and-answer task, it enters stage B. Since stage B corresponds to a computationally intensive task, its demand for DDR memory resources is relatively low. Therefore, the host, through the host software, issues a first memory instruction to the CPU Linux software, applying a first memory frequency. Upon receiving this instruction, the CPU Linux software controls the DDR to operate at the first memory frequency, thus enabling the AI chip to run at the first memory frequency to execute the question-and-answer task in stage B. After stage B of the question-and-answer task is completed, it exits stage B and enters stage C. Since stage C corresponds to a decoding task, it is access-intensive (requiring frequent memory access) and has a higher demand for memory resources. Therefore, the host, through the host software, issues a second memory instruction to the CPU Linux software, applying a second memory frequency. Upon receiving this instruction, the CPU Linux software controls the DDR to operate at the second memory frequency, thus enabling the AI chip to run at the second memory frequency to execute the question-and-answer task in stage B. The first memory frequency is lower than the second memory frequency. That is, the DDR frequency is reduced in stage B and increased or restored to normal memory frequency in stage C. In this way, the power consumption of DDR can be reduced in stage B.
[0100] Optionally, in the second low-power mode, during the steps of controlling the AI chip to execute the processing stage task and the decoding stage task, when the first processing instruction is received from the host, the AI chip is controlled to execute the processing stage task at a first processing frequency. The first processing instruction corresponds to the first processing frequency and is generated by the host when the number of text units to be processed in the processing stage task is less than a quantity threshold. When the second processing instruction is received from the host, the AI chip is controlled to execute the processing stage task at a second processing frequency. The second processing instruction corresponds to the second processing frequency, and the first processing frequency is lower than the second processing frequency. The second processing instruction is generated by the host when the number of text units to be processed in the processing stage task is not less than a quantity threshold.
[0101] In this embodiment of the invention, when the AI chip performs a processing stage task, dynamic DFS (NPU computing power) control is performed according to the number of text units that the processing stage task needs to process. In this way, when the number of text units that need to be processed is small, the NPU DFS frequency can be reduced, and when the number of text units that need to be processed is large, the NPU DFS frequency can be increased or restored to the normal processing frequency.
[0102] Referring to Figure 6, the host side includes the host (master controller) and host software, while the edge side includes CPU Linux software. After exiting stage A of the question-and-answer task, it enters stage B. Since stage B corresponds to a computationally intensive task, processor resources are determined based on the number of tokens (text units). The larger the number of tokens, the higher the required processor resources. These processor resources can be understood as the processor's processing frequency. Therefore, when the number of tokens to be processed is less than a threshold, the host, through the host software, issues a first processing instruction to the CPU Linux software, applying a first processing frequency. Upon receiving the first processing signal, the CPU Linux software controls the AI chip to operate at the first processing frequency, thus enabling the AI chip to run in stage B, executing the question-and-answer task at the first processing frequency. When the number of tokens to be processed is not less than (greater than or equal to) the threshold, the host, through the host software, issues a second processing instruction to the CPU Linux software, applying a second processing frequency. Upon receiving the second processing signal, the CPU Linux software controls the AI chip to operate at the second processing frequency, thus enabling the AI chip to run in stage B, executing the question-and-answer task at the second processing frequency. The first processing frequency is lower than the second processing frequency. That is, in stage B, if the number of tokens to be processed is small, the NPU of the AI chip is downclocked; if the number of tokens to be processed is large, the NPU of the AI chip is upclocked or restored to the normal processing frequency. In this way, when the number of tokens to be processed is small, the power consumption of the AI chip can be reduced.
[0103] In one possible embodiment, after exiting phase B, phase C is entered, where the host sends a second processing instruction to the CPU Linux software via the host software, applying a second processing frequency.
[0104] The threshold for the number of tokens mentioned above can be tested based on the actual model, ensuring that it does not affect performance; that is, the delay time for the first tokens should not exceed 2-3 seconds. The first processing frequency mentioned above can be determined based on the performance of the actual model; in this application, the first processing frequency is lower than the normal processing frequency.
[0105] Optionally, in the steps of controlling the AI chip to execute the processing stage tasks and the decoding stage tasks in the second low-power mode, when a third processing instruction is received from the host side during the process of controlling the AI chip to execute the processing stage tasks and the decoding stage tasks in the second low-power mode, the AI chip can be controlled to enter the third low-power mode. The third processing instruction is generated by the host side when the large language model is called. In the third low-power mode, it waits for an interrupt signal to be generated. If an interrupt signal is received, the AI chip is controlled to exit the third low-power mode.
[0106] In this embodiment of the invention, when the large language model is invoked, a pre-run check is required. This pre-run check is completed on the host side. At this time, the AI chip is in an idle state and can enter the third low-power mode to further reduce the power consumption of the AI chip. Invoking the large language model includes forward inference (processed by the NPU) and model execution (module.run). Invoking the large language model can be achieved by calling the model execution interface. Therefore, before NPU processing, the model execution interface needs to be called to perform the pre-run check; before module.run, the model execution interface needs to be called to perform the pre-run check.
[0107] Referring to Figure 7, the Prefill process performs forward inference (processed by the NPU), and the Decode process performs model execution (module.run). Before NPU processing, the model execution interface needs to be called to perform pre-run checks. Before module.run, the model execution interface also needs to be called to perform pre-run checks. These pre-run checks are completed on the host (or main controller). Taking an x86 main controller as an example, during the pre-run checks, the AI chip can enter a third low-power mode. When the host calls the model execution interface, it needs to notify the AI chip to enter the third low-power mode; upon receiving this notification, the AI chip needs to immediately enter the third low-power mode; after calling the model execution interface, the host needs to notify the AI chip to exit the third low-power mode.
[0108] Optionally, in the step of controlling the AI chip to enter the third low-power mode, the low-power resource operations of the AI chip can be executed in static random access memory through a preset low-power function.
[0109] In this embodiment of the invention, when entering the third low-power mode, the operation of key chip low-power resources can be placed into the static random access memory of the AI chip for execution through a preset low-power function. The low-power resources include turning off the memory clock, the top-level power management and subsystem clock, etc.
[0110] Referring to Figure 8, during the execution of the third low-power mode, the Linux tickless function is enabled. At this time, the CPU runs on a single core (core0). During the pre- and post-processing of the large language model, there is essentially nothing to do; it is waiting for the tick (timer) task. When the large language model completes processing a token and generates an NPU int0 interrupt, Linux immediately exits the low-power mode to execute the interrupt. This can be achieved by configuring the kernel's cpuidle (CPU idle state) process using the Linux kernel and ATF firmware. This involves entering the ATF firmware via psci, proceeding to psci_cpu_suspend to CPU standby, and then entering the pre-designed low-power function deepeye_pwr_domain_standby, where it waits for the tick timer and other asynchronous interrupts. After the NPU completes its computation task and generates an int0 interrupt, deepeye_pwr_domain_standby receives the interrupt and sets the CPU to running state (i.e., exits the third low-power mode) via psci_set_cpu_local_state(psci_set_local_state_run).
[0111] `psci_cpu_suspend` is a function provided by the PSCI interface that allows Linux to put a specified CPU into a low-power suspend state. When the CPU is in a suspend state, its power consumption is significantly reduced, while maintaining a certain level of capability to quickly resume normal operation.
[0112] CPU standby refers to the CPU operating in a low-power mode, in which the CPU clock is turned off or its frequency is reduced.
[0113] After enabling Linux tickless (Linux low-power mode), the AI chip receives the third processing instruction from the host to enter the third low-power mode. The kernel then quickly triggers the entry into the ATF firmware CPU idle low-power process. In the low-power function deepeye_pwr_domain_standby, key chip low-power resource operations are placed into the chip's SRAM for execution.
[0114] The low-power function deepeye_pwr_domain_standby primarily operates on resources including disabling the DDR clock / SOC (Power Management) top-level and subsystem clocks. These operations offer significant power savings but have minimal time overhead. Once the large language model has finished executing, the NPU will generate a completion interrupt. Upon receiving the interrupt, it will immediately exit the third low-power mode, and the relevant resources will be restored.
[0115] As shown in Figure 10, Figure 10 is a flowchart of a signal processing method provided by an embodiment of the present invention. The signal processing method includes the following steps:
[0116] 201. When the questioning phase task is completed, a wake-up command is generated.
[0117] In this embodiment of the invention, the above signal processing method is used on the host side, which is connected to the edge side for communication. The AI chip and the host side are used together to execute the question-answering task of the large language model. The question-answering task includes a questioning phase task, a processing phase task, and a decoding phase task in the execution order. The host side is used to execute the questioning phase task, and the AI chip is used to execute the processing phase task and the decoding phase task.
[0118] When the questioning phase task is completed on the host side, a wake-up command can be generated.
[0119] 202. Send the wake-up command to the edge device.
[0120] In this embodiment of the invention, the wake-up command can be generated and sent to the edge terminal via the PCIe bus protocol. When the edge terminal receives the wake-up command from the host terminal, it controls the AI chip to exit the first low-power mode and enter the second low-power mode.
[0121] The first low-power mode is used to wait for the host to complete the query phase task in a sleep state. The second low-power mode is used to control the AI chip to perform the processing phase task and the decoding phase task in a manner lower than the normal operating power consumption. In the second low-power mode, the AI chip is controlled to perform the processing phase task and the decoding phase task. After the AI chip completes the decoding phase task, it sends the task result to the host and controls the AI chip to exit the second low-power mode and enter the first low-power mode.
[0122] The second low-power mode can specifically involve low-power management of at least one of the following: the processing frequency of the embedded neural network processor (NPU), the memory frequency of DDR, the pre- and post-processing of the prefill process, and the pre- and post-processing of the decode process.
[0123] Specifically, the above-mentioned question-asking phase task involves receiving user input questions on the host side, corresponding to phase A of the question-answering task. The above-mentioned processing phase task is the Prefill process, used to process the entire input sequence, corresponding to phase B of the question-answering task. The above-mentioned decoding phase task is the Decode process, used to generate q, k, and v from the newly generated tokens and calculate its attention with all previous tokens, corresponding to phase C of the question-answering task.
[0124] After the AI chip completes the decoding phase, it obtains the task result, which is the answer to the inquiry. Once the AI chip has completed the decoding phase, it sends the task result to the host computer, which can then display the result to the user.
[0125] Since the AI chip needs to wait for a new question-and-answer task after completing the above decoding stage, the edge device can control the AI chip to exit the second low-power mode and enter the first low-power mode to wait with lower power consumption.
[0126] Referring to Figure 3, when exiting stage C, it can be considered that stage A has begun. During this time, the user inputs the question on the host side, and the AI chip does not need to work. Therefore, the AI chip can be controlled to enter the first low-power mode to wait for the host side to complete stage A. After the host side completes stage A (for example, after the user inputs the question and presses Enter or sends), the host side will generate a wake-up command and send a wake-up command to the edge device. The edge device controls the AI chip to exit the first low-power mode and enter the second low-power mode. In the second low-power mode, the AI chip is controlled to execute stage B and stage C.
[0127] In this embodiment of the invention, upon completion of the questioning phase task, a wake-up command is generated and sent to the edge device. Upon receiving the wake-up command from the host device, the edge device controls the AI chip to exit the first low-power mode and enter the second low-power mode. The wake-up command is generated and sent to the AI chip by the host device upon completion of the questioning phase task. The first low-power mode is used to wait for the host device to complete the questioning phase task while in a sleep state. The second low-power mode controls the AI chip to execute the processing and decoding phase tasks at a power consumption lower than normal operation. In the second low-power mode, the AI chip executes the processing and decoding phase tasks. After the AI chip completes the decoding phase task, the task result is sent to the host device, and the AI chip exits the second low-power mode and enters the first low-power mode. By introducing the first and second low-power modes, the AI chip can maintain a low power consumption state while waiting for and executing tasks, solving the problem of excessive power consumption in existing systems when running large language model decoding tasks. Simultaneously, by optimizing hardware and software collaboration, a higher energy efficiency ratio can be achieved, ensuring a significant reduction in overall power consumption while maintaining high performance.
[0128] Optionally, after the AI chip enters the second low-power mode, a first memory instruction can be generated before the AI chip executes the processing stage task; the first memory instruction is sent to the edge terminal so that when the edge terminal receives the first memory instruction from the host terminal, it controls the AI chip to run at the first memory frequency to execute the processing stage task, and the first memory instruction corresponds to the first memory frequency; before the AI chip executes the processing stage task, a second memory instruction is generated; the second memory instruction is sent to the edge terminal so that when the edge terminal receives the second memory instruction from the host terminal, it controls the AI chip to run at the second memory frequency to execute the decoding stage task, and the second memory instruction corresponds to the second memory frequency, and the first memory frequency is lower than the second memory frequency.
[0129] In this embodiment of the invention, the second low-power mode is used to control the AI chip to perform processing stage tasks and decoding stage tasks; the second low-power mode can be used to perform low-power management of DDR memory frequency.
[0130] The processing phase is computationally intensive, while the decoding phase is access intensive (requiring frequent memory access). Therefore, the DDR frequency can be reduced during the processing phase and increased or restored to normal memory frequency during the decoding phase.
[0131] After completing the questioning phase, the host computer enters a second low-power mode. Once in this mode, during the processing phase, the host generates a first memory instruction and sends it to the edge. Upon receiving this instruction, the edge controls the DDR memory frequency to change to the first memory frequency, causing the AI chip to run at that frequency to execute the processing phase. After completing the processing phase, the AI chip enters the decoding phase. In this second phase, the host generates a second memory instruction and sends it to the edge. Upon receiving this instruction, the edge controls the DDR memory frequency to change to the second memory frequency, causing the AI chip to run at that frequency to execute the processing phase.
[0132] Referring to Figure 5, the host side includes the host (controller) and host software, while the edge side includes the CPU Linux software. After exiting stage A of the question-and-answer task, it enters stage B. Since stage B corresponds to a computationally intensive task, its demand for DDR memory resources is relatively low. Therefore, the host, through the host software, issues a first memory instruction to the CPU Linux software, applying a first memory frequency. Upon receiving this instruction, the CPU Linux software controls the DDR to operate at the first memory frequency, thus enabling the AI chip to run at the first memory frequency to execute the question-and-answer task in stage B. After stage B of the question-and-answer task is completed, it exits stage B and enters stage C. Since stage C corresponds to a decoding task, it is access-intensive (requiring frequent memory access) and has a higher demand for memory resources. Therefore, the host, through the host software, issues a second memory instruction to the CPU Linux software, applying a second memory frequency. Upon receiving this instruction, the CPU Linux software controls the DDR to operate at the second memory frequency, thus enabling the AI chip to run at the second memory frequency to execute the question-and-answer task in stage B. The first memory frequency is lower than the second memory frequency. That is, the DDR frequency is reduced in stage B and increased or restored to normal memory frequency in stage C. In this way, the power consumption of DDR can be reduced in stage B.
[0133] Optionally, after the AI chip enters the second low-power mode, it can also obtain the number of text units that the processing stage task needs to process; when the number of text units that the processing stage task needs to process is less than a quantity threshold, a first processing instruction is generated; the first processing instruction is sent to the edge terminal, so that when the edge terminal receives the first processing instruction from the host terminal, it controls the AI chip to execute the processing stage task at a first processing frequency, and the first processing instruction corresponds to the first processing frequency; when the number of text units that the processing stage task needs to process is not less than the quantity threshold, a second processing instruction is generated; the first processing instruction is sent to the edge terminal, so that when the edge terminal receives the second processing instruction from the host terminal, it controls the AI chip to execute the processing stage task at a second processing frequency, and the second processing instruction corresponds to the second processing frequency, and the first processing frequency is lower than the second processing frequency.
[0134] In this embodiment of the invention, when the AI chip is performing a processing stage task, the host generates a first processing instruction or a second processing instruction based on the number of text units that need to be processed in the processing stage task, and then performs dynamic DFS (NPU computing power) control on the AI chip. In this way, when the number of text units that need to be processed is small, the NPU DFS frequency can be reduced, and when the number of text units that need to be processed is large, the NPU DFS frequency can be increased or restored to the normal processing frequency.
[0135] Referring to Figure 6, the host side includes the host (master controller) and host software, while the edge side includes CPU Linux software. After exiting stage A of the question-and-answer task, it enters stage B. Since stage B corresponds to a computationally intensive task, processor resources are determined based on the number of tokens (text units). The larger the number of tokens, the higher the required processor resources. These processor resources can be understood as the processor's processing frequency. Therefore, when the number of tokens to be processed is less than a threshold, the host, through the host software, issues a first processing instruction to the CPU Linux software, applying a first processing frequency. Upon receiving the first processing signal, the CPU Linux software controls the AI chip to operate at the first processing frequency, thus enabling the AI chip to run in stage B, executing the question-and-answer task at the first processing frequency. When the number of tokens to be processed is not less than (greater than or equal to) the threshold, the host, through the host software, issues a second processing instruction to the CPU Linux software, applying a second processing frequency. Upon receiving the second processing signal, the CPU Linux software controls the AI chip to operate at the second processing frequency, thus enabling the AI chip to run in stage B, executing the question-and-answer task at the second processing frequency. The first processing frequency is lower than the second processing frequency. That is, in stage B, if the number of tokens to be processed is small, the NPU of the AI chip is downclocked; if the number of tokens to be processed is large, the NPU of the AI chip is upclocked or restored to the normal processing frequency. In this way, when the number of tokens to be processed is small, the power consumption of the AI chip can be reduced.
[0136] In one possible embodiment, after exiting phase B, phase C is entered, where the host sends a second processing instruction to the CPU Linux software via the host software, applying a second processing frequency.
[0137] The threshold for the number of tokens mentioned above can be tested based on the actual model, ensuring that it does not affect performance; that is, the delay time for the first tokens should not exceed 2-3 seconds. The first processing frequency mentioned above can be determined based on the performance of the actual model; in this application, the first processing frequency is lower than the normal processing frequency.
[0138] Optionally, after the AI chip enters the second low-power mode, a third processing instruction can be generated when the large language model is called; the third processing instruction is sent to the edge terminal so that when the edge terminal receives the third processing instruction from the host terminal during the process of controlling the AI chip to perform the processing stage tasks and the decoding stage tasks, it controls the AI chip to enter the third low-power mode; in the third low-power mode, it waits for an interrupt signal to be generated, and if an interrupt signal is received, it controls the AI chip to exit the third low-power mode.
[0139] In this embodiment of the invention, when the large language model is invoked, a pre-run check is required. This pre-run check is completed on the host side. At this time, the AI chip is in an idle state and can enter the third low-power mode to further reduce the power consumption of the AI chip. Invoking the large language model includes forward inference (processed by the NPU) and model execution (module.run). Invoking the large language model can be achieved by calling the model execution interface. Therefore, before NPU processing, the model execution interface needs to be called to perform the pre-run check; before module.run, the model execution interface needs to be called to perform the pre-run check.
[0140] Referring to Figure 7, the Prefill process performs forward inference (processed by the NPU), and the Decode process performs model execution (module.run). Before NPU processing, the model execution interface needs to be called to perform pre-run checks. Before module.run, the model execution interface also needs to be called to perform pre-run checks. These pre-run checks are completed on the host (or main controller). Taking an x86 main controller as an example, during the pre-run checks, the AI chip can enter a third low-power mode. When the host calls the model execution interface, it needs to notify the AI chip to enter the third low-power mode; upon receiving this notification, the AI chip needs to immediately enter the third low-power mode; after calling the model execution interface, the host needs to notify the AI chip to exit the third low-power mode.
[0141] The embodiments of the present invention can solve the problems of high power consumption, poor heat dissipation, and power supply issues in the actual deployment of customer products when running large LLM models on AI chips. Specifically, it can reduce system power consumption and improve and solve the above problems. After optimizing power consumption through the embodiments of the present invention, the energy efficiency ratio and competitiveness of chip products can be improved.
[0142] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. The storage medium can be a magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM), etc.
[0143] The above description discloses only preferred embodiments of the present invention and should not be construed as limiting the scope of the present invention. Therefore, equivalent variations made in accordance with the claims of the present invention are still within the scope of the present invention.
Claims
1. A control method for an AI chip, characterized in that, The AI chip is located at the edge, which is communicatively connected to the host. The AI chip and the host jointly perform a question-answering task based on a large language model. The question-answering task, in execution order, includes a question-asking phase, a processing phase, and a decoding phase. The host executes the question-asking phase, and the AI chip executes the processing and decoding phases. The control method for the AI chip at the edge includes the following steps: When the wake-up command is received from the host, the AI chip is controlled to exit the first low-power mode and enter the second low-power mode. The wake-up command is generated and sent to the AI chip when the host completes the questioning phase task. The first low-power mode is used to wait for the host to complete the questioning phase task in a sleep state. The second low-power mode is used to control the AI chip to execute the processing phase task and the decoding phase task in a manner lower than the normal operating power consumption. In the second low-power mode, the AI chip is controlled to execute the processing stage task and the decoding stage task; After the AI chip completes the decoding stage task, it sends the task result to the host and controls the AI chip to exit the second low-power mode and enter the first low-power mode.
2. The control method for the AI chip as described in claim 1, characterized in that, In the second low-power mode, controlling the AI chip to execute the processing stage tasks and the decoding stage tasks includes: In the second low-power mode, when the first memory instruction is received from the host, the AI chip is controlled to run at a first memory frequency to execute the processing stage task, and the first memory instruction corresponds to the first memory frequency; When the host receives the second memory instruction, it controls the AI chip to run the decoding stage task at the second memory frequency. The second memory instruction corresponds to the second memory frequency, and the first memory frequency is lower than the second memory frequency.
3. The control method for the AI chip as described in claim 1, characterized in that, In the second low-power mode, controlling the AI chip to execute the processing stage tasks and the decoding stage tasks includes: In the second low-power mode, when the host receives the first processing instruction, it controls the AI chip to execute the processing stage task at a first processing frequency. The first processing instruction corresponds to the first processing frequency. The first processing instruction is generated by the host when the number of text units to be processed in the processing stage task is less than a quantity threshold. Upon receiving the second processing instruction from the host, the AI chip is controlled to execute the processing stage task at a second processing frequency. The second processing instruction corresponds to the second processing frequency, and the first processing frequency is lower than the second processing frequency. The second processing instruction is generated by the host when the number of text units to be processed in the processing stage task is not less than a quantity threshold.
4. The control method for the AI chip as described in claim 1, characterized in that, In the second low-power mode, controlling the AI chip to execute the processing stage tasks and the decoding stage tasks includes: In the second low-power mode, when the AI chip is controlled to execute the processing stage task and the decoding stage task, and a third processing instruction is received from the host, the AI chip is controlled to enter the third low-power mode. The third processing instruction is generated by the host when the large language model is called. In the third low-power mode, it waits for an interrupt signal to be generated. If an interrupt signal is received, it controls the AI chip to exit the third low-power mode.
5. The control method for the AI chip as described in claim 4, characterized in that, The control of the AI chip to enter the third low-power mode includes: The low-power resource operations of the AI chip are executed in static random access memory by using a preset low-power function.
6. A signal processing method, characterized in that, For use on a host device, the host device is communicatively connected to an edge device. The AI chip and the host device jointly execute a question-answering task of a large language model. The question-answering task includes, in execution order, a question-asking phase task, a processing phase task, and a decoding phase task. The host device executes the question-asking phase task, and the AI chip executes the processing phase task and the decoding phase task. The signal processing method on the host device includes: Upon completion of the questioning phase task, a wake-up command is generated; The wake-up command is sent to the edge device so that when the edge device receives the wake-up command from the host device, it controls the AI chip to exit the first low-power mode and enter the second low-power mode. The wake-up command is generated and sent to the AI chip when the host device completes the questioning phase task. The first low-power mode is used to wait for the host device to complete the questioning phase task in a sleep state. The second low-power mode is used to control the AI chip to execute the processing phase task and the decoding phase task in a manner lower than the normal operating power. In the second low-power mode, the AI chip is controlled to execute the processing phase task and the decoding phase task. When the AI chip completes the decoding phase task, it sends the task result to the host device and controls the AI chip to exit the second low-power mode and enter the first low-power mode.
7. The signal processing method as described in claim 6, characterized in that, After the AI chip enters the second low-power mode, the method further includes: Before the AI chip executes the processing stage task, a first memory instruction is generated; The first memory instruction is sent to the edge terminal so that when the edge terminal receives the first memory instruction from the host terminal, it controls the AI chip to run at a first memory frequency to execute the processing stage task. The first memory instruction corresponds to the first memory frequency. Before the AI chip executes the processing stage task, a second memory instruction is generated; The second memory instruction is sent to the edge device so that when the edge device receives the second memory instruction from the host device, it controls the AI chip to run a decoding stage task at a second memory frequency. The second memory instruction corresponds to the second memory frequency, and the first memory frequency is lower than the second memory frequency.
8. The signal processing method as described in claim 6, characterized in that, After the AI chip enters the second low-power mode, the method further includes: Obtain the number of text units that the task in the aforementioned processing stage needs to process; When the number of text units that the task needs to process in the processing stage is less than the quantity threshold, a first processing instruction is generated. The first processing instruction is sent to the edge terminal so that when the edge terminal receives the first processing instruction from the host terminal, it controls the AI chip to execute the processing stage task at a first processing frequency, wherein the first processing instruction corresponds to the first processing frequency. When the number of text units that the task needs to process in the processing stage is not less than the quantity threshold, a second processing instruction is generated. The first processing instruction is sent to the edge device so that when the edge device receives the second processing instruction from the host device, it controls the AI chip to execute the processing stage task at a second processing frequency. The second processing instruction corresponds to the second processing frequency, and the first processing frequency is lower than the second processing frequency.
9. The signal processing method as described in claim 6, characterized in that, After the AI chip enters the second low-power mode, the method further includes: When the large language model is invoked, a third processing instruction is generated; The third processing instruction is sent to the edge device so that when the edge device receives the third processing instruction from the host device during the process of controlling the AI chip to perform the processing stage task and the decoding stage task, it controls the AI chip to enter the third low-power mode. In the third low-power mode, it waits for an interrupt signal to be generated. If an interrupt signal is received, it controls the AI chip to exit the third low-power mode.
10. A task processing system, characterized in that, The task processing system includes an edge terminal and a host terminal. The host terminal is communicatively connected to the edge terminal. The AI chip and the host terminal are used together to execute a question-answering task of a large language model. The question-answering task includes a questioning phase task, a processing phase task, and a decoding phase task in the execution order. The host terminal is used to execute the questioning phase task, the AI chip is used to execute the processing phase task and the decoding phase task, the edge terminal is used to execute the control method of the AI chip as described in any one of claims 1 to 5, and the host terminal is used to execute the signal processing method as described in any one of claims 6 to 9.