Large model text parallel generation system and method for edge computing device

By implementing parallel processing for large-model text generation on edge computing devices and utilizing heterogeneous scheduling and pipelined strategies of computing chips, the problems of response latency and resource idleness in high-concurrency scenarios are solved, thereby improving the text generation efficiency and user experience of edge computing devices.

CN122197905APending Publication Date: 2026-06-12SHANGHAI XIANGCHENG COMM TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI XIANGCHENG COMM TECH CO LTD
Filing Date
2026-05-13
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Edge computing devices are inefficient at generating large model text in high-concurrency scenarios, resulting in a poor user experience, especially during multi-round intensive interactions and high-concurrency requests, where hardware resources are idle and response latency is severe.

Method used

A large-model text parallel generation system for edge computing devices is adopted. Through an interface packaging module, a user request scheduling module, and a large-model text generation module, parallel processing of the pre-filling and decoding stages is achieved. The system utilizes the heterogeneous scheduling and pipelined squeezing strategy of computing chips to prioritize the processing of high-priority requests.

🎯Benefits of technology

It has significantly improved the text generation efficiency of edge computing devices, reduced user waiting time, improved system response performance, solved the lag problem in high-concurrency scenarios, and enhanced user experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122197905A_ABST
    Figure CN122197905A_ABST
Patent Text Reader

Abstract

The application provides a large model text parallel generation system and method for edge computing devices. The large model text parallel generation system comprises an interface packaging module; a large model text generation module configured to generate a large model text based on an inference operation, the inference operation comprising a pre-padding stage and a decoding stage; and a user request scheduling module configured to receive a text generation request sent by a user through the interface packaging module, schedule a text generation request with the highest current processing priority to enter the pre-padding stage, schedule a next text generation request with the highest current processing priority to enter the pre-padding stage after the text generation request enters the decoding stage, and feed back the large model text generated by the large model text generation module to the user through the interface packaging module. Based on the large model text parallel generation system, the large model text parallel generation method can realize parallel generation of large model texts on edge computing devices.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of edge computing and large model text generation, and more specifically, relates to a parallel generation system and method for large model text for edge computing devices. Background Technology

[0002] In recent years, with the significant improvement in the performance of edge computing devices, deploying large language models on edge computing devices for text generation has become an industry trend. This localized deployment model significantly reduces deployment costs and network resource consumption by avoiding cloud data transmission, while shortening text generation response time from several seconds in the cloud model to sub-seconds, effectively meeting the needs of low-latency scenarios such as intelligent customer service and real-time translation. However, due to the inherent bottlenecks of edge computing devices in terms of memory capacity and computing power, the large model text generation currently deployed on edge computing devices still generally adopts a serial queuing processing mechanism. That is, when multiple user requests arrive simultaneously, the device must process them one by one in sequence, resulting in severe request queue accumulation in high-concurrency scenarios such as real-time replies to live stream comments and intelligent customer service during e-commerce promotions. The average response latency can reach tens of seconds, greatly affecting the user experience.

[0003] In typical scenarios of voice-based question answering, the limitations of serial processing mechanisms are particularly pronounced. These scenarios typically involve multiple rounds of intensive text generation interactions, such as question correction after speech-to-text conversion, user intent recognition, and answer generation. Each round of interaction requires rapid processing by a large language model. Once high concurrency requests occur, the latency of a single round of text generation accumulates with each round, leading to a significant decrease in the efficiency of the entire question-and-answer process. Furthermore, due to a lack of parallel processing capabilities, edge computing devices cannot simultaneously handle multiple consecutive questions from the same user or concurrent voice requests from multiple users. This not only results in idle hardware resources but also causes noticeable stuttering in voice interaction, severely impacting the user experience. Summary of the Invention

[0004] In view of this, the present invention proposes a large-model text parallel generation system and method for edge computing devices.

[0005] According to a first aspect of the present invention, a large-model text parallel generation system for edge computing devices is provided, the system comprising the following functional modules:

[0006] Interface packaging module;

[0007] A large model text generation module is used to generate large model text based on inference operations, which include a pre-filling stage and a subsequent decoding stage;

[0008] The user request scheduling module is used to receive text generation requests from users through the interface packaging module, schedule the text generation request with the highest processing priority to enter the pre-filling stage, and after the text generation request enters the decoding stage, schedule the next text generation request with the highest processing priority to enter the pre-filling stage, and feed back the large model text generated by the large model text generation module to the user through the interface packaging module.

[0009] Optionally, the interface wrapper module is configured to provide a standardized interface calling method;

[0010] The interface packaging module supports web service interfaces and system development language interfaces.

[0011] Optionally, the large model text generation module is configured to complete the large model text generation task based on the text generation request by scheduling the computing power chip of the edge computing device, and can monitor and provide feedback on the pre-filling stage and the decoding stage in the large model text generation process in real time.

[0012] Optionally, the large model text generation module is configured to: prioritize the neural network processor in the computing chip for computation during the computationally intensive pre-filling stage; and distribute the computational tasks to the general-purpose processing unit for execution according to the real-time load during the decoding stage.

[0013] Optionally, the large model text generation module includes:

[0014] The model loading submodule is used to initialize and load the large language model, and deploy the model data to the computing chip of the edge computing device so that the large language model can be put into a usable state.

[0015] The computation submodule is used to perform computational tasks in the pre-filling and decoding stages;

[0016] The reset submodule is used to reset and clean up the system state after each large model text generation task is completed, releasing the occupied computing resources.

[0017] Optionally, the reset submodule resets and cleans up the system state, releasing the occupied computing resources by forcibly releasing the video memory space and KV Cache buffer, and erasing temporary computational remnants and user information traces in the inference engine.

[0018] Optionally, the user request scheduling module calculates the current processing priority based on multi-dimensional indicators, including: the urgency of the text generation request, the estimated token output length, the user permission level, and batch processing tendency.

[0019] According to a second aspect of the present invention, a method for parallel generation of large-model text for edge computing devices is provided. This method is implemented based on any of the aforementioned parallel generation systems for large-model text for edge computing devices, and specifically includes the following steps:

[0020] Receive text generation requests from users;

[0021] The highest-priority text generation request is scheduled to enter the pre-filling stage, and after that text generation request enters the decoding stage, the next highest-priority text generation request is scheduled to enter the pre-filling stage.

[0022] The beneficial effects of this invention are as follows:

[0023] The large model text parallel generation system for edge computing devices of the present invention includes: an interface packaging module; a large model text generation module, used to generate large model text based on inference operations, the inference operations including a pre-filling stage and a subsequent decoding stage; and a user request scheduling module, used to receive text generation requests sent by users through the interface packaging module, schedule the text generation request with the highest current processing priority to enter the pre-filling stage, and after the text generation request enters the decoding stage, schedule the next text generation request with the highest current processing priority to enter the pre-filling stage, and feed back the large model text generated by the large model text generation module to the user through the interface packaging module.

[0024] This invention presents a large-model text parallel generation system for edge computing devices, which innovatively transforms the serial text generation process into a parallel mode, fully considering the differences in characteristics between the pre-filling and decoding stages in large-model text generation. In the pre-filling stage, the powerful parallel computing capabilities of the computing chip are utilized to quickly complete the computationally intensive initial calculations. When a text generation request enters the low-computation-intensity, long-duration decoding stage for streaming output, the next text generation request is immediately scheduled to enter the pre-filling stage. This scheduling strategy ensures that the edge computing device executes only one pre-filling stage operation at a time and can process the output tasks of multiple decoding stages in parallel, achieving fine-grained dynamic allocation of computing resources. Practical verification has shown that this parallel mode can significantly improve the text generation efficiency of edge computing devices, greatly shorten user waiting time, significantly enhance system response performance, and provide users with a smooth and efficient user experience.

[0025] The parallel text generation method for large models for edge computing devices of the present invention belongs to the same general inventive concept as the above-mentioned parallel text generation system for large models for edge computing devices, and has at least the same beneficial effects as the above-mentioned parallel text generation system for large models for edge computing devices, the beneficial effects of which will not be elaborated here.

[0026] Other features and advantages of the present invention will be described in detail in the following detailed description section. Attached Figure Description

[0027] The present invention can be better understood by referring to the following description taken in conjunction with the accompanying drawings, in which the same or similar reference numerals are used throughout the drawings to denote the same or similar parts.

[0028] Figure 1 A block diagram of a large-model text parallel generation system for edge computing devices according to an embodiment of the present invention is shown.

[0029] Figure 2 A flowchart illustrating the implementation of a large-model text parallel generation method for edge computing devices according to an embodiment of the present invention is shown. Detailed Implementation

[0030] To enable those skilled in the art to more fully understand the technical solutions of the present invention, exemplary embodiments of the present invention will be described more comprehensively and in detail below with reference to the accompanying drawings. Obviously, the one or more embodiments of the present invention described below are merely one or more specific ways to implement the technical solutions of the present invention, and are not exhaustive. It should be understood that other ways belonging to a general inventive concept can be used to implement the technical solutions of the present invention, and should not be limited to the embodiments described exemplary. Based on one or more embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0031] Example: Figure 1 A block diagram of a large-model text parallel generation system for edge computing devices, according to an embodiment of the present invention, is shown. (Refer to...) Figure 1 The large-model text parallel generation system for edge computing devices according to embodiments of the present invention includes the following functional modules:

[0032] Interface packaging module;

[0033] The large model text generation module is used to generate large model text based on inference operations, which include a pre-filling stage and a subsequent decoding stage.

[0034] The user request scheduling module is used to receive text generation requests from users through the interface packaging module, schedule the text generation request with the highest processing priority to enter the pre-filling stage, and after the text generation request enters the decoding stage, schedule the next text generation request with the highest processing priority to enter the pre-filling stage, and feed back the large model text generated by the large model text generation module to the user through the interface packaging module.

[0035] The priority can be based on: the type of task requested (e.g., real-time interaction is preferred over background processing), the predicted length of the requested token (short text is preferred), the user level, or a "continuous batching" preference to maximize computing power utilization.

[0036] Specifically, the user request scheduling module calculates the current processing priority based on multi-dimensional indicators, including but not limited to: the urgency of the text generation request, the estimated token output length, and the permission level of the user to whom each request belongs. When the system detects that the computing power usage of the current task has shifted from high load in the Prefill phase to low load in the Decode phase, the scheduling algorithm immediately activates the idle computing power window, pulls the next request in the priority queue into the Prefill operation, and maximizes throughput with limited computing power resources through this pipeline-like squeezing.

[0037] Furthermore, in this embodiment of the invention, the interface packaging module is configured to provide a standardized interface calling method;

[0038] The interface packaging module supports web service interfaces and system development language interfaces.

[0039] Furthermore, in this embodiment of the invention, the large model text generation module is configured to complete the large model text generation task based on the text generation request by scheduling the computing power chip of the edge computing device, and can monitor and provide feedback on the pre-filling stage and the decoding stage in the large model text generation process in real time.

[0040] Furthermore, in this embodiment of the invention, the large model text generation module includes:

[0041] The model loading submodule is used to initialize and load the large language model, and deploy the model data to the computing chip of the edge computing device so that the large language model can be put into a usable state.

[0042] The computation submodule is used to perform computational tasks in the pre-filling and decoding stages;

[0043] The reset submodule is used to reset and clean up the system state after each large model text generation task is completed, releasing the occupied computing resources.

[0044] Specifically, in this embodiment of the invention, the large-model text generation module serves as the core of the system. It efficiently schedules dedicated computing chips from edge computing devices to complete text generation tasks based on user requests and can monitor and provide real-time feedback on the prefill and decoding stages of the text generation process. The computing chips include, but are not limited to, neural network processors (NPUs), graphics processing units (GPUs), or dedicated AI acceleration units in edge SoCs. The large-model text generation module possesses heterogeneous awareness and scheduling capabilities. During the computationally intensive Prefill stage, it prioritizes scheduling NPU cores for matrix parallel computation. During the low-computational-intensive and long-duration Decode stage, it can distribute some computational tasks to general-purpose processing units based on real-time load to optimize overall energy efficiency and free up dedicated acceleration units for the next Prefill task.

[0045] In this embodiment, the large model text generation module is further divided into three sub-modules:

[0046] Model loading submodule: Responsible for the initialization and loading of large language models, efficiently deploying model data to the computing chips of edge computing devices, and ensuring that the model quickly enters a usable state.

[0047] The computation submodule undertakes the core computational tasks in the Prefill and Decode stages. In the Prefill stage, it fully utilizes the parallel computing capabilities of the computing chip to quickly complete the initial calculation at a rate of hundreds of tokens per second. In the Decode stage, it adopts a streaming output strategy, performing calculations on each token and outputting them one by one to ensure the continuity of text generation.

[0048] During the Decode phase, the computation submodule synchronously enables a key-value cache (KV Cache) management mechanism to dynamically allocate independent memory buffers for each parallel text generation task. This management mechanism indexes and isolates the context states of different tasks in real time based on the remaining GPU / RAM capacity of the edge computing device. This ensures that when the next request enters the Prefill phase and consumes computing power, the states of multiple tasks currently in the Decode phase are not overwritten, thus supporting macroscopic multi-task parallel output.

[0049] The reset submodule resets and cleans up the system state after each text generation task, releasing occupied computing resources to prepare for subsequent tasks. After each large model text generation task is completely finished, the reset submodule performs a physical reset and logical cleanup of the system state: on the one hand, it forcibly releases the GPU memory and KVCache buffer occupied by the task, ensuring the continuous availability of the memory resource pool; on the other hand, it erases temporary computational remnants and user information traces in the inference engine, preventing contextual interference between different user requests, thereby ensuring the privacy and security of edge data while satisfying parallel processing efficiency.

[0050] It's worth noting that the Prefill and Decode phases differ significantly in processing speed and computational power consumption. The Prefill phase is extremely fast but consumes a large amount of computational resources, while the Decode phase, although slower and less computationally efficient, needs to run continuously. The speed difference in token processing per unit time can be tens of times. Furthermore, the Prefill phase utilizes the matrix operation capabilities of the NPU / GPU, while the Decode phase, because it generates tokens one by one, can dynamically switch to low-power cores based on the computational load, thus achieving better energy efficiency.

[0051] Specifically, in this embodiment of the invention, the user request scheduling module is responsible for the entire process management of user requests. On the one hand, it receives user-submitted requests through the interface packaging module and rationally plans the priority of request tasks based on the current computing power load and task progress of the system, ensuring that the requests can be transmitted to the large model text generation module as quickly as possible; on the other hand, it promptly feeds back the generated text to the user.

[0052] The user request scheduling module employs an innovative task scheduling strategy: once a text generation request completes its Prefill phase and enters the Decode phase, the next text generation request is immediately scheduled to enter the Prefill phase. This scheduling method ensures that the edge computing device always maintains only one text generation request in the Prefill operation state at any given time, while simultaneously processing multiple requests in the Decode phase in parallel, achieving efficient utilization of computing resources and parallel task processing.

[0053] Specifically, in this embodiment of the invention, to lower the barrier to entry for users and improve system usability and compatibility, the interface packaging module provides a standardized interface calling method. It supports two types of interfaces: one is a web service interface, covering mainstream standards such as the Ollam protocol interface and the OpenAI protocol interface, facilitating user calls via protocols such as HTTP; the other is a system development language interface, including Android SDK, iOS SDK, etc., enabling developers to integrate text generation services into mobile applications and other scenarios.

[0054] Accordingly, based on the large model text parallel generation system for edge computing devices proposed in the embodiments of the present invention, the embodiments of the present invention also propose a large model text parallel generation method for edge computing devices, which is implemented based on the large model text parallel generation system for edge computing devices proposed above.

[0055] Figure 2 A flowchart illustrating the implementation of a large-model text parallel generation method for edge computing devices according to an embodiment of the present invention is shown. (Refer to...) Figure 2 The large-model text parallel generation method for edge computing devices according to embodiments of the present invention includes the following steps:

[0056] Step S100: Receive a text generation request from the user;

[0057] Step S200: Schedule the text generation request with the highest current processing priority to enter the pre-filling stage, and after the text generation request enters the decoding stage, schedule the next text generation request with the highest current processing priority to enter the pre-filling stage.

[0058] Specifically, in this embodiment of the invention, the user submits a text generation request through the interface packaging module. After receiving the request, the user request scheduling module prioritizes tasks according to the current system state and sends the request to the large model text generation module. Upon receiving the request, the large model text generation module completes model preparation through the model loading submodule, performs rapid processing in the Prefill stage through the computation submodule, and then enters the Decode stage for streaming output. While the current request enters the Decode stage, the user request scheduling module immediately starts the Prefill stage for the next request, continuously looping this process to achieve parallel processing of large model text generation on edge computing devices, significantly improving the overall text generation efficiency.

[0059] In one optional embodiment, the system can be applied to digital human interaction scenarios. Users submit voice or text interaction requests via smart terminals or mobile applications. After voice recognition or text input, multiple text generation requests are generated, including a main response text generation request, an intent explanation text generation request, a user preference description text generation request, and a product introduction text generation request. The user request scheduling module prioritizes the main response text generation request for pre-filling according to a preset priority. When this request enters the decoding stage and is streamed, the user request scheduling module schedules the next text generation request for pre-filling, enabling the edge computing device to perform at most one pre-filling operation at any given time and to process multiple text generation requests in the decoding stage in parallel.

[0060] The embodiments of the present invention have the following beneficial effects:

[0061] I. Highly efficient parallel processing mechanism achieves double the efficiency

[0062] Traditional text generation processes often employ a serial model, where tasks must queue for processing, leading to underutilization of edge computing resources and low processing efficiency. This invention innovatively transforms the serial text generation process into a parallel model, fully considering the differences between the Prefill and Decode stages in large-scale model text generation. In the Prefill stage, the powerful parallel computing capabilities of the computing chip are used to quickly complete the computationally intensive initial calculations. When a request enters the low-computation-intensity, long-duration Decode stage for streaming output, the next request is immediately scheduled to enter the Prefill stage. This unique scheduling strategy ensures that the edge device executes only one Prefill operation at a time, while simultaneously processing multiple Decode output tasks in parallel, achieving fine-grained dynamic allocation of computing resources. Practical verification shows that this parallel model can significantly improve the text generation efficiency of edge computing devices, greatly shorten user waiting time, significantly enhance system response performance, and provide users with a smooth and efficient user experience. Furthermore, this parallel model effectively solves the 'lag' problem of edge devices handling high-concurrency voice requests or multi-round intensive interactions. By leveraging the asymmetric computing power requirements of large-model inference, this invention achieves a deep integration of 'Prefill preemptive computation' and 'Decode continuous output'. Experimental results show that, under the same hardware configuration, this system significantly improves the total token throughput per unit time compared to the traditional serial mode, and maintains a sub-second time-to-first-Token (TTFT) latency under high-load scenarios, greatly optimizing the human-computer interaction experience at the edge.

[0063] II. Standardized interface encapsulation expands diverse application scenarios

[0064] This invention provides Web API and Native API interface encapsulation based on standard protocols through an interface packaging module, greatly expanding the application boundaries and compatibility of the system. The Web API covers mainstream industry standards such as the Ollam protocol interface and the OpenAI protocol interface. Users can easily call the text generation service using the universal HTTP protocol, easily adapting to web application development, cloud service integration, and cross-platform data interaction based on the API. The Native API, such as the Android SDK and iOS SDK, provides mobile application developers with technical interfaces to directly embed text generation functionality, supporting seamless integration into native application scenarios such as mobile apps and smart terminals, meeting diverse needs such as mobile office, intelligent customer service, and content creation. This standardized interface design lowers the barrier to entry for system use. Developers do not need to deeply understand the complex underlying technical implementation; they only need to select the appropriate interface according to their application scenario to quickly complete system integration and function development, effectively promoting the widespread application and practical implementation of large-model text generation technology in different fields.

[0065] While one or more embodiments of the present invention have been described above, those skilled in the art will recognize that the present invention can be implemented in any other form without departing from its spirit and scope. Therefore, the embodiments described above are illustrative and not restrictive, and many modifications and substitutions will be apparent to those skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A large-model text parallel generation system for edge computing devices, characterized in that, include: Interface packaging module; A large model text generation module is used to generate large model text based on inference operations, which include a pre-filling stage and a subsequent decoding stage; The user request scheduling module is used to receive text generation requests sent by users through the interface packaging module, schedule the text generation request with the highest processing priority to enter the pre-filling stage, and after the text generation request enters the decoding stage, schedule the next text generation request with the highest processing priority to enter the pre-filling stage, and feed back the large model text generated by the large model text generation module to the user through the interface packaging module.

2. The large-model text parallel generation system for edge computing devices according to claim 1, characterized in that, The interface packaging module is configured to provide a standardized interface calling method; The interface packaging module supports web service interfaces and system development language interfaces.

3. The large-model text parallel generation system for edge computing devices according to claim 1, characterized in that, The large model text generation module is configured to complete the large model text generation task based on the text generation request by scheduling the computing power chip of the edge computing device, and can monitor and provide feedback on the pre-filling stage and the decoding stage in the large model text generation process in real time.

4. The large-model text parallel generation system for edge computing devices according to claim 3, characterized in that, The large model text generation module is configured to: prioritize the neural network processor in the computing chip for computation during the high-computation-power-consuming pre-filling stage; and distribute the computational tasks to the general-purpose processing unit for execution according to the real-time load during the decoding stage.

5. The large-model text parallel generation system for edge computing devices according to claim 4, characterized in that, The large model text generation module includes: The model loading submodule is used to initialize and load the large language model, and deploy the model data to the computing chip of the edge computing device so that the large language model can be put into a usable state. The computation submodule is used to perform the computation tasks of the pre-filling stage and the decoding stage; The reset submodule is used to reset and clean up the system state after each large model text generation task is completed, releasing the occupied computing resources.

6. The large-model text parallel generation system for edge computing devices according to claim 5, characterized in that, The reset submodule resets and cleans up the system state, releasing occupied computing resources including: forcibly releasing video memory space and KV Cache buffer, and erasing temporary computation residues and user information traces in the inference engine.

7. The large-model text parallel generation system for edge computing devices according to claim 5, characterized in that, The user request scheduling module calculates the current processing priority based on multi-dimensional indicators, including: the urgency of the text generation request, the estimated token output length, the user permission level, and batch processing tendency.

8. A method for parallel generation of large-model text for edge computing devices, characterized in that, Implemented based on the large model text parallel generation system according to any one of claims 1-7; The large-model text parallel generation method includes: Receive text generation requests from users; The text generation request with the highest current processing priority is scheduled to enter the pre-filling stage, and after the text generation request enters the decoding stage, the next text generation request with the highest current processing priority is scheduled to enter the pre-filling stage.