Method for scheduling large language model computing power cluster and related device thereof

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By introducing a scheduling kernel layer into the large language model computing power cluster, managing attention threads and performing persistent storage, the problems of insufficient scheduling granularity and programming paradigm limitations in existing technologies are solved, achieving efficient and reliable computing power scheduling and state management, and improving the application performance of large language models.

CN122240264APending Publication Date: 2026-06-19SHENZHEN SHENMA INNOVATION TECHNOLOGY CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SHENZHEN SHENMA INNOVATION TECHNOLOGY CO LTD
Filing Date: 2026-03-18
Publication Date: 2026-06-19

Application Information

Patent Timeline

18 Mar 2026

Application

19 Jun 2026

Publication

CN122240264A

IPC: G06F9/48; G06F9/54; G06N3/045; G06N5/04; G06N3/006

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Elastic task scheduling method and system based on large language model, electronic equipment and medium
CN121636111A
Software multi-agent collaboration method and system based on large language model
CN121523815A
Multi-task big language model training method and system
CN121785719A
High-concurrency large language model high-speed reasoning deployment method
CN121597366A
Large language model reasoning acceleration method and system
CN118133969A

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing technologies lack a general scheduling service layer between the computing power interface layer and the application interface layer of large language models. This results in each upper-layer framework having to handle infrastructure such as computing power scheduling, state management, and fault recovery on its own. This leads to insufficient scheduling granularity, limited programming paradigms, insufficient granularity of state persistence, lack of quality assurance mechanisms, and failure to fully utilize the statelessness of large language models.

⚗Method used

This paper proposes a scheduling method for a large language model computing power cluster. By managing attention threads through the scheduling kernel layer, priority preemptive scheduling is achieved within a single intelligent body. Thread control blocks are used for persistent storage to ensure external state and optimal single-step input. A three-layer architecture consisting of a computing power service layer, a scheduling kernel layer, and an application interface layer is established to achieve automated management of deterministic kernel programs.

🎯Benefits of technology

It implements attention thread priority scheduling within a single intelligent body, improves scheduling granularity, lowers the development threshold, provides stage/parameter-level state recovery and acceptance condition verification, and enhances the reliability and efficiency of large language models.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122240264A_ABST

Patent Text Reader

Abstract

This invention discloses a scheduling method and related equipment for a large language model computing power cluster. According to a deterministic kernel program, the highest-priority thread is loaded from the thread control block. This thread performs a context switch and assembles the input content to obtain the current complete input data, which is then sent to the computing power service layer to be input into the target large language model instance to obtain the current inference result. The result is then verified according to acceptance criteria. If the verification result is deemed successful, the current execution stage corresponding to the highest-priority thread in the persistent storage space is updated accordingly, and the corresponding result is stored. This invention, through a deterministic program in the scheduling kernel layer managing the creation, scheduling, switching, and destruction of attention threads, externalizes all execution states to the thread control block for persistence, achieving preemptive scheduling of attention threads within a single agent, with scheduling granularity extending to multiple concurrent task flows within the agent.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence technology, and in particular to a scheduling method and related equipment for large language model computing power clusters. Background Technology

[0002] In recent years, agent technology based on Large Language Models (LLMs) has developed rapidly. Large language models process input text within a context window through an attention mechanism to generate natural language output. Agent systems combine large language models with capabilities such as tool invocation, external knowledge, and task planning to achieve autonomous execution of complex tasks.

[0003] However, large language models based on attention mechanisms have three fundamental drawbacks: First, attention is scattered, as all content in the context window is focused on, making it impossible to distinguish between the information needed for the current task and irrelevant noise, resulting in a decrease in output quality as the context expands; second, they cannot maintain focus for long periods, as each inference starts from scratch and does not retain cross-call state, and complex tasks may require hundreds of rounds of interaction, but large language models may lose task objectives and progress information in each round; third, they cannot reliably self-manage, as the agent decides on its own what to focus on, track progress, and switch strategies, forming a circular dependency where both the manager and the executor are unreliable.

[0004] In a public speech in 2023, Karpathy first proposed the LLM operating system metaphor, describing the large language model as a processor-like computing unit. Subsequently, "LLMs as OS, Agents as Apps" (arXiv2312.03815, December 2023) proposed an ecological model of an intelligent agent operating system. These works established a conceptual connection between the large language model and the operating system, but all remained at the analogy level, without providing specific system architecture designs. It is worth noting that these conceptual works all equate the large language model with the operating system itself or its core components.

[0005] The AIOS system (Rutgers University, arXiv 2403.16971, accepted by COLM 2025) implements a large language model agent operating system, providing functions such as a scheduler (supporting algorithms such as FIFO, round-robin, priority, and shortest job first), context management, memory management, and storage management, and supporting the pausing and resuming of agents. AIOS's scheduling granularity is at the agent request level, determining which agent receives the large language model resource, which is inter-agent scheduling. AIOS embeds the large language model into the kernel layer as the system core, and none of its five versions (v1 to v5, March 2024 to August 2025) involve attention thread scheduling within a single agent.

[0006] AgentOS (arXiv 2602.20934, February 2026) proposes conceptual frameworks such as the Reasoning Kernel and the Cognitive Scheduler, introducing the concept of Priority-based Semantic Scheduling. However, AgentOS is a purely theoretical framework, without providing scheduling algorithm specifications or implementation code, and its scheduling object is the competition for access to the Reasoning Kernel among multiple agents, which still belongs to inter-agent scheduling.

[0007] The MemGPT / Letta system (arXiv 2310.08560, October 2023) is inspired by operating system virtual memory and implements hierarchical memory management (working memory and archive memory) to address the problem of insufficient context window space in large language models. This system focuses on the context space management dimension and does not involve attention scheduling or task switching.

[0008] In terms of programming paradigms, currently commonly used large language model agent frameworks all adopt a code-driven programming paradigm. The LangGraph framework requires developers to define state graphs using Python code and construct fixed directed acyclic graph structures through the add_node and add_edge methods. The CrewAI framework uses Python classes to define Agents and Tasks and orchestrates multi-agent collaborative processes through code. This code-driven paradigm has limitations such as high development barriers, insufficient flexibility, and incompatibility with the capabilities of large language models.

[0009] AWS Strands Agent Standard Operating Procedures (SOPs) (Amazon, open-sourced in November 2025) use Markdown files to define standard operating procedures in natural language. Each SOP file contains a YAML front matter metadata header and a Markdown body. Strands SOPs support parameterized input and multi-stage structured processes, but its Markdown format does not distinguish between attribute sections and instruction sections, lacks runtime acceptance condition verification and automated graded retry mechanisms, and does not have the ability to automatically update thread attributes when switching stages.

[0010] In terms of quality assurance, large language models are probabilistic execution units, and the same input may produce different outputs. SagaLLM (Stanford University, accepted by VLDB 2025 conference) uses the Saga transaction pattern to introduce commit / rollback semantics into multi-agent planning scenarios, achieving transactional consistency guarantees at the planning step level among multiple agents. Its transaction granularity is at the planning step level among multiple agents, without involving the attention thread scheduling within a single agent. The DSPy framework (Stanford NLP Group, arXiv 2310.03714) provides the LM Assertions mechanism to impose constraints on the output of large language models, but its module orchestration still requires Python code implementation.

[0011] Current mainstream agent frameworks (LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, etc.) have achieved deterministic control at the level of inter-agent coordination. However, the deterministic control of these frameworks stops at the inter-agent level. Within a single agent, it is still driven by a large language model, that is, the agent manages the context, decides what to focus on, and tracks its own progress.

[0012] Deterministic program control of large language model execution has become an industry trend since 2025. BlueprintFirst (Alibaba, arXiv 2508.02721, August 2025) proposed a deterministic engine to control the execution path of large language models, explicitly stating that "LLM never decides workflow path." STRATUS (arXiv 2506.02009, 2025) uses a deterministic state machine to orchestrate agents. US Patent Application No. 20250110753A1 proposes state machine control of the execution flow of large language model agents. All of the above solutions are workflow-level deterministic control, employing a single state transition path and not involving multi-threaded concurrent scheduling.

[0013] US12346664A1 patent introduces the concepts of concurrent threads, priority specification, and subtask management in generative artificial intelligence. However, its technical solution is a user interaction paradigm, in which users actively initiate new threads and specify priorities during the execution of long tasks. This is user-led interactive management rather than automatic scheduling by the operating system kernel.

[0014] A 2023 study published by Stanford University, *Lost in the Middle*, found that large language models pay significantly more attention to the beginning and end of the context than to the middle. Based on this finding, the Haystack framework's *LostInTheMiddleRanker* and the LangChain framework's *Long Context Reorder* component provide document reordering tools that retrieval documents by alternating relevance. However, these tools are application-layer, one-off document reordering tools that require active invocation from the application layer and do not involve an automatic reordering mechanism at the kernel level.

[0015] It is evident that there is a lack of a general scheduling service layer between the large language model computing power interface layer and the application interface layer. This layer pools the dispersed, stateless large language model computing power into reliable, schedulable execution resources. Consequently, each upper-layer framework must handle infrastructure issues such as computing power scheduling, state management, and fault recovery independently. Specifically, existing technologies have the following shortcomings: 1) Insufficient scheduling granularity, meaning that all existing scheduling systems (AIOS, AgentOS, etc.) remain at the level of inter-agent scheduling, without delving into the individual agent to achieve deterministic scheduling at the attention thread level; 2) Limitations of programming paradigm: Existing frameworks all adopt a code-driven paradigm (Python / YAML / JSON), which has a high development threshold, lacks flexibility, and is incompatible with the natural language understanding capabilities of large language models. Although Strands SOPs uses Markdown to define natural language workflows, it lacks a dedicated template file format, stage-level attribute management, and runtime acceptance verification capabilities. 3) Insufficient granularity of state persistence, i.e., LangGraph's checkpoints are at the graph node level, and Letta's state persistence is at the dialogue block level, neither of which can achieve accurate recovery of the stage and parameter level of the attention thread in a single intelligent body. 4) Lack of quality assurance mechanism. Although SagaLLM applies transaction semantics to multi-agent planning and DSPy provides the LM Assertions mechanism, it lacks acceptance condition verification and graded retry mechanism that are deeply integrated with the life cycle of threads within a single agent. 5) The statelessness of large language models has not been fully utilized. That is, existing work treats the statelessness of large language models as a defect that needs to be remedied (add memory, add context management, add checkpoints) rather than an opportunity to turn it into an architectural advantage. Summary of the Invention

[0016] This invention provides a scheduling method and related equipment for a large language model computing power cluster. It aims to solve the problem in the prior art that there is no general scheduling service layer between the large language model computing power interface layer and the application interface layer to pool the dispersed stateless large language model computing power into reliable and schedulable execution resources. This results in each upper-layer framework having to handle the infrastructure such as computing power scheduling, state management and fault recovery on its own.

[0017] In a first aspect, embodiments of the present invention provide a scheduling method for a large language model computing power cluster, applied to a scheduling kernel layer. The scheduling kernel layer runs in a scheduling middleware system of the large language model computing power cluster. The scheduling middleware system of the large language model computing power cluster further includes a computing power service layer and an application interface layer. The scheduling kernel layer is communicatively connected to both the computing power service layer and the application interface layer. The method includes: If a scheduling event is detected in the computing power service layer, the highest priority thread is loaded from the persistent storage space of the thread control block according to the pre-deployed deterministic kernel program. The current highest priority thread performs the corresponding context switching operation and completes the assembly of input content to obtain the current complete input data. The current complete input data is then sent to the computing power service layer to be input into the connected target large language model instance for inference to obtain the current inference result. If the current inference result sent by the target large language model instance is received, the current inference result is verified according to the preset acceptance conditions to obtain the current verification result; If the current verification result is determined to be a successful verification result, then the inference execution of the target large language model instance for the current complete input data is determined to be successful. According to the preset execution stage reference information, the current execution stage corresponding to the current highest priority thread in the persistent storage space is changed accordingly, and the current inference result is saved to the area corresponding to the current highest priority thread in the persistent storage space.

[0018] Secondly, embodiments of the present invention also provide a scheduling device for a large language model computing power cluster, configured in a scheduling kernel layer. The scheduling kernel layer runs in a scheduling middleware system for the large language model computing power cluster. The scheduling middleware system for the large language model computing power cluster further includes a computing power service layer and an application interface layer. The scheduling kernel layer is communicatively connected to both the computing power service layer and the application interface layer. The scheduling kernel layer is used to execute the scheduling method for the large language model computing power cluster as described in the first aspect above.

[0019] Thirdly, embodiments of the present invention also provide a scheduling middleware system for a large language model computing power cluster, including a computing power service layer, a scheduling kernel layer, and an application interface layer; the scheduling kernel layer is communicatively connected to both the computing power service layer and the application interface layer; the scheduling kernel layer is used to execute the scheduling method for the large language model computing power cluster as described in the first aspect above.

[0020] Fourthly, embodiments of the present invention also provide a computer device, which includes a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the scheduling method of the large language model computing power cluster described in the first aspect.

[0021] Fifthly, embodiments of the present invention also provide a computer-readable storage medium storing a computer program, the computer program including program instructions, which, when executed by a processor, can implement the scheduling method for the large language model computing power cluster described in the first aspect.

[0022] This invention provides a scheduling method and related equipment for a large language model computing power cluster. The method includes: if a scheduling event is detected in the computing power service layer, the highest priority thread is loaded from the persistent storage space of the thread control block according to a pre-deployed deterministic kernel program; the highest priority thread performs a corresponding context switching operation and completes the assembly of input content to obtain current complete input data; the current complete input data is sent to the computing power service layer to be input to the connected target large language model instance for inference to obtain the current inference result; if the current inference result sent by the target large language model instance is received, the current inference result is verified according to preset acceptance conditions to obtain a current verification result; if the current verification result is determined to be a verification pass result, the inference execution of the target large language model instance for the current complete input data is determined to be successful, and the current execution stage corresponding to the highest priority thread in the persistent storage space is changed accordingly according to preset execution stage reference information, and the current inference result is saved to the area corresponding to the highest priority thread in the persistent storage space. This invention implements a deterministic program in the scheduling kernel layer to manage the creation, scheduling, switching, and destruction of attention threads, and externalizes all execution states to the thread control block for persistence. This enables priority-based preemptive scheduling of attention threads within a single agent, with scheduling granularity extending to multiple concurrent task flows within the agent. Attached Figure Description

[0023] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the following description of the embodiments will be briefly introduced. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0024] Figure 1 This is a schematic diagram illustrating an application scenario of the scheduling method for a large language model computing power cluster provided in an embodiment of the present invention. Figure 2 A flowchart illustrating the scheduling method for a large language model computing power cluster provided in an embodiment of the present invention; Figure 3 A schematic diagram of the first sub-process of the scheduling method for a large language model computing power cluster provided in an embodiment of the present invention; Figure 4 This is a schematic diagram of the second sub-process of the scheduling method for a large language model computing power cluster provided in an embodiment of the present invention; Figure 5 A schematic diagram of the third sub-process of the scheduling method for a large language model computing power cluster provided in an embodiment of the present invention; Figure 6A schematic block diagram of a computer device provided for an embodiment of the present invention. Detailed Implementation

[0025] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0026] It should be understood that, when used in this specification and the appended claims, the terms "comprising" and "including" indicate the presence of the described features, integrals, steps, operations, elements and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or collections thereof.

[0027] It should also be understood that the terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to limit the invention. As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms unless the context clearly indicates otherwise.

[0028] It should also be further understood that the term "and / or" as used in this specification and the appended claims refers to any combination of one or more of the associated listed items and all possible combinations, and includes such combinations.

[0029] Please also refer to Figure 1 and Figure 2 ,in Figure 1 This is a schematic diagram illustrating a scenario of the scheduling method for a large language model computing power cluster according to an embodiment of the present invention. Figure 2 This is a flowchart illustrating the scheduling method for a large language model computing power cluster provided in an embodiment of the present invention. Figure 1 As shown, the scheduling method for a large language model computing power cluster provided in this embodiment of the invention is applied to the scheduling kernel layer 10. The scheduling kernel layer 10 runs in the scheduling middleware system of the large language model computing power cluster. The scheduling middleware system of the large language model computing power cluster also includes a computing power service layer 20 and an application interface layer 30. The scheduling kernel layer 10 is communicatively connected to the computing power service layer 20 and the application interface layer 30.

[0030] The computing power service layer 20 comprises a stateless, single-inference computing power service cluster consisting of one or more large language model instances. Each large language model instance is characterized by receiving natural language instructions as input and returning natural language text as output; it does not hold any cross-call state, has no session binding, and no state residue; each call is independent, and the instance retains no information about the call after completion. The large language model instances in the computing power service layer 20 can be arbitrarily replaced and horizontally scaled, including multiple large language model instances from different vendors and of different models, such as large model instances with high inference capabilities and low-cost lightweight model instances. Each large language model instance in the computing power service layer 20 provides stateless inference services through a unified interface (such as an HTTP RESTful API or a streaming protocol).

[0031] The scheduling kernel layer 10 includes a deterministic kernel program whose core responsibilities are: (a) pooling the distributed computing power of the computing power service layer 20 into a unified schedulable resource; (b) managing multiple attention threads within a single intelligent body, each attention thread representing a schedulable attention focus, which may include attributes such as thread identifier, priority, state, natural language instruction content, context reference, template name, current stage, stage parameters, acceptance conditions, and lifespan; (c) implementing an attention scheduler, which selects the highest priority ready thread for execution each time a scheduling event is triggered using a preemptive priority scheduling algorithm; (d) performing context switching, rebuilding the task area as a whole during thread switching, so that the large language model instance receives input tailored to the current thread each time inference; and (e) externalizing all execution states to an independent persistent storage space (such as an SQLite database) through a thread control block (TCB), with persistence granularity reaching a fine-grained level for stage names, stage parameter dictionaries, remaining scheduling cycles, acceptance conditions, and thread priorities.

[0032] The intelligent agent application in the application interface layer 30 runs the intelligent agent application by scheduling the kernel interface provided by the kernel layer 10 (including interfaces for creating threads, terminating threads, batch terminating thread groups, switching thread stages, creating temporary threads, and executing scheduling), and does not directly operate the large language model instance in the computing power service layer 20.

[0033] Moreover, the technical solution of this application is based on a methodological insight into the mathematical essence of large language models. The mathematical essence of large language models is a memoryless probabilistic inference system. Each call receives input, produces output, and does not retain any internal state. Formalized as: output_i=f(input_i,ε_i), where f is the inference function of the large language model (uncontrollable, determined by the model), ε_i is the implicit random input (uncontrollable, independent of each call), and input_i is the complete input of this call (the only controllable variable).

[0034] The core proposition is that for a memoryless probabilistic system, under the constraint of only being able to control the input, the cumulative optimality of a single-step input is equivalent to the global optimality. The proof is as follows: Because the system has no memory, each call is independent. In the dependency chain of a multi-step task, the execution result of the preceding step is persisted as a deterministic state through the thread control block. The construction of the input for the current step depends only on the current deterministic state and is independent of the optimization objective conditions of subsequent steps (Markov property). Therefore, the cumulative quality of N calls is a decomposable summation: maxE[Q]=maxE[Σq(input_i,ε_i)] =maxΣE[q(input_i,ε_i)] (conditional independence) =ΣmaxE[q(input_i,ε_i)] (Decomposability) Item-by-item optimality is equivalent to sum-by-sum optimality, which is a direct corollary guaranteed by memorylessness and external state.

[0035] From this, three architectural principles are derived: (1) External state: The large language model has no memory, and all cross-step states must be managed by an external system to ensure that the state is not lost. (2) Optimal single-step input: Before each large language model call, the deterministic kernel selectively loads the required content and assembles the input according to the context reference to ensure that each input is optimal in the current state. (3) Self-management without relying on the large language model: Any design that allows the large language model to "manage itself" is to optimize uncontrollable variables with uncontrollable components, which violates the basic principle of optimal control.

[0036] This core proposition is independent of the specific model's capabilities and holds true for any execution unit that retains the essence of a memoryless probabilistic state machine. The above proposition guarantees policy-level optimality, meaning that the policy of "constructing the current optimal input at every step" is optimal among all possible control policies. Due to the probabilistic nature of large language models (the existence of ε_i), the actual output of a single execution may deviate from the expectation, but this does not affect the optimality of the policy itself: the optimal policy selects the optimal input construction method for the current step in any execution state, achieving global optimality for this stochastic system in the mathematical expectation sense.

[0037] Following the methodology described above, the scheduling middleware system for the large language model computing power cluster adopts the aforementioned three-layer architecture. A scheduling kernel layer 10 is further established between the computing power service layer 20 and the application interface layer 30, allowing the application interface layer 30 to focus on agent collaboration logic rather than computing power management. The scheduling kernel layer 10 is located between the underlying computing power API (the inference interface of the computing power service layer 20) and the upper-layer agent orchestration framework (such as AIOS, CrewAI, etc.) of the application interface layer 30. Analogous to CUDA abstracting GPU hardware into a general computing power interface, this application pools the dispersed, stateless large language model computing power into reliable, schedulable execution resources. The upper-layer agent orchestration framework of the application interface layer 30 is a consumer of this system, not a competitor; that is, they are responsible for inter-agent collaboration, while the scheduling kernel layer 10 is responsible for scheduling attention threads within a single agent.

[0038] like Figure 2 As shown, the method includes the following steps S110-S140.

[0039] S110. If a scheduling event is detected in the computing power service layer, the highest priority thread is loaded from the persistent storage space of the thread control block according to the pre-deployed deterministic kernel program.

[0040] In this embodiment, the technical solution is described with the scheduling kernel layer as the execution entity. When the scheduling kernel layer detects that a scheduling event in the computing power service layer has been triggered, the target large language model instance obtained by the computing power service layer from the stateless single-inference computing power service cluster has already been accessed. This indicates that there is an intelligent agent application in the application interface layer that needs to call the target large language model instance, and input content has been entered in the intelligent agent application. At this time, the target large language model instance does not directly process the input content, but it is necessary to first load the current highest priority thread from the persistent storage space of the thread control block according to the pre-deployed deterministic kernel program.

[0041] To better understand the technical solution of this application, the table structure of the thread storage table in the persistent storage space of the thread control block in the deterministic kernel program is described in detail below. The thread storage table of the thread control block is stored in the persistent storage space of the SQLite database. All scheduling decisions in the scheduling kernel layer are based on the thread storage table. In one embodiment, the table structure fields of the thread storage table include at least the thread template name (which can be represented by template), the name of the current execution phase (which can be represented by phase), the phase parameter dictionary (which can be represented by params), the number of remaining scheduling cycles (which can be represented by remaininging), the thread acceptance condition (which can be represented by assertion), the thread priority (which can be represented by priority), and the thread wake-up condition (which can be represented by wake_condition). Of course, the table structure fields of the thread storage table also include the thread identifier (which can be represented by tid), the natural language instruction content (which can be represented by content), the context reference relationship (which can be represented by context_ref), the thread state (which can be represented by state), the total number of live scheduling cycles (which can be represented by ttl), and the creation time (which can be represented by created_at).

[0042] The specific format of the thread template file can be obtained from the thread template name in the thread storage table. For example, the thread template file is in plain text format, has the extension .thread, and includes a metadata area and a stage area. The metadata area, identified by [meta], defines the thread's default attributes, such as priority, TTL (total live scheduling cycles), and assertion (acceptance criteria). The stage area, identified by [stage name], defines the specific execution stage. Each stage area contains an attribute section (located before the separator, defining the attributes of the thread's current execution stage, which can override the default values in the metadata area, e.g., the separator is --) and an instruction section (located after the separator, containing the natural language execution instructions for that stage).

[0043] For example, the specific thread template file for a programming task thread template is as follows: [meta] priority=70 ttl=20 [planning] priority=80 ttl=5 assertion= Output includes task breakdown steps --- Task: {task_description} Please analyze this task and break it down into 3-5 executable sub-steps, describing each step in one sentence.

[0044] [implementing] ttl=10 assertion= Output includes code block --- Implement the code according to the following task plan: {plan} Requirements: The code must be complete and runnable, and include necessary comments.

[0045] [testing] priority=90 ttl=5 assertion= Output includes test results --- Test the following code: {code} Requirements: Write test cases, execute tests, and report the results.

[0046] In the example above, "planning" indicates that the current execution stage of the thread is the planning stage, "implementing" indicates that the current execution stage of the thread is the implementation stage, and "testing" indicates that the current execution stage of the thread is the testing stage. During execution, the attention thread moves between multiple stages through a stage switching interface. Each time it switches to a new stage, the natural language execution instructions in the instruction segment of the stage area are automatically updated to the content of the instruction segment of that stage; the attributes in the metadata area of the attention thread are automatically updated to the attribute segment definitions of that stage; and the current stage name and stage parameters of the attention thread are recorded through the thread control block.

[0047] Furthermore, all attention threads support two creation methods. The first is the template thread creation method, which creates thread instances based on predefined thread template files by specifying the template name, initial stage, and parameter dictionary. This method is suitable for standardized and reusable tasks. The second is the temporary thread creation method, which creates thread instances directly using natural language text as thread execution instructions. This method does not rely on template files and is suitable for one-off dynamic tasks. Template threads and temporary threads share the same attention scheduler and are scheduled uniformly according to priority. The two creation methods complement each other, taking into account both standardization and flexibility.

[0048] In one embodiment, the attributes of the attention thread in the persistent storage space of the thread control block include at least thread identifier, priority, thread state, natural language instruction content, context reference relationship, lifespan, acceptance condition, and wake-up condition; wherein, the lifespan includes the total number of live scheduling cycles and the number of remaining scheduling cycles.

[0049] Among them, the thread identifier can be represented by tid (which is a string type); the priority can be represented by priority (which is an integer type, with a value ranging from 0 to 255, and the larger the value, the higher the priority); the thread state can be represented by state (which is an enumeration type, including ready state, running state, sleeping state, and terminated state, with ready state represented by READY, running state by RUNNING, sleeping state by SLEEPING, and terminated state by TERMINATED); the natural language instruction content can be represented by content (which is a string type); the context reference relationship can be represented by context_ref (which is a string type); the total number of survival scheduling cycles in the survival time can be represented by ttl (which is an integer type), and the remaining number of scheduling cycles can be represented by remaining (which is an integer type); the acceptance condition can be represented by assertion (which is a string type); and the wake-up condition can be represented by wake_condition (which is an integer type). The attributes of the attention thread also include the thread template name (represented by `template`, which is a string type), the current execution phase name (represented by `phase`, which is a string type), the phase parameter dictionary (represented by `params`, which is a dictionary type), and the creation time (represented by `created_at`, which is a string type). The total number of fields in the thread storage table is the same as the total number of attributes of the attention thread and corresponds one-to-one. This is not limited to the priority of the attributes included in the attention thread corresponding to the thread priority in the thread storage table, the acceptance condition corresponding to the thread acceptance condition in the thread storage table, or the wake-up condition corresponding to the thread wake-up condition in the thread storage table.

[0050] When the attributes of the attention thread adopt the data structure described above, compared with the existing LangGraph checkpoint mechanism (node-level persistence), the persistence of the thread control block in this invention achieves fine-grained persistence at the stage / parameter level. For example, when the thread executes the parameter replacement sub-step in the implementation stage, the thread control block records the specific stage name and the stage parameter dictionary {"variable_name":"user_count"}, rather than just recording the code generation node. This allows for precise return to the sub-step to continue execution during fault recovery, rather than re-executing the entire node.

[0051] The total number of scheduling cycles is designed to decrease only during the invocation of attention threads, specifically decreasing in the ready state and not in the running state. This is because running threads are executing tasks, and their lifecycle consumption should be included in the actual execution time, not the waiting time. Ready threads are in a waiting-for-scheduling state; if they cannot get an execution opportunity for a long time (due to their low priority), they should be automatically cleaned up through the decreasing total number of scheduling cycles mechanism to prevent low-priority threads from occupying system resources indefinitely. Furthermore, this decreasing design allows high-priority threads to run for extended periods without being limited by the total number of scheduling cycles, while low-priority threads automatically terminate if they are not scheduled for a long time, achieving adaptive resource management. This decreasing mechanism is similar to the reverse application of process aging in operating systems, ensuring that system resources are tilted towards high-priority tasks.

[0052] In one embodiment, the complete lifecycle of the attention thread includes a planning phase, an implementation phase, and a testing phase. When the attention thread switches from its current lifecycle state to another lifecycle state through a phase switching interface, the natural language instruction content, priority, lifespan, and wake-up conditions in the attention thread's attributes are updated accordingly based on the other lifecycle state. The current execution phase name and phase parameter dictionary of the attention thread are also updated accordingly based on the other lifecycle state. The lifecycle state and the other lifecycle state are both one of the planning phase, implementation phase, and testing phase, and the lifecycle state and the other lifecycle state are different.

[0053] In this embodiment, all attention threads included in the thread control block have a complete lifecycle, specifically including a planning phase, an implementation phase, and a testing phase. Of course, in actual implementation, the complete lifecycle of an attention thread can be divided into any number of other execution phases in addition to the three phases listed in the example above, and is not entirely limited to the above example.

[0054] Taking the highest priority thread as an example, it can transition between four thread states: READY, RUNNING, SLEEPING, and TERMINATED. Specifically, the highest priority thread automatically enters the READY state upon creation, transitions to the RUNNING state after being selected by the attention scheduler, returns to the READY state if preempted by a higher priority attention thread, and transitions to the TERMINATED state when the remaining scheduling cycles are 0. If the acceptance condition in the highest priority thread fails to verify the relevant output data, it transitions to the SLEEPING state and returns to the READY state when the wake-up condition is met.

[0055] In one embodiment, the deterministic kernel program includes at least an attention scheduler and a thread control block; such as Figure 3 As shown, step S110 includes: S111. Load the attention thread in the ready state from the persistent storage space of the thread control block according to the attention scheduler; S112. If there are attention threads in the ready state and the total number is greater than 1, then the attention thread with the highest priority is determined based on the priority of each ready attention thread, and is used as the thread to be run. S113. If it is determined that there is no running attention thread, then the thread to be run is converted into a running state and used as the current highest priority thread; S114. If it is determined that there is a running attention thread that is different from the thread to be run, then the running attention thread is converted to the ready state, and the thread to be run is converted to the running state and used as the current highest priority thread. S115. If it is determined that there is a running attention thread that corresponds to the same thread as the thread to be run, then the running attention thread is taken as the current highest priority thread.

[0056] In this embodiment, the attention scheduler in the scheduling kernel layer loads attention threads in the ready state from the persistent storage space of the thread control block. If it is determined that there are attention threads in the ready state and the total number is greater than one, then the attention thread with the highest priority is determined based on the priority of each ready attention thread (if there are multiple ready attention threads with the same priority, then the attention thread with the earliest creation time is selected) as the thread to be run. If it is determined that there are attention threads in the ready state and the total number is equal to one, then that ready attention thread is directly used as the thread to be run.

[0057] When an attention thread in a ready state is identified as a thread waiting to be started, it is also necessary to determine whether a running attention thread exists, as only one running attention thread can exist at any given time. If no running attention thread is found, the thread waiting to be started is directly converted to a running state and becomes the current highest priority thread. If a running attention thread is found but corresponds to a different thread than the thread waiting to be started, it means that the priority of the currently running attention thread is lower than the priority of the current highest priority thread. In this case, the running attention thread is converted to a ready state, and the thread waiting to be started is converted to a running state and becomes the current highest priority thread, thus achieving preemptive thread scheduling. If a running attention thread is found to correspond to the same thread as the thread waiting to be started, it means that the thread identifier of the currently running attention thread is the same as the thread identifier of the thread waiting to be started. In this case, there is no need to start the thread waiting to be started; instead, the running attention thread is directly used as the current highest priority thread.

[0058] For example, the preemptive thread scheduling process in a deterministic kernel program can be understood through the following example. At tick N (N is a positive integer), there is attention thread A (denoted as thread_A) with a priority of 70, and attention thread B (denoted as thread_B) with a priority of 30. At this time, thread_A is in the running state and thread_B is in the ready state, so thread_A can continue to execute. At tick N+1, a new attention thread C (denoted as thread_C) is created with a priority of 95. The attention scheduler in the deterministic kernel program sorts the priorities of the above three attention threads, and the sorting result is thread_C > thread_A > thread_B. If thread_C subsequently completes the context switch operation, completes the input content assembly to obtain the current complete input data, and is converted to the running state, then thread_A is converted to the ready state. It can be seen that the higher-priority attention thread automatically preempts the attention thread in the running state of the previous tick in the next tick, rather than immediately preempting it, thus ensuring the integrity of the current time slice (the execution period between two adjacent ticks).

[0059] In one embodiment, after step S113, or after step S114, or after step S115, the method further includes: Obtain attention threads that are in the ready state and have a total number of live scheduling cycles greater than 0 from the persistent storage space of the thread control block, decrement the remaining scheduling cycle count of the obtained attention threads by 1, and convert attention threads with a remaining scheduling cycle count of 0 into attention threads in the terminated state. The attention thread in the dormant state is obtained from the persistent storage space of the thread control block, and the obtained attention thread is converted into a ready state when the preset wake-up condition is met.

[0060] In this embodiment, after the thread control block has determined the current highest priority thread, it is also necessary to decrement the remaining scheduling cycle count by 1 for all other attention threads that are in the ready state (refer to the example above, at tick N+1, compared to tick N, the remaining scheduling cycle count of each attention thread in the ready state is decremented by 1), and at the same time, attention threads in the ready state with a remaining scheduling cycle count of 0 are converted into attention threads in the terminated state to avoid attention threads occupying system resources indefinitely.

[0061] Furthermore, attention threads in a dormant state are retrieved from the persistent storage space of the thread control block. All dormant attention threads are converted to ready states when preset wake-up conditions are met. These wake-up conditions can be periodic or event-based. For attention threads with periodic wake-up conditions, their dormant cycle count decreases, and they become ready when the count reaches zero. For attention threads with event-based wake-up conditions, it checks whether their associated event has been triggered; if so, they become ready; otherwise, they remain dormant. Thus, the thread control block achieves fine-grained state monitoring of attention threads at the stage / parameter level, and regardless of the number of inference rounds or task switching, it can accurately resume execution from the last position.

[0062] S120. The current highest priority thread performs the corresponding context switching operation and completes the input content assembly to obtain the current complete input data. The current complete input data is sent to the computing power service layer to be input to the accessed target large language model instance for inference to obtain the current inference result.

[0063] In this embodiment, after the scheduling kernel layer obtains the current highest priority thread, it needs to obtain the application interface layer to run the agent application through the kernel interface provided by the scheduling kernel layer. After the running agent receives the user's input content (such as natural language instructions), it needs to perform a context switching operation on the current highest priority thread to assemble the input content, thereby obtaining the current complete input data. The current complete input data obtained in the scheduling kernel layer is sent to the computing power service layer to be input into the target large language model instance for inference to obtain the current inference result. It is important to note that when the current highest priority thread performs the corresponding context switching operation and completes the input content assembly to obtain the current complete input data, the context switching mechanism ensures that only the on-demand assembly context required by the currently running attention thread is provided to the large language model instance each time, eliminating the attention distraction problem and making attention more focused.

[0064] In one embodiment, the currently highest priority thread has an agent context window, and the agent context window includes at least a role area and a task area. The task area includes a context area and an attention cue area. The data in the task area is cleared before each context switch operation, such as... Figure 4 As shown, step S120 includes: S121. Obtain the agent role definition information of the target large language model instance and set it in the role area; S122. Obtain the natural language instruction content corresponding to the target large language model instance and set it in the attention cue area; S123. Obtain the background knowledge and tool documents corresponding to the context reference relationship of the current highest priority thread, and set them in the context area to obtain the current complete input data corresponding to the agent context window of the current highest priority thread.

[0065] The attention weight of the attention cue area is greater than the attention weight of the role area, and the attention weight of the role area is greater than the attention weight of the context area.

[0066] In this embodiment, when using an agent context window to perform context switching and input content assembly in the current highest priority thread, the agent role definition information of the target large language model instance is first filled into the role area with the second highest attention weight in the agent context window. Then, the natural language instruction content corresponding to the target large language model instance is filled into the attention cue area with the highest attention weight in the agent context window. Finally, the background knowledge and tool documents corresponding to the context reference relationship of the current highest priority thread are filled into the context area with the lowest attention weight in the agent context window. The current complete input data obtained in the agent context window is sent to the computing power service layer to be input into the target large language model instance for inference to obtain the current inference result.

[0067] The attention cue area can be considered a recency effect hotspot, the role area a primacy effect hotspot, and the context area a mid-range area with the lowest attention weight. The role area includes basic role definitions and the current working mode, is managed by the application interface layer, and is relatively stable as it is unaffected by thread switching. The task area needs to be completely rebuilt each time a thread switch occurs, and the entire capacity of the context window is used by the currently running attention thread.

[0068] Before step S121, it is necessary to determine whether there is an old thread before the current highest priority thread is started. If there is an old thread, its relevant information (such as the current execution stage name, stage parameter dictionary, etc.) needs to be saved to the thread control block. If there is no old thread, the current highest priority thread can be started directly.

[0069] S130. If the current inference result sent by the target large language model instance is received, the current inference result is verified according to the preset acceptance conditions to obtain the current verification result.

[0070] In this embodiment, after the target large language model instance completes inference on the current complete input data and obtains the current inference result, it is necessary to obtain preset acceptance conditions from the current highest priority thread to verify the current inference result and obtain the current verification result. Only after the verification passes can the inference on the current complete input data be considered successful, and the agent application can output the current inference result.

[0071] In one embodiment, step S130 includes: Obtain the acceptance criteria and the preset tiered retry strategy, and perform at least one verification on the current inference result based on the tiered retry strategy and the acceptance criteria to obtain the current verification result.

[0072] The acceptance criteria include structured rules, natural language rules, or regular expressions.

[0073] In this embodiment, the acceptance criteria obtained in the current highest priority thread are described in natural language or structured rules to describe the expected output characteristics. Common forms include natural language descriptions (such as output containing function definitions), regular expressions (such as ^def\s+\w+$.*$:$), and structured rules (such as length>100andcontains("import")).

[0074] Of course, the current inference results and acceptance criteria can also be input into the large language model for auxiliary verification. For example, the acceptance criteria can form the prompt word "prompt," and "prompt" can be specifically represented as follows: prompt=f""" Please determine whether the following output meets the acceptance criteria.

[0075] Acceptance criteria: {assertion} Actual output: {output} Please answer "pass" or "fail" and explain your reasoning.

[0076] """ result = llm.execute(prompt) return is used via "inresult" The current inference result can also be verified at least once by using a large language model for auxiliary verification, thus obtaining the current verification result.

[0077] Specifically, the current inference result is verified at least once through a tiered retry strategy and the acceptance conditions, indicating that the number of verifications is greater than or equal to 1. However, the upper limit of the number of verifications can also be limited to avoid excessive rounds of verification causing the current highest priority thread to occupy system resources for a long time.

[0078] In one embodiment, such as Figure 5 As shown, step S130 includes: S131. Perform a preliminary verification on the current reasoning result based on the acceptance conditions to obtain an initial verification result; S132. If the initial verification result is determined to be a verification pass result, then the initial verification result shall be used as the current verification result. S133. If the initial verification result is determined to be a verification failure result, then obtain the preset error example content corresponding to the acceptance condition, inject the error example content into the current complete input data to update the current complete input data, and send the current complete input data to the computing power service layer to input into the target large language model instance for reasoning to obtain the second reasoning result; S134. If it is determined that the second reasoning result passes the verification of the acceptance condition, then the second reasoning result shall be used as the current verification result. S135. If it is determined that the second reasoning result fails the verification of the acceptance condition, the current complete input data is sent to the computing power service layer to be input into another target large language model instance with a reasoning ability rating higher than the target large language model instance to obtain the third reasoning result. S136. If it is determined that the third reasoning result passes the verification of the acceptance condition, then the third reasoning result shall be used as the current verification result. S137. If it is determined that the third inference result fails the verification of the acceptance condition, the current highest priority thread is changed from running state to sleeping state, and all current state information of the current highest priority thread is saved to the thread storage table in the persistent storage space of the thread control block.

[0079] In this embodiment, the maximum number of checks for the tiered retry strategy can be set to 3. The first check is a preliminary check of the current inference result based on the acceptance conditions of the highest priority thread, yielding an initial check result. If the initial check result corresponds to a pass result, it is used as the current check result, and the remaining two checks are not performed. If the initial check result corresponds to a fail result, a preset error example content corresponding to the acceptance conditions is obtained, and the error example content is injected into the current complete input data to update it. The current complete input data is then sent to the computing power service layer to be input into the target large language model instance, allowing the target large language model instance to perform inference under the condition of knowing the error pattern and obtain a second inference result, thus completing the first level of retry.

[0080] If the second inference result passes the acceptance criteria, it is used as the current verification result, and no further verification is performed. If the second inference result fails the acceptance criteria, the current complete input data is sent to the computing power service layer to be input into another target large language model instance with a higher inference capability rating than the target large language model instance for inference to obtain a third inference result, thus completing the second-level retry.

[0081] If the third inference result passes the acceptance condition verification, it is used as the current verification result, and subsequent operations can be executed. If the third inference result fails the acceptance condition verification, the current highest priority thread is switched from running to sleeping state, and all current state information of the current highest priority thread is saved to the thread storage table in the persistent storage space of the thread control block. It then waits for external intervention or a change in conditions to wake up and continue execution. Thus, the tiered retry strategy is deeply integrated with the complete lifecycle of the attention thread. Each stage can independently define acceptance conditions and the maximum number of retries. During the retry process, the stage state of the attention thread does not advance; the state change is only submitted after the verification is passed.

[0082] Furthermore, in the deterministic kernel program, the operations for each stage of the attention thread are designed as idempotent operations. That is, for the generation operations of the attention thread, the output is written to a temporary region and then atomically replaced to the target position; for the modification operations of the attention thread, they are executed based on a clear input state, and executing them twice yields the same result as executing them once. This ensures that during retries, even if the same operation is executed multiple times, the system state remains consistent, avoiding state pollution caused by repeated executions.

[0083] S140. If the current verification result is determined to be a verification pass result, the inference execution of the target large language model instance against the current complete input data is determined to be successful. The current execution stage corresponding to the current highest priority thread in the persistent storage space is changed accordingly according to the preset execution stage reference information, and the current inference result is saved to the area corresponding to the current highest priority thread in the persistent storage space.

[0084] In this embodiment, if the highest priority thread obtains the current inference result corresponding to the verification result, it is determined that the inference execution of the target large language model instance for the current complete input data is successful. At this time, the current execution stage corresponding to the highest priority thread is obtained from the attributes of the attention thread, and the current execution stage of the highest priority thread is changed accordingly according to the preset execution stage reference information (such as including three execution stages: planning, implementation, and testing). The current inference result is saved to the area corresponding to the highest priority thread in the persistent storage space, thereby realizing the scheduling of the highest priority thread in this round and calling the target large language model instance to perform corresponding inference and result output.

[0085] Specifically, the scheduling middleware system for large language model computing power clusters can be applied to scenarios such as fully automated programming engines (where coding task threads and safety check threads coexist, with the safety check thread gaining execution rights with high priority in each tick), medical AI assistants (where diagnostic process threads and medication safety threads coexist, with the medication safety thread always preempting with the highest priority), and customer service robots (where conversation context threads and SLA timeout monitoring threads coexist). These three scenarios use the same kernel and the same set of interfaces; the differences lie only in the content of the thread template files and the event binding logic at the application layer.

[0086] It is evident that, compared to existing technologies, the embodiments implementing this method achieve the following beneficial effects: (1) In terms of context switching overhead, existing technologies save / restore the complete agent state (coarse-grained snapshot), while this application only loads the stage / parameter fields of the TCB (fine-grained), thus reducing the amount of data processed by an order of magnitude. (2) In terms of fault recovery cost, the prior art requires retrying from the beginning when the transmission session is interrupted, which leads to cost waste. In this application, there is no stateless hot switching and the execution continues from TCB, saving the cost. (3) In terms of resource utilization, in the prior art, each task occupies a unique agent instance. In this application, the same context window is reused in a time-division multiplexing manner by multiple threads, thereby improving resource utilization. (4) In terms of development threshold, existing technologies require programming skills (such as Python / YAML), while this application only requires natural language skills, thus lowering the development threshold; (5) In terms of behavior modification cycle, existing technologies require a complete process of code modification, testing and deployment, which takes hours or days. Modifications to the template file in this application take effect immediately, greatly improving the iteration speed. (6) In terms of architectural innovation, existing technologies (such as AIOS, AgentOS, etc.) regard large language models as the core of the operating system or managed resources. This invention regards large language models as the lowest-level memoryless execution unit and uses stateless characteristics to realize computing power pooling, hot switching and natural disaster recovery. This application positions the large language model computing power service as a stateless hardware layer and builds a general scheduling service layer on top of it, filling the structural gap between the API of large language model computing power and the multi-agent orchestration framework. (7) In terms of scheduling granularity, the scheduling in the prior art is at the inter-agent or inter-program level. This application implements attention thread priority preemptive scheduling within a single agent, and the scheduling granularity extends to multiple concurrent task flows within the agent. (8) In terms of state persistence accuracy, the state persistence of existing technologies is at the conversation level (MemGPT) or node level (LangGraph). This application implements phase / params level fine-grained persistence, which can accurately restore to the specific position of the last execution and continue to advance, rather than retrying from the beginning or restoring from the coarse-grained checkpoint.

[0087] Figure 1 This is a schematic diagram illustrating a scenario of the scheduling method for a large language model computing power cluster according to an embodiment of the present invention, such as... Figure 1 As shown in the embodiments of the present invention, the scheduling method for a large language model computing power cluster is applied to a scheduling kernel layer 10. The scheduling kernel layer 10 runs in the scheduling middleware system of the large language model computing power cluster. The scheduling middleware system of the large language model computing power cluster further includes a computing power service layer 20 and an application interface layer 30. The scheduling kernel layer 10 is communicatively connected to both the computing power service layer 20 and the application interface layer 30. The scheduling kernel layer 10 is used to execute the scheduling method for the large language model computing power cluster described in any of the foregoing embodiments.

[0088] It should be noted that those skilled in the art can clearly understand that the specific implementation process of the above-mentioned scheduling kernel layer and each unit can be referred to the corresponding description in the foregoing method embodiments. For the sake of convenience and brevity, it will not be repeated here.

[0089] The present invention also provides a scheduling middleware system for a large language model computing power cluster, which includes a computing power service layer, a scheduling kernel layer and an application interface layer; the scheduling kernel layer is communicatively connected to the computing power service layer and the application interface layer; the scheduling kernel layer is used to execute the scheduling method of the large language model computing power cluster described in any of the foregoing embodiments.

[0090] It should be noted that those skilled in the art can clearly understand that the specific implementation process of the scheduling middleware system and each unit of the above-mentioned large language model computing power cluster can be referred to the corresponding description in the aforementioned method embodiments. For the sake of convenience and brevity, it will not be repeated here.

[0091] The scheduling system of the aforementioned large language model computing cluster can be implemented as a computer program, which can be used in various ways, such as... Figure 6 It runs on the computer device shown.

[0092] Please see Figure 6 , Figure 6 This is a schematic block diagram of a computer device provided in an embodiment of the present invention. This computer device integrates a scheduling system for any of the large language model computing power clusters provided in this embodiment of the present invention.

[0093] See Figure 6 The computer device 400 includes a processor 402, a memory, and a network interface 405 connected via a system bus 401. The memory may include a storage medium 403 and internal memory 404.

[0094] The storage medium 403 may store an operating system 4031 and a computer program 4032. The computer program 4032 includes program instructions that, when executed, cause the processor 402 to perform a scheduling method for a large language model computing power cluster.

[0095] The processor 402 provides computing and control capabilities to support the operation of the entire computer device.

[0096] The internal memory 404 provides an environment for the computer program 4032 in the storage medium 403 to run. When the computer program 4032 is executed by the processor 402, the processor 402 can execute the scheduling method of the large language model computing power cluster described above.

[0097] This network interface 405 is used for network communication with other devices. Those skilled in the art will understand that... Figure 6 The structure shown is merely a block diagram of a portion of the structure related to the present invention and does not constitute a limitation on the computer device to which the present invention is applied. A specific computer device may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0098] The processor 402 is used to run the computer program 4032 stored in the memory to implement the scheduling method of the large language model computing power cluster described above.

[0099] It should be understood that, in this embodiment of the invention, the processor 402 may be a Central Processing Unit (CPU), or it may be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or any conventional processor.

[0100] It will be understood by those skilled in the art that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program includes program instructions and can be stored in a storage medium, which is a computer-readable storage medium. The program instructions are executed by at least one processor in the computer system to implement the process steps of the embodiments of the above methods.

[0101] Therefore, the present invention also provides a computer-readable storage medium. This computer-readable storage medium stores a computer program, wherein the computer program includes program instructions. When executed by a processor, the program instructions cause the processor to perform the scheduling method for the large language model computing power cluster described above.

[0102] The storage medium can be any computer-readable storage medium that can store program code, such as a USB flash drive, external hard drive, read-only memory (ROM), magnetic disk, or optical disk.

[0103] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.

[0104] In the several embodiments provided by this invention, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative. For example, the division of each unit is merely a logical functional division, and there may be other division methods in actual implementation. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed.

[0105] The steps in the method of this invention can be adjusted, merged, or reduced in order according to actual needs. The units in the device of this invention can be merged, divided, or reduced according to actual needs. Furthermore, the functional units in the various embodiments of this invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.

[0106] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, a terminal, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention.

[0107] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in the present invention, and these modifications or substitutions should all be covered within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. A scheduling method of a large language model computing power cluster, characterized in that, The method is applied to the scheduling kernel layer, which runs in the scheduling middleware system of a large language model computing power cluster. The scheduling middleware system of the large language model computing power cluster also includes a computing power service layer and an application interface layer. The scheduling kernel layer is communicatively connected to both the computing power service layer and the application interface layer. If a scheduling event is detected in the computing power service layer, the highest priority thread is loaded from the persistent storage space of the thread control block according to the pre-deployed deterministic kernel program. The current highest priority thread performs the corresponding context switching operation and completes the assembly of input content to obtain the current complete input data. The current complete input data is then sent to the computing power service layer to be input into the connected target large language model instance for inference to obtain the current inference result. If the current inference result sent by the target large language model instance is received, the current inference result is verified according to the preset acceptance conditions to obtain the current verification result; If the current verification result is determined to be a successful verification result, then the inference execution of the target large language model instance for the current complete input data is determined to be successful. According to the preset execution stage reference information, the current execution stage corresponding to the current highest priority thread in the persistent storage space is changed accordingly, and the current inference result is saved to the area corresponding to the current highest priority thread in the persistent storage space.

2. The method of claim 1, wherein, The deterministic kernel program includes at least an attention scheduler and a thread control block; loading the currently highest priority thread from the persistent storage space of the thread control block according to the pre-deployed deterministic kernel program includes: The attention scheduler loads attention threads in the ready state from the persistent storage space of the thread control block. If there are attention threads in the ready state and the total number is greater than 1, then the attention thread with the highest priority is determined based on the priority of each ready attention thread, and is used as the thread to be run. If it is determined that there is no running attention thread, then the thread to be run is converted into a running state and becomes the current highest priority thread; If it is determined that there is a running attention thread that is different from the thread to be run, then the running attention thread is converted to the ready state, and the thread to be run is converted to the running state and becomes the current highest priority thread. If it is determined that there exists a running attention thread that corresponds to the same thread as the thread to be run, then the running attention thread is taken as the current highest priority thread.

3. The method according to claim 1, characterized in that, The current highest priority thread has an agent context window, and the agent context window includes at least a role area and a task area. The task area includes a context area and an attention cue area. The data in the task area is cleared before each context switching operation. The process of executing the corresponding context switching operation through the current highest priority thread and completing the assembly of input content to obtain the current complete input data includes: Obtain the agent role definition information of the target large language model instance and set it in the role area; Obtain the natural language instruction content corresponding to the target large language model instance and set it in the attention cue area; Obtain the background knowledge and tool documents corresponding to the context reference relationship of the current highest priority thread, and set them in the context area to obtain the current complete input data corresponding to the agent context window of the current highest priority thread.

4. The method according to claim 3, characterized in that, The attention weight of the attention cue area is greater than the attention weight of the role area, and the attention weight of the role area is greater than the attention weight of the context area.

5. The method according to claim 2, characterized in that, After the step of converting the thread to be run into a running state and making it the current highest priority thread if it is determined that no running attention thread exists, or after the step of converting the running attention thread into a ready state and converting the thread to be run into a running state and making it the current highest priority thread if it is determined that a running attention thread exists and corresponds to a different thread than the thread to be run, or after the step of making the running attention thread the current highest priority thread if it is determined that a running attention thread exists and corresponds to the same thread as the thread to be run, the running attention thread is used as the current highest priority thread, the method further includes: Obtain attention threads that are in the ready state and have a total number of live scheduling cycles greater than 0 from the persistent storage space of the thread control block, decrement the remaining scheduling cycle count of the obtained attention threads by 1, and convert attention threads with a remaining scheduling cycle count of 0 into attention threads in the terminated state. The attention thread in the dormant state is obtained from the persistent storage space of the thread control block, and the obtained attention thread is converted into a ready state when the preset wake-up condition is met.

6. The method according to any one of claims 1-5, characterized in that, The attributes of the attention thread in the persistent storage space of the thread control block include at least the thread identifier, priority, thread state, natural language instruction content, context reference relationship, lifespan, acceptance condition, and wake-up condition; wherein, the lifespan includes the total number of live scheduling cycles and the number of remaining scheduling cycles.

7. The method according to claim 6, characterized in that, The persistent storage space of the thread control block contains a thread storage table whose table structure fields include at least the thread template name, the name of the current execution stage, a stage parameter dictionary, the number of remaining scheduling cycles, thread acceptance conditions, thread priority, and thread wake-up conditions. The priority of the attributes included in the attention thread corresponds to the thread priority in the thread storage table, the acceptance conditions correspond to the thread acceptance conditions in the thread storage table, and the wake-up conditions correspond to the thread wake-up conditions in the thread storage table.

8. The method according to claim 7, characterized in that, The complete lifecycle of the attention thread includes a planning phase, an implementation phase, and a testing phase. When the attention thread switches from its current lifecycle state to another lifecycle state through the phase switching interface, the natural language instruction content, priority, lifespan, and wake-up conditions in the attention thread's attributes are updated accordingly based on the other lifecycle state. The name of the current execution phase and the phase parameter dictionary of the attention thread are also updated accordingly based on the other lifecycle state. The lifecycle state and the other lifecycle state are both one of the planning phase, the implementation phase, and the testing phase, and the lifecycle state and the other lifecycle state are different.

9. The method according to claim 1, characterized in that, The step of verifying the current inference result according to preset acceptance conditions to obtain the current verification result includes: The acceptance criteria and a preset tiered retry strategy are obtained. Based on the tiered retry strategy and the acceptance criteria, the current inference result is verified at least once to obtain the current verification result. The acceptance criteria include structured rules, natural language rules, or regular expressions.

10. The method according to claim 9, characterized in that, The process of performing at least one verification on the current inference result based on the tiered retry strategy and the acceptance criteria to obtain the current verification result includes: Based on the acceptance criteria, a preliminary verification is performed on the current inference result to obtain an initial verification result. If the initial verification result is determined to be a verification pass result, then the initial verification result is used as the current verification result; If the initial verification result is determined to be a verification failure result, then the preset error example content corresponding to the acceptance condition is obtained, the error example content is injected into the current complete input data to update the current complete input data, and the current complete input data is sent to the computing power service layer to be input into the target large language model instance for reasoning to obtain the second reasoning result; If it is determined that the second reasoning result passes the verification of the acceptance condition, then the second reasoning result is used as the current verification result; If it is determined that the second inference result fails the verification of the acceptance condition, the current complete input data is sent to the computing power service layer to be input into another target large language model instance with a higher inference ability rating than the target large language model instance to obtain the third inference result; If it is determined that the third reasoning result passes the verification of the acceptance condition, then the third reasoning result is used as the current verification result; If it is determined that the third inference result fails the verification of the acceptance condition, the current highest priority thread is changed from running state to sleeping state, and all current state information of the current highest priority thread is saved to the thread storage table in the persistent storage space of the thread control block.

11. A scheduling device for a large language model computing power cluster, characterized in that, The scheduling kernel layer is configured in the scheduling kernel layer, which runs in the scheduling middleware system of the large language model computing power cluster. The scheduling middleware system of the large language model computing power cluster also includes a computing power service layer and an application interface layer. The scheduling kernel layer is communicatively connected to both the computing power service layer and the application interface layer. The scheduling kernel layer is used to execute the scheduling method of the large language model computing power cluster as described in any one of claims 1-10.

12. A scheduling middleware system for a large language model computing power cluster, characterized in that, It includes a computing power service layer, a scheduling kernel layer, and an application interface layer; the scheduling kernel layer is communicatively connected to both the computing power service layer and the application interface layer; the scheduling kernel layer is used to execute the scheduling method of the large language model computing power cluster as described in any one of claims 1-10.

13. A computer device, characterized in that, The computer device includes a memory and a processor. The memory stores a computer program, and when the processor executes the computer program, it implements the scheduling method for the large language model computing power cluster as described in any one of claims 1-10.

14. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, which includes program instructions that, when executed by a processor, can implement the scheduling method for a large language model computing power cluster as described in any one of claims 1-10.

Citation Information

Patent Citations

Interactions with a generative response engine during a long running task
US12346664B1
State machine backed LLM agents
US20250110753A1

Patent Information

AI Technical Summary

Abstract

Description

Patent Citations

Interactions with a generative response engine during a long running task

State machine backed LLM agents