Natural language driven end-cloud collaborative intelligent operation and maintenance system and control method thereof

The edge-cloud collaborative intelligent operation and maintenance system driven by natural language solves the problems of low automation and weak security auditing in operation and maintenance. It realizes an intelligent closed loop from intent understanding to execution, shortens fault repair time, ensures operational safety and compliance, and makes operation and maintenance experience version-based and reusable.

CN122309294APending Publication Date: 2026-06-30SHANDONG CITY COMMERCIAL BANK COOP ALLIANCE CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANDONG CITY COMMERCIAL BANK COOP ALLIANCE CO LTD
Filing Date
2026-05-29
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies suffer from low levels of automation in operations and maintenance, lack of closed-loop capabilities, weak security auditing, low efficiency in edge-cloud collaboration, and difficulty in accumulating and reusing operational knowledge.

Method used

The edge-cloud collaborative intelligent operation and maintenance system, driven by natural language, includes a management plane, control plane, edge agent, intelligent operation and maintenance center, task center, and skill center. It realizes intent understanding, planning, routing, reasoning, and summarization through a directed graph state machine execution engine. Combined with RBAC and three-layer audit, it supports the asset-based management of operation and maintenance knowledge.

Benefits of technology

It achieves an intelligent closed loop from operational intent to execution, significantly shortens MTTR, ensures operational safety and compliance, allows for version-based reuse of operational experience, and the system has the ability to self-evolve.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309294A_ABST
    Figure CN122309294A_ABST
Patent Text Reader

Abstract

This application discloses a natural language-driven edge-cloud collaborative intelligent operation and maintenance system and its control method, belonging to the field of intelligent operation and maintenance technology. The system includes a management plane, a control plane, an edge agent, an intelligent operation and maintenance center, a task center, and a skill center. The intelligent operation and maintenance center uses a directed graph state machine engine to convert natural language operation and maintenance goals into executable steps. The task center atomically distributes tasks to the edge agent and streams execution fragments. The skill center implements version management and AI-assisted generation of operation and maintenance skills. This invention achieves a fully automated closed loop from understanding operation and maintenance intentions to edge execution and knowledge accumulation, significantly shortening fault repair time, improving operation and maintenance security and knowledge reuse capabilities, and has outstanding substantive features and significant progress.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to a natural language-driven edge-cloud collaborative intelligent operation and maintenance system and its control method, belonging to the field of intelligent operation and maintenance technology. Background Technology

[0002] As enterprise IT infrastructure becomes larger and more complex, traditional operation and maintenance models are facing severe challenges. In existing technologies, operation and maintenance automation solutions are mainly divided into the following categories: (1) Traditional monitoring and alarm systems (such as Zabbix and Prometheus): These systems focus on indicator collection and threshold alarms. Their core shortcoming is that they can only discover "what happened", but cannot explain "why it happened" and "how to solve it". Alarm information is usually presented in the form of raw data or simple aggregation. Operation and maintenance personnel still need to rely on human experience to conduct root cause analysis and fault handling, resulting in a long mean time to repair (MTTR) and difficulty in accumulating and reusing fault handling experience. In existing technologies, alarm handling is a two-part separation of "discovery" and "handling". (II) Task scheduling and automation platforms (such as Ansible and Rundeck): These systems can execute preset scripts or playbooks to realize automated task distribution. Their shortcoming is that the automation process is "static" and "preset", lacking intelligent decision-making capabilities. When faced with unknown and unpreset fault modes, the system cannot autonomously plan handling steps. Meanwhile, these platforms typically require operations and maintenance personnel to manually define complex work processes, resulting in high learning costs and difficulty in coping with dynamically changing operations and maintenance environments. (III) Exploration of AI-based Operations and Maintenance Based on Large Models: In recent years, some solutions have attempted to introduce large models into operations and maintenance, such as querying logs through natural language. However, these solutions mostly remain at the "question and answer" level, or are merely auxiliary tools for operations and maintenance personnel. They have failed to form a complete and secure automated link from "intent understanding" and "task planning" to "edge execution" and "result closure".

[0003] The existing technologies generally have the following drawbacks: (1) Lack of closed-loop capability: The conclusions drawn from intelligent analysis cannot be automatically converted into actions to be taken on the end-side host, or the conversion process requires complex customized development. (2) Security and compliance risks: There are risks in directly issuing operation instructions from large models, and there is a lack of effective access control, approval process and operation audit mechanism. (3) Difficulty in accumulating capabilities: Successful experiences in the operation and maintenance process (such as troubleshooting logic and handling scripts) are difficult to be structured into reusable and version-manageable assets, resulting in "knowledge silos". (4) Low efficiency of end-to-cloud collaboration: The end-side host and the cloud management platform usually use simple polling or long connection, lacking a refined task distribution, status synchronization and skill update mechanism, which is difficult to adapt to large-scale, high-concurrency operation and maintenance scenarios.

[0004] Therefore, there is an urgent need for a closed-loop system that can automatically understand natural language operation and maintenance goals, execute edge tasks securely and controllably, and realize the assetization of operation and maintenance knowledge. Summary of the Invention

[0005] This invention provides a natural language-driven edge-cloud collaborative intelligent operation and maintenance system and its control method, aiming to solve the problems of low automation level, lack of closed-loop capability and weak security audit in the existing technology.

[0006] The technical solution adopted by this application to solve its technical problem is: On one hand, a natural language-driven edge-cloud collaborative intelligent operation and maintenance system is provided, including a management plane, a control plane, an edge agent, an intelligent operation and maintenance center, a task center, and a skill center. The management plane and control plane are deployed on different physical or virtual hosts and are connected in communication. The intelligent operation and maintenance center uses a directed graph-based state machine execution engine (such as, but not limited to, LangGraph) to understand, plan, route, reason, and summarize natural language objectives. The task center provides an atomic polling distribution and streaming fragment collection mechanism. The skill center realizes version management and AI-assisted generation of operation and maintenance skills. The system completes a closed-loop process of "intent → planning → execution → auditing → data accumulation" through edge-cloud collaboration.

[0007] As a preferred option, the intelligent operation and maintenance center includes a read-only snapshot module, an intent understanding module, a planner module, a router module, a sub-agent module, a summarizer module, and a final response generation module. These modules work together to achieve complex operation and maintenance decisions.

[0008] As a preferred approach, the task center adopts a pending-dispatched-running-success / failed state machine and ensures the orderliness of streaming segments through unique sequence numbers, supporting polling and retrieval by the management front end.

[0009] As a preferred option, the Skills Center supports skill package upload verification, AI generation optimization, differentiation between built-in and non-built-in skills, and server-side caching tool injection.

[0010] On the other hand, a natural language-driven edge-cloud collaborative intelligent operation and maintenance control method based on the system is provided, including the following steps: S1: The management plane receives the operation and maintenance goals input by the user through natural language, creates an operation plan, and sets the scenario type; S2: The intelligent operation and maintenance center calls the read-only snapshot module to pull read-only evidence from the alarm center, health inspection module, and document center, compresses it, and stores it in short-term memory; S3: The intent understanding module parses the operation and maintenance goals based on heuristic rules and a large language model, and outputs the idea graph category, constraints, and success criteria; S4: The planner module dynamically generates an ordered sequence of steps based on the scenario template or the large language model, with each step containing expected evidence and confidence level; S5: For each step, the router module decides the call type: if a read-only tool is called, the data is obtained directly; if a skill is called, the data is obtained through the skill center. S6: The task center performs atomic polling and dispatch; if execution on the client side is required, the task center is called to create a task and dispatch it to the specified agent; S7: The task center performs atomic polling and dispatch, the agent receives the task and executes it, and reports incremental fragments through the streaming interface. The task center stores the fragments and provides incremental retrieval; S8: The sub-agent module calls the large language model to generate sub-conclusions and suggestions for steps with low confidence or requiring deep reasoning, and performs quality review based on the suggestions; S9: The summarizer module performs a phased summary of the results of each step, generates reflection experience and writes it into the knowledge approval queue; S0: The final response generation module summarizes the conclusions of all steps, generates a final response in lightweight markup language (Markdown) format, and displays it to the user through the management interface; S11: After approval, the reflection experience is written into the document center to form a reusable knowledge asset.

[0011] One of the above technical solutions has the following advantages or beneficial effects: This invention organically integrates multiple technologies such as natural language understanding, directed graph state machine orchestration, edge-cloud task closed-loop execution, and skill asset management to achieve an intelligent closed loop from "discovery" to "disposal," significantly shortening MTTR (experimental data shows an average reduction of approximately 79%). The separation of the management and control planes, combined with RBAC and three-layer auditing, ensures operational safety and compliance. The skill center enables versioned reuse of operational experience, and AI generation capabilities lower the barrier to skill writing.

[0012] The aforementioned technical features are integrated: the separation of the management and control planes fundamentally eliminates the risk of agents directly accessing sensitive configurations; the directed graph-based state machine orchestration enables operational goals to be dynamically and adaptively decomposed into atomic steps, and supports low-confidence interception and manual review, overcoming the shortcomings of traditional script fixation; skill versioning and AI generation capabilities allow operational experience to be continuously accumulated and evolved, forming a positive feedback loop. The combination of these three features enables the system not only to handle known faults, but also to attempt new solutions in unknown fault modes through LLM reasoning and sub-agents, automatically transforming successful experiences into reusable skill assets, achieving the self-evolution capability of the operational system—something that no existing technology using only state machines, task queues, or knowledge bases can achieve. Attached Figure Description

[0013] Figure 1 This is a schematic diagram illustrating the structure of a natural language-driven edge-cloud collaborative intelligent operation and maintenance system according to an exemplary embodiment; Figure 2 This is a flowchart illustrating a natural language-driven edge-cloud collaborative intelligent operation and maintenance control method according to an exemplary embodiment; Figure 3 This is a schematic diagram of the overall system architecture of the present invention according to an exemplary embodiment; Figure 4 This is a technical roadmap illustrating a specific embodiment of the present invention according to an exemplary embodiment. Detailed Implementation

[0014] To more clearly illustrate the technical features of this application, the following detailed description is provided through specific embodiments and in conjunction with the accompanying drawings.

[0015] For ease of understanding, the following English terms and abbreviations are explained uniformly in this manual: Agent: A software module deployed on the target host to perform operational and maintenance tasks; Large Language Model (LLM): A deep learning model trained on massive amounts of data that can understand and generate natural language; Model Context Protocol (MCP): A standardized protocol for interaction between large language models and external tools; JSON (JavaScript Object Notation): A lightweight data interchange format; Markdown: A lightweight markup language used for formatting text; ZIP: A file compression format; YAML: A human-readable data serialization format; Server-Sent Events (SSE): A technology that allows servers to proactively push data to clients; Identifier (ID): A string of characters used to uniquely identify an object; Uniform Resource Locator (URL): A string used to locate network resources; Artificial intelligence (AI): The technology that simulates human intelligence; Token: The basic unit for large language models to process text; Agent Skills: A package of executable capabilities organized according to specifications.

[0016] Example 1 like Figure 1 As shown in this embodiment, a natural language-driven edge-cloud collaborative intelligent operation and maintenance system includes: The management interface is used to provide user interaction interface, permission management, auditing and configuration management; The control plane, deployed on a different physical or virtual host from the management plane and connected in communication, is used to perform task polling and result collection with at least one end-side agent; The edge agent is deployed on the target host to collect host health indicators, execute cloud-based operation and maintenance tasks, and stream execution segments back. The intelligent operation and maintenance center, deployed on the management plane, is configured to: receive operation and maintenance objectives described in natural language, use a state machine execution engine based on a directed graph to perform operation and maintenance intent understanding, read-only evidence snapshot retrieval, operation and maintenance step planning, tool / skill / sub-agent routing, sub-agent deep reasoning, phase summary and final response generation, and call the task center to dispatch execution tasks to the end-side Agent based on the routing results; The task center, deployed on the control plane, is used to create and poll and distribute tasks, maintain the task state machine (pending→dispatched→running→success / failed), receive streaming output segments and final results reported by the agent on the end side, and provide an interface for incrementally pulling segments by sequence number. The Skill Center, deployed on the management plane, is used to manage skill packages that conform to the Agent Skills specification, enabling versioned storage of skill packages, AI-assisted generation and optimization, client-based registration and distribution, and server-side skill cache injection.

[0017] The intelligent operation and maintenance center includes: The read-only snapshot module is used to aggregate read-only evidence from the alarm center, health inspection module and document center and compress it into an LLM-acceptable context. The intent understanding module is used to structure natural language goals into main intent categories, constraints, and success criteria based on heuristic rules and large language models. The planner module is used to dynamically generate an execution plan containing multiple steps based on a scenario template or LLM, with each step including evidence identification and confidence level; The router module is used to select for each step to invoke read-only tools, platform skills, MCP (Model Context Protocol) tools, memory read / write, or sub-agents, and to trigger manual review and interception when the confidence level is lower than the threshold. The sub-agent module is used to call the large language model to generate conclusions and suggestions for steps that require deep reasoning, and to perform quality checks based on the suggestions; The summary module is used to generate periodic summaries and reflections on operational experience, and write the reflections into the approval queue. The final answer generation module is used to converge the conclusions of each stage into a user-friendly Markdown format final answer.

[0018] The task center includes: The task creation unit supports the creation of single tasks and batch tasks, and generates standardized workloads based on task type (command, script, agent instruction, skill synchronization). The polling and retrieval unit provides atomic state transitions, batch updating pending tasks to dispatched and returning the updated status to the client-side agent. The streaming recycling unit receives text fragments reported by the Agent on the receiving end, automatically assigns them incremental sequence numbers and stores them to prevent duplicate writing; The result storage unit receives the final execution status (success / failure) and output, and updates the task status; The pull interface allows the management frontend to pull segments incrementally according to the after_seq parameter and return the final output when the task is completed.

[0019] The skills center includes: The skill package entry module is used to parse ZIP format skill packages, verify the YAML pre-data in the skill markup language file (SKILL.md), extract attachments and perform security filtering, and determine the deployment scope (client / server) and version number; The version management module supports incremental release and rollback of skill versions; The registration and distribution module controls the skill synchronization set of a specified Agent by registering / unregistering skills that are not built into the client. The generation optimization module calls the large language model to generate compliant skill packages or optimize existing skills based on natural language descriptions, and returns the generation process through the streaming SSE (Server-Sent Events) interface; The server-side caching module caches server-wide skills to the local directory and generates a manifest file, which is then dynamically loaded by the execution engine as utility functions.

[0020] Example 2 like Figure 2 As shown in this embodiment, a natural language-driven edge-cloud collaborative intelligent operation and maintenance control method includes the following steps: S1: The management interface receives the operation and maintenance goals input by the user through natural language, creates the operation plan, and sets the scenario type.

[0021] The process of creating an operation plan and setting the scenario type specifically includes: receiving a natural language string input by the user, calling the scenario classifier: if the natural language string contains "inspection", "overview", or "status", then the scenario is set as inspection; if it contains "alarm", "analysis", or "warning", then the scenario is set as alert_analysis; if it contains "root cause", "why", or "reason", then the scenario is set as root_cause; otherwise, it is set as custom; creating an operation plan record, generating a unique plan identifier (ID), with the initial state being draft, and storing the natural language string in the goal field.

[0022] S2: The intelligent operation and maintenance center calls the read-only snapshot module to pull read-only evidence from the alarm center, health inspection module and document center, compress it and store it in short-term memory.

[0023] Step S2 specifically includes: calling the alarm center interface to obtain the most recent 40 alarm records, retaining only the title, severity, and occurrence time of each alarm; calling the health inspection module to obtain the most recent 50 health reports, truncating the large language model summary (llm_summary) to 1200 characters, and compressing the indicator summary to 800 characters; calling the document center, using the operation and maintenance target as the query term for word segmentation and matching, and selecting the 8 document fragments with the highest relevance; concatenating the obtained alarm data, health reports, and document fragments into structured text, and if the total length exceeds the preset token budget, further compressing the alarm sample to 20 records, the health report sample to 30 records, and retaining the top 5 document fragments.

[0024] S3: The intent understanding module analyzes operational goals based on heuristic rules and a large language model, and outputs the intent graph category, constraints, and success criteria.

[0025] The specific steps for parsing the operational goals include: first, using heuristic rules to extract keywords and matching them with predefined intent templates (insight / troubleshoot / inventory / comparison / action / other) to generate preliminary intents; if a large language model is configured on the management side, the operational goals, read-only evidence summaries, and intent prompts are sent to the large language model, requiring the output of the primary intent (primary_intent), user needs (user_need), constraints (constraints), and success criteria (success_criteria) in JSON (JavaScript Object Notation) format; merging the output of the large language model (LLM) with the heuristic results, with the large language model result taking precedence in case of conflict; finally, storing the structured intent data in the user intent (user_intent) field of short-term memory.

[0026] S4: The planner module dynamically generates an ordered sequence of steps based on a scenario template or a large language model. Each step includes expected evidence and confidence level.

[0027] The dynamic generation of ordered step sequences specifically includes: if the scenario is inspection and the target involves simple indicator queries, then a simple mode is enabled, directly generating two steps: ① obtaining a snapshot of key indicators, ② generating health recommendations; if the scenario is root cause analysis, then the Large Language Model Planner (LLM Planner) is invoked, the intent and read-only evidence are input, and an array of steps is required to be output, each step containing an identifier (ID), title, role, evidence IDs, and confidence; the steps output by the Large Language Model (LLM) are validated for legality, steps lacking confidence are assigned a default value of 0.5, and the step sequence is stored in the plan JSON (plan_json), where the roles include tools / subagents.

[0028] S5: For each step, the router module decides the call type: if a read-only tool is called, the data is obtained directly; if a skill is called, the skill content is obtained through the skill center and executed; if execution on the end side is required, the task center is called to create a task and dispatch it to the specified Agent.

[0029] The router module decision call type specifically includes: For the current step, first check if there is a batch route cache; if so, use it directly; otherwise, call the Router LLM, inputting the step title, read-only evidence summary, and list of available tools / skills for the current step, requiring the output of a call array, each call containing type, name, reason, and confidence; only calls allowed by type are retained, deduplicated by "type, name", limiting a maximum of 5 calls per step; if the call type is a skill and the skill exists in the skill center, the skill is executed (server-side skills are called directly, client-side skills create tasks through the task center); if the confidence is below the threshold (e.g., 0.45) and the interception switch is on, the step is marked as needing review, interrupting subsequent automatic execution. The types include tools / skills / model context protocol tools / short memory / long memory / subagents.

[0030] S6: The task center performs atomic polling and dispatching. After the agent receives the task, it executes it and reports incremental fragments through the streaming interface. The task center stores the fragments and provides incremental retrieval.

[0031] Step S6 specifically includes: the control plane polling interface receives the client identifier (client_id) of the agent, queries the waiting tasks of the agent (up to 5) in the database transaction, updates the status to dispatched, records the dispatch time (dispatched_at), and returns the task list; during the agent's task execution, for each output fragment generated, it calls the control plane POST / tasks / stream, carrying the task identifier (task_id), client identifier (client_id), and fragment text (chunk); the control plane verifies that the task exists, the client matches, and the status is dispatched or running, queries the current maximum sequence number (seq), sets the new sequence number to the maximum sequence number + 1 (max+1), and stores the fragment; the management front end obtains the incremental fragment and the current task status through GET / tasks / {task_id} / stream?after_seq=last maximum sequence number, and if the task status is success / failed, it returns the final output (final_output).

[0032] S7: The sub-agent module calls the large language model to generate sub-conclusions and suggestions for steps with low confidence or requiring deep reasoning, and performs quality review based on the suggestions.

[0033] Step S7 specifically includes: when the role of the step is a subagent or the router explicitly calls the subagent, constructing a subagent prompt word containing read-only evidence in short-term memory, user intent, and task description of the current step; calling the large language model and requesting the output of a result, confidence, and suggestions in JSON (JavaScript Object Notation) format; if the confidence is lower than a preset minimum threshold (e.g., 0.6) and quality review is enabled, calling the review large language model again, providing the original conclusion and supplementary evidence, and requesting repair or improvement of the confidence; if the confidence is improved after review, replacing the original conclusion; writing the subagent's output into short-term memory and appending a subagent_completed event.

[0034] S8: The summarizer module summarizes the results of each step, generates reflection experiences, and writes them into the knowledge approval queue.

[0035] Step S8 specifically includes: the summarizer module collects the conclusions, routing call records, and final intermediate results of all completed steps, calls the large language model to generate a reflection text, which includes key findings, effective steps, and improvement suggestions of this operation and maintenance process; the reflection text is assembled with metadata such as the operation plan identifier, step identifier, and generation time into a knowledge write request, the status is set to pending, and it is stored in the knowledge write request (OpsKnowledgeWriteRequest) table; at the same time, the knowledge write pending (knowledge_write_pending) event is sent to the server push event (SSE) stream for display on the management front end.

[0036] S9: The final response generation module summarizes the conclusions of all steps, generates a final response in Markdown format, and displays it to the user through the management interface.

[0037] Step S9 specifically includes: if the final answer generation using a large language model is enabled, constructing prompts containing the user's goal, summaries of each step, and short-term memory; calling the large language model to output JSON, including the answer, highlights, and risk level; if the large language model does not produce an answer, iterating through the summaries of each step in long-term memory and concatenating them into structured text as the final response; writing the final response into the final answer field of the execution plan, appending a final answer generation event, and pushing it to the front end via a server push event (SSE). The answer is in Markdown format.

[0038] S10: After approval, reflective experiences will be written into the document center to form reusable knowledge assets.

[0039] The process of writing reflection experiences into the document center specifically includes: the management interface provides a knowledge approval list interface, where users review requests with a pending status; if approved, the system parses the reflection text in the content JSON (content_json), supplements metadata information, which includes at least the source operation plan identifier, approver, and approval time; it then calls the document center to create a document entry (DocArticle), with the title automatically generated as "Operation Review - {Scenario} - {Timestamp}", and the body containing the reflection content and cited evidence; the approval request status is updated to approved, and the approved document identifier (approved_doc_id) is recorded; if the approval is rejected, the status is updated to rejected, and the reason for rejection is recorded.

[0040] Example 3 This embodiment presents a cloud-edge collaborative intelligent operation and maintenance system based on a dual-process architecture. Through the physical separation of the management plane and control plane, combined with the intelligent data collection and execution capabilities of the edge agent, it constructs an automated operation and maintenance system covering the entire chain of "intent understanding - task planning - edge execution - result closure - knowledge accumulation." For example... Figure 3 and Figure 4 As shown below, the technical implementation of each functional module will be described in detail.

[0041] I. Main Functions: 1. Overview: The overview provides a quick glance at the platform's "current global status," helping operations personnel quickly determine: which hosts are abnormal, whether there are alarms, whether there is a backlog of tasks or failures, whether the platform has available critical capabilities (such as LLM configuration), and whether there has been any recent progress.

[0042] It can typically perform the following tasks: Quickly locate "hosts / clients that need attention": Statistically analyze the distribution of health statuses (critical / warning / normal / unknown) and provide a list of abnormal hosts (sorted by severity and recent time).

[0043] Quickly identify "What's running recently": Display a brief summary of recent tasks (task type, target host, status, and brief description).

[0044] Quickly determine if the platform is operating healthily: Statistics include Agent online / offline / banned status, task pending / running status, and daily success / failure status.

[0045] Quickly confirm platform capability readiness: Display whether LLM is configured, making it easy to determine whether capabilities such as alarm analysis, skill generation / optimization, and operation and maintenance AI assistant are available.

[0046] The specific implementation steps are shown in Table 1: Table 1 provides an overview of the specific implementation steps.

[0047] 2. Intelligent Agent Center: The Intelligent Agent Center is used to "configure and orchestrate how intelligent agents execute operational goals". It separates the management of "what can be done (the scope allowed by the Agent / tool)" from "how to do it (workflow / policy / runs)", thereby making the same set of operational logic reusable, versionable, rollbackable, and auditable.

[0048] 2.1 Agent Profile: You can define a "profile of the edge-side intelligent agent at runtime" for the platform, including: ① Agent name, state, version, role type (e.g., executor); ②Capability tags: Used to match "which type of steps are suitable for" during the arrangement process; ③ Tool whitelist: Restricts the set of tools that the agent can use; ④ Cost / Delay Profile: Used for strategy selection and budget control during orchestration; ⑤ The ultimate purpose of the tenant identifier (tenant_id, which can isolate multiple tenants) is to enable the system to select appropriate agents to undertake sub-tasks or execute steps according to the strategy during an operation and maintenance orchestration.

[0049] 2.2 Workflow Template: Workflows are used to define the "graph structure / process skeleton of the execution plan" and bind strategies to it: ① Workflow name and scenario; ② Workflow diagram definition (graph); ③Policy binding relationships (policy_bindings); ④ The typical function of version status and whether it is default (is_default) is to make complex operation and maintenance execution processes into "deployable / reusable" templates; when the process evolves, new versions can be released and rolled back to old versions.

[0050] 2.3 Operation and Maintenance Policy: The strategy defines "how to constrain and evaluate an execution in different scenarios", for example: ① Complexity rules; ②Token budget and resource control (token_budget); ③ Timeout and circuit breaker rules; ④ Quality gate; ⑤ The goal of the evaluation / review of relevant rules (used to determine how to handle low confidence) strategy is to make operations and maintenance more stable and controllable: it can make full use of the large model, and also trigger degradation, retry or manual review when risks and uncertainties occur.

[0051] 2.4 Policy Simulation: When you're unsure whether a strategy is suitable for a particular type of input, the system will perform a "strategy adaptation prediction" based on the input text, classifying it as simple / normal / deep or similarity-based. Typical applications include: Before launching actual operations, conduct a rehearsal and risk assessment of the strategy selection; Let me help you answer the question, "Should this input follow a simple route or a more in-depth troubleshooting route?"

[0052] 2.5 Runtime Management (Runs): Runs is an instance of "the object that actually runs the target": ① You can create a run plan, specifying the goal, scenario, and plan / context. ②The system supports executing, terminating, and viewing the runtime event stream (events); ③ After the process is completed, you can view the quality / reliability summary and the steps that require manual review. Typical uses include: like OpenClaw, turning "natural language operation and maintenance goals" into executable processes, and making the execution process visible and traceable.

[0053] The specific implementation steps of the intelligent agent center are shown in Table 2.

[0054] Table 2 Specific Implementation Steps of the Intelligent Agent Center

[0055] 3. Intelligent Operations and Maintenance Center (Ops Hub: Natural Language Query + Invocation + Command Execution): The Intelligent Operations and Maintenance Center is the core entry point for the platform's "Operations and Maintenance Task Implementation." It transforms your natural language-based operations and maintenance goals into a traceable and interruptible "Run Plan," and during execution, it performs the following: natural language understanding, read-only evidence query, tool / skill / sub-agent invocation, deployment to agents for execution when necessary, and finally converges the results into a demonstrable operations and maintenance conclusion. Simultaneously, it triggers "knowledge writing approval" when experience needs to be accumulated.

[0056] 3.1 What can Ops Hub do (user-level capabilities): Natural Language Query Type: Retrieve and interpret information based on health inspections, alarms, and document knowledge (outputting conclusions and evidence); Troubleshooting / Root Cause Analysis: Constructing troubleshooting logic based on evidence and document knowledge, providing inference chains and action suggestions; Comparative analysis: This type of analysis compares evidence from different times, different objects, and different phenomena, and presents the differences and conclusions. Execution type: When the conclusion requires "actual action", it will create an execution task to allow the Agent to complete collection / processing / script execution within a controlled scope; Knowledge accumulation type: The "reflection / summary" of this operation enters the approval queue. After approval, it is written into the document center to form reusable knowledge.

[0057] 3.2 Core objects and their responsibilities: Ops Hub's functionality primarily revolves around three objects: Operation plan (OpsAgentRunPlan): ① Represents an "execution instance" of an operational goal. ② Save the title / scenario / goal / plan / context, and generate the final response and result metadata after execution. ③ Status includes: draft / running / paused / completed / failed / cancelled (the interface allows patches to modify status fields); Runtime event timeline (OpsAgentRunEvent): ① Record key nodes and deliverables throughout the execution process: such as "Read-only snapshot pulled", "Intent identified", "Route call", "Step completed", "Final response generated", "Plan terminated", etc. ② Events are read in reverse chronological order and paginated, supporting UI elements such as progress bars, timelines, and debugging positioning; Knowledge Write-to-Approval Items (OpsKnowledgeWriteRequest): ① This is used to queue "reflection / summary content" for manual review. ② After approval, the content will be written to the document center (DocArticle) for subsequent RAG or operation and maintenance reuse.

[0058] 3.3 Ops Hub's aggregation entry point (situation dashboard): GET / api / ops-hub / summary, Return to Intelligent Operations and Maintenance Center Summary: ① Health status distribution: number of critical / warning / normal / unknown. ② Recent Alarm Summary: Includes alarm severity, title, and whether LLM analysis has been performed. ③ List of abnormal hosts: Agents whose recent health check results were critical / warning. ④ Recent Task Briefing: Overview of the type, target host, and status of recent tasks; Purpose: To help you quickly identify "where the risks are", and then proceed with the execution plan for further natural language analysis or processing.

[0059] 3.4 Lifecycle capabilities of the execution plan (CRUD + execution / termination): Ops Hub offers complete management capabilities for "operation plans".

[0060] 3.4.1 Planning Management: GET / api / ops-hub / run-plans: Lists runtime plans (page-by-page); POST / api / ops-hub / run-plans: Create a run plan (draft); GET / api / ops-hub / run-plans / {plan_id}: Retrieve runtime plan details; PATCH / api / ops-hub / run-plans / {plan_id}: Update fields such as title / status / goal / plan / context; DELETE / api / ops-hub / run-plans / {plan_id}: Deletes the run plan record (in a reclaimed state).

[0061] 3.4.2 Implementation Plan: (1) Trigger execution (background worker runs): Once completed, the plan will be marked as completed / failed / cancelled, and the final response can be seen in the plan details; (2) Push execution progress and process events in real time via SSE: The stream contains two types of information: ① Run events (such as events related to tools / skills / Agents / large models / memory / orchestration / knowledge bases), used for front-end timeline visualization. ②Incremental output of LLM (used for "generating and displaying" reasoning or summarizing text fragments on the UI); (3) Request to terminate the operation plan: After termination, the plan will enter the cancelled state, and the key event and reason information (such as stage timeout / user request) will appear in the event stream as "plan terminated".

[0062] 3.5 Execution process observability: Event timeline and "sub-channel" display: GET / api / ops-hub / run-plans / {plan_id} / events, Return to the event list for this plan (paged).

[0063] Event types will be categorized into different "lanes" (for easier UI layout), including: ①tool: Read-only tool / search / tool ​​call ②skill: Related to skill activation. ③Agent: Sub-tasks / task dispatching that are executed in conjunction with the Agent side. ④LLM: Large model stages such as routing decision-making, planner output, and summarizer / final response generation. ⑤ Memory: Short-term memory reading / long-term memory writing (process sedimentation). ⑥orchestration: Orchestration phase events (such as pulling snapshots, step start / completion, depth pipeline completion, etc.). ⑦ Knowledge: Knowledge is written into the approval queue prompt; Therefore, Ops Hub does not just "return a conclusion", but fully exposes to the front end "why it was done, what step was done, what steps need to be reviewed / coordinated execution".

[0064] 3.6 How are natural language objectives processed? When Ops Hub is executed, it generally follows a chain of experiences: "first understand the goal, then break down the steps based on evidence, then route the calls, and finally converge." You will see the system progress step by step around your natural language intent: 3.6.1 Scenario-driven step structure: Optional scenarios when creating a plan: Inspection: Health inspection insights / overview objectives. alert_analysis: Alert analysis target root_cause: Root cause identification target custom: Customize the arrangement of goals and steps; Different scenarios will determine the style of the default "step template": for example, inspection tends to focus on snapshot aggregation and suggestion output, while root_cause tends to focus on hypothesis building and action item output.

[0065] 3.6.2 Intent Understanding (Transforming Natural Language into Executable Semantics): The system will first structure your target as follows: ① Main intent categories (insight / obstacle removal / skills assessment / comparison / execution, etc.) ② Summarize your true needs in one sentence. ③ Constraints and success criteria (what evidence is needed to be considered "achieved") ④ Whether there is a tendency to delegate processing to deeper sub-agents; The result of this step will affect how the planner and router break down the steps, select the call type, and whether to proceed to deeper reasoning.

[0066] 3.6.3 Read-only evidence snapshot (query capability): Before taking any actual action, Ops Hub first performs a read-only aggregation of "verifiable facts" for subsequent reasoning and interpretation. This type of evidence mainly comes from: ① Alarm snapshot (recent alarms and summary information). ② Health inspection snapshot (health report and status) ③ Document Center Search (focusing on document fragments as the basis for answering questions); Your final conclusion will emphasize the "sources of evidence" cited, reducing unfounded inferences.

[0067] 3.6.4 Tool / Skill / External Ability Routing (Invocation-type Ability): When a step requires "doing something," Ops Hub breaks it down into a set of route calls, with route types overriding each other: ① Read-only tools: used to obtain evidence or information. ② Platform skills: Used to execute complex data collection / troubleshooting / processing logic. ③MCP tool: Used to integrate the capabilities of external tools into the same process. ④ Memory and reading / writing: Used for short-term memory to carry current process facts and for long-term memory to summarize and consolidate stages. ⑤ Sub-agent: performs isolated reasoning and generates higher-quality suggestions for a specific step; The front-end event flow will clearly show "what type was called / what decision was made at a certain step".

[0068] 3.6.5 "Low Confidence" and Steps Requiring Review (Security and Quality Access Control): When the routing phase determines that the confidence level is insufficient, the risk is high, or the information is inadequate, Ops Hub may trigger a "step interception / review required" process. ① This step will trigger the needs_review flag. ② A message "Step was automatically intercepted" will appear in the event stream. ③ Subsequent actions are usually taken in a more conservative manner: emphasizing waiting for review and reducing high-risk actions by implementing them directly; This ensures that Ops Hub doesn't turn all uncertainty into immediate action.

[0069] 3.6.6 Sub-agent Deep Inference and Quality Verification (More Stable for More Complex Problems): For steps requiring deeper inference / verification, Ops Hub will enable subagent: ① The output of the sub-agent includes "conclusion / recommendation + confidence level"; ②If the confidence level is lower than the threshold, a quality check (repair / second pass) will be performed. ③ If the reviewed recommendations are more credible, they will be replaced with a stricter version; if they are still not satisfactory, a more conservative approach will be adopted (e.g., blocking or reducing the intensity of execution).

[0070] 3.6.7 Interim Summary and Final Response (Converged Output): Ops Hub generates interim conclusions step by step and outputs the final response at the end: ① The final response is a user-facing Markdown text (conclusion first, then supporting evidence). ② It will include key points highlighted and risk level (risk_level, which can be null). ③If the final response does not follow the LLM approach, the convergence results will still be pieced together from the phase summary to ensure that users at least obtain readable conclusions; You will see the complete chain in the event flow from "step completed" to "interim conclusion" and then to "final response generated".

[0071] 3.7. Execution Linkage: When is the action "actually issued to the Agent"? When the goal of a certain step goes beyond explanation and requires actual handling / execution, Ops Hub will coordinate with the task center to create an execution task, allowing registered agents to perform the corresponding action (and potentially triggering skill synchronization capabilities): ① It will create a task of type agent (with relevant prompts for operation and maintenance plan steps). ② If this step depends on skill execution, a skill_sync task will be created to ensure that the client's skill cache is consistent. ③ The front-end event stream will show task_dispatched, which contains the task ID associated with this step, making it easy for you to check the execution progress and output stream in the task center; This is the practical application of Ops Hub's "OpenClaw-like execution experience": from natural language objectives to controllable task assignment and result collection.

[0072] 3.8. Knowledge Accumulation Loop: Reflection / Summary → Approval Queue → Document Center Implementation: When the "reflection / summary" generated by the summarizer has long-term reuse value, Ops Hub will first write the content into the knowledge base and then into the approval queue. (1) The event stream will show knowledge_write_pending (entering the approval queue); (2) You can manage approvals in Ops Hub: ① View the approval list (status can be filtered). ②Execute approve / reject; (3) After approval, the approved content will be assembled into a DocArticle in the document center for subsequent use in "Assistant Q&A / Operation and Maintenance AI Query"; In this way, Ops Hub can not only complete tasks in one go, but also accumulate experience into platform knowledge assets.

[0073] The specific implementation steps of the intelligent operation and maintenance center are as follows: A. Situation Overview (GET / api / ops-hub / summary): The overall situation overview is responsible for aggregating global statistical data across the platform, providing a one-screen overview for the front end.

[0074] A1. Aggregated summary data: Calling `crud.dashboard_summary(db)` completes multi-table join statistics in one go: ①Agent online / offline / banned number: calculated based on heartbeat time and ban flag; ② Task status count: Statistics are collected separately for pending, running, success, and failed; ③ Number of skills and documents: Directly query the total number of skills and documents in the skill table and document table; ④ Health Distribution Count: Traverse Agents and obtain the latest health reports, and count the number of critical, warning, normal, and unknown agents; ⑤ LLM configuration ready state: Read the KEY_LLM_MANAGEMENT configuration, check if enabled and if base_url and api_key are not empty.

[0075] A2. Sample of an abnormal alarm: In the routing directory `routers / hubs.py`, call `list_health_alerts(db, limit=12, offset=0)` to retrieve the 12 most recent alerts. For each alert: ① Use crud.get_agent(db, a.client_id) to complete the hostname; ② Generate the has_analysis flag using bool((a.llm_analysis or "").strip()) to indicate whether an LLM analysis has been performed.

[0076] A3. Abnormal Agent Sample: Iterate through all Agents (crud.list_agents(db)), and for each Agent, retrieve the latest health report (crud.latest_health_report(db, agent.client_id)). Filter out Agents whose health_status is in ("critical","warning"), sort them according to the following rules, and then retrieve the top 16: ①Severity priority: critical precedes warning; ② Within the same severity level, sort the results in descending order of reporting time. Return the results to the abnormal_agents list.

[0077] A4. Return to the available aggregation structure on the front end: Assemble a schemas.OpsHubSummaryOut object, which contains: ① Health distribution count (health_* fields); ② A brief list of recent alerts (alerts_recent); ③ Abnormal host list (abnormal_agents); ④ Recent task summary (recent_tasks, retrieves the most recent tasks from the task table); ⑤ Module introduction (module_intro), used for front-end display and explanation.

[0078] B. Run Plans (CRUD + Execution Control): The operation plan of the intelligent operation and maintenance center is carried by the OpsAgentRunPlan model, the API is aggregated in routes / hubs.py, and the execution engine is in ops / crud_ops_run_plans.py.

[0079] B1. Operation Plan List and Query: ① List the running plan (pagination): GET / ops-hub / run-plans calls list_run_plans(db,limit,offset), returns paginated results in reverse order of update time, and encapsulates them as OpsAgentRunPlanListOut; ② Get details of a single run plan: GET / ops-hub / run-plans / {plan_id} and call get_run_plan(db,plan_id). If it does not exist, it will return 404.

[0080] B2. Create a runtime schedule: ① Input reception: POST / ops-hub / run-plans receives OpsAgentRunPlanCreate, which includes title, scenario, goal, plan, and context; ② Scenario normalization: Only one of inspection, alert_analysis, root_cause, and custom is allowed in the scenario. If a mismatch occurs, it will be downgraded to custom. ③ Default step generation: If the passed-in plan.steps is empty, call default_plan_template(scenario,goal).steps to generate default step DAG placeholders and write them into the plan dictionary; ④ Store in the database: Serialize goal, plan, and context into JSON text using _dumps() (json.dumps) and store them in goal_json, plan_json, and context_json respectively. The initial state is draft.

[0081] B3. Update the runtime plan (PATCH): The `PATCH / ops-hub / run-plans / {plan_id}` call `patch_run_plan`, supporting field-level overriding: `title`, `status`, `goal`, `plan`, and `context`. The `status` attribute only allows values ​​from the following sets: `draft`, `running`, `paused`, `completed`, `failed`, and `cancelled`.

[0082] B4. Delete the running schedule (DELETE): The `DELETE / ops-hub / run-plans / {plan_id}` call `delete_run_plan` to physically delete the record, and returns `{ok:true}`.

[0083] B5. Execute the runtime plan (synchronous mode: execute): The POST request to `ops-hub / run-plans / {plan_id} / execute` calls `execute_run_plan_stub(db,plan_id)`. After synchronous execution, it returns an updated `OpsAgentRunPlanOut`. Internally, the execution engine updates `plan_json`, `context_json`, and `status`, and writes the execution results to the audit (category=agent_run) via `record_agent_run_audit_from_row`.

[0084] B6. Execute the execution plan (stream mode: execute-stream): Real-time processes are pushed using SSE (Server-Sent Events): (1) Create a message queue: Create queue.Queue() in post_ops_hub_run_plan_execute_stream; (2) Worker background execution: Start a thread, create a new database session SessionLocal() inside, and call execute_run_plan_stub(db,plan_id,on_event,on_llm_delta); the callback function pushes the event into the queue; (3) Standardization of event push notifications: ①on_event(ev)→q.put({"type":"event","payload":ev}), ②on_llm_delta(ev)→q.put({"type":"llm_delta",**ev}), ③ After execution, add {"type":"done"} or an error flag; (4) SSE output: The main thread generator gen() loops to retrieve data from the queue and outputs the data using _ops_sse_line(): <json>The process continues until an end marker is received. Finally, a StreamingResponse(media_type="text / event-stream") is returned.

[0085] B7. Terminate the operation plan: The POST request to `ops-hub / run-plans / {plan_id} / terminate` calls `terminate_run_plan(db,plan_id,reason)`, updating the `context_json` to include `terminate_requested=true`, `terminate_reason`, and `terminate_requested_at`. At critical entry points, the execution engine calls `ensure_not_terminated(stage)`. If a termination flag or timeout is detected, it writes a `plan_terminated` event and throws `_PlanTerminated`, ultimately setting `OpsAgentRunPlan.status` to `cancelled`.

[0086] B8. Run event query (events): The GET request to / ops-hub / run-plans / {plan_id} / events calls the `list_run_events(db,plan_id,limit,offset)` function, which returns a paginated list of events ordered in reverse chronological order by creation time. The `detail_json` file for each event is deserialized into a dictionary using `event_detail_payload(row)` for display on the front end.

[0087] C. Knowledge Write Approval: The "reflection / summary" knowledge generated during the execution process by the intelligent operation and maintenance center is stored in the document center through the approval queue.

[0088] (1) List query: GET / ops-hub / knowledge-write-requests supports status filtering (pending, approved, rejected), with no filtering by default. Call list_knowledge_write_requests(db,status,limit,offset) to return pages; (2) Approval processing: POST / ops-hub / knowledge-write-requests / {req_id} / review calls review_knowledge_write_request: ① Processing is allowed only if the request status is pending. ②Approve: Parse the content_json, construct the document body (including source plan, steps, engine information, approval metadata, etc.), call CRUD.Create_Doc_Article(...) to write it to DocArticle, write the returned approved_doc_id back to the request record, and update the status to approved. ③reject: The status is updated to rejected, and the review_note and reviewer_* fields are recorded. ④ All operations are completed within a database transaction to ensure consistency.

[0089] D. Execution Engine (Ops Run Plan Execution Engine: LangGraph Phase 5): The execution engine is the core of the intelligent operations and maintenance center, responsible for converting natural language objectives into executable steps and driving their execution. The main entry point of the engine is ops / crud_ops_run_plans.py::execute_run_plan_stub(), which internally builds a LangGraph state machine with nodes including readonly_fetch, intent_understanding, planner, batch_router, tool_router, subagent, summarizer, and finalize.

[0090] D0. Pre-operation preparations (common prerequisites): ① Read the plan JSON: Deserialize goal_json, plan_json, and context_json into dictionaries, ensuring that context is a dict; ② Read short-term memory: Call md_read_short_memory(row.scenario,max_chars=6000) to get the most recent memory and write it to context["memory_md_short_recent"]; ③ Calculate the steps array: If plan["steps"] is empty or not a list, call _scenario_default_steps(row.scenario) to generate the default steps and write them back; ④ Read LLM configuration: _llm_cfg_for_planning(db) reads enabled, base_url, api_key, model, and timeout_seconds from KEY_LLM_MANAGEMENT; if not configured or disabled, subsequent planning nodes will not use LLM; ⑤ Timeout and Circuit Breaker Mechanism Preparation: Read environment variables to configure timeout, hard timeout, and circuit breaker thresholds for each stage (e.g., force simple mode when the recent number of timeouts reaches the threshold).

[0091] D1. Read-only snapshot fetch (readonly_fetch node): ① Evidence collection: Call collect_all_readonly(db,goal) to aggregate three sources: recent alarms, health inspection summaries, and document retrieval results; ② Write to short-term memory: Put the aggregation result into short_memory["ops_readonly"], and at the same time synchronize the server-side skill list information (server_skills, server_skills_snapshot, server_skills_ranked, server_skills_total). ③Token control: Calculate the size of the projected prompt. If it exceeds the budget, compress the short_memory using _compact_for_llm and increment the compression count. ④ Event Output: Add the ops_readonly_snapshot event, which includes debugging information such as the total number of alarms, the number of health reports, and the document search mode.

[0092] D2. Intent Understanding (intent_understanding node): ① Heuristic intent draft: _build_heuristic_user_intent infers primary_intent (such as insight, troubleshoot, inventory, comparison, action, other) based on scenario, goal, context, and read-only snapshot, and generates constraints and success_criteria; ② Optional LLM intent analysis: If LLM is enabled and environment variables allow, call _llm_json to request the model to output JSON (primary_intent, user_need, constraints, success_criteria, delegate_subagent_hint, needs_clarification, clarifying_questions), and then use _merge_user_intent_llm to safely override the heuristic fields; ③ Write the product: Store the final intent merge result into short_memory["user_intent"] and append the intent_understood event.

[0093] D3. Planner Step Output (planner node): (1) Step standardization: Ensure that each step has the basic structure of {id, title, role, tool_hints}, and repair elements that do not conform to the dict; (2) Lightweight path selection: ①_detect_simple_goal determines whether to use a low-step strategy (simple mode). ②_detect_skill_inventory_goal determines whether it is the direct access mode for "listing skills / statistics skills"; (3) Mode switching: In simple mode, the number of steps is truncated, the confidence level is reduced, and the tooltips are simplified; (4) Direct Mode: The skills inventory mode directly returns a fixed two-step plan (skill list summary + suggested usage), without using LLM; (5) LLM Planner: Under normal path, call _llm_json to generate step JSON, which contains the steps array and notes. Each step contains evidence_ids and confidence. When deep inference is required, the role is set to subagent.

[0094] D4. Batch Routing (batch_router node, optional): When the number of steps is ≥2 and not disabled, LLM is invoked to generate routing decisions for all steps at once, outputting a routes array that corresponds one-to-one with each step, including step_id, calls, need_subagent, etc. If no valid route is found, a fallback is initiated by setting routes_batch=None, allowing subsequent router tools to process the routes step by step.

[0095] D5. Tool / Skill / Sub-Agent Routing (tool_router node): ① Route source priority: Prioritize using the calls in the corresponding step of routes_batch; if none are found, check route_cache for cache hits and reuse; only call the step-by-step Router LLM last. ② Step-by-step Router LLM: _llm_json requires the output of a JSONcalls array (type, name, reason, evidence_ids, confidence) and need_subagent. The hints include a read-only snapshot summary, user intent, skill list, MCP tools, etc. ③ Normalization and deduplication: _sanitize_router_calls only retains allowed call types (tool, skill, mcp, memory_short, memory_long, subagent); if memory_short / long does not have a name, it will automatically fill in the default name; deduplication is performed by "type, name", and the maximum number of calls is limited; ④ Event logging: Append corresponding events (tool_call, skill_call, mcp_call, short_memory_read, long_memory_write, subagent_route) to each route call; ⑤ Route write memory: Write the calls of the current step into the markdown short-term memory for use by subsequent nodes; ⑥ Low confidence blocking: If OPS_INTERCEPT_LOW_CONFIDENCE_ENABLED is enabled and the route confidence is lower than the threshold, mark the step as need_review, append the step_needs_review event, and return blocked_step=True; ⑦ Subagent selection: If need_subagent is true but there is no subagent in calls, automatically add a subagent.auto entry and record the event; ⑧ Execution linkage: Calling _dispatch_execution_tasks will only create and execute actual tasks if the route call contains meanings such as tasks.dispatch or skills.execute; ⑨ Create execution tasks: Create a task with type=agent for each target client. The payload includes a prompt (containing plan_id, step_id, source, etc.). At the same time, create a skill_sync task (used to prompt the client to prepare the skill).

[0096] D6. Subagent Node: ① Sub-inference: Record subagent_started in the event and subagent_completed after completion; ②LLM Inference: When calling _llm_json, the system prompts that "inferences should be based only on short-term memory and given facts, and monitoring data should not be fabricated", and outputs JSON{result,confidence,suggestions}; ③ Quality review: If OPS_SUBAGENT_QUALITY_RETRY is enabled and the current confidence level is lower than OPS_SUBAGENT_MIN_CONFIDENCE, call the quality reviewer LLM again. If the confidence level is higher after the repair, replace the result. ④ Can be skipped: If OPS_LANGGRAPH_SKIP_SUBAGENT_LLM is enabled, the LLM call is skipped directly, and skipped_subagent_llm=true is recorded.

[0097] D7. Phase Summary and Knowledge Writing to the Queue (summarizer node): ① Intercept Summary: If blocked_step=True, the status is set to need_review, a summary text awaiting review is constructed, written to short and long memory, and summary_generated and step_completed are appended; ② Skills Direct Summary: In skill review mode, use server_skills_snapshot (preferred) or server_skills_ranked (backup) to call _build_skill_inventory_summary to generate a summary and suggested actions, and write them into markdown memory; ③ Normal summary: If OPS_LANGGRAPH_SINGLE_SUMMARY=True and it is not the last step, skip the LLM and write a placeholder summary; generate the final answer in the last step. ④ LLM Summary: Call _llm_json to generate JSON {summary, risk_level, actions}, and write the summary into Long Short-Term Memory; ⑤ Knowledge Accumulation: Call _queue_reflection_approval to create OpsKnowledgeWriteRequest (status pending), add knowledge_write_pending event, and then manually approve whether to accumulate it into a document.

[0098] D8. Final response convergence (finalize node): ① Final response generation: If OPS_FINAL_ANSWER_LLM_ENABLED is enabled and not skipped, call LLM to generate the final response JSON{answer,highlights,risk_level,disclaimer}, where answer is in Markdown format; ② Default rollback: If the LLM does not produce an answer, the summaries of each step in long_memory will be concatenated as the final answer; ③ Final events: Add the final_answer_generated (containing the answer, highlights, whether LLM was used, etc.) and plan_completed events; ④ Context writeback: Write the final short_memory image to context["short_memory"] for external use.

[0099] D9. Termination and Timeout Protection (ensure_not_terminated): Call `ensure_not_terminated(stage)` at the entry point of the engine's critical nodes to check: ① Stage Timeout: Configure timeout environment variables for each stage, such as intent, planning, routing, sub-agent, summary, and final. ②Hard timeout: If OPS_RUN_HARD_TIMEOUT_SECONDS>0 and the runtime exceeds the threshold, the process will be forcibly terminated; ③ Unified processing: Upon detecting a timeout or a request to actively terminate, write a plan_terminated event, set the plan status to canceled, and throw a _PlanTerminated event to terminate execution.

[0100] E. Read-only snapshot module (Ops: evidence collection + compression): Read-only snapshots provide the factual foundation for the execution engine, all implemented in ops / ops_readonly_tools.py.

[0101] ① Alarm snapshot: snapshot_alerts(db,limit=40) calls list_health_alerts, parses detail_json, constructs detail_excerpt, and returns the list of most recent alarms; ② Health report snapshot: snapshot_health_reports(db,limit=50) is obtained from list_health_reports. For each report, the hostname is taken, llm_summary is truncated to 1200 characters, and the metrics are summarized using _health_metrics_excerpt (maximum 800 characters). ③ Document knowledge retrieval: search_doc_articles(db,query,limit=8). If query is empty, return the most recent title; otherwise, use _tokenize_query for tokenization (segment by spaces, punctuation, and Chinese characters, convert to lowercase, filter words with a length ≥2, maximum 12), check if the token appears in the title or body of each document, if a match is found, add it to the results, and truncate the body summary to 1500 characters; ④ Aggregate collection: collect_all_readonly(db,goal) calls the above three functions at once and returns {"alerts":...,"health":...,"docs":...}, which are then written to the short-term memory of the readonly_fetch node; ⑤ Compressed Summary: compact_readonly_for_llm(ro) is used for token budget control: it retains only alarm title samples, health report sample triples (such as host, status, time), document retrieval patterns and hit title lists (each containing title and ID, at most a few), significantly reducing the context size passed to LLM.

[0102] 4. Agent Management: Agent management is used to maintain the "lifecycle of clients (Agents) managed by the platform", including state maintenance, blocking control, and viewing the client's tasks and running information.

[0103] The tasks it can perform include: ① Manual registration / update of Agent (e.g., for initial introduction or quick supplementation); ② Query Agent list and details: View machine information, last heartbeat, capability information, whether it is currently blocked, application system it belongs to, etc.; ③Block / Unblock: Blocking can be done when the client malfunctions or when risky actions need to be blocked; blocking can be done after the client resumes operation. ④ Kick offline / delete: Used to handle Agents that have been unavailable for a long time or need to be removed; ⑤ View the task history / current task load of a certain agent: This makes it easier to track "what a certain host actually did and what the result was".

[0104] The specific implementation steps of Agent management are shown in Table 3.

[0105] Table 3. Specific Implementation Steps for Agent Management

[0106] 5. Application System: Application systems are used to categorize agents into clearer business / application domains, helping operations and maintenance teams to perform organized management and permission boundary management in multi-system scenarios.

[0107] The tasks it can perform include: ① Create / edit / delete application systems (maintain name, code, and description); ② Maintain application system ownership for the Agent (and allow updating associations on the management side); ③ Improved semantic grouping in aggregate statistics, runtime planning contexts, and permissions / audit presentations.

[0108] The specific implementation steps of the application system are shown in Table 4.

[0109] Table 4 Specific Implementation Steps of the Application System

[0110] 6. Task Center: The Task Center is the core module of the platform responsible for "execution unit landing and status tracking". As a bridge connecting the intention of intelligent operation and maintenance and the actual execution on the edge, it transforms the user's manual request or the action in the intelligent operation and maintenance center (Ops Hub) operation plan into task records that can be executed by the edge agent, and provides a complete closed loop from creation, distribution, streaming output collection to final result storage.

[0111] The Task Center provides three core capabilities: task creation and queuing, streaming task output (by segment + sequence number), and task status and final output query. The following sections will provide a detailed introduction to each function.

[0112] 6.1 Task Creation (Single Unit and Batch): The task center supports converting operational intentions into specific executable tasks and specifying the target client (Agent) for distribution.

[0113] 6.1.1 Single Task Creation: Users can create individual tasks through the management interface, specifying the task type and target client. The system will automatically assemble the execution payload based on the task type and initialize the task status to "pending," placing it in the queue.

[0114] The supported task types and their corresponding input requirements are as follows: ① Command type (command): Used to execute operating system commands on the target host. A command line string (command_line) needs to be provided, which the system will encapsulate into an execution payload; ② Script type (script): Used to execute the script on the target host. Script content (script_body) is required, along with an optional script language (script_language, defaults to python). The system writes the script content to a temporary file before execution. ③ Agent type: Used to trigger the execution of natural language instructions by the large model on the client side. A natural language description is required, which the system encapsulates as a prompt and sends it out. The client-side Agent will call the local LLM for processing and supports streaming output. ④ Workflow Type: Used to trigger the execution of predefined workflows. A workflow description (workflow_description) is required, which the system uses as the basis for execution. ⑤ Skill Synchronization Type (skill_sync): Used to trigger client-side skill pack synchronization. This type requires no additional content; the system creates an empty-load task, and the client-side Agent, upon receiving it, will proactively pull the latest skill list from the server and update its local cache.

[0115] After successful creation, the system returns task details, including task ID, type, target client, status, creation time, and content summary, which facilitates subsequent tracking.

[0116] 6.1.2 Batch Task Creation: When the same task content needs to be distributed to multiple hosts, the task center supports batch creation. Users provide a list of target client IDs (automatically deduplicated), task type, and corresponding content, and the system will create an independent task record for each client in the list.

[0117] The output of batch creation consists of two parts: Successfully created list: Task details for each client; List of failed items: Includes the target client ID and the reason for the failure.

[0118] Common reasons for failure include: the target client is not registered or does not exist, or the target client has been blocked.

[0119] 6.2 Task Query (List, Details, By Agent): The task center offers multiple query options, making it convenient for operations and maintenance personnel to track task execution progress from different perspectives.

[0120] 6.2.1 Full Task List: Users can retrieve a paginated list of all tasks in the system, sorted by creation time. The list returns the core fields for each task, including task ID, type, target client, target hostname, status, creation time, update time, load summary, content summary, and final output.

[0121] 6.2.2 Query details by task ID: Users can precisely query complete information about a single task using the task ID, including the complete payload and final output. If the task does not exist, the system will return a clear "not found" message.

[0122] 6.2.3 View the task list by client: Users can view all historical tasks of a specified client (Agent), making it easy to trace all operations performed by a host and their results. If the client does not exist, the system will return a "not found" message.

[0123] 6.3 Task output streaming recycling (polling by fragment): This is one of the core capabilities of the Task Center, especially suitable for task scenarios that generate long outputs (such as script execution log streams, large model long text generation streams, and segmented results of skill execution). The Task Center supports the front end to continuously pull task outputs through "fragment retrieval," achieving a streaming-like experience.

[0124] 6.3.1 Streaming recycling mechanism: For agent-type tasks, the system supports splitting the output into multiple ordered fragments for recycling. Each fragment is assigned an incrementing sequence number (seq) to ensure that the front end can concatenate the complete output in order.

[0125] The management interface provides a streaming retrieval interface. The front end needs to pass in the maximum sequence number that has been received so far (after_seq, which defaults to -1). The system will then return all segments with sequence numbers greater than this value.

[0126] The API response includes: ①Task ID: The identifier of the currently queried task; ②Task Status: The real-time status of the current task; ③ Segment array: Each segment contains a sequence number (seq) and segment text (text); ④ Done flag: This flag is true when the task status is success or failure; ⑤ Final output (final_output): Returns the final complete output of the task when the completion flag is true; otherwise, it is null.

[0127] 6.3.2 Typical usage process: The frontend maintains a maximum received sequence number (last_seq) and continuously calls the streaming fetch interface using a polling method. ① During the initial call, after_seq=-1 to obtain the first batch of fragments; ② After receiving the fragments, update last_seq to the maximum sequence number of this batch of fragments; ③ Continue calling the interface, passing in the new last_seq, to ​​obtain the subsequent fragments; ④ When the interface returns done=true, it means that the task has ended, and final_output contains the final output of the task, so there is no need to continue polling.

[0128] 6.3.3 Constraints on streaming fragment writing: Streaming fragment writing is only available to agent-type tasks. When writing fragments, the system strictly verifies whether the task status is "dispatched" or "running" to ensure that only tasks that are currently executing can produce output fragments.

[0129] 6.4 Final Result of the Task: Once the task is completed, the client-side agent will report the final result to the task center. The system will then advance the task status to the end state (success or failure) and save the final output.

[0130] The final result report includes the following information: ① Client ID: The Agent identifier for executing the task; ②Task ID: The corresponding task identifier; ③ Status: Only success or failure is allowed; ④ Output: The final output text of the task execution.

[0131] After receiving a report, the task center will update the corresponding task record: set the status to success or failure, write the output to the output field, and update the last update time.

[0132] If the reported client ID is inconsistent with the target client originally sent to the task, the system will refuse to write to prevent mismatched return.

[0133] 6.5 Task State Tracking (Complete State Machine): The task center exposes a complete state machine to the outside world, and users can observe the changes in task status in real time through multiple entry points such as the task list, task details, and streaming pull interface.

[0134] 6.5.1 State Set: The task status includes five types as shown in Table 5.

[0135] Table 5 Task Status

[0136] 6.5.2 State Advancement Path: The complete state progression path of the task is as follows: ①Create → pending: The task enters the waiting queue immediately after creation; ②Receiving → dispatched: After the Agent polls to receive the task, the status changes from pending to dispatched; ③ Start output → running: For agent-type tasks, when the Agent reports the first streaming segment, the status changes from dispatched to running; ④ Complete → success / failed: After the Agent reports the final result, the task enters the terminated state.

[0137] 6.5.3 Status Observation Entry Point: Task List / Details: Displays the current status of tasks in real time, and the final output is displayed when the task is completed; Streaming pull interface: The returned fields include task_status and the done flag. When done is true, it means that the task has entered the finished state.

[0138] 6.6 Task Constraints and Consistency Guarantees: To ensure the accuracy of task assignment and result feedback, the task center has set up multiple consistency constraints.

[0139] 6.6.1 Streaming fragment write constraints: When the Agent reports streaming segments, the system will verify: ①The task exists; ② The reported client ID must exactly match the target client ID of the task; otherwise, the write operation will be rejected. ③ The task type is agent; non-agent types do not support streaming output. ④ Tasks in the dispatched or running range that are completed (success / failed) or pending will not accept fragment writing.

[0140] 6.6.2 Final result written to constraints: When the Agent reports the final result, the system will verify: ① The reported client ID matches the target client ID of the task. ②The result status is either success or failure. ③ If the task does not exist or the client does not match, a 404 error will be returned; These constraints effectively prevent abnormal situations such as task mismatch and duplicate data transmission, ensuring the reliability of the task loop.

[0141] 6.7 Integration with the Intelligent Operations Hub: The task center not only serves manual task assignment but also acts as the core execution layer for the intelligent operations and maintenance center's operational plans. When OpsHub executes an operations and maintenance plan, it automatically creates tasks based on the required steps and dispatches the edge actions.

[0142] 6.7.1 Transformation of Operation Plan Steps and Tasks: In Ops Hub's execution engine, when a step requires specific operations to be performed on the client side (such as executing a script or calling Agent capabilities), the system will invoke the task dispatch logic: Create a task of type agent for the target client. Its payload contains prompts for the operation and maintenance plan execution steps, and includes context information such as plan_id, step_id, and source to ensure that the task can be associated with the original execution plan.

[0143] If the steps involve skill execution or require the client to prepare skills in advance, the system will also create an additional task of type skill_sync to trigger the client to synchronize the latest skill pack before executing the relevant actions.

[0144] 6.7.2 Closed-loop tracking: These tasks generated by Ops Hub immediately enter a pending state, and are then picked up by the corresponding Agent through polling, undergoing a complete state transition from dispatched to running to success / failed. Streaming segments during task execution and the final result are all transmitted back and persisted through the task center's interface.

[0145] Therefore, the Task Center is essentially a traceable execution ledger of Ops Hub at the "endpoint execution layer". Operations personnel can view the actual execution status of each operation and maintenance plan step on the endpoint through the Task Center.

[0146] 6.8 Access Control and Auditing: 6.8.1 Access Control: All management interface operations in the task center are subject to RBAC access control. ① Querying the task list and details requires the tasks.view permission; ② Creating single or batch tasks requires the tasks.create permission; ③ Viewing the task list by client belongs to the agents.view permission (because this interface is mounted under the / api / agents path).

[0147] 6.8.2 Audit Log: Task creation triggers an audit event in the "AI / Task Assignment" category, which records the following information: ① Audit category: ai_dispatch ②Action Description: Issue a task (task type). ③ Resource identifier: / api / tasks→task_id={task ID}, ④Detailed Summary: Includes load summary, target client information, etc.; Meanwhile, the general write operation auditing middleware on the management side will deliberately skip the interfaces related to task creation ( / api / tasks and / api / tasks / batch) to avoid duplicate recording. Other write operations during task execution (such as task termination) will still be recorded normally for user operation auditing.

[0148] 7. Health Inspection: Health inspection is used to continuously collect host health indicators and summarize them into health status. At the same time, it triggers alarms when anomalies occur, enabling the platform to "continuously discover problems and push them to the alarm center".

[0149] The tasks it can perform include: ① The client periodically collects key system metrics (CPU / memory / disk / network, etc.). ② The platform saves the history of health inspection reports (including indicator snapshots, inspection types, health status, and optional LLM summaries); ③ The management interface provides a health summary: summarizing the current health status and the most recent report time by Agent; ④ Management panel: View health report list: View historical inspections by page / limit number of reports; ⑤ Trigger an alarm: When the health status reaches critical / warning, an alarm record will be generated on the server.

[0150] The specific steps for implementing health inspections are shown in Table 6.

[0151] Table 6. Steps for Implementing Health Inspections

[0152] 8. Alarm Center: The alarm center is used to centrally display alarms triggered by health inspections and provides the capability of "human-readable alarms + large model-assisted analysis".

[0153] The tasks it can perform include: ① View alarm list: Displays alarm severity level, source host, source inspection report, title, details, etc.; ②Large-scale model analysis of alarms: For a specific alarm, the system utilizes the operation and maintenance alarm analysis capability to generate structured analysis conclusions; ③ Inductive phenomena; ④ Possible reasons; ⑤ Recommended order of handling / investigation; ⑥ Is emergency human intervention required? ⑦ Write back and save the analysis results: This facilitates subsequent viewing, auditing, and review.

[0154] 9. Skills Center: The Skills Center is the core module of the platform responsible for managing "capability assets." It uses skill packages organized according to the Agent Skills specification (SKILL.md + attachments) as primary assets, providing end-to-end management capabilities from database entry, versioning, deployment scope control, to client-side distribution and server-side injection. Simultaneously, the Skills Center integrates large-model-assisted generation and optimization capabilities, allowing users to create or improve skills using natural language. As a key foundation for OpenClaw-like execution capabilities, the Skills Center ensures that complex operational logic can be reused, combined, and distributed on demand in standardized "skill" form.

[0155] The core capabilities of the Skill Center can be summarized as follows: Skill package entry (parsing + verification) → Deployment scope (client / server) and activation switch → Client-side distribution to hosts (built-in / registered) → Server-side caching and platform-side intelligent agent injection → AI generation / optimization → Skill visualization and details viewing. The following provides a detailed explanation of each function.

[0156] 9.1 Skill Pack Upload and Standard Verification: The Skills Center allows users to upload ZIP-formatted skill packages through the management interface, converting them into system-manageable skill assets. The entire import process includes the following functionalities: (1) ZIP structure identification and analysis: The system automatically recognizes common layouts for skill packs: SKILL.md can be located in the ZIP root directory or in a subdirectory (in which case other files in that directory are associated with the skill as attachments). If SKILL.md is not found in the ZIP, or if multiple root-directory-level SKILL.md files exist, the system will refuse to import it, ensuring that each skill pack has exactly one entry file.

[0157] (2) Version determination strategy: Skill versions can be derived from multiple sources by priority: ① The version_override parameter explicitly specified in the upload request (highest priority); ②The metadata.version field in the YAML frontmatter of SKILL.md; ③ The version field in the _meta.json file within the ZIP archive; ④ Default version 1.0.0 (backup); This mechanism allows users to flexibly control the version, either by forcibly overriding it or by following the definitions within the skill pack.

[0158] (3) Normalization of skill names: To ensure imported skills conform to the platform's unified naming conventions, the system verifies the `name` field in the `SKILL.md` frontmatter. If a third-party skill package's name does not conform to the conventions (e.g., it contains spaces, uppercase letters, special characters, etc.), the system attempts to derive a compliant `slug` from the `slug` field in `_meta.json` or the skill's display name, and then rewrites the `name` field in the frontmatter. This step prevents subsequent directory structure chaos or tool registration failures due to invalid naming.

[0159] (4) Frontmatter field validation: The SKILL.md file contains a YAML frontmatter and must have at least two fields: name (in a standard format) and description (not empty and with a length limit). The system will rigorously validate these fields to ensure that the imported skills have complete metadata that can be consumed.

[0160] (5) Attachment extraction and security restrictions: Other files in the skill pack (except SKILL.md) will be extracted as attachment maps (relative path → text content). To prevent resource abuse, the system sets a limit on the size of the ZIP package (approximately 15MB by default) and a limit on the number of attachment files (approximately 500 by default). At the same time, common macOS (.DS_Store) and Windows (Thumbs.db) junk files, as well as temporary directories generated by compression tools such as __MACOSX, will be filtered out to ensure the attachment directory is clean and secure.

[0161] (6) Write into the skill library and distinguish the deployment scope: Validated skills will be written into the system's skill table (Skill entity) and their deployment scope will be specified: ①client: This skill will participate in client-side skill synchronization and can be distributed to the Agent. ②server: Used only by the server-side orchestration and execution side; it is not synchronized to the client. ③ In addition, skills also support an enable / disable switch (enabled), and whether they are built-in skills (is_builtin, which is only meaningful for client skills; server skills are not allowed to be marked as builtin). Through the above functions, the Skills Center ensures that any uploaded skill pack can enter the platform in a standardized, secure, and versionable form, laying the foundation for subsequent deployment and use.

[0162] 9.2 Skills Deployment and Distribution Management: Once skills are added to the skill database, they need to be deployed appropriately to where they are needed. The skill center provides a sophisticated deployment control mechanism to achieve the management goal of "distribution on demand and differentiated capabilities".

[0163] (1) Built-in skills and non-built-in skills (client-only): For skills with deployment_scope=client, the system distinguishes between two types: Built-in skills: These are automatically acquired by all connected agents without any additional steps. These skills are typically basic platform capabilities, such as common system information collection and log viewing. Non-built-in skills: These skills are explicitly "assigned" by the user to a specific client, which then acquires the skill during synchronization. This allows the platform to assign differentiated sets of capabilities based on the roles, environments, or business needs of different hosts.

[0164] (2) Registration and cancellation by client: Users can assign a skill to a specified client and a skill with a specified name@version, adding it to that client's available skill set. After assignment, the client will receive the full details of the skill during the next skill synchronization. Conversely, unassigning a skill will prevent it from being distributed to that client again. Assignment and unassignment operations do not affect the original skill entities in the skill library.

[0165] (3) Skill enable / disable: Each skill entity (whether client-side or server-side) has an enable / disable switch. For client-side skills, disabling the skill will prevent it from appearing in any client's synchronized collection (including built-in skills); for server-side skills, disabling the skill will prevent it from being included in the server-side cache, and platform-side agents will be unable to call it.

[0166] (4) Server-side skill caching linkage: When a skill with deployment_scope=server changes (uploaded, enabled / disabled, deleted, or restored to default), the system automatically triggers a rebuild of the server-side cache to ensure that the skill list read by the server-side agent is always consistent with the database state. Users can also manually trigger cache synchronization.

[0167] Through the aforementioned deployment and management capabilities, the Skills Center has achieved a flexible mapping from a "global capability library" to a "single-host capability set," ensuring both the consistency of basic capabilities and meeting the needs of differentiated customization.

[0168] 9.3 Server-side skill caching and platform agent reading: Skills must not only be "issued," but also "recalled." The Skill Center provides an efficient and secure skill retrieval mechanism for different execution sides.

[0169] (1) Client-side skill synchronization and local caching: The client agent retrieves its required skill set by periodically or on-demand calling the control plane interface GET / skills / sync. The synchronization rules are as follows: ① All skills with deployment_scope=client and is_builtin=true will be issued regardless of whether they are registered; ② For skills with is_builtin=false, only skills that have been registered (assigned) to this client will be issued; The synchronization result includes the complete content of SKILL.md and the attached files. The client writes this content to the local skills_cache directory and generates a manifest.json file (recording the skill name, version, and content hash) for subsequent incremental updates and tool loading.

[0170] (2) Server-side skill caching and injection: For skills with deployment_scope=server, the system creates a cache directory called skills_cache_server on the server, writes the skill content (SKILL.md and attachments) to disk, and generates a manifest mapping (name@version → content hash prefix). When a platform-side agent (such as the execution engine of Ops Hub) needs to invoke a skill, it reads the skill content from this cache, converts it into a callable utility function, and injects it into the execution environment.

[0171] (3) Execution-side tooling encapsulation: Whether on the client or server side, skills are ultimately used by the model or executor in the form of tools. On the client side, the skill center builds a general tool (such as `read_skill_file`) for each skill, used for progressive disclosure reading (first reading `SKILL.md`, then reading the attached scripts as needed). If the skill directory contains `scripts / run.py`, the system loads it as an executable `run(**kwargs)->str` function and registers it as a real skill tool. This allows complex tasks to be called by the model through the "skill = tool function" approach, without having to cram all the logic into the prompt words.

[0172] Through caching and injection mechanisms, the Skill Center ensures that skill content is quickly readable while also providing a unified calling interface for the execution side, enabling "plug and play" of skill assets.

[0173] 9.4 AI-assisted generation and optimization skills: The Skills Center integrates large model capabilities, enabling users to create or improve skills through natural language interaction, significantly reducing the skill writing threshold and ensuring that the generated results always conform to platform specifications.

[0174] (1) AI generates new skill sets: The user provides a natural language description (e.g., "Create a skill that can collect system load and generate reports"). The system then calls the management facet skill generator (LLM), requiring the model to strictly adhere to the specifications and output a JSON object containing `skill_md` (text conforming to the `SKILL.md` format) and `files` (maps of relative paths to content for attachments). After generation, the system performs the same specification validations on the output as manual uploads (including frontmatter format, name / description fields, etc.). Only skills that pass validation can be previewed or saved. This feature provides a synchronous interface and an SSE streaming interface, the latter allowing real-time display of the model's inference process and final results.

[0175] (2) AI optimizes existing skills: For existing skills, users can submit optimization requests (such as "change the output format to JSON" or "add error retry logic"). The system will send the original skill's skill_md and files along with the optimization requests to the optimizer LLM. The optimization process follows two key constraints: ①Reserve the name: Do not modify the name field in frontmatter without authorization to avoid damaging the skill identifier; ②Version control: You can choose to force the use of the version number provided in the form, or keep the original skill version (unless the optimization requirements explicitly require an upgrade); The optimized results will also undergo standard verification and return the optimized skill_md and files, which users can use to decide whether to save it as a new version or replace the original skill.

[0176] (3) User experience of streaming generation: The Skill Generation Streaming Interface (SSE) pushes reasoning fragments and content increments to the front end in real time during model output, and returns the parsed complete skill structure (or error information) at the end of the stream. This interactive approach allows users to see the model's thought process while waiting for the generated results, enhancing controllability and trust.

[0177] With AI-assisted capabilities, the Skills Center transforms the skill development process, which originally required manually writing YAML and scripts, into a highly efficient workflow of "natural language description → structured generation → one-click entry into the library," greatly accelerating the accumulation of skill assets.

[0178] 9.5 Skill visualization and details viewing: The skills center not only needs to support "management" and "use," but also needs to ensure that users can "see clearly." To this end, it provides multi-dimensional visualization and detailed query capabilities.

[0179] (1) Skills List (Overview): The GET request to `api / skills` returns a list of skill summaries, including skill name, version, update time, whether it is signed, whether it is a built-in skill, deployment scope, and enabled status. The visibility of this list is controlled by permissions: viewing client skills requires `skills.view` permission, and viewing server skills requires `skills.server.view` permission, preventing unauthorized users from accessing server-specific capabilities.

[0180] (2) Skill Details (Full Content): The `GET / api / skills / detail?name=...&version=...` method returns complete information about the skill, including the skill_md (which can be displayed or edited in the UI), attached files (a mapping of relative paths to text content), activation status, and deployment scope. If the skill does not exist or the user does not have permission to view it, a 404 error is returned to prevent information leakage.

[0181] (3) Deployment-level visualization (by client dimension): The `GET / api / skills / deployments` command lists the skill set (including built-in skills and registered non-built-in skills) that each host will ultimately acquire, organized by client, along with information such as the client's hostname and machine_id. This helps operations personnel quickly identify the capabilities of a particular host and whether there are any deployment discrepancies.

[0182] (4) Server-side caching and default skill management: Users can view the skill list in the server-side cache directory via GET / api / skills / server-cache / manifest, including the number of skills, hash prefixes for each skill, etc., which helps troubleshoot inconsistencies between the cache and the database. In addition, the Skill Center provides the function to restore the platform's official default server skills (POST / api / skills / server / restore-defaults) and its preview interface, allowing users to reset the server-side skill set to a known and trusted state and refresh the cache.

[0183] Through these visualization capabilities, the Skills Center enables users to have a comprehensive understanding of the current status, distribution, and changes of skill assets, providing solid data support for auditing, troubleshooting, and daily management.

[0184] 10. Configuration Center: The configuration center manages the platform's "dynamically adjustable configuration set" and persists it in a unified manner, then distributes the necessary configurations to the control plane / client / smart agent execution chain.

[0185] The tasks it can perform include: ① Unified persistent configuration: Store key configurations in key-value format for easy versioning and backtracking; ② Provide a revision mechanism: Configuration changes will generate revisions, and clients can determine whether an update is needed based on the revision; ③ Synchronize key operating switches and parameters: such as whether LLM is enabled, timeout time, health check parameters, etc., to ensure that the platform can be stopped, switched, and rolled back.

[0186] 11. Logs and Auditing: Logs and audits are used for compliance tracing and issue review, with a focus on recording: ①Management interface user write operations on the system, ②Key actions of AI / task assignment (in summary form to avoid excessive storage). ③ A structured summary of the Agent's execution plan after completion; The tasks it can perform include: ① Record user actions: who did what (success / failure / rejection) on which interface / resource, when, and when. ② Record AI / task distribution: What type of task was distributed, the task objective, payload summary, etc. ③ Record Agent execution plan completion: Save a structured execution summary that can be used for review (such as scenario, total number of steps, final answer summary, etc.). ④ Audit query and export: Administrators can filter by category and support CSV / JSON export for compliance reports and archiving.

[0187] 12. User Management: User management is used to maintain the login subjects and status of the platform, and to allow users to be assigned roles to gain permissions.

[0188] The tasks it can perform include: ① Create a local user: Set the username, password (bcrypt hash), email / display name, and bind it to the role; ② Update user information: email, display name, status (enabled / disabled), role binding; ③ Manage user activation status: Disable users to block their access; ④ Deleting users: Usually, built-in key accounts are protected (e.g., deleting admin is prohibited); ⑤ Compatibility with LDAP scenarios: LDAP users typically do not use local password modification logic (passwords are managed by LDAP).

[0189] 13. Role Management: Role management is used to configure RBAC (role-based access control) and map a set of permission points to executable capability scopes.

[0190] The tasks it can perform include: ① Create a role: Configure a set of permission points for the role, or use the built-in default permission policy; ②Update Role: Modify name, description, and permission set; ③ Delete role: Usually restricts the deletion of built-in roles (admin / operator / viewer); ④ Permission directory query: Provides permission point classification for the front end, making it easier for administrators to select permission configurations.

[0191] 14. System Settings: The system settings are used to manage the switches and parameters of the platform's "basic operational capabilities," especially key strategies related to LLM and health inspection.

[0192] The tasks it can perform include: ① Configuration Management (LLM): The foundation for calling capabilities such as skill generation / optimization, alarm analysis, and document Q&A (base_url / api_key / model / timeout / enabled); ② Configure Agent client LLM: Used for large model capabilities on the client side (and distributed to the Agent via the control plane, with revisions). ③ Configure Agent health inspection strategy: control whether health inspection is enabled, sampling interval, LLM inspection interval, buffer window size, etc.; ④ Deploy configuration changes via revision: Ensure that clients update their configurations at appropriate times and maintain consistency; ⑤ Collaboration with LDAP / Authentication Systems: System settings and administrator authentication configurations together affect platform availability and security boundaries.

[0193] 15. Operations and Maintenance AI Assistant: The Operations AI Assistant serves as the platform's unified entry point for "Operations Q&A." It combines factual extracts from the document center with optional task orchestration context, calling upon a large model to generate accurate and traceable answers. It supports purely document-driven Q&A and can also provide composite suggestions combining "knowledge + situation" after integrating the execution status of the operation plan, thus becoming an intelligent assistant for operations personnel to obtain information, understand the current situation, and assist in decision-making.

[0194] The following section provides a detailed explanation of this function, covering aspects such as entry capabilities, document center management, retrieval and extraction mechanisms, answer constraint rules, large model interaction, access control, and integration with the intelligent operation and maintenance center.

[0195] 15.1. Entry capabilities: The operations and maintenance AI assistant provides two types of Q&A entry points to adapt to different interaction scenarios: Synchronous Q&A: After a user submits a question, the system returns a complete answer all at once. This method is suitable for scenarios where answers are needed quickly, and the interaction is simple and direct.

[0196] Streaming Q&A: The system pushes answer content word by word or paragraph by paragraph using Server-Sent Events (SSE). Streaming output allows users to see the model generation process in real time, which is especially suitable for long answer scenarios and improves the interactive experience. At the same time, the streaming interface will push a list of document titles cited in the current answer before the answer begins, so that the front end can display the citation sources in advance.

[0197] Both types of entry points support the optional "orchestration context" parameter. When the user provides this parameter, the assistant's response will incorporate the current execution status of the running plan to provide more targeted suggestions.

[0198] 15.2. Document Center Management: The knowledge source for the operations and maintenance AI assistant is entirely based on the platform's built-in document center. The document center is an independently manageable knowledge base that allows operations and maintenance personnel to maintain documents related to platform capabilities, operations and maintenance standards, and troubleshooting.

[0199] Document Entities: Each document contains a title and body text, and supports rich text or plain text formats.

[0200] Management capabilities: Users can perform operations such as adding, deleting, modifying, and querying documents, including creating new documents, editing titles or text, and deleting documents that are no longer needed. All document operations are subject to access control, ensuring that only authorized users can modify the knowledge base.

[0201] Search scope: During question answering, the system will traverse the text of all documents and extract relevant fragments as input to the large model to ensure that the answers are always based on the latest knowledge base content.

[0202] 15.3. Retrieval and Extraction Mechanism: To ensure that the large model can answer based on accurate factual fragments, the system implements a lightweight retrieval and extraction algorithm that can stably extract relevant content without relying on a vector database.

[0203] Document slicing: The system first segments the main text of each document into paragraphs, then concatenates adjacent paragraphs into appropriately sized text segments (usually no more than 900 characters). If a paragraph is too long, it will be further broken into smaller segments. This slicing method preserves semantic coherence while controlling the length of each input to the model.

[0204] Relevance scoring: For each segment, the system calculates its word overlap with the user's question. The algorithm extracts Chinese and English word sets from both the question and the segment (including the title), calculates the number of co-occurring words, normalizes the number of words based on the question's word count, and gives a slight bonus based on the segment length. This algorithm does not rely on complex semantic models but can stably match text strongly related to the question's keywords, offering advantages such as being lightweight and interpretable.

[0205] Results filtering: All segments are sorted by score, and the top 8 segments are selected. Additionally, to prevent duplicate segments from appearing repeatedly in the same document, the system will deduplicate the segments based on the preceding content.

[0206] Output Excerpts: The final selected excerpts, along with the titles of their respective documents, constitute the "Document Center Excerpts." These excerpts will be provided to the larger model as factual evidence, and the document titles will also be returned along with the answer, allowing users to clearly see which knowledge sources their answers referenced.

[0207] 15.4. Answer the constraint rules: To ensure the accuracy, compliance, and explainability of the answers, the system sets two strict answer modes based on whether "arrangement context" is provided, and enforces these modes through system prompts.

[0208] (1) Document Extract Only Mode (No Arrangement Context Provided): In this mode, the system requires the large model to answer questions strictly based on facts extracted from the document center, and it must not fabricate any content that does not exist in the documents. If the extract is insufficient to answer the user's question, the model must clearly state "This question cannot be answered based on the current document center," and briefly explain what information is missing. Answers must be in Chinese, concise and organized, and may quote key points from the extract, but should avoid using clichés such as "as AI."

[0209] (2) Document extraction + layout context fusion mode (provides layout context): When users provide orchestration context (such as a summary of the run plan status, tool routing decisions, event timelines, etc.), the system activates the fusion mode. In this mode, the document center extract remains the primary factual basis, and fabricating capabilities or descriptions not found in the documents is not permitted; the orchestration context is used to supplement situational information such as "execution progress, risks, and next steps recommendations." When responding, the system will first provide an actionable conclusion, then explain the basis for the response from both the document extract and the orchestration context. If only the orchestration context is available without a document extract, suggestions will be summarized based solely on the context, with explanations of any shortcomings. Responses must also be in Chinese, concise, and well-organized.

[0210] Hard rejection: If the document search results are empty and no arrangement context is provided, the system will directly reject the answer and prompt the user to supplement the document or associate the operation plan before asking the question again, so as to avoid unfounded and random answers.

[0211] 15.5. Large Model Interaction and Output Semantics: The operations and maintenance AI assistant generates dialogues through a standard large model interface and verifies and standardizes the model's output.

[0212] Message assembly: The system generates corresponding system prompts (including strict answer constraints) based on the current mode, and assembles user questions, document center excerpts, and arrangement context into the final user message, which is then sent to the large model.

[0213] Answer Validation: After the large model returns, the system only accepts its "body" portion as the final answer. If the model only returns the reasoning process and not the body, or if the body is empty, the system will report an error and prompt the user to change the model or retry. This ensures that the front-end always has displayable answer content.

[0214] Streaming output validation: In streaming question answering, the system pushes incremental text updates in real time, while the inference process is not directly displayed to the user. If the model does not output any text at the end of the stream, the system will send an error event explaining the reason (e.g., "The model only outputs inference, not text"). Simultaneously, streaming will push the citation sources before the text begins, achieving pre-display of citation information.

[0215] Timeout and Error Handling: Large model calls have timeout control. If a timeout occurs or an HTTP error occurs, the system will return an appropriate error message. If the LLM configuration is not enabled or missing necessary parameters, the system will directly reject the call to avoid invalid requests.

[0216] 15.6. Access Control: All operations of the operations and maintenance AI assistant are managed by RBAC (role-based access control) to ensure that users with different roles have appropriate access scopes.

[0217] Question and Answer Permissions: Calling synchronous or streaming question and answer interfaces requires "Use Assistant" permission. Ordinary operations and maintenance personnel usually have this permission by default, but administrators can configure it.

[0218] Document management permissions: Viewing the document list requires "View Documents" permission, while creating, updating, and deleting documents requires "Manage Documents" permission. This segmentation allows document maintenance to be authorized to specific roles, while ordinary users can only query documents.

[0219] Audit logs: All document write operations (add, modify, delete) will pass through the audit middleware, which will automatically record information such as the operator, operation time, and operation content summary, which will facilitate subsequent traceability and responsibility determination.

[0220] 7. Integration with the intelligent operation and maintenance center: The operations and maintenance AI assistant itself does not directly perform operations and maintenance tasks, but it can work closely with the intelligent operations and maintenance center (Ops Hub) and task orchestration layer. By receiving the execution status of the operation plan, its answers have the dual value of "knowledge facts" and "current progress".

[0221] Connection method: When users execute operation plans, view task status, or perform multi-agent orchestration in the intelligent operation and maintenance center, the system can pass the current execution summary (such as completed steps, tool routing results, event timelines, etc.) as "orchestration context" to the assistant.

[0222] Integration Effect: When answering questions, the assistant distinguishes between "document-based" and "situation-based" responses. For example, when a user asks "How to troubleshoot a certain type of fault," the assistant can not only provide standard steps based on documentation but also, in conjunction with steps already executed in the current operational plan, suggest, "According to the operation log, this step has been completed. Next step suggestion…." This integration transforms the assistant's response from a static knowledge retrieval into a collaborative decision-making tool that integrates with real-time operational progress.

[0223] Security Guarantee: The fusion mode also strictly adheres to the principle of not fabricating information not present in the context, avoiding false alarms or speculative execution status, and ensuring the credibility of the recommendations.

[0224] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application and not to limit them. Although this application has been described in detail with reference to the above embodiments, those skilled in the art should understand that modifications or equivalent substitutions can still be made to the specific implementation methods of this application. Any modifications or equivalent substitutions that do not depart from the spirit and scope of this application should be covered within the protection scope of the claims of this application.< / json>

Claims

1. A natural language-driven edge-cloud collaborative intelligent operation and maintenance system, characterized in that, include: The management interface is used to provide user interaction interface, permission management, auditing and configuration management; The control plane, deployed on a different physical or virtual host from the management plane and connected in communication, is used to poll tasks and collect results with at least one end-side agent; The endpoint proxy is deployed on the target host to collect host health indicators, execute cloud-based operation and maintenance tasks, and stream execution segments back. The intelligent operation and maintenance center, deployed on the management plane, is configured to: receive operation and maintenance objectives described in natural language, use a state machine execution engine based on a directed graph to perform operation and maintenance intent understanding, read-only evidence snapshot retrieval, operation and maintenance step planning, tool / skill / sub-agent routing, sub-agent deep reasoning, phase summary and final response generation, and call the task center to dispatch execution tasks to the end-side agent according to the routing results; The task center, deployed on the control plane, is used to create and poll and distribute tasks, maintain the task state machine, receive streaming output segments and final results reported by the end-side agent, and provide an interface for incrementally pulling segments by sequence number. The Skill Center, deployed on the management plane, is used to manage skill packages that conform to the agent skill specifications, and to realize versioned storage of skill packages, AI-assisted generation and optimization, registration and distribution by client, and server-side skill cache injection.

2. The natural language-driven edge-cloud collaborative intelligent operation and maintenance system according to claim 1, characterized in that, The intelligent operation and maintenance center includes: The read-only snapshot module is used to aggregate read-only evidence from the alarm center, health inspection module and document center and compress it into a context acceptable to the large language model. The intent understanding module is used to structure natural language goals into main intent categories, constraints, and success criteria based on heuristic rules and large language models. The planner module is used to dynamically generate an execution plan containing multiple steps based on a scenario template or a large language model. Each step includes evidence identification and confidence level. The router module is used to select and invoke read-only tools, platform skills, MCP tools, memory read / write, or sub-agents for each step, and to trigger manual review and interception when the confidence level is lower than the threshold. The sub-agent module is used to call the large language model to generate conclusions and suggestions for steps that require deep reasoning, and to perform quality checks based on the suggestions; The summary module is used to generate periodic summaries and reflections on operational experience, and write the reflections into the approval queue. The final answer generation module is used to converge the conclusions of each stage into a final answer in a user-friendly, lightweight markup language format.

3. The natural language-driven edge-cloud collaborative intelligent operation and maintenance system according to claim 1, characterized in that, The task center includes: The task creation unit supports the creation of single tasks and batch tasks, and generates standardized workloads based on task types, including commands, scripts, agent instructions, and skill synchronization. The polling and claiming unit provides atomic state transitions, updating waiting tasks in batches to be distributed and returning them to the end-side agent; The streaming recycling unit receives text fragments reported by the agent on the receiving end, automatically assigns them an incrementing sequence number and stores them to prevent duplicate writing; The result storage unit receives the final execution status and output, and updates the task status. The pull interface allows the management frontend to pull segments incrementally according to the after_seq parameter and return the final output when the task is completed.

4. The natural language-driven edge-cloud collaborative intelligent operation and maintenance system according to claim 1, characterized in that, The skills center includes: The skill package entry module is used to parse ZIP format skill packages, verify the YAML pre-data of skill markup language files, extract attachments and perform security filtering, and determine the deployment scope and version number. The version management module supports incremental release and rollback of skill versions; The registration and distribution module controls the skill synchronization set of a specified agent by registering / unregistering skills that are not built into the client. The generation optimization module calls the large language model to generate a compliant skill set or optimize existing skills based on natural language descriptions, and returns the generation process through the streaming SSE interface; The server-side caching module caches server-side skills to the local directory and generates a manifest file, which is then dynamically loaded into utility functions by the execution engine.

5. A natural language-driven edge-cloud collaborative intelligent operation and maintenance control method based on the system described in any one of claims 1 to 4, characterized in that, Includes the following steps: S1: The management interface receives the operation and maintenance goals input by the user through natural language, creates the operation plan, and sets the scenario type; S2: The intelligent operation and maintenance center calls the read-only snapshot module to pull read-only evidence from the alarm center, health inspection module and document center, compress it and store it in short-term memory; S3: The intent understanding module analyzes operational goals based on heuristic rules and a large language model, and outputs the intent graph category, constraints, and success criteria; S4: The planner module dynamically generates an ordered sequence of steps based on a scenario template or a large language model. Each step includes expected evidence and confidence level. S5: For each step, the router module decides the call type: if a read-only tool is called, the data is obtained directly; if a skill is called, the skill content is obtained through the skill center and executed. If execution on the client side is required, the task center is invoked to create a task and dispatch it to the specified agent. S6: The task center performs atomic polling and dispatching. After the agent receives the task, it executes it and reports incremental fragments through the streaming interface. The task center stores the fragments and provides incremental retrieval. S7: The sub-agent module calls the large language model to generate sub-conclusions and suggestions for steps with low confidence or requiring deep reasoning, and performs quality review based on the suggestions; S8: The summarizer module summarizes the results of each step, generates reflection experiences, and writes them into the knowledge approval queue. S9: The final response generation module summarizes the conclusions of all steps, generates a final response in lightweight markup language format, and displays it to the user through the management interface; S10: After approval, reflective experiences will be written into the document center to form reusable knowledge assets.

6. The natural language-driven edge-cloud collaborative intelligent operation and maintenance control method according to claim 5, characterized in that, Step S2 specifically includes: The alarm center interface is called to retrieve the 40 most recent alarm records, with only the title, severity, and occurrence time of each alarm retained. The health inspection module is called to retrieve the 50 most recent health reports, with the large language model summary truncated to 1200 characters and the indicator summary compressed to 800 characters. The document center is called to perform word segmentation and matching using the operation and maintenance target as the query term, and the 8 document fragments with the highest relevance are selected. The three parts of evidence—alarm data, health reports, and document fragments—are concatenated into structured text. If the total length exceeds the preset token budget, the alarm sample is further compressed to 20 records, the health report sample to 30 records, and the top 5 document fragments are retained.

7. The natural language-driven edge-cloud collaborative intelligent operation and maintenance control method according to claim 5, characterized in that, The dynamically generated ordered sequence of steps specifically includes: If the scenario is an inspection and the target involves simple indicator queries, then the simple mode is enabled, directly generating two steps: obtaining a snapshot of key indicators and generating health recommendations; if the scenario is root cause analysis, then the large language model planner is invoked, the intent and read-only evidence are input, and an array of steps is required to be output, each step containing an identifier, title, role, evidence identifier, and confidence level; the steps output by the large language model are validated for legality, steps with missing confidence levels are assigned a default value of 0.5, and the step sequence is stored in the plan JSON.

8. The natural language-driven edge-cloud collaborative intelligent operation and maintenance control method according to claim 5, characterized in that, The router module decision invocation types specifically include: For the current step, first check if there is a batch route cache; if so, use it directly. Otherwise, call the routing language model, inputting the step title, read-only evidence summary, and list of available tools / skills for the current step. The output should be an array of calls, each containing type, name, reason, and confidence level. Only calls allowed by type are retained, and duplicates are removed by "type, name". A maximum of 5 calls are allowed per step. If the call type is a skill and the skill exists in the skill center, the skill is executed. If the confidence level is below the threshold and the interception switch is on, the step is marked as needing review, and subsequent automatic execution is interrupted.

9. The natural language-driven edge-cloud collaborative intelligent operation and maintenance control method according to claim 5, characterized in that, Step S6 specifically includes: The control plane polling interface receives the client identifier of the agent, queries the waiting tasks of the agent in the database transaction, updates the status to "distributed", records the distribution time, and returns the task list. During the agent's task execution, for each output fragment generated, it calls the control plane's POST / tasks / stream, carrying the task identifier, client identifier, and fragment text. The control plane verifies that the task exists, the client matches, and the status is "distributed" or "running", queries the current maximum sequence number, sets the new sequence number to the maximum sequence number + 1, and stores the fragment. The management front end obtains the incremental fragment and the current task status through GET / tasks / {task_id} / stream?after_seq=last maximum sequence number. If the task status is success / failure, it returns the final output.

10. The natural language-driven edge-cloud collaborative intelligent operation and maintenance control method according to claim 5, characterized in that, The specific content of writing reflective experiences into the document center includes: The management interface provides a knowledge approval list interface, allowing users to review requests with a pending status. If approved, the system parses the reflection text in the JSON content, supplements metadata information, including at least the source operation plan identifier, approver, and approval time. It then calls the document center to create a document entry with an automatically generated title of "Operation Review - {Scenario} - {Timestamp}" and a body containing the reflection content and cited evidence. The approval request status is updated to "Approved," and the approved document identifier is recorded. If rejected, the status is updated to "Rejected," and the reason for rejection is recorded.