Multi-cluster data processing method and system based on large language model

By using a multi-cluster data processing system based on a large language model, the problem of insufficient decision-making capabilities in multi-cluster environments is solved, global intelligent optimization and fault transfer are achieved, data processing efficiency and adaptability are improved, and rule maintenance costs are reduced.

CN122240415APending Publication Date: 2026-06-19TONGDUN NETWORK TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
TONGDUN NETWORK TECH CO LTD
Filing Date
2026-01-23
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies lack decision-making capabilities in multi-cluster environments, are unable to perform global optimization in dynamic and complex multi-objective environments, and have high rule maintenance costs, poor adaptability, and are unable to cope with unknown and complex faults.

Method used

A multi-cluster data processing system based on a large language model is adopted. Through distributed telemetry acquisition module, telemetry aggregation and processing module, intelligent decision engine module, strategy executor module and business goal and configuration management module, it realizes the aggregation, standardization, cross-modal data association and real-time feature engineering of multi-dimensional telemetry data, generates structured state feature vectors, and uses a large language model to infer and generate structured decisions for intelligent optimization and fault transfer.

Benefits of technology

It achieves global intelligent optimization in dynamic and complex environments, improves the accuracy of failover decisions and data processing efficiency, reduces rule maintenance costs, and enhances the system's adaptability and automation capabilities.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240415A_ABST
    Figure CN122240415A_ABST
Patent Text Reader

Abstract

This invention discloses a multi-cluster data processing method and system based on a large-scale language model. The method collects multi-dimensional telemetry data from managed clusters; inputs the multi-dimensional telemetry data, business objectives, and dynamically generated prompts into a large-scale language model to obtain structured decisions; translates the structured decisions into configuration instructions for the target platform and executes the configuration instructions. This invention achieves global intelligent optimization and accurately proposes failover decisions, thereby improving data processing efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer technology applications, and in particular to a multi-cluster data processing method and system based on a large language model. Background Technology

[0002] For reasons of high availability and disaster recovery, enterprises generally adopt multi-cluster deployment. Traffic is usually initially distributed through GSLB at the DNS layer or load balancers provided by cloud vendors.

[0003] In existing technologies, automated traffic management systems based on preset rules typically combine service meshes and automated scripts. They are deployed in each cluster through distributed monitoring agents to collect performance metrics. Data is aggregated by a central monitoring and alarm center, and alarms are triggered according to preset static threshold rules (e.g., IF latency>200ms FOR 5m THEN alert). The automated scripts / controllers listen for alarm events and execute pre-written scripts.

[0004] In application, the distributed monitoring agent reports latency data for each cluster service; the central monitoring and alarm center detects that the service latency of cluster A exceeds 200ms for 5 consecutive minutes; the alarm center triggers a webhook, calling a pre-defined automated script (such as an Ansible Playbook or a Kubernetes Operator). This script executes hard-coded logic, such as: "If the latency of cluster A is too high, modify the Istio traffic rules, reducing the traffic weight of cluster A by 50% and increasing the weight of cluster B by 50%"; the script completes the configuration distribution by calling the Kubernetes API.

[0005] However, existing technologies have several drawbacks: they suffer from a single decision-making dimension and a lack of global optimization capabilities. They execute corresponding operations based on fixed rules and single triggering conditions. When the system state becomes complex (e.g., cluster A has high latency but extremely low cost, while cluster B has low latency but near-saturation resources), simple rules cannot make optimal trade-offs. Furthermore, they are poorly adaptable to unknown and complex fault scenarios. Pre-set scripts can only handle known, pre-planned faults. When a new, complex fault arises due to multiple factors (such as code bugs, downstream dependency failures, and network jitter), the hard-coded rule base lacks a matching pattern, rendering the system ineffective and ultimately requiring manual intervention. Additionally, rule maintenance is costly, and the system is rigid. As business and architecture evolve, experts must continuously update and maintain these thresholds and rule scripts manually. Rule adjustments lag behind actual changes, leading to rigid system behavior and an inability to adapt to dynamically changing market or business needs.

[0006] There is currently no effective solution to the problem that existing technologies, which are based on static rules, lack the ability to make decisions in dynamic and complex multi-objective environments. Summary of the Invention

[0007] To address the aforementioned technical problems, this invention aims to provide a multi-cluster data processing method and system based on a large-scale language model, thereby at least solving the problem that existing technologies, based on static rules, lack sufficient decision-making capabilities in dynamic and complex multi-objective environments.

[0008] The technical solution of this invention is implemented as follows: This invention provides a multi-cluster data processing system based on a large-scale language model, comprising: a distributed telemetry acquisition module deployed in each managed cluster for acquiring multi-dimensional telemetry data; a telemetry aggregation and processing module for aggregating, standardizing, cross-modal data association, and real-time feature engineering of the multi-dimensional telemetry data to generate structured state feature vectors; an intelligent decision engine module for obtaining business objectives based on triggering conditions, dynamically fusing the structured state feature vectors with the business objectives to generate a prompt, calling a large-scale language model to perform reasoning based on the prompt to generate structured decisions, and performing syntactic, semantic, and security logic verification on the structured decisions to obtain verified decision instructions; a policy executor module for translating verified decision instructions into configuration instructions for the target platform and executing the configuration instructions; and a business objective and configuration management module for storing business policies and providing decision-making basis.

[0009] Optional, multi-dimensional telemetry data includes: indicator data, log data, and link data.

[0010] This invention provides a multi-cluster data processing method based on a large language model, applied to a multi-cluster data processing system based on a large language model, comprising: collecting multi-dimensional telemetry data of the managed cluster; inputting the multi-dimensional telemetry data, business objectives, and dynamically generated prompt words into a large language model to obtain structured decisions; translating the structured decisions into configuration instructions for the target platform and executing the configuration instructions.

[0011] Optionally, the collection of multi-dimensional telemetry data from the managed cluster includes: collecting the managed cluster's metric data, log data, and link data; generating multi-dimensional telemetry data based on the metric data, log data, and link data; wherein, the metric data includes: infrastructure metrics, network metrics, and application layer metrics exposed through the service mesh; the log data includes: standard output logs, error logs, and Kubernetes events; and the link data includes: call link data of distributed services.

[0012] Optionally, multi-dimensional telemetry data, business objectives, and dynamically generated prompts are input into a large-scale language model to obtain structured decisions, including: aggregating, standardizing, cross-modal data association, and performing real-time feature engineering on the multi-dimensional telemetry data to generate structured state feature vectors; obtaining business objectives based on triggering conditions, dynamically fusing the structured state feature vectors with the business objectives to generate prompts; and calling the large-scale language model to perform reasoning based on the prompts to generate structured decisions.

[0013] Further, optionally, the multi-dimensional telemetry data is aggregated, standardized, cross-modal data correlated, and real-time feature engineering processed to generate structured state feature vectors. This includes: receiving and storing all multi-dimensional telemetry data; cleaning multi-dimensional telemetry data from different sources and in different formats, and standardizing the cleaned multi-dimensional telemetry data to obtain standardized multi-dimensional telemetry data; performing cross-modal data correlation on the standardized multi-dimensional telemetry data, extracting standardized metadata tags from all multi-dimensional telemetry data, using the standardized metadata tags as spatial dimension correlation keys, and employing a rolling time window for... The time dimension alignment involves real-time correlation and fusion of indicator data and log data from multi-dimensional telemetry data from different data sources to obtain fused multi-dimensional telemetry data. Real-time feature engineering is then performed on the fused multi-dimensional telemetry data to obtain log semantic clustering vectors and / or unit cost SLA return features. Specifically, the log semantic clustering vectors are generated by real-time pattern recognition, log template extraction, and frequency statistics of error logs within a time window. The unit cost SLA return is calculated using a preset formula. A structured state feature vector is then generated based on the log semantic clustering vectors and / or unit cost SLA return features.

[0014] Optionally, based on the triggering conditions, the business objectives are obtained, and the structured state feature vector is dynamically fused with the business objectives to generate a prompt word. This includes: if the triggering conditions include a scheduled task or an alarm event, obtaining the currently effective business objectives from the configuration library; and dynamically and structurally filling the business objectives and structured state feature vectors into a preset Prompt template to obtain the prompt word.

[0015] Optionally, invoking a large language model to perform reasoning based on the prompt word and generate structured decisions includes: invoking a large language model to perform a reasoning task based on the prompt word to generate structured decisions containing decision reasons and specific operation instructions; and verifying the structured decisions using syntax and schema verification rules, semantic logic verification rules, security boundary and circuit breaker verification rules, and debouncing verification rules in a secure output parser to obtain verified structured decisions.

[0016] Optionally, the structured decision is translated into configuration instructions for the target platform. Executing the configuration instructions includes: parsing the structured decision to obtain its type; configuring the corresponding adapter based on the type of the structured decision; translating the structured decision into configuration instructions for the target platform through the adapter; executing changes to the infrastructure in a transactional or idempotent manner according to the configuration instructions, and obtaining the execution result; wherein, for complex failover operations, multiple adapters are coordinated to ensure the atomicity of the operation; and the execution result is fed back or recorded to the audit log.

[0017] Optionally, the method further includes: storage business strategies and providing decision-making basis; wherein, the storage business strategies include: defining and modifying global business objectives and SLA constraints through configured interfaces; and providing decision-making basis includes: reading the highest principle that should be followed at present as the decision-making basis when making decisions.

[0018] This invention provides a multi-cluster data processing method and system based on a large-scale language model. The method involves collecting multi-dimensional telemetry data from managed clusters; inputting the multi-dimensional telemetry data, business objectives, and dynamically generated prompts into a large-scale language model to obtain structured decisions; translating the structured decisions into configuration instructions for the target platform; and executing these configuration instructions. This enables global intelligent optimization and precise fault-to-failure decisions, thereby improving data processing efficiency. Attached Figure Description

[0019] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this application, illustrate exemplary embodiments of the invention and, together with their description, serve to explain the invention and do not constitute an undue limitation thereof. In the drawings: Figure 1 This is a schematic diagram of a multi-cluster data processing system based on a large language model, provided in an embodiment of the present invention. Figure 2 This is a schematic diagram of another multi-cluster data processing system based on a large language model provided in an embodiment of the present invention; Figure 3 A flowchart illustrating a multi-cluster data processing method based on a large language model, provided for an embodiment of the present invention; Figure 4 This is a flowchart illustrating another multi-cluster data processing method based on a large language model provided in an embodiment of the present invention. Detailed Implementation

[0020] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0021] It should be noted that the terms "first," "second," etc., in the specification, claims, and drawings of this invention are used to distinguish different objects, rather than to limit a specific order.

[0022] It should also be noted that the various embodiments of the present invention described below can be executed individually or in combination with each other, and the embodiments of the present invention do not impose specific limitations in this regard.

[0023] Technical terms used in the embodiments of this application: Multi-Cluster / Multi-Cloud: This refers to deploying application services across two or more independent computing clusters. These clusters can be located in different geographical regions, different cloud service providers (such as AWS, Azure), or on-premises data centers.

[0024] Service Mesh: Such as Istio, it is an infrastructure layer used to handle communication between services. It enables fine-grained control, observation, and security management of inter-service traffic by deploying a lightweight network proxy (Sidecar) alongside each service.

[0025] GSLB (Global Server Load Balancing): A traffic scheduling mechanism primarily based on DNS technology, which resolves user access requests to different data centers or server IPs according to preset policies (such as geographical location and server response time).

[0026] RTO (Recovery Time Objective): This refers to the maximum time required for an information system or business function to recover from a shutdown to normal operation after a disaster. A shorter RTO indicates higher system availability.

[0027] AIOps (AI for IT Operations): Intelligent operations and maintenance. Applying artificial intelligence (especially machine learning) to the field of IT operations and maintenance to solve problems and improve efficiency through automation.

[0028] Prompt: A user-input instruction, question, or contextual description when interacting with a large language model (LLM). A well-designed prompt can guide the model to produce more accurate and relevant output.

[0029] CRD (Custom Resource Definition): A custom resource definition in Kubernetes that allows users to create their own resource types. It is often used to extend the functionality of Kubernetes. For example, Istio uses CRDs to define traffic rules.

[0030] This invention provides a multi-cluster data processing system based on a large-scale language model. Figure 1 This is a schematic diagram of a multi-cluster data processing system based on a large-scale language model, provided as an embodiment of the present invention; for example... Figure 1 As shown in the embodiments of this application, the multi-cluster data processing system based on a large language model includes: The distributed telemetry acquisition module 10, deployed in each managed cluster, is used to collect multi-dimensional telemetry data; the telemetry aggregation and processing module 12 is used to aggregate, standardize, perform cross-modal data association, and perform real-time feature engineering on the multi-dimensional telemetry data to generate structured state feature vectors; the intelligent decision engine module 14 is used to obtain business objectives based on triggering conditions, dynamically fuse the structured state feature vectors with the business objectives, generate a prompt word, call a large language model to perform reasoning based on the prompt word, generate structured decisions, and perform syntactic, semantic, and security logic verification on the structured decisions to obtain verified decision instructions; the policy executor module 16 is used to translate the verified decision instructions into configuration instructions for the target platform and execute the configuration instructions; the business objective and configuration management module 18 is used to store business policies and provide decision-making basis.

[0031] Optional, multi-dimensional telemetry data includes: indicator data, log data, and link data.

[0032] Specifically, Figure 2 This is a schematic diagram of another multi-cluster data processing system based on a large language model provided in an embodiment of the present invention; as shown below. Figure 2 As shown, the multi-cluster data processing system based on a large language model provided in this application embodiment can be logically divided into a data plane and a control plane. The distributed telemetry acquisition module 10 is deployed in the data plane; the telemetry aggregation and processing module 12, the intelligent decision engine module 14, the strategy executor module 16, and the business objective and configuration management module 18 are deployed in the control plane. In this embodiment, the distributed telemetry collection module 10 is deployed in a set of agent programs, such as Prometheus Agent, Fluentd / Vector Agent, and OpenTelemetry Collector, in each managed service cluster (i.e., the data plane); for example, deployed in... Figure 2 In business cluster A and business cluster B.

[0033] The functions and roles of the distributed telemetry acquisition module 10 include: Metrics collection: Real-time capture of infrastructure metrics, network metrics, and application-layer metrics exposed through the service mesh within the cluster.

[0034] Logs and event collection: Collect standard output logs, error logs, and Kubernetes events from application services.

[0035] Link tracing collection: Capture call chain data of distributed services.

[0036] That is, the distributed telemetry acquisition module 10 is used to collect multi-dimensional telemetry data. In this embodiment, the indicator data in the multi-dimensional telemetry data can be infrastructure indicators, network indicators, and application layer indicators exposed through the service mesh. Specifically, the metrics data include: infrastructure metrics, network metrics, and application layer metrics. Among them, infrastructure metrics include: CPU and / or memory utilization; network metrics include: inter-cluster network latency and / or packet loss rate; and application layer metrics include: service request rate, error rate, and P95 / P99 latency.

[0037] Log data includes: standard output logs of the application service, error logs, and Kubernetes events. In a preferred example, Kubernetes events may include: Pod OOMKilled, CrashLoopBackOff.

[0038] Link data includes: call link data of distributed services.

[0039] In this embodiment, the distributed telemetry acquisition module 10 is the sensory nerve ending of the entire system. Through a secure and encrypted channel, it continuously pushes the acquired raw telemetry data to the telemetry aggregation and processing module 12 in the control plane or waits for it to be pulled.

[0040] The telemetry aggregation and processing module 12 in this embodiment consists of a highly available central time-series database, a central log system (such as Loki), and a stream processing engine (such as Apache Flink). In this embodiment, the central time-series database can be M3DB or VictoriaMetrics, the central log system can be Loki, and the stream processing engine can be Apache Flink.

[0041] The functions and roles of the telemetry aggregation and processing module 12 include: Data aggregation and storage: As a unified entry point for all telemetry data, it receives and stores data from all distributed telemetry acquisition modules 10.

[0042] Data standardization and correlation: Cleaning and standardizing data from different sources and in different formats. More importantly, it enables cross-domain data correlation. For example, it can correlate a metric with an error rate of 5% with "database connection timeout" logs and CPU usage spikes generated by the same service within the same time window.

[0043] Real-time feature engineering: Using Flink jobs, raw data streams are computed in real time to generate high-level features that are more valuable for decision-making. For example, calculating the "moving average of error rate over the past 5 minutes", "cost and performance weighted score", "resource saturation prediction", etc., and writing these features to a cache (such as Redis) or pushing them directly to a message bus (such as Kafka).

[0044] The specific implementation mechanism of cross-modal data correlation in the data standardization and correlation embodiments of this application can be as follows: Real-time correlation between metrics and logs is achieved in a stream processing engine (such as Flink) through a multi-dimensional spatiotemporal alignment algorithm. The specific steps are as follows: Extraction and alignment of shared context information: Spatial Dimension: The system mandates that all telemetry data (whether Prometheus metrics or Loki logs) must carry a set of standardized metadata labels. These standardized metadata labels include: cluster_id, namespace, service_name, and pod_name. The stream processing engine uses the common label key-value pairs of these standardized metadata labels as "Join Keys" to merge data streams.

[0045] Temporal dimension: A tumbling window mechanism is employed. The system maps data streams from different data sources to the same time window. In this embodiment, the tumbling window mechanism can be configured with a window size of 1 minute.

[0046] Association logic: Triggering condition: When the error rate of a certain service in the metric stream exceeds a preset threshold (e.g., 1%) within a time window T, an "abnormal event object" is generated.

[0047] Log retrieval and matching: The stream processing engine uses the service_name and cluster_id of the object to filter out log entries marked LEVEL=ERROR and containing specific keywords in the log stream of the same time window T. In this embodiment, the specific keywords may include: “Exception”, “Timeout”, and “Refused”.

[0048] Fusion Output: The aggregated metric data and the filtered log summary are merged into a complete structured anomaly context (Contextualized Anomaly Object) as input for downstream decision-making; in a preferred example, the aggregated metric data can be an error rate of 5%; the filtered log summary can be a log with 50 occurrences of 'Connectionrefused'.

[0049] In this embodiment of the application, Creative Real-time Feature Engineering enables the Large Model (LLM) to make decisions more efficiently. Instead of directly transmitting the raw data, the telemetry aggregation and processing module 12 generates high-dimensional decision-aid features through operator calculation.

[0050] Specific examples are as follows: Feature Example 1: Log Semantic Cluster Vector Problem: The original error log is large and messy. Directly inputting LLM will consume a lot of tokens and interfere with inference.

[0051] Calculation logic: The stream processing engine performs real-time pattern recognition on the error logs within the window, removing specific timestamps and variable parameters, and extracting log templates (e.g., normalizing Error connecting to DB: 192.168.1.1 and Error connecting to DB: 192.168.1.2 into the template Error connecting to DB:). <ip>).

[0052] Count the frequency of occurrence of each template.

[0053] Generate characteristics: {"DatabaseConnectionTimeout": 85, "NullPointerException": 12, "DiskFull": 3}.

[0054] Decision value: This feature allows the LLM to see through the nature of the failure at a glance—if the database timeout is the main cause, the LLM will decide to perform a "failover"; if the null pointer exception is the cause, the LLM will determine that this is a code bug, the failover will be ineffective, and thus suggest a "version rollback".

[0055] Feature Example 2: Unit Cost SLA Yield Problem: Simply looking at "low cost" or "low latency" can easily lead to local optima and make it impossible to evaluate cost-effectiveness.

[0056] Calculation logic: Formula: Yield = (1 / (Normalized_Latency * (1 + Error_Rate))) / Normalized_Cost; The system obtains the current P99 latency (normalized), error rate, and real-time Spot instance price returned by the cloud vendor's API for each cluster in real time.

[0057] Generated characteristics: {"Cluster_AWS": 0.85, "Cluster_Azure": 1.2, "Cluster_Private_Cloud": 0.9}.

[0058] Decision Value: This feature directly quantifies "how much service quality you can buy for every penny spent". Based on this feature, LLM can prioritize traffic to the cluster with the highest "yield" (Cluster_Azure) while meeting the bottom line of SLA, rather than simply because it is the cheapest or fastest.

[0059] In this application embodiment, the telemetry aggregation and processing module 12 is the central nervous system of the multi-cluster data processing system based on a large language model provided in this application embodiment. By receiving raw data from the distributed telemetry acquisition module 10 (i.e., the multi-dimensional telemetry data in this application embodiment), after processing, it provides the structured, more information-dense real-time feature data to the downstream intelligent decision engine module 14.

[0060] In this embodiment, the Intelligent Decision Engine Module 14 serves as the brain of the multi-cluster data processing system based on a large language model provided in this embodiment. It consists of a dynamic Prompt generator, a large language model inference service, and a secure output parser. In this embodiment, the large language model can be an LLM model, and the large language model inference service can be a public cloud API call or a locally deployed open-source model.

[0061] The functions and roles of the intelligent decision engine module 14 in this embodiment include: Context awareness: Based on the triggering conditions (timed tasks or alarm events), pull the latest global feature snapshot from the telemetry aggregation and processing module 12.

[0062] Intent understanding: Read the currently effective business objectives from the configuration library (e.g., "We are currently in a major promotional event, and the primary objective is to ensure system stability. A 30% increase in costs is acceptable").

[0063] Dynamic Prompt Construction: Real-time feature data, business objectives, historical decision records, and other contextual information are dynamically and structurally populated into a preset Prompt template.

[0064] Core reasoning: Calls the LLM service to execute reasoning tasks and generates structured JSON output containing decision reasons and specific operation instructions.

[0065] Security Validation: The secure output parser performs strict syntax, schema, and logic validation on the JSON returned by the LLM to ensure that its instructions are secure, valid, and executable, preventing model illusions from causing production accidents.

[0066] In this embodiment, the intelligent decision engine module 14 obtains the decision basis from the telemetry aggregation and processing module 12, and sends the JSON decision instruction inferred by the intelligent decision engine module 14 and verified by security to the strategy executor module 16.

[0067] In a preferred example, the preset Prompt template in the intelligent decision engine module 14 can be a dynamic Prompt template, wherein... The construction and examples of dynamic prompt templates are as follows: The intelligent decision engine module 14 does not simply stitch data together, but rather employs Structured Prompt Engineering technology. The system pre-sets prompt templates for different scenarios, and renders real-time data into them at runtime through a template engine; among these, the different scenarios in this embodiment can be "normal optimization" and "fault emergency response"; the template engine in this embodiment can be Jinja2.

[0068] In a preferred example, a specific Prompt build example for the core scenario is as follows: Example A: Multi-objective traffic optimization prompt under normal conditions Design logic: Emphasizing data comparison and trade-offs between business intent (SLA vs. cost).

[0069] Template content: codeMarkdown [System Role] You are a cloud-native traffic scheduling expert. Your task is to output the optimal traffic weight allocation strategy based on the current cluster status and business objectives.

[0070] [Context - Real-time Cluster State] Current time window: 2024-05-20 10:00:00 The cluster data table is shown in Table 1 (generated by Flink in real time). Table 1

[0071] [Context - Business Policy] Current effective policy ID: policy_cost_efficient_v1 Strategy Description: Minimize total traffic cost while ensuring that the P99 latency of all traffic is below 200ms.

[0072] Constraints: The adjustment range in a single instance shall not exceed 20% to prevent flow oscillations.

[0073] [Instruction] 1. Analyze the above data and calculate which weight combination can maximize the satisfaction of the strategy.

[0074] 2. Note that gcp-asia's latency has exceeded the limit (250ms > 200ms), and its traffic should be reduced.

[0075] 3. The output must be in strict JSON format.

[0076] [Output Schema] { "reasoning": "A brief analysis process", "action_type": "weight_adjustment", "payload": [ {"cluster_id": "string", "weight": int} ] } Example B: Root Cause Analysis and Emergency Response Prompt in Failure Scenario Design logic: Introduce unstructured log summaries and "log semantic clustering" produced by feature engineering to guide the model in root cause diagnosis.

[0077] Template content: codeMarkdown [System Role] You are an SRE emergency commander. A critical system malfunction has been detected. Please make an immediate decision.

[0078] [Trigger Event] Alert: The error rate of the AWS-East cluster's PaymentService has surged to 15% (threshold 1%).

[0079] [Context - Log Analysis (Last 1 min)] Log semantic clustering results: 1. "Connection refused to database-primary" (Count: 450) - 90% 2. "Upstream timeout" (Count: 50) - 10% of the total [Context - Alternate Clusters] azure-west: Healthy (Error Rate: 0.01%, Remaining Capacity: Sufficient) [Instruction] 1. Determine the root cause of the failure. If it's a code bug (such as a null pointer exception), traffic switching may be ineffective; if it's an infrastructure / dependency failure (such as a database connection failure), switch traffic immediately.

[0080] 2. The current log shows a database connection refused error, which is likely due to an infrastructure issue. Switching to another system is recommended.

[0081] 3. Generate decision JSON.

[0082] [Output Schema] { "root_cause_analysis": "string", "action_type": "failover" | "rollback" | "no_op", "target_cluster": "string", "fallback_cluster": "string" } In a preferred example, the specific rules for Secure Parsing & Verification are as follows: To prevent large models from experiencing "hallucinations" that could lead to unintended destructive actions, the intelligent decision engine module 14 incorporates a deterministic validator. This deterministic validator, written in traditional code (such as Python / Go), executes an insurmountable set of rules. Only decisions that pass all the rules are issued to the executor.

[0083] The specific verification rules are as follows: Rule 1: Syntactic & Schema Validation Implementation: Use a JSON Schema library to parse the string returned by the LLM.

[0084] Checkpoints: It must be a valid JSON object.

[0085] It must include required fields such as action_type, payload, and reasoning.

[0086] The field type must be correct (e.g., weight must be an Integer, not a string or floating-point number).

[0087] Rule 2: Semantic Logic Validation Implementation: Write logic code to check the rationality of numerical relationships.

[0088] Checkpoints: Weight normalization check: For traffic allocation scenarios, the sum of the weights of all target clusters must be strictly equal to 100. If the model outputs {A: 60, B: 30} (total 90), the validator will either reject it directly or automatically normalize it.

[0089] Entity existence check: The cluster_id must exist in the system's "Known Clusters Registry". This prevents the model from fabricating a non-existent cluster (such as "aliyun-mars"), which could cause downstream API calls to fail.

[0090] Rule 3: Safety Bounding & Circuit Breaking Implementation: Based on preset safety threshold policies.

[0091] Checkpoints: Step Limit: In non-emergency mode, limits the magnitude of a single traffic weight adjustment (e.g., a single change cannot exceed ±20%). If the model suggests jumping directly from 0% to 80%, the validator will correct it to 20% or reject it to prevent traffic surges from overwhelming a cold-start cluster.

[0092] Minimum retention traffic limit: Prevents setting the weight of all clusters to 0, or setting the weight of the only healthy cluster to 0.

[0093] High-risk operation blocking: For high-risk commands such as action_type = "failover" (full switch) or action_type = "block_traffic", if the current confidence score (which can be output by the model) is lower than the threshold, or if it is not in "emergency alarm state", the validator will block the command and switch to "manual approval mode".

[0094] Rule 4: Anti-Flapping Validation Implementation: Maintain a sliding window that represents the N most recent decisions.

[0095] Checkpoint: If the system is detected to frequently perform "A to B" and then "B to A" switching operations on the same cluster within a short period of time (e.g., within 10 minutes) (Ping-Pong effect), the checker will forcibly pause scheduling, maintain the current state, and avoid system oscillations caused by the model's sensitivity to critical values.

[0096] In this embodiment, the policy executor module 16 consists of multiple adapters for different infrastructure platforms, such as Kubernetes adapter, AWS Route 53 adapter, Cloudflare API adapter, etc.

[0097] The functions of the strategy executor module 16 in this embodiment include: Command parsing: Receives JSON commands from the intelligent decision engine module 14 and parses their action_type (such as update_weights, failover) and action_payload.

[0098] Instruction translation: Based on the instruction type, select the appropriate adapter to translate abstract instructions into specific API calls or configuration files that the target platform can understand. For example, translating {"cluster": "A", "weight": 70} into a modification to a YAML file of an IstioVirtualService resource.

[0099] Atomic execution: Executes changes to infrastructure in a transactional or idempotent manner. For complex failover operations, it coordinates multiple adapters (e.g., changing DNS first, then the service mesh) to ensure the atomicity of the operation.

[0100] Status feedback: After execution is complete, the execution result (success / failure) will be fed back to the intelligent decision engine module or recorded in the audit log.

[0101] Connection relationship: In this embodiment of the application, the policy executor module 16 is the two hands of the multi-cluster data processing system based on a large language model provided in this embodiment of the application. It receives commands from the intelligent decision engine module 14 and performs actual configuration operations on the flow control components (such as service mesh, DNS) in the data plane.

[0102] The specific technical implementation of atomic execution in the strategy executor module 16 of this application embodiment is as follows: Because calls involve heterogeneous platforms (such as the AWS Route53 API and Kubernetes API), there is no single database-level ACID transaction support. The policy executor module 16 achieves application-layer atomicity by building a state-machine-driven transaction coordinator. This involves three core technical methods: Method 1: Distributed Mutex Locking Objective: To ensure isolation and prevent multiple decision commands (such as automatic scheduling and manual maintenance) from modifying the same set of resources simultaneously.

[0103] Implementation: Before making any changes, the executor first attempts to acquire a distributed lock based on Target_Service_ID in a strongly consistent store (such as etcd or Redis).

[0104] Logic: If Lock(Service_A) succeeds, continue; otherwise, queue or discard the task, and release the lock when the task ends.

[0105] Method 2: Configuration Snapshotting & Versioning Purpose: To provide a defined "origin" for rollback.

[0106] Implementation: Before performing any write operation (WRITE), the executor first performs a read operation (READ).

[0107] Call the APIs of each platform to obtain the current configuration status, such as the current Route53 weight record and the current Istio VirtualService YAML.

[0108] These current states are serialized into "rollback snapshots" and stored in memory or a temporary database, marked as Pre_Transaction_State.

[0109] Method 3: Ordered Execution Pipeline & Dry-Run Objective: To reduce the probability of runtime errors.

[0110] Implementation: Dry-Run (pre-execution): First, call the DryRun interface of the cloud platform API (if supported) or perform local schema validation to ensure that the generated configuration is syntactically valid.

[0111] Dependency ordering: Change operations are constructed as a directed acyclic graph (DAG). For example, in a failover scenario, the system enforces a defined execution order: scale up the standby cluster, perform a health check for confirmation, and then modify traffic weights. Subsequent steps are only executed if the preceding steps are successful.

[0112] In this embodiment, the rollback triggering conditions and specific rollback operations in the policy executor module 16 are as follows: The strategy executor module 16 adopts the "Compensating Transaction" mode. Once a transaction enters the "failure" state, the reverse operation process is immediately initiated.

[0113] Rollback trigger conditions: API call failure: Any downstream API returns a 4xx (client error, such as configuration conflict) or 5xx (server error) status code.

[0114] Network timeout: The API call did not respond within a preset time (e.g., 5 seconds).

[0115] Post-Verification Failure: After the change is issued, the executor will immediately initiate a round of active probing.

[0116] For example, after modifying the weight, immediately initiate a query to the DNS server. If the DNS record is not updated to the expected value within 30 seconds (possibly due to excessive DNS propagation latency), or if modifying the Istio configuration causes an error in the service mesh control plane (Istio pushes errors), then a rollback is triggered.

[0117] Details of rollback operations: Mechanism: The executor reads the Pre_Transaction_State snapshot saved in method two.

[0118] Reverse execution process: Pause: Immediately halt any subsequent unexecuted change steps.

[0119] Reverse recovery: Restores the configuration to the values ​​in the snapshot in reverse order of execution.

[0120] Example: If only the Istio configuration of Cluster A is modified, but modification of Cluster B fails, the rollback operation will call the Istio API to overwrite the configuration of Cluster A with the old version in the snapshot.

[0121] Clean up: Delete temporary configurations in intermediate states.

[0122] Alert: Output a critical log message "Transaction failed and rolled back" to the console and notify manual intervention.

[0123] The Business Objective & Configuration Management Module 18 in this embodiment can be a simple database with API, or it can be a Git repository (GitOps mode).

[0124] The roles and functions of the business objectives and configuration management module 18 in this embodiment include: Storage business strategy: Provides an interface for operations and maintenance personnel to define and modify global business objectives and SLA constraints.

[0125] Provide decision-making basis: When making decisions, the intelligent decision engine module 14 will read the highest principle that should be followed at present from the business objectives and configuration management module 18.

[0126] Connection relationship: The business objectives and configuration management module 18 provide business-level input to the intelligent decision engine module 14 in a one-way manner.

[0127] In summary, the collaborative process of each module in the multi-cluster data processing system based on a large language model provided in this application embodiment can be summarized as follows: data is collected by the distributed telemetry acquisition module 10 (sensory), the information is processed by the telemetry aggregation and processing module 12 (nerve center), the intelligent decision engine module 14 (brain) combines business objectives to think and make decisions, the strategy executor module 16 (hands) puts the decisions into action, and the results of the actions are observed by the distributed telemetry acquisition module 10, forming a complete and intelligent feedback control closed loop.

[0128] The multi-cluster data processing system based on a large language model provided in this application can operate in two modes: periodic optimization (daily patrol) and event-driven emergency response (rapid response). More importantly, it constructs a rigorous priority preemption and coordination mechanism to ensure optimal performance under normal conditions and survival under fault conditions. The specific trigger conditions for the event-driven emergency mode include: The event-driven emergency mode is not triggered by any arbitrary alarm, but rather activated by the intelligent decision engine module 14 after identifying specific high-priority signals from the telemetry aggregation and processing module 12. Specific trigger event types include: Type A: Infrastructure Hard Failure Source: P0-level alerts directly pushed by the monitoring system (Prometheus / CloudWatch).

[0129] Examples: Cluster_Down (Kubernetes API Server becomes unresponsive), Region_Offline (Cloud vendor availability zone is unavailable), Network_Partition (Cross-cluster network connectivity probe fails).

[0130] Type B: Soft Failure due to violation of core SLA Source: Real-time stream computation results from the telemetry aggregation module.

[0131] Example: The global P99 latency of a core service exceeds 500ms for two consecutive time windows (e.g., 2 minutes), and the error rate rises simultaneously to exceed the threshold (e.g., 5%).

[0132] Type C: Predictive Anomaly based on Log Patterns Source: "Cross-modal correlation" analysis of the telemetry aggregation module.

[0133] Example: Before the metrics trigger an alert, the stream processing engine detects a dramatic increase in the frequency of fatal error log patterns related to "Out of Memory" or "Deadlock" (an increase of >100 entries per second).

[0134] In order to prevent the system from being inconsistent due to the simultaneous operation of the two modes, the priority determination and conflict resolution mechanism in this application adopts a collaborative strategy of "emergency priority, high-priority preemption, and automatic cooling".

[0135] Priority is defined as follows: Level 1 (Highest): Event-driven emergency mode (goal: minimize RTO).

[0136] Level 2 (Normal): Periodic optimization mode (goal: cost / performance balance).

[0137] The mutual exclusion and preemption logic in this embodiment is as follows: Scenario 1: Idle state When any mode is triggered, attempt to acquire the distributed lock (Key=Global_Traffic_Lock). If successful, execute the command and release the lock upon completion.

[0138] Scenario 2: During periodic optimization, an unexpected failure occurs. Preemption: When the system detects that a Level 1 event has been triggered, and the current lock is held by a Level 2 task, the decision engine will immediately send a "SIGTERM / Abort" signal to the running optimization task.

[0139] Forced termination: The periodic optimization task is forcibly terminated, and any uncommitted transactions are rolled back.

[0140] Takeover: In emergency mode, immediately acquire the lock and execute the failover logic.

[0141] Reason: Optimization is for saving money, while failover is for saving lives, and saving lives is more important than saving money.

[0142] Scenario 3: Emergency mode in progress, periodic tasks triggered Skip: When the periodic task scheduler detects that the lock is held by a Level 1 task, or that the system is in an "emergency cooldown period", it skips the current scheduling and does not queue up.

[0143] Reason: During fault handling or immediately after recovery, the system is in an unstable state and routine traffic fine-tuning should not be performed.

[0144] The automatic cooling mechanism in this embodiment is as follows: After the "Event-Driven Emergency Mode" is completed, the system will automatically enter a cooling-down window (e.g., 30 minutes).

[0145] During the cooldown period, the periodic optimization mode is disabled.

[0146] Objective: To prevent newly switched-off systems (such as newly expanded standby clusters) from being misjudged as "poor performance" by periodic optimization algorithms due to insufficient traffic warm-up (Cold Start), thus avoiding unnecessary adjustments and system instability.

[0147] The multi-cluster data processing system based on a large language model provided in this application can dynamically integrate multi-source, multi-modal real-time telemetry data with business objectives into a structured prompt for LLM reasoning; it provides a complete, closed-loop control process from data acquisition, intelligent decision-making, policy generation to automated execution. In particular, it seamlessly integrates and drives infrastructure changes by verifying the LLM's decision outputs through security checks; it can parse unstructured or semi-structured data such as logs and traces, and use them as key context for decision-making—something traditional solutions struggle to achieve.

[0148] As can be seen from the above, the multi-cluster data processing system based on a large language model provided in this application, by aggregating multi-dimensional data in real time and delivering it to LLM analysis, finds a dynamic balance point among multiple conflicting objectives such as latency, cost, and resources, achieving true global intelligent optimization, which is unattainable by existing single-rule systems. Secondly, it can understand unstructured data such as logs, thus enabling root cause inference and accurate failover decisions when facing unknown or complex faults, rather than simply responding to a superficial latency alarm. This makes fault recovery no longer dependent on contingency plans, but on real-time intelligent analysis, thereby significantly reducing RTO from hours to minutes or even seconds. Finally, through Prompt and business objective configuration, operations personnel can manage system policies as if communicating with people, greatly reducing operational complexity and labor costs, and giving the system unprecedented flexibility and adaptability.

[0149] This invention provides a multi-cluster data processing system based on a large-scale language model. A distributed telemetry acquisition module, deployed across managed clusters, collects multi-dimensional telemetry data. A telemetry aggregation and processing module aggregates, standardizes, performs cross-modal data association, and real-time feature engineering on the multi-dimensional telemetry data to generate structured state feature vectors. An intelligent decision engine module, based on triggering conditions, acquires business objectives, dynamically fuses the structured state feature vectors with the business objectives, generates a prompt, calls a large-scale language model to perform reasoning based on the prompt, generates structured decisions, and performs syntactic, semantic, and security logic verification on the structured decisions to obtain verified decision instructions. A policy executor module translates the verified decision instructions into configuration instructions for the target platform and executes the configuration instructions. A business objective and configuration management module stores business strategies and provides decision-making basis, thereby achieving global intelligent optimization and accurately proposing failover decisions, improving data processing efficiency.

[0150] This invention provides a multi-cluster data processing method based on a large-scale language model. Figure 3 This is a flowchart illustrating a multi-cluster data processing method based on a large language model, provided by an embodiment of the present invention; as shown below. Figure 3 As shown, applied to Figure 1 or Figure 2 The multi-cluster data processing system based on a large language model shown in this application embodiment includes the following methods: Step S302: Collect multi-dimensional telemetry data of the managed cluster; Optionally, the collection of multi-dimensional telemetry data of the managed cluster in step S302 includes: collecting indicator data, log data, and link data of the managed cluster; generating multi-dimensional telemetry data based on the indicator data, log data, and link data; wherein, the indicator data includes: infrastructure indicators, network indicators, and application layer indicators exposed through the service mesh; the log data includes: standard output logs, error logs, and Kubernetes events; and the link data includes: call link data of distributed services.

[0151] In this embodiment of the application, step S302 corresponds to Figure 1 or Figure 2 The distributed telemetry acquisition module 10 in the multi-cluster data processing system based on a large language model is shown.

[0152] Step S304: Input multi-dimensional telemetry data, business objectives, and dynamically generated prompt words into a large language model to obtain structured decisions; Optionally, in step S304, the multi-dimensional telemetry data, business objectives, and dynamically generated prompts are input into a large-scale language model to obtain structured decisions. This includes: aggregating, standardizing, cross-modal data association, and performing real-time feature engineering on the multi-dimensional telemetry data to generate structured state feature vectors; obtaining business objectives based on triggering conditions, dynamically fusing the structured state feature vectors with the business objectives to generate prompts; and calling the large-scale language model to perform reasoning based on the prompts to generate structured decisions.

[0153] Further, optionally, the multi-dimensional telemetry data is aggregated, standardized, cross-modal data correlated, and real-time feature engineering processed to generate structured state feature vectors. This includes: receiving and storing all multi-dimensional telemetry data; cleaning multi-dimensional telemetry data from different sources and in different formats, and standardizing the cleaned multi-dimensional telemetry data to obtain standardized multi-dimensional telemetry data; performing cross-modal data correlation on the standardized multi-dimensional telemetry data, extracting standardized metadata tags from all multi-dimensional telemetry data, using the standardized metadata tags as spatial dimension correlation keys, and employing a rolling time window for... The time dimension alignment involves real-time correlation and fusion of indicator data and log data from multi-dimensional telemetry data from different data sources to obtain fused multi-dimensional telemetry data. Real-time feature engineering is then performed on the fused multi-dimensional telemetry data to obtain log semantic clustering vectors and / or unit cost SLA return features. Specifically, the log semantic clustering vectors are generated by real-time pattern recognition, log template extraction, and frequency statistics of error logs within a time window. The unit cost SLA return is calculated using a preset formula. A structured state feature vector is then generated based on the log semantic clustering vectors and / or unit cost SLA return features.

[0154] Optionally, based on the triggering conditions, the business objectives are obtained, and the structured state feature vector is dynamically fused with the business objectives to generate a prompt word. This includes: if the triggering conditions include a scheduled task or an alarm event, obtaining the currently effective business objectives from the configuration library; and dynamically and structurally filling the business objectives and structured state feature vectors into a preset Prompt template to obtain the prompt word.

[0155] Optionally, invoking a large language model to perform reasoning based on the prompt word and generate structured decisions includes: invoking a large language model to perform a reasoning task based on the prompt word to generate structured decisions containing decision reasons and specific operation instructions; and verifying the structured decisions using syntax and schema verification rules, semantic logic verification rules, security boundary and circuit breaker verification rules, and debouncing verification rules in a secure output parser to obtain verified structured decisions.

[0156] Specifically, in this embodiment, multi-dimensional telemetry data is aggregated, standardized, cross-modal data correlated, and subjected to real-time feature engineering to generate structured state feature vectors. Figure 1 or Figure 2 The telemetry aggregation and processing module 12 shown is part of a multi-cluster data processing system based on a large language model. In this embodiment, a business objective is obtained based on a triggering condition. The structured state feature vector is dynamically fused with the business objective to generate a prompt. A large language model is then invoked to perform inference based on the prompt to generate a structured decision correspondence. Figure 1 or Figure 2 The intelligent decision engine module 14 shown is part of a multi-cluster data processing system based on a large language model.

[0157] Step S306: Translate the structured decision into configuration instructions for the target platform and execute the configuration instructions.

[0158] Optionally, in step S306, the structured decision is translated into configuration instructions for the target platform. Executing the configuration instructions includes: parsing the structured decision to obtain its type; configuring the corresponding adapter based on the type of the structured decision; translating the structured decision into configuration instructions for the target platform through the adapter; executing changes to the infrastructure in a transactional or idempotent manner according to the configuration instructions, and obtaining the execution result; wherein, for complex failover operations, multiple adapters are coordinated to ensure the atomicity of the operation; and the execution result is fed back or recorded in the audit log.

[0159] Specifically, in this embodiment, structured decisions are translated into configuration instructions for the target platform, and the corresponding configuration instructions are executed. Figure 1 or Figure 2 The policy executor module 16 shown is part of a multi-cluster data processing system based on a large language model.

[0160] Optionally, the multi-cluster data processing method based on a large language model provided in this application embodiment further includes: storage business strategy and providing decision basis; wherein, the storage business strategy includes: defining and modifying global business objectives and SLA constraints through configured interfaces; providing decision basis includes: reading the highest principle that should be followed as the decision basis when making a decision.

[0161] Specifically, in this application embodiment, the storage service strategy and the basis for providing decision-making correspond to... Figure 1 or Figure 2 The business objectives and configuration management module 18 shown is part of a multi-cluster data processing system based on a large language model.

[0162] In summary, combining steps S302 to S306, as follows: Figure 4 As shown, Figure 4 This is a flowchart illustrating another multi-cluster data processing method based on a large language model provided in an embodiment of the present invention; the specific details of the multi-cluster data processing method based on a large language model provided in this embodiment of the present invention are as follows: In a preferred example, a detailed explanation will be provided using a "periodic optimized scheduling under normal conditions": Step 1: Real-time aggregation and characterization of multimodal telemetry data; The distributed telemetry acquisition modules within the system continuously and in parallel acquire at least two types of data from their respective clusters: Performance metrics data: For example, the P99 latency of the payment-service in cluster A over the past 5 minutes was 150ms, with an error rate of 0.5%; Cost metrics data: For example, the current hourly traffic cost for cluster A is 0.09 / GB, and for cluster B it is 0.05 / GB; Resource metrics data: For example, the CPU utilization rate of cluster A is 75%, while that of cluster B is 40%.

[0163] All raw data is sent to the telemetry aggregation and processing module.

[0164] The Flink streaming engine within the telemetry aggregation and processing module consumes these data streams and performs aggregation calculations within a rolling time window (e.g., 5 minutes) to generate a global, structured State Feature Vector. This vector is a digital snapshot of the current health and cost-effectiveness of the entire multi-cluster system.

[0165] Step 2: Constructing a dynamic reasoning prompt that integrates business intent; The system's timer scheduler triggers the intelligent decision engine module at a preset frequency (e.g., every 5 minutes).

[0166] The intelligent decision engine module first obtains the currently effective business policy from the business objectives and configuration management module. This business policy may be defined as a piece of natural language text, such as "Current business objective: Minimize the total cost as much as possible while ensuring that the P99 latency of all services is less than 200ms". The dynamic Prompt generator within the intelligent decision engine module starts and performs the following operations: Obtain the latest state feature vector generated in step one.

[0167] Format the latest state feature vector into a human-readable format (such as a Markdown table).

[0168] The formatted data, along with the acquired business strategy text and system role settings, are embedded into a predefined Prompt template.

[0169] Step 3: Complex reasoning and decision generation based on large-scale language models; The intelligent decision engine module sends the Prompt, which was built in step two, to the large language model inference service via an API request.

[0170] After receiving the prompt, the LLM leverages its powerful contextual understanding and logical reasoning capabilities to perform a "thinking" process: LLM understands the business objective as "conditional cost minimization".

[0171] LLM analysis of the state feature vectors revealed that cluster A had high latency but limited resources, while cluster B had latency that met SLA requirements and advantages in both resources and cost.

[0172] LLM weighs the pros and cons and calculates a traffic allocation scheme that can both meet SLA and reduce costs.

[0173] LLM outputs the final decision in a strict JSON format, which contains at least two pieces of information: decision_reason: A piece of natural language text explaining why this decision was made, used for auditing and enhancing credibility.

[0174] action_payload: A structured object containing specific execution parameters, such as {"traffic_weights": [{"cluster": "A", "weight": 60}, {"cluster": "B", "weight": 40}]}.

[0175] Step 4: Security verification and atomized execution of decisions; The secure output parser within the intelligent decision engine module receives the JSON returned by the LLM.

[0176] The intelligent decision engine module performs a series of security checks on the JSON: confirming that it conforms to the predefined schema, the total weight is 100, and the cluster name is valid, etc.

[0177] After successful verification, the decision JSON is sent to the policy executor module.

[0178] The policy executor parses the JSON and translates it into a series of API PATCH requests to the underlying infrastructure (e.g., IstioVirtualService resources across multiple clusters).

[0179] The executor submits these requests atomically, and if any request fails, it will attempt to roll back all successful changes to ensure the consistency of the system configuration.

[0180] Step 5: Closed-loop feedback and status update; Once the policy executor successfully applies the new traffic allocation policy, the traffic distribution in the data plane begins to change.

[0181] This change will be immediately reflected in the performance, resource, and cost metrics of each cluster.

[0182] The distributed telemetry acquisition module collects these new, changed data and sends them back to the control plane.

[0183] This new data will be processed in step one of the next decision cycle (i.e., 5 minutes later), becoming the input for a new round of optimization, thus forming a closed loop of intelligent control that continuously learns and optimizes.

[0184] The multi-cluster data processing method based on a large language model provided in this application provides source data generated in real time by various monitoring components when the multi-cluster online business system is running normally or experiencing a failure. The source data is characterized by being multimodal (metrics, logs, links, etc.), highly time-sensitive (data needs to be processed within seconds or minutes after it is generated), massive (numerous data points), and distributed (from different clusters). The generated target data is a structured JSON instruction generated by the decision engine. The target data can be applied to Infrastructure as Code (IaC): the JSON can be directly used to drive tools such as Terraform or Pulumi to update cloud resource configurations in a declarative manner; or, GitOps process: the JSON can be submitted to a Git repository and automatically synchronized to the target cluster by GitOps tools such as ArgoCD to achieve traceability and auditability of changes.

[0185] This invention provides a multi-cluster data processing method based on a large-scale language model. The method involves collecting multi-dimensional telemetry data from managed clusters; inputting the multi-dimensional telemetry data, business objectives, and dynamically generated prompts into a large-scale language model to obtain structured decisions; translating the structured decisions into configuration instructions for the target platform; and executing these configuration instructions. This enables global intelligent optimization and precise fault-to-failure decisions, thereby improving data processing efficiency.

[0186] The above are merely preferred embodiments of the present invention and are not intended to limit the scope of protection of the present invention.< / ip>

Claims

1. A multi-cluster data processing system based on a large-scale language model, characterized in that, include: The distributed telemetry acquisition module is deployed in each managed cluster to collect multi-dimensional telemetry data. The telemetry aggregation and processing module is used to aggregate, standardize, perform cross-modal data association and real-time feature engineering processing on the multi-dimensional telemetry data to generate structured state feature vectors. The intelligent decision engine module is used to obtain business objectives based on triggering conditions, dynamically fuse the structured state feature vector with the business objectives, generate a prompt word, call a large language model to perform reasoning based on the prompt word, generate a structured decision, and perform syntactic, semantic and security logic verification on the structured decision to obtain a decision instruction that passes the verification. The strategy executor module is used to translate the verified decision instructions into configuration instructions for the target platform and execute the configuration instructions; The business objectives and configuration management module is used to store business strategies and provide decision-making basis.

2. The multi-cluster data processing system based on a large-scale language model according to claim 1, characterized in that, The multi-dimensional telemetry data includes: indicator data, log data, and link data.

3. A multi-cluster data processing method based on a large-scale language model, characterized in that, Applications include multi-cluster data processing systems based on large language models, including: Collect multi-dimensional telemetry data from the managed cluster; The multi-dimensional telemetry data, business objectives, and dynamically generated prompts are input into a large language model to obtain structured decisions. The structured decision is translated into configuration instructions for the target platform, and the configuration instructions are executed.

4. The multi-cluster data processing method based on a large-scale language model according to claim 3, characterized in that, The multi-dimensional telemetry data collected from the managed cluster includes: Collect the metric data, log data, and link data of the managed cluster; The multi-dimensional telemetry data is generated based on the indicator data, the log data, and the link data; The metrics data include: infrastructure metrics, network metrics, and application layer metrics exposed through the service mesh; The log data includes: standard output logs, error logs, and Kubernetes events; The link data includes: call link data of distributed services.

5. The multi-cluster data processing method based on a large-scale language model according to claim 3 or 4, characterized in that, The process of inputting the multi-dimensional telemetry data, business objectives, and dynamically generated prompt words into a large language model to obtain structured decisions includes: The multi-dimensional telemetry data is aggregated, standardized, cross-modal data correlated, and subjected to real-time feature engineering to generate a structured state feature vector. Based on the triggering conditions, the business objective is obtained, and the structured state feature vector is dynamically fused with the business objective to generate the prompt word. A large language model is invoked to perform reasoning based on the prompt word "Prompt" to generate structured decisions.

6. The multi-cluster data processing method based on a large-scale language model according to claim 5, characterized in that, The process of aggregating, standardizing, cross-modal data association, and performing real-time feature engineering on the multi-dimensional telemetry data to generate a structured state feature vector includes: Receive all the multi-dimensional telemetry data and store the multi-dimensional telemetry data; The multi-dimensional telemetry data from different sources and in different formats are cleaned, and the cleaned multi-dimensional telemetry data is standardized to obtain standardized multi-dimensional telemetry data. The standardized multi-dimensional telemetry data is cross-modal data association, and the standardized metadata tags of all the multi-dimensional telemetry data are extracted. The standardized metadata tags are used as spatial dimension association keys, and the time dimension is aligned using a rolling time window. The indicator data and log data in the multi-dimensional telemetry data from different data sources are associated and fused in real time to obtain the fused multi-dimensional telemetry data. Real-time feature engineering is performed on the fused multi-dimensional telemetry data to obtain log semantic clustering vectors and / or unit cost SLA return features; wherein, the log semantic clustering vectors are generated by real-time pattern recognition, log template extraction, and frequency statistics of error logs within a time window; the unit cost SLA return is calculated using a preset formula. The structured state feature vector is generated based on the log semantic clustering vector and / or the unit cost SLA return characteristics.

7. The multi-cluster data processing method based on a large-scale language model according to claim 5, characterized in that, The step of obtaining the business objective based on the triggering condition, dynamically fusing the structured state feature vector with the business objective, and generating the prompt word Prompt includes: If the triggering conditions include scheduled tasks or alarm events, retrieve the currently effective business objectives from the configuration library; The business objective and the structured state feature vector are dynamically and structurally filled into a preset Prompt template to obtain the prompt word.

8. The multi-cluster data processing method based on a large-scale language model according to claim 7, characterized in that, The process of invoking a large language model to perform reasoning based on the prompt word "Prompt" and generate structured decisions includes: By invoking the large language model to perform reasoning tasks based on the prompt word Prompt, a structured decision containing decision reasons and specific operation instructions is generated; The structured decision is verified using syntax and schema verification rules, semantic logic verification rules, security boundary and circuit breaker verification rules, and debouncing verification rules in the secure output parser, resulting in the verified structured decision.

9. The multi-cluster data processing method based on a large-scale language model according to claim 3, characterized in that, The step of translating the structured decision into configuration instructions for the target platform and executing the configuration instructions includes: The structured decision is analyzed to obtain the type of the structured decision; Configure a corresponding adapter based on the type of the structured decision, and translate the structured decision into the configuration instructions of the target platform through the adapter; The changes to the infrastructure are executed in a transactional or idempotent manner according to the configuration instructions, and the execution results are obtained; wherein, for complex failover operations, multiple adapters are coordinated to ensure the atomicity of the operation; The execution results will be fed back or recorded in the audit log.

10. The multi-cluster data processing method based on a large-scale language model according to claim 9, characterized in that, The method further includes: Storage business strategies and decision-making basis; wherein, the storage business strategies include: defining and modifying global business objectives and SLA constraints through configured interfaces; the decision-making basis includes: reading the highest principle that should be followed at present as the decision-making basis when making decisions.