A behavior reasoning and AI intelligent handling method and system for communication anomalies

By constructing an ideal communication state model and generating simulation trajectories of abnormal scenarios and business tides, and calculating the residual space, the problem of distinguishing between high-concurrency legitimate tides and covert communication faults or attacks in existing technologies has been solved. This has enabled highly accurate root cause localization and automated linkage handling, ensuring the stability of the system and business continuity.

CN122247760APending Publication Date: 2026-06-19ZHEJIANG SCI-TECH UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZHEJIANG SCI-TECH UNIV
Filing Date
2026-05-20
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing communication anomaly detection methods are unable to effectively distinguish between high-concurrency legitimate surges and covert communication failures or attacks. They lack stable benchmarks, resulting in inaccurate anomaly type identification and insufficient root cause localization. Furthermore, the handling process relies on human experience, making it difficult to achieve stable and interpretable intelligent linkage handling.

Method used

An ideal communication state model based on interface contracts, protocol specifications, and queuing rules is constructed. By collecting communication interaction time-series data and protocol stack resource state data, an ideal communication trajectory benchmark is generated. Abnormal scenarios and business tide simulation trajectories are generated through parameterized perturbation injection. The residual space is calculated, and the abnormal type and root cause result are generated using topological similarity. Intelligent handling instructions are then output.

Benefits of technology

It enables precise differentiation between high-concurrency legitimate surges and real attacks under the same reference frame, improves the accuracy of anomaly root cause localization and multi-dimensional correlation capabilities, realizes automated linkage from discovery to mitigation, reduces reliance on human experience, and ensures business continuity.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122247760A_ABST
    Figure CN122247760A_ABST
Patent Text Reader

Abstract

This invention relates to the field of network communication anomaly detection and intelligent operation and maintenance, specifically to a behavioral reasoning and AI intelligent handling method and system for communication anomalies. The method includes: collecting communication interaction time-series data and protocol stack resource status data; constructing an ideal communication state model based on interface contract parsing results, protocol specification parsing results, and queuing rules, and generating an ideal communication trajectory benchmark based on the service request sequence within a preset time window; performing fault attack knowledge injection and service load fluctuation rule injection respectively to generate anomaly scenario simulation trajectories and service tidal simulation trajectories; calculating the difference between real-world observed features and the ideal communication trajectory benchmark to obtain a real residual space, and generating anomaly theoretical residual spaces and service tidal theoretical residual spaces; generating anomaly type results and root cause results based on the topological similarity between each residual space, and outputting corresponding intelligent handling instructions. This invention accurately distinguishes between high-concurrency legitimate tidal flows and real attacks.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of network communication anomaly detection and intelligent operation and maintenance, specifically to a behavioral reasoning and AI intelligent handling method and system for communication anomalies. Background Technology

[0002] Distributed business systems carry a large amount of inter-service network communication interactions in scenarios such as medical collaboration, government services, and industrial internet. The stability of the communication status directly affects business continuity and processing timeliness. Therefore, accurate identification of communication anomalies, root cause judgment, and coordinated handling are the key to ensuring the reliable operation of the system.

[0003] Existing communication anomaly detection methods have many problems. For example, they often use alarm methods based on real-time traffic, response latency, or resource usage thresholds. Since legitimate business peaks, call chain fluctuations and connection maintenance anomalies, resource leaks, and protocol state anomalies can all manifest as increased latency, deeper queues, and increased resource usage, it is difficult to effectively distinguish between normal business surges and real faults or attacks. At the same time, existing solutions usually lack ideal communication state references based on interface contracts, protocol specifications, and queuing rules, making it difficult to make a unified comparison between actual observations and theoretical anomaly patterns. This leads to inaccurate anomaly type identification, insufficient root cause localization capabilities, and the handling process relies heavily on human experience, making it difficult to achieve stable, explainable, and intelligent linkage handling. Summary of the Invention

[0004] The purpose of this invention is to provide a method and system for behavioral reasoning and AI-powered intelligent handling of communication anomalies, addressing the following technical problems:

[0005] It distinguishes between traditionally easily confused high-concurrency legitimate surges and covert communication failures or attacks, and enables explainable, traceable, and coordinated handling of anomalies, as well as an automatic closed loop from detection to mitigation of communication anomalies.

[0006] The objective of this invention can be achieved through the following technical solutions:

[0007] Behavioral reasoning and AI-powered intelligent handling methods for communication anomalies include:

[0008] Collect communication interaction timing data and protocol stack resource status data, wherein the communication interaction timing data characterizes the inter-service network communication interaction process in the distributed business system;

[0009] Based on the pre-acquired interface contract parsing results, protocol specification parsing results, and queuing rules, an ideal communication state model is constructed, wherein the ideal communication state model is a colored Petty net model containing several state nodes.

[0010] Based on the communication interaction timing data, extract the service request sequence within a preset time window as the current service input;

[0011] Based on the ideal communication state model and the current service input, an ideal communication trajectory benchmark is generated;

[0012] Based on the pre-defined fault attack knowledge, parameterized perturbation injection is performed on the ideal communication state model to generate abnormal scenario simulation trajectories.

[0013] Based on preset service load fluctuation rules, parameterized disturbance injection is performed on the ideal communication state model to generate service tidal simulation trajectory.

[0014] Time alignment and state node mapping are performed on the communication interaction timing data and the protocol stack resource status data to generate real observation features. The difference between the real observation features and the ideal communication trajectory benchmark is calculated under the preset time window and the state node mapping to obtain the real residual space.

[0015] Under the preset time window and the state node mapping, the difference between the abnormal scenario simulation trajectory and the ideal communication trajectory benchmark is calculated to obtain at least one abnormal theoretical residual space, and the difference between the service tide simulation trajectory and the ideal communication trajectory benchmark is calculated to obtain the service tide theoretical residual space.

[0016] Based on the topological similarity between the actual residual space and each of the anomaly theoretical residual spaces and the business tide theoretical residual space, anomaly type results and root cause results are generated.

[0017] In response to the root cause result, a corresponding intelligent handling instruction is output, and a corresponding communication linkage handling operation is performed on the distributed business system according to the intelligent handling instruction.

[0018] In one possible implementation, the communication interaction timing data includes source address, destination address, calling method identifier, payload size, request timestamp, and response timestamp;

[0019] The protocol stack resource status data includes the Transmission Control Protocol (TCP) state transition sequence, User Datagram Protocol (UDP) socket transmit / receive status sequence, handle occupancy rate, and input / output queue depth.

[0020] In one possible implementation, the interface contract resolution result is obtained by parsing the interface definition file, service description file, or application interface metadata;

[0021] The protocol specification parsing result is obtained by parsing the Transmission Control Protocol specification, User Datagram Protocol specification, or Application Layer Communication Protocol specification.

[0022] The queuing rules include at least one of the following: first-come-first-served rule, priority queue rule, and connection pool allocation rule;

[0023] The current business input includes at least one of the following: business request arrival rate, call chain sequence, request type identifier, and load parameters;

[0024] The ideal communication trajectory benchmark includes an ideal state transition sequence, an ideal processing delay sequence, and an ideal resource occupancy sequence.

[0025] In one possible implementation, the fault attack knowledge consists of at least one of historical fault samples, an attack rule base, and a protocol anomaly pattern base.

[0026] The parameterized perturbation injection based on preset fault attack knowledge is performed on the ideal communication state model to generate anomaly scenario simulation trajectories, including:

[0027] The service processing rate attenuation rule is injected into the ideal communication state model to generate a fault simulation trajectory;

[0028] The connection-keeping rules and window shrinking rules are injected into the ideal communication state model to generate attack simulation trajectories;

[0029] The fault simulation trajectory and the attack simulation trajectory are combined into the abnormal scenario simulation trajectory;

[0030] The parameterized perturbation injection into the ideal communication state model based on preset service load fluctuation rules, generating a service tidal simulation trajectory, includes:

[0031] The periodic fluctuation rules or sudden increase / decrease rules of the service request arrival rate are injected into the ideal communication state model to generate the service tidal simulation trajectory.

[0032] In one possible implementation, the real residual space, the theoretical residual spaces of each of the aforementioned anomalies, and the theoretical residual spaces of the business tides are all multi-dimensional time series graph feature spaces composed of state transition features, latency features, and resource occupancy features.

[0033] The process of obtaining the real residual space includes: performing time alignment and state node mapping on the communication interaction timing data and the protocol stack resource state data to generate the real observation features; and generating the real residual space based on the difference between the real observation features and the ideal communication trajectory benchmark under the preset time window and the state node mapping.

[0034] The step of obtaining at least one abnormal theoretical residual space and calculating the difference between the service tidal simulation trajectory and the ideal communication trajectory benchmark to obtain the service tidal theoretical residual space includes: generating the abnormal theoretical residual space based on the difference between the abnormal scenario simulation trajectory and the ideal communication trajectory benchmark under the preset time window and the state node mapping; and generating the service tidal theoretical residual space based on the difference between the service tidal simulation trajectory and the ideal communication trajectory benchmark under the preset time window and the state node mapping.

[0035] In one possible implementation, topological similarity includes temporal similarity, structural similarity, and a coupling score determined based on the temporal similarity and the structural similarity;

[0036] The generation of anomaly type results and root cause results based on the topological similarity between the actual residual space and each of the anomaly theoretical residual spaces and the business tide theoretical residual space includes:

[0037] The temporal similarity between the actual residual space and each of the anomaly theoretical residual spaces and the business tide theoretical residual space is calculated based on dynamic time warping.

[0038] The structural similarity between the actual residual space and each of the anomaly theoretical residual spaces and the business tide theoretical residual space is calculated based on the graph editing distance.

[0039] The temporal similarity and the structural similarity are weighted and summed based on a predetermined weight allocation to obtain the coupling scores corresponding to each of the anomaly theory residual spaces and the business tide theory residual spaces.

[0040] The residual space with the highest coupling score among the various anomaly theories is selected as the target anomaly theory residual space;

[0041] Based on the coupling score corresponding to the residual space of the target anomaly theory and the coupling score corresponding to the residual space of the business tide theory, the anomaly type result and the root cause result are determined.

[0042] In one possible implementation, the anomaly type result and the root cause result are determined based on the coupling score corresponding to the target anomaly theoretical residual space and the coupling score corresponding to the business tide theoretical residual space, including:

[0043] When the coupling score corresponding to the target anomaly theoretical residual space is greater than the preset anomaly confidence threshold and not less than the coupling score corresponding to the business tide theoretical residual space, the anomaly type result and root cause result corresponding to the target anomaly theoretical residual space are determined.

[0044] When the coupling score corresponding to the residual space of the business tide theory is greater than the preset tide confidence threshold and greater than the coupling score corresponding to the residual space of the target anomaly theory, the current state is determined to be a normal business tide.

[0045] If any of the above judgment conditions are not met, the current state is determined to be a pending confirmation state, and continued data collection and repeated judgment are triggered.

[0046] The preset anomaly confidence threshold and the preset tidal confidence threshold are predetermined through at least one of the following methods: historical sample statistical calibration, training set calibration, or empirical parameter calibration.

[0047] A behavioral reasoning and AI-powered intelligent handling system for communication anomalies includes:

[0048] The data acquisition module is used to collect communication interaction timing data and protocol stack resource status data;

[0049] The ideal benchmark construction module is used to construct an ideal communication state model containing several state nodes based on the pre-acquired interface contract parsing results, protocol specification parsing results and queuing rules, and to generate an ideal communication trajectory benchmark based on the service request sequence within a preset time window extracted from the communication interaction time sequence data.

[0050] The parameterized injection module is used to perform parameterized perturbation injection on the ideal communication state model based on preset fault attack knowledge, generate abnormal scenario simulation trajectory, and perform parameterized perturbation injection on the ideal communication state model based on preset service load fluctuation rules, generate service tidal simulation trajectory.

[0051] The dual-track differential module is used to perform time alignment and state node mapping on the communication interaction timing data and the protocol stack resource status data to generate real observation features. Based on the real observation features, the ideal communication trajectory benchmark, the abnormal scenario simulation trajectory, and the service tide simulation trajectory, the real residual space, at least one abnormal theoretical residual space, and the service tide theoretical residual space are determined under the preset time window and the state node mapping, respectively.

[0052] The coupling decision module is used to generate anomaly type results and root cause results based on the topological similarity between the real residual space and each of the anomaly theoretical residual spaces and the business tide theoretical residual space;

[0053] The intelligent processing module is used to respond to the intelligent processing command corresponding to the root cause result output, so as to drive the execution of the corresponding communication linkage processing operation;

[0054] The data acquisition module is connected to the traffic mirroring interface and the host protocol stack monitoring interface, respectively, and the intelligent processing module is connected to the operation and maintenance orchestration interface or the service governance interface.

[0055] In one possible implementation, the intelligent handling module is used to match a target script from the atomic handling script library pre-stored in the operation and maintenance orchestration system based on at least one of the abnormal protocol level, abnormal source node, abnormal resource type, abnormal cause category and abnormal severity in the root cause result, and output the corresponding handling instructions.

[0056] The disposal instructions include isolating the target service instance, resetting the connection pool, and implementing dynamic rate limiting.

[0057] The beneficial effects of this invention are:

[0058] 1. This invention overcomes the shortcomings of traditional threshold alarms, which easily confuse legitimate business peaks with communication failures; by injecting business load fluctuation rules and fault attack knowledge into the model to generate simulation trajectories, the system can compare real residuals under the same reference system and accurately distinguish between high-concurrency legitimate tides and real attacks.

[0059] 2. This invention breaks through the limitations of previous lack of stable benchmarks by constructing an ideal communication state model based on colored Petty nets by integrating interface contracts, protocol specifications, and queuing rules. This successfully transforms fuzzy empirical judgments into analytical deterministic trajectory benchmarks, providing a highly reliable reference for subsequent residual calculations.

[0060] 3. This invention improves the accuracy of anomaly root cause localization and multidimensional correlation capability; the scheme constructs the residuals into a multidimensional time series graph, and calculates the time series and structural similarity based on dynamic time warping and graph editing distance; by forming a weighted coupled score, it overcomes the bottleneck of a single index and realizes direct and accurate comparison between real observations and theoretical templates;

[0061] 4. This invention achieves end-to-end automated linkage from anomaly detection and reasoning to mitigation; the system can accurately match and output handling instructions from the atomic handling script library based on multidimensional root cause results; by automatically executing actions such as isolation, reset, or rate limiting, it significantly reduces reliance on human experience and ensures business continuity. Attached Figure Description

[0062] The invention will now be further described with reference to the accompanying drawings.

[0063] Figure 1 This is a flowchart illustrating a behavioral reasoning and AI intelligent handling method for communication anomalies provided in an embodiment of this application;

[0064] Figure 2This is a schematic diagram of a behavioral reasoning and AI intelligent handling system for communication anomalies provided in an embodiment of this application. Detailed Implementation

[0065] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0066] Please see Figure 1 A behavioral reasoning and AI-powered intelligent handling method for communication anomalies includes: collecting communication interaction time-series data and protocol stack resource status data, wherein the communication interaction time-series data characterizes the inter-service network communication interaction process in a distributed business system; and constructing an ideal communication state model based on pre-acquired interface contract parsing results, protocol specification parsing results, and queuing rules, wherein the ideal communication state model is a colored Petty net model containing several state nodes.

[0067] Based on communication interaction time sequence data, the service request sequence within a preset time window is extracted as the current service input; based on the ideal communication state model and the current service input, an ideal communication trajectory benchmark is generated; based on preset fault attack knowledge, parameterized perturbation injection is performed on the ideal communication state model to generate an abnormal scenario simulation trajectory;

[0068] Based on the preset service load fluctuation rules, parameterized disturbance injection is performed on the ideal communication state model to generate service tidal simulation trajectory; time alignment and state node mapping are performed on the communication interaction time series data and protocol stack resource state data to generate real observation features; and the difference between the real observation features and the ideal communication trajectory benchmark is calculated under the preset time window and state node mapping to obtain the real residual space.

[0069] Under a preset time window and state node mapping, the difference between the simulated trajectory of the abnormal scenario and the benchmark of the ideal communication trajectory is calculated to obtain at least one abnormal theoretical residual space. The difference between the simulated trajectory of the service tide and the benchmark of the ideal communication trajectory is also calculated to obtain the theoretical residual space of the service tide.

[0070] Based on the topological similarity between the real residual space and the theoretical residual spaces of each anomaly as well as the residual space of the business tide theory, anomaly type results and root cause results are generated; in response to the root cause results, corresponding intelligent handling instructions are output, and corresponding communication linkage handling operations are performed on the distributed business system according to the intelligent handling instructions.

[0071] This embodiment provides a behavioral reasoning and intelligent handling mechanism for communication anomalies. Specifically, this embodiment uses a distributed collaborative platform for chest pain centers as the main scenario for illustration. This platform is deployed in a private cloud within a tertiary hospital and includes at least emergency triage services, ECG upload services, laboratory services, image scheduling services, medical record services, consultation services, and a unified message gateway. The services collaborate with each other through remote procedure calls and message queues. During the morning peak hours, the influx of patients will legitimately increase the request volume.

[0072] Meanwhile, if a node experiences slow connection exhaustion, resource leakage, or abnormal protocol status, it will also manifest as increased latency and deeper queues. Therefore, this embodiment does not directly execute threshold alarms on real-time traffic, but first constructs an ideal state, then constructs an abnormal simulation state and a service tidal simulation state respectively, and finally uses the topological similarity relationship between the residual spaces to make a judgment.

[0073] Specifically, communication interaction timing data and protocol stack resource status data are collected; the communication interaction timing data comes from the traffic mirroring port or the service mesh side vehicle, and the protocol stack resource status data comes from the host kernel monitoring interface.

[0074] Taking a 5-second preset time window as an example, if there are 3 key call chains within this time window: triage service to medical record service, medical record service to laboratory service, and laboratory service to consultation service, then the business request sequence of this time window can be abstracted into request fragments R1, R2, and R3. Furthermore, according to the interface contract parsing result and the protocol specification parsing result, each request is mapped to a token with a color identifier in the colored Peter .net. Different colors can correspond to different request types, such as the first visit request for newly admitted chest pain patients, the ECG result backfilling request, and the laboratory report push request.

[0075] In the ideal communication state model, places can represent states such as pending connection, service processing, response return, and resource release, while transitions can represent actions such as establishing a connection, initiating a call, business processing, writing back the result, and closing the connection. By combining queuing rules, such as first-come-first-served or connection pool allocation rules, the ideal communication trajectory baseline can be deduced under the current business input.

[0076] To facilitate explanation, the technical principles will now be explained using specific numerical examples. Assume that within a certain time window, the ideal communication trajectory baseline outputs the ideal values ​​for three state nodes: node N1 completes 10 connection calls, node N2 has an average processing latency of 20 milliseconds, and node N3 has a socket occupancy rate of 30%.

[0077] The actual observations within the same time window are 9 times, 55 milliseconds, and 78%, respectively; therefore, the actual residuals can be expressed as: node N1 difference is -1, node N2 difference is +35 milliseconds, and node N3 difference is +48%; after injecting the slow connection persistence attack into the ideal communication state model, a simulation trajectory for a certain abnormal scenario is obtained, and its corresponding differences can be: node N1 difference -1, node N2 difference +33 milliseconds, and node N3 difference +50%;

[0078] After injecting the morning peak tidal load of the outpatient and emergency departments into the ideal communication state model, the simulated trajectory of the business tidal load is obtained. The differences are as follows: node N1 difference 0, node N2 difference +36 milliseconds, node N3 difference +20%. It can be seen that although both abnormal scenarios and business tidal loads can cause increased latency, they present different topological forms in terms of resource consumption and state transition. The former is closer to the long-term occupation of connections and insufficient release, while the latter is closer to the increase in global queuing but the state machine continues to flow.

[0079] After obtaining the real residual space, the theoretical residual spaces of each anomaly, and the theoretical residual space of business tides, the system extracts the latency features, resource consumption features, and call chain dependencies from each residual space, and reconstructs them into graph structure data objects with node weights and directed edge weights. Then, the topological similarity of the reconstructed graph data objects is calculated. The space here does not have to be a complex high-dimensional continuous manifold. In engineering, a multi-dimensional time series graph composed of multiple state nodes and their edge relationships can be used.

[0080] Time alignment can be based on millisecond-level timestamps, sampling periods, or time buckets; state node mapping can map real-world metrics such as socket receive queue depth, connection establishment wait count, and service processing time to corresponding locations or transitions in the colored Petrynet.

[0081] Subsequently, if the topological similarity between the actual residual space and a certain abnormal theoretical residual space is the highest and exceeds a preset threshold, the result of the abnormal type and its root cause will be output; if the actual residual space is more similar to the business tide theoretical residual space, it will be judged as normal business tide and will not trigger attack-type handling; if neither of the two conditions is met, it will enter the pending confirmation state and continue to collect data in subsequent time windows for repeated judgment.

[0082] In one alternative implementation, if the data collected within a certain time window is incomplete, for example, only communication interaction timing data is collected but protocol stack resource status data is not collected, the system can mark the missing dimension as a null node and use a partial dimension matching method to generate a reduced-order residual space first, so as to avoid directly interrupting the entire process due to a single dimension missing.

[0083] If the interface contract parsing result is inconsistent with the online version, for example, if the service has been upgraded in a gray-scale manner, the latest service description file will be used to rebuild the ideal model first; if the mapping still cannot be completed, the service link will be temporarily removed from the current judgment and the model will be output as pending synchronization.

[0084] If multiple consecutive time windows remain in the pending confirmation state during repeated determination, the sampling frequency can be increased or the preset time window can be expanded to obtain a more stable determination basis.

[0085] For example, in the distributed collaboration platform of the chest pain center, at 8:20 am, a number of suspected acute myocardial infarction patients were shortly influxed into the emergency department, and the number of calls between the triage service and the medical record service increased significantly, which was observed to be an increase in average response latency in reality.

[0086] However, the theoretical residuals of business tides generated by the ideal model after introducing the business tide periodic fluctuation rules are highly consistent with the actual residuals. Therefore, the system judges that the period is a normal tide and does not trigger the blocking.

[0087] Conversely, at 8:32 AM, the host hosting the consultation service experienced a connection persistence anomaly. The actual residuals on the two nodes of delayed connection release and high socket occupancy rate highly overlapped with the theoretical residuals of a slow exhaustion attack. The system then output the anomaly type as a connection persistence attack, the root cause of which was the abnormal occupancy of the consultation service node's connection pool, and triggered intelligent handling instructions, such as resetting the connection pool, isolating the abnormal instance, or implementing targeted rate limiting upstream.

[0088] The purpose of this step is to transform real-world observations into differences from ideal states, and then compare these differences with various theoretical residuals in the same reference system, thereby distinguishing traditionally easily confused high-concurrency legitimate tides from covert communication failures or attacks, thus achieving an interpretable, traceable, and interconnected abnormal reasoning process.

[0089] Furthermore, to avoid confusion caused by the misuse of abbreviations and full names of the same technical object in different paragraphs, in the subsequent descriptions of this specification, both the ideal trajectory reference and the ideal communication trajectory reference refer to the same ideal communication trajectory reference object; and both abnormal simulation trajectories refer to abnormal scenario simulation trajectories obtained after the injection of fault attack knowledge.

[0090] Tidal simulation trajectories refer to business tidal simulation trajectories; actual residuals refer to node difference values, edge difference values, or local subgraph projections in the actual residual space; anomalous theoretical residuals refer to node difference values, edge difference values, or local subgraph projections in the corresponding anomalous theoretical residual space; tidal theoretical residuals refer to node difference values, edge difference values, or local subgraph projections in the business tidal theoretical residual space; whenever the aforementioned abbreviations appear in the following text, their corresponding technical objects and comparative relationships remain unchanged.

[0091] In a preferred embodiment of the present invention, the communication interaction timing data includes source address, destination address, calling method identifier, payload size, request timestamp, and response timestamp; the protocol stack resource status data includes Transmission Control Protocol state transition sequence, User Datagram Protocol socket transmit / receive status sequence, handle occupancy rate, and input / output queue depth.

[0092] This embodiment provides a mechanism for specific constraints on basic data collection fields. Specifically, in the overall process, if only network logs or system status are collected in a general manner, the field granularity may be inconsistent during engineering implementation, resulting in the inaccurate mapping of status nodes. Therefore, this embodiment further clarifies the composition of communication interaction timing data and protocol stack resource status data, so that subsequent ideal benchmarks, residual calculations, and topology similarity determination have stable inputs.

[0093] Specifically, the communication interaction timing data includes at least the source address, destination address, calling method identifier, payload size, request timestamp, and response timestamp; the source address and destination address are used to determine the direction of the call chain; the calling method identifier is used to distinguish different business semantics such as uploading an electrocardiogram, querying test results, and initiating an expert consultation.

[0094] The payload size is used to identify legitimate timeouts caused by large object transfers; request timestamps and response timestamps are used to calculate end-to-end latency and perform time alignment; protocol stack resource status data includes at least the Transmission Control Protocol state transition sequence, User Datagram Protocol socket transmit and receive state sequence, handle occupancy rate, and input / output queue depth.

[0095] Among them, the Transmission Control Protocol state transition sequence can record the state changes during the connection from establishment to closure, the User Datagram Protocol socket send and receive state sequence can reflect the congestion behavior of connectionless messages on the receiving and sending sides, the handle occupancy rate reflects the occupancy of sockets and related resources, and the input and output queue depth reflects the backlog of protocol stacks and application-side pending processing.

[0096] To illustrate this more clearly, a simplified deduction can be made as follows: Assume there are two call records: the first one comes from 10.1.1.8 to 10.1.1.21, with the method identifier "consultation submission," a payload size of 8KB, a request timestamp of 100 milliseconds, and a response timestamp of 160 milliseconds; the second one comes from 10.1.1.21 to 10.1.1.34, with the method identifier "verification query," a payload size of 2KB, a request timestamp of 115 milliseconds, and a response timestamp of 140 milliseconds.

[0097] Therefore, the call latencies can be calculated to be 60 milliseconds and 25 milliseconds, respectively. If, within the same time window, the protocol stack status of the consultation service host shows that the transmission control protocol state transition sequence stays in the established state for an abnormally long time, the handle occupancy rate increases from 40% to 85%, and the input queue depth increases from 5 to 28, it indicates that communication latency alone is insufficient to determine whether the business is busy or the connection release is abnormal. Instead, it is necessary to combine protocol stack resource data for joint modeling.

[0098] In one alternative implementation, if some calls use connectionless transmission and lack a complete connection state transition sequence, the mapping can be supplemented by User Datagram Protocol (UDP) socket send / receive state sequence, send / receive buffer usage, and application layer retry behavior; if the source or destination address changes frequently due to container orchestration drift, it can be merged into a stable service instance identifier or logical service name to avoid the same service being misjudged as different nodes due to address changes.

[0099] If there is a cross-host clock discrepancy between the request timestamp and the response timestamp, the latency value can be corrected by using a unified clock source or by adjusting the relative order within the link.

[0100] For example, in the chest pain center platform, when the ECG upload service calls the image storage service, although the load size increases from the usual 200KB to 1.8MB, the transmission control protocol state transition sequence still flows normally, and the handle occupancy rate and input / output queue depth do not continue to accumulate. Therefore, this large delay can be attributed to the legitimate large object transmission.

[0101] In scenarios where consultation services malfunction, the load size does not increase significantly, but the connection state remains unchanged for a long time, the handle occupancy rate surges, and the queue depth accumulates. At this time, a more discriminative observation basis can be provided for subsequent anomaly theoretical residual matching.

[0102] The purpose of this step is to ensure comparability of the same business semantics and the same protocol state across different time windows and different hosts by clearly defining data field boundaries, thereby achieving stability and verifiability of subsequent residual construction.

[0103] In a preferred embodiment of the present invention, the interface contract parsing result is obtained by parsing the interface definition file, service description file, or application interface metadata; the protocol specification parsing result is obtained by parsing the Transmission Control Protocol specification, User Datagram Protocol specification, or Application Layer Communication Protocol specification; the queuing rule includes at least one of the following: first-come, first-served rule, priority queue rule, and connection pool allocation rule.

[0104] Current business inputs include at least one of the following: business request arrival rate, call chain sequence, request type identifier, and load parameters; ideal communication trajectory benchmarks include ideal state transition sequence, ideal processing delay sequence, and ideal resource usage sequence.

[0105] This embodiment provides a refined construction mechanism for an ideal communication trajectory benchmark. Specifically, in the basic scheme, although an ideal communication state model based on interface contracts, protocol specifications, and queuing rules has been proposed, if the sources of these inputs and the output content of the ideal trajectory are not further defined, it is easy to degenerate into manual experience modeling during actual deployment, resulting in an unstable ideal benchmark. Therefore, this embodiment further elaborates on where the model comes from and what benchmark the model outputs.

[0106] Specifically, the interface contract parsing results are preferably extracted from the interface definition file, service description file, or application interface metadata; for example, the interface definition file can be parsed to show that the consultation submission interface must carry the patient identifier, timestamp, and attachment index; the service description file can be parsed to show the call path and return code specifications.

[0107] Application programming interface metadata can be parsed to reveal timeout settings, idempotency requirements, and retry strategies; protocol specification parsing results come from Transmission Control Protocol (TCP) specifications, User Datagram Protocol (UDP) specifications, or application layer communication protocol specifications, and are used to constrain connection establishment, message sending and receiving, retransmission timing, and session lifecycle; queuing rules can at least use one of the following: first-come, first-served rule, priority queue rule, and connection pool allocation rule. For example, critical value push requests from chest pain patients can have a higher priority in the queue than ordinary historical query requests;

[0108] When constructing the current business input, the business request arrival rate, call chain sequence, request type identifier and load parameters can be extracted from the preset time window; to illustrate the data format, assuming that a total of 20 requests arrive in a 10-second time window, of which 12 are critical value pushes and 8 are regular medical record queries, the request arrival rate can be recorded as 2 per second.

[0109] The call chain sequence can be simplified to L1: triage → medical record → consultation, and L2: triage → test → consultation; the request type identifier can be T1 or T2; the load parameter can represent the attachment size or message body size; after feeding these inputs into the ideal communication state model, the ideal state transition sequence, ideal processing delay sequence, and ideal resource usage sequence can be output.

[0110] For example, in an ideal state transition sequence, a T1 request should sequentially undergo the access authentication service processing result write-back and connection release; in an ideal processing latency sequence, the three stages on the L1 link correspond to 10 milliseconds, 15 milliseconds, and 12 milliseconds respectively; in an ideal resource utilization sequence, the peak connection pool utilization rate should not exceed 40%.

[0111] Let's take a further micro-level deduction; assume that the current business input only has two types of requests: T1 critical value push 1 per second, T2 medical record query 2 per second;

[0112] According to the priority queue rules, T1 is processed first. If the ideal model measures the ideal processing latency of T1 at nodes A, B, and C as 8 milliseconds, 10 milliseconds, and 6 milliseconds, respectively, and that of T2 as 5 milliseconds, 7 milliseconds, and 4 milliseconds, then under the same arrival rate, ideally there will be no long queues or connection pool exhaustion. The ideal trajectory benchmark constructed in this way is not a historical average, but rather the upper limit of the benchmark that should be reached when the interface and protocol are compliant, queuing resources are sufficient, and there are no fault disturbances.

[0113] In one alternative implementation, if multiple versions of the interface are released in parallel online, the contract can be parsed separately according to the traffic tags and multiple sets of ideal trajectory benchmarks can be established to avoid deviations caused by mixing old and new interfaces; if some call chains are missing in the interface definition file, the minimum necessary path can be backfilled from the service description file or application interface metadata.

[0114] If a complete link still cannot be formed, a local ideal baseline can be generated only for known nodes; if the queuing rules are inconsistent in different services, they can be configured separately according to the service dimension. For example, the consultation service uses a priority queue, the medical record query service uses a first-come, first-served service, and the connection pool allocation rules take effect separately at the gateway layer.

[0115] For example, in the emergency consultation process of the chest pain center platform, after the emergency doctor uploads the electrocardiogram, the system needs to first call the medical record service to complete the patient's medical history, and then call the consultation service to assign a cardiology expert.

[0116] Interface contract parsing reveals that medical record completion must be completed before consultation submission; protocol specification parsing determines that the call chain uses a long connection of the Transmission Control Protocol; and queuing rules stipulate that critical value consultations take priority over ordinary consultations. Based on this, the ideal trajectory benchmark can provide the ideal state transition order, ideal processing latency, and ideal connection pool occupancy range of the link under no abnormal conditions, providing a stable reference for subsequent abnormal reasoning.

[0117] The purpose of this step is to improve the ideal benchmark from a fuzzy empirical value to a deterministic trajectory that is analyzable, reproducible, and updatable through a three-layer joint constraint of contract, agreement, and queuing, thereby providing effective support for the subsequent difference process.

[0118] In a preferred embodiment of the present invention, the fault attack knowledge consists of at least one of historical fault samples, an attack rule base, and a protocol anomaly pattern base; based on the preset fault attack knowledge, parameterized perturbation injection is performed on the ideal communication state model to generate anomaly scenario simulation trajectories, including: injecting service processing rate attenuation rules into the ideal communication state model to generate fault simulation trajectories; injecting connection maintenance rules and window shrinking rules into the ideal communication state model to generate attack simulation trajectories; and summarizing the fault simulation trajectories and attack simulation trajectories into anomaly scenario simulation trajectories;

[0119] Based on preset service load fluctuation rules, parameterized disturbance injection is performed on the ideal communication state model to generate service tidal simulation trajectory, including: injecting the periodic fluctuation rules or sudden increase and decrease rules of service request arrival rate into the ideal communication state model to generate service tidal simulation trajectory.

[0120] This embodiment provides a parameterized perturbation injection mechanism; specifically, after the ideal baseline has been established, if the actual observation is directly subtracted from the ideal trajectory, although the actual residual can be obtained, it still cannot answer the question of which anomaly the residual resembles.

[0121] Therefore, this embodiment further introduces fault attack knowledge and business load fluctuation rules to controllably perturb the ideal model and actively synthesize multiple theoretical residual templates to solve the problem of insufficient interpretability of anomaly root causes;

[0122] Specifically, fault attack knowledge can come from historical fault samples, attack rule bases, and protocol anomaly pattern bases; for example, patterns of continuous decline in processing rate caused by memory leaks in the consultation service can be summarized from historical fault samples.

[0123] The attack rule base can summarize the slow exhaustion pattern of long-term connection occupation without release; the protocol anomaly pattern base can summarize abnormal protocol behaviors such as continuous shrinking of the receive window and repeated connection maintenance; based on this knowledge, parameterized perturbation injection is performed on the ideal communication state model.

[0124] For fault-prone scenarios, service processing rate attenuation rules can be injected. To make it easier to understand, a service that can ideally process 10 requests per second can be reduced to 8 requests per second in the first time window, 6 requests per second in the second time window, and 4 requests per second in the third time window after the attenuation rule is injected.

[0125] Even if the request arrival rate remains unchanged, the queue length and response latency will increase window by window, forming a fault simulation trajectory. For attack scenarios, connection hold rules and window shrinking rules can be injected. For example, in an ideal state, a connection is released within 50 milliseconds after completing the response, but in attack simulations, some connections remain established for 500 milliseconds. Or, after the receiving window shrinks, the data transmission efficiency gradually decreases. After summarizing the fault simulation trajectory and the attack simulation trajectory, multiple sets of abnormal scenario simulation trajectories can be formed.

[0126] On the other hand, in order to prevent legitimate peaks from being misjudged as abnormal, business load fluctuation rules should be injected into the ideal model to generate business tide simulation trajectories; periodic fluctuation rules can indicate that the request arrival rate increases from 8:00 to 10:00 and 14:00 to 16:00 every day, and decreases at night;

[0127] Sudden surge and drop rules can represent, for example, a short-term surge caused by a large number of patients being transferred in the emergency room, or a local burst caused by a large number of test results being written back. For instance, if there are 2 requests per second in an ideal state, injecting tidal rules can change it to 3, 5, and 3 requests per second in three consecutive time windows. Although the processing latency will increase, if the protocol state continues to flow normally and the connection can be released normally, the simulation trajectory should be closer to a normal tide than an attack or failure.

[0128] In one alternative implementation, if there are insufficient historical fault samples, making it impossible to directly form a parameter range for a certain fault mode, then discrete disturbance levels can be constructed first based on the attack rule base or protocol anomaly pattern base, such as three levels: mild, moderate, and severe, and then gradually corrected through online feedback.

[0129] If a certain business tidal rule deviates significantly from the actual rhythm, such as when a hospital temporarily holds a large-scale free clinic event, causing the traffic peak to differ from the historical rhythm, the periodic range and amplitude of the business request arrival rate can be dynamically updated without changing the anomaly rules. If a certain anomaly has both fault and attack characteristics, multiple perturbation operators can be superimposed and injected to form a composite anomaly scenario simulation trajectory, which will then compete in the subsequent similarity judgment.

[0130] For example, in the chest pain center platform, the consultation service once experienced a window-by-window decline in processing rate caused by cache leakage. Historical samples show that its average processing capacity dropped from 12 times per second to 5 times per second. In this embodiment, this pattern can be injected into the ideal model to obtain the fault simulation trajectory.

[0131] In another scenario, a boundary device within the hospital suffers from slow connection exhaustion. The attack rule base shows that its characteristics are longer connection holding time and a continuously shrinking receiving window. Based on this, this embodiment generates an attack simulation trajectory. At the same time, the hospital's outpatient and emergency department admissions are a typical business tide every morning. This embodiment further generates a tide simulation trajectory through periodic fluctuation rules. In this way, during subsequent judgment, the actual residual can be compared with the fault template, attack template, and tide template simultaneously.

[0132] The purpose of this mechanism is to transform the originally abstract anomalous experience into injectable, simulable, and comparable parameterized trajectories, thereby enabling root cause localization to shift from subjective inference based on appearances to directly comparing real-world observation features with theoretical templates.

[0133] In specific engineering implementation, considering that the parametric extrapolation of colored Petery networks is prone to encountering computing power bottlenecks within a very short preset time window, the simulation trajectory of abnormal scenarios and the simulation trajectory of business tides and their corresponding theoretical residual spaces can be generated by offline concurrent simulation and caching as a pre-set theoretical residual template library.

[0134] When making judgments offline, the system only needs to extract real-time business inputs to generate the actual residual space and directly match and calculate it with the pre-set theoretical residual template library, thereby effectively controlling system overhead and ensuring the real-time performance and feasibility of monitoring and judgment in high-concurrency scenarios.

[0135] In a preferred embodiment of the present invention, the actual residual space, the theoretical residual spaces of each anomaly, and the theoretical residual space of business tides are all multi-dimensional time series graph feature spaces composed of state transition features, latency features, and resource occupancy features.

[0136] Among them, obtaining the real residual space includes: performing time alignment and state node mapping on communication interaction time series data and protocol stack resource state data to generate real observation features;

[0137] Based on the difference between the actual observation characteristics and the ideal communication trajectory benchmark under the preset time window and state node mapping, a real residual space is generated; at least one abnormal theoretical residual space is obtained, and the difference between the service tidal simulation trajectory and the ideal communication trajectory benchmark is calculated to obtain the service tidal theoretical residual space, including: generating an abnormal theoretical residual space based on the difference between the abnormal scenario simulation trajectory and the ideal communication trajectory benchmark under the preset time window and state node mapping.

[0138] Based on the difference between the simulated service tide trajectory and the ideal communication trajectory benchmark under the preset time window and state node mapping, a theoretical residual space for service tide is generated.

[0139] This embodiment provides a residual space construction mechanism; specifically, after multiple simulation trajectories are available, if each index is subtracted separately, a problem will arise: latency, state transition and resource consumption are independent of each other, making it difficult to reflect the linkage relationship of the same anomaly on multiple nodes;

[0140] Therefore, in this embodiment, the actual residual space, the anomaly theoretical residual space, and the business tide theoretical residual space are uniformly represented as a multi-dimensional time series graph feature space to preserve time relationships, structural relationships, and resource relationships.

[0141] Specifically, the so-called multidimensional time sequence graph feature space can be understood as follows: the nodes in the graph represent state nodes, such as connection establishment, request queuing, service processing, response sending, and resource release; the edges in the graph represent the state transition direction and call chain relationship; each node or edge carries latency characteristics, resource usage characteristics, and statistical values.

[0142] To illustrate this, suppose that within a 5-second time window, only three state nodes S1, S2, and S3 are selected, where S1 represents connection establishment complete, S2 represents service processing in progress, and S3 represents connection release complete. In the ideal trajectory baseline, the characteristics of these three nodes are as follows: S1 takes 5 milliseconds, S2 takes 15 milliseconds, and S3 has a socket occupancy rate of 30%.

[0143] In the actual observation features, the corresponding values ​​are: S1 takes 8 milliseconds, S2 takes 40 milliseconds, and S3 socket occupancy rate is 82%. Based on this, the node difference values ​​in the actual residual space can be generated, that is, the differences of +3 milliseconds, +25 milliseconds and +52% are formed on the three nodes S1, S2 and S3 respectively, while retaining the state transition edge S1→S2→S3.

[0144] Similarly, if the difference between the simulated trajectory of an abnormal scenario and the ideal trajectory is +2 milliseconds, +23 milliseconds, and +55%, then an abnormal theoretical residual space can be formed; if the difference between the simulated trajectory of business tides is +4 milliseconds, +24 milliseconds, and +18%, then a business tide theoretical residual space can be formed.

[0145] This simplified example shows that both can be close to the actual residual in terms of latency characteristics, but in terms of resource consumption, the theoretical residual of anomalies is closer to the actual residual, while the theoretical residual of business tides has significantly lower resource consumption on the connection release node; therefore, using a multi-dimensional time series diagram instead of a single indicator vector can more clearly represent the anomaly pattern.

[0146] In terms of time alignment, the request and response timestamps in the communication interaction time sequence data and the sampling timestamps in the protocol stack resource status data can be merged into a unified time bucket; for example, by using 100 milliseconds as a bucket, the call records, status sequences and resource sampling values ​​in the same time bucket can be merged and then mapped to the corresponding status nodes.

[0147] Regarding state node mapping, a mapping table between observation indicators and model nodes can be pre-established. For example, handle occupancy rate can be mapped to resource release node, input / output queue depth can be mapped to service processing node, and the duration of waiting to close state in the transmission control protocol state transition can be mapped to connection closing node. As long as the real data and simulation data are organized according to the same mapping table, their residual space can be in the same comparison coordinate system.

[0148] In an alternative implementation, if certain state nodes have no observations in the current time window, for example, if a service does not perform a connection closure action, the node can be marked as inactive, and the node can be retained in the graph but not participate in the numerical difference calculation in this round.

[0149] If additional nodes appear in real-world observations but do not have corresponding states in the ideal trajectory, for example, if a middleware introduces a retry queue, these nodes can be attached as extension nodes to the nearest upstream state node and marked as real-world extension nodes for subsequent structural similarity algorithms to identify. If there are many anomaly theoretical residual spaces, they can be grouped according to anomaly categories, and only representative templates from each group can be retained for the first round of coarse matching to reduce computational burden.

[0150] For example, in the chest pain center platform, when the consultation service is abnormal, actual observation shows that the dwell time of the request has entered the service processing node increases significantly, while the handle occupancy rate of the connection release completion node remains high for a long time.

[0151] During the morning peak hours in the outpatient and emergency departments, although the dwell time of nodes in service processing will also increase, the resource consumption of nodes that have completed connection release will usually decrease synchronously with the processing of requests. By putting these characteristics into the residual space of the multi-dimensional time sequence diagram, the system can distinguish between two different forms: the global queue increases but the state still flows normally and the abnormal accumulation caused by the poor release of specific nodes.

[0152] The purpose of this step is to reconstruct the observations scattered across different logs and monitoring sources into a unified residual representation with structural relationships, thereby achieving complete preservation of anomaly morphology and making subsequent similarity calculations operable.

[0153] In a preferred embodiment of the present invention, topological similarity includes temporal similarity, structural similarity, and a coupling score determined based on temporal similarity and structural similarity; based on the topological similarity between the actual residual space and each anomaly theoretical residual space and the business tide theoretical residual space, anomaly type results and root cause results are generated, including: calculating the temporal similarity between the actual residual space and each anomaly theoretical residual space and the business tide theoretical residual space based on dynamic time warping;

[0154] The structural similarity between the actual residual space and the residual spaces of each anomaly theory and the residual space of the business tide theory is calculated based on the graph editing distance. The temporal similarity and structural similarity are weighted and summed based on a predetermined weight allocation to obtain the coupling score corresponding to each residual space of the anomaly theory and the residual space of the business tide theory.

[0155] The residual space with the highest coupling score among the various anomaly theory residual spaces is selected as the target anomaly theory residual space; based on the coupling score corresponding to the target anomaly theory residual space and the coupling score corresponding to the business tide theory residual space, the anomaly type result and root cause result are determined.

[0156] This embodiment provides a coupled scoring decision mechanism. Specifically, multiple residual spaces have been obtained, but using only a single similarity index still has bottlenecks: if only numerical sequences are compared, it is easy to ignore the differences in call chain structure; if only graph structure is compared, the time delay stretching phenomenon in different time windows may be ignored. Therefore, this embodiment further decomposes topological similarity into temporal similarity and structural similarity, and forms a coupled score through weighting.

[0157] Specifically, the temporal similarity is preferably based on dynamic time warping calculation; dynamic time warping is suitable for comparing temporal curves with inconsistent lengths and local stretching and compression; for example, the actual residuals in the service processing over three consecutive preset time windows have node differences of 20, 35, and 30, the theoretical residuals for a certain anomaly are 18, 33, and 29, and the theoretical residuals for business tides are 22, 36, and 15.

[0158] Although the three are of the same length, if the actual response is slightly delayed in the second preset time window, dynamic time warping can still find the minimum cumulative distance by aligning the path; the smaller the cumulative distance, the higher the temporal similarity; structural similarity is preferably calculated based on graph edit distance, that is, comparing the number of node additions / deletions, edge additions / deletions, or label replacements required to transform from one residual graph to another;

[0159] Considering that rigorous graph edit distance calculation is an NP-hard problem in large-scale multidimensional time series graphs, in order to meet the requirements of rapid response within the monitoring time window, in specific implementation, the calculation based on graph edit distance can adopt the restricted graph edit distance algorithm, or the heuristic approximation algorithm based on bipartite graph matching can be used to quantify structural differences.

[0160] If both the real residual map corresponding to the real residual space and the anomalous theoretical residual map corresponding to the anomalous theoretical residual space present a structure of S1→S2→S3, and S3 has a high occupancy and stagnation, they can be converted to each other with fewer editing actions, then the structural similarity is high; if the business tide theory residual map presents a structure of global queuing and diffusion of S1→S2→S3→S4, then the graph editing distance is greater.

[0161] To facilitate understanding, a simplified deduction can be performed; assume that the temporal similarity between the actual residual space and the residual space of a certain anomaly theory is 0.90 and the structural similarity is 0.86; and the temporal similarity with the residual space of the business tide theory is 0.88 and the structural similarity is 0.60.

[0162] If the weighting rule takes a temporal similarity weight of 0.6 and a structural similarity weight of 0.4, then the coupling score of the residual space of the anomaly theory can be obtained by weighting the temporal similarity of 0.90 and the structural similarity of 0.86 according to the aforementioned weights, resulting in 0.884. The coupling score of the residual space of the business tide theory can be obtained by weighting the temporal similarity of 0.88 and the structural similarity of 0.60 according to the same weights, resulting in 0.768.

[0163] Furthermore, the temporal similarity weight and structural similarity weight in the weighted summation calculation can be predetermined by the analytic hierarchy process, or obtained by pre-training based on the information gain contribution of the temporal and structural features in the historical operation and maintenance failure samples to the final root cause determination result. It also allows system administrators to customize the configuration according to the business preferences of specific microservices that are more sensitive to latency or more sensitive to resource consumption.

[0164] At this point, the one with the highest coupling score among the various anomaly theory residual spaces is selected as the target anomaly theory residual space. Then, it is compared with the coupling score of the business tide theory residual space to output the anomaly type result and root cause result.

[0165] To further explain, if there are multiple abnormal theoretical residual spaces, such as A1 processing rate decay fault, A2 connection persistence attack, and A3 window shrinkage attack, then first calculate the coupling score between the actual residual and A1, A2, and A3 respectively, and select the highest one;

[0166] Assuming A1 scores 0.72, A2 scores 0.91, and A3 scores 0.85, the theoretical residual space of the target anomaly is A2. Then, the score of 0.91 in A2 is compared with the score of the theoretical residual space of business tides, for example, 0.78, to lay the foundation for subsequent threshold-based decisions.

[0167] Furthermore, to avoid directly mixing distance and similarity metrics, in engineering implementation, the cumulative distance and graph editing distance obtained from dynamic time warping can be mapped to a unified similarity interval before performing a weighted summation. That is, the dynamic time warping or graph editing distance itself can be used as the original difference metric, which is then converted into a similarity value between 0 and 1 through a preset normalization rule. The closer the value is to 1, the more similar the two values ​​are, and the closer it is to 0, the less similar the two values ​​are.

[0168] In this way, the temporal similarity and structural similarity used in the subsequent coupling score are on the same scale, avoiding the problem that one quantity is better when it is smaller and the other quantity is better when it is larger, thus making direct weighting impossible;

[0169] For example, if the cumulative distance of dynamic time warping of a certain anomalous theoretical residual is less than that of another theoretical residual, then its time similarity after normalization is higher; if its graph editing distance is also smaller, then its structural similarity after normalization is also higher, and the two can directly participate in the same coupled scoring calculation.

[0170] Furthermore, the normalization rule can be pre-set according to the statistical boundary of the template set; for example, the maximum and minimum values ​​of a certain distance index can be obtained from all theoretical residual templates currently participating in the comparison, and the current distance can be mapped to similarity according to the interval.

[0171] If the maximum and minimum values ​​are the same, the similarity corresponding to the index is considered to be 1, indicating that the template in this round is indistinguishable in this dimension, thus avoiding calculation anomalies caused by a denominator of 0; for extreme distance values ​​that exceed the existing statistical boundaries, they can be directly truncated to the endpoint of the similarity interval to ensure that the coupling score falls stably between 0 and 1.

[0172] In one alternative implementation, if the dynamic time warping calculation finds that the data length of a certain time window is too short, for example, only a single sampling point, then the Euclidean distance or Manhattan distance can be used to approximate the calculation of time series similarity.

[0173] If unmapped extended nodes are encountered during graph editing distance calculation, lower weights can be set for the extended nodes to avoid excessive influence of newly introduced secondary nodes on the overall structure judgment. If the temporal similarity is high but the structural similarity is low, it indicates that the real phenomena are similar in numerical terms but different in propagation paths. In this case, the system can output intermediate results of candidate anomalies that require manual review, instead of directly linking to high-risk handling.

[0174] For example, in the chest pain center platform, during an abnormal consultation service, the actual residual showed that the processing time was continuously prolonged, the connection release was delayed, and the handle occupancy rate remained high in multiple consecutive time windows; after dynamically time-normalizing the actual residual and the connection-keeping attack theoretical residual, the alignment error was small.

[0175] Meanwhile, in terms of graph structure, both are characterized by high occupancy and stagnation of the consultation service nodes, and shrinking release paths, resulting in the highest coupling score. In contrast, although the residual of the business tide theory is somewhat similar in terms of latency, its graph structure is more inclined to global queuing and diffusion, so its overall score is lower. As a result, the system outputs the abnormal type as a connection persistence attack, the root cause of which is abnormal occupancy of the consultation service connection pool.

[0176] The purpose of this mechanism is to utilize both temporal and structural information simultaneously to improve the robustness of anomaly detection, thereby achieving more stable anomaly type identification and root cause mapping.

[0177] Furthermore, to ensure the uniqueness of the symbol meaning throughout the text, in this embodiment, dynamic time warping and graph edit distance are used only as intermediate difference quantities and are not directly confused with the coupling score. If the relevant quantities are symbolically represented, the normalized temporal similarity can be denoted as (St), the normalized structural similarity can be denoted as (Ss), and the coupling score formed by the weighted sum of the two can be denoted as (Sc). Among them, (St), (Ss), and (Sc) all represent the normalized similarity values, with values ​​ranging from 0 to 1, and the larger the value, the closer the similarity.

[0178] Correspondingly, the dynamic time warping cumulative distance and graph edit distance only represent the generation... and The original distance measurement used is no longer the same as the original distance measurement. , , The same symbols are used to avoid the same symbol representing both distance and similarity; except for using them separately. and Apart from indicating two candidate coupled scores, other positions in this paper no longer use [the following]. It refers to distance or other irrelevant quantities, thereby ensuring consistency between symbols and technical meanings.

[0179] In a preferred embodiment of the present invention, the anomaly type result and root cause result are determined based on the coupling score corresponding to the residual space of the target anomaly theory and the coupling score corresponding to the residual space of the business tide theory, including:

[0180] When the coupling score corresponding to the target anomaly theoretical residual space is greater than the preset anomaly confidence threshold and not less than the coupling score corresponding to the business tide theoretical residual space, the anomaly type result and root cause result corresponding to the target anomaly theoretical residual space are determined; when the coupling score corresponding to the business tide theoretical residual space is greater than the preset tide confidence threshold and greater than the coupling score corresponding to the target anomaly theoretical residual space, the current state is determined to be normal business tide.

[0181] If any of the above judgment conditions are not met, the current state is determined to be a pending confirmation state, and continued collection and repeated judgment are triggered; wherein, the preset abnormal confidence threshold and the preset tidal confidence threshold are predetermined by at least one of the following methods: historical sample statistical calibration, training set calibration or empirical parameter calibration.

[0182] This embodiment provides a judgment mechanism with a boundary fallback. Specifically, after obtaining the target anomaly theoretical residual space and its coupled scores, if there is a lack of clear thresholds and branching rules, it is easy to make a mistake in engineering by only selecting the highest score when the highest score itself is not high enough or when the anomaly is close to the tidal score. Therefore, this embodiment constructs a complete judgment closed loop through anomaly confidence threshold, tidal confidence threshold and pending confirmation status.

[0183] Specifically, let the coupling score of the target anomaly theoretical residual space be... The coupling score of the residual space of the business tidal theory is: The determination is not simply based on the larger of the two values, but rather on whether each value exceeds its respective threshold.

[0184] For example, if the preset anomaly confidence threshold is 0.88 and the preset tidal confidence threshold is 0.83; when and At that time, due to Greater than 0.88 and not less than Therefore, the anomaly type result and root cause result corresponding to the residual space of the target anomaly theory are determined.

[0185] like and Then because Greater than 0.83 and greater than Therefore, the current state is determined to be a normal business surge; if and If the two are of different levels, neither has met the specific criteria for judgment, and they are now in a pending confirmation state.

[0186] In the pending confirmation state, the system will not suspend or terminate the judgment process, but will trigger continued data collection and repeated judgment; continued data collection can be manifested by extending the time window, for example from 5 seconds to 15 seconds, or increasing the sampling frequency, for example from once per second to 5 times per second.

[0187] When making repeated judgments, a moving average can be applied to the coupled scores of multiple consecutive time windows to reduce the impact of short-term fluctuations; for example, in three consecutive time windows, The values ​​were 0.84, 0.87, and 0.90, respectively. If the values ​​are 0.80, 0.81, and 0.79 respectively, then although the first time window does not reach the threshold, the abnormal trend can be gradually confirmed after continuous data collection.

[0188] The anomaly confidence threshold and the tidal confidence threshold can be predetermined through historical sample statistical calibration, training set calibration, or empirical parameter calibration; for example, the anomaly threshold can be obtained by subtracting the preset safety margin from the median of the coupled score distribution of historical known anomaly samples, and the tidal threshold can be obtained by taking the upper bound of the stable interval of historical tidal samples.

[0189] In one alternative implementation, if insufficient historical samples make it difficult to calibrate the threshold, empirical parameters can be used for initial deployment, and online correction can be performed without changing the core judgment logic.

[0190] like and If all values ​​are higher than their respective thresholds and very close, for example, differing by only 0.01, the system can output an abnormal and tidal mixed state prompt and limit the handling actions to low-risk actions, such as expanding monitoring and local flow limiting, rather than directly isolating the service; if the system is still in the pending confirmation state after multiple rounds of repeated judgments, the state can be reported to the manual operation and maintenance terminal, while retaining all residual space and matching results for subsequent review and threshold recalibration.

[0191] For example, during a morning rush hour on the chest pain center platform, the coupling score between the actual residual and the theoretical residual space of the outpatient and emergency department's business tide reached 0.89, while the coupling score with the theoretical residual of the processing rate decay fault was only 0.76. Based on this, the system determined it to be a normal business tide. Similarly, when a slow connection anomaly occurred in the consultation service... , If so, the system will directly confirm the anomaly and output the root cause;

[0192] For example, during a network jitter event, the scores for the two tests were 0.84 and 0.82 respectively, neither of which reached the threshold. The system would not rashly implement high-risk measures, but would continue to collect subsequent data windows for further judgment.

[0193] The purpose of this mechanism is to avoid making arbitrary conclusions based solely on the highest similarity result in a single instance by using threshold constraints and a fallback branch to be confirmed, thereby achieving stability and engineering safety in anomaly detection.

[0194] Please see Figure 2A behavioral reasoning and AI intelligent handling system for communication anomalies includes: a data acquisition module for collecting communication interaction time-series data and protocol stack resource status data; and an ideal benchmark construction module for constructing an ideal communication state model containing several state nodes based on pre-acquired interface contract parsing results, protocol specification parsing results, and queuing rules, and generating an ideal communication trajectory benchmark based on the business request sequence within a preset time window extracted from the communication interaction time-series data.

[0195] The parameterized injection module is used to perform parameterized perturbation injection on the ideal communication state model based on preset fault attack knowledge, generate abnormal scenario simulation trajectory, and perform parameterized perturbation injection on the ideal communication state model based on preset service load fluctuation rules, generate service tidal simulation trajectory.

[0196] The dual-track differential module is used to perform time alignment and state node mapping on communication interaction timing data and protocol stack resource status data, generate real observation features, and determine the real residual space, at least one abnormal theoretical residual space and the business tide theoretical residual space respectively under the preset time window and state node mapping based on the real observation features, ideal communication trajectory benchmark, abnormal scenario simulation trajectory and business tide simulation trajectory.

[0197] The coupling decision module is used to generate anomaly type results and root cause results based on the topological similarity between the real residual space and the theoretical residual spaces of each anomaly and the residual space of business tide theory.

[0198] The intelligent processing module is used to respond to the intelligent processing instructions corresponding to the root cause results output, so as to drive the execution of corresponding communication linkage processing operations.

[0199] The data acquisition module is connected to the traffic mirroring interface and the host protocol stack monitoring interface, respectively, while the intelligent processing module is connected to the operation and maintenance orchestration interface or the service governance interface.

[0200] This embodiment provides a system structure for implementing the aforementioned method. Specifically, the process from data acquisition to anomaly judgment and then to processing output has been described at the method level. However, if it is not implemented into clear modules, problems of functional overlap and unclear responsibility boundaries may easily occur during actual deployment. Therefore, this embodiment solidifies each function into a system module that cooperates with each other and explains their connection relationship.

[0201] Specifically, the data acquisition module is connected to the traffic mirroring interface and the host protocol stack monitoring interface respectively; the former is used to acquire the timing data of inter-service communication interaction, and the latter is used to acquire protocol stack resource status data such as transmission control protocol state transition, user datagram protocol socket send and receive status, handle occupancy rate and input / output queue depth.

[0202] The ideal benchmark construction module reads the interface contract parsing results, protocol specification parsing results, and queuing rules, establishes an ideal communication state model, and generates an ideal communication trajectory benchmark based on the business request sequence within a preset time window;

[0203] The parameterized injection module superimposes fault attack knowledge and business load fluctuation rules onto the ideal model, and outputs the simulation trajectory of abnormal scenarios and the simulation trajectory of business tides; the dual-track differential module receives real data and the aforementioned multiple trajectories, and calculates the real residual space, the anomaly theoretical residual space and the business tide theoretical residual space under a unified state node mapping.

[0204] The coupling decision module is responsible for performing temporal similarity, structural similarity, and coupling score calculations, and providing anomaly type and root cause results; the intelligent handling module issues handling instructions through the operation and maintenance orchestration interface or service governance interface.

[0205] To illustrate the data flow between modules, a simplified deduction can be made; assume that the data acquisition module outputs two types of data streams D1 and D2 within a certain 10-second time window, where D1 contains 20 call records and D2 contains the protocol stack sampling sequence within that time window;

[0206] The ideal baseline construction module parses the service request sequence in D1 into the current service input B1 and outputs the ideal trajectory G0; the parameter injection module generates abnormal simulation trajectories G1 and G2 and service tidal simulation trajectory G3 based on the ideal communication state model and combined with the service input conditions corresponding to B1;

[0207] G0 serves as the ideal reference for subsequent dual-track difference, while G1, G2, and G3 represent the extrapolation results after applying different perturbations to the same ideal model. The dual-track difference module takes D1, D2, G0, G1, G2, and G3 as inputs to generate the actual residual Z0, the anomalous theoretical residuals Z1 and Z2, and the tidal theoretical residual Z3, respectively. Z0 represents the actual residual result, Z1 and Z2 represent the theoretical residual results corresponding to different anomalous templates, and Z3 represents the theoretical residual result corresponding to the tidal template.

[0208] The coupled decision module compares Z0 with Z1, Z2, and Z3 and outputs the target result C1; the intelligent processing module issues instruction I1 to the orchestration system based on C1; thus, the responsibilities of each module are clear and can be deployed in parallel or in a pipelined manner; to avoid confusion with the meaning of the numbering of the request segments R1, R2, and R3 in the previous text, Z0 to Z3 in this section only represent the numbers of various residual results in this example.

[0209] Furthermore, in engineering deployment, intermediate results can be passed between modules using message queues, shared memory, or interface calls, but their data boundaries must remain consistent. Specifically, the ideal benchmark construction module should output at least the ideal communication trajectory benchmark and its corresponding state node definition; the parameter injection module should output at least the disturbance type identifier, disturbance parameters, and corresponding simulation trajectory; and the dual-track difference module should output at least the node difference values, edge relationships, and time window identifiers of each residual space.

[0210] The coupled decision module outputs at least the candidate template number, coupling score, and decision status. Through the above module boundary constraints, even if each module is deployed on different computing nodes, the input and output interfaces can be kept consistent, avoiding interface mismatches such as the previous module outputting the trajectory and the next module expecting model parameters.

[0211] In an alternative implementation, if the data acquisition module experiences a short-term interruption, the ideal benchmark construction module and the parameterization injection module can still use the business input of the most recent valid time window for short-term extrapolation, but the dual-track differential module will mark the result of that round as low confidence.

[0212] If the parameterized injection module has not yet generated a certain abnormal template, the coupled decision module can first complete the matching in the existing template set and classify the unknown pattern as pending confirmation. If the operation and maintenance orchestration interface is temporarily unavailable, the intelligent handling module can first write the target handling instruction into the local pending execution queue and resend it after the interface is restored, or switch to the service governance interface to perform lightweight rate limiting and isolation.

[0213] For example, in the deployment of the chest pain center platform, the traffic mirroring interface can access the mirroring capability of the hospital service mesh, and the host protocol stack monitoring interface is deployed on the node carrying consultation services and medical record services; when an anomaly occurs, the data acquisition module collects data on the consultation service connection status stagnation and queue deepening.

[0214] The ideal baseline construction module provides the ideal trajectory under the same business input; the parameterized injection module quickly generates three types of simulation trajectories: processing rate decay, connection persistence attack, and morning peak tide; the dual-track differential module generates the corresponding residual space; the coupling decision module provides the root cause judgment of connection persistence attack; the intelligent handling module, based on this, links with the operation and maintenance orchestration platform to isolate abnormal consultation instances and notify the service governance component to perform upstream rate limiting.

[0215] The purpose of this system is to modularize, interface, and deploy the methods and steps, thereby achieving a closed-loop operation across the entire chain from data collection, simulation, judgment to disposal.

[0216] Furthermore, to ensure consistency in system object names, in this embodiment, the operation and maintenance orchestration system refers to the upper-level platform that carries the atomic processing script library and is responsible for executing orchestration actions, and the operation and maintenance orchestration interface is the calling interface opened by the upper-level platform to the intelligent processing module.

[0217] The service governance interface is the calling interface that the service governance component exposes to the intelligent handling module. The service governance component is the execution entity that implements rate limiting, rate deselection, instance isolation, or recovery. Correspondingly, the orchestration system in the above-mentioned instruction I1 issued to the orchestration system refers to the operation and maintenance orchestration system; the service governance component in the above-mentioned notification to the service governance component to perform upstream rate limiting receives the instruction through the service governance interface. All subsequent references to platforms, interfaces, and components shall be understood in accordance with the above correspondence, without changing the system connection relationship and module responsibility boundaries.

[0218] In a preferred embodiment of the present invention, the intelligent handling module is used to match a target script from the atomic handling script library pre-stored in the operation and maintenance orchestration system based on at least one of the abnormal protocol level, abnormal source node, abnormal resource type, abnormal cause category and abnormal severity in the root cause result, and output the corresponding handling instructions; wherein, the handling instructions include isolating the target service instance, resetting the connection pool and performing dynamic rate limiting.

[0219] This embodiment provides a detailed implementation mechanism for an intelligent handling module. Specifically, although it has been stated in the system that the intelligent handling module will output handling instructions based on the root cause results, if the handling logic only stops at manual judgment before execution, it will weaken the linkage of the entire system.

[0220] Therefore, this embodiment further specifies that the intelligent handling module automatically matches the target script from the atomic handling script library based on multiple dimensions in the root cause results;

[0221] It should be noted that the AI ​​intelligent handling of this invention takes expert rule matching and knowledge graph-level association reasoning as its core logic. It replaces the traditional manual operation and maintenance experience-based investigation by automatically combining deterministic and rollbackable atomic scripts, thereby achieving explainable and risk-controllable intelligent linkage handling in high-availability scenarios such as medical care.

[0222] Specifically, the root cause result may include at least one or more of the following: abnormal protocol level, abnormal source node, abnormal resource type, abnormal cause category, and abnormal severity; the abnormal protocol level may be divided into Transmission Control Protocol layer, User Datagram Protocol layer, and Application Layer Call layer.

[0223] The source node of the exception can specify a specific microservice instance, host machine or logical service; the type of the exception resource can point to a connection pool, socket handle, input / output queue or thread pool; the category of the exception cause can be distinguished as slow connection persistence, processing rate decay, window shrinkage, resource leakage, etc.

[0224] The severity of an anomaly can be graded based on the scope of impact or coupling score. The intelligent handling module matches the target script from the atomic handling script library pre-stored in the operation and maintenance orchestration system according to these dimensions. An atomic handling script is the smallest handling unit that can be executed independently, has clear boundaries, and can be rolled back if it fails. Examples include isolating the target service instance, resetting the connection pool, performing dynamic rate limiting, reclaiming the anomaly handle, and switching backup nodes.

[0225] For ease of explanation, a simplified script matching and deduction can be performed; assuming the root cause results show: the abnormal protocol level is the Transmission Control Protocol layer, the abnormal source node is the consultation service instance P2, the abnormal resource type is the connection pool, the abnormal cause category is connection maintenance abnormality, and the abnormal severity is high.

[0226] After searching the script library, script J1 can be matched to perform a connection pool reset on instance P2, script J2 to remove instance P2 from the service registry, and script J3 to perform a 20% dynamic rate limit on the upstream consultation entry. If the severity is high, J2+J1+J3 can be executed according to the preset strategy combination; if the severity is medium, only J1+J3 will be executed; if the severity is low, only J3 will be executed first and continuous observation will continue. In this way, the handling actions and the root cause results form a clear correspondence. To avoid confusion with the local numbering meaning of the previous state nodes S1, S2, and S3, J1 to J3 in this section only represent script numbers.

[0227] Furthermore, to ensure the determinism of the target script matching, each script in the script library can be pre-associated with applicable condition tags. The applicable condition tags include at least one or more of the following: protocol level tag, resource type tag, cause category tag, target node tag, and risk level tag.

[0228] After receiving the root cause results, the intelligent handling module can filter candidate scripts according to the priority order of node, resource, cause, and severity: scripts with the same abnormal source node are given priority; scripts with the same abnormal resource type are given priority; scripts with the same abnormal cause category are given priority; and single script execution or multi-script combination execution is determined according to the severity of the abnormality. If a certain dimension is missing in the root cause results, that dimension is skipped and the remaining dimensions are selected to ensure that an executable low-risk handling solution can be obtained even if the root cause information is incomplete.

[0229] The reason for using an atomic handling script library is that if the higher-level solution only outputs abstract conclusions such as suggesting repairing the connection pool, it is difficult to implement them in a timely manner in high-availability scenarios such as hospitals. With atomic scripts, complex handling can be broken down into controllable steps. For example, first, the abnormal instance can be isolated to prevent further spread; then the connection pool can be reset to restore resources; dynamic rate limiting can be performed to prevent traffic backflow; each step can independently record the execution result, time consumption, and rollback status; if the execution fails, it can also be retried or rolled back at the script level.

[0230] Furthermore, when multiple candidate scripts in the script library meet the applicable conditions, they can be sorted a second time according to the preset risk level and scope of impact, and atomic scripts with smaller impact and higher rollback capability can be executed first.

[0231] For example, in scenarios where both isolated instances and a full system reboot can alleviate the problem, instance-level isolation should be prioritized; in scenarios where both 10% and 30% rate limiting are feasible, the rate limiting script with the smaller reduction should be prioritized and the changes in the subsequent time window should be observed; thus, uncertainty in the handling action can be avoided due to the parallel availability of candidate scripts.

[0232] In an alternative implementation, if the root cause result only includes some dimensions, such as only identifying the abnormal source node and abnormal resource type, but failing to identify the cause category, the intelligent handling module can prioritize matching low-risk general scripts, such as dynamic rate limiting or gray-scale isolation, and temporarily refrain from executing more destructive reset actions.

[0233] If multiple candidate scripts in the script library have the same score, the script with the smaller impact range can be selected for execution first. If the actual residual space in the subsequent time window does not significantly decrease after the execution of the handling script, the intelligent handling module can upgrade the handling level, add stronger actions, or trigger manual takeover. If the handling script is executed successfully but the business surge continues, the system can maintain rate limiting without restoring the isolated instance until the surge subsides and is reassessed.

[0234] For example, in the chest pain center platform, the system determines that the consultation service instance P2 has a connection maintenance anomaly, which belongs to the transmission control protocol layer problem, the abnormal resource type is connection pool, and the severity is high;

[0235] Based on this, the intelligent handling module matches three atomic scripts from the script library: remove the P2 instance, reset the P2 connection pool, and limit the flow of the consultation entry to 20%. These scripts are then executed in the order of isolation, reset, and flow limiting. If the coupling score drops from 0.93 to 0.41 within two consecutive time windows after execution, and the business processing latency returns to an acceptable range, the system can gradually restore the P2 instance traffic according to the rollback strategy. If the score remains high, manual intervention is required, and a complete handling log is retained.

[0236] The purpose of this mechanism is to directly map the anomaly identification results into executable, tiered, and reversible handling actions, thereby achieving an automatic closed loop from the discovery to the mitigation of communication anomalies.

[0237] The foregoing has provided a detailed description of one embodiment of the present invention, but this description is merely a preferred embodiment and should not be construed as limiting the scope of the invention. All equivalent variations and modifications made within the scope of the claims of this invention should still fall within the patent coverage of this invention.

Claims

1. A method for behavioral reasoning and AI-powered intelligent handling of communication anomalies, characterized in that, include: Collect communication interaction timing data and protocol stack resource status data, wherein the communication interaction timing data characterizes the inter-service network communication interaction process in the distributed business system; Based on the pre-acquired interface contract parsing results, protocol specification parsing results, and queuing rules, an ideal communication state model is constructed, wherein the ideal communication state model is a colored Petty net model containing several state nodes. Based on the communication interaction timing data, extract the service request sequence within a preset time window as the current service input; based on the ideal communication state model and the current service input, generate an ideal communication trajectory benchmark; Based on preset fault attack knowledge, parameterized perturbation injection is performed on the ideal communication state model to generate anomaly scenario simulation trajectory; based on preset service load fluctuation rules, parameterized perturbation injection is performed on the ideal communication state model to generate service tidal simulation trajectory. Time alignment and state node mapping are performed on the communication interaction timing data and the protocol stack resource status data to generate real observation features. The difference between the real observation features and the ideal communication trajectory benchmark is calculated under the preset time window and the state node mapping to obtain the real residual space. Under the preset time window and the state node mapping, the difference between the abnormal scenario simulation trajectory and the ideal communication trajectory benchmark is calculated to obtain at least one abnormal theoretical residual space, and the difference between the service tide simulation trajectory and the ideal communication trajectory benchmark is calculated to obtain the service tide theoretical residual space. Based on the topological similarity between the actual residual space and each of the anomaly theoretical residual spaces and the business tide theoretical residual space, anomaly type results and root cause results are generated. In response to the root cause result, a corresponding intelligent handling instruction is output, and a corresponding communication linkage handling operation is performed on the distributed business system according to the intelligent handling instruction.

2. The behavioral reasoning and AI intelligent handling method for communication anomalies according to claim 1, characterized in that, The communication interaction timing data includes source address, destination address, calling method identifier, load size, request timestamp, and response timestamp; The protocol stack resource status data includes the Transmission Control Protocol (TCP) state transition sequence, User Datagram Protocol (UDP) socket transmit / receive status sequence, handle occupancy rate, and input / output queue depth.

3. The behavioral reasoning and AI intelligent handling method for communication anomalies according to claim 1, characterized in that, The interface contract parsing result is obtained by parsing the interface definition file, service description file, or application interface metadata. The protocol specification parsing result is obtained by parsing the Transmission Control Protocol specification, User Datagram Protocol specification, or Application Layer Communication Protocol specification. The queuing rules include at least one of the following: first-come-first-served rule, priority queue rule, and connection pool allocation rule; The current business input includes at least one of the following: business request arrival rate, call chain sequence, request type identifier, and load parameters; The ideal communication trajectory benchmark includes an ideal state transition sequence, an ideal processing delay sequence, and an ideal resource occupancy sequence.

4. The behavioral reasoning and AI intelligent handling method for communication anomalies according to claim 1, characterized in that, The fault attack knowledge consists of at least one of the following: historical fault samples, attack rule base, and protocol anomaly pattern base. The parameterized perturbation injection based on preset fault attack knowledge is performed on the ideal communication state model to generate anomaly scenario simulation trajectories, including: The service processing rate attenuation rule is injected into the ideal communication state model to generate a fault simulation trajectory; The connection-keeping rules and window shrinking rules are injected into the ideal communication state model to generate attack simulation trajectories; The fault simulation trajectory and the attack simulation trajectory are combined into the abnormal scenario simulation trajectory; The parameterized perturbation injection into the ideal communication state model based on preset service load fluctuation rules, generating a service tidal simulation trajectory, includes: The periodic fluctuation rules or sudden increase / decrease rules of the service request arrival rate are injected into the ideal communication state model to generate the service tidal simulation trajectory.

5. The behavioral reasoning and AI intelligent handling method for communication anomalies according to claim 1, characterized in that, The actual residual space, the theoretical residual spaces of each of the above, and the residual space of the business tide theory are all multi-dimensional time series graph feature spaces composed of state transition features, latency features, and resource occupancy features; The process of obtaining the real residual space includes: performing time alignment and state node mapping on the communication interaction timing data and the protocol stack resource state data to generate the real observation features; and generating the real residual space based on the difference between the real observation features and the ideal communication trajectory benchmark under the preset time window and the state node mapping. The step of obtaining at least one abnormal theoretical residual space and calculating the difference between the service tidal simulation trajectory and the ideal communication trajectory benchmark to obtain the service tidal theoretical residual space includes: generating the abnormal theoretical residual space based on the difference between the abnormal scenario simulation trajectory and the ideal communication trajectory benchmark under the preset time window and the state node mapping; and generating the service tidal theoretical residual space based on the difference between the service tidal simulation trajectory and the ideal communication trajectory benchmark under the preset time window and the state node mapping.

6. The behavioral reasoning and AI intelligent handling method for communication anomalies according to claim 1, characterized in that, The topological similarity includes temporal similarity, structural similarity, and a coupling score determined based on the temporal similarity and the structural similarity; The generation of anomaly type results and root cause results based on the topological similarity between the actual residual space and each of the anomaly theoretical residual spaces and the business tide theoretical residual space includes: The temporal similarity between the actual residual space and each of the anomaly theoretical residual spaces and the business tide theoretical residual space is calculated based on dynamic time warping. The structural similarity between the actual residual space and each of the anomaly theoretical residual spaces and the business tide theoretical residual space is calculated based on the graph editing distance. The temporal similarity and the structural similarity are weighted and summed based on a predetermined weight allocation to obtain the coupling scores corresponding to each of the anomaly theory residual spaces and the business tide theory residual spaces. The residual space with the highest coupling score among the various anomaly theories is selected as the target anomaly theory residual space; Based on the coupling score corresponding to the residual space of the target anomaly theory and the coupling score corresponding to the residual space of the business tide theory, the anomaly type result and the root cause result are determined.

7. The behavioral reasoning and AI intelligent handling method for communication anomalies according to claim 6, characterized in that, The determination of the anomaly type and the root cause based on the coupling score corresponding to the residual space of the target anomaly theory and the coupling score corresponding to the residual space of the business tide theory includes: When the coupling score corresponding to the target anomaly theoretical residual space is greater than the preset anomaly confidence threshold and not less than the coupling score corresponding to the business tide theoretical residual space, the anomaly type result and root cause result corresponding to the target anomaly theoretical residual space are determined. When the coupling score corresponding to the residual space of the business tide theory is greater than the preset tide confidence threshold and greater than the coupling score corresponding to the residual space of the target anomaly theory, the current state is determined to be a normal business tide. If any of the above judgment conditions are not met, the current state is determined to be a pending confirmation state, and continued data collection and repeated judgment are triggered. The preset anomaly confidence threshold and the preset tidal confidence threshold are predetermined through at least one of the following methods: historical sample statistical calibration, training set calibration, or empirical parameter calibration.

8. A behavioral reasoning and AI intelligent handling system for communication anomalies, characterized in that, include: The data acquisition module is used to collect communication interaction timing data and protocol stack resource status data; The ideal benchmark construction module is used to construct an ideal communication state model containing several state nodes based on the pre-acquired interface contract parsing results, protocol specification parsing results and queuing rules, and to generate an ideal communication trajectory benchmark based on the service request sequence within a preset time window extracted from the communication interaction time sequence data. The parameterized injection module is used to perform parameterized perturbation injection on the ideal communication state model based on preset fault attack knowledge, generate abnormal scenario simulation trajectory, and perform parameterized perturbation injection on the ideal communication state model based on preset service load fluctuation rules, generate service tidal simulation trajectory. The dual-track differential module is used to perform time alignment and state node mapping on the communication interaction timing data and the protocol stack resource status data to generate real observation features. Based on the real observation features, the ideal communication trajectory benchmark, the abnormal scenario simulation trajectory, and the service tide simulation trajectory, the real residual space, at least one abnormal theoretical residual space, and the service tide theoretical residual space are determined under the preset time window and the state node mapping, respectively. The coupling decision module is used to generate anomaly type results and root cause results based on the topological similarity between the real residual space and each of the anomaly theoretical residual spaces and the business tide theoretical residual space; The intelligent processing module is used to respond to the intelligent processing command corresponding to the root cause result output, so as to drive the execution of the corresponding communication linkage processing operation; The data acquisition module is connected to the traffic mirroring interface and the host protocol stack monitoring interface, respectively, and the intelligent processing module is connected to the operation and maintenance orchestration interface or the service governance interface.

9. A behavioral reasoning and AI intelligent handling system for communication anomalies according to claim 8, characterized in that, The intelligent handling module is used to match the target script from the atomic handling script library pre-stored in the operation and maintenance orchestration system based on at least one of the abnormal protocol level, abnormal source node, abnormal resource type, abnormal cause category and abnormal severity in the root cause result, and output the corresponding handling instructions. The disposal instructions include isolating the target service instance, resetting the connection pool, and implementing dynamic rate limiting.