A data processing method and system based on multi-agent distributed collaboration

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing multi-agent distributed collaboration technology, utilizing the fusion of node operation indicator features and the Gossip protocol, abnormal nodes are dynamically eliminated, and communication paths are optimized. This solves the problems of insufficient state awareness and fault recovery in distributed systems under dynamic network environments, achieving high-precision fault identification and adaptive communication, and improving the robustness and availability of the system.

CN122247997APending Publication Date: 2026-06-19SHANGHAI WICRENET CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SHANGHAI WICRENET CO LTD
Filing Date: 2026-05-25
Publication Date: 2026-06-19

Application Information

Patent Timeline

25 May 2026

Application

19 Jun 2026

Publication

CN122247997A

IPC: H04L67/1034; H04L67/1008; H04L67/1031; H04L67/101; H04L67/1025; H04L67/1023; H04L45/12; H04L45/28; H04L45/02

AI Tagging

Application Domain

Transmission

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Network device and method with anticipatory path control
US20260172355A1Transmission
Network Cut Analysis using GPU Acceleration
US20260163807A1Transmission
Machine room network equipment intelligent monitoring method and system
CN122220185AHardware monitoring Alarms
Distributing pacing results for low latency content serving
US20260172485A1Transmission Computer network Engineering
A communication system external dependency protection method and system
CN122226792ATransmission

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN122247997A_ABST

Patent Text Reader

Abstract

This invention relates to the field of data processing and distributed computing technology, and discloses a data processing method and system based on multi-agent distributed collaboration. The method includes: acquiring node operation indicators for each agent; performing feature fusion processing based on the node operation indicators using local node evaluation to obtain a preliminary fault judgment; performing view exchange and collaborative consensus voting among adjacent nodes based on the preliminary fault judgment to obtain a list of faulty nodes; performing consistency mapping on active tasks based on the list of faulty nodes to obtain a dynamic task distribution; acquiring link transmission indicators, and performing frequency attenuation and topology routing based on the link transmission indicators and the dynamic task distribution to obtain communication links; and encoding and distributing the data to be distributed based on the communication links to obtain a disaster recovery data distribution status. This method can achieve decentralized fault consensus and adaptive degradation of communication routing, effectively improving the overall robustness and scalability of the system.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data processing and distributed computing technology, and in particular to a data processing method and system based on multi-agent distributed collaboration. Background Technology

[0002] Currently, multi-agent collaborative technology significantly improves the overall processing efficiency of systems by enabling computing units distributed across different nodes to jointly accomplish complex objectives. Particularly in industrial data storage and real-time analysis scenarios involving high reliability requirements, the efficiency and stability of distributed collaborative networks directly impact the continuity of core business operations.

[0003] In existing technologies, systems typically rely on a centralized master node or a simple timed heartbeat detection mechanism to manage cluster state. When a node goes offline, it performs fixed-logic task migration and data retransmission based on static preset strategies and a global routing table. However, this mechanism exposes significant limitations when facing large-scale, dynamically fluctuating network environments. On the one hand, centralized state management easily becomes a communication bottleneck as the node scale expands, and it is prone to single-point misjudgments in complex networks. On the other hand, when local network congestion occurs or nodes experience high-frequency abnormal fluctuations, existing technologies lack the ability to dynamically perceive and adaptively reduce the frequency of underlying link bandwidth and communication overhead. The system often blindly initiates a large number of state probe packets and cross-domain data transfers. This rigid scheduling and reallocation mechanism not only quickly exhausts already strained link bandwidth resources, triggering severe network storms, but also causes previously healthy computing nodes to suffer performance collapse due to the sudden increase in communication burden and redundant tasks.

[0004] In summary, existing technologies in distributed environments suffer from poor adaptive state awareness and the tendency for fault recovery behaviors to lead to secondary congestion when the network experiences dynamic fluctuations or nodes exhibit high-frequency anomalies. Summary of the Invention

[0005] This invention provides a data processing method and system based on multi-agent distributed collaboration to solve the problems of poor adaptive state awareness and secondary congestion caused by fault recovery behavior in the existing technology in a distributed environment when the network fluctuates dynamically or nodes have high-frequency anomalies.

[0006] Firstly, in order to solve the above-mentioned technical problems, the present invention provides a data processing method based on multi-agent distributed collaboration, comprising: Obtain the node operation indicators of each intelligent agent, and use local node evaluation to perform feature fusion processing on the node operation indicators to obtain a preliminary fault judgment. Based on the initial fault assessment, view exchange is performed between adjacent nodes to obtain a state update packet. The state update packet is then used for collaborative consensus voting to identify abnormal nodes. Finally, the abnormal nodes are merged to obtain a fault node list. Based on the list of faulty nodes, the pre-obtained available computing nodes are dynamically eliminated and maintained to obtain a real-time node pool; Based on the real-time node pool, the pre-acquired active tasks are subjected to consistent mapping processing to obtain a dynamic task distribution; Obtain link transmission metrics, perform frequency attenuation processing on the communication path corresponding to the dynamic task distribution based on the link transmission metrics to obtain an optimized transmission strategy, and perform topology routing based on the optimized transmission strategy to obtain the communication link and switching plan. Based on the communication link, the pre-acquired data to be distributed is processed by block encoding and distributed distribution to obtain the disaster recovery data distribution status.

[0007] Secondly, the present invention provides a data processing system based on multi-agent distributed collaborative processing, comprising: The state awareness module is used to acquire the node operation indicators of each intelligent agent, and to perform feature fusion processing on the node operation indicators using local node evaluation to obtain a preliminary fault judgment. The fault consensus module is used to perform view exchange processing between adjacent nodes to obtain a state update packet based on the preliminary fault judgment, and to establish a list of faulty nodes using collaborative consensus voting. The node maintenance module is used to dynamically remove and maintain pre-obtained available computing nodes based on the list of faulty nodes, thereby obtaining a real-time node pool. The task remapping module is used to perform consistent mapping processing on the pre-acquired active tasks according to the real-time node pool to obtain a dynamic task distribution. An adaptive routing module is used to obtain link transmission indicators, perform frequency attenuation processing on the communication path corresponding to the dynamic task distribution based on the link transmission indicators to obtain an optimized transmission strategy, and perform topology routing based on the optimized transmission strategy to obtain the communication link and switching plan. The disaster recovery distribution module is used to perform block encoding and distributed distribution processing on the pre-acquired data to be distributed according to the communication link to obtain the disaster recovery data distribution status.

[0008] Compared with the prior art, the present invention has the following beneficial effects: (1) This invention obtains a preliminary judgment by acquiring the node operation indicators of each agent and performing feature fusion, and triggers the Gossip protocol between adjacent nodes to exchange views, and then uses a collaborative consensus voting mechanism to establish a list of faulty nodes. This mechanism changes the defects of traditional distributed systems that rely too much on centralized single-point detection, which easily leads to communication bottlenecks or single-point misjudgments. It realizes decentralized multi-agent autonomous state consensus, and can accurately identify real abnormal nodes in complex, dynamic, and high-frequency fluctuating network environments, which greatly improves the accuracy and fault tolerance of large-scale systems in the fault perception stage.

[0009] (2) This invention dynamically maps active tasks using a consistent hashing algorithm and creatively combines link transmission metrics to prioritize dynamic frequency attenuation processing of data exchange paths, and then combines a global topology routing algorithm to search for backup communication links. This adaptive communication routing mechanism of first reducing frequency and then switching paths completely breaks the predicament of uncontrolled communication overhead caused by the static retransmission strategy in the prior art. It not only smooths out the instantaneous network peak caused by the concentrated migration of tasks, but also accurately blocks the path of local network congestion spreading to the global network, effectively avoiding network storms and system cascading failures that may be caused by large-scale task redistribution.

[0010] (3) This invention performs block encoding of distributed data and executes disaster recovery distribution across fault isolation domains. Simultaneously, during the node recovery phase, a weighted incremental control mechanism based on system load and hardware resource assessment is introduced to smoothly reintegrate the data into the real-time node pool. This design not only ensures high data availability under extreme conditions with extremely low storage redundancy but also establishes a dynamic closed loop from fault isolation and data disaster recovery to gradual node reload. It effectively avoids performance avalanches caused by a sudden influx of concurrent tasks onto newly recovered healthy nodes, greatly maintaining the high elasticity and robustness of the distributed system under long-term operation and frequent node online / offline cycles. Attached Figure Description

[0011] Figure 1 This is a schematic diagram of a data processing method based on multi-agent distributed collaboration provided in the first embodiment of the present invention; Figure 2 This is a schematic diagram of a data processing system structure based on multi-agent distributed collaboration provided in the second embodiment of the present invention. Detailed Implementation

[0012] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0013] Reference Figure 1 The first embodiment of the present invention provides a data processing method based on multi-agent distributed collaboration, comprising the following steps: S11, obtain the node operation indicators of each intelligent agent, and use the node local evaluation to perform feature fusion processing on the node operation indicators to obtain a preliminary fault judgment; S12, based on the preliminary fault judgment, view exchange processing is performed between adjacent nodes to obtain a state update packet, collaborative consensus voting is performed on the state update packet to obtain abnormal nodes, and the abnormal nodes are merged to obtain a fault node list. S13, Based on the list of faulty nodes, dynamically eliminate and maintain the pre-obtained available computing nodes to obtain a real-time node pool; S14, Perform consistency mapping processing on the pre-acquired active tasks according to the real-time node pool to obtain dynamic task distribution; S15, obtain link transmission indicators, perform frequency attenuation processing on the communication path corresponding to the dynamic task distribution according to the link transmission indicators to obtain an optimized transmission strategy, and perform topology routing based on the optimized transmission strategy to obtain the communication link and switching plan. S16, Based on the communication link, the pre-acquired data to be distributed is processed by block encoding and distributed distribution to obtain the disaster recovery data distribution status; S17, acquire node recovery signal and system load, perform weight increment control processing based on the recovery signal and system load, and re-include the recovered node into the real-time node pool.

[0014] In step S11, node operation metrics for each agent are obtained, and feature fusion processing is performed on the node operation metrics using local node evaluation to obtain a preliminary fault judgment, including: Extract network latency data, communication overhead data, and load utilization data, and combine the network latency data, communication overhead data, and load utilization data to obtain node operation indicators; An exponentially weighted moving average is calculated on the network latency data to obtain smooth latency characteristics; The node operation index containing the smooth delay feature is input into a preset local evaluation model for multi-dimensional feature comprehensive calculation and state prediction processing to obtain the fault probability value. The fault probability value is compared with a preset warning threshold. If the fault probability value is greater than the warning threshold, the corresponding node identifier and abnormal feature dimension are extracted to form a preliminary fault judgment.

[0015] Before detailing the technical solutions of the embodiments of the present invention, the core concepts of the present invention will first be explained. It should be clarified that in the distributed system constructed in the embodiments of the present invention, each physical or logical node essentially corresponds to an intelligent agent with independent autonomy and collaborative capabilities. Specifically, each node integrates a perception unit, a decision-making unit, and an execution unit, enabling it to autonomously perceive its state and reason logically based on its local view within the distributed network, and to engage in decentralized information interaction and consensus-building with other nodes according to a preset collaborative protocol, rather than passively relying on the scheduling commands of a centralized cluster manager. Therefore, in the detailed descriptions of each step and module below, the terms "node" and "intelligent agent" are consistent in their technical logical connotations, and their behavioral patterns follow the autonomy, sociality, and responsiveness characteristics of multi-agent systems.

[0016] In one implementation, network latency data, communication overhead data, and load utilization data are extracted. A lightweight agent program is deployed on the host computing unit of the distributed system. This agent program periodically sends probe data packets of a preset fixed byte length to pre-defined monitoring endpoints. The time difference between the sending time and the time of receiving the response acknowledgment is recorded and determined as network latency data. The total number of bytes transmitted at the network interface layer within a preset time window is calculated, and this total number of bytes is divided by the duration of the preset time window to obtain communication overhead data. The preset time window is set according to the system's fault response time requirements, with a default value of sixty seconds. For high real-time scenarios requiring second-level fault switching, the time window is shortened to ten seconds; for batch processing or offline computing scenarios, the time window can be extended to three hundred seconds. The current CPU utilization rate and memory usage rate are read by calling the operating system's kernel monitoring interface. An arithmetic average is performed on these two values, and the result is determined as load utilization data. Subsequently, the network latency data, the communication overhead data, and the load utilization data at the same sampling timestamp are vectorized and concatenated to obtain multi-dimensional node operation indicators.

[0017] It should be noted that in this embodiment, the network latency data is calculated using an exponentially weighted moving average to obtain smoothed latency features. The network latency data of the current sampling period is multiplied by a preset smoothing factor to extract the smoothed latency historical features of the previous sampling period. These smoothed latency historical features are then multiplied by the difference between a value of one and the preset smoothing factor, and the products are summed to obtain the smoothed latency features. Regarding the method for determining the preset smoothing factor, latency time series samples under historically stable network conditions are extracted. A fixed iteration step size is set within the range of zero to one to traverse all candidate smoothing factor values. The mean square error between the smoothed output result corresponding to each candidate value and the baseline low-frequency trend component is calculated. Candidate values that minimize this mean square error are extracted and objectively labeled as the preset smoothing factor, thereby eliminating high-frequency noise interference caused by instantaneous network jitter.

[0018] It is worth noting that in this embodiment, the node operation indicators containing the smoothing delay feature are input into a preset local evaluation model for multi-dimensional feature comprehensive calculation and state prediction processing to obtain the fault probability value. Regarding the preset local evaluation model, a lightweight neural network architecture can be used to adapt to the low-latency inference requirements of the node end; those skilled in the art can also use decision trees or logistic regression models, all of which fall within the scope of this invention. A multi-dimensional indicator data sample set of the distributed cluster during its historical operating cycle is collected, and the actual operating status is used for binary classification labeling as normal or faulty. Network latency, communication overhead, and load utilization containing the smoothing delay feature are used as input feature vectors, linearly mapped through the weight matrix within the model, and combined with a nonlinear activation function for feature space transformation. The weight matrix is iteratively updated through a backpropagation algorithm until the loss function converges, resulting in a trained local evaluation model. The current real-time node operation indicators containing the smoothing delay feature are input into this model, which outputs a state prediction scalar between zero and one through forward propagation calculation, directly determining this as the fault probability value of the computed node.

[0019] This embodiment compares the fault probability value with a preset warning threshold. Regarding the method for determining the preset warning threshold, the trained local evaluation model is used to predict the probability on an independent validation test set, calculating the true positive rate and false positive rate under different cutoff threshold values; a receiver operating characteristic (ROC) curve is plotted, and the difference between the true positive rate and false positive rate at each discrete point on the curve is calculated. The point where this difference (Youden index) reaches its maximum value is extracted, and the probability cutoff value corresponding to this maximum value point is strictly set as the preset warning threshold. If the fault probability value is determined to be greater than the preset warning threshold, the computing node is determined to be in an abnormal state. If the fault probability value is less than or equal to the preset warning threshold, the computing node is determined to be in a normal state, the system does not trigger the abnormal feature extraction process, and continues to monitor the computing node in real time. At this time, the physical interconnect protocol address corresponding to the node is extracted as the node identifier, and the deviation of each dimension of the node's operating indicators from the historical steady-state mean is calculated. The feature parameter name that makes the deviation reach its maximum value is extracted and determined as the abnormal feature dimension. The node identifier and the anomaly feature dimension are encapsulated in a structured dictionary and combined to form a preliminary fault judgment.

[0020] For example, the agent program sends probe packets to neighboring nodes, calculates the current network latency as 210 milliseconds, analyzes the traffic over the past 60 seconds to obtain a communication overhead of 0.25 megabytes per second, reads the kernel interface to obtain a CPU utilization of 92% and a memory utilization of 70%, and calculates the average to obtain a load utilization of 81%. The system smooths the network latency based on an objectively determined preset smoothing factor of 0.3, and inputs the updated node performance indicator vector into a preset local evaluation model. After forward propagation synthesis, the model outputs a fault probability value of 0.85. The preset warning threshold determined by the ROC curve and Youden's index is 0.80. The system compares and finds that 0.85 is greater than 0.80, extracts the corresponding node identifier Node_102, and extracts the abnormal feature dimensions with the largest deviation: network latency and load utilization. The two are combined and output as a structured preliminary fault judgment.

[0021] In step S12, based on the preliminary fault judgment, a view exchange process is performed between adjacent nodes to obtain a state update packet. The state update packet is then used for collaborative consensus voting to obtain abnormal nodes, and the abnormal nodes are merged to obtain a fault node list.

[0022] The state update package, obtained by performing view exchange processing between adjacent nodes based on the preliminary fault assessment, includes: Trigger the preset Gossip protocol and randomly select neighbor nodes in the distributed system according to the preset infection probability to obtain the target receiving node; Send the local view data, which includes the preliminary fault assessment and the local version number, to the target receiving node; Extract the peer view data returned by the target receiving node, and use the vector clock algorithm to perform version conflict comparison processing between the local view data and the peer view data to obtain the status update packet.

[0023] Specifically, the abnormal nodes are obtained by performing a collaborative consensus vote on the state update packets, and the abnormal nodes are merged to obtain a list of faulty nodes, including: The state update packet is parsed to extract the local state evaluation data and corresponding trust weight values sent by neighboring nodes to the target node. The local state evaluation data are weighted and summed based on the trust weight values to obtain the comprehensive anomaly confidence of the target node. The overall anomaly confidence level is compared with a preset consensus threshold. If the overall anomaly confidence level is greater than the preset consensus threshold, the target node is marked as an anomaly node. Extract all abnormal nodes that meet the preset consensus threshold, deduplicate the abnormal nodes and merge them to obtain a list of faulty nodes.

[0024] In one implementation, a preset Gossip protocol is triggered, and neighboring nodes are randomly selected in the distributed system according to a preset infection probability to obtain the target receiving node. The method for determining the preset infection probability involves extracting the total number of currently active nodes in the distributed system, calculating the natural logarithm of this total number, and dividing the natural logarithm by the total number to obtain the corresponding probability baseline value. Subsequently, a pseudo-random number generator is invoked to generate a random floating-point number between zero and one. If this random floating-point number is less than or equal to the probability baseline value, the neighboring node is determined to be the target receiving node, thereby ensuring that the status information covers the entire network within a logical logarithmic time. This embodiment sends local view data, including the preliminary fault assessment and the local version number, to the target receiving node. The local view data is stored in a structured vector format and includes the physical identifier of the corresponding node, a fault feature vector, and a local version number composed of a monotonically increasing logical counter.

[0025] It should be noted that in this embodiment, the peer view data returned by the target receiving node is extracted, and a vector clock algorithm is used to perform version conflict comparison processing on the local view data and the peer view data to obtain a state update package. Specifically, the version number sequence corresponding to each node identifier in the local and peer view data is extracted, and the counter value at the corresponding position is compared bit by bit; if any counter value in the peer view is greater than the corresponding value in the local view and the values at the other positions are not less than the local value, the peer data is determined to be the latest state and an overwrite update is performed; if there is a cross-leading situation, a maximum value merging operation is performed to ensure causal consistency in a distributed environment. After the conflict processing is completed, the updated global node state set is encapsulated into the state update package.

[0026] It is worth noting that this embodiment parses the state update packet to extract the local state assessment data sent by neighboring nodes to the target node and the corresponding trust weight value. Regarding the method for determining the trust weight value, the accuracy rate of the state reports of the neighboring nodes that sent the local state assessment data within a preset historical time window (e.g., the past thirty days) is extracted; the accuracy rate of the state reports is then weighted and averaged with the current system hardware health score of the neighboring nodes, outputting a floating-point number between zero and one, which is determined as the corresponding trust weight value. This prevents Byzantine faults or malicious nodes sending false fault reports in the distributed system.

[0027] In this embodiment, the local state evaluation data is weighted and summed based on the trust weight values to obtain the comprehensive anomaly confidence level of the target node. Specifically, all local state evaluation data for the same target node in the state update packet are traversed, each local state evaluation data is multiplied by its corresponding trust weight value, all product results are summed, and then the summation result is divided by the sum of the trust weight values of all neighboring nodes participating in the evaluation to obtain a weighted average value, which is objectively labeled as the comprehensive anomaly confidence level of the target node.

[0028] Further, the overall anomaly confidence level is compared with a preset consensus threshold. Regarding the method for determining the preset consensus threshold, based on the Byzantine Fault Tolerance (BFT) quorum principle of distributed systems, to tolerate potential delays or misjudged nodes in the network, the preset consensus threshold is set to two-thirds of the value (approximately 0.67). In financial transaction scenarios with extremely high security requirements, this threshold can be strictly increased to 0.80. If the overall anomaly confidence level is greater than the preset consensus threshold, it is confirmed that the target node has experienced a genuine failure and is marked as an anomaly node; if it is less than or equal to the threshold, the anomaly is determined to be an isolated false alarm caused by local network jitter, and the system clears the node's anomaly status bit. Finally, all anomaly nodes that meet the preset consensus threshold are extracted, records with the same physical identifier are deduplicated, and the deduplicated nodes are aggregated and merged to obtain a list of faulty nodes for subsequent task scheduling.

[0029] For example, the distributed system contains 200 nodes, and the calculated infection probability is approximately 0.026. The agent exchanges views with neighboring nodes via the Gossip protocol and generates a state update packet through vector clock comparison. For the target node Node_102, the state update packet contains evaluation data from three neighboring nodes: neighboring node A (trust weight 0.9) provides a local state evaluation of 0.85; neighboring node B (trust weight 0.8) provides an evaluation of 0.90; and neighboring node C (trust weight reduced to 0.4 due to recent frequent disconnections) provides an evaluation of 0.20. The system performs a weighted summation calculation: (0.9×0.85+0.8×0.90+0.4×0.20) / (0.9+0.8+0.4)≈0.745. The calculated overall anomaly confidence score of 0.745 is greater than the preset consensus threshold of 0.67. The agent ultimately marks Node_102 as an anomaly node and adds it to the list of faulty nodes.

[0030] In step S13, the available computing nodes obtained in advance are dynamically eliminated and maintained according to the list of faulty nodes to obtain a real-time node pool.

[0031] In one implementation, available computing nodes are acquired. This is done by reading a pre-defined asset management database and extracting the physical unique identifiers, network interconnection protocol addresses, and hardware computing power parameters (such as the number of cores and clock speed) of all initially registered nodes in the distributed system. This data is then loaded into a pre-established metadata management table in memory, and the status bits of all nodes are initialized to healthy.

[0032] It should be noted that the preset asset management database is created by the administrator manually entering or automatically scanning all computing nodes' physical identifiers, network addresses, and hardware parameters during the initial deployment of the distributed system, and storing this information in a relational database table; each time a node joins or leaves the cluster, the database is incrementally updated.

[0033] It should be noted that this embodiment dynamically removes and maintains the available computing nodes based on the list of faulty nodes. The system receives the list of faulty nodes output in step S12, extracts the physical identifiers of the faulty nodes in the list, and performs a hash address search in the metadata management table. If a matching node record is found, the system performs a dynamic removal operation, changing the flag of that node in the status mapping index from healthy to faulty, synchronously updating the cluster active routing table, and intercepting communication commands sent to that node to prevent tasks from being scheduled to the identified fault source.

[0034] It is worth noting that during maintenance, this embodiment incorporates a lightweight heartbeat monitoring mechanism to refresh the node status in real time. This is achieved by listening to the periodic probe signals sent by the agent programs deployed on each computing node. If no heartbeat feedback is received from a specific node within three consecutive preset heartbeat monitoring periods (e.g., fifteen seconds), even if the node has not been confirmed as a precise fault via the Gossip protocol, the system uses dynamic maintenance logic to update its status to suspicious and perform pre-removal, removing it from the currently allocable resources to ensure the physical fidelity of the real-time node pool.

[0035] It is worth noting that the heartbeat monitoring period and the number of consecutive periods are set by setting the heartbeat monitoring period to twice the average state propagation delay of the system, with a default value of five seconds; the number of consecutive periods is set to three, meaning that if no heartbeat is received for three consecutive periods, it is considered suspicious; for low-latency network environments, the period can be shortened to two seconds; for high-latency or satellite communication environments, the period can be extended to fifteen seconds.

[0036] This embodiment obtains a real-time node pool. Data from all nodes in the metadata management table that are in a healthy state and have a normal heartbeat are aggregated to generate a structured object containing the remaining processing capacity and real-time available bandwidth weights of each node. This object is then designated as the real-time node pool, serving as the resource benchmark for subsequent task remapping and data distribution.

[0037] For example, two hundred initial computing nodes are registered in the asset management database. If the faulty node list determined in step S12 includes node identifier Node_102, the system locates the node in the metadata management table and sets its status bit to zero (removes it). Simultaneously, if heartbeat monitoring detects that Node_110 has not provided a signal for fifteen consecutive seconds, the system marks it as suspicious and performs logical removal. Finally, the real-time node pool aggregates the physical information of the remaining one hundred and ninety-eight healthy nodes, outputting a real-time resource view with topology consistency.

[0038] In step S14, a consistency mapping process is performed on the pre-acquired active tasks according to the real-time node pool to obtain a dynamic task distribution, including: Analyze the task key-value characteristics of the active tasks; The computing nodes in the real-time node pool are mapped to a preset hash ring using a consistent hashing algorithm, and the number of corresponding virtual nodes is dynamically adjusted according to the current load utilization of the computing nodes to obtain a dynamic hash ring. Calculate the hash value of the task key feature, search for the virtual node closest to the hash value along a preset direction on the dynamic hash ring, and assign the active task to the computing node corresponding to the virtual node to obtain the dynamic task distribution.

[0039] In one implementation, active tasks are identified by the system monitoring a preset global task distribution queue and load balancing events within the cluster in real time. When a new data processing request is received from an external application, or when an existing task remapping instruction is triggered due to compute node failure or overload, the corresponding task records are written into the task scheduling metadata management table of the distributed system, and the tasks in the pending state are identified as active tasks. Subsequently, the task key-value characteristics of the active tasks are parsed. By querying the task scheduling metadata management table of the distributed system, the globally unique identifier, task data size, and task computation priority of the currently pending or migrated active tasks are extracted. The globally unique identifier and task logical attributes are concatenated to generate task key-value characteristics that characterize the uniqueness of the task.

[0040] It should be noted that a consistent hashing algorithm is used to map the compute nodes in the real-time node pool to a preset hash ring, and the number of virtual nodes is dynamically adjusted according to the current load utilization of the compute nodes to obtain a dynamic hash ring. First, the physical addresses of each healthy node in the real-time node pool are extracted, and the SHA-256 hash algorithm is used to perform hash calculations on the physical addresses, mapping the generated hash values to the preset hash ring space. Second, the average load utilization of all nodes in the real-time node pool is calculated and used as a comparison benchmark. If it is determined that the current load utilization of the target compute node is greater than the average load utilization, a dynamic increase in the number of virtual nodes is performed.

[0041] It is worth noting that the potential load increment is calculated using the following formula:

[0042] in, This indicates the expected load utilization rate after assigning the task; This indicates the current load utilization rate of the node; This indicates the amount of data currently being parsed by the active task; Indicates the maximum preset data processing capacity of the target computing node; This indicates the preset computing power conversion factor. Regarding the preset computing power conversion factor... The method for determining the processing capacity involves extracting the task execution logs of the node within its historical operating cycle, using the amount of historical task data in the logs as the independent variable, and the average increase in CPU utilization during the corresponding task execution period as the dependent variable. A least squares method is used for linear regression fitting, and the slope of the fitted line is extracted and objectively calibrated as the preset computing power conversion factor. Regarding the preset data processing capacity upper limit... The method for determining the maximum data processing capacity involves extracting the maximum bus transmission rate specified in the node's hardware specifications and setting it as the upper limit of the node's data processing capacity. It should be noted that the current load utilization rate in the above formula is expressed in decimal form, for example, 75% is substituted with 0.75, and the calculated result of the formula is also in decimal form, for example, 0.8 corresponds to 80%. It is worth noting that this formula assumes that the amount of data of the task to be distributed is linearly related to the computing power or bandwidth resources required by the node, and that the upper limit of the node's data processing capacity is a static configuration value. For computationally intensive tasks, the upper limit of the data processing capacity in this formula should be replaced with the number of CPU cores or floating-point operations per second of the node.

[0043] Based on the calculation results of the potential load increment, if the expected load utilization does not exceed the preset performance bottleneck threshold, the number of virtual nodes of the node is increased from a base value (e.g., 32) to a preset extended value (e.g., 40). This increases the probability of the node carrying tasks by increasing the coverage density on the hash ring, generating a dynamic hash ring. If the expected load utilization exceeds the preset performance bottleneck threshold, the mapping of the task to the current target node is canceled, the number of virtual nodes of the node remains unchanged, and the next computing node that meets the performance requirements is retrieved on the dynamic hash ring along a preset direction. The preset performance bottleneck threshold is determined by independently performing a full-load stress test on each computing node, recording the memory usage and network throughput when the CPU utilization is stable at 95%, and adding a 5% safety margin to the load utilization in this state as the performance bottleneck threshold for the node. For the node in the example, this threshold is set to 85%.

[0044] It is worth noting that the baseline number of virtual nodes is obtained by multiplying the total number of nodes in the distributed system by 8 and then dividing by the arithmetic square root of the number of nodes, and this number is not less than 16 and not more than 128; the preset expansion number is obtained by multiplying the baseline number by 1.25 and then rounding up, and the expansion number does not exceed twice the baseline number.

[0045] Subsequently, the hash value of the task key-value feature is calculated. A virtual node closest to the hash value is searched along a preset direction on the dynamic hash ring. The active task is then assigned to the computing node corresponding to the virtual node, resulting in a dynamic task distribution. The same SHA-256 algorithm is used to calculate the hash value of the parsed task key-value feature, and this hash value is then located on the dynamic hash ring. The system performs a linear search along a preset direction to find the coordinates of the first virtual node with a coordinate value greater than the task's hash value, and assigns the corresponding task to the physical computing node to which that virtual node belongs.

[0046] For example, the system obtains an active task T1 with a data volume of 450GB. The system monitors that the current load utilization of node A is 75%. The system extracts the preset data processing capacity limit of node A as 9000GB and the preset computing power conversion factor as 1.0. Substituting into the formula, the expected load utilization after allocating T1 is calculated to be 80% (i.e., Since this value is below the bottleneck threshold of 85%, the number of virtual nodes for node A is dynamically increased to 40. Then, the hash value of task T1 is calculated, and the nearest virtual node to node A is searched clockwise along the hash ring. T1 is assigned to node A, and the dynamic task distribution containing this mapping is output.

[0047] In step S15, the link transmission index is obtained, and the frequency attenuation processing of the communication path corresponding to the dynamic task distribution is performed according to the link transmission index to obtain an optimized transmission strategy. Based on the optimized transmission strategy, topology routing is performed to obtain the communication link and switching plan.

[0048] The process includes obtaining link transmission metrics and, based on these metrics, performing frequency attenuation processing on the communication paths corresponding to the dynamic task distribution to obtain an optimized transmission strategy, including: Extract the data exchange paths involved in the dynamic task distribution, and use probe packets to calculate the current round-trip delay value of the data exchange path as the link transmission indicator; If the current round-trip delay value is greater than the preset delay threshold, the preset reference switching frequency is subjected to a multiplicative attenuation operation based on the excess ratio and the preset attenuation factor to obtain the target switching frequency. The execution frequency parameter of the data exchange path is replaced with the target exchange frequency to obtain an optimized transmission strategy.

[0049] The process of obtaining communication links and switching plans based on the optimized transmission strategy includes: Extract the current bandwidth utilization rate of the communication path corresponding to the optimized transmission strategy, and obtain the link transmission index corresponding to the communication path; Determine whether the current bandwidth utilization rate is greater than a preset bandwidth threshold; If the current bandwidth utilization rate is not greater than the preset bandwidth threshold, the current communication path is directly determined as a communication link; If the current bandwidth utilization rate is greater than the preset bandwidth threshold, then an alternative communication path is searched in the global network topology map in combination with the link transmission index, and the expected communication overhead of the alternative communication path is calculated based on the latency and bandwidth data contained in the link transmission index. If the estimated communication overhead is less than a preset overhead threshold, the backup communication path is determined as a communication link; The routing instructions, which include the target switching frequency and the communication link, are summarized to obtain the switching plan.

[0050] In one implementation, the data exchange paths involved in the dynamic task distribution are extracted, and the current round-trip delay value of the data exchange path is calculated using probe packets as the link transmission indicator. Specifically, the system obtains the logical connection between the source node and the target node in the dynamic task distribution, calls a network diagnostic tool to control the source node to send a 64-byte ICMP probe packet to the target node, listens for the time of receiving the acknowledgment, calculates the time difference between sending and receiving, and obtains the current round-trip delay value.

[0051] In one implementation, the network diagnostic tool can refer to the ping command-line tool built into the operating system. This tool can measure the round-trip delay between the source node and the target node by sending ICMP echo request messages and receiving echo response messages. In actual implementation, the program can execute the ping command by calling system interfaces, such as the subprocess or exec function, and parse the time value in its output to obtain the round-trip delay data.

[0052] It should be noted that if the current round-trip delay value is greater than a preset delay threshold, a multiplicative attenuation operation is performed on the preset reference switching frequency based on the over-limit ratio and a preset attenuation factor to obtain the target switching frequency. In this embodiment, the difference between the current round-trip delay value and the preset delay threshold is extracted, and the difference is divided by the preset delay threshold to obtain the over-limit ratio. Subsequently, frequency attenuation calculation is performed: the result of subtracting the over-limit ratio from the value is multiplied by the preset reference switching frequency, and further multiplied by the preset attenuation factor to obtain the target switching frequency. The execution frequency parameter of the data switching path is replaced with the target switching frequency to obtain an optimized transmission strategy.

[0053] The preset attenuation factor is determined by traversing candidate attenuation factors from 0.1 to 1.0 in a simulated network congestion experimental environment with a step size of 0.1. The total number of control messages during the period from congestion to link recovery under each candidate factor is recorded. The candidate factor that minimizes the total number of control messages and prevents message timeout accumulation is selected as the attenuation factor for that link. For different links, the attenuation factor can be configured independently or uniformly set to 0.8. The preset delay threshold is determined by collecting historical round-trip delay values of all communication paths in the distributed system under normal operating conditions, calculating the average and standard deviation of these values, and adding the average value to the standard deviation. The result of twice the standard deviation is used as the delay threshold. For links without historical data, it is initially set to 100 milliseconds. During operation, after every 100 successful data exchanges, the threshold is recalibrated based on the latest delay statistics. The preset reference exchange frequency is obtained by extracting the average sending interval of control messages between nodes in the distributed system under stable operating conditions. The reciprocal of this interval is taken as the initial value of the reference exchange frequency, which is set to five times per minute by default. When the system detects that no node failure or link congestion event has occurred for one consecutive hour, the reference exchange frequency is reduced to 80% of the current value to further reduce communication overhead.

[0054] It is worth noting that the communication link and switching plan are obtained based on topology routing. The current bandwidth utilization rate of the communication path corresponding to the optimized transmission strategy is extracted, and it is determined whether it is greater than a preset bandwidth threshold (e.g., 75%). If the current bandwidth utilization rate is not greater than the preset bandwidth threshold, the current communication path is directly determined as the communication link; if the current bandwidth utilization rate is greater than the preset bandwidth threshold, an alternative communication path is searched in the global network topology map in conjunction with the link transmission indicators. The preset bandwidth threshold is obtained by extracting the nominal maximum bandwidth value of the network interface and multiplying the nominal maximum bandwidth value by a redundancy coefficient. The redundancy coefficient is between 0.75 and 0.85, and the specific value is adjusted according to the historical congestion frequency of the link. When the number of link congestion times exceeds three times in the past hour, the redundancy coefficient is reduced by 0.05, and vice versa, it is increased by 0.05, but does not exceed 0.9.

[0055] It should be noted that the global network topology graph periodically collects the connection relationships between each switching device and computing node through link layer discovery protocol probes deployed in the network. The collected neighbor relationships are summarized to form a global directed graph, where the nodes of the graph are computing nodes or switching devices, and the edges of the graph are physical or logical links. The bandwidth utilization and latency data of each edge are updated in real time.

[0056] Subsequently, the estimated communication overhead score of the backup communication path is calculated based on the delay and bandwidth data included in the link transmission metrics. The system extracts the real-time delay values and remaining bandwidth values of all network segments traversed by the backup communication path, sums all delay values to obtain the total path delay, and performs a weighted average of the bandwidth values of each network segment to obtain the path equivalent bandwidth. The total path delay is multiplied by a first weighting coefficient, and then multiplied by the reciprocal of the path equivalent bandwidth and a second weighting coefficient to obtain a dimensionless scalar value, which is determined as the estimated communication overhead score. Regarding the first and second weighting coefficients, this embodiment extracts the routing switch log dataset of the distributed system under historical network congestion conditions, and uses a grid search algorithm to find the weight combination that minimizes the packet retransmission rate within the range of zero to one for calibration.

[0057] If the estimated communication overhead score is determined to be less than a preset overhead threshold, the backup communication path is identified as the communication link. If the estimated communication overhead score is greater than or equal to the preset overhead threshold, the cost of switching the backup path is deemed too high, the system abandons route reselection, and continues to use the current communication link for rate-limited transmission. The routing instructions, including the target switching frequency and the communication link, are aggregated to obtain a switching plan. The preset overhead threshold is calculated by collecting the estimated communication overhead values of all communication paths under normal system operation, calculating the average and standard deviation of these values, and adding twice the standard deviation to the average value as the overhead threshold. If no historical data is available, it is initially set to 500 and dynamically calibrated during system operation based on the path switching success rate. The threshold is increased by 2% for each successful switch and decreased by 5% for each failed switch.

[0058] For example, the system detects a current round-trip latency of 150ms between node A and node C. Since the preset latency threshold is 100ms, the calculated excess rate is 0.5. The system extracts a baseline switching frequency of 5 times per minute and calculates a target switching frequency of 2 times per minute using a preset attenuation factor of 0.8. Because the current bandwidth utilization of this path is 82%, exceeding the 75% threshold, the system triggers the Bellman-Ford algorithm to search for a transit path via node F. Based on real-time metrics on this path, the estimated communication overhead score is calculated to be 280. This score is less than the preset overhead threshold of 400, so the system ultimately identifies this path as a communication link and generates a switching plan containing a target switching frequency of 2 times per minute.

[0059] In step S16, the pre-acquired data to be distributed is processed by block encoding and distributed distribution according to the communication link to obtain the disaster recovery data distribution status, including: Obtain the data to be distributed, use the erasure coding algorithm to perform data slicing on the data to be distributed to obtain the original data block, and calculate and generate the redundancy check block based on the original data block; Extract the physical location attribute information of global health nodes, and divide independent fault isolation domains based on the physical location attribute information; The original data block and the redundant check block are distributed in parallel to global healthy nodes belonging to different fault isolation domains using the communication link to obtain the disaster recovery data distribution status.

[0060] In one implementation, data to be distributed is acquired, and an erasure coding algorithm is used to slice the data to obtain original data blocks. Redundancy check blocks are then calculated based on these original data blocks. This embodiment extracts the maximum tolerable number of simultaneous faulty nodes pre-set by the system and directly determines the required number of redundancy check blocks. Simultaneously, the upper limit of single-node storage throughput and the maximum transmission unit (MTU) limit of the network are extracted to calculate the number of data blocks that meet transmission efficiency requirements. This embodiment uses the Reed-Solomon algorithm as the erasure coding algorithm. According to the determined number of data blocks, the data to be distributed is divided into multiple segments of equal size to obtain original data blocks. Subsequently, an encoding generator matrix containing an identity matrix and a Vandermonde matrix is constructed on the Galois field. Matrix multiplication is performed between the vector corresponding to the original data block and the encoding generator matrix to calculate the corresponding number of redundant data segments, which are then output as redundancy check blocks.

[0061] The preset maximum number of simultaneous faulty nodes is determined by taking the square root of the total number of nodes, rounding down, and dividing by two, based on the total number of nodes in the distributed system and the data replication redundancy requirements. For small clusters with less than ten nodes, the default value is two; for large clusters with more than or equal to one hundred nodes, the default value is four.

[0062] It should be noted that this embodiment extracts the physical location attribute information of the global health nodes and divides independent fault isolation domains based on this physical location attribute information. This embodiment reads the cluster's static asset topology database and extracts the rack number, access switch identifier, and power supply circuit identifier associated with each global health node, encapsulating these identifier parameters into physical location attribute information. All global health nodes are traversed, and nodes with the same physical location attribute information (e.g., nodes located in the same physical rack or connected to the same switch) are grouped into the same logical group. Each logical group is defined as an independent fault isolation domain, generating multiple sets of mutually isolated physical nodes.

[0063] It is worth noting that in this embodiment, the original data blocks and the redundant check blocks are distributed in parallel to global healthy nodes belonging to different fault isolation domains using the communication link, resulting in a disaster recovery data distribution status. This embodiment merges the original data blocks and the redundant check blocks to form a set of blocks to be transmitted. The set of blocks to be transmitted is traversed, and for each data block, a random selection operation without replacement is performed in all constructed fault isolation domains to select a target isolation domain. Within the target isolation domain, a healthy node with the lowest load utilization is selected as the target storage node. This embodiment utilizes the low-overhead communication link determined in the preceding steps to establish a multi-threaded concurrent transmission channel, sending each data block to the corresponding selected target storage node, and applying a preset rate limiting threshold (e.g., 15 megabytes per second) to the transmission rate to prevent bandwidth congestion. After receiving persistent write confirmation signals from all target storage nodes, the hash value of each data block and the physical address mapping relationship of the storage node are recorded, and this mapping relationship table is output as the disaster recovery data distribution status.

[0064] For example, the system acquires 200 megabytes of data to be distributed. The maximum tolerable number of simultaneous faulty nodes is set at 2, indicating the need to generate 2 redundant check blocks. Based on transmission efficiency, the number of data blocks is calculated to be 4. The system divides the 200 megabytes of data into four 50-megabyte raw data blocks and uses the Reed-Solomon algorithm's generator matrix operation to calculate two 50-megabyte redundant check blocks. Subsequently, the system reads node attributes and assigns nodes located in different physical racks to independent fault isolation domains. Utilizing low-overhead communication links, the system initiates 6 parallel transmission threads to send these 6 data blocks to healthy nodes belonging to 6 different rack fault isolation domains. After transmission and persistent confirmation, the system generates and saves the disaster recovery data distribution status, recording the block address mapping relationships.

[0065] It is worth noting that after performing block encoding and distributed distribution processing on the pre-acquired data to be distributed according to the communication link to obtain the disaster recovery data distribution status, the method further includes: Obtain the node recovery signal of the offline node, and extract the current hardware resource reserve data and performance benchmark test data of the offline node; The hardware resource margin data and the performance benchmark test data are input into a pre-built scoring model for calculation and processing to obtain the carrying capacity score; If the carrying capacity score matches the access requirements of the distributed system load, a preset initial weight value is assigned to the offline status node; Extract the operational error rate of the offline status node within a continuous monitoring period. If the operational error rate is continuously lower than a preset stable threshold, increase the initial weight value by a preset step size until the normal allocation standard is reached, and update the offline status node that has reached the normal allocation standard to the real-time node pool.

[0066] In one implementation, a node recovery signal from an offline node is acquired. A monitoring agent deployed on the cluster management plane continuously listens for status change events in the offline node queue. When a packet containing a heartbeat recovery command and a hardware self-test pass flag is received from an offline node, it is identified as the node recovery signal. Subsequently, the current hardware resource availability data and performance benchmark data of the offline node are extracted. Specifically, the node's kernel monitoring interface is accessed via Remote Procedure Call (RPC) to read its current number of idle CPU cores, available memory capacity, and available network bandwidth, which are then combined to determine the hardware resource availability data. Simultaneously, a preset stress test script is executed locally on the node to measure its disk input / output operations (IOPS) per unit time, obtaining performance benchmark data, for example, a measurement of 8500 operations per second.

[0067] It should be noted that in this embodiment, the hardware resource surplus data and the performance benchmark test data are input into a pre-built scoring model for calculation to obtain a carrying capacity score. Regarding the pre-built scoring model, a multi-dimensional performance index set of the distributed system within its historical operating cycle is extracted; a max-min normalization algorithm is used to map each index to a range of zero to one hundred; and the entropy weight method is used to calculate the information entropy of each index and its corresponding weight coefficient. The calculation process involves multiplying each sub-index corresponding to the currently acquired hardware resource surplus data and performance benchmark test data by its corresponding weight coefficient and performing a weighted summation operation to output a scalar value characterizing the overall performance of the node, which is then determined as the carrying capacity score, for example, a score of 85.

[0068] It is worth noting that if the carrying capacity score is determined to match the access requirements of the system load, this embodiment assigns a preset initial weight value to the offline node. The average carrying capacity score of all healthy nodes in the real-time node pool is calculated and used as the access requirement for the distributed system load; for example, the current average score is 72. The carrying capacity score of the offline node is compared with this average. If the former is greater than or equal to the latter, the node is determined to have access capability. If it is less, the recovered node is determined not to have reload capability, and the system rejects its access request, keeping it in the gray-scale observation queue. At this time, the node whitelist of the task scheduler is modified, the node is re-marked as pending activation, and a preset initial weight value is assigned. Regarding the method for determining the preset initial weight value, the historical load fluctuation curve of the cluster during single-node failure recovery is extracted, the overshoot of the system response time under different initial weights is calculated, and the maximum weight that makes the overshoot less than 10% and does not cause system oscillation is selected and calibrated to 0.3.

[0069] This embodiment extracts the operational error rate of the offline node within a continuous monitoring period. The continuous monitoring period is set to 5 periods, each lasting 30 seconds. Within each monitoring period, the total number of all task requests processed by the node and the number of abnormal requests that returned error codes or timed out are counted. The number of abnormal requests is divided by the total number of task requests to obtain the operational error rate. If the operational error rate is consistently below a preset stability threshold, the initial weight value is increased by a preset step size until the normal allocation standard is reached. If the operational error rate is greater than or equal to the preset stability threshold in any monitoring period, the system immediately stops the weight increase operation, maintains the current initial weight, or removes the node from the real-time node pool for secondary investigation.

[0070] The number of continuous monitoring cycles is the ratio of the system's average fault recovery time to the duration of a single monitoring cycle, rounded up. The default duration of a single monitoring cycle is 30 seconds, and the average fault recovery time is estimated at 150 seconds, so the number of continuous monitoring cycles is five. For nodes with faster recovery speeds, the number can be dynamically adjusted to three or four based on actual operating data.

[0071] The method for determining the preset stability threshold involves collecting error rate samples of various tasks under long-term steady-state operation of the system, extracting the value corresponding to the 99th percentile, and objectively calibrating it as 0.01%. The preset step size is obtained by statistically analyzing the average number of monitoring cycles required for offline nodes to recover and reach the normal allocation standard in historical data. This average number of cycles is recorded as the average number of cycles. The result of subtracting the initial weight value from the average number of cycles is divided by the average number of cycles to obtain the preset step size. If the initial weight is 0.3 and the average number of cycles is 14, the preset step size is 0.05. When there is no historical data, the default step size is 0.05, and the maximum number of steps is set to 20. If the current error rate remains below 0.01%, the current weight is increased by 0.05 after each monitoring cycle until the weight value reaches the normal allocation standard, i.e., a value of 1.0. Finally, the status bit of offline nodes that have reached the normal allocation standard is updated to active, and they are re-included in the real-time node pool for full task allocation.

[0072] For example, upon receiving a recovery signal from node S, the monitoring agent extracts its hardware resources, including a 4-core CPU, 16GB of memory, and 1Gbps bandwidth, with a measured IOPS of 8500. Calculated using the scoring model, node S's capacity score is 85, higher than the cluster's average load level of 72. The system re-includes node S in the allocatable pool and assigns it an initial weight of 0.3. Over the subsequent five monitoring periods, its error rate consistently remained at 0.008%, below the preset stability threshold of 0.01%. The system incrementally increases the weight in increments of 0.05 each period until it reaches 1.0, completing node S's full return to the real-time node pool.

[0073] The core idea of this invention is to reconstruct the computing units in a distributed network into autonomous agents with a complete closed-loop capability of perception, decision-making, and execution. During operation, each agent autonomously perceives local indicators. When an anomaly is detected, instead of reporting to a higher level, it triggers a multi-agent collaborative decision-making mechanism, initiating a consensus vote based on trust weights among neighboring agents. This decentralized local consensus mechanism completely eliminates the communication overhead of global probing. After confirming a fault, the agent autonomously performs rate limiting and local topology routing. This bottom-up adaptive collaborative approach enables dynamic backoff and optimization of task migration and communication routing, effectively preventing the cascading spread of local network congestion to the global network.

[0074] In summary, this invention achieves dynamic optimization of communication paths and adaptive adjustment of exchange frequency during task remapping, effectively solving the problems of uncontrolled communication overhead and local network congestion caused by task redistribution due to node failure. Furthermore, by performing redundant distribution of critical data across fault isolation domains and combining it with a node weight increment control mechanism driven by bearing score, this invention enhances the system's disaster recovery capability and ensures the smooth return of faulty nodes to the real-time node pool. This invention effectively improves the robustness, scalability, and resource utilization efficiency of distributed systems under extreme and variable conditions, ensuring the continuity of task processing and the stability of the overall system operation.

[0075] Reference Figure 2 The second embodiment of the present invention provides a data processing system based on multi-agent distributed collaboration, comprising: The state awareness module is used to acquire the node operation indicators of each intelligent agent, and to perform feature fusion processing on the node operation indicators using local node evaluation to obtain a preliminary fault judgment. The fault consensus module is used to perform view exchange processing between adjacent nodes to obtain a state update packet based on the preliminary fault judgment, and to establish a list of faulty nodes using collaborative consensus voting. The node maintenance module is used to dynamically remove and maintain pre-obtained available computing nodes based on the list of faulty nodes, thereby obtaining a real-time node pool. The task remapping module is used to perform consistent mapping processing on the pre-acquired active tasks according to the real-time node pool to obtain a dynamic task distribution. An adaptive routing module is used to obtain link transmission indicators, perform frequency attenuation processing on the communication path corresponding to the dynamic task distribution based on the link transmission indicators to obtain an optimized transmission strategy, and perform topology routing based on the optimized transmission strategy to obtain the communication link and switching plan. The disaster recovery distribution module is used to perform block encoding and distributed distribution processing on the pre-acquired data to be distributed according to the communication link to obtain the disaster recovery data distribution status.

[0076] It should be noted that the data processing system based on multi-agent distributed collaboration provided in this embodiment of the invention is used to execute all the process steps of the data processing method based on multi-agent distributed collaboration in the above embodiment. The working principles and beneficial effects of the two are one-to-one, so they will not be described again.

[0077] It should be noted that the system embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Furthermore, in the accompanying drawings of the system embodiments provided by this invention, the connection relationships between modules indicate that they have communication connections, which can be specifically implemented as one or more communication buses or signal lines. Those skilled in the art can understand and implement this without any creative effort.

[0078] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of the present invention. It should be understood that the above descriptions are merely specific embodiments of the present invention and are not intended to limit the scope of protection of the present invention. In particular, it should be noted that any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention for those skilled in the art.

Claims

1. A data processing method based on multi-agent distributed collaboration, characterized in that, include: Obtain the node operation indicators of each intelligent agent, and use local node evaluation to perform feature fusion processing on the node operation indicators to obtain a preliminary fault judgment. Based on the initial fault assessment, view exchange is performed between adjacent nodes to obtain a state update packet. The state update packet is then used for collaborative consensus voting to identify abnormal nodes. Finally, the abnormal nodes are merged to obtain a fault node list. Based on the list of faulty nodes, the pre-obtained available computing nodes are dynamically eliminated and maintained to obtain a real-time node pool; Based on the real-time node pool, the pre-acquired active tasks are subjected to consistent mapping processing to obtain a dynamic task distribution; Obtain link transmission metrics, perform frequency attenuation processing on the communication path corresponding to the dynamic task distribution based on the link transmission metrics to obtain an optimized transmission strategy, and perform topology routing based on the optimized transmission strategy to obtain the communication link and switching plan. Based on the communication link, the pre-acquired data to be distributed is processed by block encoding and distributed distribution to obtain the disaster recovery data distribution status.

2. The data processing method based on multi-agent distributed collaboration according to claim 1, characterized in that, The process of acquiring node operation metrics for each agent and performing feature fusion processing on these metrics using local node evaluation to obtain a preliminary fault assessment includes: Extract network latency data, communication overhead data, and load utilization data, and combine the network latency data, communication overhead data, and load utilization data to obtain node operation indicators; An exponentially weighted moving average is calculated on the network latency data to obtain smooth latency characteristics; The node operation index containing the smooth delay feature is input into a preset local evaluation model for multi-dimensional feature comprehensive calculation and state prediction processing to obtain the fault probability value. The fault probability value is compared with a preset warning threshold. If the fault probability value is greater than the warning threshold, the corresponding node identifier and abnormal feature dimension are extracted to form a preliminary fault judgment.

3. The data processing method based on multi-agent distributed collaboration according to claim 1, characterized in that, The step of performing view exchange processing between adjacent nodes based on the preliminary fault judgment to obtain a state update packet includes: Trigger the preset Gossip protocol and randomly select neighbor nodes in the distributed system according to the preset infection probability to obtain the target receiving node; Send the local view data, which includes the preliminary fault assessment and the local version number, to the target receiving node; Extract the peer view data returned by the target receiving node, and use the vector clock algorithm to perform version conflict comparison processing between the local view data and the peer view data to obtain the status update packet.

4. The data processing method based on multi-agent distributed collaboration according to claim 1, characterized in that, The process of obtaining abnormal nodes through collaborative consensus voting on the state update packets and merging the abnormal nodes to obtain a list of faulty nodes includes: The state update packet is parsed to extract the local state evaluation data and corresponding trust weight values sent by neighboring nodes to the target node. The local state evaluation data are weighted and summed based on the trust weight values to obtain the comprehensive anomaly confidence of the target node. The overall anomaly confidence level is compared with a preset consensus threshold. If the overall anomaly confidence level is greater than the preset consensus threshold, the target node is marked as an anomaly node. Extract all abnormal nodes that meet the preset consensus threshold, deduplicate the abnormal nodes and merge them to obtain a list of faulty nodes.

5. The data processing method based on multi-agent distributed collaboration according to claim 1, characterized in that, The step of performing consistency mapping processing on pre-acquired active tasks based on the real-time node pool to obtain dynamic task distribution includes: Analyze the task key-value characteristics of the active tasks; The computing nodes in the real-time node pool are mapped to a preset hash ring using a consistent hashing algorithm, and the number of corresponding virtual nodes is dynamically adjusted according to the current load utilization of the computing nodes to obtain a dynamic hash ring. Calculate the hash value of the task key feature, search for the virtual node closest to the hash value along a preset direction on the dynamic hash ring, and assign the active task to the computing node corresponding to the virtual node to obtain the dynamic task distribution.

6. The data processing method based on multi-agent distributed collaboration according to claim 1, characterized in that, The step of obtaining link transmission metrics and then performing frequency attenuation processing on the communication paths corresponding to the dynamic task distribution based on the link transmission metrics to obtain an optimized transmission strategy includes: Extract the data exchange paths involved in the dynamic task distribution, and use probe packets to calculate the current round-trip delay value of the data exchange path as the link transmission indicator; If the current round-trip delay value is greater than the preset delay threshold, the preset reference switching frequency is subjected to a multiplicative attenuation operation based on the excess ratio and the preset attenuation factor to obtain the target switching frequency. The execution frequency parameter of the data exchange path is replaced with the target exchange frequency to obtain an optimized transmission strategy.

7. The data processing method based on multi-agent distributed collaboration according to claim 6, characterized in that, The process of obtaining the communication link and switching plan based on the optimized transmission strategy includes: Extract the current bandwidth utilization rate of the communication path corresponding to the optimized transmission strategy, and obtain the link transmission index corresponding to the communication path; Determine whether the current bandwidth utilization rate is greater than a preset bandwidth threshold; If the current bandwidth utilization rate is not greater than the preset bandwidth threshold, the current communication path is directly determined as a communication link; If the current bandwidth utilization rate is greater than the preset bandwidth threshold, then an alternative communication path is searched in the global network topology map in combination with the link transmission index, and the expected communication overhead of the alternative communication path is calculated based on the latency and bandwidth data contained in the link transmission index. If the estimated communication overhead is less than a preset overhead threshold, the backup communication path is determined as a communication link; The routing instructions, which include the target switching frequency and the communication link, are summarized to obtain the switching plan.

8. The data processing method based on multi-agent distributed collaboration according to claim 1, characterized in that, The step of performing block encoding and distributed distribution processing on the pre-acquired data to be distributed according to the communication link to obtain the disaster recovery data distribution status includes: Obtain the data to be distributed, use the erasure coding algorithm to perform data slicing on the data to be distributed to obtain the original data block, and calculate and generate the redundancy check block based on the original data block; Extract the physical location attribute information of global health nodes, and divide independent fault isolation domains based on the physical location attribute information; The original data block and the redundant check block are distributed in parallel to global healthy nodes belonging to different fault isolation domains using the communication link to obtain the disaster recovery data distribution status.

9. The data processing method based on multi-agent distributed collaboration according to claim 1, characterized in that, After performing block encoding and distributed distribution processing on the pre-acquired data to be distributed according to the communication link to obtain the disaster recovery data distribution status, the method further includes: Obtain the node recovery signal of the offline node, and extract the current hardware resource reserve data and performance benchmark test data of the offline node; The hardware resource margin data and the performance benchmark test data are input into a pre-built scoring model for calculation and processing to obtain the carrying capacity score; If the carrying capacity score matches the access requirements of the distributed system load, a preset initial weight value is assigned to the offline status node; Extract the operational error rate of the offline status node within a continuous monitoring period. If the operational error rate is continuously lower than a preset stable threshold, increase the initial weight value by a preset step size until the normal allocation standard is reached, and update the offline status node that has reached the normal allocation standard to the real-time node pool.

10. A data processing system based on multi-agent distributed collaboration, characterized in that, include: The state awareness module is used to acquire the node operation indicators of each intelligent agent, and to perform feature fusion processing on the node operation indicators using local node evaluation to obtain a preliminary fault judgment. The fault consensus module is used to perform view exchange processing between adjacent nodes to obtain a state update packet based on the preliminary fault judgment, and to establish a list of faulty nodes using collaborative consensus voting. The node maintenance module is used to dynamically remove and maintain pre-obtained available computing nodes based on the list of faulty nodes, thereby obtaining a real-time node pool. The task remapping module is used to perform consistent mapping processing on the pre-acquired active tasks according to the real-time node pool to obtain a dynamic task distribution. An adaptive routing module is used to obtain link transmission indicators, perform frequency attenuation processing on the communication path corresponding to the dynamic task distribution based on the link transmission indicators to obtain an optimized transmission strategy, and perform topology routing based on the optimized transmission strategy to obtain the communication link and switching plan. The disaster recovery distribution module is used to perform block encoding and distributed distribution processing on the pre-acquired data to be distributed according to the communication link to obtain the disaster recovery data distribution status.