A pingmesh-based network situation awareness platform and an implementation method thereof

By using a network situation awareness platform based on pingmesh, and employing task scheduling and health score algorithms to conduct cross-ping tests across the entire network, combined with an intelligent operation and maintenance analysis system, the problem of traditional network monitoring struggling to detect and locate faults in real time in complex environments has been solved, enabling efficient fault detection and recovery.

CN122247840APending Publication Date: 2026-06-19JIANGSU SECURITIES

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
JIANGSU SECURITIES
Filing Date
2026-02-11
Publication Date
2026-06-19

Smart Images

  • Figure CN122247840A_ABST
    Figure CN122247840A_ABST
Patent Text Reader

Abstract

This invention provides a network situational awareness platform and its implementation method based on pingmesh, belonging to the field of data center network monitoring. It includes: a network situational awareness system that, based on network topology information, executes full-network cross-ping tests through a task scheduling algorithm, constructs a ping test matrix with business areas as the dimension, and performs network IP layer fault analysis; an intelligent operation and maintenance analysis system that receives fault information output by the network situational awareness system, mines fault trends, and diagnoses and analyzes abnormal links; and a business scenario linkage system that, based on the located fault points and combined with preset business scenario models, triggers business layer connectivity and performance tests to investigate business layer faults. This invention achieves minute-level coverage of all areas of the business network and switches by intelligently scheduling distributed probes to perform full-network cross-ping tests, and reduces alarm latency from minutes to seconds.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of data center network monitoring, specifically relating to a network situation awareness platform based on pingmesh and its implementation method. Background Technology

[0002] In traditional network monitoring systems, ping and SNMP are two of the most basic network diagnostic technologies. Ping is only suitable for simple network connectivity testing, latency detection, and basic troubleshooting. However, in complex network environments, its detection range is limited, and it usually needs to be combined with other advanced tools to gain a comprehensive understanding of the network status. SNMP can provide basic device monitoring, performance management, and fault alarm capabilities, but it has problems such as weak security and inconsistent collection cycles and data cache update cycles, which can easily lead to unavoidable errors in data collection.

[0003] As network architectures become increasingly complex and scales up, higher demands are placed on the real-time perception of the overall network operational status. Traditional network monitoring technologies and methods primarily focus on the performance of individual network devices, bandwidth, links, latency, and other resources, with monitoring content including device status, traffic, bandwidth usage, and link quality. However, in scenarios where network devices are operating normally, but business network connectivity is compromised due to unexpected changes or policy errors, these conventional monitoring methods cannot provide timely and effective detection. Furthermore, in scenarios involving network jitter or large-scale failures, the massive amounts of alarm information lack aggregated analysis and visualization, which is insufficient to support administrators in timely and effective assessment of the scope and extent of business impact.

[0004] Therefore, there is an urgent need for a high-efficiency network monitoring solution that can achieve real-time perception, intelligent analysis, and business linkage across the entire network to improve the efficiency of network fault detection, location, and recovery. Summary of the Invention

[0005] To address the problems existing in the prior art, this invention proposes a network situation awareness platform based on pingmesh and its implementation method. The platform uses a task scheduling algorithm to select the corresponding servers under the access switches in each business area of ​​the data center as probes to conduct cross-ping tests, achieving minute-level coverage of the entire business network and switches. By aggregating and analyzing massive amounts of ping test data, a health score algorithm is used to calculate the health score of the data center business areas and access switches, thereby analyzing the health status of the entire network.

[0006] To solve the above problems, the present invention adopts the following technical solution:

[0007] A network situational awareness platform based on pingmesh includes: a network situational awareness system, which, based on network topology information, organizes distributed probes to perform cross-ping tests across the entire network using a task scheduling algorithm, constructs a ping test matrix with business areas as the dimension, and performs network IP layer fault analysis and preliminary location; an intelligent operation and maintenance analysis system, which receives fault information output by the network situational awareness system, performs diagnostic analysis on abnormal links based on a host automation platform, and mines fault trends by combining historical data; a business scenario linkage system, which, based on the fault points located by the intelligent operation and maintenance analysis system and combined with a preset business scenario model, triggers business layer connectivity and performance tests to investigate business layer faults and achieve fault location; and based on the fault location conclusions, combined with the network and business correlation graph, provides decision-making basis and data foundation for fault recovery.

[0008] Furthermore, the network situational awareness system includes a controller module, a local data collector module, and a data service module;

[0009] The controller module is used to select distributed probe nodes based on the network topology information of the data center and the task scheduling algorithm, and generate a list of cross-ping test tasks for the entire network, which is then sent to the local collector module.

[0010] The local collector module, in agentless mode, receives ping test tasks from the controller module, uses the fping tool to perform ping probes by calling the host automation platform interface, and uploads the test results data.

[0011] The data service module receives and aggregates all test result data uploaded by local collector modules, and calculates and records the health scores between each business area based on the preset health score algorithm, thereby obtaining the network situation quality matrix and health status of the business area.

[0012] Furthermore, the specific process of selecting distributed probe nodes and creating a task list for cross-ping testing across the entire network based on the task scheduling algorithm is as follows:

[0013] For each access switch in the data center, randomly select 2 to 5 servers connected to it as source probe nodes for the current test period.

[0014] In each ping test cycle, an IP address is randomly selected as the destination address for each access switch, forming the destination address list for the current single probe;

[0015] Based on the maximum number of concurrent tasks preset for each source probe node, the destination address list is grouped, and each group randomly selects an IP address under each access switch for probing.

[0016] Furthermore, the health score algorithm preset in the data service module is as follows:

[0017] By combining the packet loss rate and latency of each link detected in each business area with preset thresholds, it is determined whether a single detected link is in a normal or abnormal state.

[0018] The health score for a single business area is calculated as follows:

[0019] For host-level and switch-level faults:

[0020] ,

[0021] For regional-level faults:

[0022] ,

[0023] Among them, the upper and lower limits of the regional fault level health score are preset according to the nature of the fault;

[0024] The overall network health score is the minimum among all regional health scores, i.e.

[0025] .

[0026] Furthermore, after identifying abnormal links through the health score algorithm, the network situational awareness system immediately initiates an MTR diagnostic request. By calling the host automation platform interface, the host automation platform performs MTR path tracing diagnosis to obtain diagnostic results containing fault node information. Based on the diagnostic results, the intelligent operation and maintenance analysis system combines historical data to mine and deeply analyze fault trends.

[0027] Furthermore, the network situation awareness system also includes a visualization module, which displays changes in the network situation quality matrix of the business area, changes in the health score trend, and MTR path tracing diagnostic results.

[0028] Furthermore, when the visualization module displays changes in the network status quality matrix of the service area, it uses different cell colors to represent the network status of the target service area, and the color of each cell is determined by the percentage of normal links between areas, i.e.:

[0029] ,

[0030] in, When the level is between 90% and 100%, it will be displayed in green; when it is below 90%, it will be displayed in yellow.

[0031] A large area of ​​green cells in the matrix indicates that network access is normal.

[0032] When a single, isolated yellow cell appears in the matrix, it is determined to be a switch-level fault.

[0033] If a matrix shows a series of consecutive yellow cells in a vertical direction for a specific business area, it is considered a business area-level fault.

[0034] This invention also protects a method for implementing a network situational awareness platform based on pingmesh, comprising the following steps:

[0035] Step 1: Based on the network topology information of the data center, select distributed probe nodes based on the task scheduling algorithm and generate a list of cross-ping test tasks for the entire network.

[0036] Step 2: Receive the ping test task issued by the controller module through the local collector module, call the host automation platform interface to perform ping detection, and upload the test result data;

[0037] Step 3: Aggregate all uploaded test result data and calculate the health score of each service area and the overall network based on the health score algorithm;

[0038] Step 4: For the abnormal links identified in Step 3, perform MTR path tracing diagnosis by calling the host automation platform interface to obtain diagnostic results containing fault node information.

[0039] Step 5: Visualize the changes in the network quality matrix status, the trend changes in health score, and the diagnostic results of MTR path tracing in the business area, and trigger graded alarms in real time based on the matrix anomaly pattern or health score threshold.

[0040] Step 6: Based on network alarm information, automatically trigger connectivity and performance tests at the business layer to confirm the scope of impact on the business layer;

[0041] Step 7: Integrate network layer diagnostic results with service layer test results to output accurate fault location conclusions; based on the fault location conclusions, combine the network and service correlation graph to provide decision-making basis and data foundation for fault recovery.

[0042] Compared with the prior art, the beneficial effects of the present invention are as follows:

[0043] This invention utilizes task scheduling and distributed probes to perform cross-ping tests across the entire network, achieving minute-level coverage of all service network areas and switches, and reducing alarm latency from minutes to seconds. By constructing a network situation matrix and health scoring algorithm at the service area level, it can quickly distinguish between host-level, switch-level, and region-level faults, and accurately locate faulty nodes using MTR path diagnosis. The intelligent operation and maintenance analysis system mines fault trends based on historical data and provides in-depth diagnosis, improving the automation and intelligence of fault root cause analysis. Through a service scenario linkage system, it achieves automatic mapping and verification from network layer faults to service layer impacts, solving the problem of difficulty in assessing the service impact in change, drill, and fault scenarios. Based on the network and service correlation graph, it provides data-driven decision-making basis for fault recovery, contributing to continuous optimization of network quality. Attached Figure Description

[0044] Figure 1 A schematic diagram of the network situation awareness platform based on pingmesh provided by the present invention;

[0045] Figure 2 This is a pingmesh scenario diagram provided by the present invention;

[0046] Figure 3 Architecture diagram of the network situation awareness system provided by the present invention;

[0047] Figure 4 The data retrieval flowchart of the network situation awareness system provided by this invention;

[0048] Figure 5 A diagram illustrating the selection strategy for source and destination nodes in pingmesh testing provided by this invention.

[0049] Figure 6 The visualization results of the network situation awareness quality matrix provided by this invention. Detailed Implementation

[0050] To make the technical solution of the present invention clearer, the technical solution of the present invention will be described in further detail below with reference to the accompanying drawings and specific embodiments.

[0051] like Figure 1As shown, this invention provides a network situational awareness platform based on pingmesh, comprising: a network situational awareness system, which, based on network topology information, organizes distributed probes to perform cross-ping tests across the entire network using a task scheduling algorithm, constructs a ping test matrix with business areas as the dimension, and performs network IP layer fault analysis and preliminary location; an intelligent operation and maintenance analysis system, which receives fault information output by the network situational awareness system, performs diagnostic analysis on abnormal links based on a host automation platform, and mines fault trends by combining historical data; a business scenario linkage system, which, based on the fault points located by the intelligent operation and maintenance analysis system, and combines a preset business scenario model, triggers business layer connectivity and performance tests to investigate business layer faults and achieve fault location; and based on the fault location conclusions, combined with the network and business correlation graph, provides decision-making basis and data foundation for fault recovery.

[0052] Specifically, such as Figure 2-4 As shown, the network situational awareness system includes a controller module, a local data acquisition module, a data service module, and a visualization module; among them,

[0053] The controller module, based on the network topology information of the data center, randomly selects 2 to 5 downstream servers as source probe nodes for each access switch in the data center; the server information is obtained by combining CMDB data with ARP and MAC address table scanning.

[0054] In each ping test cycle, an IP address is randomly selected for each access switch as the destination address, forming the destination address IP list for the current single probe;

[0055] Based on the maximum concurrent tasks preset for each source probe node, the list of destination IP addresses is grouped, and each group randomly selects one IP address from each access switch for probing (e.g., ...). Figure 5 As shown in the figure, this generates a cross-network ping test task list (pinglist) and sends it to the local collector module;

[0056] The local collector module, operating in agentless mode, receives ping test tasks from the controller module. It then uses the fping detection tool to perform ping tests on all physical or virtual machines on the network by calling the host automation platform interface, and uploads the test results to Kafka. Table 1 compares the execution efficiency of the ping and fping detection tools, and Table 2 compares their impact on host performance. The comparison shows that fping, as a detection tool, is more efficient, consumes less CPU and memory, and has a smaller impact on host performance.

[0057] Table 1

[0058]

[0059] Table 2

[0060]

[0061] The data service module receives and aggregates all test result data uploaded by local collector modules, calculates and records the health score between different service areas in the network based on the health score algorithm, specifically:

[0062] By combining the packet loss rate and latency of each link detected in each business area with preset thresholds, it is determined whether a single detected link is in a normal or abnormal state.

[0063] Table 3 shows the method for dividing the upper and lower limits of the health score for regional fault levels and the calculation method of the health score. The fault level of a single region is divided into host-level faults, and its health score range is as follows: This indicates that all switches in the area are functioning normally; for switch-level faults, the health score range is... A score of 0 indicates that some switches in the area are faulty, with a fault rate greater than 50%; a regional fault, with a health score of 0, indicates that all switches in the area are faulty.

[0064] The specific calculation method for the health score of a single business area is as follows:

[0065] For host-level and switch-level failures:

[0066] ,

[0067] For regional-level faults:

[0068] ,

[0069] The overall network health score is the minimum among all regional health scores, i.e.

[0070] ;

[0071] When the data service module identifies a link as being in an abnormal state, it immediately initiates the MTR diagnostic process. The local collector module calls the host automation platform interface, which then executes the MTR path tracing diagnostic task and returns the diagnostic results containing information about the faulty nodes. The data service module receives the diagnostic results and stores them in the database. At the same time, the diagnostic results are sent to the visualization module, which displays the path tracing diagnostic results of the abnormal link in a visual interface. The visualization module also displays changes in the network status quality matrix and health score trends in the business area.

[0072] Table 3

[0073]

[0074] like Figure 6 As shown, when the visualization module displays changes in the network quality matrix of a service area, it uses different cell colors to represent the network status of the target service area, and the color of each cell is determined by the percentage of normal links between service areas.

[0075] ,

[0076] Among them, 90%-100% is displayed in green, and below 90% is displayed in yellow;

[0077] A large area of ​​green cells in the matrix indicates that network access is normal.

[0078] When a single, isolated yellow cell appears in the matrix, it is determined to be a switch-level fault.

[0079] If a matrix shows a series of consecutive yellow cells in a vertical direction for a specific business area, it is considered a business area-level fault.

[0080] In one specific embodiment, the network situational awareness system provided by this invention is used to calculate the network health score at the city level. Table 4 specifically shows the upper and lower limits of the city-level health score as defined by the third-party organizations, as well as the calculation method for the health score. The fault level of a single city is classified as host-level fault, and its health score range is as follows: This indicates that all communication lines within the city are functioning normally; for line-level faults, the health score range is... A score of 0 indicates that some communication lines in the city are faulty, with a fault rate greater than 50%; a city-level fault, with a health score of 0, indicates that all communication lines in the city are faulty.

[0081] The specific calculation methods for health scores in each city are as follows:

[0082] For host-level faults and line-level faults:

[0083] ,

[0084] For city-level faults:

[0085] ,

[0086] The overall network health score is the minimum among all city health scores, i.e.

[0087] ;

[0088] Table 4

[0089]

[0090] When the visualization module displays changes in the city's network quality matrix, different line colors represent the network status of the target city, and the line color is determined by the percentage of normal links.

[0091] ,

[0092] When the percentage of normal links is between 50% and 100%, it is displayed in green; when it is below 50%, it is displayed in yellow.

[0093] In another specific embodiment, the implementation method of the network situation awareness platform based on pingmesh provided by the present invention includes the following steps:

[0094] Step 1: Based on the network topology information of the data center, select distributed probe nodes based on the task scheduling algorithm and generate a list of cross-ping test tasks for the entire network.

[0095] Step 2: Receive the ping test task issued by the controller module through the local collector module, call the host automation platform interface to perform ping detection, and upload the test result data;

[0096] Step 3: Aggregate all uploaded test result data and calculate the health score of each service area and the overall network based on the health score algorithm;

[0097] Step 4: For the abnormal links identified in Step 3, perform MTR path tracing diagnosis by calling the host automation platform interface to obtain diagnostic results containing fault node information.

[0098] Step 5: Visualize the changes in the network quality matrix status, the trend changes in health score, and the diagnostic results of MTR path tracing in the business area, and trigger graded alarms in real time based on the matrix anomaly pattern or health score threshold.

[0099] Step 6: Based on network alarm information, automatically trigger connectivity and performance tests at the business layer to confirm the scope of impact on the business layer;

[0100] Step 7: Integrate network layer diagnostic results with service layer test results to output accurate fault location conclusions; based on the fault location conclusions, combine the network and service correlation graph to provide decision-making basis and data foundation for fault recovery.

[0101] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Any modifications, equivalent substitutions, or improvements made by those skilled in the art within the scope of the technology disclosed in this invention, based on the technical solution and concept of the present invention, should be included within the protection scope of this invention. Therefore, the protection scope of this invention should be determined by the scope of the claims.

Claims

1. A network situational awareness platform based on pingmesh, characterized in that, include: The network situation awareness system, based on network topology information, organizes distributed probes to perform cross-ping tests across the entire network through task scheduling algorithms, constructs a ping test matrix with business areas as the dimension, and performs fault analysis and preliminary location of network IP layer faults; the intelligent operation and maintenance analysis system receives fault information output by the network situation awareness system, performs diagnostic analysis on abnormal links based on the host automation platform, and mines fault trends by combining historical data. The business scenario linkage system, based on the fault points located by the intelligent operation and maintenance analysis system and combined with the preset business scenario model, triggers business layer connectivity and performance tests to investigate business layer faults and achieve fault location; based on the fault location conclusions, combined with the network and business relationship graph, it provides decision-making basis and data foundation for fault recovery.

2. The network situation awareness platform based on pingmesh according to claim 1, characterized in that, The network situation awareness system includes a controller module, a local data acquisition module, and a data service module; The controller module is used to select distributed probe nodes based on the network topology information of the data center and the task scheduling algorithm, and generate a list of cross-ping test tasks for the entire network, which is then sent to the local collector module. The local collector module, in agentless mode, receives ping test tasks from the controller module, uses the fping tool to perform ping probes by calling the host automation platform interface, and uploads the test results data. The data service module receives and aggregates all test result data uploaded by local collector modules, and calculates and records the health scores between each business area based on the preset health score algorithm, thereby obtaining the network situation quality matrix and health status of the business area.

3. A network situation awareness platform based on pingmesh according to claim 2, characterized in that, The specific process of selecting distributed probe nodes and creating a task list for cross-network ping testing based on the task scheduling algorithm is as follows: For each access switch in the data center, randomly select 2 to 5 servers connected to it as source probe nodes for the current test period. In each ping test cycle, an IP address is randomly selected as the destination address for each access switch, forming the destination address list for the current single probe; Based on the maximum number of concurrent tasks preset for each source probe node, the destination address list is grouped, and each group randomly selects an IP address under each access switch for probing.

4. A network situation awareness platform based on pingmesh according to claim 2, characterized in that, The health score algorithm preset in the data service module is as follows: By combining the packet loss rate and latency of each link detected in each business area with preset thresholds, it is determined whether a single detected link is in a normal or abnormal state. The health score for a single business area is calculated as follows: For host-level and switch-level faults: , For regional-level faults: , Among them, the upper and lower limits of the regional fault level health score are preset according to the nature of the fault; The overall network health score is the minimum among all regional health scores, i.e. 。 5. A network situation awareness platform based on pingmesh according to claim 4, characterized in that, After identifying abnormal links through the health score algorithm, the network situational awareness system immediately initiates an MTR diagnostic request. By calling the host automation platform interface, the host automation platform performs MTR path tracing diagnosis to obtain diagnostic results containing fault node information. The intelligent operation and maintenance analysis system, based on the diagnostic results and combined with historical data, performs fault trend mining and in-depth analysis.

6. A network situational awareness platform based on pingmesh according to claim 5, characterized in that, The network situation awareness system also includes a visualization module, which displays changes in the network situation quality matrix of the business area, changes in the health score trend, and MTR path tracing diagnostic results.

7. A network situational awareness platform based on pingmesh according to claim 6, characterized in that, When the visualization module displays changes in the network quality matrix of a service area, it uses different cell colors to represent the network status of the target service area, and the color of each cell is determined by the percentage of normal links between areas. , in, When the level is between 90% and 100%, it will be displayed in green; when it is below 90%, it will be displayed in yellow. A large area of ​​green cells in the matrix indicates that network access is normal. When a single, isolated yellow cell appears in the matrix, it is determined to be a switch-level fault. If a matrix shows a series of consecutive yellow cells in a vertical direction for a specific business area, it is considered a business area-level fault.

8. A method for implementing a network situational awareness platform based on pingmesh, characterized in that, A network situation awareness platform based on pingmesh as described in any one of claims 2 to 7 includes the following steps: Step 1: Based on the network topology information of the data center, select distributed probe nodes based on the task scheduling algorithm and generate a list of cross-ping test tasks for the entire network. Step 2: Receive the ping test task issued by the controller module through the local collector module, call the host automation platform interface to perform ping detection, and upload the test result data; Step 3: Aggregate all uploaded test result data and calculate the health score of each service area and the overall network based on the health score algorithm; Step 4: For the abnormal links identified in Step 3, perform MTR path tracing diagnosis by calling the host automation platform interface to obtain diagnostic results containing fault node information. Step 5: Visualize the changes in the network quality matrix status, the trend changes in health score, and the diagnostic results of MTR path tracing in the business area, and trigger graded alarms in real time based on the matrix anomaly pattern or health score threshold. Step 6: Based on network alarm information, automatically trigger connectivity and performance tests at the business layer to confirm the scope of impact on the business layer; Step 7: Integrate network layer diagnostic results with service layer test results to output accurate fault location conclusions; based on the fault location conclusions, combine the network and service correlation graph to provide decision-making basis and data foundation for fault recovery.