A high-performance network acceptance system and method for a data center
By employing a layered architecture and multi-protocol adaptation technology, the system addresses the issues of comprehensiveness, automation, and intelligence in network acceptance testing for intelligent computing data centers. This enables efficient and accurate network acceptance testing, supports multiple protocols and hardware from multiple vendors, and meets the high reliability requirements of intelligent computing data centers.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA UNICOM DIGITAL TECNOLOGY CO LTD
- Filing Date
- 2026-03-19
- Publication Date
- 2026-06-26
Smart Images

Figure CN121864641B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data center network testing technology, and in particular to a high-performance network acceptance system and method for intelligent computing data centers. Background Technology
[0002] With the rapid development of artificial intelligence and cloud computing technologies, intelligent computing data centers have become a core component of the next generation of information infrastructure. Intelligent computing data centers, by deploying high-performance computing clusters, GPU servers, and high-speed interconnect networks, support large-scale model training, inference, and scientific computing tasks, placing extremely high demands on network bandwidth, latency, stability, and scalability.
[0003] Intelligent computing data center networks differ significantly from traditional cloud computing networks, primarily in three aspects: First, the demand for high-performance computing. AI training tasks require thousands of GPU cards to work collaboratively, and continuous parameter synchronization is necessary between computing nodes, making them far more sensitive to network latency, bandwidth, and packet loss rates than traditional services. Second, the multi-track network architecture. GPU servers are typically equipped with eight or more network cards, connecting to parameter networks, storage networks, and service networks respectively, forming a complex topology that traditional network configuration and routing planning methods struggle to adapt to. Third, the diversification of technical approaches. Intelligent computing networks are divided into two major camps: InfiniBand (IB) and Ethernet. Ethernet achieves high-performance transmission through the RoCEv2 protocol, and this technical differentiation increases the complexity of network acceptance testing.
[0004] Existing data center network acceptance methods are primarily designed for traditional IT data centers and have significant shortcomings:
[0005] 1. The acceptance indicator system is too simplistic: it only focuses on basic transmission parameters such as bandwidth, latency, and packet loss rate, and does not cover core indicators of intelligent computing networks such as low jitter, high concurrent connection count, queue scheduling accuracy, and RDMA communication latency stability. Furthermore, it lacks specific testing for aggregated communication modes such as AllReduce and Broadcast, and cannot reflect network performance under real business loads.
[0006] 2. Limited automation: Traditional methods rely on manual configuration of test nodes, execution of test processes, and analysis of data, which is inefficient and prone to human error; semi-automated systems have fixed test cases and cannot dynamically adapt to the flexible and ever-changing network topology and multi-vendor equipment compatibility requirements of intelligent computing data centers.
[0007] 3. Lack of intelligent judgment mechanism: It only outputs raw test data and has no systematic analysis capability. Testers need to manually compare a large amount of data to judge the acceptance results, which is time-consuming and easy to miss problems. It cannot meet the high reliability and traceability quality requirements of intelligent computing data centers.
[0008] Therefore, there is an urgent need for a high-performance network acceptance system and method for intelligent computing data centers to achieve standardization, automation and intelligence in the acceptance process. Summary of the Invention
[0009] To address the shortcomings of existing technologies, the present invention aims to provide a high-performance network acceptance method and system for intelligent computing data centers, solving the problems of existing acceptance technical indicators being singular, having low automation levels, and lacking intelligent judgment mechanisms.
[0010] To achieve the above objectives, the technical solution of the present invention is implemented as follows:
[0011] A high-performance network acceptance system for intelligent computing data centers adopts a layered architecture that integrates a hardware layer and core software modules. The hardware layer includes an acceptance control node, a test terminal node, and a data acquisition probe.
[0012] The core software module is integrated into the acceptance control node, including a multi-protocol adaptation index system module, a real-time analysis module, and a result output module connected by sequential signals.
[0013] The multi-protocol adaptation index system module is used to automatically identify network protocol types, match manufacturer hardware models, load corresponding configuration verification rules and acceptance thresholds, and output three types of standardized test index sets: parameter surface, sample surface, and aggregate communication.
[0014] The real-time analysis module is used to receive active test data and passive monitoring data, perform dual-mode data consistency verification, indicator threshold comparison, anomaly identification and preliminary root cause location, and output analysis results; the result output module is used to summarize test data, analysis conclusions, network topology and anomalies, generate a standardized acceptance report and output optimization suggestions; wherein, the analysis conclusions are generated based on the analysis results.
[0015] The acceptance control node is used to call the multi-protocol adaptation indicator system module to complete the indicator and threshold configuration, send test tasks to the test terminal node, send monitoring instructions to the data acquisition probe, receive and integrate active test data and passive monitoring data, and forward them to the real-time analysis module for compliance judgment and anomaly location.
[0016] The test terminal node is deployed within the computing power node and is used to perform configuration checks, connectivity tests, bandwidth latency tests, and aggregated communication algorithm tests on the parameter plane and sample plane, and to upload the active test data to the acceptance control node.
[0017] The data acquisition probe is connected in series in the network link to capture real traffic of the RoCE / IB protocol, analyze latency fluctuations, packet loss rate, and ECN tag status, form passive monitoring data and upload it to the acceptance control node to verify the authenticity of the active test data.
[0018] Furthermore, it also includes a parameter plane network and a sample plane network; the parameter plane network is used for parameter synchronization communication between GPUs across machines; the sample plane network is used for sample / model transfer between GPU nodes and AI native storage; the ensemble communication algorithm test is based on the OpenMPI framework, executing all_reduce, all_to_all, broadcast, reduce, reduce_scatter, and scatter communication operations, and recording the bandwidth values of the three-level bus for single machine, dual machine, and cluster.
[0019] Furthermore, the multi-protocol adaptation indicator system module includes a protocol identification unit, a vendor adaptation unit, an indicator library unit, and a threshold configuration unit. The protocol identification unit is used to automatically identify IB or RoCE protocols and switch the corresponding parsing and testing strategies. The vendor adaptation unit has a built-in hardware matching library of mainstream GPU and network card manufacturers. The indicator library unit stores configuration verification rules. The threshold configuration unit is used to output customized acceptance thresholds corresponding to protocols, manufacturers, and hardware models, so that the system is compatible with both IB and RoCE protocols and hardware from multiple manufacturers.
[0020] Furthermore, the test terminal node performs parameter plane and sample plane tests using the perftest tool, sends RDMA data packets based on the RoCE protocol, and tests and records one-way latency and bidirectional bandwidth data.
[0021] Furthermore, the data acquisition probe captures link traffic through port mirroring, performs protocol parsing on the data packets, and extracts latency fluctuation range, frame loss rate, and ECN tag status as passive monitoring data.
[0022] This invention also discloses a high-performance network acceptance method for intelligent computing data centers, based on the above-described system, comprising the following steps:
[0023] Preprocessing: The acceptance control node automatically discovers the network topology via the LLDP protocol; the data acquisition probe completes protocol adaptation initialization, supporting traffic capture and parsing for the corresponding protocol;
[0024] Protocol and metric configuration: Configure test metrics and thresholds according to three dimensions: parameter surface, sample surface, and aggregate communication;
[0025] Layered automated testing: Parallel / sequential execution of parameter surface testing, sample surface testing, and aggregate communication testing;
[0026] Real-time analysis: Receives active testing and passive monitoring data, performs consistency checks and threshold comparisons, and marks anomalies;
[0027] Closed-loop evaluation: Optimization suggestions for abnormal outputs are provided, and a standardized acceptance report is generated.
[0028] Furthermore, the parameter-side and sample-side metrics include: network interface card rate negotiation mode, PFC priority configuration, DSCP flag, ECN enable status, latency, bandwidth, connectivity, and tenant isolation; the aggregated communication metrics include: bus bandwidth values for single-machine, dual-machine, and cluster-level operations.
[0029] Further parameter-side and sample-side tests include: calling configuration checking tools to remotely read network card parameters of computing and storage nodes, and verifying the consistency of PFC, DSCP, and ECN configurations; performing end-to-end connectivity ping tests and tenant isolation tests; and sending RDMA data packets through the perftest tool to test bandwidth and one-way latency.
[0030] Furthermore, the ensemble communication test includes: single-machine test: performing ensemble communication operations within a single computing server and recording latency and bandwidth; dual-machine test: performing cross-node ensemble communication between two GPU servers of the same model to verify inter-node collaboration performance; and cluster test: performing large-scale ensemble communication within an intelligent computing cluster to evaluate consistency and stability.
[0031] Furthermore, real-time analysis employs a complementary verification method using both proactive test data and passive monitoring data; the acceptance report output by the closed-loop evaluation includes: indicators for each link / device, network topology diagram, anomalies, and optimization suggestions.
[0032] Beneficial effects: The present invention provides a comprehensive acceptance dimension: it constructs a three-dimensional test dimension of "parameter surface - sample surface - aggregate communication", covering the basic transmission indicators, high-level configuration indicators and business scenario indicators of intelligent computing networks, while supporting both IB and RoCE protocols, adapting to hardware devices from multiple vendors, and solving the problem of single traditional acceptance indicators.
[0033] This invention automates the acceptance process: through centralized task orchestration and distributed test execution mechanisms, it achieves full automation from topology discovery, indicator configuration, test execution to report generation, without the need for manual intervention, thereby improving acceptance efficiency and avoiding human error.
[0034] This invention ensures the accuracy of acceptance results: it adopts a dual-mode verification logic of "active testing + passive monitoring", captures real traffic data through data acquisition probes, and verifies it in conjunction with active test data to ensure the authenticity of the acceptance results; at the same time, it establishes an intelligent judgment mechanism to automatically mark abnormal items and output optimization suggestions, so as to realize the traceability and closed-loop optimization of acceptance results. Attached Figure Description
[0035] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an undue limitation of the invention. In the drawings:
[0036] Figure 1 This is a schematic diagram of the basic structure of the high-performance network acceptance system for the intelligent computing data center according to an embodiment of the present invention;
[0037] Figure 2 This is a flowchart of the high-performance network acceptance method for intelligent computing data centers according to an embodiment of the present invention;
[0038] Figure 3 This is a flowchart of the parameter plane and sample plane test in the layered automated test phase of the high-performance network acceptance method for intelligent computing data centers according to an embodiment of the present invention.
[0039] Figure 4 This is a flowchart of the aggregate communication test in the layered automated testing phase of the high-performance network acceptance method for intelligent computing data centers described in this embodiment of the invention. Detailed Implementation
[0040] It should be noted that, unless otherwise specified, the embodiments and features described in the present invention can be combined with each other.
[0041] The present invention will now be described in detail with reference to the accompanying drawings and embodiments.
[0042] Example 1
[0043] See Figure 1 A high-performance network acceptance system for intelligent computing data centers adopts a layered architecture integrating a hardware layer and a core software module. The hardware layer includes an acceptance control node, a test terminal node, and a data acquisition probe. The core software module is integrated into the acceptance control node and includes a multi-protocol adaptation index system module, a real-time analysis module, and a result output module connected by sequential signals.
[0044] The multi-protocol adaptation indicator system module has functions such as automatic protocol identification, vendor hardware adaptation, and indicator rule management. It includes a protocol identification unit, a vendor adaptation unit, an indicator library unit, and a threshold configuration unit. It can automatically match the corresponding configuration verification rules and acceptance thresholds according to the network link protocol type (IB / RoCE) and GPU / network card manufacturer model, and output a unified indicator set for test execution and result judgment.
[0045] The real-time analysis module receives active test data uploaded by the test terminal node and passive monitoring data uploaded by the data acquisition probe. It performs dual-mode data complementary verification, compares the measured values with preset thresholds item by item, automatically marks abnormal items and completes preliminary anomaly localization, and pushes the analysis results to the result output module in real time.
[0046] The results output module is used to summarize test data, analysis conclusions, network topology diagrams and anomalies, automatically generate standardized acceptance reports, and output actionable optimization suggestions for anomalies, forming a closed loop of acceptance-analysis-optimization; among them, the analysis conclusions are generated based on the analysis results.
[0047] The acceptance control node is used to call the multi-protocol adaptation indicator system module to complete the configuration of indicators and thresholds, send test tasks to test terminal nodes, send monitoring instructions to data acquisition probes, receive and integrate active test data and passive monitoring data, and forward them to the real-time analysis module for compliance judgment and anomaly location.
[0048] The test terminal node is deployed within the computing power node and is used to perform configuration checks on the parameter plane and sample plane, connectivity tests, bandwidth and latency tests, and aggregated communication algorithm tests, and to upload active test data to the acceptance control node.
[0049] The data acquisition probe is connected in series in the network link to capture real traffic of the RoCE / IB protocol, analyze latency fluctuations, packet loss rate, and ECN tag status, form passive monitoring data and upload it to the acceptance control node to verify the authenticity of the active test data.
[0050] In the specific implementation, a high-performance network acceptance system for the intelligent computing data center was built. The hardware layer deployed one acceptance control node, 32 test terminal nodes, and 8 data acquisition probes. The core software modules were integrated within the acceptance control node. The acceptance control node configured the parameters and thresholds for parameter plane, sample plane, and ensemble communication algorithm tests, issued test tasks to the test terminal nodes, and issued monitoring commands to the data acquisition probes. The test terminal nodes were deployed within the GPU computing nodes, performing parameter plane connectivity tests, sample plane bandwidth tests, and ensemble communication algorithm tests, and uploading data. The data acquisition probes were connected in series between the core switch and the computing nodes, capturing real traffic under the RoCE protocol, analyzing latency fluctuations, packet loss rate, and ECN marking, and transmitting the data back to the acceptance control node. The acceptance control node integrated active test and passive monitoring data, analyzed the packet loss rate of a certain computing node, and located the cause as an incorrect ECN configuration on the network interface card.
[0051] This embodiment constructs a layered architecture of "hardware layer + core software module," clearly defining the responsibilities of each node to achieve centralized management and distributed execution of test tasks, adapting to the complex multi-track network topology of intelligent computing data centers. This embodiment integrates active testing and passive monitoring modes, solving the deficiency of traditional active testing that only covers ideal conditions, ensuring that acceptance results reflect the network performance under real business loads. This embodiment supports both IB and RoCE protocols, adapting to intelligent computing networks with different technical routes, improving the system's compatibility and versatility.
[0052] In a specific example, it also includes a parameter plane network and a sample plane network; the parameter plane network is used for parameter synchronization communication between GPUs across machines; the sample plane network is used for sample / model transfer between GPU nodes and AI native storage; the ensemble communication test is based on the OpenMPI framework, performs all_reduce, all_to_all, broadcast, reduce, reduce_scatter, and scatter communication operations, and records the bandwidth values of the three-level bus for single machine, dual machine, and cluster.
[0053] In the specific implementation, the parameter plane network is explicitly used for cross-machine communication between 32 GPU servers, and the sample plane network is used for GPU nodes to download training samples from AI native storage. The test terminal node is based on the OpenMPI framework and executes six ensemble communication algorithm operations in sequence, recording the bus bandwidth values of single-machine, dual-machine, and cluster-level operations. Among them, the bus bandwidth of the cluster-level all_reduce operation reaches 850Gbps.
[0054] This embodiment distinguishes the functional positioning of parameter plane and sample plane networks, designs special tests for core data transmission scenarios of intelligent computing services, and makes the acceptance more in line with actual business needs; it covers mainstream ensemble communication algorithms for AI distributed training, comprehensively verifies the network performance in high-concurrency collaborative communication scenarios, and solves the problem of lack of ensemble communication testing in traditional acceptance; it executes tests based on the OpenMPI framework to ensure the accuracy and industry universality of test results, and provides data support for the performance optimization of intelligent computing clusters.
[0055] In a specific example, the multi-protocol adaptation indicator system module includes a protocol identification unit, a vendor adaptation unit, an indicator library unit, and a threshold configuration unit. The protocol identification unit is used to automatically identify IB or RoCE protocols and switch the corresponding parsing and testing strategies. The vendor adaptation unit has a built-in hardware matching library of mainstream GPU and network card manufacturers. The indicator library unit stores configuration verification rules. The threshold configuration unit is used to output customized acceptance thresholds corresponding to protocols, manufacturers, and hardware models, so that the system is compatible with both IB and RoCE protocols and hardware from multiple manufacturers.
[0056] In its implementation, the multi-protocol adaptation indicator module incorporates customized acceptance thresholds for GPUs and network cards from two mainstream manufacturers: NVIDIA and Mellanox. For the parameter plane network of the NVIDIA H100 GPU, the one-way latency threshold is set to ≤10μs; for the sample plane network of the Mellanox CX7 network card, the bidirectional bandwidth threshold is set to ≥100Gbps. Simultaneously, adaptation rules for IB and RoCE protocols are configured. When the network protocol is detected as IB, the corresponding test data packet format and analysis algorithm are automatically switched.
[0057] This embodiment supports dual protocol adaptation of IB and RoCE, solving the acceptance complexity problem caused by the diversification of intelligent computing network technology routes and improving the cross-protocol compatibility of the system; it has built-in customized thresholds and verification rules for hardware from multiple vendors to avoid test deviations caused by hardware differences and achieve accurate matching of hardware-protocol-business scenarios; it automatically identifies network protocols and switches test strategies without manual intervention, improving the automation of the acceptance process.
[0058] In a specific instance, the test terminal node performs parameter plane and sample plane tests using the perftest tool, sends RDMA data packets based on the RoCE protocol, and tests and records one-way latency and bidirectional bandwidth data.
[0059] In the implementation, the perftest tool was installed on the test terminal nodes, and RDMA_WRITE packets were sent to the parameter plane network based on the RoCE protocol to test the network performance between each of the 32 GPU nodes, recording the one-way latency and bidirectional bandwidth data. The test results showed that more than 95% of the nodes had a one-way latency of ≤8μs and a bidirectional bandwidth of ≥110Gbps, meeting the preset threshold requirements.
[0060] This embodiment uses the perftest tool in conjunction with the RDMA protocol to accurately measure the low latency and high bandwidth performance of the intelligent computing network, meeting the performance requirements of intelligent computing services for lossless networks; it records the core indicators of single-path latency and bidirectional bandwidth, providing a basis for quantitative evaluation of network performance and solving the problem of single traditional test indicators; it automates the performance testing of batch nodes, improving testing efficiency and avoiding errors from manual testing.
[0061] In a specific example, the data acquisition probe captures link traffic through port mirroring, performs protocol parsing on the data packets, and extracts latency fluctuation range, frame loss rate, and ECN tag status as passive monitoring data.
[0062] In the specific implementation, the data acquisition probe is configured with port mirroring to capture the link traffic between the core switch and the computing nodes. The data packets are parsed to obtain the following parameters: the latency fluctuation range of the parameter plane network is ±1μs, the frame loss rate is 0%, and the ECN marking status is normal; the latency fluctuation range of the sample plane network is ±2μs, the frame loss rate is 0.001%, and the ECN marking status is normal. These data are compared with the active test data from the test terminal nodes to verify the authenticity of the active test results.
[0063] This embodiment captures real traffic through port mirroring, supplementing passive monitoring and complementing active testing data to enhance the comprehensiveness and reliability of acceptance results. It analyzes key parameters such as latency fluctuations, packet loss rate, and ECN tagging status to accurately reflect network stability and traffic control capabilities, providing detailed directions for network optimization. Monitoring data is automatically transmitted back to the acceptance control node, enabling centralized data analysis and improving the intelligence level of the acceptance process.
[0064] Example 2
[0065] To achieve the above objectives, see [link to relevant documentation]. Figure 2 This embodiment also provides a high-performance network acceptance method for intelligent computing data centers, based on the system described in Embodiment 1, including the following steps:
[0066] Preprocessing: The acceptance control node automatically discovers the network topology via the LLDP protocol; the data acquisition probe completes protocol adaptation initialization, supporting traffic capture and parsing for the corresponding protocol;
[0067] Protocol and metric configuration: Configure test metrics and thresholds according to three dimensions: parameter surface, sample surface, and aggregate communication;
[0068] Layered automated testing: Parallel / sequential execution of parameter surface testing, sample surface testing, and aggregate communication testing;
[0069] Real-time analysis: Receives active testing and passive monitoring data, performs consistency checks and threshold comparisons, and marks anomalies;
[0070] Closed-loop evaluation: Optimization suggestions for abnormal outputs are provided, and a standardized acceptance report is generated.
[0071] In the specific implementation, the intelligent computing data center network acceptance process is executed as follows: the preprocessing stage automatically identifies the connection relationship of 32 GPU nodes and 8 core switches; completes RoCE protocol adaptation initialization; configures three types of dimensional indicators and thresholds; performs layered testing; analyzes and marks 2 nodes with abnormal PFC configuration in real time; and finally outputs optimization suggestions and a standardized acceptance report.
[0072] This embodiment achieves full-process automation, requiring no manual intervention, improving acceptance efficiency and avoiding human error. It automatically discovers the network topology through the LLDP protocol, adapting to the flexible and ever-changing network structures of intelligent computing data centers and solving the problem of fixed test cases in traditional methods. It forms a closed loop of acceptance and optimization, not only outputting acceptance results but also providing optimization suggestions to improve the delivery quality of the intelligent computing network.
[0073] In a specific instance, the parameters and sample metrics include: NIC rate negotiation mode, PFC priority configuration, DSCP flag, ECN enable status, latency, bandwidth, connectivity, and tenant isolation; the aggregated communication metrics include: bus bandwidth values for single-machine, dual-machine, and cluster-level operations.
[0074] In the specific implementation, the following parameters and sample indicators are set: automatic negotiation of network card speed, PFC priority 3, DSCP flag 46, ECN enabled, one-way latency ≤10μs, bidirectional bandwidth ≥100Gbps, connectivity 100%, and tenant isolation with no leakage; aggregated communication indicators: single machine ≥50Gbps, dual machine ≥100Gbps, and cluster ≥800Gbps.
[0075] This embodiment constructs a comprehensive indicator system covering "basic configuration, performance parameters, and service characteristics." It not only includes traditional bandwidth and latency indicators but also covers high-level configuration indicators unique to intelligent computing networks such as PFC, DSCP, and ECN, solving the problem of single traditional acceptance indicators. Differentiated thresholds are set for different dimensions to improve the accuracy of acceptance. Clearly defined thresholds provide a basis for judgment in real-time analysis, enabling automated verification.
[0076] In a specific example, see Figure 3 The parameter-side and sample-side tests include: calling the configuration check tool to remotely read the network card parameters of the computing and storage nodes, and verifying the consistency of PFC, DSCP, and ECN configurations; performing end-to-end connectivity ping tests and tenant isolation tests; and sending RDMA data packets through the perftest tool to test bandwidth and one-way latency.
[0077] In the actual implementation, the network card configurations of 32 computing nodes and 8 storage nodes were remotely read, and two nodes were found to have incorrect PFC priority configurations; connectivity and isolation tests were qualified; performance tests showed an average unidirectional latency of 8μs and a bidirectional bandwidth of 110Gbps.
[0078] This embodiment first verifies configuration consistency and then tests performance parameters to improve the accuracy of problem localization; it covers basic functions and security requirements to improve the comprehensiveness of acceptance; and it uses perftest+RDMA testing to accurately reflect the high bandwidth and low latency characteristics of the intelligent computing network.
[0079] In a specific example, see Figure 4 The aggregated communication test includes: single-machine test: performing aggregated communication operations within a single computing server and recording latency and bandwidth; dual-machine test: performing cross-node aggregated communication between two GPU servers of the same model to verify inter-node collaboration performance; cluster test: performing large-scale aggregated communication within an intelligent computing cluster to evaluate consistency and stability.
[0080] In practice, the bandwidth of all_reduce is 55Gbps for a single machine, 105Gbps for two machines, and 850Gbps for a cluster, all of which meet the threshold requirements.
[0081] This embodiment adopts a hierarchical testing approach of "single machine-dual machine-cluster" to fit the AI distributed training scenario; dual-machine testing with the same model of GPU eliminates interference from hardware differences; and cluster-level testing provides performance basis for capacity expansion.
[0082] In a specific example, real-time analysis uses a combination of active test data and passive monitoring data for verification; the closed-loop evaluation output acceptance report includes: indicators for each link / device, network topology diagram, anomalies, and optimization suggestions.
[0083] In practice, the data from active testing and passive monitoring are consistent, with no anomalies. The report includes metrics, topology, anomaly statistics, and optimization suggestions, such as changing the PFC priority of two nodes to 3.
[0084] This embodiment employs dual-mode verification to avoid the limitations of a single mode and ensure the authenticity of the results; the report is visualized and provides optimization solutions to achieve a closed loop; standardized reports enhance traceability and meet high reliability quality requirements.
[0085] Example Comparison
[0086] Case 1: Comparison of the Invention's Solution with Traditional Manual Inspection Methods
[0087] A smart computing data center has deployed 32 computing servers equipped with NVIDIA H100 GPUs and Mellanox CX7 network cards. A high-performance lossless network is built using the RoCEv2 protocol, and network acceptance testing needs to be completed.
[0088] When using the technical solution of this invention, the acceptance process is as follows: The acceptance control node automatically scans and generates the network topology using the LLDP protocol, automatically identifying the connection relationship between 32 computing power nodes and 8 core switches. Then, it calls the multi-protocol adaptation indicator system module to configure acceptance indicators and thresholds for three dimensions: parameter plane, sample plane, and aggregated communication. Test tasks are then sent to the test terminal nodes of each computing power node, while monitoring commands are sent to the data acquisition probes in the link. The test terminal nodes sequentially complete network card configuration checks, connectivity and isolation tests, perftest bandwidth and latency tests, and cluster-level AllReduce operation tests based on the OpenMPI framework, and upload active test data in real time. The data acquisition probes synchronously capture real business traffic, analyze latency fluctuations, packet loss rate, and ECN tagging status, forming passive monitoring data. The acceptance control node forwards both types of data to the real-time analysis module for consistency verification and threshold comparison, completing the entire acceptance process within 2 hours. It automatically locates the PFC priority configuration errors of 2 computing power nodes and outputs an optimization suggestion of "changing the PFC priority to 3" through the result output module, generating a standardized acceptance report.
[0089] When using traditional manual acceptance methods, the acceptance process was as follows: Three senior network engineers manually wrote test scripts and logged into each of the 32 computing nodes to configure the test environment. Ping was used to test connectivity, and iPerf was used to perform bandwidth tests on each node, with test data for each node recorded manually. Due to the lack of specialized testing tools for aggregated communication, critical communication modes such as AllReduce were not verified. The entire acceptance process took 12 hours, only completing basic bandwidth and latency tests, and no PFC configuration errors were found. Furthermore, due to differences in operating habits among different engineers, bandwidth test data for the same node varied by up to 15%, making it impossible to reach a unified and traceable acceptance conclusion.
[0090] Case 2: Comparison of the Invention Solution with a Semi-Automatic Acceptance System
[0091] A certain intelligent computing cluster has completed the expansion of 8 new computing nodes. The new nodes are equipped with AMD MI300X GPUs and Huawei Kunpeng network cards. The network acceptance after the expansion needs to be completed quickly to verify the compatibility between the new nodes and the original cluster.
[0092] When using the technical solution of this invention, the acceptance process is as follows: The acceptance control node automatically identifies the topology change through the LLDP protocol, discovers the addition of 8 computing power nodes, and automatically calls the customized acceptance thresholds of the built-in AMD GPU and Kunpeng network card through the vendor adaptation unit of the multi-protocol adaptation indicator system module. No manual configuration modification is required; a new test strategy is generated within 10 minutes, and test tasks are simultaneously issued to both new and old nodes. The test terminal node completes cross-node aggregation communication testing to verify the collaborative communication performance between the new and existing nodes; the data acquisition probe captures the changes in link traffic after expansion to verify the load balancing effect. Finally, the acceptance is completed within 1.5 hours, confirming that all indicators of the new nodes meet the standards and that there is no attenuation in cluster-level communication bandwidth.
[0093] When using a semi-automated acceptance testing system, technicians had to manually compile the topology information of newly added nodes, modify the system's preset test scripts, and adapt them to the new hardware models and network topology. Because the system lacked customized verification rules for multi-vendor hardware, it could only use the existing acceptance thresholds for NVIDIA GPUs, resulting in a 20% deviation in the bandwidth test data for the new nodes. The entire script modification and test execution took 3 hours, and the system only output whether the indicators met the standards, without analyzing the reasons for the deviation, making it impossible to verify the compatibility issues of the new nodes.
[0094] Case 3: Comparison of the Invention's Solution with the RoCE Network Evaluation Method for Supercomputing Clusters
[0095] Before conducting large-scale model training tasks, a certain intelligent computing center needs to conduct a comprehensive performance evaluation of the RoCEv2 lossless network, with a focus on verifying the network stability in ensemble communication scenarios.
[0096] When using the technical solution of this invention, the evaluation process is as follows: While performing cluster-level AllReduce operation tests, a data acquisition probe captures real training traffic, and analysis reveals a ±1μs latency jitter during the aggregated communication process. The real-time analysis module, combining active test data and passive monitoring data, identifies the jitter as being caused by the ECN function not being enabled on some nodes. After enabling ECN configuration according to optimization suggestions, the technicians tested again, and the latency jitter was eliminated, improving the training efficiency of the large cluster model by 12%.
[0097] When using the traditional lossless RoCEv2 network evaluation method for supercomputing clusters, technicians used shell scripts to call the perftest tool to complete bandwidth and latency tests for single nodes, two nodes, and 16 nodes, generating performance curves and compliance reports. The test results showed that the network bandwidth met the requirements, but real training traffic was not monitored, and latency jitter issues in ensemble communication scenarios were not detected. During subsequent large-scale model training, latency jitter caused some parameter synchronization failures, resulting in multiple training interruptions. Technicians spent several days troubleshooting before locating the ECN configuration problem.
[0098] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A high-performance network acceptance system for intelligent computing data centers, characterized in that, A layered architecture integrating a hardware layer and core software modules is adopted. The hardware layer includes an acceptance control node, a test terminal node, and a data acquisition probe. The core software module is integrated into the acceptance control node, including a multi-protocol adaptation index system module, a real-time analysis module, and a result output module connected by sequential signals. The multi-protocol adaptation index system module is used to automatically identify network protocol types, match manufacturer hardware models, load corresponding configuration verification rules and acceptance thresholds, and output three types of standardized test index sets: parameter surface, sample surface, and aggregate communication. The real-time analysis module is used to receive active test data and passive monitoring data, perform dual-mode data consistency verification, indicator threshold comparison, anomaly identification and preliminary root cause location, and output analysis results. The result output module is used to summarize test data, analysis conclusions, network topology and anomalies, generate a standardized acceptance report and output optimization suggestions; wherein, the analysis conclusions are generated based on the analysis results. The acceptance control node is used to call the multi-protocol adaptation indicator system module to complete the indicator and threshold configuration, send test tasks to the test terminal node, send monitoring instructions to the data acquisition probe, receive and integrate active test data and passive monitoring data, and forward them to the real-time analysis module for compliance judgment and anomaly location. The test terminal node is deployed within the computing power node and is used to perform configuration checks, connectivity tests, bandwidth latency tests, and aggregated communication algorithm tests on the parameter plane and sample plane, and to upload the active test data to the acceptance control node. The data acquisition probe is connected in series in the network link to capture real traffic of RoCE / IB protocol, analyze latency fluctuations, packet loss rate, and ECN tag status, form passive monitoring data and upload it to the acceptance control node to verify the authenticity of active test data. It also includes a parameter plane network and a sample plane network; the parameter plane network is used for parameter synchronization communication between GPUs across machines; the sample plane network is used for sample / model transfer between GPU nodes and AI native storage; the ensemble communication algorithm test is based on the OpenMPI framework, executing all_reduce, all_to_all, broadcast, reduce, reduce_scatter, and scatter communication operations, and recording the bandwidth values of the three-level bus for single machine, dual machine, and cluster.
2. The high-performance network acceptance system for intelligent computing data centers according to claim 1, characterized in that, The multi-protocol adaptation indicator system module includes a protocol identification unit, a vendor adaptation unit, an indicator library unit, and a threshold configuration unit; the protocol identification unit is used to automatically identify IB or RoCE protocols and switch the corresponding parsing and testing strategies. The vendor adaptation unit has a built-in hardware matching library for mainstream GPU and network card manufacturers; the indicator library unit stores configuration verification rules; the threshold configuration unit is used to output customized acceptance thresholds corresponding to protocols, manufacturers, and hardware models, so that the system is compatible with both IB and RoCE protocols and hardware from multiple manufacturers.
3. The high-performance network acceptance system for intelligent computing data centers according to claim 1, characterized in that, The test terminal node performs parameter plane and sample plane tests using the perftest tool, sends RDMA data packets based on the RoCE protocol, and tests and records one-way latency and bidirectional bandwidth data.
4. The high-performance network acceptance system for intelligent computing data centers according to claim 1, characterized in that, The data acquisition probe captures link traffic through port mirroring, performs protocol parsing on the data packets, and extracts latency fluctuation range, frame loss rate, and ECN tag status as passive monitoring data.
5. A high-performance network acceptance method for intelligent computing data centers, based on the system described in any one of claims 1-4, characterized in that, Includes the following steps: Preprocessing: The acceptance control node automatically discovers the network topology via the LLDP protocol; The data acquisition probe completes protocol adaptation initialization, supporting traffic capture and parsing for the corresponding protocol; Protocol and metric configuration: Configure test metrics and thresholds according to three dimensions: parameter surface, sample surface, and aggregate communication; Layered automated testing: Parallel / sequential execution of parameter surface testing, sample surface testing, and aggregate communication testing; Real-time analysis: Receives active testing and passive monitoring data, performs consistency checks and threshold comparisons, and marks anomalies; Closed-loop evaluation: Optimization suggestions for abnormal outputs are provided, and a standardized acceptance report is generated.
6. The high-performance network acceptance method for intelligent computing data centers according to claim 5, characterized in that, The parameters and sample metrics include: network interface card rate negotiation mode, PFC priority configuration, DSCP flag, ECN enable status, latency, bandwidth, connectivity, and tenant isolation; the aggregated communication metrics include: bus bandwidth values for single-machine, dual-machine, and cluster-level operations.
7. The high-performance network acceptance method for intelligent computing data centers according to claim 5, characterized in that, The testing of the parametric surface and the sample surface includes: The configuration check tool is invoked to remotely read the network card parameters of the computing and storage nodes and verify the consistency of PFC, DSCP, and ECN configurations. Perform end-to-end connectivity ping tests and tenant isolation tests; The bandwidth and one-way latency were tested by sending RDMA data packets using the perftest tool.
8. The high-performance network acceptance method for intelligent computing data centers according to claim 5, characterized in that, Collective communication tests include: Single-machine test: Perform aggregated communication operations within a single computing server and record latency and bandwidth; Dual-machine test: Cross-node aggregate communication is performed between two GPU servers of the same model to verify the inter-node collaborative performance; Cluster testing: Perform large-scale collective communication within the intelligent computing cluster to evaluate consistency and stability.
9. The high-performance network acceptance method for intelligent computing data centers according to claim 5, characterized in that, Real-time analysis uses a combination of active test data and passive monitoring data for verification; the closed-loop evaluation output acceptance report includes: indicators for each link / device, network topology diagram, anomalies, and optimization suggestions.