Ib network detection method, apparatus, device, and medium
By collecting and analyzing IB network card information, constructing test groups, and conducting various types of IB network tests, the problem of IB network fault location was solved, enabling comprehensive monitoring and rapid fault location of the IB network, and improving network reliability and stability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA TELECOM CLOUD TECH CO LTD
- Filing Date
- 2024-12-04
- Publication Date
- 2026-06-16
AI Technical Summary
In the context of AI-powered intelligent computing, model training interruptions caused by IB network failures are difficult to locate, and the probability of such failures is high, affecting training efficiency and resource utilization.
Collect IB network interface card information from multiple nodes, build a test group, perform IB network testing using various commands, obtain test data, and use Grafana for visualization analysis and Alertmanager alert push.
It enables comprehensive monitoring and rapid fault location of the IB network, reduces manual intervention, improves the comprehensiveness of network detection and fault location capabilities, enhances performance monitoring and resource utilization efficiency, and improves network reliability and stability.
Smart Images

Figure CN119814605B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of cloud computing technology, and in particular to an IB network detection method, apparatus, device and medium. Background Technology
[0002] In the context of AI-powered computing, training larger models typically requires multi-machine, multi-GPU joint training. IB network interface cards (NICs), as a core component, integrate both RDMA and IB protocols compared to ordinary NICs. The proprietary nature of these protocols further complicates troubleshooting. Furthermore, practical experience has shown that IB networks are prone to failure during training. When an IB network fails, the entire large-scale model training is forced to stop, requiring a restart from scratch. This process is not only time-consuming and labor-intensive, but also significantly increases the probability of model training and other tasks being interrupted due to IB network failures. Summary of the Invention
[0003] In view of the above, embodiments of this application provide an IB network detection method, apparatus, device, and medium to overcome or at least partially solve the above problems.
[0004] A first aspect of this application provides an IB network detection method, the method comprising:
[0005] Collect multiple node information from multiple nodes, wherein the node information includes at least the IB network card information of the IB network cards configured on the nodes;
[0006] Based on the IB network interface card information of the multiple nodes, multiple test groups are constructed, and the test groups include service nodes and client nodes;
[0007] The system iterates through the multiple test groups, performs various types of IB network tests on each test group using different commands, and obtains the test data for each test group.
[0008] Optionally, the collection of multiple node information from multiple nodes includes:
[0009] Deploy proxy services to the multiple nodes, and collect multiple node information of the multiple nodes through the proxy services;
[0010] The node information includes: IP address, host name, and IB network card information corresponding to the IB network card configured on the node. The IB network card information includes: the number of IB network cards, the LID number of the IB network card, and the device name of the IB network card.
[0011] Optionally, based on the IB network interface card information of the multiple nodes, multiple test groups are constructed, including:
[0012] Based on the node information of the multiple nodes, a hash dictionary is constructed for the multiple IB network interface cards (NICs). The key of the hash dictionary is the LID number of the IB NIC of the node, and the value of the hash dictionary is the LID number of the IB NIC of the node, the device name of the IB NIC, and the IP address.
[0013] An array is constructed based on multiple keys in the hash dictionary, and the array is shuffled using a shuffle algorithm.
[0014] The multiple keys in the shuffled array are used to construct multiple test groups, with the purpose of testing each IB network card with any other IB network card, by grouping each pair of keys together.
[0015] Optionally, the step of traversing the multiple test groups, performing various types of IB network tests on each traversed test group using various commands, and obtaining the test data for each test group includes:
[0016] Run the rping server command on the service node and the rping client command on the client node, and simultaneously execute the ibdump command on both the service node and the client node to capture packets;
[0017] When the rping server command and / or the rping client command fails to execute, obtain the first packet capture data;
[0018] When the rping server command and the rping client command are executed successfully, delete the first packet capture data;
[0019] For each test group, corresponding first detection data is constructed. The first detection data includes: detection data obtained with the first node in the test group as the service node and the second node as the client node, and detection data obtained with the first node in the test group as the client node and the second node as the service node.
[0020] The first detection data includes at least the test results of the test group, the address of the first packet capture data, and the first timestamp.
[0021] Optionally, the step of traversing the multiple test groups, performing various types of IB network tests on each traversed test group using various commands, and obtaining the test data for each test group includes:
[0022] Run the ibping server command on the service node and the ibping client command on the client node, and simultaneously execute the ibdump command on both the service node and the client node to capture packets;
[0023] When the ibping server command and / or the ibping client command fails to execute, obtain the second packet capture data;
[0024] When the ibping server command and the ibping client command are executed successfully, the second packet capture data is deleted;
[0025] For each test group, corresponding second detection data is constructed. The second detection data includes: detection data obtained with the first node in the test group as the service node and the second node as the client node, and detection data obtained with the first node in the test group as the client node and the second node as the service node.
[0026] The second detection data includes at least the test results of the test group, the address of the second packet capture data, and the second timestamp; wherein, when the test result indicates that the test was successful, the second detection data also includes the execution time.
[0027] Optionally, the step of traversing the multiple test groups, performing various types of IB network tests on each traversed test group using various commands, and obtaining the test data for each test group includes:
[0028] Run data transmission server commands on the service node and data transmission client commands on the client node, and simultaneously execute the ibdump command on both the service node and the client node to capture packets;
[0029] When the data transmission server command and / or the data transmission client command fails to execute, obtain the third packet capture data;
[0030] When the data transmission server command and the data transmission client command are executed successfully, the third packet capture data is deleted;
[0031] For each test group, a corresponding third detection data is constructed. The third detection data includes: detection data obtained with the first node in the test group as the service node and the second node as the client node, and detection data obtained with the first node in the test group as the client node and the second node as the service node.
[0032] The third detection data includes at least the test results of the test group, the address of the third packet capture data, and the third timestamp; wherein, when the test result indicates that the test is successful, the third detection data also includes bandwidth information and latency information of the data transmission process during the test.
[0033] Optionally, the method further includes:
[0034] The detection data is pushed to Grafana for visualization and analysis; and / or
[0035] When the detection data indicates an IB network anomaly, the administrator terminal is prompted to handle the issue via the alertmanager alarm push function.
[0036] A second aspect of this application provides an IB network detection device, the device comprising:
[0037] The acquisition module is used to acquire node information from multiple nodes, wherein the node information includes at least the IB network card information of the IB network cards configured on the nodes;
[0038] The construction module is used to construct multiple test groups based on the IB network interface card information of the multiple nodes, wherein the test groups include service nodes and client nodes;
[0039] The detection module is used to traverse the multiple test groups, perform various types of IB network detection on each traversed test group through various commands, and obtain the detection data of each test group.
[0040] A third aspect of this application provides an electronic device including a memory, a processor, and a computer program stored in the memory, wherein the processor executes the computer program to implement the method as described in the first aspect.
[0041] A fourth aspect of this application provides a computer program product, including a computer program that, when executed by a processor, implements the method described in the first aspect.
[0042] A fifth aspect of this application provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the method described in the first aspect.
[0043] The beneficial effects of this application are:
[0044] This application provides an IB network detection method, apparatus, device, and medium, comprising: collecting multiple node information from multiple nodes, wherein the node information includes at least IB network card information configured on the nodes; constructing multiple test groups based on the IB network card information of the multiple nodes, wherein the test groups include service nodes and client nodes; traversing the multiple test groups, performing various types of IB network detection on each traversed test group through various commands, and obtaining detection data for each test group.
[0045] The technical solution of this application collects node information for each node, including the configuration of the IB network interface card (NIC), ensuring that the IB network detection process covers all nodes and IB NICs in the network. Based on this node information, multiple test groups are then constructed to ensure that every node and its IB NIC are detected, achieving comprehensive monitoring of the entire IB network and thorough detection of the status of each node and NIC. Each test group is then tested using various commands to quickly locate network anomalies. Simultaneously, the collected detection data, such as packet capture data during failed tests, can be used to analyze the causes of IB network failures, thereby reducing the impact of network failures on services and enabling timely detection and response to network problems. Furthermore, this automated approach to information collection, test group construction, and detection reduces manual intervention, improving the comprehensiveness of network detection and fault location capabilities. It also enhances performance monitoring and resource utilization efficiency, ultimately improving network reliability and stability. Attached Figure Description
[0046] The accompanying drawings, which form part of this application, are used to provide a further understanding of this application. The illustrative embodiments of this application and their descriptions are used to explain this application and do not constitute an undue limitation of this application.
[0047] To more clearly illustrate the technical solution of this application, the drawings used in the description of this application will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0048] Figure 1 This is a flowchart illustrating an IB network detection method according to an embodiment of this application;
[0049] Figure 2 This is an overall schematic diagram of an IB network detection method provided in an embodiment of this application;
[0050] Figure 3 This is a flowchart illustrating an IB network detection method according to another embodiment of this application;
[0051] Figure 4 It is a schematic diagram of the framework of a network detection device provided by an embodiment of the present application;
[0052] Figure 5 It is a schematic diagram of an electronic device provided by an embodiment of the present application. Detailed implementation manners
[0053] It should be noted that, without conflict, the embodiments in the present application and the features in the embodiments may be combined with each other.
[0054] InfiniBand (literally translated as "infinite bandwidth" technology, abbreviated as IB), commonly known as an IB network card, is used for high-performance computer network communication in HPC / AI. It has extremely high throughput (bandwidth up to 200GB, 400GB) and extremely low latency. A network composed of IB network cards and IB switches is simply referred to as an IB network.
[0055] K8s (kubernetes): It is an open-source container cluster orchestration system that can implement functions such as automated deployment, automatic scaling, and maintenance of container clusters.
[0056] Prometheus (Chinese, Prometheus): It is an open-source monitoring system.
[0057] Grafana: It is an open-source visualization dashboard system.
[0058] Alertmanager: It is mainly used to receive alert information sent by Prometheus and push the alerts to the app.
[0059] PromQL: PromQL is a data format in Prometheus. Samples consist of the following three parts:
[0060] Metric: metric name and labelsets that describe the characteristics of the current sample;
[0061] Timestamp: A timestamp accurate to milliseconds;
[0062] Sample value (Value): A floating-point data of type float64 representing the value of the current sample.
[0063] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0064] One embodiment of this application provides an IB network detection method. Figure 1 This is a schematic flowchart of an IB network detection method provided in one embodiment of this application. Figure 2 This is an overall schematic diagram of an IB network detection method provided in an embodiment of this application, as shown below. Figure 1 As shown, the method includes:
[0065] Step S101: Collect multiple node information from multiple nodes, wherein the node information includes at least the IB network card information of the IB network card configured on the node;
[0066] Step S102: Based on the IB network card information of the multiple nodes, construct multiple test groups, wherein the test groups include service nodes and client nodes;
[0067] Step S103: Traverse the multiple test groups, perform various types of IB network detection on each traversed test group using various commands, and obtain the detection data of each test group.
[0068] First, in step S101, as Figure 2 As shown, a Kubernetes cluster contains multiple nodes, and proxy services need to be deployed on each node. These proxy services are responsible for monitoring and collecting network status information of the nodes in real time. By collecting node information from multiple nodes, a comprehensive network topology view can be built, providing necessary contextual information for subsequent network detection and fault diagnosis. The node information for each node includes at least the IB network interface card (NIC) information configured on that node.
[0069] Furthermore, in step S102, multiple test groups need to be constructed based on the collected IB network card information. Each test group contains one service node and one client node for subsequent IB network performance testing. When a node contains multiple IB network cards, it corresponds to multiple service nodes or multiple client nodes. At the same time, multiple IB network cards on a node can also construct a test group where the service node and the client node are the same node. In addition, when two nodes are in a test group, these two nodes take turns being the service node and the client node.
[0070] For example, consider nodes A and B. Node A has IB network interface cards A1 and A2, and node B has IB network interface cards B1 and B2. The resulting test group would look like this:
[0071] Test group 1: Node A (A1) and Node A (A2);
[0072] Test group 2: Node A (A1) and Node B (B1);
[0073] Test group 3: Node A (A1) and Node B (B2);
[0074] Test group 4: Node A (A2) and Node B (B1);
[0075] Test group 5: Node A (A2) and Node B (B2);
[0076] Test group 6: Node B (B1) and Node B (B2);
[0077] In test group 1, when node A (A1) is a service node, node A (A2) is a client node, and when node A (A1) is a client node, node A (A2) is a service node.
[0078] Finally, in step S103, the constructed multiple test groups need to be periodically traversed. For each traversed test group, various IB network detection commands are executed to detect the performance and stability of the IB network. For example, multiple commands (such as rping, ibping, ibsend, ibread, ibwrite, etc.) are used to detect each test group. These commands can test the IB network's connectivity, bandwidth, latency, and other performance indicators. Finally, the detection data generated by each test group during the detection process is collected. For example, for each test group, the data generated during the detection process is collected, including test results, packet capture data, timestamps, etc.
[0079] Through the above embodiments, firstly, node information for each node, including the configuration of the IB network interface card (NIC), is collected to ensure that the IB network detection process covers all nodes and IB NICs in the network. Then, multiple test groups are constructed based on this node information to ensure that every node and its IB NIC are detected, achieving comprehensive monitoring of the entire IB network and detecting the status of each node and NIC without omission. Next, each test group is tested using various commands to quickly locate network anomalies. Simultaneously, the collected detection data, such as packet capture data during detection failures, can be used to analyze the causes of IB network failures, thereby reducing the impact of network failures on services and enabling timely detection and response to network problems. Furthermore, this automated approach to information collection, test group construction, and detection execution reduces manual intervention, thereby improving the comprehensiveness of network detection and fault location capabilities, enhancing performance monitoring and resource utilization efficiency, and ultimately improving network reliability and stability.
[0080] Optionally, step S101, which involves collecting information from multiple nodes, specifically includes:
[0081] Step S101-1: Deploy proxy services for the multiple nodes, and collect multiple node information of the multiple nodes through the proxy services;
[0082] The node information includes: IP address, host name, and IB network card information corresponding to the IB network card configured on the node. The IB network card information includes: the number of IB network cards, the LID number of the IB network card, and the device name of the IB network card.
[0083] Specifically, in one embodiment, the process of collecting information from multiple nodes involves first deploying a proxy service on each of the multiple nodes using a DaemonSet approach. These proxy services are lightweight software components designed to run on each node of the Kubernetes cluster and collect node information, such as network and hardware status information. This ensures automatic deployment and management on each node of the Kubernetes cluster. These proxy services may periodically report the collected data to a central monitoring system or report anomalies immediately upon detection. Furthermore, these proxy services may execute pre-defined diagnostic scripts to verify the health status and performance metrics of the IB network interface cards (NICs).
[0084] The node information for each node specifically includes the following:
[0085] IP address: Collects the IP address of each node's network interface, enabling the identification and location of key information about the node during network communication.
[0086] Hostname: Records the hostname of each node, which helps to provide a more intuitive reference during network management and troubleshooting.
[0087] IB network card information:
[0088] Number of IB network interface cards: Count the number of IB network interface cards installed on each node, which helps to understand the network scalability and performance potential of the node.
[0089] The LID number of an IB network card: Records the local identification number (LID) of each IB network card. The LID number is an important parameter used to uniquely identify each port in the IB network.
[0090] IB network card device name: Collecting the device name of IB network cards, such as mlx5_0, mlx5_1, etc., helps to accurately identify specific network devices during configuration and diagnosis.
[0091] By collecting the IP address and hostname of each node through the above embodiments, the entire network topology can be accurately mapped, providing fundamental data for network management and optimization. Furthermore, collecting the number of IB network interface cards (NICs), their LID numbers, and device names allows for in-depth monitoring of the hardware status of each node, enabling timely detection of hardware faults or configuration problems. Detailed IB NIC information helps to quickly locate the source of problems when network failures occur, accelerating the fault response and repair process. This is crucial for subsequent network testing and fault diagnosis. In this way, the comprehensiveness and accuracy of network testing can be ensured, thereby improving the reliability and stability of the entire IB network.
[0092] Optionally, the step S102, which involves constructing multiple test groups based on the IB network interface card information of the multiple nodes, specifically includes:
[0093] Step S102-1: Based on the node information of the multiple nodes, construct a hash dictionary for the multiple IB network cards; the key of the hash dictionary is the LID number of the IB network card of the node, and the value of the hash dictionary is the LID number of the IB network card of the node, the device name of the IB network card, and the IP address.
[0094] Step S102-2: Based on multiple keys in the hash dictionary, construct an array and shuffle the array using a shuffle algorithm;
[0095] Step S102-3: Take the multiple keys in the shuffled array and, with the purpose of testing each IB network card with any other IB network card, construct multiple test groups by grouping each pair of keys together.
[0096] Figure 3This is a flowchart illustrating an IB network detection method provided in another embodiment of this application.
[0097] Specifically, such as Figure 3 As shown, in one embodiment, in step S102-1, a hash dictionary is first constructed based on the node information of multiple nodes for multiple IB network interface cards (NICs). Each Key-Value pair corresponds to one IB NIC. The Key of the hash dictionary is the LID number of an IB NIC of a node, while the Value includes the LID number, device name, and IP address of the IB NIC of that node. We can associate the information of each IB NIC together for subsequent IB network testing and pairing.
[0098] For example:
[0099] Suppose there are three nodes, node1, node2, and node3. The IB network interface LID numbers of node1 are 0x24 and 0x76, the IB network interface LID number of node2 is 0x80, and the IB network interface LID number of node3 is 0x82.
[0100] The IB network interface card information for each node is shown in Table 1:
[0101] Table 1
[0102]
[0103] The constructed hash dictionary is shown below:
[0104] [{
[0105] "0x24": ["10.0.2.1", "0x24", "mlx5_0"]
[0106] }, {
[0107] "0x76": ["10.0.2.1", "0x76", "mlx5_1"]
[0108] }, {
[0109] "0x80": ["10.0.2.2", "0x80", "mlx5_2"]
[0110] }, {
[0111] "0x82": ["10.0.2.3", "0x82", "mlx5_3"]
[0112] }]
[0113] Further, in step S102-2, an array is constructed based on the Key (i.e., the LID number of each node's IB network interface card) in the hash dictionary. This array is then shuffled using a shuffle algorithm. The purpose of this step is to ensure that the pairing of multiple IB network interfaces is random when constructing the test group. This increases test coverage and fairness, avoiding the repetitive testing of the same node pairs, which could potentially miss some network issues.
[0114] For example:
[0115] Suppose the key array of the hash dictionary is [0x24, 0x76, 0x80, 0x82]. After shuffling, the array may become [0x24, 0x82, 0x80, 0x76].
[0116] Furthermore, in step S102-3, the keys in the shuffled array are grouped into multiple test groups, each consisting of two keys. Each group represents a pairing of one IB network interface card (NIC) with another NIC, used for subsequent network testing. For example, if the array is [0x24, 0x82, 0x80, 0x76], then the constructed test groups might be (0x24, 0x82), (0x82, 0x80), (0x80, 0x76), (0x24, 0x80), (0x24, 0x76), (0x82, 0x76). This pairing ensures that each IB NIC has the opportunity to be tested with other IB NICs, thereby achieving comprehensive monitoring of the entire IB network.
[0117] For example:
[0118] Continuing with the example above, the shuffled array [0x24, 0x82, 0x80, 0x76] will construct 6 test cases:
[0119] Test group 1: IB network card 0x24 on node1 and IB network card 0x82 on node2;
[0120] Test group 2: IB network card 0x82 on node3 and IB network card 0x80 on node2;
[0121] Test group 3: IB network card 0x80 and IB network card 0x76 of node2;
[0122] Test group 4: IB network card 0x24 on node1 and IB network card 0x80 on node2;
[0123] Test group 5: IB network card 0x24 on node1 and IB network card 0x76 on node2;
[0124] Test group 6: IB network card 0x82 on node3 and IB network card 0x76 on node2.
[0125] Through the above embodiments, constructing a hash dictionary and shuffling the key array ensures that every node and IB network card in the network is included in the testing scope, improving the comprehensiveness of the test. Furthermore, by shuffling the key array, the pairing of test groups is ensured to be random, which helps to discover potential network problems rather than just testing fixed node pairs. By accurately constructing test groups, network resources can be utilized more effectively, avoiding duplicate testing and improving testing efficiency. Random pairing and comprehensive testing help to discover and resolve various problems in the network, thereby improving the reliability of network detection, optimizing resource allocation and testing efficiency, and ultimately improving the stability of the entire IB network.
[0126] Optionally, step S103 includes:
[0127] Step S103-1-1: Run the rping server command on the service node and the rping client command on the client node, and simultaneously execute the ibdump command on both the service node and the client node to capture packets.
[0128] Step S103-1-2: When the rping server command and / or the rping client command fails to execute, obtain the first packet capture data;
[0129] Step S103-1-3: When the rping server command and the rping client command are executed successfully, delete the first packet capture data;
[0130] Step S103-1-4: For each test group, construct corresponding first detection data. The first detection data includes: detection data obtained with the first node in the test group as the service node and the second node as the client node, and detection data obtained with the first node in the test group as the client node and the second node as the service node.
[0131] The first detection data includes at least the test results of the test group, the address of the first packet capture data, and the first timestamp.
[0132] Specifically, such as Figure 3 As shown, in one embodiment, in order to evaluate the performance of the RDMA network between test groups, the command to be executed for each test group is the rping command, and the packet capture data and detection data are processed based on the test results, thereby providing detailed data for the subsequent performance analysis of the IB network.
[0133] First, in step S103-1-1, the `rping server` command is run on the service node, and the `rping client` command is run on the client node. Simultaneously, the `ibdump` command is executed on both nodes to capture network packets. The purpose of this step is to test the network connectivity between the service node and the client node and to capture network packets during the test.
[0134] Furthermore, in step S103-1-2, when the rping server command and / or client command fails to execute, the first packet capture data obtained by executing the ibdump command is acquired. This data is saved to facilitate subsequent analysis of potential IB network problems when a test failure occurs during the execution of the rping command to test IB network connectivity.
[0135] Furthermore, in step S103-1-3, when both the rping server command and the client command are executed successfully, the first packet capture data obtained by executing the ibdump command can be deleted. In this process, since the test is successful, the first packet capture data is no longer needed for this test and can be deleted to free up storage space.
[0136] Finally, in step S103-1-4, corresponding first detection data is constructed for each test group. The first detection data includes detection data obtained with the first node as the service node and the second node as the client node, as well as detection data obtained with the reverse configuration, i.e., with the first node as the client node and the second node as the service node. The first detection data includes at least the test results of the test group, the address of the first packet capture data, and the first timestamp. When the test is successful, the address of the first packet capture data is empty. The test result is either successful or failed. The address of the first packet capture data is empty when the test is successful, and can be obtained through the address of the first packet capture data when the test fails. The first timestamp records the time when the test group performed the rping test.
[0137] For example, consider two nodes, node1 (IB network interface 1) and node2 (IB network interface 2), forming test group A. They are configured as a service node and a client node, respectively. Run the `rping` server command on node1 and the `rping` client command on node2. Simultaneously, use the `ibdump` command to capture the first packet capture data between these two nodes. This first packet capture data needs to be pushed to an object storage service (e.g., S3).
[0138] When an rping test fails (e.g., due to network latency or packet loss), the first packet capture data obtained by ibdump will be saved for subsequent analysis of the cause of the failure.
[0139] Once the rping test is successful, the first packet capture will be deleted because no further analysis is needed.
[0140] For test group A, regardless of success or failure, a PromQL dataset is constructed. The PromQL dataset is used to record detection data such as: test result (success or failure), address of the first packet capture data (in the case of test failure), and the first timestamp of the test, and is pushed to Prometheus. When it is necessary to analyze the first detection data, the first packet capture data can be obtained through the address of the first packet capture data.
[0141] For example, taking the IB network card information in Table 1 as an example, the first detection data contained in the promQL data of some test groups is listed below:
[0142] 1. Test group 5: node1 (0x24) and node2 (0x76):
[0143] In test group 5, the rping test data for node1 (0x24) as the service node and node2 (0x76) as the client node is as follows:
[0144] "rpingInfo{src={LID="0x24", device="mxl5_0", nodeName="node1", nodeIP="10.0.2.1"} dst={LID="0x76", device="mxl5_0", nodeName="node2", nodeIP="10.0.2.2"}, method="rping", status="OK", s3URL=""}timestap 2024-11-26 18:55:13";
[0145] In test group 5, the rping test data for node1 (0x24) as the client node and node2 (0x76) as the service node is as follows:
[0146] "rpingInfo{src={LID="0x76", device="mxl5_0", nodeName="node2", nodeIP="10.0.2.2"} dst={LID="0x24", device="mxl5_0", nodeName="node1", nodeIP="10.0.2.1"}, method="rping", status="OK", s3URL=""}timestap 2024-11-26 18:55:23";
[0147] 2. Test group 3: node2 (0x76) and node2 (0x80):
[0148] In test group 3, the rping test data for node2 (0x76) as the service node and node2 (0x80) as the client node is as follows:
[0149] "rpingInfo{src={src={LID="0x76" , device="mxl5_0", nodeName="node2" ,nodeIP="10.0.2.2"}dst={LID="0x80" , device="mxl5_1", nodeName="node2" ,nodeIP="10.0.2.2"},method="rping",status="Failed",s3URL="http: / / s3endpoint / bucket / xxx.cap"} timestap 2024-11-26 18:55:13";
[0150] In test group 3, the rping test data for node2 (0x76) as the client node and node2 (0x80) as the service node is as follows:
[0151] "rpingInfo{src={src={LID="0x80" , device="mxl5_1", nodeName="node2" ,nodeIP="10.0.2.2"}dst={LID="0x76" , device="mxl5_0", nodeName="node2" ,nodeIP="10.0.2.2"},method="rping",status="Failed",s3URL="http: / / s3endpoint / bucket / xxx.cap"}timestap 2024-11-26 18:55:23";
[0152] 3. Test group 1: node3 (0x82) and node1 (0x24):
[0153] In test group 1, the rping test data for node3 (0x82) as the service node and node1 (0x24) as the client node is as follows:
[0154] "rpingInfo{src={LID="0x82" , device="mxl5_0", nodeName="node3" ,nodeIP="10.0.2.3"},dst={LID="0x24" , device="mxl5_0", nodeName="node1" ,nodeIP="10.0.2.1"},method="rping",status="Failed",s3URL="http: / / s3endpoint / bucket / xxx.cap"}timestap 2024-11-25 18:55:13";
[0155] In test group 1, the rping test data for node3 (0x82) as the client node and node1 (0x24) as the service node is as follows:
[0156] "rpingInfo{src={LID="0x24" , device="mxl5_0", nodeName="node1" ,nodeIP="10.0.2.1"},dst={LID="0x82" , device="mxl5_0", nodeName="node3" ,nodeIP="10.0.2.3"},method="rping",status="Failed",s3URL="http: / / s3endpoint / bucket / xxx.cap"}timestap 2024-11-25 18:55:23";
[0157] The first test data from each promQL shows that each test group includes two sets of first test data for the rping test. Here, src represents the IB network interface card (NIC) as the service node, dst represents the IB NIC as the client node, LID represents the LID of the IB NIC, device represents the device name of the IB NIC, nodeName represents the name of the node, nodeIP represents the IP address of the node, and timestap represents the first timestamp.
[0158] The `status` field in the `rping` command output indicates whether the rping test was successful or failed. "OK" indicates success, and "Failed" indicates failure. When the `status` field is "Failed", the first packet capture data can be obtained from the specified address `s3URL` to analyze the cause of the error.
[0159] In addition, test groups on different nodes can perform tests in parallel to improve testing efficiency.
[0160] Through the above embodiments, the connectivity of the IB network between test group nodes can be tested using the rping command. Packet capture data is saved when the rping test fails, thus accurately diagnosing network faults, including latency and packet loss. Simultaneously, packet capture data is deleted when the rping test succeeds, optimizing storage resource usage and avoiding unnecessary data saving. Furthermore, bidirectional testing within each test group (i.e., service nodes and client nodes switching roles) allows for a comprehensive evaluation of network connectivity between nodes. Recording the first timestamp ensures the time synchronization and integrity of the test data, providing an accurate time reference for network performance analysis.
[0161] Optionally, step S103 includes:
[0162] Step S103-2-1: Run the ibping server command on the service node and the ibping client command on the client node, and simultaneously execute the ibdump command on both the service node and the client node to capture packets.
[0163] Step S103-2-2: When the ibping server command and / or the ibping client command fails to execute, obtain the second packet capture data;
[0164] Step S103-2-3: When the ibping server command and the ibping client command are executed successfully, delete the second packet capture data;
[0165] Step S103-2-4: For each test group, construct corresponding second detection data. The second detection data includes: detection data obtained with the first node in the test group as the service node and the second node as the client node, and detection data obtained with the first node in the test group as the client node and the second node as the service node.
[0166] The second detection data includes at least the test results of the test group, the address of the second packet capture data, and the second timestamp; wherein, when the test result indicates that the test was successful, the second detection data also includes the execution time.
[0167] Specifically, such as Figure 3 As shown, in one embodiment, the specific operations for InfiniBand network performance testing specifically involve using the ibping command to test the network performance between the service node and the client node, capturing second packet capture data based on the test results, pushing the second packet capture data to an object storage service (e.g., S3), and constructing second detection data. This process aims to evaluate the performance of the IB network, such as connectivity, and provide detailed data for subsequent IB network performance analysis.
[0168] First, in step S103-2-1, the ibping server command is run on the service node and the ibping client command is run on the client node. At the same time, the ibdump command is executed on both nodes to capture packets, thereby testing the IB network performance between the service node and the client node and capturing the second packet capture data during the test.
[0169] Furthermore, in step S103-2-2, when the ibping server command and / or client command fails to execute, the second packet capture data obtained by executing the ibdump command is acquired and saved, thereby enabling the analysis of possible network problems through the saved second packet capture data when the IB network performance test fails.
[0170] Furthermore, in step S103-2-3, when both the ibping server command and the client command are executed successfully, the captured second packet data is deleted. In this process, since the ibping command test is successful, the second packet data is no longer needed for this test, and storage space can be freed up by deleting the second packet data.
[0171] Furthermore, for each test group, corresponding second detection data is constructed. The second detection data includes detection data obtained with the first node as the service node and the second node as the client node, as well as detection data obtained with the first node as the client node and the second node as the service node. The second detection data includes at least the test results of the test group, the address of the second packet capture data, and the second timestamp. If the test is successful, it also includes the execution time. The test result is categorized as either successful or failed. The address of the second packet capture data is empty when the test is successful; when the test fails, the second packet capture data can be obtained through the address. The second timestamp records the moment the test group performed the ibping test.
[0172] For example, there are two nodes, Node1 and Node2, which are configured as a server node and a client node, respectively. The ibping server command is run on Node1, and the ibping client command is run on Node2. At the same time, the ibdump command is used to capture network packets between the two nodes (i.e., the second packet capture data).
[0173] When the ibping test fails (e.g., due to network latency or packet loss), the data captured by ibdump (i.e., the second packet capture data) will be saved for subsequent analysis of the cause of the failure.
[0174] Once the ibping test is successful, the scraped data will be deleted because no further analysis is needed.
[0175] Furthermore, for each test group, the test result (success or failure), the storage location of the second packet capture data (the address for obtaining the second packet capture data in case of test failure), and the test timestamp are recorded, and second detection data is generated. If the test is successful, the execution time is also recorded.
[0176] For example, taking the IB network card information in Table 1 as an example, the second detection data contained in the promQL data of some test groups are listed below:
[0177] 1. Test group 5: node1 (0x24) and node2 (0x76):
[0178] The ibping test data for node1 (0x24) as the service node and node2 (0x76) as the client node in test group 5 is as follows:
[0179] "ibpingInfo{ src={LID="0x24" , device="mxl5_0", nodeName="node1" ,nodeIP="10.0.2.1"} ,dst={LID="0x76" , device="mxl5_0", nodeName="node2" ,nodeIP="10.0.2.2"},method=" ibping", status="OK" , time="0.012ms" , s3URL="http: / / s3endpoint / bucket / xxx.cap"}timestap 2024-11-26 19:01:40";
[0180] In test group 5, the ibping test data for node1 (0x24) as the client node and node2 (0x76) as the service node is as follows:
[0181] "ibpingInfo{ src={LID="0x76" , device="mxl5_0", nodeName="node2" ,nodeIP="10.0.2.2"} ,dst={LID="0x24" , device="mxl5_0", nodeName="node1" ,nodeIP="10.0.2.1"}, method=" ibping", status="OK" , time="0.012ms" , s3URL="http: / / s3endpoint / bucket / xxx.cap"}timestap 2024-11-26 19:01:50";
[0182] 2. Test group 3: node3 (0x82) and node2 (0x80):
[0183] In test group 3, the ibping test data for node3 (0x82) as the service node and node2 (0x80) as the client node is as follows:
[0184] "ibpingInfo{ src={LID="0x82" , device="mxl5_0", nodeName="node3" ,nodeIP="10.0.2.2"} dst={LID="0x80" , device="mxl5_1", nodeName="node2" ,nodeIP="10.0.2.2"} ,method=" ibping", status="Failed" , time="" , s3URL="http: / / s3endpoint / bucket / xxx.cap"}timestap 2024-11-25 19:01:40";
[0185] In test group 3, the ibping test data for node2 (0x80) as the service node and node3 (0x82) as the client node is as follows:
[0186] "ibpingInfo{ src={LID="0x80", device="mxl5_1", nodeName="node2", nodeIP="10.0.2.2"} dst={LID="0x82", device="mxl5_0", nodeName="node3", nodeIP="10.0.2.2"}, method=" ibping", status="Failed" , time="" , s3URL="http: / / s3endpoint / bucket / xxx.cap"}timestap 2024-11-25 19:01:50";
[0187] 3. Test group 1: node3 (0x82) and node1 (0x24):
[0188] In test group 1, the ibping test data for node3 (0x82) as the service node and node1 (0x24) as the client node is as follows:
[0189] "ibpingInfo{src={LID="0x82", device="mxl5_2", nodeName="node2", nodeIP="10.0.2.3"} dst={LID="0x24", device="mxl5_0", nodeName="node1", nodeIP="10.0.2.1"}, method=" ibping", status="Failed" , time="" , s3URL="http: / / s3endpoint / bucket / xxx.cap"} timestap 2024-11-20 19:01:40";
[0190] In test group 1, the ibping test data for node1 (0x24) as the service node and node3 (0x82) as the client node is as follows:
[0191] "ibpingInfo{src={LID="0x24", device="mxl5_0", nodeName="node1", nodeIP="10.0.2.1"} dst={LID="0x82", device="mxl5_2", nodeName="node2", nodeIP="10.0.2.3"}, method=" ibping", status="Failed" , time="" , s3URL="http: / / s3endpoint / bucket / xxx.cap"}timestap 2024-11-20 19:01:40";
[0192] The second test data from each promQL shows that each test group includes two sets of second test data for the ibping test. Here, src represents the IB network interface card (NIC) as the service node, dst represents the IB NIC as the client node, LID represents the LID of the IB NIC, device represents the device name of the IB NIC, nodeName represents the name of the node, nodeIP represents the IP address of the node, and timestap represents the second timestamp.
[0193] The ibping execution result, the status field is used to indicate whether the ibping test was successful or failed. "OK" indicates success and "Failed" indicates failure. When the execution is successful, the time field can be used to check whether the execution time of the test group is reasonable. When the status field is Failed, the second packet capture data and the reason for the error can be obtained from the specified address s3URL.
[0194] In addition, test groups on different nodes can perform tests in parallel to improve testing efficiency.
[0195] Through the above embodiments, the solution can save the second packet capture data when the ibping test fails, enabling accurate diagnosis of network performance issues, including insufficient bandwidth and high latency. It also deletes the packet capture data when the ibping test succeeds, optimizing storage resource usage and avoiding unnecessary data saving. By conducting bidirectional testing in each test group (i.e., the service node and client node switch roles), the network performance between nodes can be comprehensively evaluated. Furthermore, recording a second timestamp ensures the time synchronization and integrity of the test data, providing an accurate time reference for network performance analysis. This allows for timely detection and response to network performance issues, improving the overall performance and stability of the IB network.
[0196] Optionally, step S103 includes:
[0197] Step S103-3-1: Run the data transmission server command on the service node and the data transmission client command on the client node, and simultaneously execute the ibdump command on both the service node and the client node to capture packets.
[0198] Step S103-3-2: When the data transmission server command and / or the data transmission client command fail to execute, obtain the third packet capture data;
[0199] Step S103-3-3: When the data transmission server command and the data transmission client command are executed successfully, delete the third packet capture data;
[0200] Step S103-3-4: For each test group, construct corresponding third detection data. The third detection data includes: detection data obtained with the first node in the test group as the service node and the second node as the client node, and detection data obtained with the first node in the test group as the client node and the second node as the service node.
[0201] The third detection data includes at least the test results of the test group, the address of the third packet capture data, and the third timestamp; wherein, when the test result indicates that the test is successful, the third detection data also includes bandwidth information and latency information of the data transmission process during the test.
[0202] Specifically, such as Figure 3 As shown, in one embodiment, each test group can also be tested using data transmission commands (i.e., data transmission server commands and data transmission client commands). For example, the server and client command sets corresponding to ib_send, ib_write, and ib_read can be used for testing.
[0203] First, in step S103-3-1, the data transmission server command is run on the service node and the data transmission client command is run on the client node, and the ibdump command is executed on both the service node and the client node to capture packets, thus obtaining the third packet capture data.
[0204] In step S103-3-2, when the data transmission server command and / or data transmission client command fail to execute, the third packet capture data is obtained; otherwise, step S103-3-3 is executed, where the third packet capture data is deleted to release storage space when the data transmission server command and data transmission client command execute successfully.
[0205] Therefore, the third detection data includes at least the test results of the test group, the address of the third packet capture data, and the third timestamp; when the test result indicates that the test is successful, the third detection data also includes bandwidth and latency information of the data transmission process during the test, and the third timestamp indicates the time information of the data transmission test process.
[0206] In step S103-3-4, for each test group, corresponding third detection data is constructed. The execution information of the data transmission command of each test group is recorded in the third detection data. The third detection data includes: the detection data obtained with the first node in the test group as the service node and the second node as the client node, and the detection data obtained with the first node in the test group as the client node and the second node as the service node.
[0207] For example, taking the IB network card information in Table 1 and the executed data transmission command as ib_send_bw (i.e., the service node sends data to the client node and tests the bandwidth information during the data transmission process), the third detection data contained in the promQL data of some test groups are listed below:
[0208] 1. Test group 5: node1 (0x24) and node2 (0x76):
[0209] In test group 1, the test data for ib_send_bw, where node1 (0x24) acts as the service node and node2 (0x76) acts as the client node, is as follows:
[0210] "ibinfo{src={LID="0x24",device="mxl5_0",nodeName="node1",nodeIP="10.0.2.1"},dst={LID="0x76",device="mxl5_0",nodeName="node2" ,nodeIP="10.0.2.2"},method="ib_send_bw",status="OK",bw="25GB / s",s3URL=""}timestap2024-11-26 19:07:17";
[0211] In test group 1, the test data for ib_send_bw, where node1 (0x24) acts as the client node and node2 (0x76) acts as the service node, is as follows:
[0212] "ibinfo{src={LID="0x76",device="mxl5_0",nodeName="node2",nodeIP="10.0.2.2"},dst={LID="0x24",device="mxl5_0",nodeName="node1" ,nodeIP="10.0.2.1"},method="ib_send_bw",status="OK",bw="25GB / s",s3URL=""}timestap2024-11-26 19:08:17";
[0213] 2. Test group 3: node2 (0x76) and node2 (0x80):
[0214] In test group 3, the test data for ib_send_bw, where node2 (0x76) acts as the service node and node2 (0x80) acts as the client node, is as follows:
[0215] "ibinfo{src={LID="0x76",device="mxl5_0",nodeName="node2",nodeIP="10.0.2.2"},dst={LID="0x80",device="mxl5_1",nodeName="node2",nodeIP="10.0.2.2"} ,method="ib_send_bw",status="failed",bw="" ,s3URL="http: / / s3endpoint / bucket / xxx.cap"} timestap 2024-11-26 19:07:17";
[0216] In test group 3, the test data for ib_send_bw, where node2 (0x76) is the client node and node2 (0x80) is the service node, is as follows:
[0217] "ibinfo{src={LID="0x80",device="mxl5_1",nodeName="node2",nodeIP="10.0.2.2"},dst={LID="0x76",device="mxl5_0",nodeName="node2",nodeIP="10.0.2.2"} ,method="ib_send_bw",status="failed",bw="" ,s3URL="http: / / s3endpoint / bucket / xxx.cap"} timestap 2024-11-26 19:08:17";
[0218] 3. Test group 1: node3 (0x82) and node1 (0x24):
[0219] In test group 1, the test data for ib_send_bw, where node3 (0x82) acts as the service node and node1 (0x24) acts as the client node, is as follows:
[0220] "ibinfo{src={LID="0x82",device="mxl5_0",nodeName="node3",nodeIP="10.0.2.3"},dst={LID="0x24",device="mxl5_0",nodeName="node1" ,nodeIP="10.0.2.1"},method="ib_send_bw",status="failed",bw="",s3URL="http: / / s3endpoint / bucket / xxx.cap"}timestap2024-11-26 19:07:17";
[0221] In test group 1, the test data for ib_send_bw, where node3 (0x82) acts as the client node and node1 (0x24) acts as the service node, is as follows:
[0222] "ibinfo{src={LID="0x24",device="mxl5_0",nodeName="node1",nodeIP="10.0.2.1"},dst={LID="0x82",device="mxl5_0",nodeName="node3" ,nodeIP="10.0.2.3"},method="ib_send_bw",status="failed",bw="",s3URL="http: / / s3endpoint / bucket / xxx.cap"}timestap2024-11-26 19:08:17".
[0223] The third detection data from each PromQL test group shows that each test group includes two sets of third detection data for the ib_send_bw test. Here, src represents the IB network interface card (NIC) acting as the service node, dst represents the IB NIC acting as the client node, LID represents the LID of the IB NIC, device represents the device name of the IB NIC, nodeName represents the name of the node, and nodeIP represents the IP address of the node. timestap represents the third timestamp, which is used to indicate the time information of the ib_send_bw test.
[0224] The execution result of ib_send_bw has a status field that indicates whether the ib_send_bw test was successful or failed. "OK" indicates success and "Failed" indicates failure. When the status field is Failed, the second packet capture data and the reason for the error can be obtained from the specified address s3URL.
[0225] The testing process for other data transmission commands is similar. For example, ib_send_latency can test the latency information of the client node receiving data after the service node sends data to the client node, and ib_read__bw can test the bandwidth information of the service node reading data from the client node.
[0226] In addition, test groups on different nodes can perform tests in parallel to improve testing efficiency.
[0227] Through the above embodiments, it can be seen from the third detection data that the execution result of ib_send_bw is either successful or failed (status field). When the status field is OK, the bandwidth of the network card can be checked from the bw field. When the status field is Failed, the packet capture result and error reason can be obtained from the specified address s3URL.
[0228] Optionally, the method further includes:
[0229] Step S104: Push the detection data to Grafana for visualization analysis; and / or
[0230] Step S105: When the detection data indicates an IB network anomaly, the administrator terminal is prompted to handle the issue via the alertmanager alarm push function.
[0231] Specifically, in one embodiment, such as Figure 2 As shown, since the detection data of each test group includes timestamps, the detection data is based on Prometheus time-series data. Therefore, the detection data can be pushed to the alarm platform through Prometheus and further visualized and analyzed through Grafana. The detection data can also be pushed to WeChat or DingTalk by configuring the AlertManager alarm push function, thereby prompting the administrator terminal to handle the fault in a timely manner.
[0232] Through the technical solutions of the above embodiments, this application can be combined with the Kubernetes platform, and by utilizing Kubernetes' service discovery function, service orchestration, and rich software ecosystem (monitoring), it is possible to conveniently build an end-to-end Kubernetes-based IB network inspection method.
[0233] Specifically, an agent is first deployed on each node using a daemonSet to collect information about each node and the IB network interface card. This information is then shuffled using a shuffle algorithm, with different combinations used for each inspection, thereby increasing the probability of detecting faults.
[0234] Furthermore, by running the commands rping, ibping, ib_read_bw, and ib_read_lat sequentially on every two IB network cards, and through periodic execution of the test, the system can progressively inspect for potential IB network faults. It can also perform ibdump packet capture on the faulty IB network cards and perform in-depth analysis of the root causes of IB network card (or IB network) faults based on the fault logs and packet capture results, thereby achieving early warning and early detection.
[0235] The two core steps above address the challenge of locating IB network faults in multi-machine, multi-GPU distributed training scenarios: Regular inspections detect potential IB network faults, providing early warning and prompt intervention. When a fault is detected, packet capture using ibdump preserves the fault scene, offering valuable insights for quickly pinpointing the cause. When a fault occurs, alerts are sent to platform administrators via Prometheus, Grafana, and AlertManager for timely handling. Error logs and captured IB packets facilitate easy analysis of the root cause. These steps reduce the probability of IB network card failures during large model training, improving the stability and robustness of the IB network.
[0236] Based on the same inventive concept, another embodiment of this application also provides an IB network detection device. Figure 4 This is a schematic diagram of the framework of a network detection device provided in one embodiment of this application, as shown below. Figure 4 As shown, the device includes:
[0237] The acquisition module 11 is used to acquire multiple node information of multiple nodes, wherein the node information includes at least the IB network card information of the IB network card configured on the node;
[0238] Module 12 is used to construct multiple test groups based on the IB network interface card information of the multiple nodes, wherein the test groups include service nodes and client nodes;
[0239] The detection module 13 is used to traverse the multiple test groups, perform various types of IB network detection on each traversed test group through various commands, and obtain the detection data of each test group.
[0240] Optionally, the acquisition module 11 includes:
[0241] The data collection unit is used to deploy proxy services on the multiple nodes and collect multiple node information of the multiple nodes through the proxy services.
[0242] The node information includes: IP address, host name, and IB network card information corresponding to the IB network card configured on the node. The IB network card information includes: the number of IB network cards, the LID number of the IB network card, and the device name of the IB network card.
[0243] Optionally, the building module 12 includes:
[0244] The hash dictionary construction unit is used to construct a hash dictionary for multiple IB network interface cards (NICs) based on the node information of the multiple nodes. The key of the hash dictionary is the LID number of the IB NIC of the node, and the value of the hash dictionary is the LID number of the IB NIC of the node, the device name of the IB NIC, and the IP address.
[0245] An array construction unit is used to construct an array based on multiple keys in the hash dictionary and to shuffle the array using a shuffle algorithm.
[0246] The test group building unit is used to construct multiple test groups by taking multiple keys in the shuffled array and using each IB network card as a group with any other IB network card.
[0247] Optionally, the detection module 13 includes:
[0248] The first execution unit is used to run the rping server command on the service node and the rping client command on the client node, and simultaneously execute the ibdump command on both the service node and the client node to capture packets.
[0249] The first packet capture data acquisition unit is used to acquire first packet capture data when the rping server command and / or the rping client command fails to execute.
[0250] The first deletion unit is used to delete the first packet capture data when the rping server command and the rping client command are executed successfully;
[0251] The first detection data construction unit is used to construct corresponding first detection data for each test group. The first detection data includes: detection data obtained with the first node in the test group as the service node and the second node as the client node, and detection data obtained with the first node in the test group as the client node and the second node as the service node.
[0252] The first detection data includes at least the test results of the test group, the address of the first packet capture data, and the first timestamp.
[0253] Optionally, the detection module 13 includes:
[0254] The second execution unit is used to run the ibping server command on the service node and the ibping client command on the client node, and simultaneously execute the ibdump command on both the service node and the client node to capture packets.
[0255] The second packet capture data acquisition unit is used to acquire second packet capture data when the ibping server command and / or the ibping client command fails to execute.
[0256] The second deletion unit is used to delete the second packet capture data when the ibping server command and the ibping client command are executed successfully;
[0257] The second detection data construction unit is used to construct corresponding second detection data for each test group. The second detection data includes: detection data obtained with the first node in the test group as the service node and the second node as the client node, and detection data obtained with the first node in the test group as the client node and the second node as the service node.
[0258] The second detection data includes at least the test results of the test group, the address of the second packet capture data, and the second timestamp; wherein, when the test result indicates that the test was successful, the second detection data also includes the execution time.
[0259] Optionally, the detection module 13 includes:
[0260] The third execution unit is used to run data transmission server commands on the service node and data transmission client commands on the client node, and simultaneously execute the ibdump command on both the service node and the client node to capture packets.
[0261] The third packet capture data acquisition unit is used to acquire third packet capture data when the data transmission server command and / or the data transmission client command fails to execute.
[0262] The third deletion unit is used to delete the third packet capture data when the data transmission server command and the data transmission client command are executed successfully.
[0263] The third detection data construction unit is used to construct corresponding third detection data for each test group. The third detection data includes: detection data obtained with the first node in the test group as the service node and the second node as the client node, and detection data obtained with the first node in the test group as the client node and the second node as the service node.
[0264] The third detection data includes at least the test results of the test group, the address of the third packet capture data, and the third timestamp; wherein, when the test result indicates that the test is successful, the third detection data also includes bandwidth information and latency information of the data transmission process during the test.
[0265] Optionally, the device further includes:
[0266] The visualization analysis module is used to push the detection data to Grafana for visualization analysis; and / or
[0267] The alarm module is used to prompt the administrator terminal to handle the issue when the detection data indicates an IB network anomaly via the alarmmanager alarm push function.
[0268] Based on the same inventive concept, another embodiment of this application provides an electronic device, including a memory, a processor, and a computer program stored in the memory, wherein the processor executes the computer program to implement the IB network detection method as described in any of the above embodiments.
[0269] Among them, electronic devices refer to Figure 5 , Figure 5 This is a schematic diagram of an electronic device provided in an embodiment of this application. Figure 5 As shown, the electronic device 500 includes a memory 510 and a processor 520. The memory 510 and the processor 520 are connected via a bus for communication. The memory 510 stores a computer program that can run on the processor 520 to implement the steps in the IB network detection method disclosed in the above embodiments of this application.
[0270] Based on the same inventive concept, another embodiment of this application also provides a computer program product, including a computer program that is executed by a processor as the IB network detection method described in any of the above embodiments.
[0271] Based on the same inventive concept, another embodiment of this application provides a computer-readable storage medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the IB network detection method as described in any of the above embodiments.
[0272] The above provides a detailed description of the IB network detection method, apparatus, device, and medium provided in this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The description of the above embodiments is only for the purpose of helping to understand the method and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.
Claims
1. An IB network detection method, characterized in that, The method includes: Collect multiple node information from multiple nodes, wherein the node information includes at least the IB network card information of the IB network cards configured on the nodes; Based on the IB network interface card information of the multiple nodes, multiple test groups are constructed, and the test groups include service nodes and client nodes; The multiple test groups are traversed, and various types of IB network tests are performed on each traversed test group using various commands, and the test data of each test group is obtained. The collection of multiple node information from multiple nodes includes: Deploy proxy services to the multiple nodes, and collect multiple node information of the multiple nodes through the proxy services; The node information includes: IP address, host name, and IB network card information corresponding to the IB network card configured on the node. The IB network card information includes: the number of IB network cards, the LID number of the IB network card, and the device name of the IB network card.
2. The IB network detection method according to claim 1, characterized in that, Based on the IB network interface card information of the multiple nodes, multiple test groups are constructed, including: Based on the node information of the multiple nodes, a hash dictionary is constructed for the multiple IB network interface cards (NICs). The key of the hash dictionary is the LID number of the IB NIC of the node, and the value of the hash dictionary is the LID number of the IB NIC of the node, the device name of the IB NIC, and the IP address. An array is constructed based on multiple keys in the hash dictionary, and the array is shuffled using a shuffle algorithm. The multiple keys in the shuffled array are used to construct multiple test groups, with the purpose of testing each IB network card with any other IB network card, by grouping each pair of keys together.
3. The IB network detection method according to claim 2, characterized in that, The process involves traversing the multiple test groups, performing various types of IB network tests on each test group using multiple commands, and obtaining the test data for each test group, including: Run the rping server command on the service node and the rping client command on the client node, and simultaneously execute the ibdump command on both the service node and the client node to capture packets; When the rping server command and / or the rping client command fails to execute, obtain the first packet capture data; When the rping server command and the rping client command are executed successfully, delete the first packet capture data; For each test group, corresponding first detection data is constructed. The first detection data includes: detection data obtained with the first node in the test group as the service node and the second node as the client node, and detection data obtained with the first node in the test group as the client node and the second node as the service node. The first detection data includes at least the test results of the test group, the address of the first packet capture data, and the first timestamp.
4. The IB network detection method according to claim 2, characterized in that, The process involves traversing the multiple test groups, performing various types of IB network tests on each test group using multiple commands, and obtaining the test data for each test group, including: Run the ibping server command on the service node and the ibping client command on the client node, and simultaneously execute the ibdump command on both the service node and the client node to capture packets; When the ibping server command and / or the ibping client command fails to execute, obtain the second packet capture data; When the ibping server command and the ibping client command are executed successfully, the second packet capture data is deleted; For each test group, corresponding second detection data is constructed. The second detection data includes: detection data obtained with the first node in the test group as the service node and the second node as the client node, and detection data obtained with the first node in the test group as the client node and the second node as the service node. The second detection data includes at least the test results of the test group, the address of the second packet capture data, and the second timestamp; wherein, when the test result indicates that the test was successful, the second detection data also includes the execution time.
5. The IB network detection method according to claim 2, characterized in that, The process involves traversing the multiple test groups, performing various types of IB network tests on each test group using multiple commands, and obtaining the test data for each test group, including: Run data transmission server commands on the service node and data transmission client commands on the client node, and simultaneously execute the ibdump command on both the service node and the client node to capture packets; When the data transmission server command and / or the data transmission client command fails to execute, obtain the third packet capture data; When the data transmission server command and the data transmission client command are executed successfully, the third packet capture data is deleted; For each test group, a corresponding third detection data is constructed. The third detection data includes: detection data obtained with the first node in the test group as the service node and the second node as the client node, and detection data obtained with the first node in the test group as the client node and the second node as the service node. The third detection data includes at least the test results of the test group, the address of the third packet capture data, and the third timestamp; wherein, when the test result indicates that the test is successful, the third detection data also includes bandwidth information and latency information of the data transmission process during the test.
6. The IB network detection method according to any one of claims 1-5, characterized in that, The method further includes: The detection data is pushed to Grafana for visualization and analysis; and / or When the detection data indicates an IB network anomaly, the administrator terminal is prompted to handle the issue via the alertmanager alarm push function.
7. An IB network detection device, characterized in that, The device includes: The acquisition module is used to acquire node information from multiple nodes, wherein the node information includes at least the IB network card information of the IB network cards configured on the nodes; The construction module is used to construct multiple test groups based on the IB network interface card information of the multiple nodes, wherein the test groups include service nodes and client nodes; The detection module is used to traverse the multiple test groups, perform various types of IB network detection on each traversed test group through various commands, and obtain the detection data of each test group. The acquisition module includes: an acquisition unit, used to deploy proxy services on the multiple nodes and collect multiple node information of the multiple nodes through the proxy services; The node information includes: IP address, host name, and IB network card information corresponding to the IB network card configured on the node. The IB network card information includes: the number of IB network cards, the LID number of the IB network card, and the device name of the IB network card.
8. An electronic device, characterized in that, It includes a memory, a processor, and a computer program stored on the memory, wherein the processor executes the computer program to implement the IB network detection method as described in any one of claims 1-6.
9. A computer-readable storage medium, characterized in that, It stores a computer program, wherein the computer program, when executed by a processor, implements the IB network detection method as described in any one of claims 1-6.