Congestion control method and apparatus
By assigning a trust value to each data stream in the AI network and adjusting the sending window based on congestion information, the shortcomings of congestion control algorithms under packet spraying technology are addressed, achieving efficient bandwidth management and performance optimization for AI networks.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- NEW H3C TECH CO LTD
- Filing Date
- 2024-12-09
- Publication Date
- 2026-06-18
AI Technical Summary
Existing congestion control algorithms cannot effectively handle the load balancing requirements introduced by packet spraying technology in AI networks. As a result, once a path becomes congested, the communication rate of all data streams decreases, and the overall performance is severely degraded.
The initial sending window is determined by assigning a trust value to each data stream at the sending end network card, and fine-grained adjustments are made based on the current congestion information of the AI network. Data packets are sent in a packet spraying manner, including a second refresh of the sending window for slight congestion and a third refresh of the sending window for severe congestion. Precise speed adjustment is achieved by combining probe streams and ECN tags.
It enables sensitive speed adjustment in AI networks for minor congestion, avoiding overall performance degradation and improving bandwidth performance, and provides coarse-grained control to optimize network performance during severe congestion.
Smart Images

Figure CN2024137786_18062026_PF_FP_ABST
Abstract
Description
A congestion control method and apparatus Technical Field
[0001] This application relates to the field of communication technology, and in particular to a congestion control method and apparatus. Background Technology
[0002] With the rise of large-scale artificial intelligence (AI) models, large-scale AI training tasks often require tens of thousands of network interface cards (NICs). This places higher demands on AI networks, including ultra-large-scale networking, ultra-high port bandwidth, extreme load balancing, and efficient congestion control.
[0003] Currently, packet spraying technology can meet the extreme load balancing requirements in AI networks. However, existing congestion control algorithms are all data flow-oriented. When using packet spraying technology for load balancing, if a path becomes congested, the existing congestion control algorithms will be affected by the congestion because all data flows will pass through this path, and the communication rate of all data flows will decrease. Summary of the Invention
[0004] The purpose of this application is to provide a congestion control method and apparatus to improve the bandwidth performance of AI networks. The specific technical solution is as follows:
[0005] In a first aspect, embodiments of this application provide a congestion control method applied to a transmitting network interface card (NIC) in an AI network, the method comprising:
[0006] The initial transmission window for each data stream is determined based on the trust value assigned to each data stream by the receiving network card.
[0007] According to the initial sending window corresponding to each data stream, the data packets included in each data stream are sent to the receiving network card in a packet spraying manner;
[0008] Detect the current congestion information of the AI network;
[0009] If the current congestion information indicates that the AI network is in a first congestion state, the initial sending window corresponding to each data stream is adjusted according to the current congestion information to obtain a second refresh sending window corresponding to each data stream. The parameter used to characterize the first congestion state is greater than the first parameter threshold, and the parameter used to characterize the first congestion state is less than the second parameter threshold.
[0010] According to the secondary refresh sending window corresponding to each data stream, the data packets included in each data stream are sent to the receiving network card in a packet spraying manner.
[0011] In some embodiments, the method further includes:
[0012] If the current congestion information indicates that the AI network is in a second congestion state, then according to the congestion signal sent by the receiving network card, the initial sending window corresponding to each data stream is adjusted to obtain three refresh sending windows corresponding to each data stream. The parameter used to characterize the second congestion state is greater than or equal to the second parameter threshold.
[0013] According to the three refresh sending windows corresponding to each data stream, the data packets included in each data stream are sent to the receiving network card in a packet spraying manner.
[0014] In some embodiments, the step of detecting the current congestion information of the AI network includes:
[0015] Detect the current congestion information of each path between the sending network card and the receiving network card in the AI network.
[0016] In some embodiments, the step of detecting the current congestion information of each path between the transmitting network interface card (NIC) and the receiving network interface card (NIC) in the AI network includes:
[0017] Based on the path information of each path between the sending network card and the receiving network card in the AI network, construct the probe flow corresponding to each path;
[0018] Along each path, probe packets corresponding to each path are sent to the receiving network card to obtain the current congestion information for each path.
[0019] In some embodiments, before constructing the probe flow corresponding to each path, the method further includes:
[0020] The controller receives path information for each path between the sending network card and the receiving network card from the controller within the AI network. The controller determines the path information for each path based on the communication relationship between the network cards reported by the communication component.
[0021] In some embodiments, the step of adjusting the initial sending window corresponding to each data stream based on the current congestion information to obtain a second refresh sending window corresponding to each data stream includes:
[0022] Based on the current congestion information, determine the congestion factor of each path between the sending network card and the receiving network card in the AI network;
[0023] The real-time total bandwidth of the AI network is obtained by summing the products of the congestion factor and the path bandwidth for each path.
[0024] The initial sending window for each data stream is adjusted based on the real-time total bandwidth to obtain a secondary refresh sending window for each data stream.
[0025] In some embodiments, the step of adjusting the initial sending window corresponding to each data stream according to the real-time total bandwidth to obtain the secondary refresh sending window corresponding to each data stream includes:
[0026] Calculate the ratio of the real-time total bandwidth to the static bandwidth to obtain the window coefficient;
[0027] Calculate the product of the initial sending window corresponding to each data stream and the window coefficient to obtain the secondary refresh sending window corresponding to each data stream.
[0028] Secondly, embodiments of this application provide a congestion control device applied to a transmitting network interface card in an AI network, the device comprising:
[0029] The initial transmission window determination module is used to determine the initial transmission window corresponding to each data stream based on the trust value assigned to each data stream by the receiving end network card;
[0030] The packet spraying module is used to send the data packets included in each data stream to the receiving network card in a packet spraying manner according to the initial sending window corresponding to each data stream.
[0031] A congestion information detection module is used to detect the current congestion information of the AI network;
[0032] The secondary refresh sending window determination module is used to adjust the initial sending window corresponding to each data stream according to the current congestion information if the current congestion information indicates that the AI network is in the first congestion state, so as to obtain the secondary refresh sending window corresponding to each data stream. The parameter used to characterize the first congestion state is greater than the first parameter threshold, and the parameter used to characterize the first congestion state is less than the second parameter threshold.
[0033] The packet spraying module is also used to send the data packets included in each data stream to the receiving network card in a packet spraying manner according to the secondary refresh sending window corresponding to each data stream.
[0034] In some embodiments, the apparatus further includes:
[0035] The three-refresh sending window determination module is used to adjust the initial sending window corresponding to each data stream according to the congestion signal sent by the receiving end network card if the current congestion information indicates that the AI network is in a second congestion state, so as to obtain a three-refresh sending window corresponding to each data stream, which is used to characterize the parameter of the second congestion state as being greater than or equal to the second parameter threshold.
[0036] The packet spraying module is also used to send the data packets included in each data stream to the receiving network card in a packet spraying manner according to the three refresh sending windows corresponding to each data stream.
[0037] In some embodiments, the congestion information detection module is specifically used for:
[0038] Detect the current congestion information of each path between the sending network card and the receiving network card in the AI network.
[0039] In some embodiments, the congestion information detection module is specifically used for:
[0040] Based on the path information of each path between the sending network card and the receiving network card in the AI network, a probe flow corresponding to each path is constructed; along each path, probe packets included in the probe flow corresponding to each path are sent to the receiving network card to obtain the current congestion information of each path.
[0041] In some embodiments, the apparatus further includes:
[0042] The path information receiving module is used to receive the path information of each path between the sending end network card and the receiving end network card issued by the controller in the AI network before constructing the probe flow corresponding to each path. The controller determines the path information of each path according to the communication relationship between the network cards reported by the communication component.
[0043] In some embodiments, the secondary refresh sending window determination module is specifically used for:
[0044] Based on the current congestion information, determine the congestion factor of each path between the sending network card and the receiving network card in the AI network; calculate the sum of the products of the congestion factor and the path bandwidth of each path to obtain the real-time total bandwidth of the AI network; adjust the initial sending window corresponding to each data stream according to the real-time total bandwidth to obtain the secondary refresh sending window corresponding to each data stream.
[0045] In some embodiments, the secondary refresh sending window determination module is specifically used for:
[0046] Calculate the ratio of the real-time total bandwidth to the static bandwidth to obtain the window coefficient; calculate the product of the initial sending window corresponding to each data stream and the window coefficient to obtain the secondary refresh sending window corresponding to each data stream.
[0047] Thirdly, embodiments of this application provide a server including a network interface card (NIC) and a machine-readable storage medium, wherein the machine-readable storage medium stores machine-executable instructions that can be executed by the NIC, and the NIC is prompted by the machine-executable instructions to implement any of the methods provided in the first aspect.
[0048] Fourthly, embodiments of this application provide a computer-readable storage medium storing a computer program, which, when executed by a network interface card, implements any of the methods provided in the first aspect.
[0049] In a sixth aspect, embodiments of this application provide a computer program product containing instructions that, when run on a computer, cause the computer to perform any of the methods provided in the first aspect.
[0050] In the technical solution provided in this application embodiment, when the sending-end network card and the receiving-end network card communicate, the receiving-end network card assigns a trust value to each data stream. The sending-end network card first determines the initial transmission window corresponding to each data stream according to the trust value, and then sends the data packets. During the process of sending data packets according to the initial transmission window determined by the trust value, the sending-end network card detects the current congestion information of the AI network. If the congestion information indicates that the AI network is in a first congestion state, and the parameter used to characterize the first congestion state is greater than a first parameter threshold and less than a second parameter threshold, then it indicates that the AI network is in a slightly congested state. The sending-end network card can realize fine-grained adjustment of the initial transmission window according to the current congestion information, achieving sensitive and accurate speed adjustment, avoiding a serious decline in the overall performance of the AI network due to slight congestion, and improving the bandwidth performance of the AI network.
[0051] Of course, implementing any product or method of this application does not necessarily require achieving all of the advantages described above at the same time. Attached Figure Description
[0052] The accompanying drawings, which are provided to further illustrate this application and form part of this application, illustrate exemplary embodiments of this application and are used to explain this application, but do not constitute an undue limitation of this application.
[0053] Figure 1 is a schematic diagram of the first type of AI network provided in an embodiment of this application;
[0054] Figure 2 is a schematic diagram of the first type of congestion control method provided in the embodiments of this application;
[0055] Figure 3 is a schematic diagram of a second type of congestion control method provided in an embodiment of this application;
[0056] Figure 4 is a schematic diagram showing the congestion control mechanism provided in the embodiment of this application taking effect;
[0057] Figure 5 is a schematic diagram of a second type of AI network provided in an embodiment of this application;
[0058] Figure 6 is a schematic diagram of probe flow forwarding based on the network shown in Figure 5;
[0059] Figure 7 is a schematic diagram of a window adjustment process provided in an embodiment of this application;
[0060] Figure 8 is a schematic diagram of the third type of congestion control method provided in the embodiments of this application;
[0061] Figure 9 is a schematic diagram of a congestion control device provided in an embodiment of this application;
[0062] Figure 10 is a schematic diagram of a server structure provided in an embodiment of this application. Detailed Implementation
[0063] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided with reference to the accompanying drawings and embodiments. Obviously, the described embodiments are merely some embodiments of this application, and not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments in this application are within the scope of protection of this application.
[0064] There are many existing congestion control algorithms, such as Data Center Quantized Congestion Notification (DCQCN) and High Precision Congestion Control (HPCC). These congestion control algorithms all rely on data flow-based load balancing. Taking DCQCN as an example, the congestion control method is as follows: the switching devices between the source and destination ends sense the degree of congestion and, according to the degree of congestion, randomly mark the data flow with Explicit Congestion Notification (ECN) tags in proportion. The destination end (such as the network interface card) identifies the ECN tags in the data flow and generates a corresponding Congestion Notification Packet (CNP) to notify the source end (such as the network interface card); the source end reduces the speed of the data flow corresponding to the CNP.
[0065] The load balancing method based on data flow is as follows: perform hash calculation based on the message information of the data flow to determine the path corresponding to the data flow, and forward the data flow along the path.
[0066] Flow-based load balancing cannot achieve absolute load distribution. AI data centers rely on high-performance AI networks (AI networks), which have high performance requirements. With the increasing number of expert models and all-to-all traffic, flow-based load balancing faces greater challenges and cannot meet the performance requirements of AI networks. To address these performance requirements, packet spraying technology has been proposed.
[0067] Packet spraying technology: Switches between source and destination devices evenly distribute data packets from a single data stream across multiple paths according to their load. This fully utilizes port bandwidth, and packet reordering issues are resolved by the network interface card (NIC). As shown in Figure 1, this network includes leaf (Leaf 1-Leaf 2) and spine (Spine 1-Spine 4) switches. Based on these switches, four paths are formed between NIC 1 and NIC 2: {Path 1: Leaf 1 → Spine 1 → Leaf 2}, {Path 2: Leaf 1 → Spine 2 → Leaf 2}, {Path 3: Leaf 1 → Spine 3 → Leaf 2}, and {Path 4: Leaf 1 → Spine 4 → Leaf 2}. When NIC 1 sends a data stream to NIC 2, packets 1-4 of this data stream are sprayed onto paths 1-4. After receiving packets 1-4, NIC 2 reorders them.
[0068] Packet spraying technology is a packet-based load balancing method that can meet the extreme load balancing requirements in AI networks and is well adapted to various application scenarios.
[0069] However, with the introduction of packet-by-packet load balancing, existing congestion control algorithms become unusable. If existing congestion control algorithms are still used, if a path becomes congested, all data flows will likely be marked with ECN tags by the switching equipment, leading to a decrease in the communication rate of all data flows. This results in a more severe overall performance degradation in packet spraying compared to data flow load balancing once network congestion occurs.
[0070] To address the aforementioned issues, this application provides a congestion control method, as shown in Figure 2, applied to a transmitting network interface card (NIC) in an AI network. The method includes:
[0071] Step S201: Determine the initial transmission window corresponding to each data stream based on the trust value assigned to each data stream by the receiving end network card;
[0072] Step S202: According to the initial sending window corresponding to each data stream, send the data packets included in each data stream to the receiving network card in a packet spraying manner;
[0073] Step S203: Detect the current congestion information of the AI network;
[0074] Step S204: If the current congestion information indicates that the AI network is in the first congestion state, then adjust the initial sending window corresponding to each data stream according to the current congestion information to obtain the secondary refresh sending window corresponding to each data stream; wherein, the parameter used to characterize the first congestion state is greater than the first parameter threshold, and the parameter used to characterize the first congestion state is less than the second parameter threshold.
[0075] Step S205: According to the secondary refresh sending window corresponding to each data stream, send the data packets included in each data stream to the receiving network card in the packet spraying method.
[0076] In the technical solution provided in this application embodiment, when the sending-end network card and the receiving-end network card communicate, the receiving-end network card assigns a trust value to each data stream. The sending-end network card first determines the initial transmission window corresponding to each data stream according to the trust value, and then sends the data packets. During the process of sending data packets according to the initial transmission window determined by the trust value, the sending-end network card detects the current congestion information of the AI network. If the congestion information indicates that the AI network is in a first congestion state, and the parameter used to characterize the first congestion state is greater than a first parameter threshold and less than a second parameter threshold, then it indicates that the AI network is in a slightly congested state. The sending-end network card can realize fine-grained adjustment of the initial transmission window according to the current congestion information, achieving sensitive and accurate speed adjustment, avoiding a serious decline in the overall performance of the AI network due to slight congestion, and improving the bandwidth performance of the AI network.
[0077] In this embodiment, the sending network card and the receiving network card can be any network card in the AI network, with the sending network card being the source and the receiving network card being the destination.
[0078] In step S201 above, when the sending network card needs to send a data stream to the receiving network card, it requests a credit value for the data stream from the receiving network card. The receiving network card assigns a credit value to the data stream in units of N Maximum Transmission Units (MTUs). After receiving the credit value, the sending network card determines the corresponding sending window, i.e., the initial sending window, based on the credit value, and then executes step S202. According to the initial sending window, it sends the data packets included in the data stream to the receiving network card using a packet spraying method. That is, the switching device between the sending and receiving network cards evenly sprays the data packets included in the data stream onto multiple paths according to the load of the paths, so as to send the data packets to the receiving network card in a load-balanced manner. Here, the first packet does not need to be rate-limited.
[0079] In this embodiment of the application, the data stream that the sending network card needs to send to the receiving network card can be one or more, and this method can be used to transmit data packets for each data stream.
[0080] By controlling the sending window corresponding to each data stream based on the trust value, data packets are transmitted. In many-to-one scenarios (such as Incast scenarios), it can ensure that the total bandwidth of multiple parties sending data does not exceed the total bandwidth of the receiver or network egress, thus avoiding congestion caused by many-to-one.
[0081] In step S203 above, the current congestion information is the congestion information of the current AI network. The congestion information may include, but is not limited to, parameters such as the queue depth of the switching device, the time of receiving probe packets, the time of sending probe packets, and the throughput at the destination ingress port.
[0082] During the process of sending data packets to the receiving network card according to the initial sending window, the sending network card can detect the current congestion information of the AI network in real time, so as to determine the congestion status of the AI network using the current congestion information.
[0083] For example, current congestion information includes the queue depth of switching devices. A queue depth greater than a preset depth indicates congestion. Using this current congestion information, the sending network interface card (NIC) can identify congested switching devices in the AI network. Taking the number of congested switching devices as a parameter characterizing the first congestion state as an example, when the number of congested switching devices is less than or equal to a first number (a first parameter threshold), the AI network is determined to be in a non-congested state. When the number of congested switching devices is greater than the first number but less than a second number (a second parameter threshold), the AI network is determined to be in a first congestion state, indicating slight congestion. When the number of congested switching devices is greater than or equal to the second number, the AI network is determined to be in a second congestion state, indicating severe congestion.
[0084] For example, the presence of congested switching devices along a path indicates congestion on that path. The sending network interface card (NIC) can use this congestion information to identify congested paths in the AI network. Taking the number of congested paths as a parameter representing the first congestion state as an example, when the number of congested paths is less than or equal to a first threshold (first parameter threshold), the AI network is determined to be in a non-congested state. When the number of congested paths is greater than the first threshold but less than a second threshold (second parameter threshold), the AI network is determined to be in a first congestion state, indicating slight congestion. When the number of congested paths is greater than or equal to the second threshold, the AI network is determined to be in a second congestion state, indicating severe congestion.
[0085] Other methods may also be used to determine the congestion level of the AI network in this embodiment of the application, and no limitation is imposed on them.
[0086] In step S204 above, the sending network card uses the current congestion information detected in step S203. If it determines that the AI network is in the first congestion state, it precisely adjusts the initial sending window corresponding to each data stream based on the detailed information included in the current congestion information. The adjusted initial sending window is the second refresh sending window. Then, step S205 is executed, that is, for each data stream, the sending network card sends the data packets included in the data stream to the receiving network card in a packet spraying manner according to the second refresh sending window corresponding to the data stream.
[0087] If the sending network card uses the current congestion information detected in step S203 to determine that the AI network is not congested, it does not need to adjust the initial sending window and continues to send the data packets included in each data stream to the receiving network card in a packet spraying manner according to the initial sending window corresponding to each data stream.
[0088] In this embodiment, when the AI network is in the first congestion state, it indicates that the AI network is slightly congested. The sending network card uses the current congestion information to accurately adjust the initial sending window corresponding to each data stream, so as to ensure that the communication traffic between the sending network card and the receiving network card does not exceed the real-time total bandwidth, thereby achieving sensitive and accurate fine-grained speed adjustment, avoiding network congestion, and avoiding large performance degradation due to slight congestion.
[0089] In some embodiments, to further optimize congestion control, as shown in Figure 3, a congestion control method is also provided, applied to the transmitting network interface card in an AI network, the method comprising:
[0090] Step S301: Determine the initial transmission window corresponding to each data stream based on the trust value assigned to each data stream by the receiving end network card;
[0091] Step S302: According to the initial sending window corresponding to each data stream, send the data packets included in each data stream to the receiving network card in a packet spraying manner;
[0092] Step S303: Detect the current congestion information of the AI network;
[0093] Step S304: If the current congestion information indicates that the AI network is in the first congestion state, then adjust the initial sending window corresponding to each data stream according to the current congestion information to obtain the secondary refresh sending window corresponding to each data stream; wherein, the parameter used to characterize the first congestion state is greater than the first parameter threshold, and the parameter used to characterize the first congestion state is less than the second parameter threshold.
[0094] Step S305: According to the secondary refresh sending window corresponding to each data stream, send the data packets included in each data stream to the receiving network card in the packet spraying method;
[0095] Steps S301 to S305 are the same as steps S201 to S205, and will not be repeated here.
[0096] Step S306: If the current congestion information indicates that the AI network is in a second congestion state, then adjust the initial sending window corresponding to each data stream according to the congestion signal sent by the receiving network card to obtain a three-refresh sending window corresponding to each data stream; wherein, the parameter used to characterize the second congestion state is greater than or equal to the second parameter threshold.
[0097] Step S307: According to the three refresh sending windows corresponding to each data stream, send the data packets included in each data stream to the receiving network card in the packet spraying method.
[0098] In this embodiment, the switching devices in the AI network can support existing congestion control algorithms, such as DCQCN and HPCC. Different congestion control algorithms produce different congestion signals. Taking DCQCN as an example, when the queue depth on the switching device exceeds a preset depth, the switching device can mark the data packets in that queue with an ECN tag. After the receiving network card receives a data packet carrying an ECN tag, it sends a CNP to the sending network card; this CNP is the congestion signal. In other congestion control algorithms, the receiving network card also generates corresponding congestion signals and feeds them back to the sending network card; these will not be listed here.
[0099] After receiving a congestion signal, if the current congestion information indicates that the AI network is in a non-congested state or a first-level congestion state, then the sending network card sends data packets using steps S301 to S305. If the current congestion information indicates that the AI network is in a second-level congestion state, then the AI network is in a severely congested state. The sending network card can adjust the initial transmission window corresponding to the data stream corresponding to the congestion signal using existing congestion algorithms based on the congestion signal sent by the receiving network card. The adjusted initial transmission window is the three-refresh transmission window corresponding to that data stream. Then, the sending network card executes step S307, whereby, for each data stream, the sending network card sends the data packets included in that data stream to the receiving network card using a packet spraying method according to the three-refresh transmission window corresponding to that data stream.
[0100] In this embodiment of the application, when the AI network is severely congested, such as when multiple paths are congested at the same time, resulting in a significant decrease in the overall available bandwidth, the sending network card uses congestion signals to perform coarse-grained congestion control, thereby significantly reducing the packet rate and slowly increasing the rate to quickly alleviate network congestion and further optimize network congestion control.
[0101] In this embodiment of the application, congestion control using congestion signals is taken as ECN-based congestion control as an example. ECN-based congestion control (as described in steps S306 to S307 above) takes effect later than probe-based congestion control (as described in steps S204 to S205 above), as shown in Figure 4. The degree of congestion control based on ECN is greater than that based on probe, and the degree of congestion corresponding to ECN-based congestion control is greater than that corresponding to probe-based congestion control.
[0102] In some embodiments, step S203 can be: detecting the current congestion information of each path between the sending network card and the receiving network card in the AI network. For example, as shown in Figure 1, network card 1 can detect the current congestion information of the four paths between network card 1 and network card 2 on a path-by-path basis.
[0103] In this embodiment of the application, the transmitting network card can use any of the following methods to detect the current congestion information of each path between the transmitting network card and the receiving network card.
[0104] Method 1: Based on the path information of each path between the sending network card and the receiving network card in the AI network, construct a probe flow corresponding to each path; along each path, send probe packets included in the probe flow corresponding to each path to the receiving network card to obtain the current congestion information of each path.
[0105] The path information may include, but is not limited to, path entropy values and the number of paths between the sending and receiving network interface cards (NICs). The probe flow can be an in-band network telemetry (INT) flow, and the probe packet is an INT packet. The path entropy value is a variable used to distinguish different paths and is filled in when the sending NIC sends a packet. The specific fields used to fill the path entropy value are determined by the sending NIC and the switching equipment. The number of paths refers to the number of end-to-end paths used for packet spraying; the controller determines the number of paths for packet spraying based on the network topology of the switching equipment.
[0106] The sending network interface card (NIC) constructs a probe flow for each path based on the path information. A probe flow is forwarded along one path to perform congestion detection across the entire path. The probe packets within each probe flow carry the entropy value of that path. Upon receiving the probe packets, the switching device can record congestion information such as its queue depth within the probe packets. Based on the path entropy value carried in the probe packets, the device performs hash forwarding on the probe packets, enabling them to be forwarded along the specified path and thus achieving congestion detection along that path.
[0107] After receiving a probe packet, the receiving network card can send the probe packet to the controller, which will then send the congestion information carried in the probe packet to the sending network card. The receiving network card can also send the probe packet back to the sending network card along the original path, thereby enabling the probe of the specified path.
[0108] In some embodiments, to obtain current congestion information, before constructing the probe flow corresponding to each path, the sending network interface card (NIC) can receive path information for each path between the sending and receiving NICs from the controller within the AI network. The controller determines the path information for each path based on the communication relationship between NICs reported by the communication component. The communication component is a software component used to manage communication between various NICs. The communication component can be implemented based on a collection of communication libraries.
[0109] For example, the network shown in Figure 5 includes network interface cards (NICs) 1 to NIC 4, leaf 1 to leaf 4, and spine 1 to spine 4. NICs 1 to NIC 4 can reside on the same server or on different servers. The network shown in Figure 5 also includes a communication component, which resides on the same server as NICs 1 to NIC 4. Within the dashed rectangle, NIC 1 needs to communicate with NIC 3, and NIC 2 needs to communicate with NIC 3. The communication component can create a communication group that instructs NIC 1 to communicate with NIC 3, and NIC 2 to communicate with NIC 4. The communication component synchronizes the communication relationships within the communication group to the controller. The controller determines the communication paths for all data streams through path orchestration and sends the path information for each NIC's corresponding communication path to the NIC. Taking the communication relationship between NIC 1 and NIC 3, where NIC 1 acts as the sending NIC and NIC 3 acts as the receiving NIC, as an example, the controller determines paths 1 to 4 as shown in Figure 5 through path orchestration and sends the path information for paths 1 to 4 to NIC 1. Network interface card 1 obtains the path information for communication between itself and the peer network interface card 3 through the controller. The path information includes the number of paths and the path entropy value.
[0110] After obtaining the path information, NIC 1 creates a probe flow for each path, as shown in Flow 1 to Flow 4 in Figure 6. Each probe flow follows a specific path. Various switching devices within the AI network (such as Leaf 1 to Leaf 4, Ridge 1 to Ridge 4) identify the probe flow and forward it along the corresponding path using a flow hash, rather than packet spraying. In this way, NIC 1 can obtain congestion information for paths 1 to 4.
[0111] In this embodiment of the application, when using probe streams to detect congestion information, the following points should be noted: the probe stream and the data stream (such as RDMA stream) use the same queue; the probe stream uses custom Ethernet (eth) packets, and the packet format should be as short as possible to reduce the impact on the data stream; the switching device is responsible for identifying the probe stream and performing flow hash forwarding on such packets; microsecond (μs) level detection accuracy is used to ensure detection accuracy and response accuracy, while maintaining a balance between bandwidth utilization and speed regulation efficiency, with bandwidth utilization controlled as low as possible to below one ten-thousandth.
[0112] In this embodiment, optimized congestion control in packet spraying scenarios is achieved by using a combination of a "trust value-based mechanism," "fine-grained speed adjustment using INT probing during mild congestion (i.e., path detection mechanism)," and "speed reduction and adjustment using ECN marking during severe congestion (optimized congestion control mechanism)," thereby improving network bandwidth performance in AI scenarios.
[0113] Method 2: Each switching device in the AI network periodically or triggered by events reports its congestion information to the controller. The controller determines the switching devices on each path between the sending network card and the receiving network card based on the AI network topology, and then determines the current congestion information of each path based on the congestion information of the switching devices on each path between the sending network card and the receiving network card, and sends it to the sending network card.
[0114] The sending network card can also use other methods to detect the current congestion information of each path, without limitation.
[0115] In this embodiment, the sending network card can obtain the current congestion information of the AI network by probing the path, or it can obtain the current congestion information of the AI network by other means, such as probing the throughput at the ingress port of the target device, etc., without limitation.
[0116] In some embodiments, step S204 above, adjusting the initial sending window corresponding to each data stream according to the current congestion information to obtain the secondary refresh sending window corresponding to each data stream, can be as follows: determining the congestion factor of each path between the sending network card and the receiving network card in the AI network according to the current congestion information; determining the real-time total window size of the AI network according to the congestion factor, path bandwidth, and path round-trip time of each path; adjusting the initial sending window corresponding to each data stream according to the real-time total window size to obtain the secondary refresh sending window corresponding to each data stream.
[0117] In this embodiment, the sending network interface card (NIC) can obtain information such as congestion level, number of congested hops, round-trip time (RTT), and queue depth for each path based on the current congestion information. Using these parameters, the sending NIC can comprehensively calculate the congestion level factor for each path. For example, the sending NIC can pre-set the correspondence between the congestion level factor and the congestion information, and determine the congestion level factor corresponding to the congestion information of each path based on this pre-set correspondence. Alternatively, the sending NIC can pre-set a strategy for determining the congestion level factor, and use this strategy to process the congestion information of each path to obtain the congestion level factor for each path. The strategy for determining the congestion factor can be set according to actual needs. For example, the strategy for determining the congestion factor can be: outputting the congestion information of each path; receiving the congestion factor of each path input by the user based on the congestion information of each path; the strategy for determining the congestion factor can also be: calculating the queue depth and value of the switching devices on each path, normalizing the queue depth and value, and obtaining the congestion factor of each path; other forms of strategies for determining the congestion factor can also be adopted, and there are no limitations on them.
[0118] In this embodiment, there are N paths between the sending network card and the receiving network card. Therefore, the sending network card obtains a set of congestion factors [a1, a2, a3, ..., a...]. N The higher the congestion factor of a path, the more severe the congestion of that path.
[0119] Given the congestion factor for each path, the sending network card can calculate the sum of the products of the congestion factor and the path bandwidth for each path to obtain the real-time total bandwidth of the AI network; then, based on the real-time bandwidth, the initial sending window corresponding to each data stream is adjusted to obtain the secondary refresh sending window corresponding to each data stream.
[0120] In this embodiment of the application, the sending network card can obtain a second refresh of the sending window in the following manner:
[0121] Method 1: Configure static bandwidth on the sending network card; after obtaining the real-time total bandwidth of the AI network, calculate the ratio of the real-time total bandwidth to the static bandwidth to obtain the window coefficient; calculate the product of the initial sending window and the window coefficient for each data stream to obtain the secondary refresh sending window for each data stream.
[0122] For example, the sending network card obtains the congestion factor and path bandwidth of each path based on the current congestion information of the AI network. With the congestion factor and path bandwidth obtained for each path, the sending network card can use the following formula (1) to obtain the real-time total bandwidth of the AI network, and use the following formulas (2) to (3) to obtain the secondary refresh sending window corresponding to each data stream. real_bd=a1×BW1+a2×BW2+…+a N ×BW N ; (1) X=real_bd / Sta_bd; (2) Flow_wind j =Credit_Wind j ×X; (3)
[0123] Where real_bd represents the total real-time bandwidth, a i BW represents the congestion factor of path i. i This represents the bandwidth of path i; X is the window coefficient, Sta_bd is the static bandwidth, and Flow_wind is the flow rate. j For the secondary refresh sending window corresponding to data stream j, Credit_Wind j Let i = 1, 2, ..., N, where N represents the number of paths between the sending network card and the receiving network card, and j = 1, 2, ..., M, where M represents the number of data streams.
[0124] Method 2: Configure static bandwidth on the sending network interface card (NIC). While obtaining the real-time total bandwidth of the AI network, the average round-trip time (RTT) of each path can be calculated to obtain the average RTT. The real-time total window size of the AI network is obtained by multiplying the real-time total bandwidth, average RTT, and a preset coefficient. The initial sending window for each data stream is then adjusted according to the real-time total window size. Here, the sending NIC can adjust the initial sending window in any way, as long as the sum of the adjusted initial sending window sizes is less than or equal to the real-time total window size. Alternatively, the sending NIC can use Method 1 to adjust the initial sending window, i.e., calculate the ratio of the real-time total bandwidth to the static bandwidth to obtain the window coefficient; calculate the product of the initial sending window for each data stream and the window coefficient to obtain the secondary refresh sending window for each data stream; there are no restrictions on this.
[0125] In this embodiment of the application, given the congestion factor, path bandwidth, and path round-trip time of each path, the sending network card can use the following formulas (1), (4), and (5) to obtain the total window size of the AI network. agv = (RTT1 + RTT2 + ... + RTT) N) / N; (4) real_tatol_window=real_bd×RTT agv ×P. (5)
[0126] Where real_bd represents the total real-time bandwidth; RTT agv RTT represents the average round-trip time. i The round-trip time of path i is represented by real_tatol_window, which represents the real-time total window size of the AI network. P is an empirical coefficient (i.e., a preset coefficient), and i = 1, 2, ..., N, where N represents the number of paths between the sending network card and the receiving network card.
[0127] The sending network interface card adjusts the initial sending window for each data stream according to the real-time total window size, resulting in a secondary refresh sending window for each data stream. The sum of all secondary refresh sending windows is less than or equal to the real-time total window size, i.e., SUM(Credit_Wind1, Credit_Wind2, ..., Credit_Wind...). M Credit_Windj represents the sending window size corresponding to the trust value of data stream j, where j = 1, 2, ..., M, and M represents the number of data streams between the sending and receiving network cards. The specific adjustment process is shown in Figure 7. In Figure 7, the window coefficient X can be determined based on the percentage of real-time total bandwidth to static bandwidth to ensure that all traffic within the communication group does not exceed the real-time total bandwidth and that the sum of all secondary refresh sending windows is less than or equal to the real-time total window size. Figure 7 only uses data streams 1 to 3 as examples, where Credit_Windj... i The product of the window coefficient X and the window coefficient X is used as the adjusted actual window, which is the second refresh sending window.
[0128] In this embodiment of the application, step S204 above, which adjusts the initial sending window corresponding to each data stream based on the current congestion information to obtain the secondary refresh sending window corresponding to each data stream, can also be implemented in other ways. For example, determine the percentage change between the congestion level determined by the current congestion information and the congestion level determined by the previous congestion information; adjust the initial sending window corresponding to each data stream according to this percentage change to obtain the secondary refresh sending window. The percentage change between the secondary refresh sending window and the initial sending window is the same as the percentage change in congestion level, where the percentage change in congestion level is the percentage change between the congestion level indicated by the current congestion information and the congestion level indicated by the previous congestion information.
[0129] The congestion control method provided in this application embodiment will be described in detail below with reference to the congestion control flow shown in Figure 8. When the sending network card needs to send a data stream, the following steps are performed:
[0130] Step S801: Obtain the current congestion information of the AI network. If the current congestion information indicates that the AI network is not congested, proceed to step S802; if the current congestion information indicates that the AI network is slightly congested, proceed to step S803; if the current congestion information indicates that the AI network is severely congested, proceed to step S804. Slight congestion of the AI network indicates that the AI network is in the first congestion state, and severe congestion of the AI network indicates that the AI network is in the second congestion state.
[0131] Step S802: Send messages using a trust value mechanism, that is, set the initial sending window corresponding to each data stream to the sending window corresponding to the trust value.
[0132] Step S803: Send the message using the INT algorithm mechanism, that is, adjust the sending window using the INT algorithm. Specifically, this includes: Step S813: Determine the window coefficient X based on the congestion information; Step S823: Calculate the product of the sending window corresponding to the trust value and the window coefficient X to obtain the second refresh sending window.
[0133] Step S804 involves sending a message using the ECN algorithm, specifically adjusting the sending window using the ECN algorithm. This includes: Step S814, determining the window coefficient Y based on the CNP density; and Step S824, calculating the product of the sending window corresponding to the trust value and the window coefficient Y to obtain a three-time refreshed sending window. The higher the CNP density, the smaller the window coefficient Y. This example only uses DCQCN and is not intended to be limiting.
[0134] Step S805: Determine whether the total bandwidth of all current data streams exceeds the real-time total bandwidth; if yes, repeat step S801; if no, send messages according to the adjusted window.
[0135] In this embodiment, if the total bandwidth of all current data streams exceeds the real-time total bandwidth, the sending network interface card (NIC) can re-execute step S801 to ensure that the adjusted total bandwidth of all current data streams does not exceed the real-time total bandwidth before sending the packet, thus reducing network congestion. Alternatively, the sending NIC can directly send the packet while re-executing step S801 to minimize the impact on the network. The sending NIC can also discard currently buffered packets to reduce network congestion. The sending NIC can process packets according to specific configuration strategies, and there are no limitations on this.
[0136] The three window adjustment mechanisms mentioned above ensure that the total bandwidth of all current data streams does not exceed the real-time total bandwidth.
[0137] The congestion control method provided in this application is a global traffic rate adjustment method based on the matching multipath (Spray) load sharing mode (i.e., packet spraying mode), that is, adjusting the window and adjusting the packet sending rate according to the overall congestion information of the AI network.
[0138] Furthermore, the congestion control method provided in this application adopts different mechanisms to adjust the rate according to different levels of congestion. Specifically, it is divided into coarse-grained mechanisms (such as ECN-based congestion control) and fine-grained mechanisms (such as probe-based congestion control). Fine-grained mechanisms are suitable for maximizing bandwidth when there is slight congestion, while coarse-grained mechanisms are suitable for quickly alleviating congestion in severe congestion / fault scenarios.
[0139] By integrating different algorithms to control various congestion models, including the N:1 congestion model at the end of the transmission network, the micro-burst congestion model of the transmission network, and the persistent congestion model caused by partial interruption of the transmission network, a congestion control method was achieved with the goals of optimal total control, fastest fault recovery, and minimum overall congestion.
[0140] Corresponding to the congestion control method described above, this application also provides a congestion control device, as shown in FIG9, applied to a transmitting network card in an AI network. The device includes:
[0141] The initial transmission window determination module 901 is used to determine the initial transmission window corresponding to each data stream based on the trust value allocated to each data stream by the receiving end network card;
[0142] The packet spraying module 902 is used to send the data packets included in each data stream to the receiving network card in a packet spraying manner according to the initial sending window corresponding to each data stream.
[0143] The congestion information detection module 903 is used to detect the current congestion information of the AI network;
[0144] The secondary refresh sending window determination module 904 is used to adjust the initial sending window corresponding to each data stream according to the current congestion information if the current congestion information indicates that the AI network is in the first congestion state, so as to obtain the secondary refresh sending window corresponding to each data stream. The parameter used to characterize the first congestion state is greater than the first parameter threshold, and the parameter used to characterize the first congestion state is less than the second parameter threshold.
[0145] The packet spraying module 902 is also used to send the data packets included in each data stream to the receiving network card in a packet spraying manner according to the secondary refresh sending window corresponding to each data stream.
[0146] In some embodiments, the congestion control device may further include:
[0147] The three-refresh sending window determination module is used to adjust the initial sending window corresponding to each data stream according to the congestion signal sent by the receiving network card if the current congestion information indicates that the AI network is in the second congestion state, so as to obtain the three-refresh sending window corresponding to each data stream. The parameter used to characterize the second congestion state is greater than or equal to the second parameter threshold.
[0148] The packet spraying module 902 is also used to send the data packets included in each data stream to the receiving network card in a packet spraying manner according to the three refresh sending windows corresponding to each data stream.
[0149] In some embodiments, the congestion information detection module 903 may be specifically used for:
[0150] Detect current congestion information for each path between the sending and receiving network cards in the AI network.
[0151] In some embodiments, the congestion information detection module 903 may be specifically used for:
[0152] Based on the path information of each path between the sending and receiving network cards in the AI network, construct the probe flow corresponding to each path;
[0153] Along each path, send probe packets, including probe streams for each path, to the receiving network card to obtain current congestion information for each path.
[0154] In some embodiments, the congestion control device may further include:
[0155] The path information receiving module is used to receive the path information of each path between the sending end network card and the receiving end network card issued by the controller in the AI network before constructing the probe flow corresponding to each path. The controller determines the path information of each path according to the communication relationship between the network cards reported by the communication component.
[0156] In some embodiments, the secondary refresh sending window determination module 904 can be specifically used for:
[0157] Based on the current congestion information, determine the congestion factor of each path between the sending and receiving network cards in the AI network; calculate the sum of the products of the congestion factor and the path bandwidth for each path to obtain the real-time total bandwidth of the AI network; adjust the initial sending window corresponding to each data stream according to the real-time total bandwidth to obtain the secondary refresh sending window corresponding to each data stream.
[0158] In some embodiments, the secondary refresh sending window determination module 904 can be specifically used for:
[0159] Calculate the ratio of real-time total bandwidth to static bandwidth to obtain the window coefficient; calculate the product of the initial sending window and the window coefficient for each data stream to obtain the secondary refresh sending window for each data stream.
[0160] In the technical solution provided in this application embodiment, when the sending-end network card and the receiving-end network card communicate, the receiving-end network card assigns a trust value to each data stream. The sending-end network card first determines the initial transmission window corresponding to each data stream according to the trust value, and then sends the data packets. During the process of sending data packets according to the initial transmission window determined by the trust value, the sending-end network card detects the current congestion information of the AI network. If the congestion information indicates that the AI network is in a first congestion state, and the parameter used to characterize the first congestion state is greater than a first parameter threshold and less than a second parameter threshold, then it indicates that the AI network is in a slightly congested state. The sending-end network card can realize fine-grained adjustment of the initial transmission window according to the current congestion information, achieving sensitive and accurate speed adjustment, avoiding a serious decline in the overall performance of the AI network due to slight congestion, and improving the bandwidth performance of the AI network.
[0161] Corresponding to the above congestion control methods, this application embodiment also provides a server, as shown in FIG10, including one or more network interface cards 1001 and a machine-readable storage medium 1002. The machine-readable storage medium 1002 stores machine-executable instructions that can be executed by the network interface cards 1001. The network interface cards 1001 are prompted by the machine-executable instructions to implement any of the above-described congestion control methods.
[0162] Machine-readable storage media may include random access memory (RAM) or non-volatile memory (NVM), such as at least one disk storage device. Optionally, the machine-readable storage medium may also be at least one storage device located remotely from the aforementioned network interface card.
[0163] In another embodiment provided in this application, a computer-readable storage medium is also provided, which stores a computer program that, when executed by a network card, implements any of the above-described congestion control methods.
[0164] In another embodiment provided in this application, a computer program product containing instructions is also provided, which, when run on a computer, causes the computer to execute any of the congestion control methods described above.
[0165] In the above embodiments, implementation can be achieved entirely or partially through software, hardware, firmware, or any combination thereof. When implemented using software, it can be implemented entirely or partially in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (SSD)).
[0166] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0167] The various embodiments in this specification are described in a related manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the embodiments of apparatus, servers, storage media, and program products are basically similar to the method embodiments, so the descriptions are relatively simple; relevant parts can be referred to the descriptions of the method embodiments.
[0168] The above description is only a preferred embodiment of this application and is not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of protection of this application.
Claims
1. A congestion control method, characterized in that, The method, applied to a transmitting network interface card in an AI network, includes: The initial transmission window for each data stream is determined based on the trust value assigned to each data stream by the receiving network card. According to the initial sending window corresponding to each data stream, the data packets included in each data stream are sent to the receiving network card in a packet spraying manner; Detect the current congestion information of the AI network; If the current congestion information indicates that the AI network is in a first congestion state, the initial sending window corresponding to each data stream is adjusted according to the current congestion information to obtain a second refresh sending window corresponding to each data stream. The parameter used to characterize the first congestion state is greater than the first parameter threshold, and the parameter used to characterize the first congestion state is less than the second parameter threshold. According to the secondary refresh sending window corresponding to each data stream, the data packets included in each data stream are sent to the receiving network card in a packet spraying manner.
2. The method according to claim 1, characterized in that, The method further includes: If the current congestion information indicates that the AI network is in a second congestion state, then according to the congestion signal sent by the receiving network card, the initial sending window corresponding to each data stream is adjusted to obtain three refresh sending windows corresponding to each data stream. The parameter used to characterize the second congestion state is greater than or equal to the second parameter threshold. According to the three refresh sending windows corresponding to each data stream, the data packets included in each data stream are sent to the receiving network card in a packet spraying manner.
3. The method according to claim 1, characterized in that, The step of detecting the current congestion information of the AI network includes: Detect the current congestion information of each path between the sending network card and the receiving network card in the AI network.
4. The method according to claim 3, characterized in that, The step of detecting the current congestion information of each path between the sending network card and the receiving network card in the AI network includes: Based on the path information of each path between the sending network card and the receiving network card in the AI network, construct the probe flow corresponding to each path; Along each path, probe packets corresponding to each path are sent to the receiving network card to obtain the current congestion information for each path.
5. The method according to claim 3, characterized in that, Before constructing the probe flow corresponding to each path, the method further includes: The controller receives path information for each path between the sending network card and the receiving network card from the controller within the AI network. The controller determines the path information for each path based on the communication relationship between the network cards reported by the communication component.
6. The method according to claim 1, characterized in that, The step of adjusting the initial sending window corresponding to each data stream based on the current congestion information to obtain the secondary refresh sending window corresponding to each data stream includes: Based on the current congestion information, determine the congestion factor of each path between the sending network card and the receiving network card in the AI network; The real-time total bandwidth of the AI network is obtained by summing the products of the congestion factor and the path bandwidth for each path. The initial sending window for each data stream is adjusted based on the real-time total bandwidth to obtain a secondary refresh sending window for each data stream.
7. The method according to claim 6, characterized in that, The step of adjusting the initial sending window corresponding to each data stream according to the real-time total bandwidth to obtain the secondary refresh sending window corresponding to each data stream includes: Calculate the ratio of the real-time total bandwidth to the static bandwidth to obtain the window coefficient; Calculate the product of the initial sending window corresponding to each data stream and the window coefficient to obtain the secondary refresh sending window corresponding to each data stream.
8. A congestion control device, characterized in that, The device, which is used as a transmitting network interface card in an AI network, includes: The initial transmission window determination module is used to determine the initial transmission window corresponding to each data stream based on the trust value assigned to each data stream by the receiving end network card; The packet spraying module is used to send the data packets included in each data stream to the receiving network card in a packet spraying manner according to the initial sending window corresponding to each data stream. A congestion information detection module is used to detect the current congestion information of the AI network; The secondary refresh sending window determination module is used to adjust the initial sending window corresponding to each data stream according to the current congestion information if the current congestion information indicates that the AI network is in the first congestion state, so as to obtain the secondary refresh sending window corresponding to each data stream. The parameter used to characterize the first congestion state is greater than the first parameter threshold, and the parameter used to characterize the first congestion state is less than the second parameter threshold. The packet spraying module is also used to send the data packets included in each data stream to the receiving network card in a packet spraying manner according to the secondary refresh sending window corresponding to each data stream.
9. The apparatus according to claim 8, characterized in that, The device further includes: The three-refresh sending window determination module is used to adjust the initial sending window corresponding to each data stream according to the congestion signal sent by the receiving end network card if the current congestion information indicates that the AI network is in a second congestion state, so as to obtain a three-refresh sending window corresponding to each data stream, which is used to characterize the parameter of the second congestion state as being greater than or equal to the second parameter threshold. The packet spraying module is also used to send the data packets included in each data stream to the receiving network card in a packet spraying manner according to the three refresh sending windows corresponding to each data stream.
10. The apparatus according to claim 8, characterized in that, The congestion information detection module is specifically used for: Detect the current congestion information of each path between the sending network card and the receiving network card in the AI network.
11. The apparatus according to claim 10, characterized in that, The congestion information detection module is specifically used for: Based on the path information of each path between the sending network card and the receiving network card in the AI network, a probe flow corresponding to each path is constructed; along each path, probe packets included in the probe flow corresponding to each path are sent to the receiving network card to obtain the current congestion information of each path.
12. The apparatus according to claim 10, characterized in that, The device further includes: The path information receiving module is used to receive the path information of each path between the sending end network card and the receiving end network card issued by the controller in the AI network before constructing the probe flow corresponding to each path. The controller determines the path information of each path according to the communication relationship between the network cards reported by the communication component.
13. The apparatus according to claim 8, characterized in that, The secondary refresh sending window determination module is specifically used for: Based on the current congestion information, determine the congestion factor of each path between the sending network card and the receiving network card in the AI network; calculate the sum of the products of the congestion factor and the path bandwidth of each path to obtain the real-time total bandwidth of the AI network; adjust the initial sending window corresponding to each data stream according to the real-time total bandwidth to obtain the secondary refresh sending window corresponding to each data stream.
14. The apparatus according to claim 13, characterized in that, The secondary refresh sending window determination module is specifically used for: Calculate the ratio of the real-time total bandwidth to the static bandwidth to obtain the window coefficient; calculate the product of the initial sending window corresponding to each data stream and the window coefficient to obtain the secondary refresh sending window corresponding to each data stream.
15. A server, characterized in that, The device includes a network interface card (NIC) and a machine-readable storage medium, the machine-readable storage medium storing machine-executable instructions that can be executed by the NIC, the NIC being prompted by the machine-executable instructions to implement the method of any one of claims 1-7.
16. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, which, when executed by the network card, implements the method described in any one of claims 1-7.