A method and apparatus for link switching between server nodes and an electronic device

By dynamically adjusting the keep-alive duration and the use of probe messages, the communication link failure problem caused by the fixed keep-alive duration was solved, and the continuity of communication services between server nodes and the efficient utilization of network resources were achieved.

CN116846822BActive Publication Date: 2026-06-19MACROSAN TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
MACROSAN TECH
Filing Date
2023-06-05
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing technologies, fixed keep-alive duration cannot adapt to network changes, resulting in excessively long failure detection time for communication links or frequent switching, which affects the continuity of communication services and the utilization of network bandwidth resources.

Method used

By dynamically adjusting the keep-alive duration based on the size of the target data processing message and the average message response time, and combining this with the use of probe messages, link failures are dynamically detected and switched to backup links to ensure the continuity of the communication link.

Benefits of technology

It enables link switching to adapt to network changes, improves the continuity of communication services and the effective utilization of network resources, reduces the number of probe packets, and avoids communication service interruptions.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116846822B_ABST
    Figure CN116846822B_ABST
Patent Text Reader

Abstract

This application provides a method, apparatus, and electronic device for link switching between server nodes. In this application, the keep-alive duration, determined based on the current size of the data processing message and the average message response time over a recent period, can adapt to network changes. Using a keep-alive duration adapted to network changes allows for more accurate determination of whether a communication link between server nodes is faulty, thereby ensuring the continuity of communication services between server nodes. This application also sets a shorter second keep-alive duration adapted to network changes to further detect whether a link between server nodes is faulty. This method can improve the accuracy of fault detection in communication links between server nodes, further ensuring the continuity of communication services between server nodes.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of data storage, and in particular to a method, apparatus, and electronic device for link switching between server nodes. Background Technology

[0002] In a distributed storage system, server nodes communicate with each other via communication links. To ensure network redundancy and eliminate single points of failure, multiple physical links are typically established between server nodes. In practical applications, multiple links can be managed by software. When a link in the data transmission between server nodes fails, the communication link can be switched from the failed link to the normal link to ensure normal communication between the server nodes.

[0003] Typically, communication link failures are detected by probe messages. For example, if server node 1 sends a data packet to server node 2 and server node 1 does not receive a response message from server node 2 within the keep-alive period, it is determined that the communication link between server node 1 and server node 2 has failed.

[0004] However, in related technologies, keep-alive duration is usually set to a fixed duration. A fixed keep-alive duration cannot adapt to network changes, thus affecting the continuity of communication services. For example, if the keep-alive duration is set too long, the detection time for communication link failures is longer, leading to communication service interruptions and reducing the service processing performance of upper-layer applications (e.g., databases, file systems). If the keep-alive duration is set too short, even a slight network delay during data transmission will lead to the link being considered faulty, resulting in frequent link switching and an increased number of probe packets. A large number of probe packets will consume significant network bandwidth resources, similarly impacting the service processing performance of upper-layer applications. Summary of the Invention

[0005] In view of this, embodiments of this application provide a method, apparatus, and electronic device for link switching between server nodes to ensure the continuity of communication services between server nodes.

[0006] According to a first aspect of the embodiments of this application, a link switching method between server nodes is provided, applied to a first server node in a distributed all-flash storage system. The first server node communicates with a second server node through a communication link, the communication link including at least a first link and a second link. The method includes: sending a target data processing message to the second server node through the first link at the current moment; determining a first keep-alive duration corresponding to the target data processing message based on the size of the target data processing message and the average message response time; wherein the average message response time is the data processing time of the first server node within a preset duration before the sending time of the target data processing message. The average message response time of the processing message; if the first server node does not receive the data processing response message for the target data processing message from the second server node through the first link within the first keep-alive time, then a probe message is sent to the second server node through the first link; if the first server node does not receive the probe response message for the probe message from the second server node through the first link within the second keep-alive time, then it is determined that the first link is faulty, and the communication link between the first server node and the second server node is switched from the first link to the second link. The second keep-alive time is determined by the average message response time, and the second keep-alive time is less than the first keep-alive time.

[0007] According to a second aspect of the embodiments of this application, a link switching device between server nodes is provided, applied in a first server node of a distributed all-flash storage system. The first server node communicates with a second server node through a communication link, the communication link including at least a first link and a second link. The device includes: a keep-alive duration determination module, configured to send a target data processing message to the second server node through the first link at the current time, and determine a first keep-alive duration corresponding to the target data processing message based on the size of the target data processing message and the average message response time, wherein the average message response time is a preset duration within which the data processing message sent by the first server node is within the time before the first server node sends the target data processing message. The average message response time; the probe message sending module, used to send a probe message to the second server node through the first link when the first server node does not receive a data processing response message for the target data processing message from the second server node through the first link within the first keep-alive time; the link switching module, used to determine that the first link is faulty and switch the communication link between the first server node and the second server node from the first link to the second link if the first server node does not receive a probe response message for the probe message from the second server node through the first link within the second keep-alive time, wherein the second keep-alive time is determined by the average message response time, and the second keep-alive time is less than the first keep-alive time.

[0008] According to a third aspect of the embodiments of this application, an electronic device is provided, the electronic device comprising: a processor and a memory; wherein the memory is configured to store machine-executable instructions; and the processor is configured to read and execute the machine-executable instructions stored in the memory to implement the method as described in the first aspect.

[0009] According to a fourth aspect of the embodiments of this application, a computer-readable storage medium is provided, which stores a computer program that, when executed by a processor, implements the method described in the first aspect.

[0010] The technical solutions provided in this application embodiment may include the following beneficial effects:

[0011] In this embodiment, the keep-alive duration is determined based on the size of the current data processing message and the average message response time over a recent period (i.e., a preset period before the transmission time). This keep-alive duration is then used to determine whether a link between the two server nodes is faulty. The average message response time over the recent period characterizes network changes; therefore, this application uses a keep-alive duration adapted to network changes to determine whether to switch the link between server nodes. By using a keep-alive duration adapted to network changes, the presence of communication link faults can be determined more accurately, thereby ensuring the continuity of communication services.

[0012] Furthermore, if the first server node does not receive a data processing response message within the first keep-alive period, it cannot be determined whether the communication link has failed or the data processing performance of the second server node has failed. To address this, this application also sets a shorter second keep-alive period to detect whether there is a link failure between the two server nodes. This method improves the accuracy of determining communication link failures between server nodes and further ensures the continuity of communication services. Attached Figure Description

[0013] Figure 1 This is a structural diagram of a distributed storage system shown in an embodiment of this application.

[0014] Figure 2 This is a flowchart illustrating a link switching method between server nodes according to an embodiment of this application.

[0015] Figure 3 This is an overall flowchart illustrating a link switching process between server nodes, as shown in an embodiment of this application.

[0016] Figure 4 This is a block diagram illustrating a link switching device between server nodes according to an embodiment of this application.

[0017] Figure 5This is a hardware structure diagram of an electronic device containing a link switching device between server nodes in an embodiment of this application. Detailed Implementation

[0018] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.

[0019] The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The singular forms “a,” “the,” and “the” used in this application and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used herein refers to and includes any or all possible combinations of one or more of the associated listed items.

[0020] It should be understood that although the terms first, second, third, etc., may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of this application, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to determination."

[0021] The embodiments described in this specification will now be described in detail.

[0022] Before describing the methods provided in the embodiments of this application, a structural diagram of a distributed storage system capable of implementing the methods provided in the embodiments of this application is first described:

[0023] See Figure 1 , Figure 1 This is a structural diagram of a distributed storage system capable of implementing the methods provided in the embodiments of this application.

[0024] like Figure 1 As shown, the distributed storage system is equipped with multiple server nodes. There is a communication link between any two server nodes, and the two server nodes communicate with each other through the communication link. The communication link between the server nodes consists of multiple links. In the actual communication process, if a link between the server nodes fails, the faulty link between the server nodes can be switched to another link to ensure uninterrupted communication between the server nodes.

[0025] As an example, in Figure 1 In the storage system shown, data is typically transferred between server nodes to achieve data redundancy protection during data storage. For example, in... Figure 1 In this scenario, when server node A receives a data write command, it writes the data to be written to a storage unit in its corresponding storage area and marks that storage unit as LUN-1. Simultaneously, server node A also transmits the data to be written to server node B via a communication link in the form of a message. Upon receiving the message from server node A, server node B reads the data to be written from the message and writes it to a storage unit in its corresponding storage area, marking the storage unit containing the data as LUN-1. Therefore, if server node A's storage area fails and becomes unusable, or if damage occurs leading to data loss, data can be retrieved from server node B's storage area based on the marking of the storage unit, thus achieving data redundancy protection.

[0026] In the aforementioned application scenarios, server nodes need to communicate with each other via communication links. Therefore, ensuring the redundancy of network connections between server nodes and eliminating single points of failure is of great significance. Typically, two physical links are deployed between server nodes to ensure network redundancy. However, if a faulty link between server nodes is not detected in a timely manner, it can lead to interruptions in communication services. Therefore, the method provided in this application embodiment can promptly detect faulty links between server nodes and switch links in a timely manner, avoiding communication service interruptions and improving the continuity of communication services.

[0027] In this embodiment, the distributed storage system can be a distributed all-flash storage system or a non-all-flash storage system. All-flash storage consists of independent storage arrays or devices composed of solid-state storage media, offering high processing performance and low latency fluctuations. Therefore, when multiple links are transmitting data between server nodes, a distributed all-flash storage system can quickly detect link failures and rapidly switch links, thereby improving the continuity of communication services between server nodes.

[0028] The following explains how the method provided in the embodiments of this application enables timely switching of links between server nodes.

[0029] See Figure 2 , Figure 2This is a flowchart illustrating a link switching method between server nodes according to an embodiment of this application. As described above, this method is applied to a distributed storage system, specifically to any server node (referred to as the first server node in this embodiment) in a distributed all-flash storage system.

[0030] like Figure 2 As shown, the method includes the following steps:

[0031] Step 201: At the current moment, send the target data processing message to the second server node through the first link, and determine the first keep-alive duration corresponding to the target data processing message based on the size of the target data processing message and the average message response time.

[0032] In this embodiment, when the first server node receives a data write instruction, it stores the data to be written into the storage area corresponding to the first server node. At the same time, it constructs a data processing message (i.e., the target data processing message mentioned above) based on the data to be written, and sends the target data processing message carrying the data to be written to the second server node through the first link, so that the second server node can obtain the data to be written from the target data processing message and write the data to be written into the storage area of ​​the second server node.

[0033] In addition, in this embodiment, the keep-alive duration represents the maximum time interval between the first server node receiving a response message from the second server node. That is, after the first server node sends a data processing message to the second server node, the second server node must send a response message for the data processing message within the keep-alive duration in order to ensure the connection between the two server nodes. If the second server node fails to send a response message within the keep-alive duration, the connection between the two server nodes may be faulty.

[0034] As an example, the first keep-alive duration is variable, not fixed. Specifically, each time the first server node sends a data processing message to the second server node, the first server node calculates the first keep-alive duration corresponding to that data processing message. Alternatively, the first server node can calculate the first keep-alive duration at regular intervals (e.g., a preset duration). The first server node can also determine the calculation frequency of the first keep-alive duration based on the number of data processing messages it sends to the second server node through the first link. For example, when the number of data processing messages sent by the first server node to the second server node through the first link reaches a preset number, the first keep-alive duration is calculated. This embodiment does not specifically limit the calculation frequency of the first keep-alive duration.

[0035] In addition, as described above, in this embodiment, the first keep-alive duration is related to the size of the data processing message sent by the first server node to the second server node and the average message response duration of the data messages within a recent period of time. The average message response duration is the average message response duration of the data processing messages sent by the first server node within a preset period of time before the first server node sends the target data processing message.

[0036] It should be noted that the average message response time over a recent period reflects network changes. In other words, this application uses a keep-alive duration adapted to network changes to determine whether to switch links between server nodes. By using a keep-alive duration adapted to network changes, it is possible to more accurately and promptly determine whether there are link failures between server nodes, thereby ensuring the continuity of communication services.

[0037] As an example, the first server node can determine the first keep-alive duration by statistically analyzing the average message response time of the data processing response messages received by the first server node from the second server node via the first link within a preset time period before the sending time, and the average message size of the data processing messages sent by the first server node via the first link within the preset time period before the sending time. The first keep-alive duration can satisfy the following formula:

[0038]

[0039] In the above formula, T1 is the first keep-alive duration; T ′ D is the average message response time; D is the size of the target data processing message; D′ is the average message size; N1 is the adjustment coefficient, where N1 is a positive integer greater than or equal to 1, such as 3, 5, etc. The specific value can be set according to actual needs, and this embodiment does not impose specific limitations.

[0040] As an example, the average message response time mentioned above is the average response time of the data processing messages sent by the first server node within the most recent time period.

[0041] Specifically, firstly, based on the message transmission time of the data processing message sent by the first server node through the first link within a preset time period before the transmission time, and the message reception time of the data processing response message received by the first server node from the second server node through the first link, the message response duration of the data processing message sent by the first server node to the second server node within the preset time period is determined. For example, after the first server node sends data processing message 1 to the second server node through the first link, the first server node records the transmission time of data processing message 1 and the reception time of the response message received by the first server node from the second server node in response to data processing message 1. The interval between the transmission time and the reception time is the message response duration corresponding to data processing message 1.

[0042] Then, the first server node calculates the average message response time of all data processing messages sent by the first server node to the second server node through the first link within a preset time period before the sending time, and thus obtains the average message response time.

[0043] Similar to the average message response time, the average message size is calculated by statistically analyzing the sizes of data processing messages sent by the first server node to the second server node via the first link within a preset time period prior to the sending time, and then averaging the sizes of these data processing messages within that preset time period. For example, if the first server node sent 10 data processing messages to the second server node via the first link within a recent period (i.e., the preset time period prior to the sending time mentioned above), then the average size of these 10 data processing messages is the average message size.

[0044] Step 202: If the first server node does not receive a data processing response message for the target data processing message from the second server node via the first link within the first keep-alive period, then a probe message is sent to the second server node via the first link.

[0045] In the above description, the probe message is a message that does not carry data that needs to be processed by the second server node. That is, after the second server node receives the probe message, it does not need to process the probe message in any way, but only needs to send a probe response message back to the first server node.

[0046] The data processing message carries data to be processed. This message needs to be processed by the second server node, and a response message will only be sent back to the first server node after the second server node has completed its processing. For example, the second server node may parse the data processing message, read the data to be processed from the parsing result, and store the data to be processed in its own storage area. Alternatively, the second server node may perform calculations on the data to be processed (e.g., encrypted calculations), store the calculated data in its own storage area, generate a response message, and send it back to the first server node.

[0047] In this embodiment, the fact that the first server node does not receive a data processing response message from the second server node for the target data processing message within the first keep-alive period does not necessarily indicate a fault in the first link between the first and second server nodes. It could also be that the second server node's data processing performance is reduced, preventing it from processing the data processing response message in a timely manner, thus causing it to fail to return the response message within the first keep-alive period. To determine the true reason why the first server node did not receive the data processing response message, the first server node will send a probe message to the second server node through the first link to detect whether there is a fault in the first link between the two server nodes.

[0048] Step 203: If the first server node does not receive a probe response message from the second server node via the first link within the second keep-alive period, then it is determined that the first link is faulty, and the communication link between the first server node and the second server node is switched from the first link to the second link.

[0049] In this embodiment, the second keep-alive duration is determined solely by the average message response time and is independent of the size of the target data processing message. That is, the second keep-alive duration can be determined by the following formula:

[0050] T2=T′N2

[0051] In the above formula, T2 is the first keep-alive duration; T′ is the average message response duration; N2 is the adjustment coefficient, where N2 is a positive integer greater than or equal to 1, such as 3, 5, etc. N2 can be the same as or different from N1. The specific value can be set according to actual needs, and this embodiment does not impose specific limitations.

[0052] It should be noted that, as can be seen from the calculation methods of the first and second keep-alive durations, the second keep-alive duration can also characterize network changes, and the first keep-alive duration is longer than the second keep-alive duration. Furthermore, since the second server node does not need to process the probe message when responding to it, but can directly send a probe response message back to the first server node, if the communication link between the two server nodes is normal, the second server node will send back a probe response message in a shorter time. Therefore, in this application, a shorter second keep-alive duration is set to detect whether there is a fault in the communication link between the two server nodes. This method can improve the accuracy of judging the fault in the communication link between server nodes and further ensure the continuity of communication services.

[0053] Furthermore, in this embodiment, the second link can be a link between the first server node and the second server node that is in an idle state and can transmit data normally. For example, a link between the two server nodes that does not transmit any data; or a link between the two server nodes that transmits other data but is currently in an idle state.

[0054] As an example, after determining that the first link between server nodes is faulty, the first server node uses the second link that can communicate normally to communicate with the second server node. This enables timely switching between the faulty link and the normal link when a fault is detected between the server nodes, ensuring normal communication between the server nodes and avoiding interruptions in communication services between the server nodes.

[0055] This concludes the process. Figure 2 The process is shown below.

[0056] pass Figure 2 The process shown enables timely and accurate identification of faulty links between server nodes, as well as timely switching of faulty links, ensuring the continuity of communication services between server nodes.

[0057] The following combination Figure 3 The overall process of link switching between server nodes shown in this embodiment explains the method provided in this embodiment.

[0058] like Figure 3As shown, when the first server node sends a target data processing message to the second server node through the first link, the first server node calculates the first keep-alive duration corresponding to the target data processing message and checks whether it receives a data processing response message for the target data processing message returned by the second server node through the first link within the first keep-alive duration. If the first server node receives the data processing response message for the target data processing message from the second server node within the first keep-alive duration, it is determined that the first link between the two server nodes is in a normal state; if the first server node does not receive the data processing response message for the target data processing message from the second server node within the first keep-alive duration, the first link between the two server nodes may be faulty. To accurately determine the reason why the first server node did not receive the data processing response message within the first keep-alive duration, the first server node sends a probe message to the second server node through the first link to determine whether the failure to receive the data processing response message within the first keep-alive duration is due to a network failure.

[0059] See Figure 3 Before sending a probe message to the second server node via the first link, the first server node checks whether network probing has been initiated on the first link, i.e., whether the first link is under probe. If network probing is detected, the remaining probe duration of the first link is determined and extended to a preset duration.

[0060] In this scenario, the remaining probe duration represents the duration of the first link being probed, and the difference between the remaining probe duration and the probe duration corresponding to the first link. For example, if the probe duration corresponding to the first link is 100ms, and the network probe has been initiated on the first link for 90ms, then the remaining probe duration for the first link is 10ms. The aforementioned probe duration corresponding to the first link can be a second keep-alive duration; for example, when the first server node initiates network probe for the second time, the probe duration corresponding to the first link is the second keep-alive duration. Alternatively, the aforementioned probe duration corresponding to the first link can be the extended duration of the remaining probe duration. For example, if the preset duration is 1 second, and the probe duration corresponding to the first link is 100ms, and the network probe has been initiated on the first link for 90ms, then the remaining probe duration for the first link is 10ms. Extending this remaining probe duration to 1 second, the probe duration corresponding to the probe packet sent by the first server node this time is 1 second, which is the preset duration.

[0061] It should be noted that when network probing is detected to have started on the first link, extending the remaining probing time of the first link can ensure that when the first server node detects that the second server node has not returned a data processing response message within the first keep-alive time, there is sufficient time to conduct network probing on the first link, thus avoiding the problem of failing to detect link failures in time due to insufficient time.

[0062] For example, if the network probing duration is 100ms, when the first server node detects that the second server node has not responded with a data processing response message within the first keep-alive duration, it initiates network probing on the first link. Simultaneously, the first server node continues to send other data processing messages to the second server node. If network probing on the first link has been in progress for 99ms, and the first server node detects that the second server node has not responded with a data processing response message for other data processing messages within the first keep-alive duration, the first server node will not restart network probing on the first link since it is currently being probed. It will only continue probing for the remaining 1ms. However, due to the short remaining probing time, network faults may not be detected. Therefore, this application detects the remaining probing time on the first link as it is about to end and extends the remaining probing time.

[0063] Furthermore, such as Figure 3 As shown, when the first link is detected to be in an unprobeged state, and the first server node does not receive the data processing response message from the second server node within the first keep-alive duration, the first server node sends a probe message to the second server node through the first link to initiate network probe of the first link.

[0064] See Figure 3 If the first server node receives a probe response message from the second server within the second keep-alive duration, it determines that the first link is functioning correctly and is in a normal state. The failure of the first server node to receive a data processing response message from the second server within the first keep-alive duration may be due to poor data processing performance of the second server. In this case, within a preset duration after the first server node receives the probe response message, it sends a probe message to the second server node via the first link every second keep-alive duration. For example, if the first server node receives a probe response message from the second server node within the second keep-alive duration, it continues to send probe messages to the second server node via the first link after one second keep-alive duration, and this continues for the preset duration.

[0065] If the first server node receives the probe response message from the second server node within a preset time after receiving the probe response message, and within the second keep-alive time, the first server node receives the probe response message from the second server node within the second keep-alive time, it indicates that the first link between the two server nodes can stably transmit data and is in a normal state, and the sending of probe messages to the second server node through the first link is stopped.

[0066] It should be noted that in practical applications, when the first server node detects that the second server node has failed to promptly return a data processing response message for the data processing message, the first server node sends a probe message to the second server node through the first link. If the first server node sends multiple data processing messages to the second server node, and the second server node fails to return the corresponding data processing response message in a timely manner for any of them, the first server node will send a probe message to the second server node for each data processing message, and the second server node will also send back a probe response message for each probe message. When the first server node sends a large number of probe messages, the second server node will also send back a large number of probe response messages, and the feedback of a large number of probe response messages will also consume network bandwidth resources.

[0067] In this embodiment, within a preset duration after determining that the first link is in a normal state, the first server node will no longer send a probe message for each data processing message. Instead, it will continuously send probe messages to the second server node through the first link within the preset duration (e.g., 1 second). For example, within the preset duration, the first server node will send a probe message to the second server node every 100ms (i.e., the second keep-alive duration) to reduce the number of probe messages sent, thereby reducing the number of probe response messages sent by the second server node and thus avoiding the problem of a large number of probe response messages crowding out network bandwidth resources.

[0068] See Figure 3 If the first server node does not receive a probe response message within the second keep-alive period, it determines that the first link is faulty and marks it as a faulty link. The first server node will no longer send probe messages to the second server node through the first link. At this time, the first server node communicates with the second server node through the second link.

[0069] This concludes the process. Figure 3 The process is shown below.

[0070] pass Figure 3 The process shown can not only accurately and timely detect whether the link between two server nodes is in a faulty state, but also switch the faulty link in a timely manner when a fault is determined, thus ensuring the continuity of communication services between server nodes.

[0071] Furthermore, after determining that the first link is faulty, for example, after a period of time following the determination of the fault in the first link, the first server node continues to initiate network probing on the first link to detect whether the first link has returned to normal.

[0072] Specifically, if a first link is detected connecting the first server node and the second server node, the first server node sends a probe message to the second server node through the first link every preset link probe duration; if the duration for which the first server node receives a probe response message from the second server node through the first link for the probe message exceeds the preset duration within the preset keep-alive duration, then the first link is determined to be in a normal state.

[0073] In this embodiment, after determining that the first link is faulty, since the first server node and the second server node can communicate data through the second link which is in normal condition, the importance of detecting whether the faulty link between the two server nodes has been restored to normal is relatively low. Therefore, in this embodiment, the link detection duration and the preset keep-alive duration are usually set to a relatively long duration. The link detection duration and the preset keep-alive duration can be the same duration (for example, both set to 5 seconds) or different durations.

[0074] In addition, after detecting that the first link is in a normal state, the first server node will continue to test the status of the first link to ensure its stability.

[0075] This concludes the description of the method provided in the embodiments of this application.

[0076] Corresponding to the embodiments of the aforementioned methods, this specification also provides embodiments of a link switching device between server nodes, the electronic equipment used therein, and a computer-readable storage medium.

[0077] like Figure 4 As shown, Figure 4 This is a block diagram of a link switching device between server nodes according to an embodiment of this application. The link switching device between server nodes is applied to a first server node in a distributed all-flash storage system. The first server node communicates with a second server node through a communication link, which includes at least a first link and a second link. The device includes: a keep-alive duration determination module, a probe message sending module, and a link switching module.

[0078] The keep-alive duration determination module is used to send a target data processing message to the second server node through the first link at the current time, and determine the first keep-alive duration corresponding to the target data processing message based on the size of the target data processing message and the average message response duration. The average message response duration is the average message response duration of the data processing messages sent by the first server node within a preset time before the first server node sends the target data processing message.

[0079] The probe message sending module is used to send a probe message to the second server node through the first link when the first server node does not receive a data processing response message for the target data processing message from the second server node through the first link during the first keep-alive period.

[0080] The link switching module is used to determine that the first link is faulty if the first server node does not receive a probe response message for the probe message from the second server node through the first link within the second keep-alive duration, and then switches the communication link between the first server node and the second server node from the first link to the second link. The second keep-alive duration is determined by the average message response duration and is less than the first keep-alive duration.

[0081] Optionally, the keep-alive duration determination module is specifically used to calculate the average message response time of the data processing response messages received by the first server node through the first link within a preset time period before the sending time, and the average message size of the data processing messages sent by the first server node through the first link within a preset time period before the sending time; calculate the ratio of the target data processing message size to the average message size, and calculate the product of the average message response time and the ratio to obtain the first keep-alive duration.

[0082] Optionally, the link switching device between server nodes further includes a duration calculation module and a packet size calculation module. Specifically, the duration calculation module is used to determine the packet response duration of the data processing packets sent by the first server node to the second server node through the first link within a preset duration before the transmission time, and the packet reception time of the data processing response packets received by the first server node from the second server node through the first link. The module also calculates the average packet response duration of the data processing packets sent by the first server node to the second server node through the first link within a preset duration before the transmission time.

[0083] The message size calculation module is specifically used to count the size of the data processing messages sent by the first server node to the second server node through the first link within a preset time period before the sending time, and to calculate the average size of the data processing messages within the preset time period before the sending time to obtain the average message size.

[0084] Optionally, the link switching device between server nodes also includes a detection duration processing module, which is used to determine the remaining detection duration of the first link when the first link is in the detection state, and extend the remaining detection duration to a preset duration, wherein the remaining detection duration is the duration of the first link being in the detection state and the duration of detection required to perform network detection on the first link.

[0085] Optionally, the link switching device between server nodes further includes a first message processing module and a second message processing module. Specifically, the first message processing module is used to determine that the first link is in normal condition when the first server node receives the probe response message fed back by the second server node in response to the probe message during the second keep-alive duration, and to send a probe message to the second server node through the first link once every second keep-alive duration within a preset duration after the time when the first server node receives the probe response message.

[0086] The second message processing module is specifically used to stop sending probe messages to the second server node through the first link when the first server node receives the probe response message within a preset time after the first server node receives it, and the first server node receives the probe response message from the second server node within the second keep-alive time.

[0087] Optionally, the link switching device between server nodes also includes a third message processing module, which is used to stop sending probe messages to the second server node when the first server node does not receive a probe response message within the second keep-alive period.

[0088] Optionally, the link switching device between server nodes also includes a link status detection module, which is used to send a probe message to the second server node through the first link every preset link probe duration when the first link is detected to connect the first server node and the second server node; if the duration of the probe response message received by the first server node from the second server node through the first link exceeds the preset duration within the preset keep-alive duration, then the first link is determined to be in a normal state.

[0089] The specific implementation process of the functions and roles of each module in the above-mentioned device can be found in the implementation process of the corresponding steps in the link switching method between server nodes, and will not be repeated here.

[0090] For the device embodiments, since they basically correspond to the method embodiments, the relevant parts can be referred to in the description of the method embodiments. The device embodiments described above are merely illustrative. The modules described as separate components may or may not be physically separate, and the components shown as modules may or may not be physical modules, that is, they may be located in one place or distributed across multiple network modules. Some or all of the modules can be selected to achieve the purpose of the solution in this specification according to actual needs. Those skilled in the art can understand and implement this without creative effort.

[0091] Correspondingly, embodiments of this application also provide Figure 5 The hardware structure diagram of the electronic device shown is as follows: Figure 5 As shown, the electronic device can be a device implementing the above-described method. Figure 5 As shown, the hardware architecture includes a processor and memory.

[0092] The memory is used to store machine-executable instructions;

[0093] The processor is used to read and execute machine-executable instructions stored in the memory to implement the link switching method embodiment between the corresponding server nodes as shown above.

[0094] As one embodiment, the memory can be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, etc. For example, the memory can be volatile memory, non-volatile memory, or similar storage media. Specifically, the memory can be RAM (Random Access Memory), flash memory, storage drives (such as hard disk drives), solid-state drives, any type of storage disk (such as optical discs, DVDs, etc.), or similar storage media, or combinations thereof.

[0095] This concludes the process. Figure 5 Description of the electronic device shown.

[0096] Based on the same inventive concept, this embodiment also provides a computer-readable storage medium. This computer-readable storage medium is used to store a computer program; when executed by a processor, the computer program implements the above-described method embodiment.

[0097] The foregoing has described specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing are possible or may be advantageous.

[0098] Other embodiments of this specification will readily occur to those skilled in the art upon consideration of the specification and practice of the invention claimed herein. This specification is intended to cover any variations, uses, or adaptations that follow the general principles of this specification and include common knowledge or customary techniques in the art not claimed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this specification are indicated by the following claims.

[0099] It should be understood that this specification is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this specification is limited only by the appended claims.

[0100] The above description is merely a preferred embodiment of this specification and is not intended to limit this specification. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this specification should be included within the scope of protection of this specification.

Claims

1. A method of link switching between server nodes, characterized by, In a first server node of a distributed all-flash storage system, the first server node communicates with a second server node via a communication link, wherein the communication link includes at least a first link and a second link, and the method includes: At the current moment, a target data processing message is sent to the second server node through the first link. The first keep-alive duration corresponding to the target data processing message is determined according to the size of the target data processing message and the average message response duration. The average message response duration is the average message response duration of the data processing messages sent by the first server node within a preset duration before the first server node sends the target data processing message. If, within the first keep-alive period, the first server node does not receive a data processing response message for the target data processing message from the second server node via the first link, then it sends a probe message to the second server node via the first link. If, within the second keep-alive duration, the first server node does not receive a probe response message from the second server node via the first link in response to the probe message, then it is determined that the first link is faulty, and the communication link between the first server node and the second server node is switched from the first link to the second link. The second keep-alive duration is determined by the average message response duration, and the second keep-alive duration is less than the first keep-alive duration.

2. The method according to claim 1, characterized in that, The first keep-alive duration corresponding to the target data processing message is determined based on the size of the target data processing message and the average message response time, including: The average message response time of the data processing response messages received by the first server node through the first link within a preset time period before the sending time, and the average message size of the data processing messages sent by the first server node through the first link within a preset time period before the sending time are calculated. Calculate the ratio of the size of the target data processing message to the average message size, and calculate the product of the average message response time and the ratio to obtain the first keep-alive time.

3. The method according to claim 2, characterized in that, Before determining the first keep-alive duration corresponding to the target data processing message based on the size of the target data processing message and the average message response time, the method further includes: Based on the message sending time of the data processing message sent by the first server node through the first link within a preset time period before the sending time, and the message receiving time of the data processing response message received by the first server node through the first link from the second server node, the message response time of the data processing message sent by the first server node to the second server node within the preset time period is determined. The average message response time is obtained by calculating the average message response time of the data processing messages sent by the first server node to the second server node through the first link within a preset time period before the sending time. The average message size is obtained by statistically analyzing the size of the data processing messages sent by the first server node to the second server node through the first link within a preset time period before the sending time, and calculating the average size of the data processing messages within the preset time period before the sending time.

4. The method according to claim 1, characterized in that, Before sending a probe message to the second server node via the first link, the method further includes: If the first link is in a probed state, the remaining probe duration of the first link is determined, and the remaining probe duration is extended to the preset duration, wherein the remaining probe duration is the duration of the first link being in the probed state and the probe duration required to perform network probe on the first link.

5. The method according to claim 1 or 4, characterized in that, After sending a probe message to the second server node via the first link, the method further includes: If the first server node receives a probe response message from the second server node in response to the probe message within the second keep-alive duration, it is determined that the first link is in a normal state, and within a preset duration after the first server node receives the probe response message, the first server node sends the probe message to the second server node once every second keep-alive duration through the first link. If, within a preset time period after the first server node receives the probe response message, the first server node receives the probe response message from the second server node within the second keep-alive time period, then the first server node stops sending the probe message to the second server node through the first link.

6. The method according to claim 5, characterized in that, The method further includes: If the first server node does not receive the probe response message within the second keep-alive period, it stops sending the probe message to the second server node.

7. The method according to claim 1, characterized in that, After switching the communication link between the first server node and the second server node from the first link to the second link, the method further includes: When the first link is detected to connect the first server node and the second server node, the probe message is sent to the second server node through the first link at preset link probe intervals. If, within a preset keep-alive duration, the duration for which the first server node receives a probe response message from the second server node via the first link in response to the probe message exceeds a preset duration, then the first link is determined to be in a normal state.

8. A link switching device between server nodes, characterized in that, In a first server node of a distributed all-flash storage system, the first server node communicates with a second server node via a communication link, the communication link including at least a first link and a second link, the device comprising: The keep-alive duration determination module is used to send a target data processing message to the second server node through the first link at the current time, and determine the first keep-alive duration corresponding to the target data processing message based on the size of the target data processing message and the average message response duration. The average message response duration is the average message response duration of the data processing messages sent by the first server node within a preset duration before the first server node sends the target data processing message. The probe message sending module is used to send a probe message to the second server node through the first link when the first server node does not receive a data processing response message for the target data processing message from the second server node through the first link during the first keep-alive duration. The link switching module is configured to determine that the first link is faulty if the first server node does not receive a probe response message from the second server node via the first link within the second keep-alive duration, and then switch the communication link between the first server node and the second server node from the first link to the second link. The second keep-alive duration is determined by the average message response duration, and the second keep-alive duration is less than the first keep-alive duration.

9. An electronic device, characterized in that, Electronic devices include: processors and memory; The memory is used to store machine-executable instructions; The processor is configured to read and execute machine-executable instructions stored in the memory to implement the method as described in any one of claims 1 to 7.

10. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the method described in any one of claims 1 to 7.