[0113] This step can be implemented in the following manner during specific implementation. Of course, the following is only a preferred implementation manner, and the present invention is not limited to this implementation manner.
[0114] In this embodiment, a CMM timing 10s initiating a control plane heartbeat keep-alive IPMI command to a common single board is taken as an example for description.
[0115] The CMM sends the control plane heartbeat keep-alive IPMI command to the ordinary boards in this subrack at a fixed time of 10s to test whether the link is faulty. Then, in this step, set the maximum timeout period for detection to 5 minutes, and take 10s as the basic unit to convert the maximum unreceived control plane heartbeat keep-alive IPMI command response to 30 times. The CMM records the value of the corresponding counter for each common board that needs to be detected according to the following table. If the CMM does not receive a response, the value of the counter is decremented, and if it continuously decreases from 30 to 0, it indicates a link failure.
[0116]
[0117]
[0118] Step S508: The CMM resets the faulty board.
[0119] Further, the detection process from the switch board to the common single board is explained. Of course, before the detection process is executed, the common single board and the HOST CPU of the switch board that are successfully powered on in the chassis also set the board The IPMC starts the keep-alive detection operation. Such as Image 6 As shown, the detection process from a switch board to a common board includes the following steps:
[0120] Step S601: The CMM checks the FRU status of the common single board and the switch board CPU, and obtains the common single board and the switch board whose FRU status is in the M8 state.
[0121] Step S602: The CMM starts a timer, and sends a control plane heartbeat keep-alive IPMI command to the switch board in the M8 state periodically through the IPMB bus.
[0122] Among them, the control plane heartbeat keep-alive IPMI command contains the board slot number and CPU number of a common board.
[0123] Step S603: After receiving the control plane heartbeat keep-alive IPMI command, the IPMC of the switch board sends a control plane keep-alive detection request command to the HOST CPU of the board.
[0124] Step S604: After receiving the control plane keep-alive detection request command, the HOST CPU of the switch board sends a control plane keep-alive private message to all ordinary boards in the M8 state in the machine frame.
[0125] Step S605: After receiving the control plane keep-alive private message, the HOST CPU of the ordinary single board sends a control plane keep-alive detection request command to the IPMC of the board.
[0126] Step S606: After receiving the control plane keepalive detection request command, the IPMC of the common single board sends a control plane heartbeat keepalive IPMI command response to the CMM through the IPMB bus.
[0127] Among them, the control plane heartbeat keep-alive IPMI command response message includes: board slot number, CPU number, and detection success message.
[0128] Step S607: The CMM judges whether the control plane heartbeat keep-alive IPMI command response sent by the ordinary single board is received within the preset time period. If yes, record the received response message; otherwise, judge the one-way from the switch board to the ordinary single board If the control plane link fails, step S608 is executed.
[0129] This step can be implemented by the test method in step S507, which will not be repeated here.
[0130] Step S608, the CMM resets the faulty board.
[0131] It should be noted that the above-mentioned link failure detection process defaults that the IPMB bus is in good condition, so when the CMM does not receive the control plane heartbeat keep-alive IPMI command response, it is determined as the normal single board to the switch board or the switch board to the ordinary single board. The link is down. Among them, the basis for the above-mentioned default IPMB bus status is that the IPMB is a redundant design, and there will be an emergency plan (existing technology) when a certain IPMB bus has a problem. Furthermore, even if the IPMB bus fails, the system will know its status through other alarm facilities, so that there is no situation where the IPMB implements the present invention under the premise of failure.
[0132] The method provided by the embodiment of the present invention makes full use of the role of the machine frame management module specially set in the ATCA architecture for hardware monitoring and management, improves the control plane detection and self-healing mechanism, improves the accuracy of positioning, and further enhances The robustness of the system is improved.
[0133] The present invention provides a frame management module, such as Figure 7 Shown, including:
[0134] The single board obtaining unit 710 is configured to obtain common single boards and switch boards in the machine frame where the software is successfully powered on;
[0135] The IPMI command issuing unit 720 is configured to periodically send a control plane heartbeat keep-alive IPMI command to the switching board and/or the common single board after the board acquiring unit 710 obtains the common single board and the switch board;
[0136] The fault detection unit 730 is used to determine whether the control plane heartbeat keep-alive IPMI command response sent by the ordinary single board or the switch board is received within a preset time period. If it is not received, it is determined whether the switch board is to the ordinary single board or If the link from the common board to the switch board fails, reset the failed board.
[0137] Specifically, when the fault detection unit 730 does not receive the control plane heartbeat keep-alive IPMI command response sent by the ordinary single board within the preset time period, it determines that the link from the switch board to the ordinary single board is faulty, and resets the faulty single board. ;
[0138] When the failure detection unit 730 does not receive the control plane heartbeat keep-alive IPMI command response sent by the switch board within a preset time period, it determines that the link from the ordinary board to the switch board is faulty, and resets the failed board.
[0139] The invention also provides a switch board, such as Picture 8 Shown, including:
[0140] The first IPMI command receiving unit 810 is configured to receive the control plane heartbeat keep-alive IPMI command sent by the frame management module;
[0141] The first control plane keepalive private message sending unit 820 is configured to send the control plane keepalive private message to the ordinary single board after the first IPMI command receiving unit 810 receives the control plane heartbeat keepalive IPMI command;
[0142] The first control plane keep-alive private message receiving unit 830 is configured to receive control plane keep-alive private messages sent by ordinary boards;
[0143] The first IPMI command response sending unit 840 is configured to send the control plane heartbeat keepalive IPMI command response to the shelf management module after the first control plane keepalive private message receiving unit 830 receives the control plane keepalive private message.
[0144] The invention also provides a single board, such as Picture 9 Shown, including:
[0145] The second IPMI command receiving unit 910 is used to receive the control plane heartbeat keep-alive IPMI command sent by the frame management module;
[0146] The second control plane keepalive private message sending unit 920 is configured to send the control plane keepalive private message to the switch board after the second IPMI command receiving unit 910 receives the control plane heartbeat keepalive IPMI command;
[0147] The second control plane keep-alive private message receiving unit 930 is configured to receive the control plane keep-alive private message sent by the switch board;
[0148] The second IPMI command response sending unit 940 is configured to send a control plane heartbeat keepalive IPMI command response to the shelf management module after the second control plane keepalive private message receiving unit 930 receives the control plane keepalive private message.