Monitoring system, method, apparatus and device for satellite controller, medium, and product
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- LANGCHAO ELECTRONIC INFORMATION IND CO LTD
- Filing Date
- 2025-03-24
- Publication Date
- 2026-06-11
Smart Images

Figure CN2025084490_11062026_PF_FP_ABST
Abstract
Description
Satellite controller monitoring systems, methods, devices, equipment, media and products
[0001] Cross-reference to related applications
[0002] This application claims priority to Chinese Patent Application No. 202411746151.1, filed on December 2, 2024, entitled "Monitoring System, Method, Apparatus, Equipment, Medium and Product for Satellite Controller", the entire contents of which are incorporated herein by reference. Technical Field
[0003] This application relates to the field of server technology, and in particular to monitoring systems, methods, apparatus, equipment, media and products for satellite controllers. Background Technology
[0004] With the rapid development of technologies such as cloud computing, big data, and artificial intelligence, data centers, as the core infrastructure supporting these technologies, are expanding rapidly. The stable operation of data centers is crucial for executing business tasks, especially the rapid location and resolution of server failures to minimize their impact on operations.
[0005] To achieve comprehensive monitoring of data center servers, current out-of-band monitoring systems have formed a hierarchical monitoring system consisting of an out-of-band management controller and satellite controllers. The out-of-band management controller is mostly a Baseboard Management Controller (BMC). The satellite controllers are like "satellites" surrounding the out-of-band management controller, responsible for monitoring and managing specific hardware components in the server. They communicate with the out-of-band management controller to centrally manage and monitor the server's health status.
[0006] It is conceivable that when the satellite controller malfunctions, the out-of-band management controller will be unable to obtain a large amount of server status information, which may lead to equipment downtime in severe cases. Currently, the only way to handle satellite controller failures is to reproduce the failure scenario after the failure to analyze the cause, which not only wastes human and material resources but also causes business interruption. Summary of the Invention
[0007] The purpose of this application is to provide a monitoring system, method, apparatus, equipment, medium, and product for satellite controllers, for enabling pre-processing of satellite controller faults.
[0008] To solve the above-mentioned technical problems, the satellite controller monitoring system provided in this application includes an out-of-band management master controller and a first bus;
[0009] The out-of-band management controller is used to send service request commands to the satellite controller via the first bus. If it receives a response command from the satellite controller via the first bus, it determines that the satellite controller is in a normal response state. It detects the execution result of the satellite controller on the service request command based on the command information of the service request command and the response information of the response command. If the number of times the first abnormal information is detected in the response information reaches the abnormal number accumulation threshold, it determines that the satellite controller is in an abnormal working state.
[0010] The first abnormal information includes at least one of abnormal task execution result information and abnormal completion code.
[0011] On the one hand, the types of abnormal completion codes include out-of-band management master controller side abnormal completion codes and satellite controller side abnormal completion codes;
[0012] The cumulative threshold for the number of anomalies corresponding to the anomaly completion codes on the out-of-band management master controller side is greater than the cumulative threshold for the number of anomalies corresponding to the anomaly completion codes on the satellite controller side.
[0013] In some embodiments, the type of out-of-band management master controller side exception completion code includes at least one of a first exception completion code and a second exception completion code;
[0014] The first exception completion code indicates that the business request command is an illegal command, and the second exception completion code indicates that the data of the business request command has been truncated.
[0015] In some embodiments, the type of abnormal completion code on the satellite controller side includes at least one of the following: a third abnormal completion code, a fourth abnormal completion code, a fifth abnormal completion code, and a sixth abnormal completion code.
[0016] The third exception completion code indicates a task timeout; the fourth exception completion code indicates insufficient command storage space; the fifth exception completion code indicates that the service request command cannot be executed due to priority issues; and the sixth exception completion code indicates that the current state of the satellite controller prevents the execution of the service request command.
[0017] In some embodiments, the out-of-band management controller is further configured to, if the response information contains an out-of-band management controller-side abnormal completion code and the number of occurrences has not reached the corresponding abnormal number accumulation threshold, adjust the command format of the service request command and then send the adjusted service request command.
[0018] In some embodiments, the out-of-band management controller is further configured to record an out-of-band management controller program error log if the number of times an out-of-band management controller-side exception completion code appears in the response information reaches the corresponding exception count accumulation threshold.
[0019] In some embodiments, the types of abnormal task execution result information include first information and second information;
[0020] The first piece of information is the task execution result information that does not match the target information obtained from the business request command; the second piece of information is the task execution result information where data is missing.
[0021] In some embodiments, the type of the first information includes at least one of the third information, fourth information, fifth information, and sixth information;
[0022] Among them, the third information is the task execution result information with incorrect data format; the fourth information is the task execution result information with incorrect data length; the fifth information is the task execution result information with information type that does not match the type of information to be obtained by the target; and the sixth information is the task execution result information containing monitoring values that exceed the threshold range of the information to be obtained by the target.
[0023] In some embodiments, the out-of-band management controller is further configured to resend the service request command to the satellite controller when the number of times the first abnormal information is received does not reach the abnormal number accumulation threshold.
[0024] In some embodiments, the out-of-band management controller is further configured to determine that the satellite controller is an abnormal satellite controller and adjust the service configuration information corresponding to the abnormal satellite controller after determining that the satellite controller is in an abnormal working state.
[0025] In some embodiments, the out-of-band management master controller adjusts the service configuration information corresponding to the abnormal satellite controller, including:
[0026] Locally generate monitoring information corresponding to the monitoring information type of the abnormal satellite controller to execute the control task corresponding to the abnormal satellite controller.
[0027] In some embodiments, the out-of-band management master controller locally generates monitoring information corresponding to the monitoring information type of the malfunctioning satellite controller to execute control tasks corresponding to the malfunctioning satellite controller, including:
[0028] The out-of-band management master controller acquires the default monitoring information corresponding to the abnormal satellite controller to execute control tasks.
[0029] In some embodiments, the out-of-band management master controller locally generates monitoring information corresponding to the monitoring information type of the malfunctioning satellite controller to execute control tasks corresponding to the malfunctioning satellite controller, including:
[0030] The out-of-band management master controller acquires historical monitoring information of historical moments when abnormal satellite controllers were not diagnosed as abnormal operating states in order to perform control tasks.
[0031] In some embodiments, the out-of-band management master controller includes a main processor and a coprocessor;
[0032] The main processor is used to send service request commands to the satellite controller via the first bus and to receive response commands sent by the satellite controller via the first bus, and to send the command information and response information to the coprocessor.
[0033] The coprocessor is used to detect the execution results of the satellite controller's service request commands based on command information and response information. If the number of times the first abnormal information is detected in the response information reaches the abnormal number accumulation threshold, the satellite controller is determined to be in an abnormal working state, and the information that the satellite controller is in an abnormal working state is sent to the main processor.
[0034] In some embodiments, the main processor sends command and response information to the coprocessor, including:
[0035] The main processor sends the first index number corresponding to the command information and the second index number corresponding to the response information to the coprocessor.
[0036] In some embodiments, the coprocessor sends information to the main processor that the satellite controller is in an abnormal operating state, including:
[0037] The coprocessor generates an abnormal operation flag corresponding to the abnormal operation status type of the satellite controller, writes the command information, response information and abnormal operation flag into the first data packet, and sends the first data packet to the main processor.
[0038] In some embodiments, the out-of-band management controller includes a first operating system and a second operating system, wherein the response rate of the first operating system is higher than that of the second operating system;
[0039] The second operating system is used to send service request commands to the satellite controller via the first bus and to receive response commands sent by the satellite controller via the first bus, and to send command information and response information to the first operating system.
[0040] The first operating system is used to detect the execution result of the satellite controller's service request command based on the command information and response information. If the number of times the first abnormal information is detected in the response information reaches the abnormal number accumulation threshold, the satellite controller is determined to be in an abnormal working state, and the information that the satellite controller is in an abnormal working state is sent to the second operating system.
[0041] In some embodiments, the satellite controller includes at least a platform controller management engine unit.
[0042] To address the aforementioned technical problems, this application also provides a monitoring method for a satellite controller, applied to an out-of-band management master controller, comprising:
[0043] Send a service request command to the satellite controller;
[0044] If a response command is received from the satellite controller via the first bus, it is determined that the satellite controller is in a normal response state.
[0045] The satellite controller detects the execution result of the service request command based on the command information of the service request command and the response information of the response command.
[0046] If the number of times the first abnormal information is detected in the response information reaches the abnormal number accumulation threshold, the satellite controller is determined to be in an abnormal working state.
[0047] The first abnormal information includes at least one of abnormal task execution result information and abnormal completion code.
[0048] To address the aforementioned technical problems, this application also provides a monitoring device for a satellite controller, applied to the out-of-band management master controller of a server, comprising:
[0049] The transmitting unit is used to send service request commands to the satellite controller;
[0050] The receiving unit is used to determine that the satellite controller is in a normal response state if it receives a response command from the satellite controller via the first bus.
[0051] The detection unit is used to detect the execution result of the satellite controller in response to the service request command based on the command information of the service request command and the response information of the response command; if the number of times the first abnormal information is detected in the response information reaches the abnormal number accumulation threshold, the satellite controller is determined to be in an abnormal working state.
[0052] The first abnormal information includes at least one of abnormal task execution result information and abnormal completion code.
[0053] To address the aforementioned technical problems, this application also provides monitoring equipment for the satellite controller, including:
[0054] Memory, used to store computer programs;
[0055] A processor is used to execute computer programs, which, when executed by the processor, implement the steps of any of the satellite controller monitoring methods described above.
[0056] To address the aforementioned technical problems, this application also provides a non-volatile storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of any of the above-described satellite controller monitoring methods.
[0057] To address the aforementioned technical problems, this application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of any of the above-described satellite controller monitoring methods.
[0058] The satellite controller monitoring system provided in this application has the advantage of differentiating itself from related technologies where the out-of-band management controller assumes the satellite controller is in normal working condition when it determines that the satellite controller is in normal response state. Instead, the out-of-band management controller, after determining that the satellite controller is in normal response state, further detects the response information in this normal response state. By defining a first abnormal information, it captures the detailed abnormal state of the satellite controller before a failure occurs from two perspectives: abnormal task execution result information and abnormal completion code, based on the response information that was originally considered to be fed back by the satellite controller in normal working condition. If the number of times the first abnormal information is detected reaches the abnormal number accumulation threshold, the satellite controller is determined to be in an abnormal working state. This enables pre-failure handling of the satellite controller, avoids serious consequences such as equipment downtime caused by satellite controller failure, and ensures the stable operation of the data center server.
[0059] The satellite controller monitoring method, apparatus, equipment, non-volatile storage medium, computer program product, and server out-of-band monitoring system provided in this application have the aforementioned beneficial effects, which will not be elaborated further here. Attached Figure Description
[0060] To more clearly illustrate the technical solutions of the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0061] Figure 1 is a schematic diagram of the structure of a satellite controller monitoring system provided in some embodiments of this application;
[0062] Figure 2 is a schematic diagram of the information format for updating a satellite controller abnormal state analysis and positioning strategy according to some embodiments of this application;
[0063] Figure 3 is a schematic diagram of the structure of another satellite controller monitoring system provided in some embodiments of this application;
[0064] Figure 4 is a flowchart of a satellite controller monitoring method provided in some embodiments of this application;
[0065] Figure 5 is a flowchart of another satellite controller monitoring method provided in some embodiments of this application;
[0066] Figure 6 is a schematic diagram of the structure of a satellite controller monitoring device provided in some embodiments of this application. Detailed Implementation
[0067] The core of this application is to provide a monitoring system, method, apparatus, equipment, medium, and product for satellite controllers, which enables pre-processing of satellite controller faults.
[0068] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0069] In devices such as servers, the baseboard management controller (BMC) provides out-of-band management and monitoring capabilities. The BMC is typically mounted on the motherboard or the motherboard of the monitored device and uses, but is not limited to, the Intelligent Platform Management Interface (IPMI) protocol to monitor the hardware status of the server system by monitoring sensors. The BMC can communicate with internal server modules, such as the Platform Controller Hub (PCH), memory (e.g., Dual-Inline Memory Modules (DIMMs), and power supply, using integrated circuit buses (such as I2C) or the Intelligent Platform Management Bus (IPMB). The BMC can also connect to sensors within the server via the integrated circuit bus or IPMB to monitor the hardware status, such as temperature, humidity, power supply voltage, fan speed, communication parameters, and operating system (OS) functions. If any of these variables exceeds specified limits, the BMC will notify the system administrator. The baseboard management controller can provide web services, which have network communication capabilities and provide a web page to display the monitoring interface. Maintenance personnel can obtain the monitoring data of the baseboard management controller by connecting the baseboard management controller of the monitored equipment at the equipment site via network cable, or by connecting the baseboard management controllers of multiple monitored equipment in the data center via network.
[0070] To achieve comprehensive monitoring of data center servers, current out-of-band monitoring systems have formed a hierarchical monitoring system consisting of a main out-of-band management controller and satellite controllers. The main out-of-band management controller typically uses a baseboard management controller. Satellite controllers refer to management controllers not integrated onto the same board as the main out-of-band management controller. Satellite controllers connect to the baseboard management controller via an integrated circuit bus or intelligent platform management bus. Because the baseboard management controller is the primary management controller in the server's out-of-band management system, and other management controllers are like "satellites" surrounding the main controller, they are called "satellite" controllers. Satellite controllers are responsible for monitoring and managing specific hardware components in the server, such as temperature sensors, voltage sensors, and fans. They communicate with the baseboard management controller to centrally manage and monitor the server's health status.
[0071] For example, the Management Engine (ME) in the Platform Controller Hub (PCH) can be referred to as the satellite controller of the baseboard management controller. This management engine is primarily responsible for power and energy management. By interacting with the management engine, the baseboard management controller can obtain monitoring data such as temperature and power consumption of key components like the Central Processing Unit (CPU) and memory. It can then perform operations such as fault logging, CPU frequency reduction, or shutdown based on management policies.
[0072] Therefore, when the satellite controller malfunctions, the out-of-band management controller will be unable to obtain a large amount of server status information, which may cause the equipment to crash in severe cases.
[0073] Currently, there is no feasible solution for monitoring abnormalities in the operational status of satellite controllers; only post-fault handling methods are available. This is because satellite controllers come in various types, and some (such as the platform controller management engine mentioned above) remain black boxes to out-of-band server monitoring systems. Their firmware source code is not publicly available, making it impossible to modify the satellite controller's code to enable it to proactively report its operational status during operation. Furthermore, after a satellite controller malfunctions, it is impossible to pinpoint the cause of the failure by accessing the underlying layers of the controller. Therefore, traditional satellite controller monitoring solutions are often only identifiable by equipment and maintenance personnel after a failure has occurred. At this point, maintenance personnel can only determine the cause of the failure by deploying specialized testing equipment on the server system for protocol analysis and reproducing the failure scenario, which undoubtedly increases the investment of human and material resources. Simultaneously, since current methods for locating satellite controller failures only occur after the controller has malfunctioned, the services running on the server cannot operate and require a prolonged shutdown. This can cause significant losses for critical and time-sensitive services.
[0074] Therefore, it is necessary to monitor the operating status of the satellite controller during its operation in order to detect any abnormalities in a timely manner and avoid taking action only after the satellite controller malfunctions and causes equipment downtime.
[0075] In related technologies, out-of-band management controllers are typically highly sensitive to the inability to send service request commands via the first bus and the failure to receive response commands from the satellite controller within a timeout period. In such cases, the out-of-band management controller may perform actions beyond service processing, such as retransmitting the service request command, debugging the first bus, or requesting a satellite controller reset. However, once the out-of-band management controller can send a service request command to the satellite controller and receive a timely response command, it can determine that the satellite controller is in a normal response state. At this point, the out-of-band management controller considers the satellite controller to be in normal working condition and can use the response information from the satellite controller to execute service tasks. If the task execution result information carried in the response information is monitoring data, the out-of-band management controller will execute control tasks on other hardware based on this monitoring data. Under this processing method, the satellite controller may suddenly malfunction at some point, and even with the above-mentioned methods, it may not be able to recover automatically. This would cause the out-of-band management controller to be unable to execute service tasks normally, resulting in the various serious consequences described above.
[0076] Observation revealed that the satellite controller did not fail without warning signs. These signs were hidden in the response information of the out-of-band management master controller when it thought the satellite controller was in normal working condition. However, in related technologies, the out-of-band management master controller is not sensitive to these hidden information and will not detect any possible signs that the satellite controller is beginning to show abnormalities.
[0077] Therefore, in some embodiments of this application, an out-of-band management controller is added to detect the response information of the satellite controller when it is determined to be in a normal response state. By defining the first abnormal information, the system captures the detailed abnormal state of the satellite controller before the failure from two perspectives: abnormal task execution result information and abnormal completion code, in the response information that is originally thought to be fed back by the satellite controller in a normal working state. If the number of times the first abnormal information is detected reaches the abnormal number accumulation threshold, the satellite controller is determined to be in an abnormal working state, thereby realizing the pre-failure processing of the satellite controller, avoiding serious consequences such as equipment downtime caused by satellite controller failure, and ensuring the stable operation of the data center server.
[0078] The monitoring system for a satellite controller provided in some embodiments of this application will be further described below with reference to the accompanying drawings.
[0079] Figure 1 is a schematic diagram of the structure of a satellite controller monitoring system provided in some embodiments of this application.
[0080] As shown in Figure 1, the satellite controller monitoring system provided in some embodiments of this application may include an out-of-band management master controller 101 and a first bus. The out-of-band management master controller 101 is used to send service request commands to the satellite controller 102 via the first bus. If it receives a response command from the satellite controller 102 via the first bus, it determines that the satellite controller 102 is in a normal response state. It detects the execution result of the service request command by the satellite controller 102 based on the command information of the service request command and the response information of the response command. If the number of times the first abnormal information is detected in the response information reaches the abnormal number accumulation threshold, it determines that the satellite controller 102 is in an abnormal working state. The first abnormal information includes at least one of abnormal task execution result information and abnormal completion code.
[0081] It should be noted that, in some embodiments of this application, "out-of-band management master controller" refers to the master controller in the server out-of-band management system, which is usually the baseboard management controller. In some embodiments of this application, "satellite controller" refers to a management controller in the server out-of-band management system that is below the out-of-band management master controller 101 and is used to assist the out-of-band management master controller 101 in performing out-of-band management of the server, so as to form a hierarchical monitoring system with the out-of-band management master controller 101.
[0082] In some embodiments of this application, "baseboard management controller" can refer to a baseboard management controller board, which, in addition to the controller, also includes various bus controllers, memory, and other components. In some embodiments of this application, "baseboard management controller" can also refer to the controller on the baseboard management controller board, which can be a single-core processor or a multi-core processor. In some embodiments of this application, the baseboard management controller or the controller on the baseboard management controller board can be an ARM processor, on which the baseboard management controller's management system runs.
[0083] In some embodiments of this application, the first bus may include, but is not limited to, an integrated circuit bus or an intelligent platform management bus (IPMB). The integrated circuit bus may include, but is not limited to, a two-wire serial bus (Inter-Integrated Circuit, I2C) or an improved inter-integrated circuit bus (I3C).
[0084] In some embodiments of this application, the satellite controller 102 can be a platform controller management engine unit. Depending on the requirements of the out-of-band management master controller 101, the service request commands sent by the out-of-band management master controller 101 to the platform controller management engine unit may include, but are not limited to, obtaining device ID, obtaining device information via original equipment manufacturer (OEM) commands, obtaining CPU and memory temperature information, obtaining system time, and platform environment control interface commands.
[0085] In some embodiments of this application, the service request command can be a data packet carrying command information sent by the out-of-band management master controller 101 to the satellite controller 102. The response command can be a data packet carrying response information sent by the satellite controller 102 to the out-of-band management master controller 101.
[0086] In some embodiments of this application, after determining that the satellite controller 102 is in a normal response state, the out-of-band management master controller 101 adds a task to detect the response information in order to capture some detailed abnormal situations that the satellite controller 102 may report through the response information when a fault occurs.
[0087] It should be noted that "first abnormal information" refers to abnormal information defined in some embodiments of this application. In related technologies, this information is considered normal information by the out-of-band management controller 101 that does not require processing outside of performing business tasks. That is, some embodiments of this application classify the response information of the satellite controller 102 to distinguish between "normal information" and "abnormal information." In some implementations of some embodiments of this application, the aforementioned "normal information" refers to information that can be used by the out-of-band management controller 101 to perform business tasks, while "abnormal information" refers to information that should not be used by the out-of-band management controller 101 to perform business tasks.
[0088] According to the communication protocol of satellite controller 102, the response information fed back by satellite controller 102 typically includes two main categories: completion codes and task execution result information. The completion code is a code that the data packet should carry as defined by the communication protocol of satellite controller 102, used to indicate the execution status of satellite controller 102 in response to the service request command. The task execution result information is the actual result of satellite controller 102's response to the service request command. For example, if the service request command is to acquire monitoring data, then the task execution result information should be the monitoring data.
[0089] In some embodiments of this application, the completion code is distinguished into normal completion code and abnormal completion code, the task execution result information is distinguished into normal task execution result information and abnormal task execution result information, and the abnormal completion code and abnormal task execution result information are collectively referred to as the first abnormal information.
[0090] Since the task execution result information in the response information of the satellite controller 102 is abnormal task execution result information and the completion code is abnormal completion code, it may be a temporary abnormality. Therefore, some embodiments of this application set an abnormality accumulation threshold for the first abnormal information, and the abnormal working state of the satellite controller 102 in some embodiments of this application is determined by counting the first abnormal information.
[0091] In some embodiments of this application, an abnormality accumulation threshold can be set for each satellite controller 102 individually, or a corresponding abnormality accumulation threshold can be set for the sum of abnormalities of a group of satellite controllers 102. The same abnormality accumulation threshold can be set for satellite controllers 102 of the same type or performing similar tasks. For the same group of satellite controllers 102 or the same satellite controller 102, different abnormality accumulation thresholds can be set for different types of first abnormal information.
[0092] In some embodiments of this application, the completion code can be a completion code defined by the intelligent platform management interface protocol. For example, in the intelligent platform management interface protocol, 00 is defined as a completion code indicating normal information feedback, while the rest are abnormal completion codes.
[0093] In some embodiments of this application, the types of abnormal completion codes can be divided into out-of-band management controller-side abnormal completion codes and satellite controller-side abnormal completion codes, and the cumulative threshold for the number of abnormal occurrences corresponding to the out-of-band management controller-side abnormal completion codes is set to be greater than the cumulative threshold for the number of abnormal occurrences corresponding to the satellite controller-side abnormal completion codes.
[0094] The out-of-band management master controller side exception completion code indicates a completion code pointing to an information feedback exception in satellite controller 102 caused by an exception on the out-of-band management master controller side. The satellite controller side exception completion code indicates a completion code pointing to an information feedback exception in satellite controller 102 caused by an exception on the satellite controller side. Since out-of-band management master controller side exceptions are usually due to data format errors in service request commands, which can be repaired automatically by the out-of-band management master controller 101, the cumulative exception count threshold corresponding to the out-of-band management master controller side exception completion code can be set to be greater than the cumulative exception count threshold corresponding to the satellite controller side exception completion code.
[0095] When the out-of-band management master controller 101 counts the occurrences of the first abnormal information of the satellite controller 102, it can initialize the occurrence count of the first abnormal information of each satellite controller 102 to 0 when the server is powered on, and increment it by 1 for each occurrence. If the satellite controller 102 is subsequently restored to normal operation through recovery script or manual operation, the occurrence count of the first abnormal information of the satellite controller 102 can be reset to 0 and the count can be restarted.
[0096] In some embodiments of this application, the type of the out-of-band management master controller side abnormal completion code may include at least one of a first abnormal completion code and a second abnormal completion code; the first abnormal completion code indicates that the service request command is an illegal command, and the second abnormal completion code indicates that the data of the service request command has been truncated.
[0097] The types of abnormal completion codes on the satellite controller side may include at least one of the following: third abnormal completion code, fourth abnormal completion code, fifth abnormal completion code, and sixth abnormal completion code; the third abnormal completion code indicates that the task timed out, the fourth abnormal completion code indicates that the command storage space is insufficient, the fifth abnormal completion code indicates that the service request command cannot be executed due to priority issues, and the sixth abnormal completion code indicates that the current state of the satellite controller 102 cannot execute the service request command.
[0098] It should be noted that, depending on the specific protocol type and subsequent protocol upgrades, the abnormal completion code may include one or more of the abnormal completion codes listed above, or may not include the abnormal completion codes listed above, or may include new types of abnormal completion codes, all of which fall within the protection scope of some embodiments of this application.
[0099] In some embodiments of this application, task execution result information is distinguished into normal task execution result information and abnormal task execution result information. Normal task execution result information refers to information that can be used by the out-of-band management controller 101 to perform business tasks, while abnormal task execution result information refers to information that should not be used by the out-of-band management controller 101 to perform business tasks.
[0100] In some embodiments of this application, the types of abnormal task execution result information can be divided into first information and second information; wherein, the first information is task execution result information that does not match the target information obtained by the business request command; and the second information is task execution result information with missing data.
[0101] In some embodiments of this application, the type of the first information may include at least one of the third information, the fourth information, the fifth information, and the sixth information; wherein, the third information is task execution result information with a data format error; the fourth information is task execution result information with a data length error; the fifth information is task execution result information whose information type does not match the type of information obtained by the target; and the sixth information is task execution result information containing monitoring values that exceed the threshold range of the information obtained by the target.
[0102] It should be noted that, depending on the different business tasks performed by the satellite controller 102, the type of abnormal task execution result information may include one or more of the abnormal task execution result information listed above, or may not include the abnormal task execution result information listed above, or may include new abnormal task execution result information, all of which fall within the protection scope of some embodiments of this application.
[0103] Conversely, normal task execution result information can include task execution result information that conforms to the response information data format requirements and matches the target information obtained by the business request command. The response information data format requirements can include data length requirements and data format requirements. Matching the target information obtained by the business request command can include: the information type matching the type of the target information obtained, and the monitored value being within the threshold range of the target information obtained.
[0104] In some embodiments of this application, since there is a correspondence between response information and command information, i.e., a correspondence between response commands and service request commands, the out-of-band management controller 101 detects the execution result of the service request command by the satellite controller 102 based on the command information and response information. This detection can be based on a pre-established correspondence. In some embodiments of this application, this correspondence can be a first correspondence between command information and first abnormal information. When the command information and response information corresponding to the satellite controller 102 conform to the first correspondence, the response information is considered to contain first abnormal information. In some embodiments of this application, this correspondence can also be a second correspondence between command information and standard service response information. The standard service response information can include at least one of standard task execution result information and a normal completion code. When the command information and response information corresponding to the satellite controller 102 do not conform to the second correspondence, the response information is considered to contain first abnormal information.
[0105] In some embodiments of this application, the out-of-band management master controller 101 can also be used to record a response timeout anomaly of the satellite controller 102 if no response command is received from the satellite controller 102 after a first preset time has elapsed after issuing a service request command. If the number of response timeout anomalies of the satellite controller 102 reaches the corresponding cumulative threshold of anomalies, the satellite controller 102 is determined to be in an abnormal working state.
[0106] The satellite controller monitoring system provided in some embodiments of this application differs from the out-of-band management master controller in related technologies, which considers the satellite controller to be in normal working condition when it determines that the satellite controller is in normal response state. In this system, after determining that the satellite controller is in normal response state, the out-of-band management master controller also detects the response information in this normal response state. By defining a first abnormal information, the system captures the detailed abnormal state of the satellite controller before the failure from two perspectives: abnormal task execution result information and abnormal completion code, in the response information that was originally considered to be fed back by the satellite controller in normal working condition. If the number of times the first abnormal information is detected reaches the abnormal number accumulation threshold, the satellite controller is determined to be in an abnormal working state. This realizes the pre-failure processing of the satellite controller, avoids serious consequences such as equipment downtime caused by satellite controller failure, and ensures the stable operation of the data center server.
[0107] Based on the above embodiments, in order to reduce the workload of maintenance personnel and avoid the problem of untimely manual response, the out-of-band management master controller 101 can also be used to process the first abnormal information of the satellite controller 102.
[0108] As described in the above embodiments, abnormal completion codes can be categorized into out-of-band management controller-side abnormal completion codes and satellite controller-side abnormal completion codes. Based on this, the out-of-band management controller 101 can also be used to adjust the command format of the service request command and send the adjusted service request command if the response information contains an out-of-band management controller-side abnormal completion code and the number of occurrences has not reached the corresponding abnormal count accumulation threshold. That is, the out-of-band management controller 101 can repair the situation by adjusting the command format and retransmitting the response information from the satellite controller 102 when the out-of-band management controller-side abnormal completion code is included but the corresponding abnormal count accumulation threshold has not been triggered.
[0109] In some embodiments of this application, the out-of-band management master controller 101 can also be used to record an out-of-band management master controller program error log if the number of times an out-of-band management master controller-side abnormal completion code appears in the response information reaches the corresponding abnormal number accumulation threshold. That is, when the out-of-band management master controller 101 carries an out-of-band management master controller-side abnormal completion code in the response information fed back by the satellite controller 102 and triggers the corresponding abnormal number accumulation threshold, it can record an out-of-band management master controller program error log to indicate that there may be a program error in the out-of-band management master controller 101.
[0110] Furthermore, since the satellite controller 102 is mainly used to cooperate with the out-of-band management controller 101 to complete business tasks such as monitoring server components, the out-of-band management controller 101 needs the monitoring information collected by the satellite controller 102 to perform hardware control tasks, such as performing fan control tasks based on temperature information. To ensure smooth operation, in some embodiments of this application, the out-of-band management controller 101 can also be used to resend the business request command to the satellite controller 102 when the number of times the first abnormal information is received has not reached the abnormal number accumulation threshold.
[0111] In some embodiments of this application, the out-of-band management master controller 101 can also be used to determine that the satellite controller 102 is an abnormal satellite controller and adjust the service configuration information corresponding to the abnormal satellite controller after determining that the satellite controller 102 is in an abnormal working state. That is to say, when it is determined that the satellite controller 102 is in an abnormal working state, the out-of-band management master controller 101 needs to adjust the service configuration information corresponding to the abnormal satellite controller in a timely manner to ensure the smooth operation of subsequent service tasks.
[0112] Since some monitoring tasks involve the out-of-band management controller 101 executing hardware control tasks based on monitoring information provided by the target satellite controller (a fixed satellite controller 102), if an abnormal satellite controller exists among the target satellite controllers, the control task cannot be executed due to the lack of some monitoring information. Therefore, the out-of-band management controller 101 adjusts the service configuration information corresponding to the abnormal satellite controller, which may include: locally generating monitoring information corresponding to the monitoring information type of the abnormal satellite controller to execute the control task corresponding to the abnormal satellite controller.
[0113] In some embodiments of this application, the out-of-band management controller 101 locally generates monitoring information corresponding to the monitoring information type of the abnormal satellite controller to execute control tasks corresponding to the abnormal satellite controller. This may include: the out-of-band management controller 101 acquiring default monitoring information corresponding to the abnormal satellite controller to execute control tasks. Alternatively, the out-of-band management controller 101 locally generates monitoring information corresponding to the monitoring information type of the abnormal satellite controller to execute control tasks corresponding to the abnormal satellite controller. This may include: the out-of-band management controller 101 acquiring historical monitoring information from historical moments when the abnormal satellite controller was not diagnosed as being in an abnormal operating state to execute control tasks.
[0114] In other monitoring tasks, there is redundancy in the target satellite controller (i.e., only monitoring information from some of the satellite controllers 102 is needed), so the lack of monitoring information corresponding to the abnormal satellite controller will not affect the execution of the control task. In this case, the out-of-band management controller 101 can stop repeatedly sending service request commands to the satellite controller 102 after determining that the satellite controller 102 is in an abnormal working state, and continue to execute the control task corresponding to the abnormal satellite controller.
[0115] Furthermore, if the protocol of satellite controller 102 includes a recovery command, the out-of-band management master controller 101 can send a corresponding recovery command to satellite controller 102 after determining that satellite controller 102 is in an abnormal operating state. This recovery command can be a recovery completion code defined in the protocol, such as a completion code requiring satellite controller 102 to perform a reset operation.
[0116] In the case where multiple satellite controllers 102 are connected to the same first bus, the out-of-band management master controller 101 can also be used to obtain the working status of neighboring satellite controllers 102 connected to the same first bus after determining that a satellite controller 102 is in an abnormal working state; if all neighboring satellite controllers 102 are in normal working state, the abnormality of the satellite controller 102 is determined to be a local fault of the satellite controller 102; if there is a neighboring satellite controller 102 in an abnormal working state, the first bus fault is determined.
[0117] The satellite controller monitoring system provided in some embodiments of this application, in addition to the task of the out-of-band management master controller 101 detecting the execution result of the satellite controller 102 on the service request command based on command information and response information, also adds the task of the out-of-band management master controller 101 processing the satellite controller 102 that has the first abnormal information and even the abnormal satellite controller determined to be in an abnormal working state, thereby further ensuring the smooth operation of services.
[0118] In the out-of-band management master controller 101, due to the large number of services running by the baseboard management controller itself, the workload is heavy, and the need to develop new baseboard management controller code for comprehensive abnormal state analysis of satellite controller 102 based on the baseboard management controller not only increases the development workload, but also increases the risk of firmware failure of the baseboard management controller.
[0119] Therefore, in some embodiments of this application, the out-of-band management controller 101 may include a main processor and a coprocessor. The main processor is connected to the satellite controller 102 via a first bus to send service request commands to the satellite controller 102 and receive response commands from the satellite controller 102. The coprocessor is used to detect the execution result of the service request commands by the satellite controller 102 based on the command information and response information. If the number of times the first abnormal information is detected in the response information reaches the abnormal number accumulation threshold, the satellite controller 102 is determined to be in an abnormal working state.
[0120] In some embodiments of this application, the coprocessor can communicate directly with the satellite controller 102. Thus, when performing online monitoring tasks on the satellite controller 102, the main processor in the external management controller 101 receives request commands for the satellite controller 102, and the satellite controller 102 sends the response command to the coprocessor when returning a response command to the main processor. However, this requires the coprocessor to develop communication capabilities with the satellite controller 102, such as implementing the hardware and software configuration of the intelligent platform management bus, which requires a high-performance processor, such as a Field Programmable Gate Array (FPGA).
[0121] In some embodiments of this application, the main processor is used to send service request commands to the satellite controller 102 via the first bus and receive response commands sent by the satellite controller 102 via the first bus, and send the command information and response information to the coprocessor; the coprocessor is used to detect the execution result of the satellite controller 102 on the service request commands based on the command information and response information. If the number of times the first abnormal information is detected in the response information reaches the abnormal number accumulation threshold, the satellite controller 102 is determined to be in an abnormal working state, and the coprocessor sends the information that the satellite controller 102 is in an abnormal working state to the main processor. Therefore, it is not necessary to develop code for abnormal state analysis of the satellite controller 102 for the main processor of the out-of-band management controller 101, nor is it necessary to implement the function of communicating with the satellite controller 102 on the coprocessor. Moreover, the coprocessor can use existing components in the out-of-band management controller 101, such as a complex programmable logic device (CPLD), so that the online status monitoring function of the satellite controller 102 can be realized at a very low cost.
[0122] In some embodiments of this application, if the coprocessor uses a complex programmable logic device, the complex programmable logic device on the out-of-band management controller 101 can be used, and it is usually connected to the main processor of the out-of-band management controller 101 via an integrated circuit bus.
[0123] In a specific implementation, after the main processor sends a service request command to the satellite controller 102, it can send an equivalent command of the service request command to the coprocessor, and after receiving the response command from the satellite controller 102, it sends the response information therein to the coprocessor.
[0124] To further improve the efficiency of the coprocessor in performing online status monitoring tasks on the satellite controller 102, in some embodiments of this application, the main processor sends command information and response information to the coprocessor, which may include: the main processor sending the first index number corresponding to the command information and the second index number corresponding to the response information to the coprocessor.
[0125] In other words, the main processor converts both request and response commands into index numbers instead of sending the command and response information directly to the coprocessor, thereby speeding up the communication rate between the main processor and the coprocessor and improving the efficiency of the online status monitoring task of the satellite controller 102.
[0126] If anomaly monitoring of the satellite controller 102 is performed by checking whether the response information contains an abnormal completion code, and the completion code has a small data size, the main processor can directly send the completion code in the response information to the coprocessor without converting it into an index number. As described in the above embodiments, the execution result of the service request command by the satellite controller 102 can be detected by pre-setting a first or second correspondence. The first correspondence can be the correspondence between the first index number of the service request command and the standard completion code, and the second correspondence can be the correspondence between the first index number of the service request command and the abnormal completion code.
[0127] When the main processor sends information to the coprocessor, each service request command corresponds to a first index number, which can occupy only one byte. The coprocessor can then query the corresponding service request command according to the pre-agreed command mapping relationship. For example, the format of the corresponding request command and response command in the intelligent platform management bus protocol can be as follows: "Command 1 → Index 0x01 → Completion Code 0x00". 0x00 is the normal completion code defined by the intelligent platform management interface protocol, and is a non-zero value when there is an exception.
[0128] Depending on the application requirements, when the main processor sends a service request command, if the target is the satellite controller 102, it will follow the protocol of the first bus and use the standard protocol command; if the target is the coprocessor, it will use the aforementioned first index number to replace the service request command.
[0129] The coprocessor can send a message to the main processor indicating that the satellite controller 102 successfully executed the service request command when the analysis confirms that the execution result of the satellite controller 102 is normal; conversely, it can send a message to the main processor indicating that the satellite controller 102 failed to execute the service request command when the analysis confirms that the execution result of the satellite controller 102 is abnormal. The main processor can choose to wait for the analysis result sent by the coprocessor before allowing the next service request command to be sent to the same satellite controller 102, or it can choose not to wait for the coprocessor's analysis. After receiving the message from the coprocessor indicating that the satellite controller 102 failed to execute the service request command, the main processor can log this information.
[0130] When the coprocessor counts the number of times the satellite controller 102 has the first abnormal information and triggers the corresponding abnormal number accumulation threshold, the coprocessor sends information to the main processor that the satellite controller 102 is in an abnormal working state. This may include setting the pin on the main processor corresponding to the abnormal satellite controller to the level corresponding to the abnormal state.
[0131] Alternatively, the coprocessor may send information to the main processor that the satellite controller 102 is in an abnormal working state. This may include: the coprocessor generating a working abnormality flag corresponding to the abnormal working state type of the satellite controller 102, writing command information, response information and working abnormality flag into a first data packet, and sending the first data packet to the main processor.
[0132] The satellite controller monitoring system provided in some embodiments of this application can send the cumulative threshold of abnormal occurrences and the first or second correspondence to the coprocessor during initialization, so that the coprocessor can perform online abnormality detection on the satellite controller 102 accordingly.
[0133] Furthermore, the coprocessor can also connect to the host computer via a host computer communication interface. The coprocessor can then receive the accumulated threshold of abnormal occurrences and the first or second correspondence from the host computer. The coprocessor can also send information about the abnormal operating state of the satellite controller 102 to the host computer via the host computer interface after determining that the satellite controller 102 is in an abnormal operating state. The host computer interface can be a Universal Asynchronous Receiver / Transmitter (UART) interface.
[0134] Figure 2 is a schematic diagram of the information format for updating the satellite controller abnormal state analysis and positioning strategy according to some embodiments of this application.
[0135] During server operation, the cumulative threshold for the number of anomalies can be dynamically updated, either by the host computer or the main processor, and then by the coprocessor. The data structure used to update the cumulative threshold is shown in Figure 2, and includes a data header, a data body, and a checksum byte. The data header occupies 2 bytes and is fixed as "0x55, 0xAA". The data body is of variable length, and its specific length must be consistent with the types of commands sent from the baseboard management controller to the satellite controller 102. The command number field (Cmd Num) represents the number of command types sent from the baseboard management controller to the satellite controller 102, and threshold 1 (Th1), threshold 2 (Th2), ..., threshold N (ThN) are the thresholds to be set for each command. Note that the order of these thresholds must match the commands and indices described above. The last byte is the checksum (ChkSum), used to verify the data integrity of the data body. When actually updating the threshold, the host computer needs to send data to the coprocessor through the host computer interface according to the format shown in Figure 2. The coprocessor then parses the data according to the format shown in Figure 2 and writes it to local storage.
[0136] In addition, the aforementioned threshold dynamic update can also be implemented by the baseboard management controller through the integrated circuit bus. In this case, the baseboard management controller only needs to send data to the coprocessor through the integrated circuit bus according to the format shown in Figure 2, and the coprocessor will parse the data according to the format shown in Figure 2 and write it to the local storage.
[0137] Figure 3 is a schematic diagram of the structure of a monitoring system for another satellite controller provided in some embodiments of this application.
[0138] As shown in Figure 3, the satellite controller 102 is used as an example of the platform controller management engine unit. The platform controller management engine unit is part of the platform controller hub (PCH), which is connected to the host unit via a direct media interface (DMI). In the server, the central processing unit (CPU) in the host unit provides computing power support. Taking a classic dual-socket server (i.e., the host unit includes two CPUs) as an example, CPU 1 and CPU 2 are connected via an ultra-path interconnect (UPI) interface to realize interconnection and communication between CPUs. A direct media interface is provided between CPU 1 and the platform controller for data interaction between CPU 1 and the platform controller. In some embodiments of this application, the interacted data mainly involves fault diagnosis-related data transmitted between the management engine and the CPU (e.g., information collected from the CPU's model-specific registers (MSR), control and status registers (CSR), etc., via the Raw PECI command), as well as temperature sensor-related data (e.g., acquiring CPU temperature information and memory temperature information).
[0139] Within the platform controller, the management engine possesses various functions, including but not limited to remote management and security features. It remains active during system startup and operating system runtime, and can perform certain operations even when the device is powered off. In some embodiments of this application, the management engine is connected to the integrated circuit bus controller of the out-of-band management master controller 101 via an integrated circuit bus, primarily for data interaction with the out-of-band management master controller 101, using the intelligent platform management bus protocol. Through the management engine, the out-of-band management master controller 101 can obtain more information about the host unit side, including the aforementioned system fault diagnosis information and the temperature information of the central processing unit and memory.
[0140] In some embodiments of this application, the main processor on the out-of-band management controller 101 can be a baseboard management controller, which is mainly responsible for the functions of monitoring the status of the entire server system, fault diagnosis, log recording and firmware upgrade. It is connected to the management engine through an integrated circuit bus, and can also be connected to a complex programmable logic device through the integrated circuit bus. It is mainly responsible for initiating system management commands to the management engine and interacting with the complex programmable logic device to realize the detection and analysis of abnormal status of the management engine.
[0141] In some embodiments of this application, the complex programmable logic device is connected to the baseboard management controller via an integrated circuit bus to receive command data from the baseboard management controller and send control information to the baseboard management controller. At the same time, it communicates with the host computer via a host computer interface to realize functions such as updating the abnormal state location analysis method strategy of the satellite controller 102 and uploading the analysis results to the host computer.
[0142] As mentioned above, the components involved in the system architecture provided in some embodiments of this application (central processing unit, platform controller, baseboard management controller, complex programmable logic device) are all indispensable components of the server system. Therefore, some embodiments of this application can achieve abnormal state analysis and location of satellite controller 102 without introducing additional components, which has significant advantages in improving the efficiency of fault analysis and location of satellite controller 102 and saving debugging costs.
[0143] In addition to the architecture described in the above embodiments, which uses a main processor and a coprocessor to perform the monitoring tasks of the satellite controller 102, in some embodiments of this application, the out-of-band management controller 101 may include a first operating system and a second operating system, wherein the response rate of the first operating system is higher than that of the second operating system; the second operating system is used to send service request commands to the satellite controller 102 through the first bus and to receive response commands sent by the satellite controller 102 through the first bus, and to send command information and response information to the first operating system; the first operating system is used to detect the execution result of the satellite controller 102 on the service request commands according to the command information and response information, and if the number of times the first abnormal information is detected in the response information reaches the abnormal number accumulation threshold, the satellite controller 102 is determined to be in an abnormal working state, and the information that the satellite controller 102 is in an abnormal working state is sent to the second operating system.
[0144] It should be noted that the first and second operating systems in some embodiments of this application do not limit the type or priority of the operating systems. The first and second operating systems can both run on the core processor of the baseboard management controller in the out-of-band management master controller 101 and communicate with each other through inter-core communication.
[0145] The first operating system and the second operating system can also be that one runs on the core processor of the baseboard management controller, and the other runs on the coprocessor in the out-of-band management master controller 101.
[0146] In some embodiments of this application, the first operating system can be a real-time operating system, and the second operating system can be a non-real-time operating system. It should be noted that the terms "real-time operating system" and "non-real-time operating system" in some embodiments of this application are only used to distinguish between two categories of operating systems that employ different task processing methods. For example, a non-real-time operating system can be Contiki, HeliOS, Linux, etc., while a real-time operating system can be FreeRTOS, RTLinux, etc.
[0147] In some embodiments of this application, the baseboard management controller in the out-of-band management master controller 101 can run a dual-system RTOS / Linux. When using a multi-core baseboard management controller, one core processor can run the RTOS, while the remaining core processors run the Linux system.
[0148] In some embodiments of this application, the steps executed by the first operating system can refer to the coprocessor described in the above embodiments, and the steps executed by the second operating system can refer to the main processor described in the above embodiments, which will not be repeated here.
[0149] Please refer to the monitoring system of any of the above satellite controllers. The monitoring methods of satellite controllers provided in some embodiments of this application will be described below with reference to the accompanying drawings.
[0150] Figure 4 is a flowchart of a satellite controller monitoring method provided in some embodiments of this application.
[0151] As shown in Figure 4, the satellite controller monitoring method provided in some embodiments of this application, applied to the out-of-band management master controller, may include:
[0152] S401: Send a service request command to the satellite controller;
[0153] S402: If a response command is received from the satellite controller via the first bus, it is determined that the satellite controller is in a normal response state;
[0154] S403: Detect the execution result of the satellite controller on the service request command based on the command information of the service request command and the response information of the response command;
[0155] S404: If the number of times the first abnormal information is detected in the response information reaches the abnormal number accumulation threshold, the satellite controller is determined to be in an abnormal working state.
[0156] The first abnormal information includes at least one of abnormal task execution result information and abnormal completion code.
[0157] In practice, the out-of-band management master controller can be connected to the satellite controller via a first bus. The satellite controller may include, but is not limited to, the platform controller management engine unit.
[0158] In some embodiments of this application, the type of abnormal completion code may include an out-of-band management master controller-side abnormal completion code and a satellite controller-side abnormal completion code; the cumulative threshold for the number of abnormalities corresponding to the out-of-band management master controller-side abnormal completion code is greater than the cumulative threshold for the number of abnormalities corresponding to the satellite controller-side abnormal completion code.
[0159] The type of abnormal completion code on the out-of-band management controller side may include at least one of a first abnormal completion code and a second abnormal completion code; the first abnormal completion code indicates that the service request command is an illegal command, and the second abnormal completion code indicates that the data of the service request command has been truncated.
[0160] The types of abnormal completion codes on the satellite controller side may include at least one of the following: third abnormal completion code, fourth abnormal completion code, fifth abnormal completion code, and sixth abnormal completion code; the third abnormal completion code indicates that the task timed out, the fourth abnormal completion code indicates that the command storage space is insufficient, the fifth abnormal completion code indicates that the service request command cannot be executed due to priority issues, and the sixth abnormal completion code indicates that the current state of the satellite controller cannot execute the service request command.
[0161] The satellite controller monitoring method provided in some embodiments of this application may further include: if there is an out-of-band management master controller side abnormal completion code in the response information and the number of times has not reached the corresponding abnormal number accumulation threshold, then after adjusting the command format of the service request command, the adjusted service request command is sent.
[0162] The satellite controller monitoring method provided in some embodiments of this application may further include: if the number of times an out-of-band management master controller side abnormal completion code exists in the response information reaches the corresponding abnormal number accumulation threshold, then record the out-of-band management master controller program error log.
[0163] In some embodiments of this application, the type of abnormal task execution result information may include first information and second information; wherein, the first information is task execution result information that does not match the target information obtained by the business request command; and the second information is task execution result information with missing data.
[0164] The type of the first information may include at least one of the third, fourth, fifth, and sixth information; wherein the third information is task execution result information with incorrect data format; the fourth information is task execution result information with incorrect data length; the fifth information is task execution result information with information type that does not match the type of information obtained by the target; and the sixth information is task execution result information containing monitoring values that exceed the threshold range of the information obtained by the target.
[0165] The satellite controller monitoring method provided in some embodiments of this application may further include: when the number of times the first abnormal information is received does not reach the abnormal number accumulation threshold, resending the service request command to the satellite controller.
[0166] The satellite controller monitoring method provided in some embodiments of this application may further include: after determining that the satellite controller is in an abnormal working state, determining that the satellite controller is an abnormal satellite controller and adjusting the service configuration information corresponding to the abnormal satellite controller.
[0167] In some embodiments of this application, the out-of-band management master controller adjusts the service configuration information corresponding to the abnormal satellite controller, which may include: generating monitoring information locally corresponding to the monitoring information type of the abnormal satellite controller to execute the control task corresponding to the abnormal satellite controller.
[0168] In some embodiments of this application, the out-of-band management master controller generates monitoring information locally corresponding to the monitoring information type of the abnormal satellite controller to execute the control task corresponding to the abnormal satellite controller. This may include: the out-of-band management master controller obtaining the default monitoring information corresponding to the abnormal satellite controller to execute the control task.
[0169] In some embodiments of this application, the out-of-band management master controller generates monitoring information locally corresponding to the monitoring information type of the abnormal satellite controller to execute control tasks corresponding to the abnormal satellite controller. This may include: the out-of-band management master controller acquiring historical monitoring information of historical moments when the abnormal satellite controller was not diagnosed as an abnormal working state to execute control tasks.
[0170] In some embodiments of this application, the out-of-band management controller may include a main processor and a coprocessor. The steps S401 (sending a service request command to the satellite controller) and S402 (receiving a response command from the satellite controller) include: the main processor sending a service request command to the satellite controller and receiving a response command from the satellite controller, and sending the command information and response information to the coprocessor. The steps S403 (detecting the execution result of the service request command by the satellite controller based on the command information of the service request command and the response information of the response command) and S404 (determining the satellite controller is in an abnormal operating state if the number of times the first abnormal information is detected in the response information reaches an abnormal number accumulation threshold) include: the coprocessor detecting the execution result of the service request command by the satellite controller based on the command information and response information; if the number of times the first abnormal information is detected in the response information reaches an abnormal number accumulation threshold, determining the satellite controller is in an abnormal operating state, and sending information indicating that the satellite controller is in an abnormal operating state to the main processor.
[0171] In some embodiments of this application, the sending of command information and response information from the main processor to the coprocessor may include: the main processor sending the first index number corresponding to the command information and the second index number corresponding to the response information to the coprocessor.
[0172] In some embodiments of this application, the coprocessor sending information about the satellite controller being in an abnormal working state to the main processor may include: the coprocessor generating a working abnormality flag corresponding to the abnormal working state type of the satellite controller, writing command information, response information and the working abnormality flag into a first data packet, and sending the first data packet to the main processor.
[0173] In some embodiments of this application, the out-of-band management controller may include a first operating system and a second operating system, with the first operating system having a higher response rate than the second operating system. Therefore, S401 sending a service request command to the satellite controller and S402 receiving a response command from the satellite controller include: the second operating system sending a service request command to the satellite controller and receiving a response command from the satellite controller, and sending the command information and response information to the first operating system. S403 detecting the execution result of the service request command by the satellite controller based on the command information of the service request command and the response information of the response command, and S404 determining that the satellite controller is in an abnormal working state if the number of times the first abnormal information is detected in the response information reaches an abnormal number accumulation threshold, include: the first operating system detecting the execution result of the service request command by the satellite controller based on the command information and response information; if the number of times the first abnormal information is detected in the response information reaches an abnormal number accumulation threshold, determining that the satellite controller is in an abnormal working state, and sending information indicating that the satellite controller is in an abnormal working state to the second operating system.
[0174] Since the embodiments of the method section correspond to the embodiments of the system section, please refer to the description of the embodiments of the system section for the embodiments of the method section, and they will not be repeated here.
[0175] Figure 5 is a flowchart of another satellite controller monitoring method provided in some embodiments of this application.
[0176] Taking the architecture shown in Figure 3 as an example, the steps of the monitoring method for the platform controller management engine unit can be shown in Figure 5, including the following steps S501 to S517.
[0177] S501: The baseboard management controller is powered on.
[0178] S502: Platform controller management engine powered on.
[0179] S503: Power on complex programmable logic device.
[0180] S504: The baseboard management controller sends service request commands to the platform controller management engine and simultaneously sends equivalent commands to the complex programmable logic device.
[0181] S505: Complex programmable logic devices receive and store equivalent commands.
[0182] S506: The platform controller management engine sends a response command to the baseboard management controller.
[0183] S507: The substrate management controller receives and parses the response command, and extracts the completion code from it.
[0184] S508: The baseboard management controller sends completion codes to complex programmable logic devices.
[0185] S509: Complex programmable logic devices receive and save completion codes.
[0186] S510: Complex Programmable Logic Device (SPLD) processes equivalent commands and completion codes to obtain information on the working status of the management engine.
[0187] S511: If the management engine is in an abnormal operating state, the complex programmable logic device sets the management engine operating state abnormal flag.
[0188] S512: Complex programmable logic devices send equivalent commands, completion codes, and management engine status exception flags to the host computer.
[0189] S513: The complex programmable logic device sends the equivalent command, completion code, and management engine operation status exception flag to the board management controller.
[0190] S514: The baseboard management controller logs abnormal operating status of the management engine.
[0191] S515: If the management engine is operating normally, the complex programmable logic device sets the normal operating status flag.
[0192] S516: Complex programmable logic devices send a normal operating status flag to the board management controller.
[0193] S517: The baseboard management controller continues to send the next service request command to the management engine, and jumps to S504.
[0194] It should be noted that in the embodiments of the satellite controller monitoring methods of this application, some steps or features may be omitted or not executed. The division of hardware or software functional modules is for ease of explanation and is not the only implementation of the satellite controller monitoring methods provided in some embodiments of this application.
[0195] The above details some embodiments of the satellite controller monitoring method. Based on this, this application also discloses a satellite controller monitoring device, equipment, non-volatile storage medium, and computer program product corresponding to the above method.
[0196] The out-of-band management controller applied to the server, and the monitoring device for the satellite controller provided in some embodiments of this application, may include:
[0197] The transmitting unit is used to send service request commands to the satellite controller;
[0198] The receiving unit is used to determine that the satellite controller is in a normal response state if it receives a response command from the satellite controller via the first bus.
[0199] The detection unit is used to detect the execution result of the satellite controller in response to the service request command based on the command information of the service request command and the response information of the response command; if the number of times the first abnormal information is detected in the response information reaches the abnormal number accumulation threshold, the satellite controller is determined to be in an abnormal working state.
[0200] The first abnormal information includes at least one of abnormal task execution result information and abnormal completion code.
[0201] The sending unit can also be used to resend the service request command to the satellite controller when the number of times the first abnormal information is received has not reached the abnormal number accumulation threshold.
[0202] The sending unit can also be used to send the adjusted service request command after adjusting the command format if there is an out-of-band management controller side abnormal completion code in the response information and the number of times has not reached the corresponding abnormal number accumulation threshold.
[0203] The monitoring device for the satellite controller provided in some embodiments of this application may further include:
[0204] The log unit is used to record an error log for the out-of-band management controller program if the number of times an out-of-band management controller-side exception completion code appears in the response information reaches the corresponding cumulative exception threshold.
[0205] The monitoring device for the satellite controller provided in some embodiments of this application may further include:
[0206] The service adjustment unit is used to determine that the satellite controller is in an abnormal working state after determining that the satellite controller is an abnormal satellite controller and to adjust the service configuration information corresponding to the abnormal satellite controller.
[0207] The monitoring device for the satellite controller provided in some embodiments of this application may further include:
[0208] The recovery unit is used to obtain the corresponding recovery command when the satellite controller's operating state is determined to be abnormal; and to send the recovery command to the satellite controller so that the satellite controller can perform the abnormal recovery task.
[0209] It should be noted that in the various embodiments of the satellite controller monitoring device provided in this application, the division of units is only a logical functional division, and other division methods can be used. The connection between different units can be electrical, mechanical, or other connection methods. Separate units can be located in the same physical location or distributed across multiple network nodes. Each unit can be implemented in hardware or as a software functional unit. That is, according to actual needs, some or all of the units provided in some embodiments of this application can be selected, and corresponding connection or integration methods can be used to achieve the purpose of some embodiments of this application.
[0210] Since the embodiments of the apparatus and the embodiments of the method correspond to each other, please refer to the description of the embodiments of the method for the embodiments of the apparatus, which will not be repeated here.
[0211] Figure 6 is a schematic diagram of the structure of a satellite controller monitoring device provided in some embodiments of this application.
[0212] As shown in Figure 6, the monitoring device for a satellite controller provided in some embodiments of this application includes: a memory 610 for storing a computer program 611; and a processor 620 for executing the computer program 611, wherein the computer program 611, when executed by the processor 620, implements the steps of the monitoring method for a satellite controller provided in any of the above embodiments.
[0213] The processor 620 may include one or more processing cores, such as a 3-core processor or an 8-core processor. The processor 620 may be implemented using at least one hardware form selected from Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 620 may also include a main processor and a coprocessor. The main processor, also known as the Central Processing Unit (CPU), is used to process data in the wake-up state; the coprocessor is a low-power processor used to process data in the standby state. In some embodiments, the processor 620 may integrate a Graphics Processing Unit (GPU) responsible for rendering and drawing the content to be displayed on the screen. In some embodiments, the processor 620 may also include an Artificial Intelligence (AI) processor for handling computational operations related to machine learning.
[0214] The memory 610 may include one or more non-volatile storage media, which may be non-transitory. The memory 610 may also include high-speed random access memory and non-volatile memory, such as one or more disk storage devices or flash memory devices. In this embodiment, the memory 610 is used to store at least the following computer program 611, which, after being loaded and executed by the processor 620, can implement the relevant steps in the monitoring method of the satellite controller disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 610 may also include an operating system 612 and data 613, and the storage method may be temporary storage or permanent storage. The operating system 612 may be Windows or other types of operating systems. The data 613 may include, but is not limited to, the data involved in the above methods.
[0215] In some embodiments, the monitoring equipment of the satellite controller may further include a display screen 630, a power supply 640, a communication interface 650, an input / output interface 660, a sensor 670, and a communication bus 680.
[0216] Those skilled in the art will understand that the structure shown in Figure 6 does not constitute a limitation on the monitoring equipment of the satellite controller and may include more or fewer components than shown.
[0217] The satellite controller monitoring device provided in some embodiments of this application includes a memory and a processor. When the processor executes the program stored in the memory, it can implement the steps of the satellite controller monitoring method provided in the above embodiments, and the effect is the same as above.
[0218] Some embodiments of this application provide a non-volatile storage medium storing a computer program thereon, which, when executed by a processor, can implement the steps of the satellite controller monitoring method provided in any of the above embodiments.
[0219] The non-volatile storage medium may include: USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks or optical disks, and other media that can store program code.
[0220] For a description of the non-volatile storage medium provided in some embodiments of this application, please refer to the above method embodiments. The effects of the medium are the same as those of the satellite controller monitoring method provided in some embodiments of this application, and will not be repeated here.
[0221] Some embodiments of this application provide a computer program product, including a computer program that, when executed by a processor, implements the steps of a monitoring method for a satellite controller as provided in any of the above embodiments.
[0222] For a description of the computer program products provided in some embodiments of this application, please refer to the above method embodiments. The effects they achieve are the same as those of the satellite controller monitoring method provided in some embodiments of this application, and will not be repeated here.
[0223] The monitoring system, method, apparatus, device, medium, and product of the satellite controller provided in this application have been described in detail above. The various embodiments in the specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the methods, apparatus, devices, non-volatile storage media, and computer program products disclosed in the embodiments, since they correspond to the systems disclosed in the embodiments, the descriptions are relatively simple, and relevant parts can be referred to the system section description. It should be noted that those skilled in the art can make several improvements and modifications to this application without departing from the principles of this application, and these improvements and modifications also fall within the protection scope of this application.
[0224] It should also be noted that, in this specification, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.
Claims
1. A monitoring system for a satellite controller, characterized in that, Includes the out-of-band management controller and the first bus; The out-of-band management controller is configured to send a service request command to the satellite controller via the first bus. If it receives a response command from the satellite controller via the first bus, it determines that the satellite controller is in a normal response state. The controller detects the execution result of the service request command based on the command information of the service request command and the response information of the response command. If the number of times the first abnormal information is detected in the response information reaches the abnormal number accumulation threshold, the controller is determined to be in an abnormal working state. The first abnormal information includes at least one of abnormal task execution result information and abnormal completion code.
2. The monitoring system for the satellite controller according to claim 1, characterized in that, The types of abnormal completion codes include out-of-band management master controller-side abnormal completion codes and satellite controller-side abnormal completion codes; The cumulative threshold for the number of anomalies corresponding to the anomaly completion code on the out-of-band management master controller side is greater than the cumulative threshold for the number of anomalies corresponding to the anomaly completion code on the satellite controller side.
3. The monitoring system for the satellite controller according to claim 2, characterized in that, The type of the out-of-band management master controller side abnormal completion code includes at least one of the first abnormal completion code and the second abnormal completion code; The first exception completion code indicates that the service request command is an illegal command, and the second exception completion code indicates that the data of the service request command has been truncated.
4. The monitoring system for the satellite controller according to claim 2, characterized in that, The types of abnormal completion codes on the satellite controller side include at least one of the following: third abnormal completion code, fourth abnormal completion code, fifth abnormal completion code, and sixth abnormal completion code; The third abnormal completion code indicates a task timeout, the fourth abnormal completion code indicates insufficient command storage space, the fifth abnormal completion code indicates that the service request command cannot be executed due to priority issues, and the sixth abnormal completion code indicates that the current state of the satellite controller cannot execute the service request command.
5. The monitoring system for the satellite controller according to claim 2, characterized in that, The out-of-band management controller is further configured to, if the response information contains an abnormal completion code on the out-of-band management controller side and the number of occurrences has not reached the corresponding abnormal number accumulation threshold, adjust the command format of the service request command and send the adjusted service request command.
6. The monitoring system for the satellite controller according to claim 2, characterized in that, The out-of-band management controller is also configured to record an out-of-band management controller program error log if the number of times the out-of-band management controller-side abnormal completion code appears in the response information reaches the corresponding abnormal number accumulation threshold.
7. The monitoring system for the satellite controller according to claim 1, characterized in that, The types of abnormal task execution result information include first information and second information; The first information is task execution result information that does not match the target information obtained by the business request command; the second information is task execution result information where data is missing.
8. The monitoring system for the satellite controller according to claim 7, characterized in that, The type of the first information includes at least one of the third, fourth, fifth, and sixth information; The third information is task execution result information with incorrect data format; the fourth information is task execution result information with incorrect data length; the fifth information is task execution result information with information type that does not match the type of information obtained by the target; and the sixth information is task execution result information containing monitoring values that exceed the threshold range of the information obtained by the target.
9. The monitoring system for the satellite controller according to claim 1, characterized in that, The out-of-band management controller is also configured to resend the service request command to the satellite controller if the number of times the first abnormal information is received does not reach the cumulative threshold of abnormal number.
10. The monitoring system for the satellite controller according to claim 1, characterized in that, The out-of-band management controller is also configured to, after determining that the satellite controller is in an abnormal working state, identify the satellite controller as an abnormal satellite controller and adjust the service configuration information corresponding to the abnormal satellite controller.
11. The monitoring system for the satellite controller according to claim 10, characterized in that, The out-of-band management master controller adjusts the service configuration information corresponding to the abnormal satellite controller, including: Locally generate monitoring information corresponding to the monitoring information type of the abnormal satellite controller to execute the control task corresponding to the abnormal satellite controller.
12. The monitoring system for the satellite controller according to claim 11, characterized in that, The out-of-band management controller locally generates monitoring information corresponding to the monitoring information type of the abnormal satellite controller to execute control tasks corresponding to the abnormal satellite controller, including: The out-of-band management master controller acquires the default monitoring information corresponding to the abnormal satellite controller in order to execute the control task.
13. The monitoring system for the satellite controller according to claim 11, characterized in that, The out-of-band management controller locally generates monitoring information corresponding to the monitoring information type of the abnormal satellite controller to execute control tasks corresponding to the abnormal satellite controller, including: The out-of-band management master controller acquires historical monitoring information of historical moments when the abnormal satellite controller was not diagnosed as having an abnormal working state in order to execute the control task.
14. The monitoring system for the satellite controller according to claim 1, characterized in that, The out-of-band management controller includes a main processor and a coprocessor; The main processor is configured to send the service request command to the satellite controller via the first bus and receive the response command sent by the satellite controller via the first bus, and send the command information and the response information to the coprocessor; The coprocessor is configured to detect the execution result of the satellite controller's service request command based on the command information and the response information. If the number of times the first abnormal information is detected in the response information reaches the abnormal number accumulation threshold, the satellite controller is determined to be in an abnormal working state, and the information that the satellite controller is in an abnormal working state is sent to the main processor.
15. The monitoring system for the satellite controller according to claim 14, characterized in that, The main processor sends the command information and the response information to the coprocessor, including: The main processor sends the first index number corresponding to the command information and the second index number corresponding to the response information to the coprocessor.
16. The monitoring system for the satellite controller according to claim 14, characterized in that, The coprocessor sends information to the main processor indicating that the satellite controller is in an abnormal operating state, including: The coprocessor generates an abnormal operation flag corresponding to the abnormal operation state type of the satellite controller, writes the command information, the response information, and the abnormal operation flag into a first data packet, and sends the first data packet to the main processor.
17. The monitoring system for the satellite controller according to claim 1, characterized in that, The out-of-band management controller includes a first operating system and a second operating system, wherein the response rate of the first operating system is higher than that of the second operating system; The second operating system is configured to send the service request command to the satellite controller via the first bus and receive the response command sent by the satellite controller via the first bus, and send the command information and the response information to the first operating system; The first operating system is configured to detect the execution result of the satellite controller's service request command based on the command information and the response information. If the number of times the first abnormal information is detected in the response information reaches the abnormal number accumulation threshold, the satellite controller is determined to be in an abnormal working state, and the information that the satellite controller is in an abnormal working state is sent to the second operating system.
18. The monitoring system for the satellite controller according to claim 1, characterized in that, The satellite controller includes at least a platform controller management engine unit.
19. The monitoring system for the satellite controller according to claim 1, characterized in that, The out-of-band management master controller is a baseboard management controller; the satellite controller is a management controller that is not integrated into the board where the out-of-band management master controller is located.
20. A monitoring method for a satellite controller, characterized in that, Applications include out-of-band management master controllers, including: Send a service request command to the satellite controller; If a response command is received from the satellite controller via the first bus, it is determined that the satellite controller is in a normal response state. The satellite controller detects the execution result of the service request command based on the command information of the service request command and the response information of the response command; If the number of times the first abnormal information is detected in the response information reaches the abnormal number accumulation threshold, then the satellite controller is determined to be in an abnormal working state. The first abnormal information includes at least one of abnormal task execution result information and abnormal completion code.
21. A monitoring device for a satellite controller, characterized in that, The out-of-band management controller used in servers includes: The transmitting unit is configured to send service request commands to the satellite controller; The receiving unit is configured to determine that the satellite controller is in a normal response state if it receives a response command fed back by the satellite controller through the first bus; The detection unit is configured to detect the execution result of the satellite controller on the service request command based on the command information of the service request command and the response information of the response command; if the number of times the first abnormal information is detected in the response information reaches the abnormal number accumulation threshold, the satellite controller is determined to be in an abnormal working state. The first abnormal information includes at least one of abnormal task execution result information and abnormal completion code.
22. A monitoring device for a satellite controller, characterized in that, include: Memory, configured to store computer programs; A processor is configured to execute the computer program, which, when executed by the processor, implements the steps of the monitoring method of the satellite controller as described in claim 19.
23. A non-volatile storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of the monitoring method for the satellite controller as described in claim 20.
24. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by the processor, it implements the steps of the monitoring method for the satellite controller as described in claim 20.