Fault diagnosis method, device, computer equipment, storage medium and system
By acquiring and analyzing the register data of the OCP network card under the server operating system, the problem of BMC's inability to detect UCE and CE faults was solved, enabling more accurate and efficient fault diagnosis and reducing the risk of server downtime.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INSPUR SUZHOU INTELLIGENT TECH CO LTD
- Filing Date
- 2023-06-13
- Publication Date
- 2026-06-30
AI Technical Summary
In traditional servers, the BMC cannot obtain register data related to fault types such as UCE and CE of the OCP network card, which makes it impossible to detect many fault types and thus cause the server system to crash.
By acquiring the register data of the target network card under the server operating system and analyzing the data bits, the fault diagnosis results are determined, including collecting operating data, storing it in the baseboard management controller, and using interactive documents for data transmission to improve security and efficiency.
It improves the accuracy and efficiency of fault diagnosis, reduces the probability of server system downtime, and ensures timely detection and management of target network card faults.
Smart Images

Figure CN116610481B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of communication technology, and more specifically to fault diagnosis methods, devices, computer equipment, storage media, and systems. Background Technology
[0002] Traditional servers inevitably experience various equipment failures, such as network card failures, memory failures, PCIe (Peripheral Component Interconnect Express) bus hangs, GPU (Graphics Processing Unit) failures, and OCP (Open Compute Project) network card failures. Ultimately, equipment failures may cause the server system to crash.
[0003] In related technologies, the BMC (Baseboard Management Controller) typically obtains device status information of internal server devices based on MCTP (Management Component Transport Protocol) and acquires CPU (Central Processing Unit) register data out-of-band via PECI (Platform Environment Control Interface). It then performs fault diagnosis based on the device status information and CPU register data. Out-of-band access refers to the terminal remotely accessing the server via standard commands. However, the BMC cannot obtain register data related to fault types such as UCE (uncorrectable error) and CE (corrected error) from internal server devices (e.g., OCP (Open Compute Project) and network cards). Therefore, many fault types corresponding to these faults may go undetected, leading to server system crashes. Summary of the Invention
[0004] In view of this, the present invention provides a fault diagnosis method, apparatus, computer equipment, storage medium and system, which can avoid the situation where many fault types in OCP network cards cannot be detected, thereby reducing the probability of server system downtime.
[0005] In a first aspect, the present invention provides a fault diagnosis method, comprising:
[0006] Send a data acquisition command to the server under test to obtain the register data of the target network card in the server under test. The register data of the target network card is collected under the operating system of the server under test.
[0007] Analyze the data bits in the register data to determine the fault diagnosis results of the target network card.
[0008] The address translation method provided in this invention allows the terminal to send a data acquisition command to the server under test, enabling it to acquire the register data of the target network card in the server under test within the specified band. This register data is collected under the operating system of the server under test. Then, by analyzing the data bits in the register data, the fault diagnosis result of the target network card is determined. In other words, the terminal can indirectly acquire the register data of the target network card in the server under test under the server's operating system, and analyze the data bits in the register data to determine the fault diagnosis result of the target network card. This avoids the situation where many fault types corresponding to OCP network cards cannot be detected, thereby reducing the probability of server system downtime.
[0009] In some optional implementations, the data bits in the register data are analyzed to determine the fault diagnosis result of the target network card, including:
[0010] Analyze the data bits in the register data to determine whether the first register data bit is valid;
[0011] Once the data bits in the first register are confirmed to be valid, the fault diagnosis result of the target network card is determined based on the data bits in the second register.
[0012] The address translation method provided in this invention analyzes the data bits in the register data to determine whether the first register data bit in the data bits is valid, and then determines whether the component corresponding to the first register data bit in the target network card is faulty. When it is determined that the first register data bit is valid, it can be determined that the component corresponding to the first register data bit in the target network card is faulty. At this time, the fault diagnosis result of the target network card can be determined according to the second register data bit in the data bits, thereby improving the accuracy of fault location of the target network card. At the same time, it is beneficial for testing personnel to manage the target network card according to the fault diagnosis result, so as to reduce the probability of server system downtime.
[0013] In some optional implementations, there are multiple data bits in the second register. The fault diagnosis result of the target network card is determined based on the data bits in the second register, including:
[0014] Determine from among multiple second register data bits whether there is a set register data bit;
[0015] When it is determined that there is a set register data bit, the set register data bit is used as the target register data bit;
[0016] Based on the target register data bits, the fault diagnosis result of the target network card corresponding to the target register data bits is determined from the preset fault diagnosis list. The fault diagnosis list includes the correspondence between candidate register data bits and candidate diagnosis results.
[0017] The address translation method provided in this invention determines whether each component of the faulty part in the target network card is faulty by determining whether there is a set register data bit from multiple second register data bits; then, by using the target register data bit, the fault diagnosis result of the target network card corresponding to the target register data bit is determined from a preset fault diagnosis list, which can improve the diagnostic efficiency of the target network card.
[0018] In some alternative implementations, register data is collected in the following manner:
[0019] Under the server operating system, obtain the operating data of the target network card in the server to be tested;
[0020] If there are abnormalities in the running data, collect the register data of the target network card;
[0021] Register data is stored in the baseboard management controller.
[0022] The address translation method provided in this invention allows the terminal to automatically acquire the operating data of the target network card in the server under the server operating system. The terminal can detect whether the target network card is faulty by using the operating data of the target network card. After determining that the target network card is faulty, the terminal can further determine the specific faulty component in the target network card by collecting the register data of the target network card, thereby improving the accuracy of fault diagnosis.
[0023] In some alternative implementations, register data is stored in the baseboard management controller, including:
[0024] Based on the interactive document, the script in the server under test is called to store the register data to the baseboard management controller. The interactive document stores the protocol agreed upon by the script and the baseboard management controller for data transmission.
[0025] The address translation method provided in this invention improves data transmission security by using interactive documentation for data interaction; moreover, by pre-storing register data in the board management controller, it is beneficial to directly collect register data from the board management controller when an anomaly of the target network card is detected, thereby improving data processing efficiency.
[0026] In some optional implementations, a data acquisition command is sent to the server under test to obtain the register data of the target network interface card in the server under test, including:
[0027] A data acquisition command is sent to the server under test. The data acquisition command calls the script in the server under test so that the script can obtain the register data of the target network card in the server under test from the baseboard management controller based on the interactive document.
[0028] The address translation method provided in this invention can ensure the normal collection of register data of the target network card under the operating system of the server under test; data interaction through interactive documents can improve the security of data transmission; and directly obtaining the register data of the target network card in the server under test from the baseboard management controller can improve the efficiency of data collection.
[0029] In some alternative implementations, register data is stored in the baseboard management controller, including:
[0030] Generate a diagnostic log based on register data;
[0031] The diagnostic logs are stored in the baseboard management controller.
[0032] The address translation method provided in this embodiment of the invention generates a diagnostic log from register data and stores the diagnostic log in the baseboard management controller, which facilitates the subsequent direct acquisition of the corresponding register data by the terminal and improves the efficiency of register data collection.
[0033] Secondly, the present invention provides a fault diagnosis device, comprising:
[0034] The sending module is used to send a data acquisition command to the server under test in order to obtain the register data of the target network card in the server under test. The register data of the target network card is collected under the operating system of the server under test.
[0035] The diagnostic result acquisition module is used to analyze the data bits in the register data to determine the fault diagnosis result of the target network card.
[0036] In some optional implementations, the diagnostic result acquisition module specifically includes:
[0037] The second determining unit is used to analyze the data bits in the register data to determine whether the first register data bit in the data bits is valid;
[0038] The second determining unit is used to determine the fault diagnosis result of the target network card based on the second register data bit in the data bit when the first register data bit is determined to be valid.
[0039] In some optional implementations, the second register has multiple data bits, and the second determining unit specifically includes:
[0040] The third determining subunit is used to determine from multiple second register data bits whether there is a set register data bit;
[0041] The register data bit determination subunit is used to determine the target register data bit when it is determined that there is a set register data bit.
[0042] The diagnosis result determination subunit is used to determine the fault diagnosis result of the target network card corresponding to the target register data bit from a preset fault diagnosis list based on the target register data bit. The fault diagnosis list includes the correspondence between candidate register data bits and candidate diagnosis results.
[0043] In some optional implementations, the sending module further includes:
[0044] The acquisition unit is used to acquire the operating data of the target network card in the server under the server operating system.
[0045] The collection unit is used to collect register data of the target network card if there are abnormalities in the running data;
[0046] The storage unit is used to store register data to the baseboard management controller.
[0047] In some alternative implementations, the storage unit specifically includes:
[0048] The first storage subunit is used to call the script in the server to be tested to store register data to the baseboard management controller based on the interactive document. The interactive document stores the protocol agreed upon by the script and the baseboard management controller for data transmission.
[0049] In some optional implementations, the sending module further includes:
[0050] The sending unit is used to send a data acquisition command to the server under test. The data acquisition command calls the script in the server under test so that the script can obtain the register data bits of the target network card in the server under test from the baseboard management controller based on the interactive document.
[0051] In some alternative implementations, the storage unit further includes:
[0052] The log generation subunit is used to generate diagnostic logs based on register data;
[0053] The second storage subunit is used to store diagnostic logs to the baseboard management controller.
[0054] Thirdly, the present invention provides a computer device, comprising: a memory and a processor, wherein the memory and the processor are communicatively connected to each other, the memory stores computer instructions, and the processor executes the computer instructions to perform the fault diagnosis method of the first aspect or any corresponding embodiment described above.
[0055] Fourthly, the present invention provides a computer-readable storage medium storing computer instructions for causing a computer to perform the fault diagnosis method of the first aspect or any corresponding embodiment thereof. Attached Figure Description
[0056] To more clearly illustrate the specific embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the specific embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0057] Figure 1 A schematic diagram of an optional fault diagnosis system provided in an embodiment of this application;
[0058] Figure 2 This is a flowchart of a fault diagnosis method according to an embodiment of the present invention;
[0059] Figure 3 This is a flowchart of another fault diagnosis method according to an embodiment of the present invention;
[0060] Figure 4 This is a schematic diagram of a fault diagnosis device provided in an embodiment of this application;
[0061] Figure 5 This is a schematic diagram of the structure of an optional computer device provided in an embodiment of the present invention. Detailed Implementation
[0062] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0063] Reference Figure 1 , Figure 1This is a schematic diagram of an optional fault diagnosis system provided in an embodiment of this application. The fault diagnosis system includes a terminal 101 and a server 102, wherein the terminal 101 and the server 102 are connected through a communication network.
[0064] For example, terminal 101 sends a data acquisition command to server 102 to obtain the register data of the target network card in server 102. The register data of the target network card is collected under the operating system of server 102 and stored in the baseboard management controller. Then, terminal 101 analyzes the data bits in the register data to determine the fault diagnosis result of the target network card, so that subsequent testing personnel can repair or replace the target network card according to the fault diagnosis result of terminal 101, thereby reducing the probability of server 102 system downtime.
[0065] There can be multiple servers 102, and each server 102 is connected to the terminal 101. The specific implementation method of the terminal 101 detecting multiple servers 102 is the same as the method of the terminal 101 detecting one server 102 described above, and will not be repeated here.
[0066] Server 102 can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms. Additionally, the server can also be a node server in a blockchain network.
[0067] Terminal 101 can be a smartphone, tablet, laptop, desktop computer, smart speaker, smartwatch, vehicle terminal, etc., but is not limited to these. The terminal and server can be directly or indirectly connected via wired or wireless communication, and this application embodiment does not impose any limitations on this.
[0068] In related technologies, BMC typically relies on MCTP (Mechanical Control Platform) to obtain device status information of internal server devices, acquire CPU register data out-of-band via PECI (Peripherally Accessible Communication Interface), and perform fault diagnosis based on the device status information and CPU register data. Out-of-band refers to the terminal remotely accessing the server via standard commands. However, BMC cannot obtain register data related to fault types such as UCE and FAULT from internal server devices (e.g., OCP network cards). Therefore, many fault types corresponding to various faults within the server's internal devices may go undetected, leading to server system crashes.
[0069] Based on this, embodiments of the present invention provide a fault diagnosis method, apparatus, computer equipment, storage medium, and system, which can avoid the situation where many fault types in OCP network cards cannot be detected, thereby reducing the probability of server system downtime.
[0070] According to an embodiment of the present invention, a fault diagnosis method embodiment is provided. It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions. Furthermore, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in a different order than that shown here.
[0071] This embodiment provides a fault diagnosis method that can be used in the aforementioned terminals, such as mobile phones and tablet computers. Figure 2 This is a flowchart of a fault diagnosis method according to an embodiment of the present invention, such as... Figure 2 As shown, the process includes the following steps:
[0072] Step S201: Send a data acquisition command to the server under test to obtain the register data of the target network card in the server under test.
[0073] In this step, the register data of the target network card is collected under the operating system of the server being tested. The target network card can be an OCP network card or other types of network cards, which will not be listed here.
[0074] The data acquisition command instructs the server under test to send the target network card's register data to the terminal. This register data includes AER (Advanced Error Report) and MCA (Machine Check Architecture) data, both of which contain the target network card's operational data. This operational data includes the UCE count in the L1 cache, the UCE count in the L2 cache, fault information for the Arithmetic Logic Unit (ALU) and Floating Point Unit (FPU), and UCE and CE information in the Memory Memory (MEM). L1 refers to Level 1, and L2 to Level 2. The L2 cache is slower than the L1 cache but has a larger storage capacity. The UCE count indicates the number of times a UCE has been triggered. MEM is a memory display program that shows the occupancy of all memory-resident programs.
[0075] In some optional implementations, the specific method for collecting register data can be to obtain the operating data of the target network card in the server under test under the server operating system. If there is an anomaly in the operating data, the register data of the target network card is collected, and finally the register data is stored in the baseboard management controller. That is to say, the terminal can automatically obtain the operating data of the target network card in the server under test under the server operating system, detect whether the target network card is faulty through the operating data of the target network card, and after determining that the target network card is faulty, further determine the specific faulty component in the target network card by collecting the register data of the target network card, thereby improving the accuracy of fault diagnosis.
[0076] In some optional implementations, the method for determining that the operating data is abnormal can be as follows: if the fault corresponding to the operating data is triggered and the number of times it is triggered is greater than or equal to a preset threshold, then the operating data is determined to be abnormal. For example, taking the operating data as UCE and CE of memory MEM as an example, if the faults corresponding to UCE and CE are both triggered, and the number of times the fault corresponding to UCE is triggered is greater than or equal to a first preset threshold, and the number of times the fault corresponding to CE is triggered is greater than or equal to a second preset threshold, then it indicates that the operating data is abnormal. In this case, the register data of the target network card can be collected. Alternatively, if the fault corresponding to UCE is triggered and the number of times the fault corresponding to UCE is triggered is greater than or equal to the first preset threshold, then it indicates that the operating data is abnormal. In this case, the register data of the target network card can be collected. Alternatively, if the fault corresponding to CE is triggered and the number of times the fault corresponding to CE is triggered is greater than or equal to the second preset threshold, then it indicates that the operating data is abnormal. In this case, the register data of the target network card can be collected. The first and second preset thresholds can be set according to the actual situation and are not specifically limited here.
[0077] In some optional implementations, register data is stored in the board management controller. Specifically, this can be based on an interactive document, where a script in the server under test is invoked to store the register data in the board management controller. The interactive document stores a protocol agreed upon by the script and the board management controller for data transmission. This protocol can be configured according to actual needs and is not specifically limited here. In this embodiment, data interaction via an interactive document improves data transmission security. Furthermore, by pre-storing register data in the board management controller, it is beneficial to directly collect register data from the board management controller when an anomaly is detected in the target network card, thus improving data processing efficiency.
[0078] In some optional implementations, the server under test (DUT) is communicatively connected to the terminal. The DUT stores register data in the baseboard management controller (BMC). The terminal can send data acquisition commands to the DUT in real time to obtain the register data of the target network interface card (NIC) in the BMC. The register data of the target NIC is collected under the operating system of the DUT, facilitating subsequent fault diagnosis of the target NIC based on this data. It is understood that, compared to out-of-band acquisition of register data in related technologies, this embodiment can collect the register data of the target NIC under the operating system of the DUT, thus avoiding the inability to perform fault detection on the internal devices of the server due to the inability to obtain their register data.
[0079] In some optional implementations, after sending a data acquisition command to the server under test, the terminal can call a script on the server under test through the data acquisition command under the operating system of the server under test. This allows the script to obtain the register data of the target network card in the server under test from the baseboard management controller based on the interactive document. In this way, the normal collection of the register data of the target network card can be ensured under the operating system of the server under test. Moreover, data interaction through the interactive document can improve the security of data transmission.
[0080] Step S202: Analyze the data bits in the register data to determine the fault diagnosis result of the target network card.
[0081] In this step, there are multiple data bits to be analyzed. The specific data bits to be analyzed can be determined according to the actual situation. For example, it can be determined according to the type of the target network card, the data type of the register data, the fault type, or the type of the server, etc.
[0082] Specifically, the data bits in the register data are analyzed to determine the fault diagnosis result of the target network card. Specifically, it can be to determine whether the corresponding data bit is valid. After confirming that the corresponding data bit is valid, it can be determined whether other data bits are set. If there are set data bits, it indicates that the component corresponding to the set data bit of the target network card has a fault. At this time, the corresponding fault diagnosis result can be found in the fault diagnosis list based on the set data bit.
[0083] The fault diagnosis method provided in this embodiment allows the terminal to send a data acquisition command to the server under test, enabling it to acquire the register data of the target network card in the server under test within the band. The register data of the target network card is collected under the operating system of the server under test. Then, by analyzing the data bits in the register data, the fault diagnosis result of the target network card is determined. In other words, the terminal can indirectly acquire the register data of the target network card in the server under test under the server's operating system, and analyze the data bits in the register data to determine the fault diagnosis result of the target network card. This avoids the situation where many fault types corresponding to OCP network cards cannot be detected, thereby reducing the probability of server system downtime.
[0084] This embodiment provides a fault diagnosis method that can be used in the aforementioned terminals, such as mobile phones and tablet computers. Figure 3 This is a flowchart of a fault diagnosis method according to an embodiment of the present invention, such as... Figure 3 As shown, the process includes the following steps:
[0085] Step S301: Send a data acquisition command to the server under test to obtain the register data of the target network card in the server under test.
[0086] Please see details Figure 2 Step S201 of the illustrated embodiment will not be described again here.
[0087] Step S302: Analyze the data bits in the register data to determine the fault diagnosis result of the target network card.
[0088] Specifically, step S302 includes:
[0089] Step S3021: Analyze the data bits in the register data to determine whether the first register data bit in the data bits is valid.
[0090] In this step, the data bits in the register data include a first register data bit and a second register data bit. The first register data bit is used to indicate the specific component of the target network card that is malfunctioning, and the second register data bit is used to indicate the specific location of the malfunctioning component in the target network card. Different register data correspond to different first register data bits and different second register data bits.
[0091] It is understandable that by determining whether the first register data bit in the data bits is valid, it is possible to determine whether the component corresponding to the target network card is faulty. If the first register data bit is valid, it indicates that the component corresponding to the target network card is faulty; conversely, if the first register data bit is invalid, it indicates that the component corresponding to the target network card is not faulty.
[0092] Step S3022: When it is determined that the data bits of the first register are valid, the fault diagnosis result of the target network card is determined according to the data bits of the second register in the data bits.
[0093] In this step, the first register data bit is valid, meaning it is set. The second register data bit refers to the register data bits other than the first register data bit. The fault diagnosis result is used to characterize the faulty component of the target network card and the corresponding severity of the fault.
[0094] In some optional implementations, when abnormalities exist in the operating data, i.e., when a fault corresponding to the operating data is triggered and the number of triggers is greater than or equal to a preset threshold, the first register data bit corresponding to the operating data in the register data is determined to be valid. At this time, the first register data bit is set. That is to say, when a component in the target network card malfunctions, the first register data bit corresponding to that component will be set. The preset threshold can be set according to the actual situation.
[0095] In some optional implementations, after obtaining the fault diagnosis results, the target network card can be repaired or replaced based on the fault diagnosis results, which can be determined according to the actual situation.
[0096] In this embodiment, by analyzing the data bits in the register data, it is determined whether the first register data bit in the data bits is valid, and then it is determined whether the component corresponding to the first register data bit in the target network card is faulty. When it is determined that the first register data bit is valid, it can be determined that the component corresponding to the first register data bit in the target network card is faulty. At this time, the fault diagnosis result of the target network card can be determined according to the second register data bit in the data bits, thereby improving the accuracy of fault location of the target network card. At the same time, it is beneficial for the testing personnel to manage the target network card according to the fault diagnosis result, so as to reduce the probability of server system downtime.
[0097] In some alternative implementations, when there are multiple data bits in the second register, step S3022 includes:
[0098] Step a1: Determine from the multiple second register data bits whether there is a set register data bit.
[0099] The register data includes a first register data bit and multiple second register data bits. Different first register data bits correspond to different components in different target network cards, and different second register data bits correspond to different parts of the faulty component. Therefore, by determining whether there are set register data bits among the multiple second register data bits, it is possible to determine whether there is a fault in each part of each component in the target network card.
[0100] Step a2: When it is determined that there is a set register data bit, the set register data bit is used as the target register data bit.
[0101] In this step, the target register data bit, which is the valid second register data bit, means that the part corresponding to the target register data bit has a fault.
[0102] Step a3: Based on the target register data bits, determine the fault diagnosis result of the target network card corresponding to the target register data bits from the preset fault diagnosis list.
[0103] In this step, the fault diagnosis list includes the correspondence between candidate register data bits and candidate diagnosis results. There are multiple candidate register data bits and multiple candidate diagnosis results. The number of candidate register data bits is equal to the number of candidate diagnosis results.
[0104] In some optional implementations, a preset fault diagnosis list can be updated in real time. This preset fault diagnosis list is formed based on specific faults and fault types (such as UCE, CE) of various parts of the target network card. Additionally, candidate register data bits can be obtained based on the register data bits corresponding to components that have been diagnosed or failed, and candidate diagnostic results can be obtained based on historical diagnostic results obtained from the diagnosed or failed register data bits.
[0105] For example, assuming the register data is MCerrlogReg, the first register data bit in this register data is the eighth data bit, i.e., MCerrlogReg Bit8, and the field corresponding to this first register data bit is FirstMCerrSrcFromCore. Taking the running data as UCE and CE as an example, when the faults corresponding to UCE and CE are both triggered, and the number of times the fault corresponding to UCE is triggered is greater than or equal to a first preset threshold, and the number of times the fault corresponding to CE is triggered is greater than or equal to a second preset threshold, the field corresponding to the first register data bit, FirstMCerrSrcFromCore, is valid, that is, the first register data bit is set, indicating that the core of the network card's internal chip has a fault. At this time, the specific part of the core of the network card's internal chip can be located based on the second register data bit and the preset fault diagnosis list.
[0106] In this embodiment of the application, by determining whether there is a set register data bit among multiple second register data bits, it is determined whether there is a fault in each part of each component in the target network card; then, by using the target register data bit, the fault diagnosis result of the target network card corresponding to the target register data bit is determined from a preset fault diagnosis list, which can improve the diagnostic efficiency of the target network card.
[0107] In some optional implementations, diagnostic logs can be generated based on register data and then stored in the baseboard management controller. This allows the terminal to directly obtain the corresponding register data and improves the efficiency of register data collection.
[0108] To illustrate the register data collection process more clearly, a specific example will be provided below.
[0109] Under the operating system of the server under test, the terminal executes the script in the server under test within the bandwidth. The script determines whether there are any abnormalities in the running data of the target network card in the server under test. When it is determined that there are abnormalities in the running data, the terminal automatically collects the register data of the target network card through the script collection module in the script. Then, after the register data collection is completed, the terminal calls the script's own script sending module to transmit the register data of the target network card to the BMC. The transmission of register data between the script and the BMC is based on an interactive document, thereby ensuring that the BMC accurately identifies the corresponding register data.
[0110] Then, there is a register data receiving module inside the BMC. After receiving the register data, the register data receiving module uses the register data to update the diagnostic log. The diagnostic log stores the data information used and debugged by the testers, as well as the register data of the target network card. The testers can directly obtain the register data of the target network card from the diagnostic log on the terminal.
[0111] In some optional implementations, the register data also includes data such as the target network card's in-situ status, temperature, voltage, and current. For example, the voltage and current of the target network card can be used to determine whether the target network card is overheating or experiencing overcurrent. Specifically, when the current of the target network card is greater than or equal to a preset current threshold, it is determined that the target network card has an overcurrent problem; when the voltage of the target network card is greater than or equal to a preset voltage threshold, and / or when the current of the target network card is greater than or equal to a preset current threshold, it is determined that the target network card has an overheat problem.
[0112] Reference Figure 4 An embodiment of this application also provides a fault diagnosis device, comprising:
[0113] The sending module 401 is used to send a data acquisition command to the server under test in order to obtain the register data of the target network card in the server under test. The register data of the target network card is collected under the operating system of the server under test.
[0114] The diagnostic result acquisition module 402 is used to analyze the data bits in the register data to determine the fault diagnosis result of the target network card.
[0115] In some optional implementations, the diagnostic result acquisition module 402 specifically includes:
[0116] The second determining unit is used to analyze the data bits in the register data to determine whether the first register data bit in the data bits is valid;
[0117] The second determining unit is used to determine the fault diagnosis result of the target network card based on the second register data bit in the data bit when the first register data bit is determined to be valid.
[0118] In some optional implementations, the second register has multiple data bits, and the second determining unit specifically includes:
[0119] The third determining subunit is used to determine from multiple second register data bits whether there is a set register data bit;
[0120] The register data bit determination subunit is used to determine the target register data bit when it is determined that there is a set register data bit.
[0121] The diagnosis result determination subunit is used to determine the fault diagnosis result of the target network card corresponding to the target register data bit from a preset fault diagnosis list based on the target register data bit. The fault diagnosis list includes the correspondence between candidate register data bits and candidate diagnosis results.
[0122] In some optional implementations, the sending module 401 further includes:
[0123] The acquisition unit is used to acquire the operating data of the target network card in the server under the server operating system.
[0124] The collection unit is used to collect register data of the target network card if there are abnormalities in the running data;
[0125] The storage unit is used to store register data to the baseboard management controller.
[0126] In some alternative implementations, the storage unit specifically includes:
[0127] The first storage subunit is used to call the script in the server to be tested to store register data to the baseboard management controller based on the interactive document. The interactive document stores the protocol agreed upon by the script and the baseboard management controller for data transmission.
[0128] In some optional implementations, the sending module 401 further includes:
[0129] The sending unit is used to send a data acquisition command to the server under test. The data acquisition command calls the script in the server under test so that the script can obtain the register data of the target network card in the server under test from the baseboard management controller based on the interactive document.
[0130] In some alternative implementations, the storage unit further includes:
[0131] The log generation subunit is used to generate diagnostic logs based on register data;
[0132] The second storage subunit is used to store diagnostic logs to the baseboard management controller.
[0133] In this embodiment, the fault diagnosis device is presented in the form of a functional module. Here, a module refers to an application-specific integrated circuit (ASIC), a processor and memory that execute one or more software or fixed programs, and / or other devices that can provide the above functions.
[0134] Further functional descriptions of the above modules and units are the same as those in the corresponding embodiments described above, and will not be repeated here.
[0135] The aforementioned fault diagnosis device and method are based on the same inventive concept. The terminal can send a data acquisition command to the server under test and can acquire the register data of the target network card in the server under test within the band. The register data of the target network card is collected under the operating system of the server under test. Then, by analyzing the data bits in the register data, the fault diagnosis result of the target network card is determined. That is to say, the terminal can indirectly acquire the register data of the target network card in the server under test under the server's operating system and analyze the data bits in the register data to determine the fault diagnosis result of the target network card. This avoids the situation where many fault types corresponding to OCP network cards cannot be detected, thereby reducing the probability of server system downtime.
[0136] This invention also provides a computer device having the above-described features. Figure 4 The fault diagnosis device shown.
[0137] Please see Figure 5, Figure 5 This is a schematic diagram of the structure of a computer device provided in an optional embodiment of the present invention, such as... Figure 5 As shown, the computer device includes one or more processors 10, memory 20, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The components communicate with each other via different buses and can be mounted on a common motherboard or otherwise installed as needed. The processors can process instructions executed within the computer device, including instructions stored in or on memory to display graphical information of a GUI on external input / output devices (such as display devices coupled to the interfaces). In some alternative implementations, multiple processors and / or multiple buses can be used with multiple memories and multiple memory modules, if desired. Similarly, multiple computer devices can be connected, each providing some of the necessary operations (e.g., as a server array, a group of blade servers, or a multiprocessor system). Figure 5 Take a processor 10 as an example.
[0138] Processor 10 may be a central processing unit, a network processor, or a combination thereof. Processor 10 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (PLD), or a combination thereof. The programmable logic device may be a complex programmable logic device (CAMP), a field-programmable gate array (FPGA), a general-purpose array logic (GPA), or any combination thereof.
[0139] The memory 20 stores instructions executable by at least one processor 10 to cause at least one processor 10 to perform the method shown in the above embodiments.
[0140] The memory 20 may include a program storage area and a data storage area. The program storage area may store the operating system and applications required for at least one function; the data storage area may store data created based on the use of the computer device as shown by a landing page for an app. Furthermore, the memory 20 may include high-speed random access memory and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, the memory 20 may optionally include memory remotely located relative to the processor 10, which can be connected to the computer device via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
[0141] The memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk or solid-state drive; the memory 20 may also include a combination of the above types of memory.
[0142] The computer device also includes an input device 30 and an output device 40. The processor 10, memory 20, input device 30, and output device 40 can be connected via a bus or other means. Figure 5 Taking the example of a connection between China and Israel via a bus.
[0143] Input device 30 can receive input numerical or character information, and generate key signal inputs related to user settings and function control of the computer device, such as a touchscreen, keypad, mouse, trackpad, touchpad, joystick, one or more mouse buttons, trackball, joystick, etc. Output device 40 may include display devices, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibration motors). The aforementioned display devices include, but are not limited to, liquid crystal displays, light-emitting diodes, displays, and plasma displays. In some alternative embodiments, the display device may be a touchscreen.
[0144] This invention also provides a computer-readable storage medium. The methods described above according to embodiments of the invention can be implemented in hardware or firmware, or implemented as computer code that can be recorded on a storage medium, or implemented as computer code downloaded via a network and originally stored on a remote storage medium or a non-transitory machine-readable storage medium and then stored on a local storage medium. Thus, the methods described herein can be processed by software stored on a storage medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware. The storage medium can be a magnetic disk, optical disk, read-only memory, random access memory, flash memory, hard disk, or solid-state drive, etc.; further, the storage medium can also include combinations of the above types of memory. It is understood that computers, processors, microprocessor controllers, or programmable hardware include storage components capable of storing or receiving software or computer code, which, when accessed and executed by the computer, processor, or hardware, implements the methods shown in the above embodiments.
[0145] Although embodiments of the invention have been described in conjunction with the accompanying drawings, those skilled in the art can make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations all fall within the scope defined by the appended claims.
Claims
1. A fault diagnosis method, characterized in that, The method includes: A data acquisition command is sent to the server under test to obtain the register data of the target network card in the server under test. The register data of the target network card is collected under the operating system of the server under test. The data bits in the register data are analyzed to determine the fault diagnosis result of the target network card; The step of analyzing the data bits in the register data to determine the fault diagnosis result of the target network card includes: The data bits in the register data are analyzed to determine whether the first register data bit in the data bits is valid; Once the first register data bit is determined to be valid, the fault diagnosis result of the target network card is determined based on the second register data bit in the data bit. The second register has multiple data bits. Determining the fault diagnosis result of the target network card based on the second register data bits includes: Determine from among the plurality of second register data bits whether there is a set register data bit; When it is determined that there is a set register data bit, the set register data bit is used as the target register data bit. Based on the target register data bits, the fault diagnosis result of the target network card corresponding to the target register data bits is determined from a preset fault diagnosis list, wherein the fault diagnosis list includes the correspondence between candidate register data bits and candidate diagnosis results; When the first register data bit is set, it is determined that the first register data bit is valid; when the first register data bit is valid, it is determined that the component in the target network port corresponding to the first register data bit has a fault; when the second register data bit is set, it is determined that the part of the component corresponding to the first register data bit that corresponds to the second register data bit has a fault. The target network interface card's operational data includes the memory MEM's UCE and CE. Sending the data acquisition instruction to the server to be detected includes: When both the fault corresponding to UCE and the fault corresponding to CE are triggered, and the number of times the fault corresponding to UCE is triggered is greater than or equal to the first preset threshold, and the number of times the fault corresponding to CE is triggered is greater than or equal to the second preset threshold, a data acquisition command is sent to the server to be detected.
2. The method according to claim 1, characterized in that, The register data was collected in the following manner: Under the server operating system, obtain the operating data of the target network card in the server to be tested; If the running data is abnormal, collect the register data of the target network card; The register data is stored in the baseboard management controller.
3. The method according to claim 2, characterized in that, The step of storing the register data to the baseboard management controller includes: Based on the interactive document, the script in the server to be tested is invoked to store the register data to the baseboard management controller. The interactive document stores the protocol agreed upon by the script and the baseboard management controller for data transmission.
4. The method according to claim 3, characterized in that, The step of sending a data acquisition command to the server under test to obtain the register data of the target network card in the server under test includes: A data acquisition instruction is sent to the server under test. The data acquisition instruction calls a script in the server under test so that the script can obtain the register data of the target network card in the server under test from the baseboard management controller based on the interactive document.
5. The method according to claim 2 or 3, characterized in that, The step of storing the register data to the baseboard management controller includes: A diagnostic log is generated based on the register data; The diagnostic logs are stored in the baseboard management controller.
6. A fault diagnosis device, characterized in that, include: The sending module is used to send a data acquisition instruction to the server under test in order to obtain the register data of the target network card in the server under test. The register data of the target network card is collected under the operating system of the server under test. The diagnostic result acquisition module is used to analyze the data bits in the register data to determine the fault diagnosis result of the target network card; The step of analyzing the data bits in the register data to determine the fault diagnosis result of the target network card includes: The data bits in the register data are analyzed to determine whether the first register data bit in the data bits is valid; Once the first register data bit is determined to be valid, the fault diagnosis result of the target network card is determined based on the second register data bit in the data bit. The second register has multiple data bits. Determining the fault diagnosis result of the target network card based on the second register data bits includes: Determine from among the plurality of second register data bits whether there is a set register data bit; When it is determined that there is a set register data bit, the set register data bit is used as the target register data bit. Based on the target register data bits, the fault diagnosis result of the target network card corresponding to the target register data bits is determined from a preset fault diagnosis list, wherein the fault diagnosis list includes the correspondence between candidate register data bits and candidate diagnosis results; When the first register data bit is set, it is determined that the first register data bit is valid; when the first register data bit is valid, it is determined that the component in the target network port corresponding to the first register data bit has a fault; when the second register data bit is set, it is determined that the part of the component corresponding to the first register data bit that corresponds to the second register data bit has a fault. The target network interface card's operational data includes the memory MEM's UCE and CE. Sending the data acquisition instruction to the server to be detected includes: When both the fault corresponding to UCE and the fault corresponding to CE are triggered, and the number of times the fault corresponding to UCE is triggered is greater than or equal to the first preset threshold, and the number of times the fault corresponding to CE is triggered is greater than or equal to the second preset threshold, a data acquisition command is sent to the server to be detected.
7. A computer device, characterized in that, include: A memory and a processor are communicatively connected, the memory stores computer instructions, and the processor executes the computer instructions to perform the fault diagnosis method according to any one of claims 1 to 5.
8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions for causing the computer to perform the fault diagnosis method according to any one of claims 1 to 5.
9. A fault diagnosis system, characterized in that, include: A terminal and a server, wherein the terminal is used to execute the fault diagnosis method according to any one of claims 1 to 5, and the server is used to collect the register data of the target network card and store it in the baseboard management controller.