Fault processing method and device based on multi-source log data and computer device

By receiving log fault handling requests, querying the log database and business monitoring information, determining the fault location and eliminating it, the problem of low fault location efficiency in multi-terminal processing systems is solved, and fault handling efficiency is improved.

CN119652730BActive Publication Date: 2026-06-12CHINA TELECOM CLOUD TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA TELECOM CLOUD TECH CO LTD
Filing Date
2024-12-02
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In multi-terminal processing business systems, existing technologies suffer from low efficiency in fault location and troubleshooting, requiring significant human and material resources.

Method used

By receiving log fault handling requests, the system determines the target log identifier and collection method, queries the log database, log gateway, and data integration and processing components, obtains log query results, and combines them with business monitoring information to determine the location of the log fault and troubleshoot it.

🎯Benefits of technology

This reduces the number of components that need to be checked and investigated, shortens the troubleshooting process, improves the efficiency of fault location and troubleshooting, and ensures the stability and availability of business systems.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119652730B_ABST
    Figure CN119652730B_ABST
Patent Text Reader

Abstract

The application relates to a fault processing method and device based on multi-source log data, a computer device, a storage medium and a program product. The method comprises the following steps: receiving a log fault processing request, determining a target log identifier and a log collection mode corresponding to the log fault processing request, querying at least one of a log database, a log gateway and a data integration processing component according to the target log identifier, obtaining a log query result, and determining a log fault range according to the log query result; if it is determined that the log collection mode is real-time collection, obtaining service monitoring information, determining a log fault position according to the log fault range and the service monitoring information, and performing fault elimination processing based on the log fault position. The method can reduce the link length of log fault elimination processing, and improve the efficiency of fault positioning and elimination processing.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of cloud computing technology, and in particular to a fault handling method, apparatus, computer equipment, computer-readable storage medium, and computer program product based on multi-source log data. Background Technology

[0002] With the development of cloud computing technology and the expansion of business volume and the increase in business complexity in different business systems, the specific data and business information involved in the business processing process have increased. In order to ensure the security of business data and business information in the business system and to locate and eliminate faults in the business processing process, it is necessary to collect and store the log data of the business system in the business processing process.

[0003] In traditional technologies, since different business systems usually involve multi-terminal processing, it is necessary to set up multiple different data collection terminals to collect log data associated with the business systems in real time or on a regular basis. This allows for comprehensive collection of log data and fault detection and handling based on the log data, thereby ensuring stable system operation and efficient maintenance.

[0004] However, because the current business system involves multi-terminal processing, different data collection terminals need to collect their respective log data. The log data collection and processing chain is long and complex, involving the collaborative work of multiple components and modules. When the business system fails, a lot of human and material resources are required to check and troubleshoot different components and modules separately, and there is still a problem of low efficiency in fault location and troubleshooting. Summary of the Invention

[0005] Therefore, it is necessary to provide a fault handling method, apparatus, computer equipment, computer-readable storage medium, and computer program product based on multi-source log data that can quickly locate the fault location when a business system fails, reduce detection and troubleshooting operations, and improve the efficiency of fault location and troubleshooting.

[0006] In a first aspect, this application provides a fault handling method based on multi-source log data, comprising: receiving a log fault handling request; determining a target log identifier and log collection method corresponding to the log fault handling request; querying at least one of a log database, a log gateway, and a data integration and processing component according to the target log identifier to obtain a log query result, and determining the log fault range according to the log query result; if the log collection method is determined to be real-time collection, obtaining business monitoring information; determining the log fault location according to the log fault range and the business monitoring information, and performing fault troubleshooting based on the log fault location.

[0007] Secondly, this application also provides a fault handling device based on multi-source log data, comprising: a log fault handling request receiving module, configured to receive a log fault handling request, determine a target log identifier and log collection method corresponding to the log fault handling request; a log fault range determination module, configured to query at least one of a log database, a log gateway, and a data integration and processing component according to the target log identifier, obtain a log query result, and determine the log fault range according to the log query result; and a log fault location determination module, configured to, if the log collection method is determined to be real-time collection, obtain business monitoring information, determine the log fault location according to the log fault range and the business monitoring information, and perform fault troubleshooting based on the log fault location.

[0008] Thirdly, this application also provides a computer device, including a memory and a processor. The memory stores a computer program, and the processor executes the computer program to perform the following steps: receiving a log fault handling request, determining a target log identifier and a log collection method corresponding to the log fault handling request; querying at least one of a log database, a log gateway, and a data integration and processing component according to the target log identifier to obtain a log query result, and determining the log fault range according to the log query result; if the log collection method is determined to be real-time collection, obtaining business monitoring information, determining the log fault location according to the log fault range and the business monitoring information, and performing fault troubleshooting based on the log fault location.

[0009] Fourthly, this application also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, performs the following steps: receiving a log fault handling request, determining a target log identifier and log collection method corresponding to the log fault handling request; querying at least one of a log database, a log gateway, and a data integration and processing component based on the target log identifier to obtain a log query result, and determining the log fault range based on the log query result; if the log collection method is determined to be real-time collection, obtaining business monitoring information, determining the log fault location based on the log fault range and the business monitoring information, and performing fault troubleshooting based on the log fault location.

[0010] Fifthly, this application also provides a computer program product, including a computer program that, when executed by a processor, performs the following steps: receiving a log fault handling request; determining a target log identifier and log collection method corresponding to the log fault handling request; querying at least one of a log database, a log gateway, and a data integration and processing component based on the target log identifier to obtain a log query result, and determining the log fault range based on the log query result; if the log collection method is determined to be real-time collection, obtaining business monitoring information; determining the log fault location based on the log fault range and the business monitoring information, and performing fault troubleshooting based on the log fault location.

[0011] In the aforementioned fault handling methods, apparatuses, computer devices, computer-readable storage media, and computer program products based on multi-source log data, by receiving a log fault handling request, determining the target log identifier and log collection method corresponding to the request, and querying at least one of the log database, log gateway, and data integration processing components based on the target log identifier to obtain log query results, and determining the scope of the log fault based on the query results, the number of components that need to be viewed and investigated during the query process can be reduced, the link length of log fault troubleshooting can be shortened, and the efficiency of fault location and troubleshooting can be improved. Furthermore, if the log collection method is determined to be real-time collection, business monitoring information is obtained, and the location of the log fault is determined based on the log fault scope and business monitoring information. Fault troubleshooting is then performed based on the log fault location. This allows for further integration of business monitoring information from different dimensions and the determined log fault scope to accurately determine the log fault location, avoid missing fault issues, and promptly address the log fault location to ensure the stable availability of the business system. Attached Figure Description

[0012] To more clearly illustrate the technical solutions in the embodiments of this application or related technologies, the drawings used in the description of the embodiments of this application or related technologies will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0013] Figure 1 This is an application environment diagram of a fault handling method based on multi-source log data in one embodiment.

[0014] Figure 2 This is a flowchart illustrating a fault handling method based on multi-source log data in one embodiment;

[0015] Figure 3This is a schematic diagram of the log collection and storage process based on multi-source log data in one embodiment;

[0016] Figure 4 This is a schematic diagram of the log collection module in one embodiment;

[0017] Figure 5 This is a schematic diagram of the process for obtaining log query results in one embodiment;

[0018] Figure 6 This is a schematic diagram illustrating the process of determining the scope of log faults in one embodiment;

[0019] Figure 7 This is a flowchart illustrating the process of determining the location of a log fault in one embodiment;

[0020] Figure 8 This is a schematic diagram illustrating the distribution of service monitoring information in one embodiment;

[0021] Figure 9 This is a flowchart illustrating a fault handling method based on multi-source log data in another embodiment;

[0022] Figure 10 This is a flowchart illustrating a fault handling method based on multi-source log data in another embodiment;

[0023] Figure 11 This is a schematic diagram illustrating the traditional troubleshooting process for log collection in one embodiment;

[0024] Figure 12 This is a schematic diagram of the overall processing procedure of a fault handling method based on multi-source log data in one embodiment;

[0025] Figure 13 This is a structural block diagram of a fault handling device based on multi-source log data in one embodiment.

[0026] Figure 14 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation

[0027] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0028] The fault handling method based on multi-source log data provided in this application embodiment can be applied to, for example... Figure 1In the application environment shown, terminal 102 communicates with server 104 via a network. A data storage system can store the data that server 104 needs to process. The data storage system can be integrated onto server 104 or placed on the cloud or other network servers. Terminal 102 can be, but is not limited to, various personal computers, laptops, smartphones, tablets, IoT devices, portable wearable devices, and aircraft. IoT devices can include smart speakers, smart TVs, smart air conditioners, smart in-vehicle devices, and projection devices. Portable wearable devices can include smartwatches, smart bracelets, and head-mounted devices. Head-mounted devices can be virtual reality (VR) devices, augmented reality (AR) devices, smart glasses, etc. Server 104 can be a standalone physical server, a server cluster consisting of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. The terminal 102 and the server 104 can be connected directly or indirectly through wired or wireless communication, and this embodiment does not impose any restrictions on this.

[0029] Both terminal 102 and server 104 can be used independently to execute the fault handling method based on multi-source log data provided in this embodiment, or they can work together to execute the same method. For example, taking the collaborative execution of the fault handling method based on multi-source log data by terminal 102 and server 104, when server 104 receives a log fault handling request from terminal 102, it determines the target log identifier and log collection method corresponding to the request. Based on the target log identifier, it queries at least one of the log database, log gateway, and data integration processing component to obtain log query results, thereby determining the scope of the log fault. Further, if server 104 determines that the log collection method is real-time collection, it obtains business monitoring information, determines the location of the log fault based on the scope of the log fault and the business monitoring information, and performs fault troubleshooting based on the location of the log fault.

[0030] In one exemplary embodiment, such as Figure 2 As shown, a fault handling method based on multi-source log data is provided, which can be applied to... Figure 1 Taking server 104 as an example, the explanation includes the following steps 202 to 206. Wherein:

[0031] Step 202: Receive log fault handling request and determine the target log identifier and log collection method corresponding to the log fault handling request.

[0032] Specifically, if a user cannot find log data on the log console page, indicating a log collection failure, a log failure handling request can be triggered on the terminal where the log console is located. After detecting the log failure handling request, the terminal sends the request back to the server. Upon receiving the log failure handling request, the server parses the request to determine the target log identifier and log collection method corresponding to the request.

[0033] The target log identifier can be understood as the identifier corresponding to the log data that the user needs to query. Users can enter this target log identifier on the log console page to query log data. The log collection method refers to how the log collection module collects logs, which can include real-time collection and log collector collection. Real-time collection specifically includes: API (Application Programming Interface) collection, SDK (Software Development Kit) collection, and Kafka (a distributed message queue based on a publish / subscribe pattern) protocol collection, etc.

[0034] Step 204: Based on the target log identifier, query at least one of the log database, log gateway, and data integration processing component to obtain the log query results, and determine the scope of the log fault based on the log query results.

[0035] Specifically, if a user uploads logs but the log data is not displayed on the log console page, it indicates a log collection failure, requiring diagnosis and troubleshooting. The components requiring diagnosis and troubleshooting include the log database, log gateway, and data integration processing component. This means querying at least one of these components, specifically in the order of log database → log gateway → data integration processing component. If the log database query yields the desired results, further queries of the log gateway and data integration processing component are unnecessary. Similarly, if no log database query yields results, but the log gateway query does, further queries of the data integration processing component are unnecessary.

[0036] Furthermore, after querying at least one of the log database, log gateway, and data integration processing component based on the target log identifier to obtain the log query results, the scope of the log fault is determined based on the log query results. For example, the scope of the log fault may specifically be the log console, the log query process, or the data collection link. The data collection link may specifically include: the log gateway receiving or forwarding log data, the data integration processing component receiving or forwarding log data, and the log data delivery or consumption process in the Kafka intermediate message.

[0037] Step 206: If the log collection method is determined to be real-time collection, then obtain the business monitoring information, determine the log fault location based on the log fault range and the business monitoring information, and perform fault troubleshooting based on the log fault location.

[0038] Specifically, if the log collection method is determined to be real-time collection, such as API collection, SDK collection, and Kafka protocol collection, then business monitoring information is obtained. This includes monitoring information of service components corresponding to service components, monitoring information of container components corresponding to container components, and monitoring information of host nodes with deployed container components. Based on the scope of log faults, service component monitoring information, container component monitoring information, and host node monitoring information, the location of the log fault can be determined, specifically whether it is a component problem, a machine resource problem, or a network transmission problem.

[0039] For example, log failure locations include: container anomalies, host node anomalies, or service component anomalies specifically in the log console or log query process. Or, specifically in the data collection process, during the log gateway receiving or forwarding log data, the data integration and processing component receiving or forwarding log data, or the log data delivery or consumption process in the Kafka middleware, container anomalies, host node anomalies, or service component anomalies may occur.

[0040] Furthermore, after determining the location of the log fault, the server performs further troubleshooting based on the location of the log fault, including subsequent processing such as replacing faulty components and resolving fault issues based on the specific location of the log fault.

[0041] In one exemplary embodiment, such as Figure 3 As shown, a process for log collection and storage based on multi-source log data is provided, referring to... Figure 3 It is known that the log collection and storage process involves multiple modules, including a log collection module, a log gateway, a Kafka message middleware, a data cleaning and transformation module, local columnar storage, and remote object storage. Among them:

[0042] P1. For the log collection module: The log collection module collects logs in multiple ways, and its collection methods are multi-source, which may include API collection, SDK collection, log collector collection, and Kafka protocol collection, etc.

[0043] For example, such as Figure 4 As shown, a schematic diagram of a log collection module is provided. Figure 4 As can be seen, the log collection module specifically includes API collection, SDK collection, log collector collection, Kafka protocol collection, and other collection methods. API / SDK collection uses real-time log push, which flexibly pushes logs that users need to upload. The log collector automatically pushes logs that conform to collection rules. This method requires the user to install the collector on their local machine and configure the collection rules. The log collector will then install the collection rules and collect logs from specific directories or those that conform to specific rules. Both API / SDK and log collector-collected logs are compressed and sent to the log gateway via HTTPS requests. Kafka protocol collection uses Kafka (a distributed message queue based on a publish / subscribe pattern) as an intermediary for uploading logs.

[0044] P2, Log Gateway: The log gateway extracts certain fields and identifiers from the HTTPS log upload request for subsequent authentication and packaging. Specifically, the log gateway authenticates the accessKey field (access key) and the temporary credential token. This typically requires calling other modules to obtain user information. If a match is found and the status is correct, authentication succeeds; otherwise, authentication fails and returns an error message to the log collection module. If authentication succeeds, the log gateway further packages the log packet, adding key information and identifiers from the log gateway, and then sends the log packet to the Kafka message middleware.

[0045] P3. Kafka Message Middleware: Kafka message middleware stores log packets in message queues or partitions of specific topics, waiting for the ETL (Extract Transform Load, which can be understood as data cleaning and transformation) module to consume the log packets.

[0046] P4, ETL (Data Cleaning and Transformation) module: This module processes log packets uploaded from the Kafka message middleware. Upon receiving the log packets, it unpacks them to obtain the raw logs. It then performs further verification and transformation on the raw logs, checking for correct formatting and consistency of the MD5 hash. If verification passes, the logs are converted back to their original state, and predefined fields are added for later display on the log console. Finally, the raw logs with added predefined fields are sent to the database for storage.

[0047] P5. Local Columnar Storage and Remote Object Storage: The database uses columnar storage, commonly ClickHouse and Byconity. Data is stored in the columnar database and then further stored in remote object storage. When the log console checks if logs have been successfully uploaded, it typically first queries the local columnar storage database. If the log data is not found in the local columnar storage database, it will retrieve the data from the remote object storage. If the log data is also not found in the remote object storage, a message indicating that the log data retrieval was unsuccessful will be displayed on the log console, and a system fault report will be filed. The system will then diagnose and troubleshoot the last location of the logs, identify the cause of the collection failure, and monitor key metrics of the entire log collection system.

[0048] In the aforementioned fault handling method based on multi-source log data, a log fault handling request is received, and the target log identifier and log collection method corresponding to the request are determined. Based on the target log identifier, at least one of the log database, log gateway, and data integration processing components is queried to obtain log query results. The scope of the log fault is then determined based on these results. This reduces the number of components that need to be viewed and investigated during the query process, shortens the log fault troubleshooting process, and improves the efficiency of fault location and troubleshooting. Furthermore, if the log collection method is determined to be real-time collection, business monitoring information is obtained. Based on the log fault scope and the business monitoring information, the location of the log fault is determined, and fault troubleshooting is performed based on this location. This allows for further integration of different dimensions of business monitoring information and the determined log fault scope to accurately pinpoint the log fault location, preventing any omissions and enabling timely troubleshooting to ensure the stable availability of the business system.

[0049] In one exemplary embodiment, such as Figure 5 As shown, the steps for obtaining log query results, namely, querying at least one of the log database, log gateway, and data integration processing component based on the target log identifier, specifically include steps 502 to 512. Wherein:

[0050] Step 502: Query the log database based on the target log identifier.

[0051] If a user cannot find log data based on the target log identifier on the log console page, a log fault handling request is triggered based on the terminal where the log console is located. The terminal sends the log fault handling request back to the server, and the server parses the log fault handling request to determine the target log identifier corresponding to the log fault handling request.

[0052] Specifically, after obtaining the target log identifier, the server performs a log query in the order of log database → log gateway → data integration and processing component. First, it queries the log database based on the target log identifier to determine if target log data matching the identifier exists. Querying the log database helps pinpoint whether the problem lies in the log collection process of the log system or the log query process of the log console. If the target log data is present in the log database (i.e., the query result is successful), the log database is functioning correctly, and the log fault can be traced to the log console or the log query process. Conversely, if the target log data is not present in the database, a problem exists in the log collection process of the log system, and the log fault can be traced back to the log collection process.

[0053] Step 504: If it is determined that there is target log data in the log database that matches the target log identifier, then perform a log display query on the log console based on the target log identifier to obtain the log query results.

[0054] Specifically, if it is determined that there is target log data in the log database that matches the target log identifier, that is, the query result for the log database is a successful query for the log database, it indicates that there is no fault in the log database. The server needs to further query the log display in the log console according to the target log identifier to determine whether there is a problem with the log display in the log console and obtain the corresponding log query results.

[0055] When the server queries the logs displayed in the log console, it can specifically check the log display results and the log query logic. If the log display result indicates successful log display, it means there is no problem with the log display in the log console; further checks can then be performed on the log query logic to determine if there are any errors. Conversely, if the log display result indicates a failure to display logs, it means there is a problem with the log display in the log console.

[0056] After executing step 502, execute step 506. If it is determined that there is no target log data matching the target log identifier in the log database, then query the log gateway based on the target log identifier to determine the first forwarding status of the target log data by the log gateway.

[0057] Specifically, if the server determines that there is no target log data matching the target log identifier in the log database, that is, the log data has not been stored in the log data block, and that there is a problem in the log system's collection link, it is necessary to further query the log gateway based on the target log identifier to determine the first forwarding status of the log gateway for the target log data.

[0058] Specifically, the first forwarding status of the target log data by the log gateway includes whether the log gateway successfully forwards the target log data or fails to forward the target log data.

[0059] Step 508: If the first forwarding status is that the log gateway has not successfully forwarded the target log data, then determine the first receiving status of the log gateway for the target log data, and determine the log query result based on the first receiving status.

[0060] Specifically, if the server determines that the first forwarding state is that the log gateway has failed to forward the target log data, then it determines the first receiving state of the log gateway for the target log data, that is, it further determines whether the log gateway has successfully received the target log data. The first receiving state of the log gateway for the target log data includes both successful reception of the target log data and unsuccessful reception of the target log data.

[0061] Furthermore, when determining the log query result based on the first reception status, if the first reception status of the log gateway for the target log data is determined to be that the log gateway has successfully received the target log data, then the log query result can be determined to be that the log gateway has successfully queried the target log data, but the log gateway has not successfully forwarded the target log data, thus it can be determined that the log gateway is faulty.

[0062] Conversely, if the first reception status of the log gateway for the target log data is determined to be that the log gateway has not successfully received the target log data, then the log query result can be determined to be that the log gateway query failed, that is, the log gateway has not received the target log data. Thus, it can be determined that there is a fault in the log collection module, which may be due to a fault when collecting log data using the API collection method or when collecting log data using a log collector.

[0063] After executing step 506, execute step 510. If the first forwarding status is that the log gateway has successfully forwarded the target log data, then query the data integration and processing component according to the target log identifier to determine the second forwarding status of the data integration and processing component for the target log data.

[0064] Specifically, if the server determines that the first forwarding status is that the log gateway has successfully forwarded the target log data, it further queries the data integration and processing component to determine whether the data integration and processing component has successfully forwarded the target log data.

[0065] Furthermore, if the server determines that the log gateway has successfully forwarded the target log data—that is, the log gateway has successfully received and forwarded the target log data, and there are no faults in the log gateway and its associated components—the server further queries the data integration and processing component based on the target log identifier to determine the second forwarding status of the data integration and processing component for the target log data, i.e., whether the data integration and processing component has successfully forwarded the target log data. Specifically, the second forwarding status of the data integration and processing component for the target log data includes both cases where the data integration and processing component has failed to forward the target log data and cases where the data integration and processing component has successfully forwarded the target log data.

[0066] If the server determines that the second forwarding status is that the data integration and processing component has successfully forwarded the target log data, it indicates that the data integration and processing component has successfully received and forwarded the target log data. Therefore, it can be determined that the data integration and processing component and its associated components are not faulty, and thus the log storage is faulty.

[0067] Step 512: If the second forwarding status is that the data integration and processing component has failed to forward the target log data, then determine the second receiving status of the data integration and processing component for the target log data, and determine the log query result based on the second receiving status.

[0068] Specifically, if the server determines that the second forwarding status is that the data integration and processing component has failed to forward the target log data, then it determines the second receiving status of the data integration and processing component for the target log data, that is, it determines whether the data integration and processing component has successfully received the target log data. The second receiving status of the data integration and processing component for the target log data specifically includes whether the data integration and processing component has successfully received the target log data and whether the data integration and processing component has failed to receive the target log data.

[0069] Furthermore, when determining the log query result based on the second receiving status, if the server determines that the data integration and processing component has successfully received the target log data, the query result can be determined as the data integration and processing component successfully querying the data. However, if the data integration and processing component has not successfully forwarded the target log data, then it can be determined that the data integration and processing component is faulty.

[0070] Conversely, if the server determines that the data integration and processing component has failed to receive the target log data, then the query result indicates that the data integration and processing component has failed to query, and the Kafka message middleware has not successfully forwarded the target log data to the data integration and processing component. In this case, it can be determined that there is a fault in the Kafka message middleware.

[0071] In one exemplary embodiment, such as Figure 6 As shown, a process for determining the scope of log faults is provided, referring to... Figure 6 It can be seen that the process of determining the scope of log faults specifically includes:

[0072] 1) Determine if target log data matching the target log identifier exists in the log database. 2) If target log data matching the target log identifier exists in the log database, perform a log display query on the log console to obtain the log query results. 3) If the log query results on the log console are abnormal, locate the log console or log query malfunction. 4) If target log data matching the target log identifier does not exist in the log database, query the log gateway to determine if the log gateway successfully forwarded the target log data. 5) If the log gateway failed to forward the target log data, determine if the log gateway successfully received the target log data. 6) If the log gateway successfully received the target log data, locate the log gateway malfunction. Specifically, if the log gateway successfully received the target log data but failed to forward it, i.e., the log gateway malfunctioned, it could be due to incorrect request parameters or authentication failure, preventing the log gateway from proceeding to the next forwarding step. 7) If the log gateway failed to receive the target log data, locate the log collection module malfunction. If the log gateway fails to receive or forward the target log data, the problem lies in the log collection environment, causing the logs to fail to be transmitted. 8) If the log gateway successfully forwards the target log data, the data integration processing component is queried to determine if it successfully forwarded the target log data. 9) If the data integration processing component fails to forward the target log data, it is determined whether it successfully received the target log data. 10) If the data integration processing component successfully receives the target log data, the component is identified as faulty. Specifically, if the data integration processing component successfully receives the target log data but fails to forward it, indicating a fault, this could be due to log unpacking failure or log non-compliance, causing the logs to be discarded and not forwarded further. 11) If the data integration processing component fails to receive the target log data, the Kafka message middleware is identified as faulty. If the data integration and processing component fails to receive or forward the target log data, the problem lies with the Kafka message middleware. A Kafka message middleware failure could be due to a Kafka cluster malfunction or problems with the Kafka data delivery and consumption environment. 12) If the data integration and processing component successfully forwards the target log data, the problem lies with the log storage. If the data integration and processing component successfully receives and forwards the target log data, but cannot retrieve it, the problem lies with the log storage.Specifically, log storage failures may include database write failures or network anomalies, which may cause the target log data to be incorrectly written to the database, resulting in the log console being unable to query the target log data.

[0073] In this embodiment, the log database is queried based on the target log identifier. If it is determined that target log data matching the target log identifier exists in the log database, the log display query is performed on the log console based on the target log identifier to obtain the log query result. If it is determined that no target log data matching the target log identifier exists in the log database, the log gateway is queried based on the target log identifier to determine the first forwarding status of the log gateway for the target log data. If the first forwarding status is determined to be that the log gateway has failed to forward the target log data, the first receiving status of the log gateway for the target log data is determined, and the log query result is determined based on the first receiving status. If the first forwarding state indicates that the log gateway has successfully forwarded the target log data, the data integration and processing component is queried based on the target log identifier to determine the second forwarding state of the data integration and processing component for the target log data. If the second forwarding state indicates that the data integration and processing component has failed to forward the target log data, the second receiving state of the data integration and processing component for the target log data is determined. Based on the second receiving state, the log query result is determined. This allows for log data querying of at least one of the log database, log gateway, and data integration and processing component based on the target log identifier. Based on the query results, it gradually determines whether different components have faults, rather than using a unified query and investigation method for all components. This allows for targeted and orderly fault investigation, quickly locating the fault location, resolving the fault in a timely manner, and improving the stability and availability of the system.

[0074] In one exemplary embodiment, such as Figure 7 As shown, the steps for determining the location of a log fault, namely obtaining business monitoring information and determining the location of the log fault based on the scope of the log fault and the business monitoring information, specifically include steps 702 to 706. Wherein:

[0075] Step 702: Obtain the service component monitoring information corresponding to the service component, the container component monitoring information corresponding to the container component, and the host node monitoring information of the host node on which the container component is deployed.

[0076] Specifically, when the server monitors the business system and obtains business monitoring information, it needs to monitor the host nodes, the container components deployed on the host nodes, and the service components corresponding to the underlying resources. It obtains service component monitoring information corresponding to the service components, container component monitoring information corresponding to the container components, and host node monitoring information of the host nodes with deployed container components.

[0077] For host nodes, node-exporter (a metric collector in the service monitoring system, used to collect and expose hardware and system metrics of operating systems; it runs through an HTTP server and provides various system monitoring parameters, such as CPU load, memory usage, and network status; this data can be actively captured by the metric collector for monitoring and alerting) is used to collect the host node's operational status. The metrics that need to be monitored for host nodes include: CPU, memory, and disk usage; network socket usage status; and disk and network read / write speeds and volumes. These monitored metrics reflect the underlying state of the entire system; abnormalities in these metrics will affect other upper-layer services.

[0078] For example, for container components, the container performance monitoring component corresponding to the container orchestration platform is used. This component can monitor the container's running status metrics, specifically including container CPU and memory usage, disk partition usage, and network read / write I / O. Since containers are prone to fluctuations during scaling up / down or update iterations, and various abnormal events can easily occur that affect internal services when resources within the container are nearly exhausted, it is necessary to monitor the containers themselves that host the services to determine if any faults exist.

[0079] Optionally, for service components, the service monitoring components corresponding to the container orchestration platform are used to monitor major internal events and current operating status of the service, and the monitoring indicators corresponding to the service components specifically include: network data inbound and outbound traffic, number of network requests, number of request errors, abnormal events, etc.

[0080] Step 704: Determine the target business monitoring information that matches the scope of log failures from the service component monitoring information, container component monitoring information, and host node monitoring information.

[0081] Specifically, based on the scope of the log fault, such as from service component monitoring information, container component monitoring information, and host node monitoring information, the target business monitoring information matching the scope of the log fault is determined. The scope of the log fault may include the log console or log query stage, or the collection link stage. The collection link stage may specifically include: the log gateway receiving or forwarding log data, the data integration and processing component receiving or forwarding log data, and the log data delivery or consumption stage in the Kafka intermediate message, etc. Therefore, when determining the location of the log fault based on the scope of the log fault, service component monitoring information, container component monitoring information, and host node monitoring information, it is determined whether it is a component problem, a machine resource problem, or a network transmission problem.

[0082] For example, the target business monitoring information that matches the scope of log faults can specifically be the container component monitoring information that matches the log console or log query stage, or the service component monitoring information that matches the log gateway receiving or forwarding log data stage in the data collection link stage, or the container component monitoring information that matches the data integration and processing component receiving or forwarding log data stage, etc.

[0083] Step 706: Determine the location of the log fault based on the scope of the log fault and the target business monitoring information.

[0084] Specifically, when determining the location of a log fault based on the scope of the log fault and the target business monitoring information, the server needs to identify the possible location of the fault in the corresponding environment for the specific scope of the log fault, including the log console, the log query process, or the collection link. For example, if the log database contains the target log data, that is, the log database is not faulty, the scope of the log fault can be located in the log console or the log query process. However, if the target log data is not found in the database, it indicates that there is a problem in the collection link of the log system, and the scope of the log fault can be located in the log collection link.

[0085] For example, if the target log data is not present in the database, indicating a fault in the data acquisition link environment, the log gateway is queried to determine whether it successfully forwarded the target log data. If it is determined that the log gateway failed to forward the target log data, it is further determined whether the log gateway successfully received the target log data. If the log gateway failed to receive the target log data, a fault is identified in the log gateway; conversely, if the log gateway successfully received the target log data, a fault is identified in the log acquisition module.

[0086] Furthermore, if it is determined that the log gateway successfully forwarded the target log data, the data integration and processing component is queried based on the target log identifier to determine whether the data integration and processing component successfully forwarded the target log data. If it is determined that the data integration and processing component failed to forward the target log data, it is further checked whether the data integration and processing component successfully received the target log data. If it is determined that the data integration and processing component successfully received the target log data, a fault is located in the data integration and processing component. If it is determined that the data integration and processing component failed to receive the target log data, a fault is located in the Kafka message middleware. If it is determined that the data integration and processing component successfully forwarded the target log data, a fault is located in the log storage.

[0087] Understandably, by sequentially querying and locating faults in components such as the log database, log gateway, and data integration and processing component during the data collection process, it is possible to determine which specific component among these components is likely to fail. In other words, based on the faulty component among the log database, log gateway, and data integration and processing component, the scope of the log fault can be determined. By combining service component monitoring information, container component monitoring information, and host node monitoring information, the actual location of the log fault can be further determined.

[0088] In one exemplary embodiment, such as Figure 8 As shown, this provides a schematic diagram of the distribution of business monitoring information. Figure 8 As can be seen, the business monitoring information specifically includes: service component monitoring information, container component monitoring information, and host node monitoring information. This business monitoring information can be stored in a time-series database through the service monitoring system, and specific business monitoring information can be viewed through the monitoring panel of the service monitoring system.

[0089] Specifically, refer to Figure 8 As can be seen, host node monitoring information is specifically obtained by using node-exporter (i.e., the metric collector in the service monitoring system) to collect the host node's running status and obtain specific metric information, including CPU, memory, and disk usage, network socket usage status, and disk and network read / write speeds and volumes. Container components, on the other hand, use the container performance monitoring component corresponding to the container orchestration platform to monitor container running status metrics, specifically including container CPU and memory usage, disk partition usage, and network read / write I / O. For service components, the service monitoring component corresponding to the container orchestration platform is used to monitor critical internal events and the current running status of the service. The monitoring metrics for service components specifically include: network data inflow and outflow traffic, number of network requests, number of request errors, and abnormal events.

[0090] For example, when monitoring a service component, such as when a log packet is transmitted to the ETL module (i.e., the data cleaning and transformation module) and the log packet decompression fails, it may be due to an incorrect log compression method or the log packet being unparseable. In this case, the data cleaning and transformation module will discard the log packet and generate an indicator to represent the basic information of the discarded log packet. At this time, by using the network request count, request error count, abnormal events, and other indicators in the service component monitoring information, it can be determined whether the log packet was discarded by the data cleaning and transformation module.

[0091] In this embodiment, by acquiring service component monitoring information corresponding to the service component, container component monitoring information corresponding to the container component, and host node monitoring information of the host node where the container component is deployed, the target business monitoring information matching the log fault range can be determined from the service component monitoring information, container component monitoring information, and host node monitoring information. Thus, the log fault location can be determined based on the log fault range and the target business monitoring information, enabling the problem to be identified in a short time. At the same time, by combining the monitoring of service components, container components, and host nodes, the system operating status can be monitored in all aspects, which can effectively reduce fault maintenance time, improve fault diagnosis efficiency, and help maintain the stability of the log collection system.

[0092] In one exemplary embodiment, such as Figure 9 As shown, a fault handling method based on multi-source log data is provided, which can be applied to... Figure 1 Taking server 104 as an example, the explanation includes steps 902 to 908. Wherein:

[0093] Step 902: If the log collection method is determined to be log collector collection, then obtain the collector status corresponding to the log collector.

[0094] When determining the scope of log failures, if the log collection module is found to be malfunctioning, further analysis is needed for different collection sources. For API / SDK collection methods, error messages are typically returned, allowing users to determine if the log upload method is correct. If the log collection method is a log collector, the log collector's operational evaluation results need to be determined to narrow down the scope of the log failure.

[0095] Specifically, if the log collection method is determined to be log collector collection, it is necessary to check the log collector's liveness status. This can be done by obtaining the log collector's ID and then querying monitoring based on that ID to determine the collector's status. If the log collector is down, log collection will not occur. If the log collector is alive, it is necessary to further check if its heartbeat is normal.

[0096] Step 904: If the collector is alive, determine the heartbeat data of the collector corresponding to the log collector.

[0097] Specifically, if it is determined that the log collector is alive, then the heartbeat data of the corresponding log collector is further obtained to determine whether the heartbeat of the log collector is normal.

[0098] If the log collector's timed heartbeat is abnormal or the time is too long, the log collector can be determined to be in an abnormal state based on the heartbeat data.

[0099] Step 906: If the log collector is determined to be in normal operation based on the collector's heartbeat data, then determine the log collection rule distribution status.

[0100] Specifically, if the log collector is determined to be operating normally based on its heartbeat data, the distribution status of the log collection rules is further determined based on the log console, i.e., whether the log collection rules have been distributed correctly. It is crucial to verify the correctness of the collection rules; if the collection rules distributed by the log console are incorrect, the target log data to be uploaded may not be successfully collected.

[0101] Step 908: Determine the operational evaluation result of the log collector based on the log collection rule distribution status, so as to determine the scope of log faults based on the operational evaluation result of the log collector.

[0102] Specifically, the server determines the operational evaluation result of the log collector based on the distribution status of the log collection rules. This includes determining whether the collection directory in the log collector exists, whether the regular expression of the collection rule matches, and whether the number of collected files meets the requirements (i.e., the number of collected files cannot exceed the number threshold). By obtaining the operational evaluation result of the log collector, the server can further determine the scope of log failure based on the operational evaluation result of the log collector, that is, determine whether the log collector has failed, and the stage and cause of the failure.

[0103] In the aforementioned fault handling method based on multi-source log data, if the log collection method is determined to be log collector collection, the collector status corresponding to the log collector is obtained. If the collector status is determined to be alive, the collector heartbeat data corresponding to the log collector is further obtained. If the collector heartbeat data indicates that the log collector is in normal operation, the log collection rule distribution status is determined, and the log collection rule distribution status is used to determine the operational evaluation result of the log collector. Based on the log collector operational evaluation result, the scope of log faults is determined. Therefore, targeted querying and troubleshooting methods can be used according to different log collection methods, enabling targeted and orderly fault troubleshooting, quickly locating the fault location, resolving the fault in a timely manner, and improving system stability and availability.

[0104] In one exemplary embodiment, such as Figure 10 As shown, a fault handling method based on multi-source log data is provided, which can be applied to... Figure 1 Taking server 104 as an example, the explanation includes the following steps 1001 to 1015. Wherein:

[0105] Step 1001: Receive a log fault handling request and determine the target log identifier and log collection method corresponding to the log fault handling request.

[0106] Specifically, after receiving a log fault handling request, the server parses the log fault handling request to determine the target log identifier and log collection method corresponding to the log fault handling request.

[0107] Step 1002: Query the log database based on the target log identifier.

[0108] Specifically, after obtaining the target log identifier, the server performs a log query in the order of log database → log gateway → data integration and processing component. Specifically, it first needs to query the log database based on the target log identifier to determine whether there is target log data in the log database that matches the target log identifier.

[0109] Step 1003: If it is determined that there is target log data in the log database that matches the target log identifier, then perform a log display query on the log console according to the target log identifier to obtain the log query results.

[0110] Specifically, if it is determined that there is target log data in the log database that matches the target log identifier, that is, the query result for the log database is a successful query for the log database, the server needs to further query the log display in the log console based on the target log identifier to determine whether there is a problem with the log display in the log console and obtain the corresponding log query results.

[0111] After executing step 1002, execute step 1004. If it is determined that there is no target log data matching the target log identifier in the log database, then query the log gateway according to the target log identifier to determine the first forwarding status of the target log data by the log gateway.

[0112] Specifically, if the server determines that there is no target log data in the log database that matches the target log identifier, it further queries the log gateway based on the target log identifier to determine the first forwarding status of the target log data by the log gateway, including whether the log gateway successfully forwards the target log data or fails to forward the target log data.

[0113] Step 1005: If the first forwarding status is that the log gateway has not successfully forwarded the target log data, then determine the first receiving status of the log gateway for the target log data, and determine the log query result based on the first receiving status.

[0114] Specifically, if the server determines that the first forwarding status is that the log gateway has failed to forward the target log data, it further determines the first receiving status of the log gateway for the target log data, that is, whether the log gateway has successfully received the target log data. If the first receiving status of the log gateway for the target log data is determined to be that the log gateway has successfully received the target log data, then it can be determined that the log gateway is faulty. Conversely, if the first receiving status of the log gateway for the target log data is determined to be that the log gateway has failed to receive the target log data, then it can be determined that the log gateway has not received the target log data, that is, it can be determined that the log collection module is faulty.

[0115] After executing step 1004, execute step 1006. If the first forwarding status is that the log gateway has successfully forwarded the target log data, then query the data integration and processing component according to the target log identifier to determine the second forwarding status of the data integration and processing component for the target log data.

[0116] Specifically, if the server determines the first forwarding status as the log gateway successfully forwarding the target log data, it further queries the data integration and processing component to determine the second forwarding status of the target log data by the data integration and processing component, including whether the data integration and processing component has failed to forward the target log data or has successfully forwarded the target log data.

[0117] Step 1007: If the second forwarding status is that the data integration and processing component has failed to forward the target log data, then determine the second receiving status of the data integration and processing component for the target log data, and determine the log query result based on the second receiving status.

[0118] Specifically, if the server determines the second forwarding status—that the data integration and processing component has failed to forward the target log data—then it determines the second receiving status of the data integration and processing component for the target log data, i.e., whether the data integration and processing component has successfully received the target log data. If the server determines that the data integration and processing component has successfully received the target log data, but has failed to forward it, then the data integration and processing component is faulty. Conversely, if the server determines that the data integration and processing component has failed to receive the target log data, meaning the Kafka message middleware has not successfully forwarded the target log data to the data integration and processing component, then the Kafka message middleware is faulty.

[0119] After executing steps 1003, 1005, and 1007, execute step 1008 to determine the scope of the log fault based on the log query results.

[0120] After executing step 1001, execute step 1009. If it is determined that the log collection method is log collector collection, then obtain the collector status corresponding to the log collector.

[0121] Specifically, if the log collection method is determined to be log collector collection, it is necessary to check the log collector's liveness status. This can be done by obtaining the log collector number corresponding to the log collector and querying the monitoring based on the log collector number to determine the collector status of the log collector.

[0122] Step 1010: If the collector is alive, determine the heartbeat data of the collector corresponding to the log collector.

[0123] Specifically, if it is determined that the log collector is alive, then the heartbeat data of the corresponding log collector is further obtained to determine whether the heartbeat of the log collector is normal.

[0124] Step 1011: If the log collector is determined to be in normal operation based on the collector's heartbeat data, then determine the status of the log collection rule distribution.

[0125] Specifically, if the log collector is determined to be in normal operation based on the collector's heartbeat data, then the distribution status of the log collection rules is further determined based on the log console, that is, whether the distribution status of the log collection rules is normal.

[0126] Step 1012: Determine the operational evaluation result of the log collector based on the log collection rule distribution status, so as to determine the scope of log faults based on the operational evaluation result of the log collector.

[0127] Specifically, after determining the operational evaluation result of the log collector based on the log collection rule distribution status, the server further determines the scope of log failure based on the log collector's operational evaluation result, that is, whether the log collector has failed, and the stage and cause of the log collector failure.

[0128] After executing step 1008 or step 1012, execute step 1013 to obtain the service component monitoring information corresponding to the service component, the container component monitoring information corresponding to the container component, and the host node monitoring information of the host node with the container component deployed.

[0129] Step 1014: Determine the target business monitoring information that matches the scope of log failures from the service component monitoring information, container component monitoring information, and host node monitoring information.

[0130] Step 1015: Determine the location of the log fault based on the scope of the log fault and the target business monitoring information, and perform troubleshooting based on the location of the log fault.

[0131] In one exemplary embodiment, such as Figure 11As shown, this provides a traditional process for troubleshooting log collection, referring to... Figure 11 Traditionally, troubleshooting log collection involves: user log collection failure → querying the log database → excluding intermediate modules → querying the transmission network → querying logs or simple monitoring → determining the scope of the problem → reviewing problem logs → identifying the cause of the problem. Understandably, the traditional log collection troubleshooting process is lengthy, requiring investigation of numerous intermediate modules, including log gateways, Kafka message middleware, ETL data cleaning and transformation modules, log storage, and multiple collection endpoints at different times. Checking these sequentially is time-consuming and may require reviewing component logs, which is often limited and incomplete. Relying solely on simple monitoring can easily overlook the true cause, ultimately leading to inefficient troubleshooting.

[0132] Furthermore, such as Figure 12 As shown, this paper presents the overall processing procedure of a fault handling method based on multi-source log data, referring to... Figure 12 As can be seen, the process of the fault handling method based on multi-source log data includes: user log collection failure → querying the log database → querying the log gateway → querying the ETL data cleaning and transformation module → determining the scope of the problem → querying multi-dimensional monitoring → determining the cause of the problem. In other words, compared to traditional log collection troubleshooting methods, the fault handling method based on multi-source log data in this embodiment can change the original method of checking each component to a three-step method. It only requires checking the log database, log gateway, and data cleaning and transformation module to determine the scope of the problem. Furthermore, the multi-dimensional monitoring includes monitoring of the host, container, and service components, which can effectively monitor the system and provide a more comprehensive and accurate diagnosis of the problem.

[0133] In the aforementioned fault handling method based on multi-source log data, a log fault handling request is received, and the target log identifier and log collection method corresponding to the request are determined. Based on the target log identifier, at least one of the log database, log gateway, and data integration processing components is queried to obtain log query results. The scope of the log fault is then determined based on these results. This reduces the number of components that need to be viewed and investigated during the query process, shortens the log fault troubleshooting process, and improves the efficiency of fault location and troubleshooting. Furthermore, if the log collection method is determined to be real-time collection, business monitoring information is obtained. Based on the log fault scope and the business monitoring information, the location of the log fault is determined, and fault troubleshooting is performed based on this location. This allows for further integration of different dimensions of business monitoring information and the determined log fault scope to accurately pinpoint the log fault location, preventing any omissions and enabling timely troubleshooting to ensure the stable availability of the business system.

[0134] It should be understood that although the steps in the flowcharts of the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the above embodiments may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.

[0135] Based on the same inventive concept, this application also provides a fault handling device based on multi-source log data for implementing the fault handling method based on multi-source log data described above. The solution provided by this device is similar to the solution described in the above method. Therefore, the specific limitations in one or more embodiments of the fault handling device based on multi-source log data provided below can be found in the limitations of the fault handling method based on multi-source log data described above, and will not be repeated here.

[0136] In one exemplary embodiment, such as Figure 13 As shown, a fault handling device based on multi-source log data is provided, including: a log fault handling request receiving module 1302, a log fault range determination module 1304, and a log fault location determination module 1306, wherein:

[0137] The log fault handling request receiving module 1302 is used to receive log fault handling requests, determine the target log identifier and log collection method corresponding to the log fault handling request; the log fault range determination module 1304 is used to query at least one of the log database, log gateway, and data integration processing component according to the target log identifier, obtain log query results, and determine the log fault range according to the log query results; the log fault location determination module 1306 is used to obtain business monitoring information if the log collection method is determined to be real-time collection, determine the log fault location according to the log fault range and business monitoring information, and perform fault troubleshooting based on the log fault location.

[0138] In the aforementioned fault handling device based on multi-source log data, at least one of the log database, log gateway, and data integration processing components is queried according to the target log identifier to obtain log query results. These results are then used to determine the scope of the log fault. This reduces the number of components that need to be viewed and investigated during the query process, shortens the link length for log fault troubleshooting, and improves the efficiency of fault location and troubleshooting. Furthermore, after determining the location of the log fault based on the log fault scope and business monitoring information, fault troubleshooting is performed based on the log fault location. This allows for further integration of business monitoring information from different dimensions with the determined log fault scope, accurately pinpointing the log fault location, avoiding omissions of faults, and enabling timely troubleshooting to ensure the stable availability of the business system.

[0139] In an exemplary embodiment, the log fault range determination module is further configured to: query the log database according to the target log identifier; if it is determined that there is no target log data matching the target log identifier in the log database, then query the log gateway according to the target log identifier to determine the first forwarding status of the log gateway for the target log data; if the first forwarding status is that the log gateway has not successfully forwarded the target log data, then determine the first receiving status of the log gateway for the target log data, and determine the log query result according to the first receiving status.

[0140] In an exemplary embodiment, the log fault range determination module is further configured to: if the first forwarding state is that the log gateway has successfully forwarded the target log data, then query the data integration processing component according to the target log identifier to determine the second forwarding state of the data integration processing component for the target log data; if the second forwarding state is that the data integration processing component has not successfully forwarded the target log data, then determine the second receiving state of the data integration processing component for the target log data, and determine the log query result according to the second receiving state.

[0141] In an exemplary embodiment, the log fault range determination module is further configured to: if it is determined that there is target log data in the log database that matches the target log identifier, then perform a log display query on the log console according to the target log identifier to obtain the log query result.

[0142] In an exemplary embodiment, the log fault range determination module is further configured to: obtain service component monitoring information corresponding to the service component, container component monitoring information corresponding to the container component, and host node monitoring information of the host node on which the container component is deployed; determine the target business monitoring information that matches the log fault range from the service component monitoring information, container component monitoring information, and host node monitoring information; and determine the log fault location based on the log fault range and the target business monitoring information.

[0143] In an exemplary embodiment, a fault handling device based on multi-source log data is provided, including a log collector operation evaluation module, configured to: if the log collection method is determined to be log collector collection, obtain the collector status corresponding to the log collector; if the collector status is alive, determine the collector heartbeat data corresponding to the log collector; if the log collector is determined to be in normal operation based on the collector heartbeat data, determine the log collection rule issuance status; and determine the log collector operation evaluation result based on the log collection rule issuance status, so as to determine the log fault range based on the log collector operation evaluation result.

[0144] Each module in the aforementioned fault handling device based on multi-source log data can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the corresponding operations of each module.

[0145] In one exemplary embodiment, a computer device is provided, which can be a server or a terminal. Taking the computer device as a server as an example, its internal structure diagram can be as follows. Figure 14 As shown, this computer device includes a processor, memory, input / output (I / O) interfaces, and a communication interface. The processor, memory, and I / O interfaces are connected via a system bus, and the communication interface is also connected to the system bus via the I / O interfaces. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and a database. The internal memory provides the environment for the operating system and computer programs stored in the non-volatile storage media. The database stores data such as target log identifiers, log collection methods, log databases, log gateways, log fault ranges, data integration and processing components, log query results, real-time acquisition, business monitoring information, and log fault locations. The I / O interfaces are used for information exchange between the processor and external devices. The communication interface is used for communication with external terminals via a network connection. When executed by the processor, the computer program implements a fault handling method based on multi-source log data.

[0146] Those skilled in the art will understand that Figure 14The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0147] In one embodiment, a computer device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps in the above method embodiments.

[0148] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon that, when executed by a processor, implements the steps in the above method embodiments.

[0149] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps in the above method embodiments.

[0150] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data must comply with relevant regulations.

[0151] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile memory and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, artificial intelligence (AI) processors, etc., and are not limited to these.

[0152] The technical features of the above embodiments can be combined arbitrarily. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as the combination of these technical features does not contradict each other, it should be considered within the scope of this application. The above embodiments only illustrate several implementation methods of this application, and their descriptions are relatively specific and detailed, but they should not be construed as limiting the scope of this application. It should be noted that for those skilled in the art, several modifications and improvements can be made without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. A fault handling method based on multi-source log data, characterized in that, The method includes: Receive a log fault handling request and determine the target log identifier and log collection method corresponding to the log fault handling request; Based on the target log identifier, perform the following operations to obtain the log query result: Query the log database; if no target log data matching the target log identifier exists, query the log gateway to determine the first forwarding status of the target log data; if the first forwarding status indicates that the log gateway has not successfully forwarded the target log data, determine the first receiving status of the target log data, and determine the log query result based on the first receiving status; if the first forwarding status indicates that the log gateway has successfully forwarded the target log data, query the data integration processing component based on the target log identifier to determine the second forwarding status of the target log data; if the second forwarding status indicates that the data integration processing component has not successfully forwarded the target log data, determine the second receiving status of the target log data, and determine the log query result based on the second receiving status. The scope of the log failure is determined based on the log query results. The scope of the log failure includes the log console query environment, the log query process, and the collection link process. If it is determined that the log collection method is real-time collection, then business monitoring information is obtained. The business monitoring information includes service component monitoring information corresponding to the service component, container component monitoring information corresponding to the container component, and host node monitoring information of the host node where the container component is deployed. From the service component monitoring information, the container component monitoring information, and the host node monitoring information, target business monitoring information that matches the log fault range is determined. Based on the log fault range and the business monitoring information, the log fault location is determined, and fault troubleshooting is performed based on the log fault location.

2. The method according to claim 1, characterized in that, The method further includes: If it is determined that there is target log data in the log database that matches the target log identifier, then a log display query is performed on the log console based on the target log identifier to obtain the log query results.

3. The method according to claim 1 or 2, characterized in that, The method further includes: If it is determined that the log collection method is log collector collection, then obtain the collector status corresponding to the log collector; If the collector is in an active state, then the heartbeat data of the collector corresponding to the log collector is determined; If the log collector is determined to be in normal operating condition based on the heartbeat data of the collector, then the log collection rule distribution status is determined. Based on the issuance status of the log collection rules, the operational evaluation result of the log collector is determined, and the scope of log faults is determined based on the operational evaluation result of the log collector.

4. The method according to claim 1, characterized in that, The data collection process includes: the log gateway receiving or forwarding log data, the data integration and processing component receiving or forwarding log data, and the log data delivery or consumption process in the Kafka middleware.

5. The method according to claim 1, characterized in that, The log fault locations include: container anomalies, host node anomalies, and service component anomalies in the log console query stage, log query stage, or data collection link stage.

6. A fault handling device based on multi-source log data, characterized in that, The device includes: The log fault handling request receiving module is used to receive log fault handling requests and determine the target log identifier and log collection method corresponding to the log fault handling request. The log fault range determination module is used to perform the following operations based on the target log identifier to obtain log query results: query the log database; if no target log data matching the target log identifier exists, query the log gateway to determine the first forwarding status of the target log data; if the first forwarding status is that the log gateway has not successfully forwarded the target log data, determine the first receiving status of the target log data and determine the log query result based on the first receiving status; if the first forwarding status is that the log gateway has successfully forwarded the target log data, query the data integration and processing component based on the target log identifier to determine the second forwarding status of the target log data; if the second forwarding status is that the data integration and processing component has not successfully forwarded the target log data, determine the second receiving status of the target log data and determine the log query result based on the second receiving status; determine the log fault range based on the log query results, wherein the log fault range includes the log console query environment, the log query stage, and the collection link stage; The log fault location determination module is used to obtain business monitoring information if the log collection method is determined to be real-time collection. The business monitoring information includes service component monitoring information corresponding to the service component, container component monitoring information corresponding to the container component, and host node monitoring information of the host node where the container component is deployed. From the service component monitoring information, the container component monitoring information, and the host node monitoring information, the module determines the target business monitoring information that matches the log fault range. Based on the log fault range and the business monitoring information, the module determines the log fault location and performs fault troubleshooting based on the log fault location.

7. The apparatus according to claim 6, characterized in that, The log fault range determination module is further configured to: if it is determined that there is target log data in the log database that matches the target log identifier, then perform a log display query on the log console based on the target log identifier to obtain the log query result.

8. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 5.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 5.

10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 5.