Problem positioning method and device, equipment and storage medium
By parsing natural language input and combining it with knowledge base and log data, the system automatically locates application anomalies, solving the problem of high reliance on manual intervention in existing technologies and achieving efficient and accurate troubleshooting.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING DUYOU INFORMATION TECH CO LTD
- Filing Date
- 2026-04-09
- Publication Date
- 2026-06-12
Smart Images

Figure CN122197900A_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of computer technology, and in particular to problem localization technology and artificial intelligence technology, specifically to a problem localization method, apparatus, device and storage medium. Background Technology
[0002] In application operation and management scenarios, log retrieval and anomaly localization are crucial components of daily operations and maintenance. Related technologies typically require analysis combining static knowledge documents with dynamic system log data. This process largely relies on manual operation, with operations personnel independently retrieving relevant information from knowledge bases and log platforms, and then manually comparing and correlating the data. Summary of the Invention
[0003] This disclosure provides a problem localization method, apparatus, device, and storage medium.
[0004] According to one aspect of this disclosure, a problem localization method is provided, comprising: Obtain natural language input, which is used to describe abnormal problems during application runtime; The natural language input is parsed to obtain a parsing result, which contains semantic information about the problem corresponding to the abnormal problem. Based on the parsing results, retrieve the operation and maintenance knowledge fragments corresponding to the semantic information of the problem from the pre-built knowledge base; Based on the parsing results and the retrieved operation and maintenance knowledge fragments, the target log data is obtained; Based on the retrieved operation and maintenance knowledge fragments and the target log data, a problem location result is generated.
[0005] According to another aspect of this disclosure, a problem location apparatus is provided, comprising: The input acquisition unit is configured to acquire natural language input, which is used to describe abnormal problems during application runtime; The parsing unit is configured to parse the natural language input to obtain a parsing result, the parsing result containing the semantic information of the problem corresponding to the abnormal problem; The retrieval unit is configured to retrieve, based on the parsing results, operation and maintenance knowledge fragments corresponding to the semantic information of the problem from a pre-built knowledge base; The log acquisition unit is configured to acquire target log data based on the parsing results and the retrieved operation and maintenance knowledge fragments; The problem localization unit is configured to generate problem localization results based on the retrieved operation and maintenance knowledge fragments and the target log data.
[0006] According to another aspect of this disclosure, an electronic device is provided, comprising: At least one processor; and A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the methods described in the embodiments of this disclosure.
[0007] According to another aspect of this disclosure, a non-transitory computer-readable storage medium is provided storing computer instructions, wherein the computer instructions are configured to cause the computer to perform the methods described in embodiments of this disclosure.
[0008] According to another aspect of this disclosure, a computer program product is provided, including a computer program that, when executed by a processor, implements the methods described in the embodiments of this disclosure.
[0009] This disclosure obtains and parses natural language input describing application runtime anomalies to obtain semantic information about the problem. Based on this parsing result, it retrieves corresponding operational knowledge fragments, combines the parsing result with the operational knowledge fragments to accurately obtain target log data, and then combines the operational knowledge fragments with the target log data to generate a problem localization result. This solution significantly reduces the reliance on expert experience for problem troubleshooting, avoids the tedious manual association of knowledge and logs, significantly improves the efficiency and accuracy of problem localization, and lowers the barrier to entry for non-professional users.
[0010] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description
[0011] The accompanying drawings are provided to better understand this solution and do not constitute a limitation of this disclosure.
[0012] Figure 1 This is a flowchart of the problem localization method provided in this publication.
[0013] Figure 2 This is a schematic diagram of one implementation of the problem localization method provided in this publication.
[0014] Figure 3 This is a schematic block diagram of the problem location device provided in this disclosure.
[0015] Figure 4 This is a block diagram of an electronic device used to implement the problem localization method of the embodiments of this disclosure. Detailed Implementation
[0016] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.
[0017] To quickly pinpoint the cause of application runtime anomalies, the common approach is for operators to leverage their experience to first translate user-reported natural language problem descriptions (e.g., "user cannot receive SMS verification codes") into a series of possible keywords or technical terms (e.g., "SMS service," "sending failure," "timeout"). These keywords are then input into a separate log query platform to search through massive amounts of system logs. Simultaneously, operators may need to consult another static operations and maintenance knowledge base or standard operating procedure (SOP) documents to recall or find historical troubleshooting steps and possible causes for such problems.
[0018] However, this approach performs poorly when applied to high-frequency, highly variable troubleshooting scenarios. A fundamental contradiction lies in the fact that this solution separates static knowledge from dynamic logs, leading to low troubleshooting efficiency and heavy reliance on human experience. For example, when dealing with payment failures involving multiple microservice modules, the aforementioned contradiction can cause the troubleshooting process to be lengthy and error-prone when logs are scattered across different servers and have different formats, resulting in a significant increase in the average fault recovery time. Humans need to constantly translate and guess between business semantics ("payment failure") and technical log characteristics (various error codes, timeout records), and repeatedly switch between fragmented logging tools and knowledge documents. The entire process is highly dependent on the operator's personal experience and on-the-spot condition.
[0019] In view of this, the present disclosure provides a problem localization method. This problem localization method can be executed by an electronic device with data processing capabilities, wherein the electronic device includes, but is not limited to, mobile phones, tablets, personal computers, servers, smart terminals, or dedicated computing devices.
[0020] Figure 1 A flowchart of the problem localization method provided in the embodiments of this disclosure is shown below. Figure 1 As shown, the method may include the following steps: Step 101: Obtain natural language input, which is used to describe the abnormal problems encountered during application runtime.
[0021] Step 102: Parse the natural language input to obtain the parsing result, which contains the semantic information of the problem corresponding to the abnormal problem.
[0022] Step 103: Based on the parsing results, retrieve the operation and maintenance knowledge fragments corresponding to the semantic information of the problem from the pre-built knowledge base.
[0023] Step 104: Based on the parsing results and the retrieved operation and maintenance knowledge fragments, obtain the target log data.
[0024] Step 105: Based on the retrieved operation and maintenance knowledge fragments and target log data, generate the problem location result.
[0025] As can be seen from the above process, this disclosure obtains natural language input describing application runtime anomalies, parses it to obtain parsing results containing semantic information about the problem, retrieves corresponding operational knowledge fragments based on these parsing results, accurately obtains target log data by combining the parsing results and operational knowledge fragments, and then generates problem localization results by combining the operational knowledge fragments and target log data. This solution significantly reduces the reliance on expert personal experience in problem troubleshooting, avoids the tedious manual association of knowledge and logs, significantly improves the efficiency and accuracy of problem localization, and lowers the barrier to entry for non-professional users.
[0026] The following describes in detail each step of the above process and the effects that can be further produced, with reference to the embodiments.
[0027] First, the above step 101, namely "obtaining natural language input, which is used to describe abnormal problems during application runtime", will be described in detail with reference to the embodiments.
[0028] In the embodiments of this disclosure, natural language input is the input content in which the user fully describes the functional abnormalities, interaction failures, business errors and other phenomena that occur during the operation of the application in the form of everyday business language, unstructured spoken language or written sentences. This input does not need to follow professional search syntax or technical specifications, and allows non-professionals to fill in the information freely in the form of natural expression habits. It only needs to be able to clearly reflect the abnormal scenarios and fault manifestations that occur during the operation of the application.
[0029] Natural language input is not limited to plain text: it supports both text input directly by the user and voice input by the user; the system can automatically convert the collected voice input into standard text through the speech recognition module, and use it as the raw input for subsequent processing.
[0030] As a specific implementation method, the system is configured with a visual user interface. This interface can be a chatbot dialogue window, an online problem submission form, or a built-in interactive component of the operation and maintenance platform. The interface provides both text input and voice input entry points, and is open to end users, front-line customer service personnel, or operation and maintenance R&D personnel. Operators can enter a description of the anomaly through the text editing box, or click the voice button to verbally describe the fault to complete the entry.
[0031] For example, the user can input text: "When trying to reset the password, the user has not received the email verification code, and the page does not show a success message after clicking the send button"; or the user can directly speak the same content, and the system will collect the speech and automatically transcribe it into the corresponding text.
[0032] The system monitors text submission events and voice capture trigger commands on the interactive interface in real time, and obtains the raw input after the input is completed. If it is in the form of voice, it is first converted into structured text through speech recognition, and the complete semantics of the user are preserved before being sent to the subsequent parsing process.
[0033] The following describes step 102, namely "parse the natural language input to obtain the parsing result, which contains the semantic information of the problem corresponding to the abnormal problem", in detail with reference to the embodiments.
[0034] In the embodiments of this disclosure, problem semantic information refers to an abstract representation that can accurately characterize the core semantics of abnormal problems. Its core function is to break the ambiguity and business attributes of natural language descriptions, extract key information that can support subsequent knowledge retrieval and log matching, and realize the transformation from "business language" to "technical analysis language".
[0035] The semantic information of the problem can be presented in any form, including structured classification labels, key-value pairs, or unstructured representations such as dense semantic vectors, as long as it can fully convey the core intent and key features of the problem.
[0036] In addition to semantic information about the problem, the parsing results may also include at least one of the following: the functional module identifier corresponding to the abnormal problem, the user identifier, and the time range in which the problem occurred.
[0037] The functional module identifier is a unique identifier used to distinguish each business function unit within the application. Examples include the names or codes of service modules such as user login, password reset, and payment settlement, used to locate the business function scope to which an anomaly belongs. This functional module identifier can be actively selected by the user through input in the interactive interface, or it can be automatically determined by the system based on the parsed semantic information of the problem.
[0038] User identifiers are identity information that can distinguish the object of an anomaly. They can be user accounts, user types, user terminal device identifiers, or user group tags. They can be extracted directly from natural language content or supplemented by combining the conversation context.
[0039] The time range of the problem occurrence refers to the time interval information corresponding to the occurrence, continuation or recurrence of the abnormal phenomenon, including the start time, end time and duration of the abnormality. If it is not explicitly stated in natural language, the system can use the current time to configure the default backtracking time interval as a supplement.
[0040] As one specific implementation, embodiments of this disclosure can employ a pre-trained large language model to perform the parsing task. For example, the natural language input obtained in step 101 is used as input data, and preset prompt words are input to the large language model. These prompt words guide the model to extract parsing results from the natural language input according to the technical analysis requirements.
[0041] For example, for the natural language input "Users are unable to receive email verification codes when trying to reset their passwords", the model may output a structured parsing result, such as: {"Problem semantic information": "Verification code sending failed", "Function module identifier": "User Center - Password Reset Service", "User identifier": XXX, "Problem occurrence time range": "Last 1 hour (default)"}.
[0042] This approach leverages the powerful language understanding capabilities of large language models to accurately extract the core elements of technical concern from vague descriptions.
[0043] More generally, parsing natural language input can be achieved in various ways. For example, it can include, but is not limited to: employing a dedicated natural language processing pipeline to sequentially perform word segmentation, part-of-speech tagging, named entity recognition, and intent classification; or using a vectorization model to convert the entire input text into a semantic vector, which itself serves as a dense representation of question semantic information; or combining rule templates with keyword matching for parsing. All these approaches can achieve the function of extracting or generating question semantic information from natural language input for subsequent retrieval and analysis. This disclosure does not specifically limit the parsing method.
[0044] The following describes in detail step 103, namely, "based on the parsing results, retrieve the operation and maintenance knowledge fragments corresponding to the semantic information of the problem from the pre-built knowledge base", with reference to the embodiments.
[0045] In embodiments of this disclosure, the pre-built knowledge base can be a vector database, pre-built in the following manner: First, obtain unstructured original operation and maintenance troubleshooting documents. These documents are usually SOPs, fault review reports, etc., in formats such as Word, PDF, Markdown, or Confluence pages.
[0046] Secondly, based on predefined rules, the original operation and maintenance troubleshooting documents are segmented to obtain multiple structured operation and maintenance knowledge fragments. The predefined rules here define how to identify the boundaries and internal structure of a complete knowledge fragment. For example, a fragment is defined as starting with the title "Problem Phenomenon" and ending with the title "Solution," and the sub-parts such as "Problem Location Method" and "Reference Log" are identified in between.
[0047] Finally, the operation and maintenance knowledge fragments are vectorized and stored to construct a knowledge base. Vectorization enables semantic similarity retrieval of operation and maintenance knowledge fragments.
[0048] This method segments unstructured documents using predefined rules, cutting and organizing messy, inconsistently formatted original documents into structured knowledge fragments containing specific fields, thus achieving knowledge standardization. Furthermore, by vectorizing and storing these structured fragments, a knowledge base supporting semantic retrieval is constructed, enabling the rapid and accurate retrieval of relevant knowledge based on the semantics of natural language questions.
[0049] To automate and intelligently segment unstructured operation and maintenance documents to adapt to different document formats, a natural language processing model can be driven to perform the segmentation process based on predefined rules, thereby obtaining multiple structured operation and maintenance knowledge fragments.
[0050] As a specific implementation method, prompt words can be designed to drive the large language model. For example, the original document content and predefined rules (such as "Please extract and segment the following document content according to the structure of 'problem description,' 'location steps,' 'root cause,' and 'solution,' and output each complete problem and its corresponding content as an independent fragment in JSON format") are input into the large language model. The model leverages its powerful language understanding and content parsing capabilities to identify different parts of the document, extract them, and format them into structured JSON fragments.
[0051] This approach reduces the strict dependence on the original document format. Even if the wording of the document titles is slightly different, the model can understand its semantics and make correct classifications, thereby improving the automation and adaptability of knowledge base construction.
[0052] Those skilled in the art will understand that the natural language processing model used is not limited to a large language model, but may also be a finely tuned sequence labeling model or text classification model, etc.
[0053] When retrieving operation and maintenance knowledge fragments from a knowledge base, one specific implementation method is to perform vector transformation on the parsed results (or their core semantic information) to obtain a retrieval vector. For example, the semantic description of "user login failed" can be converted into a 768-dimensional floating-point vector using an embedding model. Then, the retrieval vector is matched with the vectorized representations of each operation and maintenance knowledge fragment in the vector database, and the cosine similarity score between each pair of vectors is calculated. Finally, based on the similarity matching results, a predetermined number of operation and maintenance knowledge fragments (e.g., the top 3 with the highest similarity) are selected as the corresponding operation and maintenance knowledge fragments.
[0054] This method transforms both the parsing results representing the current problem and knowledge fragments in the knowledge base into vector representations, and then performs similarity calculations in the vector space, achieving deep semantic-based matching retrieval. This overcomes the failure of traditional keyword matching in cases of lexical differences and synonym substitutions, enabling it to more accurately find historical fault knowledge and solutions that are semantically most relevant to the current problem.
[0055] It should also be noted that retrieving corresponding operation and maintenance knowledge fragments from the knowledge base can be achieved in various ways. For example, this can include, but is not limited to: if the knowledge base is a relational database, the parsed intent, functional modules, etc., can be used as SQL query conditions for retrieval; if the knowledge base is a graph database, the parsed entities can be used to traverse related nodes in the graph structure to find relevant knowledge. All these solutions can achieve the function of finding relevant knowledge fragments based on the semantic information of the problem.
[0056] The following describes step 104, namely "obtaining target log data based on the parsing results and retrieved operation and maintenance knowledge fragments", in detail with reference to the embodiments.
[0057] In the embodiments of this disclosure, the retrieved operation and maintenance knowledge fragments include at least one problem location field. This field is used to solidify and carry the log retrieval rules and targeted investigation basis accumulated by operation and maintenance experts, so as to guide the system to accurately locate the log features that match the current abnormal problem.
[0058] After obtaining the parsing results and operation and maintenance knowledge fragments, the target log data can be further obtained from the full system logs.
[0059] Specifically, first, based on the semantic information of the problem in the parsing results, the log retrieval criteria are determined. Second, the content of the problem location field is extracted from the retrieved operation and maintenance knowledge fragments. This approach concretizes the expert experience (i.e., "how to locate") in static knowledge and applies it to this investigation. Finally, based on the log retrieval criteria and the content of the problem location field, target log data is filtered from the log data.
[0060] This approach combines broad semantic information about the problem with specific technical troubleshooting methods, enabling precise guidance from "what is the problem" to "how to find the logs." This allows for more refined and targeted selection of log entries most relevant to specific failure modes from a massive log dataset, effectively narrowing the scope of analysis.
[0061] Those skilled in the art will understand that the content of the problem location field can take many forms. For example, it may include, but is not limited to: indicating the specific API interface name or error code to be checked; a regular expression describing the log patterns to be monitored; a rule listing the interfering log types to be excluded; or, pointing to a specific log analysis script or command to be executed. For example, for the problem of "user file upload failure," a problem location field in an operations and maintenance knowledge fragment might be "Search application logs for entries containing 'UploadService' and 'FileSizeLimitExceededException'." After obtaining this content, the system will convert it into specific log filtering conditions, and submit it to the log system for execution along with conditions such as module and time determined based on the parsing results.
[0062] To further refine the log filtering criteria and better match them with historically known fault phenomena, thereby more efficiently identifying critical logs, this disclosure also provides another feasible solution. In this solution, the operational knowledge fragment also includes reference log information corresponding to the problem location field. The reference log information provides concrete, referential examples for the abstract descriptions in the location field.
[0063] At this point, the operation of "filtering target log data based on log retrieval conditions and the content of the problem location field" in step 104 is further optimized to: filtering target log data from the log data based on log retrieval conditions, the content of the problem location field, and the corresponding reference log information. This limitation aims to provide richer matching criteria. When filtering, the system not only relies on abstract location descriptions but also compares them with specific reference log patterns, thereby improving the accuracy and reliability of log matching.
[0064] Those skilled in the art will understand that the reference log information can be a complete log text, or a key feature pattern or vectorized representation extracted from it.
[0065] To effectively narrow the search scope when determining log retrieval criteria based on natural language input and avoid inefficient searches from excessively large log sets, this disclosure further proposes that the parsed results, in addition to containing semantic information about the problem, may also include at least one of the following: the functional module identifier corresponding to the abnormal problem, the user identifier, and the time range in which the problem occurred. This information can typically be parsed directly or indirectly from the problem description.
[0066] Accordingly, the operation of "determining log retrieval conditions based on the problem semantic information in the parsing results" in step 104 can specifically include: extracting at least one of the functional module identifier, user identifier, and problem occurrence time range, and combining them with the problem semantic information as log retrieval conditions.
[0067] This method parses clear objective identifiers or ranges such as functional modules, users, and time from natural language input, and uses these specific constraints together with the semantic information of the problem as filtering conditions for log retrieval. It can significantly limit the log dataset to be searched from multiple dimensions such as time, space (module), and subject (user), thereby improving the speed and efficiency of log retrieval.
[0068] For example, the parsing result might be {"Intent": "Login Failed", "Functional Module": "Authentication Service", "User Identifier": "user_12345", "Time Range": "2023-10-27 14:00 to 14:30"}. Based on this, the system will generate very specific log retrieval conditions, potentially only querying logs generated by the authentication service for that user within that time period. This significantly reduces the amount of log data that needs to be processed, thereby improving the speed and efficiency of log retrieval and laying the foundation for subsequent accurate analysis. The functional module identifier can be a service name, subsystem number, code package path, etc. The user identifier can be a user ID, account, device ID, session ID, etc. The time range can be an absolute timestamp or a relative time (e.g., "last 5 minutes").
[0069] The following describes step 105, namely "generating problem location results based on retrieved operation and maintenance knowledge fragments and target log data", in detail with reference to the embodiments.
[0070] In the embodiments of this disclosure, after obtaining the operation and maintenance knowledge fragments and target log data, the two can be analyzed collaboratively to ultimately generate problem location results that can be used by users for reference.
[0071] As one possible approach, operations and maintenance knowledge fragments can also include problem cause fields and / or solution fields. By pre-setting these fields in structured operations and maintenance knowledge fragments, the retrieved knowledge fragments themselves carry root cause analysis and handling recommendations.
[0072] Those skilled in the art will understand that the Problem Cause field and Solution field can be plain text descriptions, links to more detailed documentation, or structured lists of steps.
[0073] Based on the above-mentioned operational knowledge fragments containing fields for problem causes and / or solutions, the analysis process for generating problem location results can include the following two stages.
[0074] The first phase involves analyzing the target log data based on the content of the problem location fields, filtering out abnormal log entries that match the content of the problem location fields. This phase focuses on finding conclusive evidence from the target log data. For example, using regular expressions or keyword lists from the problem location fields, the target logs are scanned and matched to filter out log lines that meet the fault characteristics and mark them as abnormal log entries.
[0075] The second stage involves analyzing the abnormal log entries based on the problem cause and / or solution fields to generate problem localization results, which include the root cause of the fault and / or the handling solution. This stage focuses on interpretation and conclusion generation.
[0076] The system combines the specific anomaly logs selected in the first phase with the abstract knowledge about causes and solutions from the knowledge fragments. For example, if the system finds a matching anomaly log "Database connection timeout after 30s," and the knowledge fragment states the cause as "network latency or database overload," and the solution as "optimize queries or expand capacity," the system can generate the following result: "Database connection timeout log found (evidence), inferred cause is network or database load issue (root cause analysis), recommended to optimize related queries or contact operations and maintenance to check database status (handling solution)."
[0077] In this approach, the first stage uses the location guidance in the operation and maintenance knowledge fragments to accurately locate abnormal records from the target logs, ensuring conclusive evidence. The second stage combines the causal and solution knowledge pre-stored in the operation and maintenance knowledge fragments to interpret and reason about the locked abnormal logs, generating conclusive outputs containing deep root causes and feasible solutions, thus realizing a complete deduction from log phenomena to business conclusions.
[0078] To enhance the robustness and evolvability of the system, this disclosure also proposes a mechanism for handling unknown problems. Specifically, when the system cannot retrieve the corresponding operation and maintenance knowledge fragments for the problem's semantic information from the pre-built knowledge base based on the parsing results (e.g., the score of the fragment with the highest similarity is also below a certain threshold), the system records the natural language input and the parsing results into a queue to be processed.
[0079] This method provides a buffer and recording channel for new problems that the system cannot handle automatically by setting up a pending queue mechanism. This avoids process interruptions or erroneous outputs due to the inability to retrieve knowledge, while collecting cases of these "unknown problems" to provide a valuable data source for subsequent manual analysis, rule expansion, model training, or knowledge base supplementation, supporting the continuous iteration and expansion of the entire system's capabilities.
[0080] The following example, a payment scenario on an online e-commerce platform, illustrates the specific implementation of the technical solution disclosed herein.
[0081] refer to Figure 2 As shown, when a user encountered a payment problem on the platform, they submitted a problem description through the customer service system or the platform's built-in feedback interaction window: "After submitting the order, clicking the payment button had no effect; the page kept spinning, and finally, a payment failure message appeared." The platform integrates the publicly disclosed problem localization system, and its server immediately triggers an automated localization process upon receiving the problem description.
[0082] First, the system's input acquisition unit captures the natural language input and passes it to the parsing unit for processing. The parsing unit invokes a pre-trained large language model to perform deep semantic parsing of the input text. The model not only identifies the core "problem semantic information" as "payment process interruption or failure," but also accurately extracts contextual information from multiple dimensions: the core functional modules involved are identified as "front-end payment page," "payment gateway service," and "order service"; the user identifier field can be extracted from the text description or context; the time range of the problem can be automatically associated with "recent time" based on the submission timestamp (e.g., the default setting is the last 30 minutes), or await further clarification from the user. The parsing result is structured into a machine-readable object, for example: {"Intent": "payment process interruption", "Functional module": ["front-end payment page", "payment gateway service", "order service"], "User identifier": null, "Time range": "last 30 minutes"}. This step transforms the ambiguous user language into structured information that the system can understand, containing business semantics and technical dimensions.
[0083] Based on the above analysis results, the system's retrieval unit begins its operation. It transforms the key semantic information of "payment process interruption" into a high-dimensional semantic vector (i.e., the retrieval vector) using an embedding model. This vector is then used to query a pre-built vector knowledge base. This knowledge base stores structured fault knowledge fragments extracted and quantified from a large number of payment-related SOPs and historical fault reports. By calculating cosine similarity, the system quickly retrieves the fault knowledge fragment with the highest similarity. The structured content of this fragment is as follows: Problem field: "Transaction failed due to timeout when payment gateway calls third-party payment channel".
[0084] The location method field is as follows: "1. Query the application logs of the payment gateway service and filter out entries that call third-party payment channel APIs (such as 'ChannelX_Pay_API'). 2. Pay special attention to records with a response time greater than 5 seconds and records with a non-successful response status code (such as TIMEOUT, ERROR, 5xx). 3. Compare whether the error message contains keywords such as 'quota', 'rate limiting', and 'connection failure'." Reference log fields: An example log is provided: "2023-11-01 14:05:22.123 [INFO]PaymentGateway - Called ChannelX payment interface, order number: ORDER_789, time taken: 7450ms, return status: TIMEOUT, error message: Read timed out".
[0085] The error message indicates the following causes: "1. Congestion, overload, or temporary failure of the third-party payment channel server. 2. High latency or jitter in the network link between the payment gateway service and the third-party channel. 3. The payment gateway service's own thread pool is exhausted, unable to process concurrent requests in a timely manner. 4. The call frequency limit or quota for reaching the third-party payment channel has been exhausted." Solution fields: "1. Immediate Verification: Monitor the status page or health metrics of the third-party payment channel. 2. Emergency Operation: In the payment gateway management backend, temporarily switch the transaction route to the backup 'ChannelY' payment channel. 3. Capacity Check: Check the instance load and thread pool usage of the payment gateway service, and perform emergency scaling if necessary. 4. Follow-up Coordination: Contact the technical support of the third-party payment channel, synchronize the fault information, and request assistance in troubleshooting." Next, the log acquisition unit works collaboratively. It receives structured conditions from the parsing unit (functional module: "Payment Gateway Service"; time range: "last 30 minutes") and expert-rich location methods from the retrieval unit. The system first uses "module + time" as the condition to initially retrieve all log entries for the Payment Gateway Service within that time period from the distributed log system, forming an initial log set. Then, instead of performing a full analysis of this set, it applies specific rules from the "location method field" in the knowledge fragment to intelligently filter the initial set. For example, it searches the logs for call records containing "ChannelX_Pay_API" or similar interface identifiers, filtering out entries with a response_time field value greater than 5000 milliseconds or a status field containing values such as "TIMEOUT" or "ERROR". Simultaneously, patterns provided by the "reference log field" are also used to assist in matching, improving filtering accuracy. Through this dual filtering mechanism combining objective conditions and subjective knowledge rules, the system quickly focuses on dozens of highly suspicious "target log data" from potentially millions of raw logs. These data clearly show that in the last 20 minutes, a large number of calls to the “ChannelX” payment channel have timed out, with each call taking more than 7 seconds.
[0086] Finally, the problem localization unit (which can be viewed as an intelligent agent dedicated to log analysis) is triggered. Its input consists of retrieved fault knowledge fragments containing causal chains, and filtered target log data reflecting real-time facts. The agent first performs evidence verification: it analyzes each target log entry to confirm that its pattern (ChannelX interface call, high latency, TIMEOUT status) perfectly matches the descriptions in the "reference logs" and "localization method" within the knowledge fragment. Next, it performs reasoning and synthesis: it correlates the conclusive log evidence ("what happened") with the "problem cause field" ("why it happened") in the knowledge fragment. Given that the error type is "timeout" rather than "connection rejection," and that it occurred in batches within a short period, the agent prioritizes inferring the root cause as "regional congestion or performance degradation on the third-party payment channel ChannelX server," while also not ruling out the possibility of local network fluctuations. Then, it combines the step-by-step suggestions from the "solution field" and, based on the context of the current peak business period, generates a problem localization result in natural language format that directly guides action.
[0087] A few seconds later, the problem location results were sent to the operations engineer, enabling the engineer to perform subsequent system maintenance based on these results.
[0088] This example demonstrates that, starting from receiving a simple natural language description from a user, the disclosed solution automatically connects the entire process of semantic parsing, knowledge retrieval, log filtering, and intelligent reasoning. It successfully links static SOP knowledge ("What should be checked and switched when a payment timeout occurs") with dynamic system logs ("A large number of timeouts have indeed occurred") in real time, and outputs a location result containing specific evidence, root cause analysis, and actionable solutions. This significantly reduces the time from problem discovery to initiating the correct response, achieving an intelligent upgrade of operational response capabilities.
[0089] The collection, storage, use, processing, transmission, provision, and disclosure of user personal information involved in the technical solution disclosed herein comply with the provisions of relevant laws and regulations and do not violate public order and good morals.
[0090] The foregoing has described specific embodiments of this disclosure. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
[0091] According to another embodiment, a problem location device is provided. Figure 3 A schematic block diagram of the problem location device according to one embodiment is shown. Figure 3 As shown, the device 300 includes an input acquisition unit 301, a parsing unit 302, a retrieval unit 303, a log acquisition unit 304, and a problem location unit 305. The main functions of each component are as follows: The input acquisition unit 301 is configured to acquire natural language input, which is used to describe abnormal problems during application runtime; The parsing unit 302 is configured to parse the natural language input to obtain a parsing result, the parsing result containing the semantic information of the problem corresponding to the abnormal problem; The retrieval unit 303 is configured to retrieve, based on the parsing result, operation and maintenance knowledge fragments corresponding to the semantic information of the problem from a pre-built knowledge base; The log acquisition unit 304 is configured to acquire target log data based on the parsing results and the retrieved operation and maintenance knowledge fragments; Problem location unit 305 is configured to generate problem location results based on retrieved operation and maintenance knowledge fragments and the target log data.
[0092] The operation and maintenance knowledge fragment includes at least a problem location field; The log acquisition unit 304 is specifically configured as follows: Based on the semantic information of the problem in the parsing results, the log retrieval conditions are determined; Extract the content of the problem location field from the operation and maintenance knowledge fragment; Based on the log retrieval criteria and the content of the problem location field, the target log data is filtered from the log data.
[0093] The operation and maintenance knowledge fragment also includes reference log information corresponding to the problem location field; The log acquisition unit 304 is specifically configured as follows: Based on the log retrieval criteria, the content of the problem location field, and the corresponding reference log information, the target log data is filtered from the log data.
[0094] The analysis results also include at least one of the following: the functional module identifier, the user identifier, and the time range in which the problem occurred. The log acquisition unit 304 is specifically configured as follows: Extract at least one of the functional module identifier, user identifier, and problem occurrence time range, and combine them with the problem semantic information as log retrieval conditions.
[0095] The operation and maintenance knowledge fragment also includes a problem cause field and / or a solution field.
[0096] Specifically, the problem location unit 305 is configured as follows: Based on the content of the problem location field, the target log data is analyzed to filter out abnormal log entries that match the content of the problem location field; Based on the problem cause field and / or solution field, the abnormal log conditions are analyzed to generate the problem location result, which includes the root cause of the fault and / or the handling solution.
[0097] The device may further include a knowledge base construction unit, configured as follows: Obtain unstructured original operation and maintenance troubleshooting documents; Based on predefined rules, the original operation and maintenance troubleshooting document is segmented to obtain multiple structured operation and maintenance knowledge fragments. The operation and maintenance knowledge fragments are vectorized and stored to construct the knowledge base.
[0098] The knowledge base construction unit is specifically configured as follows: Based on the predefined rules, the natural language processing model is driven to segment the original operation and maintenance troubleshooting document to obtain multiple structured operation and maintenance knowledge fragments.
[0099] The retrieval unit 303 is specifically configured as follows: The parsing result is then transformed into a vector to obtain the retrieval vector; The retrieval vector is matched with the vectorized representations of each operation and maintenance knowledge fragment in the vector database based on similarity. Based on the similarity matching results, a preset number of operation and maintenance knowledge fragments are selected as the corresponding operation and maintenance knowledge fragments.
[0100] The device also includes: The recording unit is configured to record the natural language input and the parsing result to the pending processing queue when the operation and maintenance knowledge fragment corresponding to the semantic information of the problem cannot be retrieved.
[0101] According to embodiments of this disclosure, this disclosure also provides an electronic device, a readable storage medium, and a computer program product.
[0102] Figure 4 A schematic block diagram of an example electronic device 400 that can be used to implement embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.
[0103] like Figure 4 As shown, device 400 includes a computing unit 401, which can perform various appropriate actions and processes based on a computer program stored in read-only memory (ROM) 402 or a computer program loaded from storage unit 408 into random access memory (RAM) 403. RAM 403 may also store various programs and data required for the operation of device 400. The computing unit 401, ROM 402, and RAM 403 are interconnected via bus 404. Input / output (I / O) interface 405 is also connected to bus 404.
[0104] Multiple components in device 400 are connected to I / O interface 405, including: input unit 406, such as keyboard, mouse, etc.; output unit 407, such as various types of monitors, speakers, etc.; storage unit 408, such as disk, optical disk, etc.; and communication unit 409, such as network card, modem, wireless transceiver, etc. Communication unit 409 allows device 400 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0105] The computing unit 401 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 401 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 401 performs the various methods and processes described above, such as problem localization methods. For example, in some embodiments, the problem localization method may be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 408. In some embodiments, part or all of the computer program may be loaded and / or installed on device 400 via ROM 402 and / or communication unit 409. When the computer program is loaded into RAM 403 and executed by the computing unit 401, one or more steps of the problem localization method described above may be performed. Alternatively, in other embodiments, the computing unit 401 may be configured to perform the problem localization method by any other suitable means (e.g., by means of firmware).
[0106] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.
[0107] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.
[0108] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0109] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).
[0110] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.
[0111] Computer systems can include clients and servers. Clients and servers are generally located far apart and typically interact via communication networks. Client-server relationships are created by computer programs running on the respective computers and having a client-server relationship with each other. Servers can be cloud servers, servers in distributed systems, or servers incorporating blockchain technology.
[0112] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.
[0113] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.
Claims
1. A problem localization method, comprising: Obtain natural language input, which is used to describe abnormal problems during application runtime; The natural language input is parsed to obtain a parsing result, which contains semantic information about the problem corresponding to the abnormal problem. Based on the parsing results, retrieve the operation and maintenance knowledge fragments corresponding to the semantic information of the problem from the pre-built knowledge base; Based on the parsing results and the retrieved operation and maintenance knowledge fragments, the target log data is obtained; Based on the retrieved operation and maintenance knowledge fragments and the target log data, a problem location result is generated.
2. The problem localization method according to claim 1, wherein, The operation and maintenance knowledge fragment shall include at least a problem location field; The process of obtaining target log data based on the parsing results and retrieved operation and maintenance knowledge fragments includes: Based on the semantic information of the problem in the parsing results, the log retrieval conditions are determined; Extract the content of the problem location field from the operation and maintenance knowledge fragment; Based on the log retrieval criteria and the content of the problem location field, the target log data is filtered from the log data.
3. The problem localization method according to claim 2, wherein, The operation and maintenance knowledge fragment also includes reference log information corresponding to the problem location field; The step of filtering the target log data from the log data based on the log retrieval conditions and the content of the problem location field includes: Based on the log retrieval criteria, the content of the problem location field, and the corresponding reference log information, the target log data is filtered from the log data.
4. The problem localization method according to claim 2 or 3, wherein, The analysis results also include at least one of the following: the functional module identifier, the user identifier, and the time range in which the problem occurred; The step of determining log retrieval conditions based on the semantic information of the problem in the parsing results includes: Extract at least one of the functional module identifier, user identifier, and problem occurrence time range, and combine them with the problem semantic information as log retrieval conditions.
5. The problem localization method according to any one of claims 2 to 4, wherein, The operation and maintenance knowledge fragment also includes a problem cause field and / or a solution field.
6. The problem localization method according to claim 5, wherein, The process of generating problem location results based on retrieved operation and maintenance knowledge fragments and target log data includes: Based on the content of the problem location field, the target log data is analyzed to filter out abnormal log entries that match the content of the problem location field; Based on the problem cause field and / or solution field, the abnormal log conditions are analyzed to generate the problem location result, which includes the root cause of the fault and / or the handling solution.
7. The problem localization method according to claim 1, wherein, The knowledge base is pre-built in the following ways: Obtain unstructured original operation and maintenance troubleshooting documents; Based on predefined rules, the original operation and maintenance troubleshooting document is segmented to obtain multiple structured operation and maintenance knowledge fragments. The operation and maintenance knowledge fragments are vectorized and stored to construct the knowledge base.
8. The problem localization method according to claim 7, wherein the original operation and maintenance troubleshooting document is segmented based on predefined rules to obtain multiple structured operation and maintenance knowledge fragments, including: Based on the predefined rules, the natural language processing model is driven to segment the original operation and maintenance troubleshooting document to obtain multiple structured operation and maintenance knowledge fragments.
9. The problem localization method according to claim 7, wherein, Based on the parsing results, the corresponding operation and maintenance knowledge fragments are retrieved from the pre-built knowledge base, including: The parsing result is then transformed into a vector to obtain the retrieval vector; The retrieval vector is matched with the vectorized representations of each operation and maintenance knowledge fragment in the vector database based on similarity. Based on the similarity matching results, a preset number of operation and maintenance knowledge fragments are selected as the corresponding operation and maintenance knowledge fragments.
10. The problem localization method according to claim 1, further comprising: When the operation and maintenance knowledge fragment corresponding to the semantic information of the problem cannot be retrieved, the natural language input and the parsing result are recorded in the queue to be processed.
11. A problem location device, comprising: The input acquisition unit is configured to acquire natural language input, which is used to describe abnormal problems during application runtime; The parsing unit is configured to parse the natural language input to obtain a parsing result, the parsing result containing the semantic information of the problem corresponding to the abnormal problem; The retrieval unit is configured to retrieve, based on the parsing results, operation and maintenance knowledge fragments corresponding to the semantic information of the problem from a pre-built knowledge base; The log acquisition unit is configured to acquire target log data based on the parsing results and the retrieved operation and maintenance knowledge fragments; The problem localization unit is configured to generate problem localization results based on the retrieved operation and maintenance knowledge fragments and the target log data.
12. An electronic device, comprising: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores instructions executable by the at least one processor, which, when executed by the at least one processor, enables the at least one processor to perform the method of any one of claims 1-10.
13. A non-transitory computer-readable storage medium storing computer instructions, wherein, The computer instructions are used to cause the computer to perform the method according to any one of claims 1-10.
14. A computer program product comprising a computer program that, when executed by a processor, implements the method according to any one of claims 1-10.