Method, system and computer program for optimizing retrieval of similar tickets in a ticket management system
Generative AI transforms ticket data into structured formats for enhanced accuracy in identifying similar tickets, addressing the inefficiencies of manual processing and text-based limitations in ticket management systems.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- NEC LAB EURO GMBH
- Filing Date
- 2025-02-06
- Publication Date
- 2026-06-25
Smart Images

Figure EP2025053159_25062026_PF_FP_ABST
Abstract
Description
[0001] METHOD, SYSTEM AND COMPUTER PROGRAM FOR OPTIMIZING RETRIEVAL OF SIMILAR TICKETS IN A TICKET MANAGEMENT SYSTEM
[0002] The present invention relates to a computer-implemented method, system and computer program or non-transitory computer-readable medium for optimizing retrieval of similar tickets in a ticket management system.
[0003] Ticket management systems have the issue of duplicate tickets, where identical or highly similar issues are reported multiple times by different users. This problem is challenging due to the technical nature of many tickets, which often involve many details like error messages, code snippets, and file attachments. Manually identifying these duplicates is time-consuming and error prone. Hence, a significant issue in the current ticket management pipeline is the manual processing of redundant / repeating tickets. Manual search for duplicates is too time-consuming (considering that this needs to happen for every ticket) but also the decentralized nature of ticket assignment.
[0004] Other approaches in duplicate ticket (or bug report) detection often have limitations such as being restricted to text-based tickets only, using predefined vocabularies for matching tickets which are created for each new business domain, requiring time and cost expensive human triaging of all existing tickets for mapping incoming tickets, availability of monitoring and logging systems to provide runtime information in a high-availability fashion, and searching tickets only based on lexical patterns derived via basic natural language processing.
[0005] It is therefore an objective of the present invention to improve and further develop a method of the initially described type for handling duplicate tickets.
[0006] In accordance with the invention, the objective is accomplished by a method comprising the features of claim 1 . According to this claim, a computer-implemented method for optimizing retrieval of similar tickets in a ticket management system using machine learning comprises obtaining a ticket being input into the ticket management system The method comprises extracting first textual data from a description portion of the ticket The method comprises determining, if the ticket comprises at least one attachment, second textual data representing the at least one attachment. The method comprises determining combined textual data based on the first textual data and, if determined (i.e., if the ticket comprises an attachment), based on the second textual data. The method comprises performing a similarity search based on the combined textual data and a repository of existing tickets using a lexical search and / or a semantic search to determine one or more existing tickets being similar to the ticket. The method comprises providing information on the one or more existing tickets being similar to the ticket.
[0007] The proposed concept is based on the insight that the handling of tickets in ticket management systems can be improved by identifying existing tickets (and, if possible, the measures taken to resolve these existing tickets) that most closely resemble the ticket at hand. For the purpose of identifying the most similar tickets, the ticket is processed to not only extract information being contained in the description being provided by the user having submitted the ticket, but also ancillary information that is included in attachments that are included with the ticket. This ancillary information is extracted from screenshots, videos or spoken messages attached to the ticket and used to augment the description that is included in the ticket. The combined information is then used to perform a similarity search, to identify the existing tickets that are most similar to the ticket being submitted, and which may thus contain a pointer to the solution to the underlying problem. If no solution is known, the ticket can at least be linked with the existing tickets to avoid a duplication of effort to resolve the issue. In summary, a concept is proposed that leverages generative Al (Artificial Intelligence), e.g., when determining the second textual data, to automatically detect duplicate tickets by transforming free-text descriptions and attachments, such as error codes, stack traces, database queries, and even visual elements into a structured representation to accurately identify reliable similarities between tickets. The proposed concept may thus provide a generative Al-based approach for detecting duplicate tickets through ticket representation transformation. Various examples of the present disclosure may thus overcome the limitations of other ticket management approaches by optimizing a modular pipeline that enables advanced processing of complex technical tickets. The proposed concept is based on the usage of generative Al and (large) language models. Generative Al refers to artificial intelligence systems designed to produce new content, such as text, images, or other forms of media based on an input prompt. These systems use patterns and structures learned from existing data to generate novel outputs that resemble the examples they were trained on. Language models, a subset of generative Al, are systems specifically trained to understand and generate human language. Large language models, or LLMs, like GPT-3 or GPT-4, are trained on vast amounts of text data and are capable of producing human-like text by predicting the next word in a sequence based on the context of the input. Small language models, in contrast, are trained on a smaller scale, which can make them more suitable for specific tasks where resources are limited, or where interpretability and energy efficiency are desired. They are often used in applications where quick responses are essential, or where deploying an extensive LLM may not be feasible due to computational constraints. Multi-modal models expand on the capabilities of language models by integrating and understanding multiple types of data, such as image, text, and audio, and enabling interaction between these modalities. These models can process and generate content across different forms of media, offering more versatile applications. Generative Al models, such as LLMs, SMLs (Small Language Models), Speech-to-Text models or Multi- Modal models are available off-the-shelf from various providers, either through an API (Application Programming Interface) or as open-source models for self-hosting, alleviating the burden of training the respective models from scratch.
[0008] In the present context, generative Al is used to transform the ticket from the format it is input into a format that is more suitable for the similarity and that supports the user in better understanding the content of the ticket. In particular, the second textual data, which is derived from the attachment(s) of the ticket, may be generated using generative Al. In the act of determining the second textual data, the second textual data may be generated using a language model or a multi-modal model. This enables the system to leverage advanced linguistic analysis and machine learning capabilities to extract meaningful text from diverse attachment types, ensuring that the subsequent similarity search is as comprehensive as possible. This results in enhanced accuracy in identifying similar tickets due to the sophisticated handling of language nuances and context, leading to better matching between tickets.
[0009] Moreover, machine learning, and in particular machine learning models, such as speech-to-text models, optical character recognition models or multi-model models may be used to extract the information contained in the attachment(s) to the ticket.
[0010] In particular, the act of determining the second textual data may comprise, if the at least one attachment comprises audio data and / or video data, generating, using a speech-to-text model or a multi-modal model, a transcript of spoken content included in the audio data and / or video data, with the second textual data being based on the transcript. This enables the method to extract meaningful information from non-textual attachments, expanding the scope of searchable data. By leveraging speech-to-text models, the system can uncover relevant details that might be missed by solely relying on textual descriptions, ultimately leading to more comprehensive search results and thus better-informed human operators.
[0011] Additionally, or alternatively, the act of determining the second textual data may comprise, if the at least one attachment comprises video data and / or image data, generating, using a multi-modal model, a textual description of at least one image frame included in the video data and / or image data, with the second textual data being based on the textual description. This capability allows the system to interpret visual content, making it possible to determine the similarity based on the content of images or videos, which is particularly useful in scenarios where descriptions are inadequate or incomplete. Thus, the most similar tickets can be identified more accurately, even when traditional text-based searches are insufficient, thereby enhancing the overall accuracy of the matching.
[0012] Additionally, or alternatively, the act of determining the second textual data may comprise, if the at least one attachment comprises video data and / or image data, performing optical character recognition (OCR) on at least one image frame included in the video data and / or image data, with the second textual data being based on the optical character recognition. This feature enables the extraction of text embedded within images, further enriching the searchable dataset with information that might not be immediately apparent from descriptions alone. By applying OCR, the system can uncover specific details such as error codes mentioned in images, which can be crucial for accurate ticket matching and resolution.
[0013] In general, while the description included in the description portion of the ticket is often concise, the information contained in the attachment(s), such as stack traces, screenshots, screencasts, videos etc., are often unstructured, with a large amount of extraneous information. Therefore, a language model (e.g., the same language model being used to output the second textual data) may be used to summarize the content of the attachment(s). In other words, the act of determining the second textual data may comprise summarizing information contained in the at least one attachment. This helps condense lengthy or complex attachments into concise, searchable summaries, making it easier to identify relevant tickets. This makes the search process more accurate by focusing on the essential information.
[0014] In many cases, the attachment(s) may be used to support or elaborate on the description provided in the description portion of the ticket. Thus, in most cases, there is some redundancy between the information contained in the description portion and the information contained in the attachment. This redundancy may be reduced or eliminated by processing both the first and second textual data, or by processing the resulting combined textual data. In other words, the method may comprise removing duplicate information from at least one of the first textual data, the second textual data and the combined textual data. This ensures that the dataset used for similarity searches is improved by eliminating redundant information, which can otherwise dilute search results with less relevant matches. By reducing duplication, the system can provide more precise and relevant ticket suggestions, ultimately aiding in a more precise matching.
[0015] There are different approaches for eliminating redundancy from the textual data at hand. For example, a semantic embedding-based approach may be used to identify redundant information based on the similarity of the semantic embeddings of chunks of data, enabling removal of chunks of redundant text. For example, the act of removing duplicate information may comprise dividing at least one of the first textual data, the second textual data and the combined textual data into chunks of text. The act of removing duplicate information may further comprise generating, using an embedding model, embeddings representing the chunks of text. The act of removing duplicate information may further comprise identifying redundant chunks of text based on the embeddings representing the chunks of text. The act of removing duplicate information may further comprise removing at least one redundant chunk of text. The act of removing duplicate information may further comprise rewriting, using a language model, at least one of the first textual data, the second textual data and the combined textual data into a coherent text without the removed at least one redundant chunk of text. By using a semantic matching based on the similarity of embeddings, redundant information can be removed that is only redundant in content, not in language. Moreover, by only removing redundant chunks of text, it may be ensured that each aspect of the original ticket is maintained. Rewriting the textual data may ensure that the ticket remains legible for human operators reading the ticket.
[0016] For example, the act of identifying and removing duplicate / redundant chunks of text can be done by semantic clustering: Given the textual data (first, second and / or combined), the textual data can be split into sentences (i.e. , chunks, which may be sentences or larger portions of text) and arranged in a sequential fashion. Then, each of the sentences / chunks can be embedded using an embedding model (e.g. E5 or GPT3) to obtain embeddings of the respective chunks. Next, the sentences / chunks may be clustered based on their semantic similarity into variable number of clusters. Clusters may be sorted based on the number of sentences / chunks or their quality and then organized as a coherent text which forms the (new) combined contextual data. Alternatively, or additionally, the act of identifying and removing duplicate / redundant chunks of text may include deduplication. Given the textual data (first, second and / or combined), redundant content can be identified using a lexical or semantic overlap on a chunk / sentence level. The first occurrence of duplicate content may be removed and the deduplicated content may be organized to form the reformulated combined textual data. In this case, if only the lexical overlap is considered, computing the embeddings of the chunks of text is unnecessary, and identification and removal of redundant chunks of text can be performed on the text itself. In an alternative approach to reducing the redundancy within the textual data, a language model may be used to summarize the textual data, thereby removing duplicate information. In other words, the act of removing duplicate information may comprise summarizing, using a language model, at least one of the first textual data, the second textual data and the combined textual data. This requires less effort for the removal of redundant information, at an increased likelihood of retaining duplicate information.
[0017] Once the combined textual data is determined and, optionally, redundant information has been removed, the combined textual data may be processed further to bring the combined textual data into a standardized format. For example, a structured information extraction operation may accept the combined textual data and extract information using instruction-following language models (such as LLMs) to unify the ticket into a single, structured, domain-agnostic format for streamlined processing. Here, a predefined schema (i.e. , the domain-agnostic structured format) may be defined and used based on domain or product expertise. For example, for tickets used in software development, Fig. 3 shows a schema that captures possible aspects of a typical software bug. This structured representation summarizes the ticket in a systematic manner by capturing relevant aspects of the ticket for accelerated handling, which the human operators may already use for processing an incoming ticket. Moreover, tickets from customers may be written in varying styles or contain different types of attachments which can be unified via structured information extraction, resulting in a common schema (i.e., the domain-agnostic structured format) that all tickets conform to. This accelerates the subsequent steps such as automatic ticket assignment (routing), aggregating errors, platforms, among other aspects with lower cognitive effort.
[0018] In particular, the method may comprise transforming, using a language model (e.g., an instruction-tuned LLM), the combined textual data into structured data according to a domain-agnostic structured format. From this structured, domain-agnostic format, a further natural language text may be derived that takes into account the specifics of the domain. Thus, the method may comprise generating domain-specific textual data representing the ticket based on the structured data. The method may comprise providing the domain-specific textual data representing the ticket alongside the information on the one or more tickets being similar to the ticket. Additionally, or alternatively, the method may comprise using the domain-specific textual data representing the ticket for the similarity search. By transforming the textual data into structured data, the foundation is laid for a subsequent transformation of the ticket into a template-based textual format, such as the domain-specific textual data. This approach bridges the gap between raw text and domain-specific understanding, making the subsequent similarity search and representation of the ticket more accurate and meaningful.
[0019] For example, the method may comprise transforming, using a language model and a repository of domain-specific context, the structured data into domain-specific structured data according to a domain-specific structured format. The method may comprise generating the domain-specific textual data representing the ticket based on the domain-specific structured data and based on a pre-defined or generated domain-specific template. By transforming the structured data into domain-specific structured data, additional context can be added according to the specific domain, which can improve the similarity search. By outputting the ticket according to the pre-defined or generated domain-specific template, a consistent format is maintained that helps human operators in quickly understanding the ticket at hand. The template can be pre-defined according to knowledge on the domain, or it can be derived (i.e., generated), e.g., by a language model, from the domain-specific structured format.
[0020] In the present disclosure, a distinction is made between domain-agnostic and domain-specific formats. In this context, a domain is a specific area of knowledge, expertise, or activity where particular concepts, terminology, practices, and problems are well-defined and focused. It encompasses the set of subjects or fields that are typically grouped together because they share common characteristics, objectives, or methods. A domain serves as a framework within which tasks are executed and decisions are made, guiding the application of techniques, tools, and best practices relevant to that specific area. For example, in the context of software development, domains such as “backend development”, “frontend development”, “data layer”, “user experience”, “scaling” etc. may be defined. In the context of medical tickets, domains such as “oncology”, “cardiology”, “gastro-intestinal”, “gynecology” etc. may be defined. With respect to the formats being used, a domainagnostic format is designed to be versatile and applicable across various fields or areas of knowledge, without being tailored to the specific needs or conventions of any one domain. Thus, the domain-agnostic structured format is broad and adaptable to many areas, while domain-specific formats, such as the format the domain-specific textual data is in, are narrowly focused to efficiently and precisely serve the particular needs of a specific domain.
[0021] The goal of the proposed concept is to identify and present similar tickets, which may enable the human operator (user) to quickly identify solutions or courses of action based on the knowledge contained in the similar, existing tickets. To retrieve these similar tickets, one or more queries may be generated that are targeted at identifying the relevant tickets in one or more databases. Thus, the method may comprise formulating a query for performing the similarity search based on the combined textual data and / or based on domain-specific textual data representing the ticket, by determining keywords representing the ticket and formulating the query based on the determined keywords. This allows for more precise and targeted search execution, enhancing the effectiveness and relevance of retrieved similar tickets.
[0022] There are various approaches for extracting suitable keywords from the textual data. For example, keywords may be extracted from the combined textual data and / or domain-specific textual data using an LLM, which can then be used as queries. In other words, the process of determining the keywords representing the tickets may comprise extracting, using a language model, the keywords from the combined textual data or the domain-specific textual data representing the ticket. This provides an easy extraction of relevant keywords. However, this approach may introduce the problem of having too many distinct keywords (sparsity) if the ticket contains detailed technical information such as error descriptions, software components, etc.
[0023] According to an example, extracted information from the tickets may be transformed into structured representations, e.g., domain-specific contextualized schemas that convey the different components of the ticket such as error messages, observed behavior, ticket summary, possibly among other essential information. In this case, informative queries may be formulated based on natural language templates that are generated based on such contextualized schemas. In this context, it is emphasized that the term “keywords” is not limited just to a list, but may also refer, e.g., to an entire contextualized schema that can be used as a query.
[0024] Additionally, or alternatively, clustering of the keywords (or key phrases) identified from the ticket description may be performed based on their semantic embeddings (meaning) and labels may be generated for each cluster using a language model, such as an LLM or SLM. These cluster labels ideally capture the different facets of the ticket, for example: “database connection error”, “javascript blocked by client”, “transaction verification timeout”. These queries are more meaningful and avoid the sparsity issue described above. Thus, the process of determining the keywords representing the tickets may comprise clustering keywords or key phrases included in the combined textual data or the domain-specific textual data representing the ticket based on a semantic embedding of the respective keywords or key phrases, and generating, using a language model, labels for the clusters of keywords or key phrases, with the labels being used as the keywords representing the ticket. This way, fewer, more meaningful keywords can be extracted.
[0025] Additionally, or alternatively, the domain-specific repository used in transforming the tickets can be used to expand the queries by adding technical words that best represent the target domain in addition to keywords from the ticket. Thus, the process of determining the keywords representing the tickets may comprise adding keywords from a repository of domain-specific context to represent the domain associated with the domain-specific textual data. This way, it may be ensured that the tickets being searched are relevant for the specific domain the ticket belongs to.
[0026] In the proposed concept, a hybrid search, i.e., a lexical search and a semantic search, may be used. In this context, a lexical search finds matches based on the exact words or phrases used in the input query, e.g., relying on keyword matching without understanding the meaning behind the words. In contrast, a semantic search uses techniques such as embeddings to represent the contextual meaning of the search query, which allows retrieving information that is relevant based on the concepts and relationships involved rather than just matching specific words. In a hybrid search, the similarity search may thus be performed using the lexical search and the semantic search, and respective search results of the lexical search and of the semantic search may be ranked, e.g., using one of reciprocal rank fusion of the search results and weighting-based ranking of the search results. Performing a similarity search using both lexical and semantic methods with ranked results improves overall search precision and recall. Combining these techniques ensures that related tickets are identified accurately based on both exact wording and contextual meaning.
[0027] In some examples, reciprocal rank fusion may be used to rank the results provided by the semantic search and by the lexical search. Reciprocal Rank Fusion (RRF) is a method used to combine the ranked results from multiple search systems, such as semantic and lexical searches, into a single, more comprehensive ranking. In RRF, each result from the different search systems is assigned a score based on its rank position, usually calculated as the reciprocal of its rank (i.e., 1 / (k + rank)), where (k) is a constant to moderate the effect of rank. These reciprocal scores from the different systems are then summed for each result, providing a combined score that reflects its relative ranking across all systems. The final list is then sorted based on these combined scores, synthesizing the strengths of both semantic and lexical searches to deliver a more accurate and robust ranked result set.
[0028] Alternatively, a weighting-based ranking approach may be used, in which weights are assigned for lexical and semantic rankings. Subsequently, the rankings are fused by sorting their combined weights. The top-k ranked results may then be presented as potential similar tickets.
[0029] In some examples, the semantic search may be performed using a vector database containing semantic embeddings of the existing tickets. For example, each new ticket (and existing ticket) may be transformed into a specific vector representation and stored in the vector database. Performing the semantic search using a vector database containing embeddings from existing tickets ensures that searches can efficiently identify semantically similar documents, leading to more accurate and relevant retrieval of ticket matches. The lexical search may be performed using the word-based indexing of the textual data of the existing tickets.
[0030] For example, the method may comprise providing a representation of the ticket alongside a representation of the one or more existing tickets being similar to the ticket to a user via a user interface. For example, if someone submits a new ticket request to the ticket management system, the proposed concept searches for existing tickets containing solutions or courses of actions for handling the ticket. If successful, the solution or course of action is returned to the human user. The solution, i.e., the cognitive content of the ticket solution presented to the user relates to an internal state prevailing in a technical system and enables the user to properly operate this technical system. Thus, the human operator of the ticket management system is guided by the proposed concept to control the ticket management system. This way, the user is guided and supported to identify a solution or course of action for the ticket based on the knowledge inherent to the existing tickets.
[0031] Another aspect of the present disclosure relates to a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the above method.
[0032] Another aspect of the present disclosure relates to a non-transitory, computer- readable medium comprising a program code that, when the program code is executed on a processor, a computer, or a programmable hardware component, causes the processor, computer, or programmable hardware component to perform the above method.
[0033] Another aspect of the present disclosure relates to a system comprising interface circuitry, machine-readable instructions, and processor circuitry to execute the machine-readable instructions to carry out the above method.
[0034] There are several ways how to design and further develop the teaching of the present invention in an advantageous way. To this end it is to be referred to the dependent claims on the one hand and to the following explanation of preferred embodiments of the invention by way of example, illustrated by the figure on the other hand. In connection with the explanation of the preferred embodiments of the invention by the aid of the figure, generally preferred embodiments and further developments of the teaching will be explained. In the drawing
[0035] Figs. 1 a and 1 b show flow charts of examples of a method for optimizing retrieval of similar tickets in a ticket management system using machine learning;
[0036] Fig. 1c shows a block diagram of an example of a system for optimizing retrieval of similar tickets in a ticket management system using machine learning;
[0037] Fig. 2 shows a schematic diagram of an example architecture of a duplicate detection system;
[0038] Fig. 3 shows a schematic overview of three major operations in an example of the proposed pipeline;
[0039] Fig. 4 shows an example of a unified ticket schema for structured information extraction;
[0040] Fig. 5 shows a concrete example of extracted information based on a unified schema; and
[0041] Fig. 6 shows an example of a contextualized transformation tailored to the domain of “Web Banking III Development” applied to the output of a unified schema.
[0042] Embodiments of the present disclosure relate to a method, system and computer program for optimizing retrieval of similar tickets in a ticket management system using machine learning. The method for optimizing retrieval of similar tickets in a ticket management system using machine learning is discussed in connection with Figs. 1a and 1 b, while the system is discussed in connection with Fig. 1c. The computer program comprises a program code or machine-readable instructions to perform, when executed by a computer system or other processor-based device, the method of Figs. 1 a and / or 1 b. Subsequently, in connection with Figs. 2 to 6, an example implementation of such a method, system and computer program is discussed.
[0043] Figs. 1 a and 1 b show flow charts of examples of a computer-implemented method for optimizing retrieval of similar tickets in a ticket management system using machine learning. The method comprises obtaining 110 a ticket being input into the ticket management system. The method comprises extracting 120 first textual data from a description portion of the ticket. The method comprises determining 130, if the ticket comprises at least one attachment, second textual data representing the at least one attachment. The method comprises determining 150 combined textual data based on the first textual data and, if determined, based on the second textual data. The method comprises performing 185 a similarity search based on the combined textual data and a repository of existing tickets using a lexical search and / or a semantic search to determine one or more existing tickets being similar to the ticket. The method comprises providing 190 information on the one or more existing tickets being similar to the ticket. In particular, the method may comprise providing 190 a representation of the ticket alongside a representation of the one or more existing tickets being similar to the ticket to a user via a user interface, e.g., a web-browser-based user interface.
[0044] The method of Figs. 1 a and / or 1 b may be performed by the corresponding system 10 shown in Fig. 1 c. Fig. 1 c shows a block diagram of a system 10 for optimizing retrieval of similar tickets in a ticket management system using machine learning. The system 10 shown in Fig. 1 c is a computer system comprising interface circuitry 12, processor circuitry 14, machine-readable instructions, and, optionally, memory and / or storage circuitry 16. The interface circuitry 12 may be used to facilitate communication with other components, such as sensors, or with other computer systems. For example, the system 10 may be a server computer system, with client computer systems connecting to the system 10, e.g., to provide the ticket and / or to retrieve a user interface with a representation of the ticket and of the one or more similar existing tickets. The communication may occur via a bus or a network, for example. The processor circuitry 14 may be used to provide the functionality of the system, for example, in conjunction with the interface circuitry 12 (for exchanging information) and / or optional memory or storage circuitry 16 (for storing information, such as machine-readable instructions). The processor circuitry 14 is therefore coupled with the interface circuitry 12 and, optionally, with the memory or storage circuitry 16. For instance, the system 10 may comprise machine-readable instructions, i.e., a computer program, which may prompt the one or more processors to execute at least one of the methods introduced in connection with Figs. 1 a and / or 1 b. In accordance with the aforementioned, the method introduced in connection with Figs. 1a and / or 1 b may be carried out by the one or more processors executing the machine-readable instructions.
[0045] For example, the interface circuitry 12 may include or correspond to a network interface circuitry and / or a device interface circuitry configured to be communicatively coupled to one or more other devices, such as the one or more processors. For example, the interface circuitry may include a transmitter, a receiver, or a combination thereof (e.g., a transceiver), and may enable wired communication, wireless communication, or a combination thereof. For example, the processor circuitry 14 may include or correspond to one or more of a digital signal processor circuitry (DSP), a graphical processing unit (GPU), and / or a central processing unit (CPU). For example, the memory and / or storage circuitry 16 may include or correspond to volatile or nonvolatile storage circuitry, such as Random Access Memory (RAM), magnetic disks, optical disks, or flash memory devices. The one or more memory / storage devices may include both removable and nonremovable memory devices.
[0046] In the following, an example implementation of the method, system and computer program of Figs. 1 a to 1 c is presented.
[0047] Embodiments of the present disclosure address the problem of detecting duplicate tickets in a ticket management system. The proposed concept searches for existing solutions in a vector database and distributes the electronic file from the central server to the applicant. Potential use cases are software bug tracking systems or medical record management systems. In other words, the ticket management system may be a software bug tracking system or a medical record management system.
[0048] At the core of the concepts disclosed herein is transforming generic bug reports (or more generally, “tickets”) from customers into domain-specific, information-rich, and structured natural language templates (i.e., domain-specific textual data) for optimizing the search for duplicate reports (i.e., existing tickets being similar to the ticket) from a repository of existing reports. This transformation is a multi-operation process as outlined in Fig. 2 with several components combined in a novel way.
[0049] Fig. 2 shows a schematic diagram of an example architecture of a duplicate detection system. As shown in Fig. 2, a customer ticket 210 (with attachment(s)) is first processed by a Multi-modal Document Processing operation M1 , comprising sub-operations File Understanding M1.1 and Ticket Rewriting M1.2. The Multimodal Document Processing operation M1 outputs a Reformulated Ticket Description 220, which is processed by a Structured Information Extraction operation M2. The Structured Information Extraction operation M2 outputs a version 230 of the ticket according to a Unified Ticket Schema ({Problem, Version, Priority, ... }), e.g., structured data. This version of the ticket is provided to a Context-specific Transformation operation M3, which uses a domain knowledge repository, bounded context aspects, and a (pre-defined) Schema Filtering sub-operation M3.1 and an (automatic) schema synthesis M3.2. Output of operation M3 is a version 240 of the ticket according to a Contextualized Ticket Schema ({Problem, Interface, Error Code}), e.g., domain-specific structured data. A Natural Language Ticket Template Generation operation M4 processes this version of the ticket to output a version 250 of the ticket according to a Contextualized Ticket Description, e.g., the domainspecific textual data. This version 250 is used by a Query Formulation operation M5 to generate Contextualized Search Queries 260, which are in turn used for a Hybrid Search operation M6 being performed on a ticket repository to identify duplicate (i.e., similar) tickets 270.
[0050] In the following description of the proposed concept, the bug report created by a customer of a software product is referred to as a “Ticket” and the bug as an “Issue”. A ticket may not only contain textual description but also images (screenshots) or videos (screen recordings) demonstrating an issue faced by the customer.
[0051] The transformation of the ticket to a domain-specific template facilitates creating customized search queries for finding duplicate tickets easily, as well as better utilization of resources for automatic triaging of the tickets in large software products and platforms. Starting from a single long piece of text as the ticket description, Embodiments of the concepts disclosed herein add multiple metadata and structured information to the ticket object which easily allows mapping a ticket to its correct context in the internal business operations and ticket resolution systems.
[0052] To accomplish this, in various examples, the proposed concept is composed of three major operations as shown in Fig. 3. Fig. 3 shows a schematic overview of three major operations in an example of the proposed pipeline, with corresponding modules to accomplish each operation. First, the “Ticket Processing” operation operates, using the Multi-modal Document Processing operation M1 , on the entire ticket’s text and its multi-modal attachments such as images or videos to synthesize a “complete” ticket description, followed by, using the Structured Information Extraction operation M2, extracting structured information broadly covering the various aspects of a software bug report. Next, the “Ticket Transformation” operation accepts, using the Context-specific Transformation M3 and the Natural Language Template Generation operation M4, the generic information extracted from the previous operation and applies context-specific transformations to tailor the ticket representation to a specific domain / area within the business. Finally, the “Ticket Retrieval” operation enables formulating, using the Query Formulation operation M5, custom queries from the contextualized representation of the ticket and using, using the Hybrid Search operation M6, these queries to search for duplicate tickets from a repository of existing tickets. These operations are explained in detail below.
[0053] Ticket Processing (operations M1 and M2): This section describes a first operation of an example of the proposed concept, which is used to process the customer tickets into a complete, information-rich, structured format. This operation comprises or consists of two tasks or operations: (M1 ) Multi-modal Document Processing, and (M2) Structured Information Extraction.
[0054] M1 Multi-modal Document Processing is responsible for thoroughly analyzing the customer ticket along with its multimedia attachments such as images or videos. Providing images of an issue (as screenshots) or video recordings of the workflow leading to the issue (as screencasts) is a relatively faster way of describing the issue as compared to writing it in text. Therefore, in tickets that contain such attachments, the file attachments are processed first (M1 .1 ) to convert them into text documents. In particular, with respect to the method of Figs. 1a and / or 1 b, the act of determining 130 the second textual data may comprise, if the at least one attachment comprises audio data and / or video data, generating 132, using a speech-to-text model or a multi-modal model, a transcript of spoken content included in the audio data and / or video data, with the second textual data being based on the transcript. Similarly, the act of determining 130 the second textual data may comprise, if the at least one attachment may comprise video data and / or image data, generating 134, using a multi-modal model, a textual description of at least one image frame included in the video data and / or image data, with the second textual data being based on the textual description, performing 136 optical character recognition on at least one image frame included in the video data and / or image data, with the second textual data being based on the optical character recognition.
[0055] Next, these descriptive documents are reformulated into a coherent ticket via a rewriting module (M1 .2). For example, with respect to the method of Figs. 1a and / or 1 b, in the act of determining 130 the second textual data, the second textual data may be generated using a language model or a multi-modal model. In particular, the act of determining 130 the second textual data may comprise summarizing 138 information contained in the at least one attachment (using the language model). This has two benefits: (1 ) if a ticket partially describes the issue and refers to the attachments for the remaining information, a “complete” ticket is derived; (2) if a ticket contains only attachments without any text description, essentially, a “new” ticket containing a coherent description of the issue based on the provided attachments is synthesized. The ticket rewriting module may also account for redundancy in the descriptive documents as well as organizing the text in a coherent fashion. Thus, the method of Figs. 1 a and / or 1 b may comprise removing 140 duplicate information from at least one of the first textual data, the second textual data and the combined textual data. Additionally, or alternatively, the method of Figs. 1 a and / or 1 b may comprise rewriting the combined textual data, e.g., based on the first and second textual data or based on an initial version of the combined textual data, into a coherent text. A specialized LLM may be used to perform content deduplication and reorganization of key details of the ticket, given multiple attachments.
[0056] The ticket rewriting module can be implemented in the following ways. For example, the ticket rewriting module may be implemented using semantic clustering: Given the text descriptions derived from multimedia attachments (and from the description portion), the text descriptions may be split into sentences (or chunks), e.g., divided 141 into chunks in Fig. 1 b, and arranged in a sequential fashion. Then, each of the sentences / chunks may be embedded using an embedding model (e.g. E5 or GPT3) to obtain sentence embeddings (e.g., generating 142 the embeddings in Fig. 1 b). Next, the sentences / chunks may be clustered based on their semantic similarity into variable number of clusters. Clusters may be sorted based on the number of sentences / chunks or their quality and then organized as a coherent text which forms the reformulated ticket description. This results in the identification 143 and removal 144 of at least one chunk of text, and in rewriting 145 the remaining chunks into a coherent text.
[0057] Alternatively, the ticket rewriting module may be implemented using deduplication: Given the text descriptions from multimedia attachments and the text description in the ticket, redundant content may be identified using lexical or semantic overlap on chunk / sentence level. Thus, similar to semantic clustering, if the overlap is determined at the semantic level, the method of Fig. 1 b may further comprise dividing 141 the respective textual data into chunks of text, generating 142 embeddings for the chunks of text, and identifying 143 and removing 144 duplicate chunks of text based on the overlap at the semantic level (between embeddings of the respective chunks). If lexical overlap is used, the act of generating 142 the embeddings can be omitted, and the duplicates can be identified 143 and removed 144 based on the level overlap of the respective text. For example, the first occurrence of duplicate content may be removed, and the deduplicated contents may be organized to form the reformulated ticket description. For example, the remaining chunks may be rewritten 145 into a coherent text.
[0058] In an alternative implementation, an LLM may be employed to automatically summarize the content of the derived text descriptions from multimedia attachments and / or the ticket description. This summarized document is considered as the reformulated ticket description. In other words, with respect to Fig. 1a and / or 1 b, the act of removing duplicate information may comprise summarizing 146, using a language model, at least one of the first textual data, the second textual data and the combined textual data.
[0059] M2 Structured Information Extraction accepts the reformulated ticket description (from M1 ) and extracts information, e.g., using instruction-following LLMs, to unify the customer tickets into a single format for streamlined processing. Here, a predefined schema may be provided based on domain or product expertise. For instance, it is possible to outline a schema which captures all possible aspects of a typical software bug as shown in Fig. 4. Fig. 4 shows an example of a unified ticket schema for structured information extraction. This extracted information summarizes the ticket in a systematic manner by capturing relevant aspects of the ticket for accelerated handling which the ticket handlers may already use for processing an incoming ticket. Moreover, tickets from customers may be written in varying styles or contain different types of attachments which can be unified via structured information extraction resulting in a common schema that all tickets conform to. This common schema is domain-agnostic, i.e., it is not tailored to a specific domain at hand, but is applicable to different domains. With respect to the method of Fig. 1 a and / or 1 b, the method may comprise transforming 160, using a language model, the combined textual data into structured data according to a domain-agnostic structured format (i.e., the common schema). From this common schema, domain-specific textual data representing the ticket may be generated (as shown in connection with operations M3 and M4. The resulting domain-specific textual data may be provided 190 alongside the information on the one or more similar tickets, and / or may be used for performing the similarity search 185, e.g., for formulating 180 the query(s) for the similarity search 185. The use of a common schema accelerates the subsequent operations such as automatic ticket assignment (routing), aggregating errors, platforms, among other aspects with lower cognitive effort.
[0060] Second major operation: Ticket Transformation. This section outlines a second operation of a concept according to an example of the present disclosure, which comprises or consists of two tasks I operations: (M3) Context-specific Transformation and (M4) Natural Language Template Generation.
[0061] M3 Context-specific Transformation converts the unified or common ticket schema (Fig. 4) to a more specific one that contains information relevant to the corresponding functional areas / divisions (i.e. , domains) within a business that are responsible for handling a customer’s ticket. These internal divisions (also referred to as bounded contexts in domain driven design) have specific aspects pertaining to their components in the overall software product. For instance, a platform may have functional areas such as “III Development”, “Transaction Management”, “Web & Mobile Banking”, “Payment Authentication” etc., Customer tickets may pertain to one or more of these functional areas and may be routed accordingly to efficiently manage resources.
[0062] The approach disclosed herein employs an LLM (or other type of language model)- based transformation of the unified schema that is augmented with relevant knowledge of a specific functional area within a business domain. This knowledge may be retrieved from curated document databases (domain knowledge repository), or pattern recognition applied on an existing repository of resolved tickets. Thus, with respect to the method of Figs. 1a and / or 1 b, the method may comprise transforming 170, using a language model and a repository of domain-specific context, the structured data into domain-specific structured data according to a domain-specific structured format. When a customer creates a new ticket targeted to a specific functional area, it may be first transformed into the unified schema. Next, relevant aspects of the target functional area that are to be included in the ticket may be determined using an LLM (or other type of language model) with access to a domain knowledge repository (i.e., the repository of domain-specific context). For example, a ticket targeted at functional area “III Development” may include aspects such as “browser”, “platform”, “web component”, “user interaction”, “labels” etc. while a ticket for the functional area “Transaction Management” may include “transaction date”, “payment method”, “sender”, “receiver”, “message” as aspects.
[0063] Given a functional area and its explanatory documents, the LLM (or other type of language-model) may first identify a list of aspects that a ticket is to contain to be assigned to this functional area for resolution. Then, given the unified ticket schema created from M2, the LLM (or other type of language model) may extract relevant information pertaining to these aspects and creates a new (contextualized) ticket schema tailored to this functional area (M3.2). A contextualized schema may also be predefined (as a template) for each functional area which is then filled by the LLM (or other type of language-model, M3.1 ) using the unified schema. This can be seen as transforming a generic schema to a domain-specific one. Fig. 5 exemplifies a unified schema extracted for a new ticket. Fig. 6 shows how this schema is transformed to fit the domain of “Web Banking and Ul (User Interface) Development” by not only creating new keys but also rewriting the information from the unified schema. Fig. 5 shows a concrete example of extracted information based on a unified schema, and Fig. 6 shows an example of a contextualized transformation tailored to the domain of “Web Banking Ul Development” applied to the output of a unified schema.
[0064] M4 Natural Language Template Generation is the next task in the ticket transformation operation which generates a natural language template based on the contextualized ticket schema from the previous operation (M3). Thus, the method may comprise generating 175 the domain-specific textual data representing the ticket based on the domain-specific structured data and based on a pre-defined or generated domain-specific template (i.e., the natural language template). Generating such a template from structured information can be done in several ways. In the proposed concept, a query language syntax is employed as the basis for creating semi-structured but self-contained templates of tickets. For instance, SQL syntax provides operators such as SELECT, WHERE, HAVING, LIKE, FROM which enable ordering contents from the structured information into a coherent template statement. An LLM (or other type of language-model) may be employed to synthesize (e.g., fill out) such templates using query language syntax. An alternative is to skip the SQL (Structured Query Language) operators and directly concatenate the values from the contextualized schema JSON (JavaScript Object Notation) (Fig. 6) via a separator (e.g. ‘|’) to create a readable template that only consists of the values of the relevant attributes. The idea is to enrich the embeddings of tickets with structured information that might otherwise be ambiguous / difficult to contextualize solely from the textual description of the ticket. Moreover, a human readable template serves as a summary of the extracted and contextualized ticket information which can be used for skimming by the support team. This summary presents key information extracted from different parts of the ticket in a coherent fashion.
[0065] The third major operation or group of operations is ticket retrieval. The final operation after transforming the ticket is to use it for retrieving duplicates from the ticket database. To accomplish this, the ticket retrieval operation may first create suitable search queries from the natural language description (M5 Query Formulation) and then use these queries to perform a search (M6 Hybrid Search) which combines lexical and semantic search over the ticket database.
[0066] M5 Query Formulation is the task of constructing meaningful and unique search queries from a piece of text that optimize the coverage of the retrieved result set. In particular, with respect to the method of Figs. 1 a and / or 1 b, method may comprise formulating 180 a query for performing the similarity search based on the combined textual data and / or based on domain-specific textual data representing the ticket, by determining keywords representing the ticket and formulating the query based on the determined keywords.
[0067] One approach is to extract keywords from the ticket description using an LLM which can then be used as queries. In other words, the process of determining the keywords representing the tickets may comprise extracting, using a language model, the keywords from the combined textual data or the domain-specific textual data representing the ticket. However, this introduces the problem of having too many distinct keywords (sparsity) if the ticket contains detailed technical information such as error descriptions, software components, etc. To mitigate this, clustering of the keywords (or key phrases) identified from the ticket description may be performed based on their semantic embeddings (meaning) and labels for each cluster may be generated using an LLM (or other type of language model). In other words, the process of determining the keywords representing the tickets may comprise clustering keywords or key phrases included in the combined textual data or the domain-specific textual data representing the ticket based on a semantic embedding of the respective keywords or key phrases, and generating, using a language model, labels for the clusters of keywords or key phrases, with the labels being used as the keywords representing the ticket. These cluster labels ideally capture the different facets of the ticket, for example: “database connection error”, “javascript blocked by client”, “transaction verification timeout”. These queries are more meaningful and avoid the sparsity issue described above.
[0068] Alternatively, or additionally, the domain-specific databases used in transforming the tickets (M3) can be used to expand the queries by adding technical words that best represent the target domain in addition to keywords from the ticket. In other words, the process of determining the keywords representing the tickets may comprise adding keywords from a repository of domain-specific context to represent the domain associated with the domain-specific textual data.
[0069] M6 Hybrid Search uses these new queries (or even the complete ticket description) to retrieve relevant candidates for potential duplicates by combining lexical search (e.g. BM25) with semantic search (e.g. contextual embeddings of the tickets from LLMs), e.g., using a vector database containing semantic embeddings of the existing tickets for the semantic search. In other words, the similarity search 185 may be performed using the lexical search and the semantic search. In scenarios with diverse functional areas each having many tickets, the semantic search can be further optimized to fetch candidates for a new ticket from the same functional area (i.e., domain). This reduces the number of comparisons the algorithm needs to make, thereby increasing the overall efficiency. To accomplish this, the proposed system may create multiple indexes and route the search query to the correct index based on the functional area (for instance) of the incoming ticket. Such indexes can be created for different aspects of the ticket such as software component, platform, etc.
[0070] Since hybrid search comprises of two search engines (lexical, semantic), there are multiple ways to combine the rankings from each search engine for a given query. One possibility is to apply reciprocal rank fusion for combining the rankings. In other words, respective search results of the lexical search and of the semantic search may be ranked using reciprocal rank fusion of the search results. An alternative strategy is to assign weights for lexical and semantic rankings and then fuse the rankings by sorting their combined weights. In other words, respective search results of the lexical search and of the semantic search may be ranked using a weighting-based ranking of the search results. The top-k ranked results are then presented as potential duplicates.
[0071] The proposed concept may be applied to various use cases. For example, the proposed concept may be applied, in the field of information processing, for detecting duplicate tickets in a ticket management system. A ticket management system faces the challenge of duplicate tickets, which occur when identical or highly similar tickets are created and processed multiple times. This leads to increased workload, inefficiency, and potential errors. Detecting duplicates is complex due to the technical nature of tickets, often involving file attachments, code snippets, and detailed descriptions. To address this issue, an approach is needed to automatically identify and merge duplicate tickets, saving time and resources for both the system and the ticket handlers.
[0072] In this use case, as data source, free text descriptions detailing the nature of the issue, along with relevant error codes, stack traces, and database queries for technical analysis may be used. Additionally, metadata such as the software version involved, may be used. Further, file attachments like screenshots and video recordings offer visual representations of the problem.
[0073] The proposed concept may effectively identify duplicate tickets by first transforming a new ticket into a structured representation. This involves parsing free text descriptions, error codes, stack traces, database queries, and even visual elements like screenshots and video recordings into a unified, structured format. By leveraging domain knowledge, weights may be assigned to different components of the structured representation, prioritizing critical information for accurate duplicate detection. Finally, the structured representation may be converted into a vector representation, enabling efficient comparison and identification of duplicates based on semantic similarity.
[0074] The proposed concept may return a ranked list of potential duplicate tickets (received from the ticket database), each assigned a probability score reflecting the likelihood of a true match.
[0075] The proposed concept provides a technical contribution in the fields related to the processing of tickets. For example, if someone submits a new ticket request to the ticket management system, the proposed concept solution searches for existing solutions. If successful, the solution is returned to the human user. The distribution of the electronic files from the central server to the applicant could be technically implemented by enabling download of individual files directly from the central database to the computer on request of a customer.
[0076] Moreover, the solution presented to the requester can have a technical effect. The solution, i.e., the cognitive content of the ticket solution presented to the user relates to an internal state prevailing in a technical system and enables the user to properly operate this technical system (e.g., resolving database issues).
[0077] Furthermore, electronic patient records, containing physical attributes like blood pressure, heart rate, and temperature, can be considered a type of ticket. Processing these records, which represent physical entities, has a technical effect as it involves the manipulation and analysis of digital data that corresponds to real- world measurements.
[0078] Embodiments of the present disclosure further manage an internal vector database of the existing tickets. Each new ticket is transformed into a specific vector representation and stored in the vector database. An index structure used for searching a record in a database produces a technical effect since it controls the way the computer performs the search operation.
[0079] For example, the proposed concept may be applied to the Medical and Healthcare industry (e.g., Electronic Health Records, medical ticket system), IT and Technology (Help Desk and Customer Support), Manufacturing (track production issues, quality control, and maintenance requests), Government (manage citizen complaints, permit applications), Retail (manage customer returns, exchange and complaints), Digital Finance applications (e.g., banking ticketing system). In other words, the ticket may comprise, represent or be associated with an electronic health record. The ticket may be an IT or technology helpdesk ticket or customer support ticket. The ticket may be a ticket for tracking a production issue, a ticket being used as part of quality control, or a maintenance request ticket. The ticket may be a ticket related to a citizen complaint or to a government permit. The ticket may be a ticket associated with a customer return or an exchange or a ticket associated with a customer complaint.
[0080] Various examples of the proposed concept are based on integrating multi-modal and text generation models to transform customer tickets containing file attachments such as images and videos into focused descriptions. Relevant information from images and videos may be identified by transforming them into text or structured information. Redundant information may be identified from the ticket’s textual description if it is already presented in the provided file attachments, or non- informative descriptions may be replaced with key information from the attachments. A coherent textual description may be composed, combining information from the previous two operations.
[0081] Various examples of the proposed concept are further based on transforming extracted information from the tickets into domain-specific schemas. This can involve one of predefined schemas designed based on domain knowledge, and synthesizing custom schemas based on external knowledgebases. The proposed concept may further include generating a natural language template based on the contextualized schema that clearly conveys the different components of the ticket such as error messages, observed behavior, ticket summary, among other information. The proposed concept may further include formulating informative queries based on the generated natural language template.
[0082] The present disclosure provides methods and systems for detecting duplicate customer service tickets from a repository of existing tickets, comprising one or more of the following operations I components: (1 ) Identifying if a ticket has additional modalities besides text such as images or video recordings demonstrating the issue. (2) The proposed concept comprises processing the text, image, video to extract clean text combining all the information. (M1 ). (3) The proposed concept comprises using the extracted text description to find similar tickets in the repository of existing tickets via lexical or semantic search (M6). (4) Optionally, the proposed concept comprises providing a general schema capturing the various aspects of the ticket such as affected systems, error messages, versions etc., (5) Optionally, the proposed concept comprises applying information extraction with the general schema as basis for transforming the ticket into a structured format (M2). (6) Optionally, the proposed concept comprises integrating domain information to make the contextualize the schema (M3). (7) Optionally, the proposed concept comprises transforming the domain-contextualized schema into a natural language template (M4). (8) Optionally, the proposed concept comprises formulating representative search queries from this template for finding duplicate tickets from the repository. Operations 5 to 8 may be applied on all tickets in the repository (see M5). (9) Optionally, the proposed concept comprises employing a hybrid search combining lexical search and semantic search using contextual embeddings (M6). (10) Optionally, the proposed concept comprises fusing the rankings from several search engines or re-rank the results from a single search engine based on relevance criteria or metadata (M6). (11 ) The proposed concept comprises presenting the top search results to the ticket handler alongside the current ticket.
[0083] Incident-aware duplicate ticket aggregation NPL 1 (Non-Patent Literature 1 , Jinyang et al.: “Incident-aware Duplicate Ticket Aggregation for Cloud Systems.”) is tailored to high-availability systems and assumes that a monitoring system that periodically fires incident alerts with runtime information for linking the tickets is present in the pipeline. This can be prohibitively costly for small and medium-scale organizations to setup. Compared to NPL 1 , the proposed concept does not require any runtime information and can work with only the information provided in the tickets, given a repository of previously resolved tickets. Moreover, NPL 1 also does not link to previously resolved duplicate tickets and merely forwards the cluster of tickets to human handlers. The proposed concept can directly link to resolved tickets to enable human handlers to quickly dispatch solutions. Finally, NPL 1 only works with textbased tickets while the proposed concept approach works with multimodal tickets containing images and file attachments.
[0084] NPL 2 (Z. Xu et al: “Retrieval-augmented generation with knowledge graphs for customer service question answering”) builds knowledge graphs to “generate” answers to frequently asked questions from the customers using retrieval augmented generation (RAG). A key limitation of this approach is that it does not exactly identify duplicate issues given a repository of previously resolved issues. At most, this generates information from related issues and is unsuitable for highly technical tickets containing error logs, code, and attachments as this approach is limited to textual tickets only.
[0085] Compared to NPL 2, the proposed concept overcomes these limitations in two ways: creating complete tickets extracting information from multimedia attachments, creating structured templates capturing the different aspects of the technical tickets such as error logs, expected behavior, affected components, platform versions, etc. Moreover, automatically generating answers may introduce hard-to-verify hallucinations. The proposed concept approach avoids such errors by directly linking to the relevant tickets.
[0086] Cupid (NPL 3, Ting Zhang et al.: “Leveraging ChatGPT for More Accurate Duplicate Bug Report Detection”) uses ChatGPT to identify keywords from bug reports which are used as queries for lexical search to retrieve duplicate reports. This is suboptimal since solely using a sparse set of keywords may result in many candidates that share generic aspects such as software platform versions or browser versions across functionally different business domains. Compared to NPL 3, various examples of the proposed concept employ hybrid search combining lexical search and embedding-based semantic search which matches tickets by meaning in addition to lexical patterns. Second, the proposed concept may create contextualized queries (instead of simple keywords) augmented via domain-specific information to optimize retrieval of duplicates from similar business domains.
[0087] NPL 4 (Chengnian Sun: “Towards more accurate retrieval of duplicate bug reports”) organizes the repository of bug reports in buckets with each bucket having at least one distinct master report. Subsequently filed bug reports are then mapped to these buckets by computing similarity with the master report of each bucket. A major limitation of this pipeline is that a human triage of all the existing bug reports must be performed where duplicates have been labeled and assigned to a master report in a bucket. The proposed concept is more efficient as it eliminates the need for such labor-intensive triaging of all reports and only relies on the ticket contents to index them for hybrid search.
[0088] NPL 5 (Per Runeson et al.: “Detection of Duplicate Defect Reports Using Natural Language Processing”) employs a basic natural language pipeline consisting of tokenization, stemming, and count vectors for words to find duplicate tickets. In addition, NPL 5 also limits the scope of tickets to a specific time frame to find duplicates. Such a surface-level approach may be deemed insufficient for finding complex tickets containing software code, error logs, and attachments. Various examples of the proposed concept overcome this by utilizing contextual embeddings of the complete ticket content, including those from attachments extracted via multimodal LLMs. These embeddings provide semantic understanding of the tickets enabling more accurate retrieval of duplicate tickets with the same underlying issue but described differently.
[0089] For encapsulating domain-specific information in the pipeline, NPL 6 (Abram Hindle et al.: “A contextual approach towards more accurate duplicate bug report detection and ranking”) uses a predefined vocabulary consisting of architecture and platformspecific word list and topics from LDA modelling as features for classification and similarity matching for detecting duplicates. This mandates a new vocabulary to be created for each new product or platform where ticket management is needed. Also, when a new ticket arrives that does not contain these features, the classification model cannot match it to an existing duplicate. The proposed concept provides more flexibility in that LLMs are leveraged as core components of the proposed processing pipeline that implicitly captures these features via semantic search when matching tickets. Thus, the proposed concept can be used out of the box for minimal preprocessing of the existing tickets.
[0090] The proposed concept may be implemented as a computer-implemented method, computer system (comprising one or more processors and one or more storage devices) configured to perform the computer-implemented method and / or as a computer program for performing the computer-implemented method. For example, the computer-implemented method may include one or more steps and / or operations discussed above.
[0091] Various aspects of the proposed concept relates to machine learning. In particular, the model mentioned above may be a machine learning model.
[0092] Machine learning is a branch of artificial intelligence that involves the development of algorithms and models that allow computers to learn and make predictions or decisions without being explicitly programmed. It focuses on creating systems that can improve their performance over time by learning from data.
[0093] Training a machine-learning model refers to the process of teaching the model to make accurate predictions or decisions. During training, the model is exposed to a large amount of data, which is used to adjust the model's internal parameters or weights. The model learns patterns, relationships, or rules from the training data, allowing it to generalize and make predictions on new, unseen data.
[0094] Training data is the set of examples or instances that is used to teach a machinelearning model. It is often labeled data, meaning that each example is associated with a known outcome or target value. The training data consists of both input features and the corresponding output or target variable. The model learns from this data by analyzing the patterns and relationships between the input features and the target variable. Training algorithms, such as supervised learning, semi-supervised learning, unsupervised learning or reinforcement learning may be used for training the machine-learning model.
[0095] Machine-learning models, such as the machine-learning model being trained in the present disclosure, are often implemented as Artificial Neural Networks (ANNs), and in particular Deep Neural Networks, Support Vector Machines, Decision Tree models, or Random Forest models.
[0096] Examples may involve or relate to computer programs, including program codes to execute one or more of the mentioned methods when the program is executed on a computer, processor, or other programmable hardware component. As a result, steps, operations, or processes from various methods described above can also be executed by computers, processors, or other programmable hardware components. Examples may additionally cover program storage devices, such as digital data storage media, which are machine-, processor-, or computer-readable and encode and / or contain machine-executable, processor-executable, or computer-executable programs and instructions. These devices may include or be digital storage devices, magnetic storage media like magnetic disks and tapes, hard disk drives, or optically readable digital data storage media, for instance. Other examples encompass computers, processors, control units, field programmable logic arrays (FPLAs), field programmable gate arrays (FPGAs), graphics processing units (GPUs), applicationspecific integrated circuits (ASICs), integrated circuits (ICs), or system-on-a-chip (SoC) systems that are programmed to carry out the steps of the aforementioned methods. In simpler terms, examples may involve computer programs and storage media comprising computer programs, as well as hardware components like processors and control units, which can be programmed to execute the methods described above.
[0097] When certain aspects are mentioned in relation to a device or system, they should also be considered as descriptions of the corresponding methods. For example, a block, component, or functional aspect of the device or system may correspond to a method step or feature of the related method. Therefore, aspects described regarding a method should also be understood as depicting a corresponding element, property, or functional feature of the corresponding device or system. In simpler terms, if something is described in relation to a device or system, it can also be applied to the corresponding method, and vice versa. Many modifications and other embodiments of the invention set forth herein will come to mind to the one skilled in the art to which the invention pertains having the benefit of the teachings presented in the foregoing description and the associated drawings. Therefore, it is to be understood that the invention is not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.
[0098] L i s t o f r e f e r e n c e s i g n s
[0099] System
[0100] Interface
[0101] Processor
[0102] Memory / storage
[0103] Obtaining a ticket
[0104] Extracting first textual data
[0105] Extracting second textual data
[0106] Generating a transcript
[0107] Generating a textual description of an image frame
[0108] Performing OCR
[0109] Summarizing information
[0110] Removing duplicate information Dividing textual data into chunks Generating embeddings
[0111] Identifying redundant chunks
[0112] Removing a redundant chunk
[0113] Rewriting textual data
[0114] Summarizing textual data
[0115] Determining combined textual data
[0116] Transforming the combined textual data into structured data according to domain-agnostic format Transforming the structured data into domainspecific structured data according to domainspecific format
[0117] Generating domain-specific textual data
[0118] Formulating a query
[0119] Performing a similarity search
[0120] Providing information on the ticket and on one or more existing tickets
[0121] Customer ticket
[0122] Reformulated ticket description 230 Unified ticket schema
[0123] 240 Contextualized ticket schema
[0124] 250 Contextualized ticket description
[0125] 260 Contextualized search queries
[0126] M1 Multi-modal document processing
[0127] M1.1 File understanding
[0128] M1.2 Ticket rewriting
[0129] M2 Structured information extraction
[0130] M3 Context-specific transformation
[0131] M3.1 Schema filtering
[0132] M3.2 Schema synthesis
[0133] M4 Natural language ticket template generation
[0134] M5 Query formulation
[0135] M6 Hybrid search
Claims
C l a i m s1 . Computer-implemented method for optimizing retrieval of similar tickets in a ticket management system using machine learning, the method comprising: obtaining (110) a ticket being input into the ticket management system; extracting (120) first textual data from a description portion of the ticket; determining (130), if the ticket comprises at least one attachment, second textual data representing the at least one attachment; determining (150) combined textual data based on the first textual data and, if determined, based on the second textual data; performing (185) a similarity search based on the combined textual data and a repository of existing tickets using a lexical search and / or a semantic search to determine one or more existing tickets being similar to the ticket; and providing (190) information on the one or more existing tickets being similar to the ticket.
2. The method according to claim 1 , wherein the act of determining (130) the second textual data comprises, if the at least one attachment comprises audio data and / or video data, generating (132), using a speech-to-text model or a multi-modal model, a transcript of spoken content included in the audio data and / or video data, with the second textual data being based on the transcript.
3. The method according to one of the claims 1 or 2, wherein the act of determining (130) the second textual data comprises, if the at least one attachment comprises video data and / or image data, generating (134), using a multi-modal model, a textual description of at least one image frame included in the video data and / or image data, with the second textual data being based on the textual description, and / or wherein the act of determining (130) the second textual data comprises, if the at least one attachment comprises video data and / or image data, performing (136) optical character recognition on at least one image frame included in the video data and / or image data, with the second textual data being based on the optical character recognition.
4. The method according to one of the claims 1 to 3, wherein the act of determining (130) the second textual data comprises summarizing (138) information contained in the at least one attachment.
5. The method according to one of the claims 1 to 4, wherein, in the act of determining (130) the second textual data, the second textual data is generated using a language model or a multi-modal model.
6. The method according to one of the claims 1 to 5, wherein the method comprises removing (140) duplicate information from at least one of the first textual data, the second textual data and the combined textual data.
7. The method according to claim 6, wherein the act of removing duplicate information comprises dividing (141 ) at least one of the first textual data, the second textual data and the combined textual data into chunks of text, generating (142), using an embedding model, embeddings representing the chunks of text, identifying (143) redundant chunks of text based on the embeddings representing the chunks of text, removing (144) at least one redundant chunk of text, and rewriting (145), using a language model, at least one of the first textual data, the second textual data and the combined textual data into a coherent text without the removed at least one redundant chunk of text, or wherein the act of removing duplicate information comprises summarizing (146), using a language model, at least one of the first textual data, the second textual data and the combined textual data.
8. The method according to one of the claims 1 to 7, wherein the method comprises transforming (160), using a language model, the combined textual data into structured data according to a domain-agnostic structured format, generating (175) domain-specific textual data representing the ticket based on the structured data, and providing (190) the domain-specific textual data representing the ticket alongside the information on the one or more ticketsbeing similar to the ticket and / or using the domain-specific textual data representing the ticket for the similarity search (185).
9. The method according to claim 8, wherein the method comprises transforming (170), using a language model and a repository of domainspecific context, the structured data into domain-specific structured data according to a domain-specific structured format, and generating (175) the domain-specific textual data representing the ticket based on the domainspecific structured data and based on a pre-defined or generated domainspecific template.
10. The method according to one of the claims 1 to 9, wherein the method comprises formulating (180) a query for performing the similarity search based on the combined textual data and / or based on domain-specific textual data representing the ticket, by determining keywords representing the ticket and formulating the query based on the determined keywords.
11. The method according to claim 10, wherein the process of determining the keywords representing the tickets comprises extracting, using a language model, the keywords from the combined textual data or the domain-specific textual data representing the ticket, and / or wherein the process of determining the keywords representing the tickets comprises clustering keywords or key phrases included in the combined textual data or the domain-specific textual data representing the ticket based on a semantic embedding of the respective keywords or key phrases, and generating, using a language model, labels for the clusters of keywords or key phrases, with the labels being used as the keywords representing the ticket, and / or wherein the process of determining the keywords representing the tickets comprises adding keywords from a repository of domain-specific context to represent the domain associated with the domain-specific textual data.
12. The method according to one of the claims 1 to 11 , wherein the similarity search (185) is performed using the lexical search and the semantic search, wherein respective search results of the lexical search and of the semantic search are ranked using one of reciprocal rank fusion of the search results and weighting-based ranking of the search results, and / or wherein the semantic search is performed using a vector database containing semantic embeddings of the existing tickets, and / or wherein the lexical search is performed using the word-based indexing of the textual data of the existing tickets.
13. The method according to one of the claims 1 to 12, wherein the method comprises providing (190) a representation of the ticket alongside a representation of the one or more existing tickets being similar to the ticket to a user via a user interface.
14. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method according to one of the claims 1 to 13.
15. A system (10) comprising interface circuitry (12), machine-readable instructions, and processor circuitry (14) to execute the machine-readable instructions to carry out the method according to one of the claims 1 to 13.