Data enrichment techniques using large language models

Large language models are used for prompt engineering to address the limitations of existing data enrichment methods, enhancing data quality and model performance by generating accurate vertical predictions and reducing high cardinality, thus improving AI/ML model efficiency and security.

WO2026128384A1PCT designated stage Publication Date: 2026-06-18EQUIFAX INC

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
EQUIFAX INC
Filing Date
2025-12-08
Publication Date
2026-06-18

Smart Images

  • Figure US2025058618_18062026_PF_FP_ABST
    Figure US2025058618_18062026_PF_FP_ABST
Patent Text Reader

Abstract

In one example, a computer-implemented method includes receiving, by a processor, a query comprising an entity name, where the entity name is representative of an entity. The method includes retrieving, by the processor and from one or more repositories, an entity data set comprising records associated with the entity and prompt engineering a large language model (LLM) with the entity data set. The method includes generating, by the processor and using the prompt engineered LLM, a vertical prediction based at least in part on the entity name and assigning the vertical prediction to an entity record corresponding to the entity. The method also includes storing, by the processor, the entity record in an enriched entity data set.
Need to check novelty before this filing date? Find Prior Art

Description

Attorney Docket No. 096923-1530227EFX-208WODATA ENRICHMENT TECHNIQUES USING LARGE LANGUAGE MODELSCross-Reference to Related Applications

[0001] This application claims the benefit of U.S. Provisional Application No. 63 / 729,573 filed December 9, 2024, and entitled “DATA ENRICHMENT TECHNIQUES USING LARGE LANGUAGE MODELS,” and U.S. Provisional Application No. 63 / 745,129, filed January 14, 2025, and entitled “FRAUD DETECTION USING MACHINE MODELS,” the entire contents of each of which is incorporated herein by reference in its entirety for all purposes.Technical Field

[0002] The present disclosure relates generally to artificial intelligence. More specifically, but not by way of limitation, this disclosure relates to systems and methods for enriching data using large language models.Background

[0003] The increasing prevalence of artificial intelligence (“Al”) and machine-learning (“ML”) models across various industries has created a growing demand for high-quality training data. ML models require large amounts of diverse and relevant data to learn patterns, relationships, and decision boundaries. But collecting, labeling, and curating such datasets can be a time-consuming, expensive, and often impractical task.

[0004] A significant challenge in AI / ML model development is the scarcity of relevant and diverse training data, which can result in models that are biased, overfit, or underfit, leading to poor performance on unseen data. To address this issue, data augmentation techniques have been employed to enhance the efficacy of training datasets. One such technique is data enrichment which involves adding new features, attributes, or information to existing data to improve the quality of the existing data as employed within the AI / ML model through increased relevance and usefulness. However, some data enrichment methods, such as manual annotation, feature engineering or rule-based approaches, have limitations including being labor-intensive, error-prone, and unable to capture complex relationships or generalize key features.

[0005] The emergence of large language models (“LLMs”) has transformed the field of natural language processing and has significant implications for AI / ML models. LLMs are deep learning models trained on vast amounts of text data to generate human-like language. By learning patterns, relationships, and structures of language, LLMs can generate coherent,Attorney Docket No. 096923-1530227EFX-208WO context-specific text that is often indistinguishable from human-written text. Leveraging LLMs, researchers have developed various techniques, including data imputation, data transformation, data synthesis, feature extraction, and feature generation.Summary

[0006] Various aspects of the present disclosure provide systems and methods for implementing data enrichment techniques. In one example, a computer-implemented method includes receiving, by a processor, a query comprising an entity name, where the entity name is representative of an entity. The method includes retrieving, by the processor and from one or more repositories, an entity data set comprising records associated with the entity and prompt engineering a large language model (“LLM”) with the entity data set. The method includes generating, by the processor and using the prompt engineered LLM, a vertical prediction based on the entity name and assigning the vertical prediction to an entity record corresponding to the entity. The method then includes storing, by the processor, the entity record in an enriched entity data set.

[0007] In another example, a system includes a processing device and a memory device in which instructions executable by the processing device are stored for causing the processing device to perform operations. The operations include receiving a query comprising an entity name, where the entity name is representative of an entity and retrieving, from one or more repositories, an entity data set comprising records associated with the entity and prompt engineering an LLM with the entity data set. The operations include generating, using the prompt engineered LLM, a vertical prediction based on the entity name and assigning the vertical prediction to an entity record corresponding to the entity. The operations then include storing the entity record in an enriched entity data set.

[0008] In yet another example, a non-transitory computer-readable storage medium having program code that is executable by a processor to cause a computing device to perform operations is described. The operations include receiving a query comprising an entity name, where the entity name is representative of an entity and retrieving, from one or more repositories, an entity data set comprising records associated with the entity and prompt engineering a LLM with the entity data set. The operations include generating, using the prompt engineered LLM, a vertical prediction based on the entity name and assigning the vertical prediction to an entity record corresponding to the entity. The operations then include storing the entity record in an enriched entity data set.Attorney Docket No. 096923-1530227EFX-208WO

[0009] This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used in isolation to determine the scope of the claimed subject matter. The subject matter should be understood by reference to appropriate portions of the entire specification, any or all drawings, and each claim.

[0010] The foregoing, together with other features and examples, will become more apparent upon referring to the following specification, claims, and accompanying drawings.Brief Description of the Drawings

[0011] FIG. 1 is a block diagram of an operating environment that implements a data enrichment computing system according to certain examples.

[0012] FIG. 2 is a flow diagram of a process for implementing a data enrichment computing system according to certain examples.

[0013] FIG. 3 is a flow diagram of a process for training a machine-learning model on enriched data according to certain examples.

[0014] FIG. 4 is a flow diagram of a process for executing a machine-learning model trained on enriched data according to certain examples.

[0015] FIG. 5 is a block diagram illustrating an example of a computing device 500, which can be used to implement the data enrichment computing system according to certain examples.Detailed Description

[0016] Certain aspects and features of the present disclosure address the above limitations of data enrichment techniques to provide improvements in artificial intelligence (“Al”) and machine-learning (“ML”) model development. The described techniques provide a variety of improvements including: (1) improving the quality and accuracy of generated text data by using a combination of large language models (“LLMs”) and other techniques; (2) increasing the diversity and novelty of generated text data by using techniques such as text augmentation and data imputation; and (3) improving the scalability and efficiency of data enrichment by using techniques such as parallel processing and distributed computing.

[0017] Certain aspects of the disclosure relate to data science applied to a variety of contexts including business-industry assignments. Aspects can be implemented to address problems of enriching data through feature engineering powered by LLMs. The context and prompt for a new feature may be provided through another known and factual categorical feature in a given dataset. Such implementations provide improvements over other featureAttorney Docket No. 096923-1530227EFX-208WO engineering methods, such as aggregations or one-hot encoding, by reducing manual effort expenditures while also harnessing the semantic analysis capabilities of LLMs. The contextual information from other records can be leveraged to enhance the existing LLM output, enabling the generation of structured data that can be seamlessly integrated into the dataset using the processes outlined according to the described examples.

[0018] In some examples, a curated dataset for AI / ML modeling can include a variety of corpora related to entities such as commercial entities and company names. Each entity may demonstrate similar or different patterns depending on their vertical. As used herein, “vertical” can refer to and industry type or subsection of a larger market. Verticals can include, for example, one or multiple restaurant sector verticals, healthcare verticals, software services verticals, and the like. Challenges arise in attempting to assign verticals based on the lack of identifiers and ambiguity in identifiers that may be used to assign verticals. For example, if purchases were made at a location named “Mark’s”, it may be difficult to identify what vertical to assign to the location without more information. Particularly in ambiguous cases, an unconstrained model for assigning verticals may assign a large number of unique verticals to entities, leading to high cardinality with respect to industry vertical classifications. As an example, higher cardinality data sets can include multiple overlapping classifications, such as “restaurant”, “dining, “eatery”, and the like, while lower cardinality data sets may truncate such classifications simply to “restaurant”.

[0019] AI / ML models may be unable to include high cardinality features (e.g., features that have a large number of unique values or categories). High cardinality features in an AI / ML model can be problematic for several reasons such as overfitting, computational complexity, or lack of generalizability. Further, it may not be feasible to create too many segmentations that fit for each entity. Therefore, certain examples relate to assigning industry type information to a given entity regardless of whether the entity’s information is publicly available. By using verticals as a new feature in AI / ML modeling, it is possible to improve model performance lift as well as to better generalize the AI / ML model instead of including high cardinality features directly. By and large, LLMs leverage advanced encoding techniques to tap into vast amounts of training data, enabling the association of entity names with industries based on publicly available information and uncovering valuable context that would otherwise remain hidden.

[0020] Thus, aspects of the disclosure are directed to reducing the excessive cardinality of a highly predictive and entity-relevant categorical feature for AI / ML models. In various applications, a large number of entities may be associated with a high-risk profile, and may utilize lesser-known service providers, resulting in a lack of information on their respectiveAttorney Docket No. 096923-1530227EFX-208WO categories. Other attempts to append category information to such entities have been met with limited success, defined by limitations such as low match rates using commercial databases; unavailability of category information in existing databases; limited success using alternative data sources, such as industry-specific lists; and unreliable or missing information such as addresses or ownership data that could be used to improve matching. Such limitations reflect the benefits of a more effective and efficient method for assigning categories to entities, particularly in situations where information is limited or unreliable.

[0021] The described data enrichment computing system provides for many advantages of using LLMs and prompt engineering to impute missing information and enrich data sets for subsequent use by AI / ML models. Examples of advantages include but are not limited to improvements in comprehensiveness, scalability, accuracy and potential for further advancement and refining. Regarding comprehensiveness, the described system is able to assign verticals based purely on entity name. Thus, regardless of whether the entity is known or unknown, or if the entity can be identified through other means (e.g., online searching via address, phone or other identifiable information), verticals may be assigned utilizing the predictive capabilities of the LLM prompt engineered on entity data sets. Regarding scalability, the predictive capabilities of the LLM can scale larger computing systems to accommodate new entities being input into repositories. The described techniques are able to quickly assign verticals for a variety of additional companies including newly formed companies and unknown small businesses, particularly where preexisting methods may be bottlenecked due to a lack of available information.

[0022] The described methodology also provides improved accuracy in vertical assignment. The LLM prompt engineered on entity data may be able to accurately assign verticals. The described data validation techniques of exact matching searching and fuzzy matching searching to produce validation scores, in addition to threshold confidence values used to ensure accurate assignment, can provide further means for ensuring the accurate assignment of entity verticals in the data enrichment process.

[0023] As a further benefit, the LLM model user interface provides for further advancements and refining. Static model parameters and limited context windows can be applied in a base implementation. In further implementations, coding agents may be employed to optimize more parameters such as maximum tokens and temperature to further streamline the process. Fine-tuning procedures such as Parameter-Efficient Fine-Tuning (“PEFT”) and Low-Rank Adaptation (“LoRA”) can be implemented to update the base implementation which may provide for additional improvements in accuracy.Attorney Docket No. 096923-1530227EFX-208WO

[0024] Certain examples described herein improve the operations of various machinelearning models by describing particular rules for improving the training data used to produce the machine-learning models. High cardinality in training data contributes both to excess data that must be stored in training data repositories and also contributes to excessive noise that impairs the performance of any corresponding machine-learning models. Excess cardinality in training data can cause machine-learning models to learn patterns that do not generalize well, leading to overfitting and reduced interpretability. High cardinality also leads to computational inefficiency by requiring machine-learning models to iterate training on data that does not lead to improved interpretability. Particular rules described herein address the limitations of high- cardinality data by generating and assigning vertical predictions for given entity data to produce an enriched entity data set which may then be used to train machine-learning models in a manner that improves machine-learning models compared to prior techniques.

[0025] Certain examples described herein improve the security of various network environments based on particular rules for identifying anomalous behavior in the network environments. As described herein, electronically facilitated dispute processes may rely on server-hosted environments to review particular transactions. Particular computer environments, including merchant service networks, may be required to provide access to electronically hosted dispute services. However, required access to such networks presents security risks, where fraudulent entities may initiate bad-faith disputes to various transactions. During routine operations, such networks lack security measures in place to respond to bad- faith actors who may otherwise exploit the generally accessible network. Rules described here, including training a machine-learning model to detect anomalous behavior in disputes may then be used to restrict or limit access to secure computing networks, including those hosting dispute services. Additionally or alternatively, automated authentication requests may be triggered responsive to detected anomalous behavior. In such a manner, the techniques described herein can improve network security and stability via analyzing network patterns and enriched data.Example of a Data Enrichment Computing System

[0026] Referring now to the drawings, FIG. 1 is a block diagram of an operating environment configured to implement a data enrichment computing system according to certain examples. In the operating environment 100 data enrichment computing system 130 trains or tunes models that can be used to enrich data for subsequent in use according to a variety of configurations of AI / ML models. The data enrichment computing system 130 can further apply one or more algorithms involving several models to enrich data. FIG. 1 depicts examples ofAttorney Docket No. 096923-1530227EFX-208WO hardware components of the data enrichment computing system 130, according to some aspects. The data enrichment computing system 130 is a specialized computing system that may be used for processing large amounts of data using a large number of computer processing cycles. The data enrichment computing system 130 can include a model training server 110 for retrieving, training, and / or tuning machine-learning models such as LLMs 120. The data enrichment computing system 130 can further include a data enrichment server 118 for enriching data for use in AI / ML models through vertical prediction based on entity data 124 (including data corresponding to the candidate entity and several other entities) and attribute data 142.

[0027] The model training server 110 can include one or more processing devices that execute program code, such as a model training application 112. The program code is stored on a non-transitory computer-readable medium. The model training application 112 can execute one or more processes to execute and / or retrain an LLM for predicting an entity vertical based on entity data 124, attribute data 142, and model training samples 126. In some examples, model training server 110 can retrieve a prebuilt third party LLM 105 via the public data network 108 which is then subsequently trained according to certain examples.

[0028] In some aspects, the model training application 112 can train a machine-learning model including an LLM 120 utilizing model training samples 126 (e.g., including training entity data and training attributes). The model training samples 126 can be stored in one or more network-attached storage units on which various repositories, databases, or other structures are stored. Examples of these data structures are the data repository 122. In the same or other examples, the model training application 112 can use model training samples 126 to prompt engineer, or “tune” a prebuilt LLM subsequently employed by the data enrichment computing system 130.

[0029] Network-attached storage units may store a variety of different types of data organized in a variety of different ways and from a variety of different sources. For example, the network-attached storage unit may include storage other than primary storage located within the model training server 110 that is directly accessible by processors located therein. In some aspects, the network-attached storage unit may include secondary, tertiary, or auxiliary storage, such as large hard drives, servers, virtual memory, among other types. Storage devices may include portable or non-portable storage devices, optical storage devices, and various other mediums capable of storing and containing data. A machine-readable storage medium or computer-readable storage medium may include a non-transitory medium in which data can be stored and that does not include carrier waves or transitory electronic signals. Examples of aAttorney Docket No. 096923-1530227EFX-208WO non-transitory medium may include, for example, a magnetic disk or tape, optical storage media such as a compact disk or digital versatile disk, flash memory, memory, or memory devices.

[0030] The data enrichment server 118 can include one or more processing devices that execute program code, such as a data enrichment application 114. The program code is stored on a non-transitory computer-readable medium. The data enrichment application 114 can execute one or more processes to utilize the LLM 120 trained by the model training application 112 based on entity data 124 to predict and assign entity verticals. For example, the data enrichment application 114 can apply one or more algorithms and models configured to predict verticals for a given entity and assign the entity the corresponding vertical.

[0031] Furthermore, the data enrichment computing system 130 can communicate with various other computing systems, such as client computing systems 104. For example, client computing systems 104 may send entity queries to the data enrichment server 118 to determine a vertical prediction for a given entity or may send signals to the data enrichment server 118 that control or otherwise influence different aspects of the data enrichment computing system 130. The client computing systems 104 may also interact with user computing systems 106 via one or more public data networks 108 to facilitate interactions between users of the user computing systems 106 and interactive computing environments provided by the client computing systems 104.

[0032] Each client computing system 104 may include one or more third-party devices, such as individual servers or groups of servers operating in a distributed manner. A client computing system 104 can include any computing device or group of computing devices operated by a seller, lender, or other providers of products or services. The client computing system 104 can include one or more server devices. The one or more server devices can include or can otherwise access one or more non-transitory computer-readable media. The client computing system 104 can also execute instructions that provide an interactive computing environment accessible to user computing systems 106. Examples of the interactive computing environment include a mobile application specific to a particular client computing system 104, a web-based application accessible via a mobile device, etc. The executable instructions are stored in one or more non-transitory computer-readable media.

[0033] The client computing system 104 can further include one or more processing devices that are capable of providing the interactive computing environment to perform operations described herein. The interactive computing environment can include executable instructions stored in one or more non-transitory computer-readable media. The instructionsAttorney Docket No. 096923-1530227EFX-208WO providing the interactive computing environment can configure one or more processing devices to perform operations described herein. In some aspects, the executable instructions for the interactive computing environment can include instructions that provide one or more graphical interfaces. The graphical interfaces are used by a user computing system 106 to access various functions of the interactive computing environment. For instance, the interactive computing environment may transmit data to and receive data from a user computing system 106 to shift between different states of the interactive computing environment, where the different states enable one or more electronics transactions between the user computing system 106 and the client computing system 104 to be performed.

[0034] In some examples, a client computing system 104 may have other computing resources associated therewith (not shown in FIG. 1), such as server computers hosting and managing virtual machine instances for providing cloud computing services, server computers hosting and managing online storage resources for users, server computers for providing database services, and others. The interaction between the user computing system 106 and the client computing system 104 may be performed through graphical user interfaces presented by the client computing system 104 to the user computing system 106, or through an application programming interface (“API”) calls or web service calls.

[0035] A user computing system 106 can include any computing device or other communication device operated by a user, such as a consumer or a customer. The user computing system 106 can include one or more computing devices, such as laptops, smartphones, and other personal computing devices. A user computing system 106 can include executable instructions stored in one or more non-transitory computer-readable media. The user computing system 106 can also include one or more processing devices that are capable of executing program code to perform operations described herein. In various examples, the user computing system 106 can enable a user to access certain online services from a client computing system 104 or other computing resources, to engage in mobile commerce with a client computing system 104, to obtain controlled access to electronic content hosted by the client computing system 104, etc.

[0036] In some examples, the data enrichment computing system 130 can cause the user computing system 106, the client computing system 104, or a combination thereof to execute one or more actions in accordance with the generated, enriched data. For instance, as described in the example above, the data enrichment computing system 130 can generate enriched data for implementation in a subsequent AI / ML model 132, where the subsequent AI / ML model 132 can communicate with a user computing system to automatically cause oneAttorney Docket No. 096923-1530227EFX-208WO or more components of the user computing system 106 to reject a transaction based on the enriched data.

[0037] Each communication within the operating environment 100 may occur over one or more data networks, such as a public data network 108, a network 116 such as a private data network, or some combination thereof. A data network may include one or more of a variety of different types of networks, including a wireless network, a wired network, or a combination of a wired and wireless network. Examples of suitable networks include the Internet, a personal area network, a local area network (“LAN”), a wide area network (“WAN”), or a wireless local area network (“WLAN”). A wireless network may include a wireless interface or a combination of wireless interfaces. A wired network may include a wired interface. The wired or wireless networks may be implemented using routers, access points, bridges, gateways, or the like, to connect devices in the data network.

[0038] The number of devices depicted in FIG. 1 is provided for illustrative purposes. Different numbers of devices may be used. For example, while certain devices or systems are shown as single devices in FIG. 1, multiple devices may instead be used to implement these devices or systems. Similarly, devices or systems that are shown as separate, such as the model training server 110 and the data enrichment server 118, may be instead implemented in a single device or system.Example of a Process for Enriching Data

[0039] FIG. 2 is a flow diagram of a set of operations 200 for implementing a data enrichment computing system (e.g., data enrichment computing system 130) according to certain examples. One or more computing devices (e.g., the data enrichment server 118) implement operations depicted in FIG. 2 by executing suitable program code (e.g., the data enrichment application 114). For illustrative purposes, the operations 200 are described with reference to certain examples depicted in the figures. Other implementations, however, are possible. While the blocks of the operations 200 are described in the temporal order below for illustrative purposes, it may be appreciated that the blocks can occur in any order, and some blocks may occur simultaneously.

[0040] At block 202 the operations 200 include receiving a query comprising an entity name where the entity name is representative of an entity. Generally, the process for data enrichment relates to assigning entities previously unassigned to a vertical to a proper vertical. Thus, the entities queried, via entry of the entity name, can correspond to entities without assigned verticals. In some examples, each entity within a repository may be queried, and as a threshold analysis, if the entity is determined to already be assigned a vertical, the operationsAttorney Docket No. 096923-1530227EFX-208WO200 may terminate. In other examples, the initial repository queried can include only those entities already determined to not have a corresponding vertical. In further examples, entities already having a vertical assigned can be queried and processed per operations 200 to perform further validation.

[0041] At block 204 the operations 200 involve retrieving, from one or more repositories, an entity data set including records associated with the entity. Generally, the records in each repository can include entity -vertical pairing data. One repository of the one or more repositories can be an inquiry log table including transaction inquiry records for respective transactions. The inquiry log table can provide for a large set of industry codes (e.g., 100+ industry codes) to provide an initial reference point for vertical assignment. The inquiry log table may thus provide an initial data set for subsequent evaluation and refinement according to additional repositories providing smaller sets of industry codes for vertical assignment.

[0042] The one or more repositories storing entity data can include a repository storing transaction data. Unlike the inquiry log table, the repository including transaction data can include a reduced set of industry codes (e.g., on the order of 30 industry codes). As greater numbers of industry codes would on its own contribute to high data cardinality, databases storing a truncated set of industry codes (e.g., a database storing 30 industry codes as opposed to 100+ industry codes), such as provided by the repository including transaction data may be preferred over other repositories such as the inquiry log table in generating the final set of enriched data.

[0043] The initial data and records retrieved from the one or more repositories can be filtered to eliminate records with null or missing values to refine the initial data and form a curated dataset. The data retrieved from the one or more repositories can thus represent an initial set of entity data. Records with null values, such as those lacking an entity name entry, a vertical entry, or an entity-record pairing, can then be identified and the identified records removed to generate the entity data set for use in prompt engineering (also referred to as “tuning”) an LLM. In such a way, incomplete data which would not assist in prompt engineering the LLM to assign verticals can be removed from the entity data set to improve prompt engineering efficiency.

[0044] At block 206 the operations 200 include prompt engineering an LLM with the entity data set. The LLM employed can generally include any pre-trained LLM capable of feature inference based on received prompts. Examples can include Google Gemini, Databricks Llama, Amazon Web Services (“AWS”) SageMaker and the like. As the LLM will generallyAttorney Docket No. 096923-1530227EFX-208WO be pretrained, block 206 refers to a process of tuning the LLM through prompt engineering based on the entity data set. Thus, the LLM, as pre-trained, can use a preexisting corpora of training data to predict verticals, further augmented by the entity data set. The entity data set may be used to establish a predefined scope of verticals for assignment to a representative sample of entities. The use of the entity data set can serve as a reference point, providing a set of established vertical-industry pairing examples for further tuning and refining the accuracy of a prebuilt LLM at block 206. In some examples, the entity data set may also be used to train a new LLM that may then be further prompt engineered per block 206.

[0045] According to certain examples, prompt engineering the LLM based on the entity data can include prompt structuring and dialog tuning. Prompt structuring refers to the process of generating prompts for input into the LLM (e.g., specifying the inclusion of the entity data) to further train the LLM to predict verticals for a given entity. One example format of prompts used to assign verticals can include “[industry options] + [instructions] + [Entity List]” where the entity list represents the entity data set. In some cases, such prompts may be altered to achieve a desired output format or to achieve additional outputs. Dialog tuning refers to the process of refining the prompts to ensure extraction of the relevant data and ensure that the LLM is outputting accurate vertical predictions. For instance, if the entity data is too large, the LLM may only output, per block 208, a partial set of results due to token limits and other limitations of the LLM. Additional examples of dialog tuning can include refining instructions for output formatting or adding language muting additional commentary beyond entity: industry pairs to conserve tokens or additional prompting to have the LLM continue assigning for long lists of entities should a first prompt not return the complete list.

[0046] At block 208 the operations 200 include generating, using the prompt engineered LLM, a vertical prediction based on the entity name. The vertical prediction represents a confidence of the LLM, based on the entity data used to further tune the LLM and the entity name input into the LLM, that the entity is to be assigned to a given vertical. As an example, an entity name entered into the LLM such as “Jane’s Provisions” may lead to the LLM to output a confidence score of 0.99 that the entity is to be assigned to the vertical representing grocers. In another example, the entity name “John’s” may, depending on the entity data used to prompt engineer the LLM, output a corresponding vertical prediction that John’s is to be assigned to vertical defining restaurants with a .6 confidence score. As more entity data is provided to prompt engineer the LLM per blocks 204-206, the confidence scores and accuracies of the LLM may be improved.Attorney Docket No. 096923-1530227EFX-208WO

[0047] At block 210 the operations 200 include assigning the vertical prediction to the entity. In some examples, the entity refers to a two-dimensional record including entity name and the assigned vertical prediction. Additional dimensions for each record can be included representing other dimensions of the entity. In some examples, the confidence score associated with the vertical prediction may also be stored as a dimension of data with the entity name and assigned vertical. In some examples, the vertical prediction may be assigned to the entity only if a threshold confidence is exceeded. For instance, returning to the examples discussed at block 208, the “Jane’s Provisions” entity may be assigned the grocer vertical prediction for exceeding the confidence threshold (e.g., >=.8 confidence), while “John’s” remains unassigned a vertical prediction for no corresponding vertical prediction exceeding the threshold confidence.

[0048] At block 212 the operations 200 include storing the entity in an enriched data set. The entity data, with the assigned vertical per block 210, may be stored in an enriched data set, where the enriched data set can be implemented for subsequent use according to a variety of AI / ML model configurations. The enriched data set, storing entity data with assigned verticals, can store the entity data with reduced cardinality due to the vertical assignments. As used herein, cardinality can refer to the number of unique values in a data set, particularly related to classifications. The reduced cardinality leads to an improved implementation of the subsequent AI / ML models, contributing to increased efficiencies in larger computing infrastructures.

[0049] In some examples, the operations 200 can be supplemented with techniques for validating the accuracy of the vertical assignments, and for ensuring the accuracy of the enriched data set. For instance, techniques including exact matching (e.g., VLOOKUP) and fuzzy matching can be applied to compare entity data as stored across the various databases to ensure proper processing by the LLM.

[0050] For instance, a validation score can be generated representing the degree in confidence that the enriched data set accurately reflects the entity data used to generate the entity data set. In other words, such matching techniques may be applied to identify the overlap between entities in the enriched data set and those present in the training data. The validation score can be generated by performing, for each record in the entity data set, an exact matching search within the enriched entity data set to determine whether the entity record matches with an enriched entity record within the enriched entity data set. The validation score may then be based on the number of determined matches, such as a percentage of records that were determined to match. Similar techniques may be applied per a fuzzy matching search procedureAttorney Docket No. 096923-1530227EFX-208WO where tolerances may be applied to allow for minor variations between the query entity record and a candidate matching enriched entity record.Example o f a Process for Training a Machine Learning Model on Enriched Data

[0051] As noted above, the data enrichment operations described according to operations 200 can facilitate the improved training of machine-learning models. The described data enrichment computing system can include a model training server 110 hosting a model training application 112, which can communicate with the data enrichment server 118. The operations 200 described in FIG. 2 can be used to generate enriched training data, where the enriched training data includes a data set with fewer verticals and lower cardinality compared to prior training data sets. The enriched training data can be stored in an enriched data set. The following described operations 300 of FIG. 3 describe further techniques for training a machine-learning model using the enriched data set.

[0052] FIG. 3 is a flow diagram of operations 300 for training a machine-learning model on enriched data according to certain examples. One or more computing devices (e.g., the data enrichment server 118) implement operations depicted in FIG. 3 by executing suitable program code (e.g., the data enrichment application 114). For illustrative purposes, the operations 300 are described with reference to certain examples depicted in the figures. Other implementations, however, are possible. While the blocks of the operations 300 are described in the temporal order below for illustrative purposes, it may be appreciated that the blocks can occur in any order, and some blocks may occur simultaneously.

[0053] At block 302 the operations 300 include accessing a repository including multiple data sets including a first data set and the enriched entity data set. The repository (e.g., repository 122) can accordingly store multiple sets of data. The first data set can correspond to the enriched data set, but without the assigned vertical predictions generated according to the operations 200 described with respect to FIG. 2. The enriched entity data set can include records storing at least two-dimensional data including an entity identifier and an assigned vertical prediction. The records can further include additional dimensions. In an example, each dimension, including the entity name, assigned vertical prediction, and each further dimension relates to chargeback disputes. Chargeback disputes refer to interactions where an entity may wish to reverse a transaction that the entity alleges was unauthorized. While certain chargeback disputes are valid, not all may be, and particular machine-learning models may be implemented to aid in distinguishing valid disputes from invalid disputes. Thus, in some examples, operations 300 can relate to training a chargeback dispute machine-learning model, trained to assess the validity of a received chargeback dispute.Attorney Docket No. 096923-1530227EFX-208WO

[0054] At block 304 the operations 300 include receiving selected features to train a machine-learning model. The selected features can shape the functionality of the machinelearning model and will vary from application to application. Generally however, one or more of the selected features can capture the enriched data according to the operations 200. In the chargeback dispute example, all features can characterize and associated entity, and at least one of the features can correspond to the assigned vertical prediction for the given entity. The chargeback dispute example can include other features such as timestamp features and temporal features. Timestamp features can include the date and time of a dispute interaction, such as the date of a transaction and / or date of a submitted chargeback. Temporal features can be built from timestamp features and relate to a difference in days between key dates in a transaction dispute, such as the number of days between an order or transaction being placed, and the date of the associated chargeback.

[0055] At block 306 the operations 300 include generating reduced vertical cardinality training data based on a subset of the enriched entity data set. The reduced vertical cardinality training data can provide an optimization compared to other training data, such as any training data acquired from the first data set. For example, excess cardinality in training data can then result in the machine-learning models learning patterns that do not generalize well, leading to overfitting and reduced interpretability. High cardinality also leads to computational inefficiency by requiring machine-learning models to iterate training on data that does not lead to improved interpretability. Thus, the reduced cardinality training data can mitigate inefficiencies resulting from accessing another data set to generate the training data. The reduced cardinality training data may still include a large amount of records in excess of a suitable amount for model training. In the chargeback dispute example, refining the training data to more accurately represent different features can yield further improvement in model performance. Generating the training data based on a subset of the enriched entity data set can then be executed according to various sampling techniques. In an example, the model training application 112 can execute a stratified random sampling of the enriched entity data set to produce the training data. The stratified random sampling can include dividing the enriched entity data set into stratum based on timestamp features (e.g., month-year categories identifying the month and year of a dispute). After dividing the enriched entity data into stratum, the model training application 112 can perform a random sampling of each stratum of the divided set of stratum to generate the training data.

[0056] At block 308 the operations 300 include training a machine-learning model based on the training data and selected features. Machine learning model architectures to beAttorney Docket No. 096923-1530227EFX-208WO trained can generally include supervised or semi-supervised classification models, including logistic regression models, support vector machines, and the like. Various machine-learning model architectures may be chosen as part of the training operations but will generally include supervised or semi-supervised training. In the chargeback dispute example, the machinelearning model can be trained according to supervised classification techniques trained to classify the likelihood that an entity is fraudulently executing a chargeback dispute. In similar chargeback dispute examples, unsupervised training techniques can be executed for exploratory analysis, such as in identifying clustered features that may be more predictive of fraudulent behavior. The identified clustered features may then be selected during a subsequent execution of operations 300 in retraining or tuning a machine-learning model for classifying the likelihood of fraud during a chargeback dispute.

[0057] Depending on the selected machine-learning model type, various specific model training operations can be executed, including among others, initializing parameters including weights and biases, batch processing, and forward passing to generate initial predictions. Loss functions may then evaluate the forward pass so that the model training application 112 can execute gradient descent operations for reducing loss via further parameter tuning. The training process can then iterate until threshold machine-learning model performance is achieved.Example Operations for Executing a Machine Learning Model Trained on Enriched Data

[0058] FIG. 4 is a flow diagram of a set of operations for executing a machine-learning model trained on enriched data according to certain examples. One or more computing devices (e.g., the data enrichment server 118) implement operations depicted in FIG. 4 by executing suitable program code (e.g., the data enrichment application 114). For illustrative purposes, the operations 400 are described with reference to certain examples depicted in the figures. Other implementations, however, are possible. While the blocks of the operations 400 are described in the temporal order below for illustrative purposes, it may be appreciated that the blocks can occur in any order, and some blocks may occur simultaneously. The machine-learning model executed according to operations 400 relates to a chargeback dispute trained model, described according to certain implementations of operations 300. It is to be appreciated that in other examples, different models trained according to operations 300 may be executed according to similar operations.

[0059] At block 402, the operations 400 include receiving an interaction inquiry associated with an interaction (for example, the interaction being a transaction and the interaction inquiry being a chargeback inquiry). The interaction inquiry can include messages received from a variety of client computing systems 104 including merchant computingAttorney Docket No. 096923-1530227EFX-208WO systems. For instance, a user with access to a merchant computing system can select a specific interaction they would like to dispute and submit an interaction inquiry associated with that interaction. In the same or other examples, settings can be configured such that each interaction on a given computing system is received by the data enrichment computing system 130. According to some settings, specific entities (via associated accounts) can be identified and flagged such that each interaction associated with that entity automatically generates an interaction inquiry. According to the above examples, interaction inquiries can be generated in real-time, near real-time, or can be collected and transmitted according to a variety of other schedules.

[0060] At block 404, the operations 400 include retrieving interaction data associated with the interaction. The interaction data can be retrieved within the same message or data structure as the interaction inquiry. Thus, a given client computing system 104 may transmit both the interaction inquiry into an interaction and the associated interaction data. In the same and other examples, interaction data can also be retrieved via a data repository 122 communicatively coupled to the dispute computing system. As an example, the interaction inquiry received at block 402, may also be received with interaction data such as the name and date of the interaction and an identifier of the entity associated with the interaction. The data enrichment computing system 130 can then perform a search within the data repository 122 based on the interaction data, such as the entity identifier, to retrieve additional interaction data stored internal the data repository 122, such as data associated with the entity.

[0061] At block 406, the operations 400 include applying the interaction data to a machine-learning model trained on dispute training data. The machine-learning model can generally include one or more of decision trees, random forests, logic regression models or other classifier models. The machine-learning model, according to some examples, can be trained according to a variety of libraries such as extreme Gradient Boosting (“XGBoosf ’), Light Gradient Boosting Machine (“LightGBM”) and the like. The machine-learning model hyperparameters may also be tuned according to a variety of techniques including through application of optimization algorithms such as Tree of Parzen Estimators (“TPE”).

[0062] The dispute training data can be grouped according to feature engineering techniques to improve the accuracy of the model. Thus, the data enrichment computing system 130 can generate a set of features from the dispute training data, where the set of features is then used to train the machine-learning model. The sets of features can include timestamp features, temporal features, and vertical features. Timestamp features can include the date and time of the interaction, such as the date of a transaction and / or date of a submitted chargeback.Attorney Docket No. 096923-1530227EFX-208WOTemporal features can be built from timestamp features and relate to a difference in days between key dates in an interaction procedure, such as the number of days between an order or transaction being placed, and the date of the associated chargeback. Vertical features can relate to the industry associated with the interaction.

[0063] In some examples, the various features and associated data can be enriched to reduce the cardinality (i.e., number of unique values in the dataset) and further improve machine-learning operations. For instance, verticals may be assigned through data enrichment techniques using large language models (“LLMs”). The verticals, or industry, associated with a given interaction can be assigned by training or tuning an LLM through prompt engineering. In such ways, data sparsity issues can be resolved (i.e., where the interaction data associated with the dispute does not provide the associated vertical), while also reducing the cardinality of the data set.

[0064] Additional examples of specific features that may be employed to train the model can include refunded amount values representing the amount disputed within the interaction, the provider name, representing the payment provider of the transaction, bank identification number (“BIN”) of the bank who issued the payment card underlying the transaction in dispute, the reason code underlying the reason provided for the dispute, portal name representing the portal the dispute was initiated through and vertical assignment representing the industry in which the dispute occurred. Additional examples of the machinelearning model structure, and techniques for training the machine-learning model are described according to the examples of FIG. 3.

[0065] At block 408, the operations 400 include generating, via the machine-learning model, a dispute score. The dispute score represents the “winnability” or likelihood of success that a dispute will result in an interaction being reversed. A greater dispute score can represent a greater likelihood that the interaction will be reversed. A lower dispute score can represent a greater likelihood that the interaction will be sustained. Thus, according to some examples, the generated dispute score can be used and analyzed by users to determine techniques for maximizing the win rate. In the same and other examples, users such as merchants can apply the dispute score to determine how to manage a chargeback dispute, including by leading to faster resolution in the dispute process Additionally, the dispute score can be used for fraud analysis measures. For instance, if a given entity is determined to have repeatedly submitted interactions, each with a low credibility and high dispute score, then the entity can be assigned a risk score, or otherwise flagged as an entity likely to submit reversable interactions with low credibility.Attorney Docket No. 096923-1530227EFX-208WO

[0066] At block 410, the operations 400 include transmitting the dispute score. The dispute score can be transmitted back to the same device which submitted the interaction inquiry or can be transmitted to another device. Transmission of the dispute score can be accompanied by various warnings, flags, alerts, and the like. For instance, exceeding a threshold dispute value can trigger the transmission of an alert along with the dispute score. In some examples, each dispute score is transmitted in real time (i.e., in immediate response to the dispute inquiry). In other examples, only those dispute scores triggering a threshold dispute value will be transmitted. For instance, in cases where user settings are configured to transmit an interaction inquiry with each interaction, the data enrichment computing system 130 can be configured to generate alerts and report dispute scores only with those dispute scores exceeding the threshold dispute value.

[0067] Additionally or alternatively to operations at block 410, the operations 400 can include controlling access to a secure computing environment. For example, the dispute score, representing the confidence in disputing a transaction, can be indicative of fraudulent behavior. Particularly low scores (e.g., below an exceedingly low threshold) can indicate that a given dispute may be fraudulent. A repeated pattern of disputes below the threshold can more strongly indicate fraudulent behavior. A set quantity of disputes below the dispute threshold may then be used to identify fraudulent behavior within a dispute service. In response, operations 400 can include flagging the entity initiating the disputes as a fraudulent entity and respond by restricting or limiting the entities access to one or more secure computing environments, including the dispute service. Further, responsive to such determined anomalous behavior, the data enrichment service may automatically initiate further authentication requests to establish the identity of the disputing entity.

[0068] As noted above, similar techniques may be used to train other types of models based on enriched data with assigned verticals. For example, models trained for synthetic identity detection or risks of credit washing may be trained on enriched data with vertical assignments with reduced verticality. Such types of risk assessment models may then be executed according to operations similar to those described with respect to operations 400.Example of a Computing System for Implementins Data Enrichment Techniques

[0069] Any suitable computing system or group of computing systems can be used to perform the operations described herein. For example, FIG. 5 is a block diagram illustrating an example of a computing device 500, which can be used to implement the data enrichment computing system described above. The computing device 500 can include various devices for communicating with other devices in a computing environment.Attorney Docket No. 096923-1530227EFX-208WO

[0070] The computing device 500 can include a processor 502 that is communicatively coupled to a memory 504. The processor 502 can execute computer-executable program code stored in the memory 504, can access information stored in the memory 504, or both. Program code may include machine-executable instructions that may represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a class, or any combination of instructions, data structures, or program statements. A code segment may be coupled to another code segment or a hardware circuit by passing or receiving information, data, arguments, parameters, or memory contents. Information, arguments, parameters, data, etc., may be passed, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, among others.

[0071] Examples of a processor 502 can include a microprocessor, an applicationspecific integrated circuit (ASIC), a field-programmable gate array (FPGA), or any other suitable processing device. The processor 502 can include any suitable number of processing devices, including one. The processor 502 can include or communicate with a memory 504. The memory 504 can store program code that, when executed by the processor 502, causes the processor 502 to perform the operations described herein.

[0072] The memory 504 can include any suitable non-transitory computer-readable medium. The computer-readable medium can include any electronic, optical, magnetic, or other storage device capable of providing a processor with computer-readable program code or other program code. Non-limiting examples of a computer-readable medium can include a magnetic disk, memory chip, optical storage, flash memory, storage class memory, ROM, RAM, an ASIC, magnetic storage, or any other medium from which a computer processor can read and execute program code. The program code may include processor-specific program code generated by a compiler or an interpreter from code written in any suitable computerprogramming language. Examples of suitable programming language can include Hadoop, C, C++, C#, Visual Basic, Java, Python, Perl, JavaScript, ActionScript, etc.

[0073] The computing device 500 may also include a number of external or internal devices such as input or output devices. For example, the computing device 500 is illustrated with an input / output interface 508 that can receive input from input devices or provide output to output devices. A bus 506 can also be included in the computing device 500. The bus 506 can communicatively couple one or more components of the computing device 500.

[0074] The computing device 500 can execute program code 514 that can include the presently described models, testing operations, and the like. The program code 514 may be resident in any suitable computer-readable medium and may be executed on any suitableAttorney Docket No. 096923-1530227EFX-208WO processing device. Executing the program code 514 can configure the processor 502 to perform one or more of the operations described herein.

[0075] In some aspects, the computing device 500 can include one or more output devices. One example of an output device can be the network interface device 510 depicted in FIG. 1. A network interface device 510 can include any device or group of devices suitable for establishing a wired or wireless data connection to one or more data networks described herein. Non-limiting examples of the network interface device 510 can include an Ethernet network adapter, a modem, etc.

[0076] Another example of an output device can include the presentation device 512 depicted in FIG. 1. A presentation device 512 can include any device or group of devices suitable for providing visual, auditory, or other suitable sensory output. Non-limiting examples of the presentation device 512 can include a touchscreen, a monitor, a speaker, a separate mobile computing device, etc. In some aspects, the presentation device 512 can include a remote client-computing device that can communicate with the computing device 500 using one or more data networks described herein. In other aspects, the presentation device 512 can be optional.

[0077] The foregoing description of some examples has been presented only for the purpose of illustration and description and is not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. Numerous modifications and adaptations thereof will be apparent to those skilled in the art without departing from the spirit and scope of the disclosure.

Claims

Attorney Docket No. 096923-1530227EFX-208WOClaims1. A computer-implemented method for data enrichment comprising: receiving, by a processor, a query comprising an entity name, the entity name representative of an entity; retrieving, by the processor and from one or more repositories, an entity data set comprising records associated with the entity; prompt engineering, by the processor, a large language model (LLM) with the entity data set; generating, by the processor and using the prompt engineered LLM, a vertical prediction based on the entity name; assigning, by the processor, the vertical prediction to an entity record corresponding to the entity; and storing, by the processor, the entity record in an enriched entity data set.

2. The computer-implemented method of claim 1, wherein the one or more repositories includes a first repository including an inquiry log table and a second repository including transaction data, wherein each of the first repository and the second repository include entityvertical pairing data.

3. The computer-implemented method of claim 1, wherein retrieving the entity data set comprises: retrieving, by the processor, an initial set of entity data; identifying, by the processor and within the initial set of entity data, records that lack an entity name entry or a vertical entry; and removing, by the processor, the identified records to generate the entity data set.

4. The computer-implemented method of claim 1, further comprising generating a validation score through operations comprising: for each record in the entity data set, performing an exact matching search or fuzzy matching search within the enriched entity data set, and generating the validation score based on a number of determined matches.

5. The computer-implemented method of claim 1, further comprising:Attorney Docket No. 096923-1530227EFX-208WO accessing a repository including multiple data sets including a first data set and the enriched entity data set, wherein the enriched data set has a reduced vertical cardinality compared to the first data set; generating reduced cardinality training data including the assigned vertical predictions based at least in part on a subset of the enriched entity data set; receiving selected features to train a machine-learning model; and training the machine-learning model based at least in part on the reduced cardinality training data and selected features.

6. The computer-implemented method of claim 5, wherein generating the training data based at least in part on a subset of the enriched entity data set comprises stratified random sampling the enriched entity data set to generate the training data.

7. The computer-implemented method of claim 5, further comprising: receiving a chargeback inquiry associated with an interaction; retrieving interaction data associated with the interaction; applying the interaction data to the machine-learning model; generating, via the machine-learning model, a dispute score; and responsive to the dispute score falling below a threshold, controlling access to a secure computing environment.

8. A system comprising: a processing device; and a memory device in which instructions executable by the processing device are stored for causing the processing device to perform operations comprising: receiving a query comprising an entity name, where the entity name is representative of an entity; retrieving, from one or more repositories, an entity data set comprising records associated with the entity; prompt engineering an LLM with the entity data set; generating, using the prompt engineered LLM, a vertical prediction based at least in part on the entity name; assigning the vertical prediction to an entity record corresponding to the entity; andAttorney Docket No. 096923-1530227EFX-208WO storing the entity record in an enriched entity data set.

9. The system of claim 8, wherein the one or more repositories includes a first repository including an inquiry log table, and a second repository including transaction data, wherein each of the first repository and the second repository include entity -vertical pairing data.

10. The system of claim 8, wherein retrieving the entity data set comprises: retrieving an initial set of entity data; identifying within the initial set of entity data, records that lack an entity name entry or a vertical entry; and removing the identified records to generate the entity data set.

11. The system of claim 8, wherein the operations further comprise: for each record in the entity data set, performing an exact matching search or fuzzy matching search within the enriched entity data set, and generating a validation score based at least in part on a number of determined matches.

12. The system of claim 8, wherein the operations further comprise: accessing a repository including multiple data sets including a first data set and the enriched entity data set, wherein the enriched data set has a reduced vertical cardinality compared to the first data set; generating reduced cardinality training data including the assigned vertical predictions based at least in part on a subset of the enriched entity data set; receiving selected features to train a machine-learning model; and training the machine-learning model based at least in part on the reduced cardinality training data and selected features.

13. The system of claim 12, wherein generating the training data based at least in part on a subset of the enriched entity data set comprises stratified random sampling the enriched entity data set to generate the training data.

14. The system of claim 12, wherein the operations further comprise: receiving a chargeback inquiry associated with an interaction; retrieving interaction data associated with the interaction;Attorney Docket No. 096923-1530227EFX-208WO applying the interaction data to the machine-learning model; generating, via the machine-learning model, a dispute score; and responsive to the dispute score falling below a threshold, controlling access to a secure computing environment.

15. A non-transitory computer-readable storage medium having program code that is executable by a processor to cause a computing device to perform operations, the operations comprising: receiving a query comprising an entity name, where the entity name is representative of an entity; retrieving, from one or more repositories, an entity data set comprising records associated with the entity; prompt engineering an LLM with the entity data set; generating, using the prompt engineered LLM, a vertical prediction based at least in part on the entity name; assigning the vertical prediction to an entity record corresponding to the entity; and storing the entity record in an enriched entity data set.

16. The non-transitory computer-readable storage medium of claim 15, wherein the one or more repositories includes a first repository including an inquiry log table, and a second repository including transaction data, wherein each of the first repository and the second repository include entity-vertical pairing data.

17. The non-transitory computer-readable storage medium of claim 15, wherein retrieving the entity data set comprises: retrieving an initial set of entity data; identifying within the initial set of entity data, records that lack an entity name entry or a vertical entry; and removing the identified records to generate the entity data set.

18. The non-transitory computer-readable storage medium of claim 15, wherein the operations further comprise: for each record in the entity data set, performing an exact matching search or fuzzy matching search within the enriched entity data set, and generating a validation score based at least in part on a number of determined matches.Attorney Docket No. 096923-1530227EFX-208WO19. The non-transitory computer-readable storage medium of claim 15, wherein the operations further comprise: accessing a repository including multiple data sets including a first data set and the enriched entity data set, wherein the enriched data set has a reduced vertical cardinality compared to the first data set; generating reduced cardinality training data including the assigned vertical predictions based at least in part on a subset of the enriched entity data set; receiving selected features to train a machine-learning model; and training the machine-learning model based at least in part on the reduced cardinality training data and selected features.

20. The non-transitory computer-readable storage medium of claim 19, wherein generating the training data based at least in part on a subset of the enriched entity data set comprises stratified random sampling the enriched entity data set to generate the training data.