Graph-merge split techniques for entity resolution

Graph-merge-split techniques effectively address database fragmentation and overcombination by merging and splitting entities based on matching scores and clique constraints, enhancing accuracy and efficiency in Entity Resolution.

WO2026127948A1PCT designated stage Publication Date: 2026-06-18EQUIFAX INC

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
EQUIFAX INC
Filing Date
2024-12-10
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

Existing database management systems struggle with Entity Resolution due to issues like file fragmentation and overcombination, where multiple entities represent a single consumer or multiple consumers are combined as one, respectively, and traditional hard-coded rules fail to address larger variations in data entries.

Method used

Implementing graph-merge-split techniques that involve generating an entity level graph, identifying connected components, applying a maximal graph matching algorithm, and using a graph-splitter process to merge and split entities based on entity matching scores and clique constraints, thereby generating a merge-split graph.

🎯Benefits of technology

This approach enhances the accuracy of Entity Resolution by correctly matching records, reducing database size, and optimizing computational resources by detecting and resolving fragmentation and overcombination, thus improving search efficiency and reducing storage needs.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure US2024059298_18062026_PF_FP_ABST
    Figure US2024059298_18062026_PF_FP_ABST
Patent Text Reader

Abstract

In some aspects, a graph-merge-split computing system is provided. The graph-merge-split computing system is capable of identifying candidate records for merging and generate an entity level graph from the list of candidate records comprising nodes with matching score edges. A graph-merge process is performed for each connected component which includes determining the matching score for each edge. Edges falling below a threshold are removed. A maximal graph matching algorithm is applied the updated graph where the maximal graph matching algorithm identifies paired nodes. For each of the paired nodes, the graph-merge process then determines a clique score and in response to determining the clique score exceeds a delta-clique constraint, applies a graph-splitter process to the paired nodes prior to forming the merged entity. The graph-merge process is repeated until no merge occurs. Once all merges have been performed, an output graph is generated.
Need to check novelty before this filing date? Find Prior Art

Description

Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATIONGRAPH MERGE-SPLIT TECHNIQUES FOR ENTITY RESOLUTIONTechnical Field

[0001] This disclosure relates generally to digital data processing systems. More specifically, but not by way of limitation, this disclosure relates to systems and methods for Entity Resolution through graph-merge-split techniques.Background

[0002] Databases often store data in records. Each record may represent an entity, where the entity is representative of a person (also referred to as a consumer), place or thing. Each record also includes multiple identities (e.g., the entity name, address, social security number, date of birth, and the like). Entity Resolution refers to a process of managing records within a database such that records in the database correctly identify the corresponding entity. Entity Resolution can include identifying mismatching records within the database. Issues arising from mismatch include file fragmentation and file overcombination (also referred to as overcombines). With fragmentation, a single consumer is represented by multiple entities. Fragmentation can arise due to variations in how data is entered and managed as part of the record. For instance, if an entity is determined based on an address identity, user error in entering data, or simply entering the data in a different format (e.g., in one instance address is entered as “street” while in a second record is entered as “st.”), can result in the consumer having multiple corresponding entities in the database. With overcombines, a similar issue arises where multiple consumers are represented as a single entity. In such cases consumers with identities yielding similar entries, such as the consumer having the same name, or residing at the same address, can lead to multiple consumers being represented as a single entity.

[0003] Approaches to database management and Entity Resolution rely on hard-coded rules such relying on data normalization (e.g., changing every input of “Dr.” to “Drive” in address fields) to standardize records and prevent mismatch. But such hard coded rules can only prevent issues that are directly accounted for. Moreover, such traditional Entity Resolution tools may not account for larger variations within a preexisting database.Summary

[0004] Various embodiments of the present disclosure provide systems and methods for Entity Resolution through graph-merge-split techniques. In one example, a method that includes one or more processing devices performing operations. The operations include identifying a list of candidate records from a set of data records stored in a data repository forAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION merging and generating an entity level graph from the list of candidate records where the entity level graph includes entity nodes including entities and entity edges including entity matching scores. The operations include identifying one or more connected components within the entity level graph and performing a graph-merge process where the graph-merge process is performed for each connected component of the one or more connected components. The graph-merge process includes determining the entity matching score for each entity edge, removing each entity edge that falls below an entity level merge score threshold to generate an updated entity level graph, and applying a maximal graph matching algorithm to the updated entity level graph where the maximal graph matching algorithm identifies one or more maximally paired entity nodes. For each of the maximally paired entity nodes identified by the maximal graph matching algorithm, the graph-merge process then determines an entity clique score based on effects of merging the maximally paired entity nodes together to form a merged entity and in response to determining the entity clique score exceeds an entity level delta-clique constraint, the graphmerge process applies a graph-splitter process to the maximally paired entity nodes prior to merging the maximally paired entity nodes to generate a merge-split graph. The graph-merge process is repeated until no merge occurs. Once all merges have been performed, an output graph including the merge-split graph is generated.

[0006] In another example, a non-transitory computer-readable storage medium having program code executable by a processing device to perform operations. The operations include identifying a list of candidate records from a set of data records stored in a data repository for merging and generating an entity level graph from the list of candidate records where the entity level graph includes entity nodes including entities and entity edges including entity matching scores. The operations include identifying one or more connected components within the entity level graph and performing a graph-merge process where the graph-merge process is performed for each connected component of the one or more connected components. The graph-merge process includes determining the entity matching score for each entity edge, removing each entity edge that falls below an entity level merge score threshold to generate an updated entity level graph, and applying a maximal graph matching algorithm to the updated entity level graph where the maximal graph matching algorithm identifies one or more maximally paired entity nodes. For each of the maximally paired entity nodes identified by the maximal graph matching algorithm, the graph-merge process then determines an entity clique score based on effects of merging the maximally paired entity nodes together to form a merged entity and in response to determining the entity clique score exceeds an entity level delta-clique constraint, applies a graph-splitter process to the maximally paired entity nodes prior to merging the maximallyAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION paired entity nodes to generate a merge-split graph. The graph-merge process is repeated until no merge occurs. Once all merges have been performed, an output graph including the mergesplit graph is generated.

[0007] In yet another example, a computing system includes a processing device and a data repository for storing data records. Each data record represents an entity based on a set of one or more identifiers. The computing system further includes a non-transitory computer-readable storage medium having program code executable by the processing device to perform operations. The operations include identifying a list of candidate records from a set of data records stored in a data repository for merging and generating an entity level graph from the list of candidate records where the entity level graph includes entity nodes including entities and entity edges including entity matching scores. The operations include identifying one or more connected components within the entity level graph and performing a graph-merge process where the graph-merge process is performed for each connected component of the one or more connected components. The graph-merge process includes determining the entity matching score for each entity edge, removing each entity edge that falls below an entity level merge score threshold to generate an updated entity level graph, and applying a maximal graph matching algorithm to the updated entity level graph where the maximal graph matching algorithm identifies one or more maximally paired entity nodes. For each of the maximally paired entity nodes identified by the maximal graph matching algorithm, the graph-merge process then determines an entity clique score based on effects of merging the maximally paired entity node together to form a merged entity and in response to determining the entity clique score exceeds an entity level delta-clique constraint, applies a graph-splitter process to the maximally paired entity nodes prior to merging the maximally paired entity nodes to generate a merge-split graph. The graph-merge process is repeated until no merge occurs. Once all merges have been performed, an output graph including the merge-split graph is generated.

[0008] This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used in isolation to determine the scope of the claimed subject matter. The subject matter should be understood by reference to appropriate portions of the entire specification, any or all drawings, and each claim.

[0009] The foregoing, together with other features and examples, will become more apparent upon referring to the following specification, claims, and accompanying drawings.Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATIONBrief Description of the Drawings

[0010] FIG. 1 is a block diagram depicting an example of a computing environment in which a merge-split computing system can identify data fragmentation and overcombination in a database according to certain aspects of the present disclosure.

[0011] FIG. 2 is a diagram depicting an example of a data flow for identifying candidate records, according to certain aspects of the present disclosure.

[0012] FIG. 3 is a flowchart depicting an example of a process for training a machine learning matching model to determine a matching decision for a reference record and a query record, according to certain aspects of the present disclosure.

[0013] FIG. 4 is a diagram illustrating the data flow in the training of the machine learning model, according to certain aspects of the present disclosure.

[0014] FIG. 5 is a flowchart depicting an example of a graph-merge-split process for Entity Resolution based on identified candidate records within a database, according to certain aspects of the present disclosure.

[0015] FIG. 6 is a flowchart depicting an example of a graph-splitter process for Entity Resolution based on identifiers within a merged entity, according to certain aspects of the present disclosure.

[0016] FIG. 7 is a diagram showing an example of a graph including connected components, nodes, and edges, according to certain aspects of the present disclosure.

[0017] FIG. 8 is a diagram depicting an example implementation of the maximal graph matching algorithm, according to certain aspects of the present disclosure.

[0018] FIG. 9 is a diagram depicting examples of cliques with varying clique scores representing the degree of connectivity between nodes within a graph, according to certain aspects of the present disclosure.

[0019] FIG. 10 is a block diagram depicting an example of a computing system suitable for implementing aspects of the techniques and technologies presented herein.Detailed Description

[0020] Certain aspects and features of the present disclosure involve Entity Resolution through graph-merging and graph-splitting techniques. Entity Resolution refers to the process of managing records within a database such that each record accurately relates back to the correct corresponding entity, and such that duplicative records (e.g., records with similar identifiers referring to the same entity) are removed. One approach to perform Entity Resolution is through graph analysis, where associations between records, and identifiersAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION within records, can be analyzed based on strength of similarity. Specifically, the described processes include graph-merging, where entities determined to be the same are merged, where the similarity determination is based on the strength of the entities’ associated records. A graphsplitting technique is also described where identifiers of a single entity are determined to more accurately relate to another entity.

[0021] A merge-split computing system can identify candidate records for merging and splitting from a set of data records stored in a data repository. Identifying candidate records can include, by the merge-split computing system, performing a search in a database for records that match a query record based on one or more identifiers. To perform the matching, the merge-split computing system can generate an identifier score for each identifier based on the values of the identifiers in the query record and a reference record to be compared. In some examples, a machine learning model trained to predict a matching decision from identifier scores and other identifier attributes generated for a pair of records can be used to identify candidate records. The merge-split computing system can then perform a merge process to defragment the candidate records and a split process to prevent overcombination of the candidate records. The merge-split computing system can be implemented in an offline batch process that can be run to produce a report of all merges and splits performed within the data repository.

[0022] The following non-limiting example is provided to introduce certain embodiments. In this example, a merge-split computing system can perform Entity Resolution by performing a merge process to overcome issues related to fragmentation where multiple entities refer to a single consumer, and in tandem, perform a split process to overcome issues related to overcombination, where multiple consumers are represented as a single entity (i.e., where identifiers corresponding of a given entity must be split into separate entities such that each entity properly corresponds to a single consumer).

[0023] The merge-split computing system may first identify a list of candidate records for merging and splitting, where the candidate records are identified as representing one or more entities based on a set of identifiers. To identify the list of candidate records, the merge-split computing system can search a database, or a subset of the database, for records that match a query record based on one or more identifiers. Searching may be streamlined through generation of optimized search keys used to traverse the database or subset of the database. The merge-split computing system can generate similarity scores based on similarities between identifiers compared between a query record and a reference record. Different similarity score techniques may be applied based on the identifiers being compared. For instance, numeric and string identifiers can be evaluated relative to a probability distribution of errors to confirmAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION whether differences between numeric or string identifiers were accidental or intentional. Corresponding scores can be generated based on the evaluations relative to the probability distribution. In the example of string identifier scoring, phonetic algorithms for matching components of names based on similar pronunciation, distance measures, such as Levenstein distance or Jaccard distance, can be used to generate the name identifier score. The sum of each identifier score (e.g., a name identifier score, an address similarity score, a date of birth identifier score, and a social security number identifier score) may then provide the similarity score defining the similarity between candidate records.

[0024] In some examples, the merge-split computing system can employ a machine learning model to generate similarity scores and matching decisions between query records and reference records when generating the list of candidate records. The input to the machine learning model can include the identifier scores. In addition, the merge-split computing system can also generate other attributes (also referred to as “matching attributes” or “identifier attributes”) for each of the identifiers as input to the machine learning model. These attributes can include, for example, a numerical identifier attribute measuring the total number of positions matched between the numerical identifier in the query record and the numerical identifier in the reference record, an address attribute generated based on a geographical distance between the query address and the reference address, an address frequency attribute indicating the number of records in the data records having a same address as the reference address, a name frequency attribute indicating the frequency of the name in the reference record, and so on.

[0025] Once the list of candidate records is identified, the merge-split computing system can generate an entity level graph. The entity level graph includes nodes, representing entities within the list of candidate records, and edges, representing the similarity scores between the entities. The similarity score may be determined as described above (e.g., via a machine learning model trained to compare records and generate similarity scores defining the strength of association between a pair of records based on their identifiers).

[0026] The merge-split computing system can then identify connected components in the entity level graph. Connected components represent entity nodes connected via edges having a sufficiently high similarity score. Fragmentation of data records can then be found by identifying connected components with a high degree of connectivity in the entity level graph.

[0027] For each identified connected component within the entity level graph, the score of each edge (e.g., the similarity score between entities based on their identifiers) can be generated. The same match model and principles may be used to generate the scores for eachAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION edge in a given connected component. On a first instance, the same scores used to generate the entity level graph may be used to score each edge of the first analyzed connected component. However, as the merge-split computing system may merge entities for each identified connected component, the structure of the entity level graph may be adjusted in response to each merge performed. Thus, the merge-split computing system will iteratively score each edge in the connected component (for instance, using a trained match model), to account for modifications to the entity level graph in response to previous entity merges on subsequent iterations.

[0028] The merge-split computing system can then remove edges that fall below an entity level merge score threshold. Removing edges between entity nodes subsequently prevents the merging of the removed entity nodes in subsequent merge steps. To avoid the issue of overcombining entities (e.g., inadvertently merging two dissimilar entity nodes representing two different consumers), the entity level merge score threshold may have a higher threshold value, particularly when compared to an identity merge threshold as discussed with respect to the splitter process which may be later called by the merge-split computing system.

[0029] After the edges falling below the entity level merge score threshold are removed from the connected component, the merging, de-fragmentation process can then determine a maximal graph match with respect to the entity graph. The maximal graph matching algorithm can adjust the entity graph such that no entity node has multiple edges (i.e., connections to other entity nodes). In such a way, the graph is reduced so that, at most, each entity is paired to only one other entity node, or is not connected to any other entity nodes

[0030] For each pair of entity nodes that are part of the maximal graph matching algorithm, the merge-split computing system can determine if the entity node pair satisfies an entity deltaclique constraint prior to merging. The entity delta-clique constraint evaluates a threshold connectivity between entity nodes within the entity level graph. As entity node pairs are merged back together, the merge may otherwise violate the entity delta-clique constraint, leading to low connectivity — indicative of mismatching entities — being merged during the merge process. Therefore, the entity level delta clique constraint may be used to prevent such low connectivity entities being merged during the merge process. Because the entity merge process has a relatively high entity level merge score threshold, a lower entity level delta-clique constraint can be used, as in effect the entity level merge score threshold has already filtered out low similarity entities from the graph.

[0031] For each merged entity, the merge-split computing system may then perform a graph splitter process to resolve data record overcombination. Similar analyses described withAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION respect to the merge process may be applied within the graph splitter process. Additionally, the graph splitter process can involve determining a centrality score for merging identities which may further determine whether to merge such identities in view of a centrality threshold. The centrality threshold may provide further means for preventing the overcombination of data files within the data repository.

[0032] Certain aspects described herein overcome the limitations of previous techniques and provide improvements to database technology by detecting record fragmentation and record overcombination within larger record databases. Detection of fragmentation and overcombination through the described merge-split process thus enables the correct matching records to be retrieved than the traditional searching techniques, thereby increasing the accuracy of the search results. In addition, the fragmentation and overcombine detection can reduce the size of the database and thus reduce the storage space used to store the database. Reducing the size of the database also reduces the computational complexity of searching the database for a given record, thereby reducing the consumption of computing resources, such as CPU time and memory space. Furthermore, the merge-split techniques presented herein also enable the accurate detection of fragmented and overcombined files in the database and thus increase the efficiency of the fragmentation and overcombination detection.

[0033] These illustrative examples are given to introduce the reader to the general subject matter discussed here and are not intended to limit the scope of the disclosed concepts. The following sections describe various additional features and examples with reference to the drawings in which like numerals indicate like elements, and directional descriptions are used to describe the illustrative examples but, like the illustrative examples, should not be used to limit the present disclosure.Operating Environment Example for Record Matching and Fragmented File Detection

[0034] FIG. 1 is a block diagram depicting an example of a computing environment in which a merge-split computing system can identify data fragmentation and overcombination in a database according to certain aspects of the present disclosure. FIG. 1 depicts examples of hardware components of a graph-merge-split computing system 100, according to some aspects. The graph-merge-split computing system 100 is a specialized computing system that may be used for processing large amounts of data using a large number of computer processing cycles. The number of devices depicted in FIG. 1 are provided for illustrative purposes. Different numbers of devices may be used. For example, while certain devices or systems are shown as single devices in FIG. 1, multiple devices may instead be used to implement these devices or systems.Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION

[0035] As shown in FIG. 1, the graph-merge-split computing system 100 can include a graph-merge-split server 106, a model training server 108, a private data network 132, data repository 120 storing data records 122, firewall 130, a client external-facing subsystem 128, and a public data network 104 communicatively coupled to one or more client computing systems 102.

[0036] The data repository 120 can include internal databases or other data sources that are stored at or otherwise accessible via the private data network 132. The data repository 120 can include data records 122, and each data record 122 includes one or more identifiers 124. An identifier 124 can include any information that can be used alone or in combination with other identifiers to uniquely identify a data record 122. For example, if the data records 122 represent data associated with an individual or entity, the identifiers 124 in each data record 122 can include information that can be used on its own to identify an individual or entity. Non-limiting examples of such information include one or more of a legal name, a company name, a social security number, a credit card number, a date of birth, an e-mail address, etc. In other aspects, the identifiers 124 can include information that can be used in combination with other information to identify an individual or entity. Non-limiting examples of such consumer identification data include a street address or other geographical location, etc.

[0037] In some examples, the identifiers 124 can be classified into four categories: numerical identifiers such as the social security number, credit card number, name identifiers such as the legal name of the individuals or company name, address identifiers such as the street address of the individual or entity, and date identifiers such as the date of birth of an individual. Depending on the nature of data stored in the data records 122, not all four categories of identifiers are available for the data record 122. For example, if the data records 122 represent data associated with products or other types of physical items, the numerical identifier in each data record 122 can include a serial number of a product, a MAC address of a network component; the name identifier can include the name of the product or item; the address identifier can include the address or location where the product or item is manufactured or produced; the date-based identifier can include the manufacturing date of the product or item. If the data records 122 represent data associated with digital items such as a webpage or a digital file, the numerical identifier in each data record 122 can include an IP address of the webpage; the name identifier can include the domain name of the webpage or the name of the digital file; the date-based identifier can include the date when the webpage or digital file is created, accessed, or modified. The data record 122 can include other information about theAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION associated entity or item, such as the employment data of the individual, description, and specification of the product, and so on.

[0038] The graph-merge-split server 106 can include a graph-merge-split service 110 capable of detecting both fragmented records and overcombined records in the data records 122. The graph-merge-split service 110 can examine pairs of candidate data records generated by a candidate generation model 113 to determine a matching decision and associated matching score between the pair of records. The matching score can be the compound scores determined based on the attribute scores generated for the pair or the confidence score output by a matching model 114 when determining the classification for the pair of records. Based on the matching decisions and matching scores between pairs of data records, the graph-merge-split service 110 can build graphs with nodes representing entities and edges representing the matching decisions. Fragmented data records can be found by identifying connected components with a high degree of connectivity in an entity level graph. Overcombined records can be found by generating an identity level graph between merged entity records.

[0039] To train the matching model 114, the graph-merge-split computing system 100 can include the model training server 108 for operating a model training service 116 for training the matching model 114 for use by a record matching service and the graph-merge-split service 110. The model training service 116 can train the matching model 114 using an initial set of training samples 126 and further determine predicted classifications for the sets of training samples 126 using the initially trained matching model. Based on the predicted classifications, the graph-merge-split computing system 100 identifies misclassified training samples based on the predicted classifications of the set of training samples being different from the respective matching labels in the training samples 126.

[0040] To correct and update the matching labels of the misclassified training samples, the model training server 108 can refine classifications for each of the misclassified training samples using multiple auxiliary model 118. The auxiliary model(s) 118 can be trained to determine whether and how to correct the labels of the misclassified training samples. The training samples 126 with the updated or corrected matching labels can be used to re-train the matching model 114. This training process can be repeated until there are no misclassified training samples in the training samples 126. In this way, ground truth matching labels for the training samples 126 can be obtained in conjunction with training the matching model 114. The graph-merge-split computing system 100 can communicate with various other computing systems such as client computing systems 102. For example, the graph-merge-split computing system 100 may include one or more provider external-facing devices that communicate withAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION data provider systems for receiving the data regarding entities or other items to be stored in data records in the data repository 120. The graph-merge-split server 106 may also communicate with the client computing system 102 by way of a client external-facing subsystem 128.

[0041] The client computing systems 102 may interact, via one or more public data networks 104, with various external-facing subsystems of the graph-merge-split computing system 100. For instance, an individual can use a client computing system 102 to attempt to search in the data records 122 for a match to a query record. The client computing system 102 may generate the query record and send it to the graph-merge-split server 106. Alternatively, the client computing system 102 can send data to be used for the search in any format and the graph-merge-split server 106 can generate the query record based on the received information. To request the search, the client computing system 102 can communicate with the client external-facing subsystem 128. The client external-facing subsystem 128 can selectively prevent the client computing system 102 from accessing or searching in the data repository 120. For example, the client external-facing subsystem 128 can determine whether the client computing system 102 can access or search in the databases based on an identifier of the client computing system and a record stored in a secure location in the client external-facing subsystem 128, such as a memory in a basic input-output system (BIOS) of the client externalfacing subsystem 128. The record can indicate the access permission of a client computing device and can be determined based on various factors such as whether the client computing system is an authorized system to access a certain database, whether the timing of the access is within an authorized window, and so on.

[0042] To determine if a client computing system 102 can access a certain database, the client external-facing subsystem 128 can retrieve the record associated with the client computing system 102 from the secure location and encrypt the record and other associated data using a cryptographic key. Similarly, the client external-facing subsystem 128 can encrypt the record submitted by the client external-facing subsystem 128 using the same cryptographic key to determine a match. A match indicates that the client computing system 102 can access the database. The client external-facing subsystem 128 can prevent the client computing system 102 from accessing the databases if there is no match.

[0043] The client external-facing subsystem 128 can be communicatively coupled, via a firewall 130, to one or more computing devices forming the private data network 132. The firewall 130, which can include one or more devices, can create a secured part of the graphmerge-split computing system 100 that includes various devices in communication via theAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION private data network 132. In some aspects, by using the private data network 132, the graphmerge-split computing system 100 can house the data repository 120 in an isolated network (i.e., the private data network 132) that has no direct accessibility via the Internet or another public data network 104.

[0044] Each client computing system 102 may include one or more third-party devices, such as individual servers or groups of servers operating in a distributed manner. Client computing system 102 can include any computing device or group of computing devices operated by a seller, lender, or other provider of products or services. Client computing system 102 can include one or more server devices. The one or more server devices can include or can otherwise access one or more non-transitory computer-readable media. The client computing system 102 can also execute an online service. The online service can include executable instructions stored in one or more non-transitory computer-readable media.

[0045] Each communication within or with the graph-merge-split computing system 100 may occur over one or more data networks, such as the public data network 104, the private data network 132, or some combination thereof. A data network may include one or more of a variety of different types of networks, including a wireless network, a wired network, or a combination of a wired and wireless network. Examples of suitable networks include the Internet, a personal area network, a local area network (“LAN”), a wide area network (“WAN”), or a wireless local area network (“WLAN”). A wireless network may include a wireless interface or a combination of wireless interfaces. A wired network may include a wired interface. The wired or wireless networks may be implemented using routers, access points, bridges, gateways, or the like, to connect devices in the data network.

[0046] A data network may include network computers, sensors, databases, or other devices that may transmit or otherwise provide data to the graph-merge-split computing system 100. For example, a data network may include local area network devices, such as routers, hubs, switches, or other computer networking devices. The data networks depicted in FIG. 1 can be incorporated entirely within (or can include) an intranet, an extranet, or a combination thereof. In one example, communications between two or more systems or devices can be achieved by a secure communications protocol, such as secure Hypertext Transfer Protocol (“HTTPS”) communications that use secure sockets layer (“SSL”) or transport layer security (“TLS”). In addition, data or transactional details communicated among the various computing devices may be encrypted. For example, data may be encrypted in transit and at rest.Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION

[0047] The graph-merge-split computing system 100 can include one or more graphmerge-split servers 106 and one or more model training servers 108. The graph-merge-split server 106 or the model training servers 108 may be a specialized computer or other machine that processes the data received at the graph-merge-split computing system 100. The graphmerge-split server 106 or the model training servers 108 may include one or more other systems. For example, the graph-merge-split server 106 or the model training servers 108 may include a database system for accessing the network-attached storage unit, a communications grid, or both. A communications grid may be a grid-based computing system for processing large amounts of data.

[0048] The graph-merge-split server 106 or the model training servers 108 can include one or more processing devices that execute program code, such as graph-merge-split service 110, the matching model 114, or the model training service 116. The program code can be stored on a non-transitory computer-readable medium. While FIG. 1 shows that the graph-merge-split server 106 and the model training server 108 are two separate servers, the function of these two servers can be implemented in a single server or a group of servers.

[0049] The graph-merge-split computing system 100 may also include one or more network-attached storage units on which various repositories, databases, or other data structures are stored. Examples of these data structures are the data repository 120. Network- attached storage units may store a variety of different types of data organized in a variety of different ways and from a variety of different sources. For example, the network-attached storage unit may include storage other than the primary storage located within the graph-merge- split server 106 or the model training server 108 that is directly accessible by processors located therein. In some aspects, the network-attached storage unit may include secondary, tertiary, or auxiliary storage, such as large hard drives, servers, virtual memory, among other types. Storage devices may include portable or non-portable storage devices, optical storage devices, and various other mediums capable of storing and containing data. A machine-readable storage medium or computer-readable storage medium may include a non-transitory medium in which data can be stored and that does not include carrier waves or transitory electronic signals. Examples of a non-transitory medium may include, for example, a magnetic disk or tape, optical storage media such as compact disk or digital versatile disk, flash memory, memory, or memory devices.

[0050] In some aspects, the graph-merge-split computing system 100 can implement one or more procedures to secure communications between the graph-merge-split computing system 100 and other client systems. Non-limiting examples of features provided to protectAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION data and transmissions between the graph-merge-split computing system 100 and other client systems include secure web pages, encryption, firewall protection, network behavior analysis, intrusion detection, etc. In some aspects, transmissions with client systems can be encrypted using public-key cryptography algorithms using a minimum key size of 128 bits. In additional or alternative aspects, website pages or other data can be delivered through HTTPS, secure filetransfer protocol (“SFTP”), or other secure server communications protocols. In additional or alternative aspects, electronic communications can be transmitted using Secure Sockets Layer (“SSL”) technology or other suitable secure protocols. Extended Validation SSL certificates can be utilized to clearly identify a website’s organization identity. In another non-limiting example, physical, electronic, and procedural measures can be utilized to safeguard data from unauthorized access and disclosure.Example Candidate Record Identification and Generation

[0051] The described graph-merge-split service 110, used to perform Entity Resolution in addressing fragmentation and overcombines for data records within a database, generates graphs for analysis by evaluating candidate records. FIG. 2 is a diagram depicting an example of a data flow 200 for identifying candidate records, according to certain aspects of the present disclosure.

[0052] FIG. 2 shows a data repository 120 and subset of the data repository, referred to as a delta 202. Delta 202 can be used to facilitate quicker searching and candidate record identification by preventing excessive searching of the larger data repository 120 . The graphmerge-split computing system 100, via the matching model 114 can access inquiry datasets that include relationships between prior inquires and returned records via the data repository 120 and delta 202. The prior inquiries include search queries previously submitted to the database system by one or more client devices. The returned records include those records returned from the database system in response to the prior inquiries. Each prior inquiry can be correlated in the inquiry dataset to a corresponding set of returned records from the database system. Each set of returned records may include one or more returned records. For example, if the search query involves a particular entity (e.g., a consumer), the returned set of records may include one or more returned records involving that particular entity. For instance, the returned set of records can include a first record with a current address of the entity and a second record with a prior address of the entity.

[0053] The candidate generation model 113 can then generate optimized search keys per a search key optimizer 204 based on the relationships between prior inquiries and returned records. In an example, Boolean indexes can be generated based on the inquiry dataset and thenAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION deduplicated by identifying multiple Boolean indexes that correspond to the same returned record. Aspects of the multiple Boolean indexes can then be combined together to form a single Boolean index for that returned record. In such away, there may be only one Boolean index corresponding to each returned record in the inquiry dataset. Frequent indexes may be identified within the set of deduplicated Boolean indexes, where the frequent indexes satisfy at least one criterion and where the frequent indexes occur at least a threshold number of times in the set of deduplicated Boolean indexes. One example of the criterion may be that the frequent indexes have estimated candidate sizes that are less than a maximum size. The maximum size can be customizable and selected to avoid frequent indexes that return an excessively large number of results. The frequent indexes with the highest frequency (e.g., occurring the greatest number of times) in the set of Boolean indexes may then be selected as the optimized search keys. The generated optimized search keys may provide for efficient traversal of the data repository or a subset of the data repository, including in instances where the data records are sparse (e.g., without specific identifiers such as a social security number) or contains variations in each of the identifiers. The optimized search keys can be tailored to the frequency of the identifiers such that uncommon identifiers may be found more easily.

[0054] After generating the optimized search keys via the search key optimizer 204, the candidate generation model 113 identifies preliminary candidate records set 206. The preliminary candidate records in some instances can include the final candidate set analyzed per subsequent procedures via the graph-merge-split service 110. However, even when optimized, the search keys used to generate the preliminary candidate record set 206 can still yield significant numbers of candidate records when queried against the data repository 120 and / or delta 202. Further processing of the preliminary candidate records to reduce the candidate record set may be applied.

[0055] The candidate generation model 113 may apply a filter model 208 to filter the preliminary candidate record set 206 to produce the filtered preliminary candidate record set 210, further reducing the number of records required for analysis by the graph-merge-split service 110. The filtering model 208 can leverage information gathered from the optimized search keys and determine which of the optimized keys match and which do not match for a pair of records. Each of the identifiers between a set of records may match or not match corresponding identifiers. For instance, a first optimized search key may capture name and date of birth fields, while a second key may capture name and address fields, and the like. Each of the keys may capture one or more identifiers. Filtering may include determining a threshold number of keys match between a query record and reference record and removing the query -Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION record reference-record pair from the candidate set if the threshold number of keys do not match.

[0056] The filtered preliminary candidate record set 210, and / or the preliminary candidate record set 206 produced by the search key optimizer 204 may then be applied to the matching model 114 to generate matching decisions and matching scores between pairs of data records. In some examples, the matching model 114 is a machine learning model employed to determine matching decisions and scores between data records.

[0057] Applying the match model 114 can include receiving a query record that contains one or more identifiers that can be used for matching. The graph-merge-split server 106 may receive the query record from a client computing system 102 in a request to find matching records in the data repository 120 or generate the query record based on the information contained in the request received from the client computing system 102. In other examples, the query record may be generated when the graph-merge-split computing system 100 receives new data, for example from an external data source, to be stored as a data record 122 in the data repository 120. The graph-merge-split server 106 generates the query record based on the information in the received new data. If the graph-merge-split server 106 finds no match for the query record in the data records, a new data record 122 can be created in the data repository 120 to store the new data; otherwise, the received new data will be used to update the data record 122 that matches the query record. In this way, fragmented records can be avoided.

[0058] In some examples, the query record can include a set of query records, such as the delta 202. Thus, the delta 202, as a subset of the data repository 120, can be queried against the data repository 120 to identify candidate records for merging and splitting per the graph-merge- split procedure (discussed with respect to FIG. 5). Querying the delta 202 against the data repository 120 to identify candidate records can thus improve the efficiency of candidate record identification as compared to querying the data repository 120 against itself.

[0059] Applying the matching model 114 can include receiving a query record that contains one or more identifiers that can be used for retrieving from the data records 122, a reference record that contains the one or more identifiers to be used for matching. Identifier attributes can be generated for each of the identifiers in the query record and reference record that are to be used for matching

[0060] The matching model 114 may then generate and output a matching decision using a matching model 114 based on the identifier attributes previously generated. In some examples, the matching model 114 can be a model that is explainable and exportable as a rule set, such as a decision tree model, a random forest, or a repeated incremental pruning to produceAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION error reduction (RIPPER)-based model. The matching model 114 can be trained using training data to accept a set of identifier attributes generated for the pair of query record and reference record as input and output a matching decision. The matching model 114 may also generate and output a matching score indicating the confidence level associated with the matching decision. Depending on the type of the record matching model 114, the matching score may be generated based on the prediction errors of leaf nodes in a decision tree model or prediction errors of trees in a random forest model.

[0061] The matching decisions and matching scores produced by the matching model 114 may then be used to produce the final candidate records 212 which may then be used per the subsequent procedures discussed with respect to FIGS. 5-9 as the candidate record set for graph-merge-split analysis.

[0062] FIG. 3 is a flowchart depicting an example of a process 300 for training a machine learning matching model 114 trained to determine a matching decision for a reference record and a query record, according to certain aspects of the present disclosure. FIG. 3 will be described in conjunction with FIG. 4. FIG. 4 is a diagram illustrating the data flow in the training of the machine learning model, according to certain aspects of the present disclosure. For illustrative purposes, the process 300 is described with reference to implementations described above with respect to one or more examples described herein. Other implementations, however, are possible. In some aspects, the operations in FIG. 3 may be implemented in program code that is executed by one or more computing devices such as the model training server 108 depicted in FIG. 1. In some aspects of the present disclosure, one or more operations shown in FIG. 3 may be omitted or performed in a different order. Similarly, additional operations not shown in FIG. 3 may be performed.

[0063] At block 302, the process 300 involves obtaining the training samples for the record matching model 114. As shown in FIG. 4, each of the training samples 126 can include input identifier attributes 404 generated for a corresponding pair of training data records and a matching label 1106 for the pair. The input identifier attributes 404 can include the identifier attributes described above with respect to block 306 of FIG. 3. In some examples, the training samples 126 include the pairs of training data records and the input identifier attributes 404 are generated based on the training data records according to the method described above with respect to block 306 of FIG. 2. The matching label 1106 indicates whether the pair of training data records match or not. In some examples, the matching label 1106 may be inaccurate and thus cannot serve as the ground truth for the training. As such, the training process 300 may also be used to identify ground truth matching labels for the training samples 126.Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION

[0064] In some examples, the training samples 126 can be selected from the data records 122 and the respective associated labels based on stratified sampling. In the data records 122, some patterns of the identifier values may be rare compared to others. The model training server 108 can first perform random sampling in the data record 122 by the type of matches indicated by the label, such as a match or no match. If the labels have flags other than match or no match, those flags can be mapped to match or no match. A stratified sample by scores is extracted on the randomly selected samples. In some examples, the score attributes, such as identifier scores, along with the compound scores (area scores and volume scores) are used for extracting out the stratified samples. The scores or compound scores can be rounded to the nearest integer before a stratified sample is extracted. Samples are also ensured to have each attribute value represented n times with n being a positive integer.

[0065] At block 304, the process 300 involves training the record matching model 114 and two or more auxiliary models 408 using the training samples 126. As discussed above with respect to FIG. 1, the record matching model 114 may be a decision tree model, a random forest, a RIPPER model, or any other model that is explainable and exportable as a rule set. The training can involve supervised training using the input attributes and the current matching labels in the training samples 126. In some examples, the auxiliary models 408 are employed in order to correct the matching label 406 in the misclassified training samples. The auxiliary models 408 can operate under different principles of classification and each can be trained to generate a classification of match or no-match based on attributes associated with a pair of records. Examples of the auxiliary models 408 can include a naive Bayes model, a multilayered perception model, a random forest model, and a support vector machine (SVC).

[0066] Each of the auxiliary models 408 can be trained using the training samples 126 further used to train the matching model 114. In some examples, the attributes input to each of the auxiliary models 408 can include the input identifier attributes 404 for the matching model 114. In other examples, the attributes input to each of the auxiliary models 408 include a subset of the input identifier attributes 404, such as the identifier scores and the compound scores. By using a subset of the input identifier attributes 404, the computational complexity of training the auxiliary models 408, and thus training the record matching model 114, can be significantly reduced.

[0067] At block 306, the process 300 involves determining predicted classifications for the training samples using the initially trained record matching model 114. In other words, the input identifier attributes 404 in each training sample 126 are input to the initially trained record matching model 114 to generate the respective predicted classifications 402.Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION

[0068] At block 308, the process 300 involves identifying misclassified training samples. The misclassified training samples can include training samples that are mistakenly labeled. In other words, the matching label 406 in a training sample for a pair of matched records is incorrectly marked as no-match, or the matching label 406 in a training sample for a pair of unmatched records is incorrectly marked as a match. The graph-merge-split server 106 can identify a set of the training samples as misclassified training samples 412 if the predicted classifications 402 of the set of training samples 126 are different from the respective matching labels 406.

[0069] At block 310, the process 300 involves the graph-merge-split server 106 determining if there are any misclassified training samples 412. If so, the process 300 involves generating, at block 312, predicted classifications for each of the misclassified training samples 412 using the auxiliary models 408, also referred to as auxiliary classification 410. At block 314, the process 300 involves updating the misclassified training samples 412 based on the auxiliary classifications 410 generated by the auxiliary models 408.

[0070] In some examples, the auxiliary classifications 410 are compared with each other to determine if the misclassified training samples need to be corrected. Because the auxiliary models 408 have different underlying principles to predict the classifications, if a pair of records is a genuine match, the auxiliary models 408 should agree on the classification. But if the auxiliary models 408 do not agree on the predicted classifications, the pair of records should be further analyzed to determine the accurate label. For example, for a mismatched training sample, if the auxiliary classifications 410 are consistent with the predicted classification by the matching model 114, the graph-merge-split server 106 can change the matching label 406 of the mismatched training sample to be consistent with the classification output by the matching model 114. If the auxiliary classifications 410 include conflicting classifications, the graph-merge-split server 106 can determine the matching label for the mismatched training sample based on a combination of the original matching label, the classification by the matching model, and the auxiliary classifications 410 by the auxiliary matching models, such as through a majority voting. Alternatively, or additionally, the record-matching computing system can output the mismatched training sample to another system for further analysis to determine the correct matching label. The mismatched training samples whose matching labels are corrected can then be used to update the corresponding training sample 126.

[0071] The record matching model 114 can be re-trained using the updated training samples 126 at block 304 and the operations in blocks 306-314 can be repeated until the graphmerge-split server 106 determines, at block 310, that there are no misclassified trainingAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION samples. The process 300 then involves, at block 316, the graph-merge-split server 106 outputting the trained record matching model 114 and the training samples 126. At this stage, the training samples 126 include the corrected matching labels 406, which can be used as ground truth matching labels 406.Examples Merge-Split Operations

[0072] FIG. 5 is a flowchart depicting an example of a graph-merge-split process 500 for Entity Resolution based on identified candidate records within a database, according to certain aspects of the present disclosure. For illustrative purposes, the process 500 is described with reference to implementations described above with respect to one or more examples described herein. Other implementations, however, are possible. In some aspects, the operations in FIG. 5 may be implemented in program code that is executed by one or more computing devices such as the graph-merge-split server 106 depicted in FIG. 1. In some aspects of the present disclosure, one or more operations shown in FIG. 5 may be omitted or performed in a different order. Similarly, additional operations not shown in FIG. 5 may be performed.

[0073] At block 502, the process 500 involves identifying a list of candidate records to be evaluated for merging. In some examples, every pair of data records 122 within a data repository 120 are evaluated for possible merge. But for a data repository 120 containing a large number of data records, such as tens of millions of data records, the computational complexity for examining each pair of data records is prohibitively high. As such, additional techniques for identifying the list of candidate records to be evaluated for merging can include the steps as described in FIG. 2.

[0074] At block 504, the process 500 involves generating an entity level graph from the list of candidate records. FIG. 7 shows an example of a graph. In the graph, each node 702 represents an entity and an edge 704 between the two nodes indicates a match between the entities represented by the two nodes according to a matching decision. The values associated with edges 704 indicates the matching score for the paired entities. For example, the edge 704 connecting nodes B and C indicates that the two entities represented by these two nodes match with each other according to a matching decision for this pair. The value .95 associated with the edge 704 is the matching score associated with the matching decision which indicates relatively high confidence in the matching decision. Similarly the edge 706 connecting nodes A and E indicates that the entities represented by these two nodes match with each other but with a relatively low confidence score of .65. Pairs of nodes that do not have an edge connecting them are not considered as matching entities according to the matching decision, such as nodes B and F, nodes C and E.Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION

[0075] Challenges arise when nodes of a graph are not fully connected. For example, for three nodes A, B, and F, A is connected to B by an edge; A is further connected to F by another edge; but B and F are not connected. In this case, merging A and B or A and F can be problematic because B and F do not match. Either merging A and B or merging A and F would violate the matching decision between B and F. To address this kind of scenario and increase the precision of the merging, connected components are utilized. At block 506, the process 500 involves identifying connected components in the graph. A connected component of a graph is a subgraph in which any two nodes are connected to each other through one or more edges. In the example shown in FIG. 7, nodes A-F form a connected component 700. FIG. 7 further shows a process of breaking apart the connected component 700 by removing edges that fail to satisfy a minimum entity level merge score threshold as described in subsequent blocks of process 500.

[0076] Referring now back to FIG. 5, block 508 of process 500 indicates that, once the one or more connected components within the entity level graph are identified, blocks 510 - 524 may be iterated per a graph-merge process, where entities may be merged. Merging entities into a single, merged entity results in all the identifiers within those entities being merged into the single merged entity. Blocks 510 - 524 may be iteratively performed for each identified connected component such that each identified connected component is reduced by potential merges to ultimately generate an output graph per block 528.

[0077] At block 510, the process 500 involves determining an entity matching score for each entity edge. For the first identified connected component, the entity matching score may be the entity matching score generated at block 504 in forming the entity level graph. However, as process 500 involves iteratively merging specific entity nodes in each identified connected component, the entity level graph may be updated between analyzing each identified connected component. Thus, on each subsequent iteration of block 510, the process 500 may determine new entity matching scores for each entity edge of a given connected component. Determining the entity matching scores can follow a similar procedure to block 504, where each entity edge in a connected component is scored based on the similarity of its identifiers, for instance, via the matching model 114.

[0078] At block 512, the process 500 involves removing entity edges that fall below an entity level merge score threshold to generate an updated entity level graph. Removing entity edges between the entity nodes indicates that the entity nodes are insufficiently similar for merging. Because the entity level merge score threshold prevents merging of entities, where each entity represents consumers, a relatively high (e.g., >.95 similarity between entities) entityAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION level merge score threshold may be used to decide whether the merge the two entity nodes. Removing entity level edges that fail to meet the entity level merge score threshold can thus lead to the entity level graph being updated.

[0079] At block 514, the process 500 involves applying a maximal graph matching algorithm to the updated entity level graph. The maximal graph matching algorithm can further restructure the updated entity level graph by breaking down entity edges such that no entity node has multiple entity level edges. Thus, each entity will either be disconnected (having no entity edges connecting to another entity node), or, at most, paired with a single other entity node via a single entity edge. In breaking apart the entity nodes, the maximal graph matching algorithm optimizes the break down such that the remaining entity edges are those with the greatest entity matching scores. Each of the remaining pairs of entity nodes with the maximally matched entity edges may be referred to as the maximally paired entity nodes.

[0080] FIG. 8 is a diagram depicting an example implementation of the maximal graph matching algorithm, according to certain aspects of the present disclosure. The simplified example of FIG. 8 illustrates a first graph 802 broken down via the maximal graph matching algorithm into a second graph 804 including a set of maximally paired nodes and a lone node with no edges. The maximal graph matching algorithm, also referred to as maximal weight matching, can break up graph, such as graph 802, such that no node has multiple edges. The maximal graph matching algorithm is further optimized such that weights are maximized between nodes with the single connected edge.

[0081] Graph 802 is shown including five nodes, where nodes 2 and 3 each have three edges, connecting to adjacent nodes. The maximal graph matching algorithm can proceed by identifying the nodes having multiple edges, such as nodes 2 and 3, and then identify the maximal edge for each of these nodes to remove the lesser scored edges, ensuring that each of nodes 2 and 3 are connected at most to one other node. In the example of FIG. 8, the maximal graph matching algorithm identifies the edge connecting nodes 2 and 4 as the maximal edge for node 2, and the edge connecting nodes 3 and 8 as the maximal edge for node 3. The maximal graph matching algorithm identifies the corresponding nodes as the maximally paired nodes and removes the lower scored edges of the remaining nodes. As a result, graph 804 can be generated where each node has, at most, one edge connecting to another node, but where the edge is the maximal edge of those identified from graph 802.

[0082] Referring now back to FIG. 5, per determination 516, the process 500 determines whether, in applying the maximal graph matching algorithm at block 514, whether a set of maximally paired entity nodes was identified (e.g., nodes 2-4 and 3-5 of graph 804). If not, theAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION process exits, indicating that each of the entity nodes is accurately merged. Per block 526, exiting the process can include evaluating the next connected component of the identified connected components. If no additional connected components are identified, the process 500 may then terminate per block 528 where an output graph is generated based on the previous modifications made to the entity level graph.

[0083] At block 518, the process 500 involves, for each of the maximally paired entity nodes identified by the matching algorithm, determining an entity clique score based on effects of merging the maximally paired entity nodes together to form a merged entity. In other words, for each pair of entity nodes that are part of the maximal matching, the process 500 determines an entity clique score.

[0084] FIG. 9 is a diagram depicting examples of cliques with varying clique scores representing the degree of connectivity between nodes within a graph, according to certain aspects of the present disclosure. The principles described with respect to FIG. 9 can be applied with respect to entity level graphs, as per block 518, and additionally with respect to identity level graphs per block 618 of process 600. In the example of FIG. 9, three connected components 902, 904, and 906 are shown, each with a varying clique score.

[0085] The connected component 902 is shown as fully connected, having a clique score of 1 indicating that each node is connected to another node. A perfect clique score of 1 would indicate that the connected component is completely connected and would thus satisfy every delta-clique constraint applied to the connected component.

[0086] Connected component 904 is shown having intermediate connectivity. Some of the nodes are fully connected to every other node, while other nodes have lesser connectivity. The associated clique score for the connected component 904 would have a medium clique score or delta. The clique score for connected component 904, e.g. a normalized score of .7 on a range of 0-1, may be sufficient to satisfy certain delta-clique constraints (e.g., a lower deltaclique constraint for a merge procedure, entity level approach), while failing other delta-clique constraints (e.g., a higher delta-clique constraint applied for a split procedure, identity level approach) depending on the values set.

[0087] Connected component 906 is shown having low connectivity. Each of the nodes is connected, at most, to two other nodes while nodes on the edges of the connected component 906 are connected only to a single other node. The associated clique score for the connected component 906 would have a low clique score. The clique score for connected component 906, may generally be filtered by various delta-clique constraints to prevent merging of the nodes within the connected component 906.Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION

[0088] Referring now back to FIG. 5, per block 520, the entity clique score may be compared against an entity level delta-clique constraint to determine if the clique score represents a normalized score from 0 to 1 indicating how well connected a set of nodes are connected to each other. Each set of maximally paired entity nodes may be evaluated for merging by evaluating the strength of the associations between the maximally paired entity nodes via the entity level delta-clique constraint. In the entity merge case, a relatively low entity level delta-clique constraint may be used relative to a higher identity level delta-clique constraint applied in a subsequent split process. Because the entity level merge score threshold is used per block 512, the use of a lower entity level delta-clique may be motivated. If the entity clique score fails to exceed the entity level delta-clique constraint, indicative that the maximally paired entity nodes are not to be merged, the process may then terminate by turning to block 526. Per block 526, exiting the process can include evaluating the next connected component of the identified connected components. If no additional connected components are identified, the process 500 may then terminate per block 528 where an output graph is generated based on the previous modifications made to the entity level graph.

[0089] At block 522, the maximally paired entity nodes are merged. Merging is thus performed between entities, and all identities within those entities are merged into the single merged entity. Thus, according to the process occurring prior to merge, each set of entity level nodes in a given graph are first filtered by entity merge threshold per block 512, reduced to single or paired entity nodes through a maximal graph matching algorithm per block 514, and evaluated against an entity level delta-clique score per block 520. Each of these sub-procedures is used to further determine the accuracy of merging a pair of entity nodes prior to the merge.

[0090] At block 524, a graph splitter process is applied to the merged nodes. The graphsplitter process entails a procedure, discussed with respect to FIG. 6, similar to that of the graph-merge process described with respect to FIG. 5. Generally, the graph-splitter process includes generating an identity level graph from the merged entity nodes generated at block 522, identifying connected components within the identity level graph, comparing identity matching scores for each identity edges against an identity merge threshold to modify the identity level graph, applying the maximal graph matching algorithm to the identity level graph, comparing an identity clique score against an identity level delta-clique threshold, and merging those identities that satisfy the threshold. By merging identities that satisfy such conditions, a merge-split graph is produced, where the merged entity (merged per the graph-merge process) is split (per the graph-split process) to reduce overcombines such that each consumer, or person, is represented by at most a single entity. After completion of the graph-merge processAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION and the graph-split process an output graph comprising the final merge-split graph can be generated.

[0091] At block 528, the process 500 includes generating an output graph including the merge-split graph. The output graph, generated by iterating blocks 508-526 of process 500, may thus include sets of merged entities where the merged entities resolve issues of fragmentation where otherwise sets of previously unmerged entities referred to the same consumer or person. Similarly, as each time entities are merged, a graph splitter procedure is called in order to resolve issues related to overcombination such that each merged entity relates to only one respective consumer. In some applications, the generated output graph may be displayed or output via a monitoring report to enable database administrators to perform further analyses and provide for additional resolution and oversight of unresolved entities as stored in an entity repository.

[0092] FIG. 6 is a flowchart depicting an example of a graph-splitter process 600 for Entity Resolution based on identifiers within a merged entity, according to certain aspects of the present disclosure. The graph splitter process may be performed between identities of a single entity (e.g., the merged entity). As the splitting is performed between identities, the process 600 is referred to as being applied at the identity level, where the generated graphs for implementing the graph-splitter maps relationships of identities for splitting. The identities can be split from 1-n entities after the split. For illustrative purposes, the process 600 is described with reference to implementations described above with respect to one or more examples described herein. Other implementations, however, are possible. In some aspects, the operations in FIG. 6 may be implemented in program code that is executed by one or more computing devices such as the graph-merge-split server 106 depicted in FIG. 6. In some aspects of the present disclosure, one or more operations shown in FIG. 6 may be omitted or performed in a different order. Similarly, additional operations not shown in FIG. 6 may be performed.

[0093] At block 602, the process 600 begins with a merged entity. The merged entity can be the merged entity performed by merging nodes at block 522 per process 500. In such a way, the process 600 is iteratively performed for each connected component in an entity level graph that satisfied the conditions of process 500 for merging, such as the merging entities being identified as satisfying an entity merge threshold and satisfying the entity level delta-clique constraint. The merged entity per block 602 can include multiple identifiers linked to the merge node. As the entity nodes are now merged into one merged entity node, the identities may be evaluated to determine whether the identities each belong to the same entity or instead should be split into separate entities representing different consumers.Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION

[0094] At block 604, the process 600 involves generating an identity level graph. The identity level graph includes the identifiers of the merged entity as nodes, referred to as identity nodes, where the identity nodes are connected by identity edges, representing the strength of the association between each identity node in the identity level graph. The identity level graph is similar to the entity level graph but representing the connections between identities of a merged entity. Per the process 500 of FIG. 5, scores such as entity matching scores for each entity edge were previously generated. The same matching scores may be used to generate the identity edge scores.

[0095] With the identity edge scores generated, process 600 involves identifying one or more identity level connected components per block 606. A similar process for identifying identity level connected components within the identity level graph may be applied as was applied to identify the connected components of the entity level graph per block 506 of process 500. Generally, the identity level graph generated per block 604, may already be a fully connected network of identity nodes. Thus, identifying the one or more identity level connected components per block 606 can entail identifying the fully connected identity level graph, generated per block 604, as the single connected component at the identity level.

[0096] Block 608 of process 600 indicates that, once the one or more connected components within the identity level graph are identified, blocks 610 - 628 may be iterated per the graph-splitter process, where identities may be merged so as to split the merged entity. Blocks 610 - 628 may be iteratively performed for each identified connected component such that each identified connected component is reduced by potential merges to ultimately generate the updated merged entity per block 626. Per blocks 604 and 606, only a single iteration may be performed (i.e., only one connected component was identified per block 606, as the connected component is effectively identical to the graph generated per block 604).

[0097] At block 610, the process 600 involves determining an identity matching score for each identity edge. Block 610 is similar to block 510 but applied at an identity level. As in block 510, in a first instance, determining the identity matching score for each identity edge may include retrieving the scores used to generate the identity level connected component. On subsequent interactions and each additional connected component, a new identity matching score for each identity edge may be calculated to account for modifications made to the identity level graph, for instance, when identities within a connected component are merged.

[0098] At block 612, the process 600 involves removing identity edges that fall below an identity level merge threshold. Since all identities within an entity may not score high enough against each other, an identity level merge low threshold is used but high enough to identifyAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION outliers. Thus, the identity level merge threshold may have a lower value compared to the entity level merge threshold used per block 512. Otherwise, block 512 and 612 may be similarly applied to remove respective edges (entity or identity level) so as to adjust the respective entity and identity level graphs.

[0099] At block 614, the process 600 involves applying the maximal graph matching algorithm to the identity level graph. As in block 514, the maximal graph matching algorithm can further restructure the updated identity level graph by breaking down identity edges such that no identity node has multiple identity edges. The maximal graph matching algorithm can identify maximally paired identity nodes within the updated identity level graph for merging. Per decision 616, the process 600 can proceed to an identity level delta-clique constraint analysis to further determine whether to merge the maximally paired identity nodes. Otherwise, the process 600 proceeds to analyze additional connected components within the identity level graph.

[0100] At block 618, the process 600 involves determining an identity clique score based on the maximally paired identity nodes identified by the maximal graph matching algorithm. Similar to block 518 of process 500, involves for each pair of identity nodes that are part of the maximal matching, determining an identity clique score.

[0101] Per block 620, the identity clique score may be compared against an identity level delta-clique constraint to determine if the clique score represents a normalized score from 0 to 1 indicating how well connected a set of nodes are connected to each other. Each set of maximally paired identity nodes may be evaluated for merging by evaluating the strength of the associations between the maximally paired identity nodes via the identity level delta-clique constraint. In the identity level, split case, a relatively higher identity level delta-clique constraint may be applied compared to the entity level delta-clique constraint applied per block 520. Because a lower identity level merge score threshold is applied per block 612, the higher identity level delta-clique constraint may be used to weed out outliers (e.g., identities that are insufficiently similar per there assigned identity edge similarity scores). If the identity clique score fails to exceed the identity level delta-clique constraint, the process may then terminate by turning to decision 628. Per decision 628, exiting the process can include evaluating the next connected component of the identified connected components. If no additional connected components are identified, the process 600 may then terminate per block 630 where updates to a merge-split graph are completed.

[0102] Per block 620, if the identity clique score exceeds the identity level delta-clique constraint, the maximally paired identity nodes may be merged per block 626, producing anAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION updated merge-split graph. Subsequent to the merge, the process 600 may return to block 608 where the next identity level connected component is evaluated per a similar procedure. The process of blocks 608-624 may be iterated until no additional identity level connected components are identified, leading to block 626, where an updated identity level graph is applied to update the merged entity within the entity level graph. The updated entity level graph, including updates made via the spilt process is represents the merge-split graph which may form the output graph.

[0103] In some examples, special consideration may be applied in the graph-splitter process to avoid inadvertently overcombining two data records. In such cases, the process 600 may optionally include determining a centrality score for merging identities and comparing the centrality score against a centrality threshold to determine whether to merge. Optional blocks 622 and 624 illustrate an example of such a process applied during the graph-splitter procedure to prevent overcombination of two data records.

[0104] At block 622, the process 600 can include determining a centrality score for the maximally paired identity nodes, prior to merging the maximally paired identity nodes. The centrality score may include an average eigenvector centrality score determined from evaluating the weights of the two identity nodes being merged. Other centrality scores may be used in addition to or alternatively to average eigenvector centrality such as Katz centrality and the like. At block 624, the process may then compare the centrality score for the maximally paired identity nodes against a centrality threshold. The centrality threshold may be a configurable value representing a tolerance of the potential risk of data record overcombination during the graph-splitter process. In response to determining the centrality score is less than the centrality threshold, the process 600 can terminate the merging of the two maximally paired identity nodes per block 626, and instead transition towards analyzing any additional connected components in the identity level graph per decision 628 or updating the merged identity having not merged the maximally paired identity nodes which failed to reach the centrality score threshold.Example of a Computing Environment for Record Matching and Fragmented File Detection

[0105] Any suitable computing system or group of computing systems can be used to perform the operations for record matching and fragmented file detection described herein. For example, FIG. 10 is a block diagram depicting an example of a computing device 1000 which can be the graph-merge-split server 106 or the model training server 108. The example of the computing device 1000 can include various devices for communicating with other devices inAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION the graph-merge-split computing system 100, as described with respect to FIG. 1. The computing device 1000 can include various devices for performing one or more operations described above with respect to FIGS. 1-9.

[0106] The computing device 1000 can include a processor 1002 that is communicatively coupled to a memory 1004. The processor 1002 executes computer-executable program code stored in the memory 1004, accesses information stored in the memory 1004, or both. Program code may include machine-executable instructions that may represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a class, or any combination of instructions, data structures, or program statements. A code segment may be coupled to another code segment or a hardware circuit by passing or receiving information, data, arguments, parameters, or memory contents. Information, arguments, parameters, data, etc. may be passed, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, among others.

[0107] Examples of a processor 1002 include a microprocessor, an application-specific integrated circuit, a field-programmable gate array, or any other suitable processing device. The processor 1002 can include any number of processing devices, including one. The processor 1002 can include or communicate with a memory 1004. The memory 1004 stores program code that, when executed by the processor 1002, causes the processor to perform the operations described in this disclosure.

[0108] The memory 1004 can include any suitable non-transitory computer-readable medium. The computer-readable medium can include any electronic, optical, magnetic, or other storage device capable of providing a processor with computer-readable program code or other program code. Non-limiting examples of a computer-readable medium include a magnetic disk, memory chip, optical storage, flash memory, storage class memory, ROM, RAM, an ASIC, magnetic storage, or any other medium from which a computer processor can read and execute program code. The program code may include processor-specific program code generated by a compiler or an interpreter from code written in any suitable computerprogramming language. Examples of suitable programming language include Hadoop, C, C++, C#, Visual Basic, Java, Python, Perl, JavaScript, ActionScript, etc.

[0109] The computing device 1000 may also include a number of external or internal devices such as input or output devices. For example, the computing device 1000 is shown with an input / output interface 1008 that can receive input from input devices or provide output to output devices. A bus 1006 can also be included in the computing device 1000. The bus 1006 can communicatively couple one or more components of the computing device 1000.Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION

[0110] The computing device 1000 can execute program code that includes graph-merge- split service 110, matching model 114, or model training service 116. The program code for the graph-merge-split service 110, matching model 114, or model training service 116 may be resident in any suitable computer-readable medium and may be executed on any suitable processing device. For example, as depicted in FIG. 10, the program code for the graph-merge- split service 110, matching model 114, or model training service 116 can reside in the memory 1004 at the computing device 1000. Executing the graph-merge-split service 110, matching model 114, or model training service 116 can configure the processor 1002 to perform the operations described herein.[OHl] In some aspects, the computing device 1000 can include one or more output devices. One example of an output device is the network interface device 1010 depicted in FIG. 10. A network interface device 1010 can include any device or group of devices suitable for establishing a wired or wireless data connection to one or more data networks described herein. Non-limiting examples of the network interface device 1010 include an Ethernet network adapter, a modem, etc.

[0112] Another example of an output device is the presentation device 1012 depicted in FIG. 10. A presentation device 1012 can include any device or group of devices suitable for providing visual, auditory, or other suitable sensory output. Non-limiting examples of the presentation device 1012 include a touchscreen, a monitor, a speaker, a separate mobile computing device, etc. In some aspects, the presentation device 1012 can include a remote client-computing device that communicates with the computing device 1000 using one or more data networks described herein. In other aspects, the presentation device 1012 can be omitted.

[0113] The foregoing description of some examples has been presented only for the purpose of illustration and description and is not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. Numerous modifications and adaptations thereof will be apparent to those skilled in the art without departing from the spirit and scope of the disclosure.Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATIONClaims1. A computer-implemented method that includes one or more processing devices performing operations comprising: identifying a list of candidate records from a set of data records stored in a data repository for merging; generating an entity level graph from the list of candidate records where the entity level graph includes entity nodes comprising entities and entity edges comprising entity matching scores; identifying one or more connected components within the entity level graph; performing a graph-merge process wherein the graph-merge process is performed for each connected component of the one or more connected components and the graph-merge process comprises: determining the entity matching score for each entity edge; removing each entity edge that falls below an entity level merge score threshold to generate an updated entity level graph; applying a maximal graph matching algorithm to the updated entity level graph where the maximal graph matching algorithm identifies one or more maximally paired entity nodes; for each of the maximally paired entity nodes identified by the maximal graph matching algorithm: determining an entity clique score based on effects of merging the maximally paired entity node together to form a merged entity; and in response to determining the entity clique score exceeds an entity level delta-clique constraint, merging the maximally paired entity nodes to form the merged entity; applying a graph-splitter process to the merged entity to generate a merge-split graph; repeating the graph-merge process until no merge occurs; and generating an output graph comprising the merge-split graph.2. The method of claim 1, wherein applying the graph-splitter process comprises: generating an identity level graph from the merged entity, where the identity level graph includes identity nodes comprising identities and identity edges comprising identity matchingAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION scores; removing each identity edge that falls below an identity level merge score threshold to generate an updated identity level graph; applying the maximal graph matching algorithm to the updated identity level graph where the maximal graph matching algorithm identifies one or more maximally paired identity nodes; for each of the maximally paired identity nodes identified by the maximal graph matching algorithm: determining an identity clique score based on effects of merging the maximally paired identity nodes together to form a merged identity; and in response to determining the identity clique score exceeds an identity level delta-clique constraint, merging the maximally paired identity nodes to generate the merge-split graph.3. The method of claim 2, wherein the entity level merge score threshold is greater than the identity level merge score threshold.4. The method of claim 2, wherein the entity level delta-clique constraint is less than the identity level delta-clique constraint.5. The method of claim 2, wherein the graph-splitter process further comprises, prior to merging the maximally paired identity nodes: determining a centrality score for the maximally paired identity nodes; and in response to determining the centrality score is less than a centrality threshold, preventing merging of the maximally paired identity nodes.6. The method of claim 1, wherein identifying the list of candidate records comprises: accessing an inquiry data set that includes relationships between prior inquiries and returned records; generating a set of optimized search keys based on the relationships between prior inquiries and returned records; grouping a subset of records in a record database; querying the subset of records by the set of optimized search keys to generate a preliminary candidate set; andAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION filtering the preliminary candidate set to identify the list of candidate records.7. The method of claim 1, wherein the entity matching scores are generated by a machine learning model, the machine learning model being trained by steps comprising: obtaining a plurality of training samples, each training sample of the plurality of training samples comprising a set of training matching attributes generated for a pair of data records and a matching label indicating a match or a no-match between the pair of data records; training the machine learning model using the plurality of training samples; determining predicted classifications for the plurality of training samples by inputting the sets of training matching attributes to the machine learning model; identifying a set of the training samples as misclassified training samples based on a set of the predicted classifications being different from the respective matching labels in the training samples; generating two or more auxiliary classifications for each of the misclassified training samples using two or more auxiliary models; updating the matching labels of the misclassified training samples based on the two or more auxiliary classifications; and re-training the machine learning model using the plurality of training samples with the updated matching labels.8. A non-transitory computer-readable storage medium storing instructions executable by a processing device to perform operations comprising: identifying a list of candidate records from a set of data records stored in a data repository for merging; generating an entity level graph from the list of candidate records where the entity level graph includes entity nodes comprising entities and entity edges comprising entity matching scores; identifying one or more connected components within the entity level graph; performing a graph-merge process wherein the graph-merge process is performed for each connected component of the one or more connected components and the graph-merge process comprises: determining the entity matching score for each entity edge; removing each entity edge that falls below an entity level merge score threshold to generate an updated entity level graph;Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION applying a maximal graph matching algorithm to the updated entity level graph where the maximal graph matching algorithm identifies one or more maximally paired entity nodes; for each of the maximally paired entity nodes identified by the maximal graph matching algorithm: determining an entity clique score based on effects of merging the maximally paired entity nodes together to form a merged entity; and in response to determining the entity clique score exceeds an entity level delta-clique constraint, merging the maximally paired entity nodes to form the merged entity; applying a graph-splitter process to the merged entity to generate a merge-split graph; repeating the graph-merge process until no merge occurs; and generating an output graph comprising the merge-split graph.9. The non-transitory computer-readable storage medium of claim 8, wherein the operation of applying the graph-splitter process comprises: generating an identity level graph from the merged entity, where the identity level graph includes identity nodes comprising identities and identity edges comprising identity matching scores; removing each identity edge that falls below an identity level merge score threshold to generate an updated identity level graph; applying the maximal graph matching algorithm to the updated identity level graph where the maximal graph matching algorithm identifies one or more maximally paired identity nodes; for each of the maximally paired identity nodes identified by the maximal graph matching algorithm: determining an identity clique score based on effects of merging the maximally paired identity nodes together to form a merged identity; and in response to determining the identity clique score exceeds an identity level delta-clique constraint, merging the maximally paired identity nodes to generate the merge-split graph.Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION10. The non-transitory computer-readable storage medium of claim 9, wherein the entity level merge score threshold is greater than the identity level merge score threshold.11. The non-transitory computer-readable storage medium of claim 9, wherein the entity level delta-clique constraint is less than the identity level delta-clique constraint.12. The non-transitory computer-readable storage medium of claim 9, wherein the graphsplitter process further comprises, prior to merging the maximally paired identity nodes; determining a centrality score for the maximally paired identity nodes; and in response to determining the centrality score is less than a centrality threshold, preventing merging of the maximally paired identity nodes.13. The non-transitory computer-readable storage medium of claim 8, wherein the operation of identifying the list of candidate records comprises: accessing an inquiry data set that includes relationships between prior inquiries and returned records; generating a set of optimized search keys based on the relationships between prior inquiries and returned records; grouping a subset of records in a record database; querying the subset of records by the set of optimized search keys to generate a preliminary candidate set; and filtering the preliminary candidate set to identify the list of candidate records.14. The non-transitory computer-readable storage medium of claim 8, wherein the entity matching scores are generated by a machine learning model, the machine learning model being trained by steps comprising: obtaining a plurality of training samples, each training sample of the plurality of training samples comprising a set of training matching attributes generated for a pair of data records and a matching label indicating a match or a no-match between the pair of data records; training the machine learning model using the plurality of training samples; determining predicted classifications for the plurality of training samples by inputting the sets of training matching attributes to the machine learning model;Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION identifying a set of the training samples as misclassified training samples based on a set of the predicted classifications being different from the respective matching labels in the training samples; generating two or more auxiliary classifications for each of the misclassified training samples using two or more auxiliary models; updating the matching labels of the misclassified training samples based on the two or more auxiliary classifications; and re-training the machine learning model using the plurality of training samples with the updated matching labels.15. A computing system comprising: a processing device; a data repository for storing data records, wherein each data record comprises one or more identifiers; and a non-transitory computer-readable storage medium storing instructions executable by the processing device to perform operations comprising: identifying a list of candidate records from a set of data records stored in a data repository for merging; generating an entity level graph from the list of candidate records where the entity level graph includes entity nodes comprising entities and entity edges comprising entity matching scores; identifying one or more connected components within the entity level graph; performing a graph-merge process wherein the graph-merge process is performed for each connected component of the one or more connected components and the graph-merge process comprises: determining the entity matching score for each entity edge; removing each entity edge that falls below an entity level merge score threshold to generate an updated entity level graph; applying a maximal graph matching algorithm to the updated entity level graph where the maximal graph matching algorithm identifies one or more maximally paired entity nodes; for each of the maximally paired entity nodes identified by the maximal graph matching algorithm: determining an entity clique score based on effects of mergingAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION the maximally paired entity nodes together to form a merged entity; and in response to determining the entity clique score exceeds an entity level delta-clique constraint, merging the maximally paired entity nodes to form the merged entity; applying a graph-splitter process to the merged entity to generate a merge-split graph; repeating the graph-merge process until no merge occurs; and generating an output graph comprising the merge-split graph.16. The computing system of claim 15, wherein the operation of applying the graph-splitter process comprises: generating an identity level graph from the merged entity, where the identity level graph includes identity nodes comprising identities and identity edges comprising identity matching scores; removing each identity edge that falls below an identity level merge score threshold to generate an updated identity level graph; applying the maximal graph matching algorithm to the updated identity level graph where the maximal graph matching algorithm identifies one or more maximally paired identity nodes; for each of the maximally paired identity nodes identified by the maximal graph matching algorithm: determining an identity clique score based on effects of merging the maximally paired identity nodes together to form a merged identity; and in response to determining the identity clique score exceeds an identity level delta-clique constraint, merging the maximally paired identity nodes to generate the merge-split graph.17. The computing system of claim 16, wherein the entity level merge score threshold is greater than the identity level merge score threshold.18. The computing system of claim 16, wherein the entity level delta-clique constraint is less than the identity level delta-clique constraint.19. The computing system of claim 16, wherein the graph-splitter process furtherAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION comprises, prior to merging the maximally paired identity nodes; determining a centrality score for the maximally paired identity nodes; and in response to determining the centrality score is less than a centrality threshold, preventing merging of the maximally paired identity nodes.20. The computing system of claim 15, wherein identifying the list of candidate records comprises: accessing an inquiry data set that includes relationships between prior inquiries and returned records; generating a set of optimized search keys based on the relationships between prior inquiries and returned records; grouping a subset of records in a record database; querying the subset of records by the set of optimized search keys to generate a preliminary candidate set; and filtering the preliminary candidate set to identify the list of candidate records.Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATIONGRAPH MERGE-SPLIT TECHNIQUES FOR ENTITY RESOLUTIONTechnical Field

[0001] This disclosure relates generally to digital data processing systems. More specifically, but not by way of limitation, this disclosure relates to systems and methods for Entity Resolution through graph-merge-split techniques.Background

[0002] Databases often store data in records. Each record may represent an entity, where the entity is representative of a person (also referred to as a consumer), place or thing. Each record also includes multiple identities (e.g., the entity name, address, social security number, date of birth, and the like). Entity Resolution refers to a process of managing records within a database such that records in the database correctly identify the corresponding entity. Entity Resolution can include identifying mismatching records within the database. Issues arising from mismatch include file fragmentation and file overcombination (also referred to as overcombines). With fragmentation, a single consumer is represented by multiple entities. Fragmentation can arise due to variations in how data is entered and managed as part of the record. For instance, if an entity is determined based on an address identity, user error in entering data, or simply entering the data in a different format (e.g., in one instance address is entered as “street” while in a second record is entered as “st.”), can result in the consumer having multiple corresponding entities in the database. With overcombines, a similar issue arises where multiple consumers are represented as a single entity. In such cases consumers with identities yielding similar entries, such as the consumer having the same name, or residing at the same address, can lead to multiple consumers being represented as a single entity.

[0003] Approaches to database management and Entity Resolution rely on hard-coded rules such relying on data normalization (e.g., changing every input of “Dr.” to “Drive” in address fields) to standardize records and prevent mismatch. But such hard coded rules can only prevent issues that are directly accounted for. Moreover, such traditional Entity Resolution tools may not account for larger variations within a preexisting database.Summary

[0004] Various embodiments of the present disclosure provide systems and methods for Entity Resolution through graph-merge-split techniques. In one example, a method that includes one or more processing devices performing operations. The operations include identifying a list of candidate records from a set of data records stored in a data repository for1Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION merging and generating an entity level graph from the list of candidate records where the entity level graph includes entity nodes including entities and entity edges including entity matching scores. The operations include identifying one or more connected components within the entity level graph and performing a graph-merge process where the graph-merge process is performed for each connected component of the one or more connected components. The graph-merge process includes determining the entity matching score for each entity edge, removing each entity edge that falls below an entity level merge score threshold to generate an updated entity level graph, and applying a maximal graph matching algorithm to the updated entity level graph where the maximal graph matching algorithm identifies one or more maximally paired entity nodes. For each of the maximally paired entity nodes identified by the maximal graph matching algorithm, the graph-merge process then determines an entity clique score based on effects of merging the maximally paired entity nodes together to form a merged entity and in response to determining the entity clique score exceeds an entity level delta-clique constraint, the graphmerge process applies a graph-splitter process to the maximally paired entity nodes prior to merging the maximally paired entity nodes to generate a merge-split graph. The graph-merge process is repeated until no merge occurs. Once all merges have been performed, an output graph including the merge-split graph is generated.

[0006] In another example, a non-transitory computer-readable storage medium having program code executable by a processing device to perform operations. The operations include identifying a list of candidate records from a set of data records stored in a data repository for merging and generating an entity level graph from the list of candidate records where the entity level graph includes entity nodes including entities and entity edges including entity matching scores. The operations include identifying one or more connected components within the entity level graph and performing a graph-merge process where the graph-merge process is performed for each connected component of the one or more connected components. The graph-merge process includes determining the entity matching score for each entity edge, removing each entity edge that falls below an entity level merge score threshold to generate an updated entity level graph, and applying a maximal graph matching algorithm to the updated entity level graph where the maximal graph matching algorithm identifies one or more maximally paired entity nodes. For each of the maximally paired entity nodes identified by the maximal graph matching algorithm, the graph-merge process then determines an entity clique score based on effects of merging the maximally paired entity nodes together to form a merged entity and in response to determining the entity clique score exceeds an entity level delta-clique constraint, applies a graph-splitter process to the maximally paired entity nodes prior to merging the maximallyAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION paired entity nodes to generate a merge-split graph. The graph-merge process is repeated until no merge occurs. Once all merges have been performed, an output graph including the mergesplit graph is generated.

[0007] In yet another example, a computing system includes a processing device and a data repository for storing data records. Each data record represents an entity based on a set of one or more identifiers. The computing system further includes a non-transitory computer-readable storage medium having program code executable by the processing device to perform operations. The operations include identifying a list of candidate records from a set of data records stored in a data repository for merging and generating an entity level graph from the list of candidate records where the entity level graph includes entity nodes including entities and entity edges including entity matching scores. The operations include identifying one or more connected components within the entity level graph and performing a graph-merge process where the graph-merge process is performed for each connected component of the one or more connected components. The graph-merge process includes determining the entity matching score for each entity edge, removing each entity edge that falls below an entity level merge score threshold to generate an updated entity level graph, and applying a maximal graph matching algorithm to the updated entity level graph where the maximal graph matching algorithm identifies one or more maximally paired entity nodes. For each of the maximally paired entity nodes identified by the maximal graph matching algorithm, the graph-merge process then determines an entity clique score based on effects of merging the maximally paired entity node together to form a merged entity and in response to determining the entity clique score exceeds an entity level delta-clique constraint, applies a graph-splitter process to the maximally paired entity nodes prior to merging the maximally paired entity nodes to generate a merge-split graph. The graph-merge process is repeated until no merge occurs. Once all merges have been performed, an output graph including the merge-split graph is generated.

[0008] This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used in isolation to determine the scope of the claimed subject matter. The subject matter should be understood by reference to appropriate portions of the entire specification, any or all drawings, and each claim.

[0009] The foregoing, together with other features and examples, will become more apparent upon referring to the following specification, claims, and accompanying drawings.3Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATIONBrief Description of the Drawings

[0010] FIG. 1 is a block diagram depicting an example of a computing environment in which a merge-split computing system can identify data fragmentation and overcombination in a database according to certain aspects of the present disclosure.

[0011] FIG. 2 is a diagram depicting an example of a data flow for identifying candidate records, according to certain aspects of the present disclosure.

[0012] FIG. 3 is a flowchart depicting an example of a process for training a machine learning matching model to determine a matching decision for a reference record and a query record, according to certain aspects of the present disclosure.

[0013] FIG. 4 is a diagram illustrating the data flow in the training of the machine learning model, according to certain aspects of the present disclosure.

[0014] FIG. 5 is a flowchart depicting an example of a graph-merge-split process for Entity Resolution based on identified candidate records within a database, according to certain aspects of the present disclosure.

[0015] FIG. 6 is a flowchart depicting an example of a graph-splitter process for Entity Resolution based on identifiers within a merged entity, according to certain aspects of the present disclosure.

[0016] FIG. 7 is a diagram showing an example of a graph including connected components, nodes, and edges, according to certain aspects of the present disclosure.

[0017] FIG. 8 is a diagram depicting an example implementation of the maximal graph matching algorithm, according to certain aspects of the present disclosure.

[0018] FIG. 9 is a diagram depicting examples of cliques with varying clique scores representing the degree of connectivity between nodes within a graph, according to certain aspects of the present disclosure.

[0019] FIG. 10 is a block diagram depicting an example of a computing system suitable for implementing aspects of the techniques and technologies presented herein.Detailed Description

[0020] Certain aspects and features of the present disclosure involve Entity Resolution through graph-merging and graph-splitting techniques. Entity Resolution refers to the process of managing records within a database such that each record accurately relates back to the correct corresponding entity, and such that duplicative records (e.g., records with similar identifiers referring to the same entity) are removed. One approach to perform Entity Resolution is through graph analysis, where associations between records, and identifiersAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION within records, can be analyzed based on strength of similarity. Specifically, the described processes include graph-merging, where entities determined to be the same are merged, where the similarity determination is based on the strength of the entities’ associated records. A graphsplitting technique is also described where identifiers of a single entity are determined to more accurately relate to another entity.

[0021] A merge-split computing system can identify candidate records for merging and splitting from a set of data records stored in a data repository. Identifying candidate records can include, by the merge-split computing system, performing a search in a database for records that match a query record based on one or more identifiers. To perform the matching, the merge-split computing system can generate an identifier score for each identifier based on the values of the identifiers in the query record and a reference record to be compared. In some examples, a machine learning model trained to predict a matching decision from identifier scores and other identifier attributes generated for a pair of records can be used to identify candidate records. The merge-split computing system can then perform a merge process to defragment the candidate records and a split process to prevent overcombination of the candidate records. The merge-split computing system can be implemented in an offline batch process that can be run to produce a report of all merges and splits performed within the data repository.

[0022] The following non-limiting example is provided to introduce certain embodiments. In this example, a merge-split computing system can perform Entity Resolution by performing a merge process to overcome issues related to fragmentation where multiple entities refer to a single consumer, and in tandem, perform a split process to overcome issues related to overcombination, where multiple consumers are represented as a single entity (i.e., where identifiers corresponding of a given entity must be split into separate entities such that each entity properly corresponds to a single consumer).

[0023] The merge-split computing system may first identify a list of candidate records for merging and splitting, where the candidate records are identified as representing one or more entities based on a set of identifiers. To identify the list of candidate records, the merge-split computing system can search a database, or a subset of the database, for records that match a query record based on one or more identifiers. Searching may be streamlined through generation of optimized search keys used to traverse the database or subset of the database. The merge-split computing system can generate similarity scores based on similarities between identifiers compared between a query record and a reference record. Different similarity score techniques may be applied based on the identifiers being compared. For instance, numeric and string identifiers can be evaluated relative to a probability distribution of errors to confirm5Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION whether differences between numeric or string identifiers were accidental or intentional. Corresponding scores can be generated based on the evaluations relative to the probability distribution. In the example of string identifier scoring, phonetic algorithms for matching components of names based on similar pronunciation, distance measures, such as Levenstein distance or Jaccard distance, can be used to generate the name identifier score. The sum of each identifier score (e.g., a name identifier score, an address similarity score, a date of birth identifier score, and a social security number identifier score) may then provide the similarity score defining the similarity between candidate records.

[0024] In some examples, the merge-split computing system can employ a machine learning model to generate similarity scores and matching decisions between query records and reference records when generating the list of candidate records. The input to the machine learning model can include the identifier scores. In addition, the merge-split computing system can also generate other attributes (also referred to as “matching attributes” or “identifier attributes”) for each of the identifiers as input to the machine learning model. These attributes can include, for example, a numerical identifier attribute measuring the total number of positions matched between the numerical identifier in the query record and the numerical identifier in the reference record, an address attribute generated based on a geographical distance between the query address and the reference address, an address frequency attribute indicating the number of records in the data records having a same address as the reference address, a name frequency attribute indicating the frequency of the name in the reference record, and so on.

[0025] Once the list of candidate records is identified, the merge-split computing system can generate an entity level graph. The entity level graph includes nodes, representing entities within the list of candidate records, and edges, representing the similarity scores between the entities. The similarity score may be determined as described above (e.g., via a machine learning model trained to compare records and generate similarity scores defining the strength of association between a pair of records based on their identifiers).

[0026] The merge-split computing system can then identify connected components in the entity level graph. Connected components represent entity nodes connected via edges having a sufficiently high similarity score. Fragmentation of data records can then be found by identifying connected components with a high degree of connectivity in the entity level graph.

[0027] For each identified connected component within the entity level graph, the score of each edge (e.g., the similarity score between entities based on their identifiers) can be generated. The same match model and principles may be used to generate the scores for each6Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION edge in a given connected component. On a first instance, the same scores used to generate the entity level graph may be used to score each edge of the first analyzed connected component. However, as the merge-split computing system may merge entities for each identified connected component, the structure of the entity level graph may be adjusted in response to each merge performed. Thus, the merge-split computing system will iteratively score each edge in the connected component (for instance, using a trained match model), to account for modifications to the entity level graph in response to previous entity merges on subsequent iterations.

[0028] The merge-split computing system can then remove edges that fall below an entity level merge score threshold. Removing edges between entity nodes subsequently prevents the merging of the removed entity nodes in subsequent merge steps. To avoid the issue of overcombining entities (e.g., inadvertently merging two dissimilar entity nodes representing two different consumers), the entity level merge score threshold may have a higher threshold value, particularly when compared to an identity merge threshold as discussed with respect to the splitter process which may be later called by the merge-split computing system.

[0029] After the edges falling below the entity level merge score threshold are removed from the connected component, the merging, de-fragmentation process can then determine a maximal graph match with respect to the entity graph. The maximal graph matching algorithm can adjust the entity graph such that no entity node has multiple edges (i.e., connections to other entity nodes). In such a way, the graph is reduced so that, at most, each entity is paired to only one other entity node, or is not connected to any other entity nodes

[0030] For each pair of entity nodes that are part of the maximal graph matching algorithm, the merge-split computing system can determine if the entity node pair satisfies an entity deltaclique constraint prior to merging. The entity delta-clique constraint evaluates a threshold connectivity between entity nodes within the entity level graph. As entity node pairs are merged back together, the merge may otherwise violate the entity delta-clique constraint, leading to low connectivity — indicative of mismatching entities — being merged during the merge process. Therefore, the entity level delta clique constraint may be used to prevent such low connectivity entities being merged during the merge process. Because the entity merge process has a relatively high entity level merge score threshold, a lower entity level delta-clique constraint can be used, as in effect the entity level merge score threshold has already filtered out low similarity entities from the graph.

[0031] For each merged entity, the merge-split computing system may then perform a graph splitter process to resolve data record overcombination. Similar analyses described with7Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION respect to the merge process may be applied within the graph splitter process. Additionally, the graph splitter process can involve determining a centrality score for merging identities which may further determine whether to merge such identities in view of a centrality threshold. The centrality threshold may provide further means for preventing the overcombination of data files within the data repository.

[0032] Certain aspects described herein overcome the limitations of previous techniques and provide improvements to database technology by detecting record fragmentation and record overcombination within larger record databases. Detection of fragmentation and overcombination through the described merge-split process thus enables the correct matching records to be retrieved than the traditional searching techniques, thereby increasing the accuracy of the search results. In addition, the fragmentation and overcombine detection can reduce the size of the database and thus reduce the storage space used to store the database. Reducing the size of the database also reduces the computational complexity of searching the database for a given record, thereby reducing the consumption of computing resources, such as CPU time and memory space. Furthermore, the merge-split techniques presented herein also enable the accurate detection of fragmented and overcombined files in the database and thus increase the efficiency of the fragmentation and overcombination detection.

[0033] These illustrative examples are given to introduce the reader to the general subject matter discussed here and are not intended to limit the scope of the disclosed concepts. The following sections describe various additional features and examples with reference to the drawings in which like numerals indicate like elements, and directional descriptions are used to describe the illustrative examples but, like the illustrative examples, should not be used to limit the present disclosure.Operating Environment Example for Record Matching and Fragmented File Detection

[0034] FIG. 1 is a block diagram depicting an example of a computing environment in which a merge-split computing system can identify data fragmentation and overcombination in a database according to certain aspects of the present disclosure. FIG. 1 depicts examples of hardware components of a graph-merge-split computing system 100, according to some aspects. The graph-merge-split computing system 100 is a specialized computing system that may be used for processing large amounts of data using a large number of computer processing cycles. The number of devices depicted in FIG. 1 are provided for illustrative purposes. Different numbers of devices may be used. For example, while certain devices or systems are shown as single devices in FIG. 1, multiple devices may instead be used to implement these devices or systems.8Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION

[0035] As shown in FIG. 1, the graph-merge-split computing system 100 can include a graph-merge-split server 106, a model training server 108, a private data network 132, data repository 120 storing data records 122, firewall 130, a client external-facing subsystem 128, and a public data network 104 communicatively coupled to one or more client computing systems 102.

[0036] The data repository 120 can include internal databases or other data sources that are stored at or otherwise accessible via the private data network 132. The data repository 120 can include data records 122, and each data record 122 includes one or more identifiers 124. An identifier 124 can include any information that can be used alone or in combination with other identifiers to uniquely identify a data record 122. For example, if the data records 122 represent data associated with an individual or entity, the identifiers 124 in each data record 122 can include information that can be used on its own to identify an individual or entity. Non-limiting examples of such information include one or more of a legal name, a company name, a social security number, a credit card number, a date of birth, an e-mail address, etc. In other aspects, the identifiers 124 can include information that can be used in combination with other information to identify an individual or entity. Non-limiting examples of such consumer identification data include a street address or other geographical location, etc.

[0037] In some examples, the identifiers 124 can be classified into four categories: numerical identifiers such as the social security number, credit card number, name identifiers such as the legal name of the individuals or company name, address identifiers such as the street address of the individual or entity, and date identifiers such as the date of birth of an individual. Depending on the nature of data stored in the data records 122, not all four categories of identifiers are available for the data record 122. For example, if the data records 122 represent data associated with products or other types of physical items, the numerical identifier in each data record 122 can include a serial number of a product, a MAC address of a network component; the name identifier can include the name of the product or item; the address identifier can include the address or location where the product or item is manufactured or produced; the date-based identifier can include the manufacturing date of the product or item. If the data records 122 represent data associated with digital items such as a webpage or a digital file, the numerical identifier in each data record 122 can include an IP address of the webpage; the name identifier can include the domain name of the webpage or the name of the digital file; the date-based identifier can include the date when the webpage or digital file is created, accessed, or modified. The data record 122 can include other information about the9Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION associated entity or item, such as the employment data of the individual, description, and specification of the product, and so on.

[0038] The graph-merge-split server 106 can include a graph-merge-split service 110 capable of detecting both fragmented records and overcombined records in the data records 122. The graph-merge-split service 110 can examine pairs of candidate data records generated by a candidate generation model 113 to determine a matching decision and associated matching score between the pair of records. The matching score can be the compound scores determined based on the attribute scores generated for the pair or the confidence score output by a matching model 114 when determining the classification for the pair of records. Based on the matching decisions and matching scores between pairs of data records, the graph-merge-split service 110 can build graphs with nodes representing entities and edges representing the matching decisions. Fragmented data records can be found by identifying connected components with a high degree of connectivity in an entity level graph. Overcombined records can be found by generating an identity level graph between merged entity records.

[0039] To train the matching model 114, the graph-merge-split computing system 100 can include the model training server 108 for operating a model training service 116 for training the matching model 114 for use by a record matching service and the graph-merge-split service 110. The model training service 116 can train the matching model 114 using an initial set of training samples 126 and further determine predicted classifications for the sets of training samples 126 using the initially trained matching model. Based on the predicted classifications, the graph-merge-split computing system 100 identifies misclassified training samples based on the predicted classifications of the set of training samples being different from the respective matching labels in the training samples 126.

[0040] To correct and update the matching labels of the misclassified training samples, the model training server 108 can refine classifications for each of the misclassified training samples using multiple auxiliary model 118. The auxiliary model(s) 118 can be trained to determine whether and how to correct the labels of the misclassified training samples. The training samples 126 with the updated or corrected matching labels can be used to re-train the matching model 114. This training process can be repeated until there are no misclassified training samples in the training samples 126. In this way, ground truth matching labels for the training samples 126 can be obtained in conjunction with training the matching model 114. The graph-merge-split computing system 100 can communicate with various other computing systems such as client computing systems 102. For example, the graph-merge-split computing system 100 may include one or more provider external-facing devices that communicate with10Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION data provider systems for receiving the data regarding entities or other items to be stored in data records in the data repository 120. The graph-merge-split server 106 may also communicate with the client computing system 102 by way of a client external-facing subsystem 128.

[0041] The client computing systems 102 may interact, via one or more public data networks 104, with various external-facing subsystems of the graph-merge-split computing system 100. For instance, an individual can use a client computing system 102 to attempt to search in the data records 122 for a match to a query record. The client computing system 102 may generate the query record and send it to the graph-merge-split server 106. Alternatively, the client computing system 102 can send data to be used for the search in any format and the graph-merge-split server 106 can generate the query record based on the received information. To request the search, the client computing system 102 can communicate with the client external-facing subsystem 128. The client external-facing subsystem 128 can selectively prevent the client computing system 102 from accessing or searching in the data repository 120. For example, the client external-facing subsystem 128 can determine whether the client computing system 102 can access or search in the databases based on an identifier of the client computing system and a record stored in a secure location in the client external-facing subsystem 128, such as a memory in a basic input-output system (BIOS) of the client externalfacing subsystem 128. The record can indicate the access permission of a client computing device and can be determined based on various factors such as whether the client computing system is an authorized system to access a certain database, whether the timing of the access is within an authorized window, and so on.

[0042] To determine if a client computing system 102 can access a certain database, the client external-facing subsystem 128 can retrieve the record associated with the client computing system 102 from the secure location and encrypt the record and other associated data using a cryptographic key. Similarly, the client external-facing subsystem 128 can encrypt the record submitted by the client external-facing subsystem 128 using the same cryptographic key to determine a match. A match indicates that the client computing system 102 can access the database. The client external-facing subsystem 128 can prevent the client computing system 102 from accessing the databases if there is no match.

[0043] The client external-facing subsystem 128 can be communicatively coupled, via a firewall 130, to one or more computing devices forming the private data network 132. The firewall 130, which can include one or more devices, can create a secured part of the graphmerge-split computing system 100 that includes various devices in communication via the11Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION private data network 132. In some aspects, by using the private data network 132, the graphmerge-split computing system 100 can house the data repository 120 in an isolated network (i.e., the private data network 132) that has no direct accessibility via the Internet or another public data network 104.

[0044] Each client computing system 102 may include one or more third-party devices, such as individual servers or groups of servers operating in a distributed manner. Client computing system 102 can include any computing device or group of computing devices operated by a seller, lender, or other provider of products or services. Client computing system 102 can include one or more server devices. The one or more server devices can include or can otherwise access one or more non-transitory computer-readable media. The client computing system 102 can also execute an online service. The online service can include executable instructions stored in one or more non-transitory computer-readable media.

[0045] Each communication within or with the graph-merge-split computing system 100 may occur over one or more data networks, such as the public data network 104, the private data network 132, or some combination thereof. A data network may include one or more of a variety of different types of networks, including a wireless network, a wired network, or a combination of a wired and wireless network. Examples of suitable networks include the Internet, a personal area network, a local area network (“LAN”), a wide area network (“WAN”), or a wireless local area network (“WLAN”). A wireless network may include a wireless interface or a combination of wireless interfaces. A wired network may include a wired interface. The wired or wireless networks may be implemented using routers, access points, bridges, gateways, or the like, to connect devices in the data network.

[0046] A data network may include network computers, sensors, databases, or other devices that may transmit or otherwise provide data to the graph-merge-split computing system 100. For example, a data network may include local area network devices, such as routers, hubs, switches, or other computer networking devices. The data networks depicted in FIG. 1 can be incorporated entirely within (or can include) an intranet, an extranet, or a combination thereof. In one example, communications between two or more systems or devices can be achieved by a secure communications protocol, such as secure Hypertext Transfer Protocol (“HTTPS”) communications that use secure sockets layer (“SSL”) or transport layer security (“TLS”). In addition, data or transactional details communicated among the various computing devices may be encrypted. For example, data may be encrypted in transit and at rest.Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION

[0047] The graph-merge-split computing system 100 can include one or more graphmerge-split servers 106 and one or more model training servers 108. The graph-merge-split server 106 or the model training servers 108 may be a specialized computer or other machine that processes the data received at the graph-merge-split computing system 100. The graphmerge-split server 106 or the model training servers 108 may include one or more other systems. For example, the graph-merge-split server 106 or the model training servers 108 may include a database system for accessing the network-attached storage unit, a communications grid, or both. A communications grid may be a grid-based computing system for processing large amounts of data.

[0048] The graph-merge-split server 106 or the model training servers 108 can include one or more processing devices that execute program code, such as graph-merge-split service 110, the matching model 114, or the model training service 116. The program code can be stored on a non-transitory computer-readable medium. While FIG. 1 shows that the graph-merge-split server 106 and the model training server 108 are two separate servers, the function of these two servers can be implemented in a single server or a group of servers.

[0049] The graph-merge-split computing system 100 may also include one or more network-attached storage units on which various repositories, databases, or other data structures are stored. Examples of these data structures are the data repository 120. Network- attached storage units may store a variety of different types of data organized in a variety of different ways and from a variety of different sources. For example, the network-attached storage unit may include storage other than the primary storage located within the graph-merge- split server 106 or the model training server 108 that is directly accessible by processors located therein. In some aspects, the network-attached storage unit may include secondary, tertiary, or auxiliary storage, such as large hard drives, servers, virtual memory, among other types. Storage devices may include portable or non-portable storage devices, optical storage devices, and various other mediums capable of storing and containing data. A machine-readable storage medium or computer-readable storage medium may include a non-transitory medium in which data can be stored and that does not include carrier waves or transitory electronic signals. Examples of a non-transitory medium may include, for example, a magnetic disk or tape, optical storage media such as compact disk or digital versatile disk, flash memory, memory, or memory devices.

[0050] In some aspects, the graph-merge-split computing system 100 can implement one or more procedures to secure communications between the graph-merge-split computing system 100 and other client systems. Non-limiting examples of features provided to protect13Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION data and transmissions between the graph-merge-split computing system 100 and other client systems include secure web pages, encryption, firewall protection, network behavior analysis, intrusion detection, etc. In some aspects, transmissions with client systems can be encrypted using public-key cryptography algorithms using a minimum key size of 128 bits. In additional or alternative aspects, website pages or other data can be delivered through HTTPS, secure filetransfer protocol (“SFTP”), or other secure server communications protocols. In additional or alternative aspects, electronic communications can be transmitted using Secure Sockets Layer (“SSL”) technology or other suitable secure protocols. Extended Validation SSL certificates can be utilized to clearly identify a website’s organization identity. In another non-limiting example, physical, electronic, and procedural measures can be utilized to safeguard data from unauthorized access and disclosure.Example Candidate Record Identification and Generation

[0051] The described graph-merge-split service 110, used to perform Entity Resolution in addressing fragmentation and overcombines for data records within a database, generates graphs for analysis by evaluating candidate records. FIG. 2 is a diagram depicting an example of a data flow 200 for identifying candidate records, according to certain aspects of the present disclosure.

[0052] FIG. 2 shows a data repository 120 and subset of the data repository, referred to as a delta 202. Delta 202 can be used to facilitate quicker searching and candidate record identification by preventing excessive searching of the larger data repository 120 . The graphmerge-split computing system 100, via the matching model 114 can access inquiry datasets that include relationships between prior inquires and returned records via the data repository 120 and delta 202. The prior inquiries include search queries previously submitted to the database system by one or more client devices. The returned records include those records returned from the database system in response to the prior inquiries. Each prior inquiry can be correlated in the inquiry dataset to a corresponding set of returned records from the database system. Each set of returned records may include one or more returned records. For example, if the search query involves a particular entity (e.g., a consumer), the returned set of records may include one or more returned records involving that particular entity. For instance, the returned set of records can include a first record with a current address of the entity and a second record with a prior address of the entity.

[0053] The candidate generation model 113 can then generate optimized search keys per a search key optimizer 204 based on the relationships between prior inquiries and returned records. In an example, Boolean indexes can be generated based on the inquiry dataset and thenAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION deduplicated by identifying multiple Boolean indexes that correspond to the same returned record. Aspects of the multiple Boolean indexes can then be combined together to form a single Boolean index for that returned record. In such away, there may be only one Boolean index corresponding to each returned record in the inquiry dataset. Frequent indexes may be identified within the set of deduplicated Boolean indexes, where the frequent indexes satisfy at least one criterion and where the frequent indexes occur at least a threshold number of times in the set of deduplicated Boolean indexes. One example of the criterion may be that the frequent indexes have estimated candidate sizes that are less than a maximum size. The maximum size can be customizable and selected to avoid frequent indexes that return an excessively large number of results. The frequent indexes with the highest frequency (e.g., occurring the greatest number of times) in the set of Boolean indexes may then be selected as the optimized search keys. The generated optimized search keys may provide for efficient traversal of the data repository or a subset of the data repository, including in instances where the data records are sparse (e.g., without specific identifiers such as a social security number) or contains variations in each of the identifiers. The optimized search keys can be tailored to the frequency of the identifiers such that uncommon identifiers may be found more easily.

[0054] After generating the optimized search keys via the search key optimizer 204, the candidate generation model 113 identifies preliminary candidate records set 206. The preliminary candidate records in some instances can include the final candidate set analyzed per subsequent procedures via the graph-merge-split service 110. However, even when optimized, the search keys used to generate the preliminary candidate record set 206 can still yield significant numbers of candidate records when queried against the data repository 120 and / or delta 202. Further processing of the preliminary candidate records to reduce the candidate record set may be applied.

[0055] The candidate generation model 113 may apply a filter model 208 to filter the preliminary candidate record set 206 to produce the filtered preliminary candidate record set 210, further reducing the number of records required for analysis by the graph-merge-split service 110. The filtering model 208 can leverage information gathered from the optimized search keys and determine which of the optimized keys match and which do not match for a pair of records. Each of the identifiers between a set of records may match or not match corresponding identifiers. For instance, a first optimized search key may capture name and date of birth fields, while a second key may capture name and address fields, and the like. Each of the keys may capture one or more identifiers. Filtering may include determining a threshold number of keys match between a query record and reference record and removing the query -15Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION record reference-record pair from the candidate set if the threshold number of keys do not match.

[0056] The filtered preliminary candidate record set 210, and / or the preliminary candidate record set 206 produced by the search key optimizer 204 may then be applied to the matching model 114 to generate matching decisions and matching scores between pairs of data records. In some examples, the matching model 114 is a machine learning model employed to determine matching decisions and scores between data records.

[0057] Applying the match model 114 can include receiving a query record that contains one or more identifiers that can be used for matching. The graph-merge-split server 106 may receive the query record from a client computing system 102 in a request to find matching records in the data repository 120 or generate the query record based on the information contained in the request received from the client computing system 102. In other examples, the query record may be generated when the graph-merge-split computing system 100 receives new data, for example from an external data source, to be stored as a data record 122 in the data repository 120. The graph-merge-split server 106 generates the query record based on the information in the received new data. If the graph-merge-split server 106 finds no match for the query record in the data records, a new data record 122 can be created in the data repository 120 to store the new data; otherwise, the received new data will be used to update the data record 122 that matches the query record. In this way, fragmented records can be avoided.

[0058] In some examples, the query record can include a set of query records, such as the delta 202. Thus, the delta 202, as a subset of the data repository 120, can be queried against the data repository 120 to identify candidate records for merging and splitting per the graph-merge- split procedure (discussed with respect to FIG. 5). Querying the delta 202 against the data repository 120 to identify candidate records can thus improve the efficiency of candidate record identification as compared to querying the data repository 120 against itself.

[0059] Applying the matching model 114 can include receiving a query record that contains one or more identifiers that can be used for retrieving from the data records 122, a reference record that contains the one or more identifiers to be used for matching. Identifier attributes can be generated for each of the identifiers in the query record and reference record that are to be used for matching

[0060] The matching model 114 may then generate and output a matching decision using a matching model 114 based on the identifier attributes previously generated. In some examples, the matching model 114 can be a model that is explainable and exportable as a rule set, such as a decision tree model, a random forest, or a repeated incremental pruning to produce16Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION error reduction (RIPPER)-based model. The matching model 114 can be trained using training data to accept a set of identifier attributes generated for the pair of query record and reference record as input and output a matching decision. The matching model 114 may also generate and output a matching score indicating the confidence level associated with the matching decision. Depending on the type of the record matching model 114, the matching score may be generated based on the prediction errors of leaf nodes in a decision tree model or prediction errors of trees in a random forest model.

[0061] The matching decisions and matching scores produced by the matching model 114 may then be used to produce the final candidate records 212 which may then be used per the subsequent procedures discussed with respect to FIGS. 5-9 as the candidate record set for graph-merge-split analysis.

[0062] FIG. 3 is a flowchart depicting an example of a process 300 for training a machine learning matching model 114 trained to determine a matching decision for a reference record and a query record, according to certain aspects of the present disclosure. FIG. 3 will be described in conjunction with FIG. 4. FIG. 4 is a diagram illustrating the data flow in the training of the machine learning model, according to certain aspects of the present disclosure. For illustrative purposes, the process 300 is described with reference to implementations described above with respect to one or more examples described herein. Other implementations, however, are possible. In some aspects, the operations in FIG. 3 may be implemented in program code that is executed by one or more computing devices such as the model training server 108 depicted in FIG. 1. In some aspects of the present disclosure, one or more operations shown in FIG. 3 may be omitted or performed in a different order. Similarly, additional operations not shown in FIG. 3 may be performed.

[0063] At block 302, the process 300 involves obtaining the training samples for the record matching model 114. As shown in FIG. 4, each of the training samples 126 can include input identifier attributes 404 generated for a corresponding pair of training data records and a matching label 1106 for the pair. The input identifier attributes 404 can include the identifier attributes described above with respect to block 306 of FIG. 3. In some examples, the training samples 126 include the pairs of training data records and the input identifier attributes 404 are generated based on the training data records according to the method described above with respect to block 306 of FIG. 2. The matching label 1106 indicates whether the pair of training data records match or not. In some examples, the matching label 1106 may be inaccurate and thus cannot serve as the ground truth for the training. As such, the training process 300 may also be used to identify ground truth matching labels for the training samples 126.17Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION

[0064] In some examples, the training samples 126 can be selected from the data records 122 and the respective associated labels based on stratified sampling. In the data records 122, some patterns of the identifier values may be rare compared to others. The model training server 108 can first perform random sampling in the data record 122 by the type of matches indicated by the label, such as a match or no match. If the labels have flags other than match or no match, those flags can be mapped to match or no match. A stratified sample by scores is extracted on the randomly selected samples. In some examples, the score attributes, such as identifier scores, along with the compound scores (area scores and volume scores) are used for extracting out the stratified samples. The scores or compound scores can be rounded to the nearest integer before a stratified sample is extracted. Samples are also ensured to have each attribute value represented n times with n being a positive integer.

[0065] At block 304, the process 300 involves training the record matching model 114 and two or more auxiliary models 408 using the training samples 126. As discussed above with respect to FIG. 1, the record matching model 114 may be a decision tree model, a random forest, a RIPPER model, or any other model that is explainable and exportable as a rule set. The training can involve supervised training using the input attributes and the current matching labels in the training samples 126. In some examples, the auxiliary models 408 are employed in order to correct the matching label 406 in the misclassified training samples. The auxiliary models 408 can operate under different principles of classification and each can be trained to generate a classification of match or no-match based on attributes associated with a pair of records. Examples of the auxiliary models 408 can include a naive Bayes model, a multilayered perception model, a random forest model, and a support vector machine (SVC).

[0066] Each of the auxiliary models 408 can be trained using the training samples 126 further used to train the matching model 114. In some examples, the attributes input to each of the auxiliary models 408 can include the input identifier attributes 404 for the matching model 114. In other examples, the attributes input to each of the auxiliary models 408 include a subset of the input identifier attributes 404, such as the identifier scores and the compound scores. By using a subset of the input identifier attributes 404, the computational complexity of training the auxiliary models 408, and thus training the record matching model 114, can be significantly reduced.

[0067] At block 306, the process 300 involves determining predicted classifications for the training samples using the initially trained record matching model 114. In other words, the input identifier attributes 404 in each training sample 126 are input to the initially trained record matching model 114 to generate the respective predicted classifications 402.18Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION

[0068] At block 308, the process 300 involves identifying misclassified training samples. The misclassified training samples can include training samples that are mistakenly labeled. In other words, the matching label 406 in a training sample for a pair of matched records is incorrectly marked as no-match, or the matching label 406 in a training sample for a pair of unmatched records is incorrectly marked as a match. The graph-merge-split server 106 can identify a set of the training samples as misclassified training samples 412 if the predicted classifications 402 of the set of training samples 126 are different from the respective matching labels 406.

[0069] At block 310, the process 300 involves the graph-merge-split server 106 determining if there are any misclassified training samples 412. If so, the process 300 involves generating, at block 312, predicted classifications for each of the misclassified training samples 412 using the auxiliary models 408, also referred to as auxiliary classification 410. At block 314, the process 300 involves updating the misclassified training samples 412 based on the auxiliary classifications 410 generated by the auxiliary models 408.

[0070] In some examples, the auxiliary classifications 410 are compared with each other to determine if the misclassified training samples need to be corrected. Because the auxiliary models 408 have different underlying principles to predict the classifications, if a pair of records is a genuine match, the auxiliary models 408 should agree on the classification. But if the auxiliary models 408 do not agree on the predicted classifications, the pair of records should be further analyzed to determine the accurate label. For example, for a mismatched training sample, if the auxiliary classifications 410 are consistent with the predicted classification by the matching model 114, the graph-merge-split server 106 can change the matching label 406 of the mismatched training sample to be consistent with the classification output by the matching model 114. If the auxiliary classifications 410 include conflicting classifications, the graph-merge-split server 106 can determine the matching label for the mismatched training sample based on a combination of the original matching label, the classification by the matching model, and the auxiliary classifications 410 by the auxiliary matching models, such as through a majority voting. Alternatively, or additionally, the record-matching computing system can output the mismatched training sample to another system for further analysis to determine the correct matching label. The mismatched training samples whose matching labels are corrected can then be used to update the corresponding training sample 126.

[0071] The record matching model 114 can be re-trained using the updated training samples 126 at block 304 and the operations in blocks 306-314 can be repeated until the graphmerge-split server 106 determines, at block 310, that there are no misclassified training19Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION samples. The process 300 then involves, at block 316, the graph-merge-split server 106 outputting the trained record matching model 114 and the training samples 126. At this stage, the training samples 126 include the corrected matching labels 406, which can be used as ground truth matching labels 406.Examples Merge-Split Operations

[0072] FIG. 5 is a flowchart depicting an example of a graph-merge-split process 500 for Entity Resolution based on identified candidate records within a database, according to certain aspects of the present disclosure. For illustrative purposes, the process 500 is described with reference to implementations described above with respect to one or more examples described herein. Other implementations, however, are possible. In some aspects, the operations in FIG. 5 may be implemented in program code that is executed by one or more computing devices such as the graph-merge-split server 106 depicted in FIG. 1. In some aspects of the present disclosure, one or more operations shown in FIG. 5 may be omitted or performed in a different order. Similarly, additional operations not shown in FIG. 5 may be performed.

[0073] At block 502, the process 500 involves identifying a list of candidate records to be evaluated for merging. In some examples, every pair of data records 122 within a data repository 120 are evaluated for possible merge. But for a data repository 120 containing a large number of data records, such as tens of millions of data records, the computational complexity for examining each pair of data records is prohibitively high. As such, additional techniques for identifying the list of candidate records to be evaluated for merging can include the steps as described in FIG. 2.

[0074] At block 504, the process 500 involves generating an entity level graph from the list of candidate records. FIG. 7 shows an example of a graph. In the graph, each node 702 represents an entity and an edge 704 between the two nodes indicates a match between the entities represented by the two nodes according to a matching decision. The values associated with edges 704 indicates the matching score for the paired entities. For example, the edge 704 connecting nodes B and C indicates that the two entities represented by these two nodes match with each other according to a matching decision for this pair. The value .95 associated with the edge 704 is the matching score associated with the matching decision which indicates relatively high confidence in the matching decision. Similarly the edge 706 connecting nodes A and E indicates that the entities represented by these two nodes match with each other but with a relatively low confidence score of .65. Pairs of nodes that do not have an edge connecting them are not considered as matching entities according to the matching decision, such as nodes B and F, nodes C and E.20Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION

[0075] Challenges arise when nodes of a graph are not fully connected. For example, for three nodes A, B, and F, A is connected to B by an edge; A is further connected to F by another edge; but B and F are not connected. In this case, merging A and B or A and F can be problematic because B and F do not match. Either merging A and B or merging A and F would violate the matching decision between B and F. To address this kind of scenario and increase the precision of the merging, connected components are utilized. At block 506, the process 500 involves identifying connected components in the graph. A connected component of a graph is a subgraph in which any two nodes are connected to each other through one or more edges. In the example shown in FIG. 7, nodes A-F form a connected component 700. FIG. 7 further shows a process of breaking apart the connected component 700 by removing edges that fail to satisfy a minimum entity level merge score threshold as described in subsequent blocks of process 500.

[0076] Referring now back to FIG. 5, block 508 of process 500 indicates that, once the one or more connected components within the entity level graph are identified, blocks 510 - 524 may be iterated per a graph-merge process, where entities may be merged. Merging entities into a single, merged entity results in all the identifiers within those entities being merged into the single merged entity. Blocks 510 - 524 may be iteratively performed for each identified connected component such that each identified connected component is reduced by potential merges to ultimately generate an output graph per block 528.

[0077] At block 510, the process 500 involves determining an entity matching score for each entity edge. For the first identified connected component, the entity matching score may be the entity matching score generated at block 504 in forming the entity level graph. However, as process 500 involves iteratively merging specific entity nodes in each identified connected component, the entity level graph may be updated between analyzing each identified connected component. Thus, on each subsequent iteration of block 510, the process 500 may determine new entity matching scores for each entity edge of a given connected component. Determining the entity matching scores can follow a similar procedure to block 504, where each entity edge in a connected component is scored based on the similarity of its identifiers, for instance, via the matching model 114.

[0078] At block 512, the process 500 involves removing entity edges that fall below an entity level merge score threshold to generate an updated entity level graph. Removing entity edges between the entity nodes indicates that the entity nodes are insufficiently similar for merging. Because the entity level merge score threshold prevents merging of entities, where each entity represents consumers, a relatively high (e.g., >.95 similarity between entities) entity21Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION level merge score threshold may be used to decide whether the merge the two entity nodes. Removing entity level edges that fail to meet the entity level merge score threshold can thus lead to the entity level graph being updated.

[0079] At block 514, the process 500 involves applying a maximal graph matching algorithm to the updated entity level graph. The maximal graph matching algorithm can further restructure the updated entity level graph by breaking down entity edges such that no entity node has multiple entity level edges. Thus, each entity will either be disconnected (having no entity edges connecting to another entity node), or, at most, paired with a single other entity node via a single entity edge. In breaking apart the entity nodes, the maximal graph matching algorithm optimizes the break down such that the remaining entity edges are those with the greatest entity matching scores. Each of the remaining pairs of entity nodes with the maximally matched entity edges may be referred to as the maximally paired entity nodes.

[0080] FIG. 8 is a diagram depicting an example implementation of the maximal graph matching algorithm, according to certain aspects of the present disclosure. The simplified example of FIG. 8 illustrates a first graph 802 broken down via the maximal graph matching algorithm into a second graph 804 including a set of maximally paired nodes and a lone node with no edges. The maximal graph matching algorithm, also referred to as maximal weight matching, can break up graph, such as graph 802, such that no node has multiple edges. The maximal graph matching algorithm is further optimized such that weights are maximized between nodes with the single connected edge.

[0081] Graph 802 is shown including five nodes, where nodes 2 and 3 each have three edges, connecting to adjacent nodes. The maximal graph matching algorithm can proceed by identifying the nodes having multiple edges, such as nodes 2 and 3, and then identify the maximal edge for each of these nodes to remove the lesser scored edges, ensuring that each of nodes 2 and 3 are connected at most to one other node. In the example of FIG. 8, the maximal graph matching algorithm identifies the edge connecting nodes 2 and 4 as the maximal edge for node 2, and the edge connecting nodes 3 and 8 as the maximal edge for node 3. The maximal graph matching algorithm identifies the corresponding nodes as the maximally paired nodes and removes the lower scored edges of the remaining nodes. As a result, graph 804 can be generated where each node has, at most, one edge connecting to another node, but where the edge is the maximal edge of those identified from graph 802.

[0082] Referring now back to FIG. 5, per determination 516, the process 500 determines whether, in applying the maximal graph matching algorithm at block 514, whether a set of maximally paired entity nodes was identified (e.g., nodes 2-4 and 3-5 of graph 804). If not, the22Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION process exits, indicating that each of the entity nodes is accurately merged. Per block 526, exiting the process can include evaluating the next connected component of the identified connected components. If no additional connected components are identified, the process 500 may then terminate per block 528 where an output graph is generated based on the previous modifications made to the entity level graph.

[0083] At block 518, the process 500 involves, for each of the maximally paired entity nodes identified by the matching algorithm, determining an entity clique score based on effects of merging the maximally paired entity nodes together to form a merged entity. In other words, for each pair of entity nodes that are part of the maximal matching, the process 500 determines an entity clique score.

[0084] FIG. 9 is a diagram depicting examples of cliques with varying clique scores representing the degree of connectivity between nodes within a graph, according to certain aspects of the present disclosure. The principles described with respect to FIG. 9 can be applied with respect to entity level graphs, as per block 518, and additionally with respect to identity level graphs per block 618 of process 600. In the example of FIG. 9, three connected components 902, 904, and 906 are shown, each with a varying clique score.

[0085] The connected component 902 is shown as fully connected, having a clique score of 1 indicating that each node is connected to another node. A perfect clique score of 1 would indicate that the connected component is completely connected and would thus satisfy every delta-clique constraint applied to the connected component.

[0086] Connected component 904 is shown having intermediate connectivity. Some of the nodes are fully connected to every other node, while other nodes have lesser connectivity. The associated clique score for the connected component 904 would have a medium clique score or delta. The clique score for connected component 904, e.g. a normalized score of .7 on a range of 0-1, may be sufficient to satisfy certain delta-clique constraints (e.g., a lower deltaclique constraint for a merge procedure, entity level approach), while failing other delta-clique constraints (e.g., a higher delta-clique constraint applied for a split procedure, identity level approach) depending on the values set.

[0087] Connected component 906 is shown having low connectivity. Each of the nodes is connected, at most, to two other nodes while nodes on the edges of the connected component 906 are connected only to a single other node. The associated clique score for the connected component 906 would have a low clique score. The clique score for connected component 906, may generally be filtered by various delta-clique constraints to prevent merging of the nodes within the connected component 906.23Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION

[0088] Referring now back to FIG. 5, per block 520, the entity clique score may be compared against an entity level delta-clique constraint to determine if the clique score represents a normalized score from 0 to 1 indicating how well connected a set of nodes are connected to each other. Each set of maximally paired entity nodes may be evaluated for merging by evaluating the strength of the associations between the maximally paired entity nodes via the entity level delta-clique constraint. In the entity merge case, a relatively low entity level delta-clique constraint may be used relative to a higher identity level delta-clique constraint applied in a subsequent split process. Because the entity level merge score threshold is used per block 512, the use of a lower entity level delta-clique may be motivated. If the entity clique score fails to exceed the entity level delta-clique constraint, indicative that the maximally paired entity nodes are not to be merged, the process may then terminate by turning to block 526. Per block 526, exiting the process can include evaluating the next connected component of the identified connected components. If no additional connected components are identified, the process 500 may then terminate per block 528 where an output graph is generated based on the previous modifications made to the entity level graph.

[0089] At block 522, the maximally paired entity nodes are merged. Merging is thus performed between entities, and all identities within those entities are merged into the single merged entity. Thus, according to the process occurring prior to merge, each set of entity level nodes in a given graph are first filtered by entity merge threshold per block 512, reduced to single or paired entity nodes through a maximal graph matching algorithm per block 514, and evaluated against an entity level delta-clique score per block 520. Each of these sub-procedures is used to further determine the accuracy of merging a pair of entity nodes prior to the merge.

[0090] At block 524, a graph splitter process is applied to the merged nodes. The graphsplitter process entails a procedure, discussed with respect to FIG. 6, similar to that of the graph-merge process described with respect to FIG. 5. Generally, the graph-splitter process includes generating an identity level graph from the merged entity nodes generated at block 522, identifying connected components within the identity level graph, comparing identity matching scores for each identity edges against an identity merge threshold to modify the identity level graph, applying the maximal graph matching algorithm to the identity level graph, comparing an identity clique score against an identity level delta-clique threshold, and merging those identities that satisfy the threshold. By merging identities that satisfy such conditions, a merge-split graph is produced, where the merged entity (merged per the graph-merge process) is split (per the graph-split process) to reduce overcombines such that each consumer, or person, is represented by at most a single entity. After completion of the graph-merge process24Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION and the graph-split process an output graph comprising the final merge-split graph can be generated.

[0091] At block 528, the process 500 includes generating an output graph including the merge-split graph. The output graph, generated by iterating blocks 508-526 of process 500, may thus include sets of merged entities where the merged entities resolve issues of fragmentation where otherwise sets of previously unmerged entities referred to the same consumer or person. Similarly, as each time entities are merged, a graph splitter procedure is called in order to resolve issues related to overcombination such that each merged entity relates to only one respective consumer. In some applications, the generated output graph may be displayed or output via a monitoring report to enable database administrators to perform further analyses and provide for additional resolution and oversight of unresolved entities as stored in an entity repository.

[0092] FIG. 6 is a flowchart depicting an example of a graph-splitter process 600 for Entity Resolution based on identifiers within a merged entity, according to certain aspects of the present disclosure. The graph splitter process may be performed between identities of a single entity (e.g., the merged entity). As the splitting is performed between identities, the process 600 is referred to as being applied at the identity level, where the generated graphs for implementing the graph-splitter maps relationships of identities for splitting. The identities can be split from 1-n entities after the split. For illustrative purposes, the process 600 is described with reference to implementations described above with respect to one or more examples described herein. Other implementations, however, are possible. In some aspects, the operations in FIG. 6 may be implemented in program code that is executed by one or more computing devices such as the graph-merge-split server 106 depicted in FIG. 6. In some aspects of the present disclosure, one or more operations shown in FIG. 6 may be omitted or performed in a different order. Similarly, additional operations not shown in FIG. 6 may be performed.

[0093] At block 602, the process 600 begins with a merged entity. The merged entity can be the merged entity performed by merging nodes at block 522 per process 500. In such a way, the process 600 is iteratively performed for each connected component in an entity level graph that satisfied the conditions of process 500 for merging, such as the merging entities being identified as satisfying an entity merge threshold and satisfying the entity level delta-clique constraint. The merged entity per block 602 can include multiple identifiers linked to the merge node. As the entity nodes are now merged into one merged entity node, the identities may be evaluated to determine whether the identities each belong to the same entity or instead should be split into separate entities representing different consumers.25Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION

[0094] At block 604, the process 600 involves generating an identity level graph. The identity level graph includes the identifiers of the merged entity as nodes, referred to as identity nodes, where the identity nodes are connected by identity edges, representing the strength of the association between each identity node in the identity level graph. The identity level graph is similar to the entity level graph but representing the connections between identities of a merged entity. Per the process 500 of FIG. 5, scores such as entity matching scores for each entity edge were previously generated. The same matching scores may be used to generate the identity edge scores.

[0095] With the identity edge scores generated, process 600 involves identifying one or more identity level connected components per block 606. A similar process for identifying identity level connected components within the identity level graph may be applied as was applied to identify the connected components of the entity level graph per block 506 of process 500. Generally, the identity level graph generated per block 604, may already be a fully connected network of identity nodes. Thus, identifying the one or more identity level connected components per block 606 can entail identifying the fully connected identity level graph, generated per block 604, as the single connected component at the identity level.

[0096] Block 608 of process 600 indicates that, once the one or more connected components within the identity level graph are identified, blocks 610 - 628 may be iterated per the graph-splitter process, where identities may be merged so as to split the merged entity. Blocks 610 - 628 may be iteratively performed for each identified connected component such that each identified connected component is reduced by potential merges to ultimately generate the updated merged entity per block 626. Per blocks 604 and 606, only a single iteration may be performed (i.e., only one connected component was identified per block 606, as the connected component is effectively identical to the graph generated per block 604).

[0097] At block 610, the process 600 involves determining an identity matching score for each identity edge. Block 610 is similar to block 510 but applied at an identity level. As in block 510, in a first instance, determining the identity matching score for each identity edge may include retrieving the scores used to generate the identity level connected component. On subsequent interactions and each additional connected component, a new identity matching score for each identity edge may be calculated to account for modifications made to the identity level graph, for instance, when identities within a connected component are merged.

[0098] At block 612, the process 600 involves removing identity edges that fall below an identity level merge threshold. Since all identities within an entity may not score high enough against each other, an identity level merge low threshold is used but high enough to identify26Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION outliers. Thus, the identity level merge threshold may have a lower value compared to the entity level merge threshold used per block 512. Otherwise, block 512 and 612 may be similarly applied to remove respective edges (entity or identity level) so as to adjust the respective entity and identity level graphs.

[0099] At block 614, the process 600 involves applying the maximal graph matching algorithm to the identity level graph. As in block 514, the maximal graph matching algorithm can further restructure the updated identity level graph by breaking down identity edges such that no identity node has multiple identity edges. The maximal graph matching algorithm can identify maximally paired identity nodes within the updated identity level graph for merging. Per decision 616, the process 600 can proceed to an identity level delta-clique constraint analysis to further determine whether to merge the maximally paired identity nodes. Otherwise, the process 600 proceeds to analyze additional connected components within the identity level graph.

[0100] At block 618, the process 600 involves determining an identity clique score based on the maximally paired identity nodes identified by the maximal graph matching algorithm. Similar to block 518 of process 500, involves for each pair of identity nodes that are part of the maximal matching, determining an identity clique score.

[0101] Per block 620, the identity clique score may be compared against an identity level delta-clique constraint to determine if the clique score represents a normalized score from 0 to 1 indicating how well connected a set of nodes are connected to each other. Each set of maximally paired identity nodes may be evaluated for merging by evaluating the strength of the associations between the maximally paired identity nodes via the identity level delta-clique constraint. In the identity level, split case, a relatively higher identity level delta-clique constraint may be applied compared to the entity level delta-clique constraint applied per block 520. Because a lower identity level merge score threshold is applied per block 612, the higher identity level delta-clique constraint may be used to weed out outliers (e.g., identities that are insufficiently similar per there assigned identity edge similarity scores). If the identity clique score fails to exceed the identity level delta-clique constraint, the process may then terminate by turning to decision 628. Per decision 628, exiting the process can include evaluating the next connected component of the identified connected components. If no additional connected components are identified, the process 600 may then terminate per block 630 where updates to a merge-split graph are completed.

[0102] Per block 620, if the identity clique score exceeds the identity level delta-clique constraint, the maximally paired identity nodes may be merged per block 626, producing an27Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION updated merge-split graph. Subsequent to the merge, the process 600 may return to block 608 where the next identity level connected component is evaluated per a similar procedure. The process of blocks 608-624 may be iterated until no additional identity level connected components are identified, leading to block 626, where an updated identity level graph is applied to update the merged entity within the entity level graph. The updated entity level graph, including updates made via the spilt process is represents the merge-split graph which may form the output graph.

[0103] In some examples, special consideration may be applied in the graph-splitter process to avoid inadvertently overcombining two data records. In such cases, the process 600 may optionally include determining a centrality score for merging identities and comparing the centrality score against a centrality threshold to determine whether to merge. Optional blocks 622 and 624 illustrate an example of such a process applied during the graph-splitter procedure to prevent overcombination of two data records.

[0104] At block 622, the process 600 can include determining a centrality score for the maximally paired identity nodes, prior to merging the maximally paired identity nodes. The centrality score may include an average eigenvector centrality score determined from evaluating the weights of the two identity nodes being merged. Other centrality scores may be used in addition to or alternatively to average eigenvector centrality such as Katz centrality and the like. At block 624, the process may then compare the centrality score for the maximally paired identity nodes against a centrality threshold. The centrality threshold may be a configurable value representing a tolerance of the potential risk of data record overcombination during the graph-splitter process. In response to determining the centrality score is less than the centrality threshold, the process 600 can terminate the merging of the two maximally paired identity nodes per block 626, and instead transition towards analyzing any additional connected components in the identity level graph per decision 628 or updating the merged identity having not merged the maximally paired identity nodes which failed to reach the centrality score threshold.Example of a Computing Environment for Record Matching and Fragmented File Detection

[0105] Any suitable computing system or group of computing systems can be used to perform the operations for record matching and fragmented file detection described herein. For example, FIG. 10 is a block diagram depicting an example of a computing device 1000 which can be the graph-merge-split server 106 or the model training server 108. The example of the computing device 1000 can include various devices for communicating with other devices in28Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION the graph-merge-split computing system 100, as described with respect to FIG. 1. The computing device 1000 can include various devices for performing one or more operations described above with respect to FIGS. 1-9.

[0106] The computing device 1000 can include a processor 1002 that is communicatively coupled to a memory 1004. The processor 1002 executes computer-executable program code stored in the memory 1004, accesses information stored in the memory 1004, or both. Program code may include machine-executable instructions that may represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a class, or any combination of instructions, data structures, or program statements. A code segment may be coupled to another code segment or a hardware circuit by passing or receiving information, data, arguments, parameters, or memory contents. Information, arguments, parameters, data, etc. may be passed, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, among others.

[0107] Examples of a processor 1002 include a microprocessor, an application-specific integrated circuit, a field-programmable gate array, or any other suitable processing device. The processor 1002 can include any number of processing devices, including one. The processor 1002 can include or communicate with a memory 1004. The memory 1004 stores program code that, when executed by the processor 1002, causes the processor to perform the operations described in this disclosure.

[0108] The memory 1004 can include any suitable non-transitory computer-readable medium. The computer-readable medium can include any electronic, optical, magnetic, or other storage device capable of providing a processor with computer-readable program code or other program code. Non-limiting examples of a computer-readable medium include a magnetic disk, memory chip, optical storage, flash memory, storage class memory, ROM, RAM, an ASIC, magnetic storage, or any other medium from which a computer processor can read and execute program code. The program code may include processor-specific program code generated by a compiler or an interpreter from code written in any suitable computerprogramming language. Examples of suitable programming language include Hadoop, C, C++, C#, Visual Basic, Java, Python, Perl, JavaScript, ActionScript, etc.

[0109] The computing device 1000 may also include a number of external or internal devices such as input or output devices. For example, the computing device 1000 is shown with an input / output interface 1008 that can receive input from input devices or provide output to output devices. A bus 1006 can also be included in the computing device 1000. The bus 1006 can communicatively couple one or more components of the computing device 1000.29Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION

[0110] The computing device 1000 can execute program code that includes graph-merge- split service 110, matching model 114, or model training service 116. The program code for the graph-merge-split service 110, matching model 114, or model training service 116 may be resident in any suitable computer-readable medium and may be executed on any suitable processing device. For example, as depicted in FIG. 10, the program code for the graph-merge- split service 110, matching model 114, or model training service 116 can reside in the memory 1004 at the computing device 1000. Executing the graph-merge-split service 110, matching model 114, or model training service 116 can configure the processor 1002 to perform the operations described herein.[OHl] In some aspects, the computing device 1000 can include one or more output devices. One example of an output device is the network interface device 1010 depicted in FIG. 10. A network interface device 1010 can include any device or group of devices suitable for establishing a wired or wireless data connection to one or more data networks described herein. Non-limiting examples of the network interface device 1010 include an Ethernet network adapter, a modem, etc.

[0112] Another example of an output device is the presentation device 1012 depicted in FIG. 10. A presentation device 1012 can include any device or group of devices suitable for providing visual, auditory, or other suitable sensory output. Non-limiting examples of the presentation device 1012 include a touchscreen, a monitor, a speaker, a separate mobile computing device, etc. In some aspects, the presentation device 1012 can include a remote client-computing device that communicates with the computing device 1000 using one or more data networks described herein. In other aspects, the presentation device 1012 can be omitted.

[0113] The foregoing description of some examples has been presented only for the purpose of illustration and description and is not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. Numerous modifications and adaptations thereof will be apparent to those skilled in the art without departing from the spirit and scope of the disclosure.30

Claims

1. Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATIONClaims1. A computer-implemented method that includes one or more processing devices performing operations comprising: identifying a list of candidate records from a set of data records stored in a data repository for merging; generating an entity level graph from the list of candidate records where the entity level graph includes entity nodes comprising entities and entity edges comprising entity matching scores; identifying one or more connected components within the entity level graph; performing a graph-merge process wherein the graph-merge process is performed for each connected component of the one or more connected components and the graph-merge process comprises: determining the entity matching score for each entity edge; removing each entity edge that falls below an entity level merge score threshold to generate an updated entity level graph; applying a maximal graph matching algorithm to the updated entity level graph where the maximal graph matching algorithm identifies one or more maximally paired entity nodes; for each of the maximally paired entity nodes identified by the maximal graph matching algorithm: determining an entity clique score based on effects of merging the maximally paired entity node together to form a merged entity; and in response to determining the entity clique score exceeds an entity level delta-clique constraint, merging the maximally paired entity nodes to form the merged entity; applying a graph-splitter process to the merged entity to generate a merge-split graph; repeating the graph-merge process until no merge occurs; and generating an output graph comprising the merge-split graph.

2. The method of claim 1, wherein applying the graph-splitter process comprises: generating an identity level graph from the merged entity, where the identity level graph includes identity nodes comprising identities and identity edges comprising identity matchingAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION scores; removing each identity edge that falls below an identity level merge score threshold to generate an updated identity level graph; applying the maximal graph matching algorithm to the updated identity level graph where the maximal graph matching algorithm identifies one or more maximally paired identity nodes; for each of the maximally paired identity nodes identified by the maximal graph matching algorithm: determining an identity clique score based on effects of merging the maximally paired identity nodes together to form a merged identity; and in response to determining the identity clique score exceeds an identity level delta-clique constraint, merging the maximally paired identity nodes to generate the merge-split graph.

3. The method of claim 2, wherein the entity level merge score threshold is greater than the identity level merge score threshold.

4. The method of claim 2, wherein the entity level delta-clique constraint is less than the identity level delta-clique constraint.

5. The method of claim 2, wherein the graph-splitter process further comprises, prior to merging the maximally paired identity nodes: determining a centrality score for the maximally paired identity nodes; and in response to determining the centrality score is less than a centrality threshold, preventing merging of the maximally paired identity nodes.

6. The method of claim 1, wherein identifying the list of candidate records comprises: accessing an inquiry data set that includes relationships between prior inquiries and returned records; generating a set of optimized search keys based on the relationships between prior inquiries and returned records; grouping a subset of records in a record database; querying the subset of records by the set of optimized search keys to generate a preliminary candidate set; andAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION filtering the preliminary candidate set to identify the list of candidate records.

7. The method of claim 1, wherein the entity matching scores are generated by a machine learning model, the machine learning model being trained by steps comprising: obtaining a plurality of training samples, each training sample of the plurality of training samples comprising a set of training matching attributes generated for a pair of data records and a matching label indicating a match or a no-match between the pair of data records; training the machine learning model using the plurality of training samples; determining predicted classifications for the plurality of training samples by inputting the sets of training matching attributes to the machine learning model; identifying a set of the training samples as misclassified training samples based on a set of the predicted classifications being different from the respective matching labels in the training samples; generating two or more auxiliary classifications for each of the misclassified training samples using two or more auxiliary models; updating the matching labels of the misclassified training samples based on the two or more auxiliary classifications; and re-training the machine learning model using the plurality of training samples with the updated matching labels.

8. A non-transitory computer-readable storage medium storing instructions executable by a processing device to perform operations comprising: identifying a list of candidate records from a set of data records stored in a data repository for merging; generating an entity level graph from the list of candidate records where the entity level graph includes entity nodes comprising entities and entity edges comprising entity matching scores; identifying one or more connected components within the entity level graph; performing a graph-merge process wherein the graph-merge process is performed for each connected component of the one or more connected components and the graph-merge process comprises: determining the entity matching score for each entity edge; removing each entity edge that falls below an entity level merge score threshold to generate an updated entity level graph;Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION applying a maximal graph matching algorithm to the updated entity level graph where the maximal graph matching algorithm identifies one or more maximally paired entity nodes; for each of the maximally paired entity nodes identified by the maximal graph matching algorithm: determining an entity clique score based on effects of merging the maximally paired entity nodes together to form a merged entity; and in response to determining the entity clique score exceeds an entity level delta-clique constraint, merging the maximally paired entity nodes to form the merged entity; applying a graph-splitter process to the merged entity to generate a merge-split graph; repeating the graph-merge process until no merge occurs; and generating an output graph comprising the merge-split graph.

9. The non-transitory computer-readable storage medium of claim 8, wherein the operation of applying the graph-splitter process comprises: generating an identity level graph from the merged entity, where the identity level graph includes identity nodes comprising identities and identity edges comprising identity matching scores; removing each identity edge that falls below an identity level merge score threshold to generate an updated identity level graph; applying the maximal graph matching algorithm to the updated identity level graph where the maximal graph matching algorithm identifies one or more maximally paired identity nodes; for each of the maximally paired identity nodes identified by the maximal graph matching algorithm: determining an identity clique score based on effects of merging the maximally paired identity nodes together to form a merged identity; and in response to determining the identity clique score exceeds an identity level delta-clique constraint, merging the maximally paired identity nodes to generate the merge-split graph.Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION10. The non-transitory computer-readable storage medium of claim 9, wherein the entity level merge score threshold is greater than the identity level merge score threshold.

11. The non-transitory computer-readable storage medium of claim 9, wherein the entity level delta-clique constraint is less than the identity level delta-clique constraint.

12. The non-transitory computer-readable storage medium of claim 9, wherein the graphsplitter process further comprises, prior to merging the maximally paired identity nodes; determining a centrality score for the maximally paired identity nodes; and in response to determining the centrality score is less than a centrality threshold, preventing merging of the maximally paired identity nodes.

13. The non-transitory computer-readable storage medium of claim 8, wherein the operation of identifying the list of candidate records comprises: accessing an inquiry data set that includes relationships between prior inquiries and returned records; generating a set of optimized search keys based on the relationships between prior inquiries and returned records; grouping a subset of records in a record database; querying the subset of records by the set of optimized search keys to generate a preliminary candidate set; and filtering the preliminary candidate set to identify the list of candidate records.

14. The non-transitory computer-readable storage medium of claim 8, wherein the entity matching scores are generated by a machine learning model, the machine learning model being trained by steps comprising: obtaining a plurality of training samples, each training sample of the plurality of training samples comprising a set of training matching attributes generated for a pair of data records and a matching label indicating a match or a no-match between the pair of data records; training the machine learning model using the plurality of training samples; determining predicted classifications for the plurality of training samples by inputting the sets of training matching attributes to the machine learning model;Attorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION identifying a set of the training samples as misclassified training samples based on a set of the predicted classifications being different from the respective matching labels in the training samples; generating two or more auxiliary classifications for each of the misclassified training samples using two or more auxiliary models; updating the matching labels of the misclassified training samples based on the two or more auxiliary classifications; and re-training the machine learning model using the plurality of training samples with the updated matching labels.

15. A computing system comprising: a processing device; a data repository for storing data records, wherein each data record comprises one or more identifiers; and a non-transitory computer-readable storage medium storing instructions executable by the processing device to perform operations comprising: identifying a list of candidate records from a set of data records stored in a data repository for merging; generating an entity level graph from the list of candidate records where the entity level graph includes entity nodes comprising entities and entity edges comprising entity matching scores; identifying one or more connected components within the entity level graph; performing a graph-merge process wherein the graph-merge process is performed for each connected component of the one or more connected components and the graph-merge process comprises: determining the entity matching score for each entity edge; removing each entity edge that falls below an entity level merge score threshold to generate an updated entity level graph; applying a maximal graph matching algorithm to the updated entity level graph where the maximal graph matching algorithm identifies one or more maximally paired entity nodes; for each of the maximally paired entity nodes identified by the maximal graph matching algorithm: determining an entity clique score based on effects of mergingAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION the maximally paired entity nodes together to form a merged entity; and in response to determining the entity clique score exceeds an entity level delta-clique constraint, merging the maximally paired entity nodes to form the merged entity; applying a graph-splitter process to the merged entity to generate a merge-split graph; repeating the graph-merge process until no merge occurs; and generating an output graph comprising the merge-split graph.

16. The computing system of claim 15, wherein the operation of applying the graph-splitter process comprises: generating an identity level graph from the merged entity, where the identity level graph includes identity nodes comprising identities and identity edges comprising identity matching scores; removing each identity edge that falls below an identity level merge score threshold to generate an updated identity level graph; applying the maximal graph matching algorithm to the updated identity level graph where the maximal graph matching algorithm identifies one or more maximally paired identity nodes; for each of the maximally paired identity nodes identified by the maximal graph matching algorithm: determining an identity clique score based on effects of merging the maximally paired identity nodes together to form a merged identity; and in response to determining the identity clique score exceeds an identity level delta-clique constraint, merging the maximally paired identity nodes to generate the merge-split graph.

17. The computing system of claim 16, wherein the entity level merge score threshold is greater than the identity level merge score threshold.

18. The computing system of claim 16, wherein the entity level delta-clique constraint is less than the identity level delta-clique constraint.

19. The computing system of claim 16, wherein the graph-splitter process furtherAttorney Docket No. 096923-1461100 (EFX-199WO)PATENT APPLICATION comprises, prior to merging the maximally paired identity nodes; determining a centrality score for the maximally paired identity nodes; and in response to determining the centrality score is less than a centrality threshold, preventing merging of the maximally paired identity nodes.

20. The computing system of claim 15, wherein identifying the list of candidate records comprises: accessing an inquiry data set that includes relationships between prior inquiries and returned records; generating a set of optimized search keys based on the relationships between prior inquiries and returned records; grouping a subset of records in a record database; querying the subset of records by the set of optimized search keys to generate a preliminary candidate set; and filtering the preliminary candidate set to identify the list of candidate records.