Information matching using subgraphs

By identifying the central node and adjacent node groups in the subgraph, and using Hausdorff distance and the set of best matching node pairs, the problem of long matching time and high resource consumption in the prior art is solved, and the accuracy of matching records is improved, especially the accuracy of identifying duplicate records in first-order and second-order matching.

CN116806337BActive Publication Date: 2026-06-19INTERNATIONAL BUSINESS MACHINE CORPORATION

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
INTERNATIONAL BUSINESS MACHINE CORPORATION
Filing Date
2022-01-11
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies require significant time and resources to match large numbers of records, and the matching results are not accurate enough, especially when identifying duplicate records.

Method used

By identifying the central node and adjacent node groups in the subgraph, the total distance between the central nodes is determined using the Hausdorff distance and the set of best-matching node pairs to determine whether a match exists.

🎯Benefits of technology

It reduces the time and resources required for matching records and improves the accuracy of matching information fragments, especially improving the accuracy of identifying duplicate records in first-order and second-order matching.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116806337B_ABST
    Figure CN116806337B_ABST
Patent Text Reader

Abstract

A method for matching information. Identify a first center node in a first subgraph and a second center node in a second subgraph. Identify neighboring node groups with adjacent nodes from both subgraphs. One neighboring node group in a group has adjacent nodes of the same node type. Identify the best matching node pair of adjacent nodes in each cluster. The adjacent nodes in each best matching node pair include a first node from the first subgraph and a second node from the second subgraph. Determine whether the center nodes match based on the total distance between the center nodes using the first and second center nodes and the best matching node pair.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure generally relates to improved computer systems, and more specifically to methods, apparatus, systems, and computer program products for matching subgraphs. Background Technology

[0002] Companies and other organizations have numerous data sources. These data sources contain records of people, organizations, suppliers, products, marketing plans, or other types of projects. These records are typically maintained across multiple operating systems that process the company's daily transactions. Analytical systems move or access these records to generate reports. These reports include customer revenue, product revenue, sales trends, usage reports, or other types of reports. When generating reports in analytical systems, duplicate records can lead to inaccuracies in analysis and results reporting. As a result, it is necessary to identify and reconcile duplicate records in the data to meet reporting requirements.

[0003] Software matching algorithms are used to identify duplicate records within or across different datasets. These algorithms implement processes such as deterministic matching, fuzzy probabilistic matching, and other types of matching. These software matching algorithms focus on the relationships and columnar data structures of records to determine the existence of duplicate records. As the number of records compared increases, the time and resource usage can increase significantly.

[0004] Therefore, it is desirable to implement a method and apparatus that takes into account at least some of the above-mentioned problems, as well as other possible problems. For example, it is desirable to implement a method and apparatus that overcomes the technical problem of the time and resource requirements for matching a large number of records. Summary of the Invention

[0005] According to one embodiment of the present invention, a method for matching information is provided. A computer system identifies a first central node in a first subgraph and a second central node in a second subgraph. The computer system identifies adjacent node groups having adjacent nodes from both the first and second subgraphs. Adjacent node groups within adjacent node groups have adjacent nodes of the same node type. The computer system identifies the best matching node pairs in each group of adjacent nodes to form a set of best matching node pairs in a cluster set, wherein each best matching node pair includes a first adjacent node from the first subgraph and a second adjacent node from the second subgraph. The computer system uses the first central node, the second central node, and the set of best matching node pairs in the cluster set to determine whether the first central node and the second central node match.

[0006] According to another embodiment of the present invention, a method for matching information is provided. A computer system assigns adjacent nodes of two central nodes in two subgraphs into groups based on node type, wherein each group includes adjacent nodes from both subgraphs. The computer system uses Hausdorff distance to select the best-matching node pairs for each group of adjacent nodes, forming a set of best-matching node pairs for each group of adjacent nodes, wherein the best-matching node pairs in the set have adjacent nodes from each of the two subgraphs. The computer system uses the two central nodes and the set of best-matching node pairs to determine the total distance between the two central nodes. The total distance between the two central nodes takes into account the set of best-matching node pairs for each of the two central nodes. The computer system determines whether a match exists between the two central nodes based on the total distance between them.

[0007] According to another embodiment of the present invention, an information management system includes a computer system that executes program instructions to identify a first central node in a first subgraph and a second central node in a second subgraph. The computer system executes program instructions to identify adjacent node groups having adjacent nodes from both the first and second subgraphs. Adjacent node groups within adjacent node groups have adjacent nodes of the same node type. The computer system executes program instructions to identify best-matching node pairs in each group of adjacent nodes to form a set of best-matching node pairs. Each best-matching node pair includes a first adjacent node from the first subgraph and a second adjacent node from the second subgraph. The computer system executes program instructions to use the first central node, the second central node, and the set of best-matching node pairs to determine whether the first central node and the second central node match.

[0008] According to another embodiment of the present invention, an information management system includes a computer system that executes program instructions to group adjacent nodes of two central nodes in two subgraphs according to node type. Each group includes adjacent nodes from both subgraphs. The computer system executes program instructions to select the best-matching node pairs for each group of adjacent nodes using Hausdorff distance, forming a set of best-matching node pairs for a set of clusters. The best-matching node pairs in the set of best-matching node pairs have adjacent nodes from each of the two subgraphs. The computer system executes program instructions to determine the total distance between the two central nodes using the two central nodes and the set of best-matching node pairs of adjacent nodes. The total distance between the two central nodes takes into account the set of best-matching node pairs for each of the two central nodes. The computer system executes program instructions to determine whether a match exists between the two central nodes based on the total distance between them.

[0009] According to another embodiment of the present invention, a computer program product for matching information includes a computer-readable storage medium having program instructions embodied therein. The program instructions are executable by a computer system to cause the computer to perform a method comprising: identifying a first central node in a first subgraph and a second central node in a second subgraph; identifying adjacent node groups having adjacent nodes from both the first and second subgraphs, wherein adjacent node groups within adjacent node groups have adjacent nodes of the same node type; identifying best matching node pairs of adjacent nodes in each adjacent node group to form a set of best matching node pairs in a cluster set, wherein the adjacent nodes in the best matching node pairs include a first adjacent node from the first subgraph and a second adjacent node from the second subgraph; and determining whether the first central node and the second central node match using the first central node, the second central node, and the set of best matching node pairs in the cluster set.

[0010] Therefore, compared to current techniques that do not compare subgraphs, the different illustrative embodiments can reduce at least one of the time or resources used to determine whether information fragments match. Furthermore, the different illustrative examples can also improve the accuracy of matching information fragments in at least first-order matching or first-second-order matching. Attached Figure Description

[0011] Figure 1 It is a graphical representation of a network of a data processing system in which illustrative embodiments can be implemented;

[0012] Figure 2 According to the illustrative embodiments Figure 1 The cloud computing environment 50 provides a set of functional abstraction layers;

[0013] Figure 3 It is a graphical representation of a network of a data processing system in which illustrative embodiments can be implemented;

[0014] Figure 4 This is a block diagram of an information environment according to an illustrative embodiment;

[0015] Figure 5 It is an illustration of two subgraphs having adjacent nodes assigned to a group, according to an illustrative embodiment;

[0016] Figure 6 This is a diagram of adjacent node groups according to an illustrative embodiment;

[0017] Figure 7 This is an illustration of a cluster created from adjacent entity groups according to an illustrative embodiment;

[0018] Figure 8These are illustrations of adjacent information segments according to an illustrative embodiment;

[0019] Figure 9 This is a flowchart of a process for managing information according to an illustrative embodiment;

[0020] Figure 10 This is a flowchart of a process for matching a central node according to an illustrative embodiment;

[0021] Figure 11 This is a flowchart of a process for identifying adjacent node groups according to an illustrative embodiment;

[0022] Figure 12 This is a flowchart for creating a cluster collection according to an illustrative embodiment;

[0023] Figure 13 This is a flowchart of a process for identifying the best matching pair of adjacent nodes according to an illustrative embodiment;

[0024] Figure 14 This is a flowchart of a process for determining whether a first sub-center node diagram matches a second center node, according to an illustrative embodiment.

[0025] Figure 15 This is a flowchart of a process for determining whether a first central node and a second central node match, according to an illustrative embodiment.

[0026] Figure 16 This is a flowchart of a process for matching subgraphs according to an illustrative embodiment;

[0027] Figure 17 This is a flowchart of a process for assigning adjacent nodes to a group according to an illustrative embodiment;

[0028] Figure 18 This is a flowchart of a process for selecting the best matching node pair of adjacent nodes for each cluster, according to an illustrative embodiment.

[0029] Figure 19 This is a flowchart of a process for generating feature vectors according to an illustrative embodiment;

[0030] Figure 20 This is a flowchart of a process for matching a central node according to an illustrative embodiment; and

[0031] Figure 21 This is a block diagram of a data processing system according to an illustrative embodiment. Detailed Implementation

[0032] This invention can be a system, method, and / or computer program product at any possible level of technical detail integration. The computer program product may include a computer-readable storage medium (or media) having computer-readable program instructions thereon for causing a processor to perform aspects of the invention.

[0033] Computer-readable storage media can be tangible devices capable of holding and storing instructions for use by an instruction execution device. Computer-readable storage media can be, for example, but not limited to, electronic storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of computer-readable storage media includes the following: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static random access memory (SRAM), portable optical disc read-only memory (CD-ROM), digital multifunction disc (DVD), memory sticks, floppy disks, mechanical encoding devices such as punch cards or recessed structures with instructions recorded thereon, and any suitable combination of the foregoing. As used herein, computer-readable storage media should not be construed as transient signals themselves, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses through fiber optic cables), or electrical signals transmitted through wires.

[0034] The computer-readable program instructions described herein can be downloaded from a computer-readable storage medium to a suitable computing / processing device, or via a network, such as the Internet, a local area network (LAN), a wide area network (WAN), and / or a wireless network, to an external computer or external storage device. The network may include copper cables, optical fibers, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to a computer-readable storage medium within the respective computing / processing device.

[0035] Computer-readable program instructions used to perform the operations of this invention may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, status setting data, integrated circuit configuration data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages ​​(e.g., Smalltalk, C++, etc.) and procedural programming languages ​​(e.g., the "C" programming language or similar programming languages). The computer-readable program instructions may be executed entirely on the user's computer, partially on the user's computer, as a stand-alone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In the latter case, the remote computer may be connected to the user's computer via any type of network, including a local area network (LAN) or wide area network (WAN), or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, to perform aspects of this invention, electronic circuits, including, for example, programmable logic circuits, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), may execute computer-readable program instructions to personalize the electronic circuits by utilizing the status information of the computer-readable program instructions.

[0036] Various aspects of the present invention are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.

[0037] These computer-readable program instructions may be provided to a processor of a computer or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions / actions specified in one or more blocks of a flowchart and / or block diagram. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and / or other devices to operate in a particular manner, such that the computer-readable storage medium in which the instructions are stored includes an article of writing comprising instructions for implementing aspects of the functions / actions specified in one or more blocks of a flowchart and / or block diagram.

[0038] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer-implemented process, such that the instructions, which execute on the computer, other programmable apparatus or other device, perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.

[0039] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of instructions comprising one or more executable instructions for implementing a specified logical function. In some alternative embodiments, the functions indicated in the blocks may occur in a different order than indicated in the figures. For example, two blocks shown consecutively may actually be implemented as a single step, executed simultaneously, substantially simultaneously, with partial or complete time overlap, or these blocks may sometimes be executed in reverse order, depending on the functions involved. It will also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented by a dedicated hardware-based system that performs the specified function or action or executes a combination of dedicated hardware and computer instructions.

[0040] The illustrative embodiments recognize and consider many different considerations. For example, the illustrative embodiments recognize and consider that current matching algorithms do not consider the relationship network between records and data represented as a graph. For example, the illustrative embodiments recognize and consider that when comparing two records of a person, if the two records have the same relationship with adjacent nodes in the graph, then the records may be for the same person. The illustrative embodiments recognize and consider that comparing subgraphs can provide a stronger indication (indicating that records are duplicates) than determining the similarity of names in the records themselves. Therefore, the illustrative embodiments recognize and consider that considering subgraph comparisons can improve the matching results in the matching process.

[0041] Therefore, illustrative embodiments provide methods, apparatus, systems, and computer program products for matching information. In one illustrative example, a first central node in a first subgraph and a second central node in a second subgraph are identified. A computer system identifies adjacent node groups having adjacent nodes from both the first and second subgraphs. Adjacent node groups within adjacent node groups have adjacent nodes of the same node type. The computer system creates a set of clusters from each adjacent node group, such that each cluster in the set of clusters has adjacent nodes from both the first and second subgraphs. The computer system identifies the best matching node pairs of adjacent nodes in each cluster in the set of clusters to form a set of best matching node pairs in the set of clusters, wherein the adjacent nodes in the best matching node pairs include a first node from the first subgraph and a second node from the second subgraph. Using the best matching node pairs in the first and second central node cluster sets, the computer system determines whether the first and second central nodes match based on the total distance between the first and second central nodes.

[0042] As used in this article, when “set” is used to refer to an item, it means one or more items. For example, a “cluster set” is one or more clusters. Similarly, when “group” is used to refer to an item, it also means one or more items. For example, a “neighboring node group” is one or more neighboring nodes.

[0043] Now for reference Figure 1 The diagram illustrates a cloud computing environment 50. As shown, the cloud computing environment 50 includes one or more cloud computing nodes 10 to which local computing devices used by cloud consumers can communicate, such as personal digital assistants (PDAs) or cellular phones 54A, desktop computers 54B, laptop computers 54C, and / or automotive computer systems 54N. The cloud computing nodes 10 can communicate with each other. They can be physically or virtually grouped (not shown) in one or more networks, such as private clouds, community clouds, public clouds, or hybrid clouds, or combinations thereof, as described above. This allows the cloud computing environment 50 to provide infrastructure, platform, and / or software as a service, without requiring cloud consumers to maintain resources on their local computing devices. It should be understood that... Figure 1 The types of computing devices 54A-N shown are for illustrative purposes only, and the cloud computing node 10 in the cloud computing environment 50 can communicate with any type of computing device via any type of network and / or network-addressable connection (e.g., using a web browser).

[0044] Now for reference Figure 2 This shows the result of Figure 1 The cloud computing environment in 50 provides a set of functional abstraction layers. Prior understanding, Figure 2 The components, layers, and functions shown are for illustrative purposes only, and embodiments of the invention are not limited thereto. As described, the following layers and corresponding functions are provided.

[0045] The hardware and software layer 60 includes hardware and software components. Examples of hardware components include: a host 61; a server 62 based on a RISC (Reduced Instruction Set Computer) architecture; a server 63; a blade server 64; a storage device 65; and a network and network components 66. In some embodiments, software components include network application server software 67 and database software 68.

[0046] The virtualization layer 70 provides an abstraction layer from which the following examples of virtual entities can be provided: virtual server 71; virtual storage 72; virtual network 73, including virtual private network; virtual application and operating system 74; and virtual client 75.

[0047] In one example, management layer 80 can provide the following functionalities: Resource Provisioning 81 provides dynamic procurement of computing resources and other resources used to perform tasks within the cloud computing environment. Metering and Pricing 82 provides cost tracking when utilizing resources in the cloud computing environment, as well as billing or invoicing for consuming these resources. In one example, these resources may include application software licenses. Security provides authentication for cloud consumers and tasks, and protection for data and other resources. User Portal 83 provides access to the cloud computing environment for consumers and system administrators. Service Level Management 84 provides cloud resource allocation and management to ensure that required service levels are met. Service Level Agreement (SLA) Planning and Fulfillment 85 provides pre-scheduling and procurement of cloud resources, where future needs are anticipated according to the SLA.

[0048] Workload layer 90 provides examples of functionalities that can leverage a cloud computing environment. Examples of workloads and functionalities that can be provided from this layer include: mapping and navigation 91; software development and lifecycle management 92; virtual classroom education delivery 93; data analytics and processing 94; transaction processing 95; and data management 96. Data management 96 provides functions for management... Figure 1 Data or access in cloud computing environments 50 Figure 1 The cloud computing environment in which data is served in the network at 50 physical locations.

[0049] For example, data management 96 can be implemented as a master data management service or within a data management service, wherein at least one of consistency, accuracy, semantic consistency, or accountability can be added to the management of information. This management of information by data management 96 can be useful when more than one copy of information exists. Data management 96 can maintain a single version of the truth value across all copies of information. In an illustrative example, data management 96 can be used to manage information such as records located in multiple operating systems. In an illustrative example, data management 96 can identify duplicate records. Data management 96 can also reconcile identified duplicate records. In an illustrative example, data management 96 can employ a matching process when processing information such as records to identify duplicate segments of information.

[0050] Now for reference Figure 3 This image depicts a graphical representation of a network in which illustrative embodiments of a data processing system may be implemented. The network data processing system 300 is a computer network in which illustrative embodiments may be implemented. The network data processing system 300 includes a network 302, which is a medium for providing communication links between various devices and computers connected together within the network data processing system 300. The network 302 may include connections such as wired, wireless communication links, or fiber optic cables.

[0051] In the illustrated example, server computers 304 and 306 are connected to network 302 along with storage unit 308. Additionally, client device 310 is connected to network 302. As shown, client device 310 includes client computers 312, 314, and 316. Client device 310 can be, for example, a computer, workstation, or network computer. In the illustrated example, server computer 304 provides client device 310 with information such as boot files, operating system images, and applications. Furthermore, client device 310 may also include other types of client devices, such as mobile phone 318, tablet computer 320, and smart glasses 322. In this illustrative example, server computer 304, server computer 306, storage unit 308, and client device 310 are network devices connected to network 302, where network 302 is the communication medium for these network devices. Some or all of client devices 310 can form an Internet of Things (IoT), where these physical devices can connect to network 302 and exchange information with each other through network 302.

[0052] In this example, client device 310 is a client of server computer 304. Network data processing system 300 may include additional server computer, client computer, and other devices (not shown). Client device 310 is connected to network 302 using at least one of wired, fiber optic, or wireless connections.

[0053] The program code located in the network data processing system 300 can be stored on a computer-recordable storage medium and downloaded to the data processing system or other devices for use. For example, the program code can be stored on a computer-recordable storage medium on a server computer 304 and downloaded to a client device 310 via network 302 for use on the client device 310.

[0054] In the illustrated example, network data processing system 300 is the Internet, where network 302 represents a worldwide collection of networks and gateways that communicate with each other using the Transmission Control Protocol / Internet Protocol (TCP / IP) protocol suite. The core of the Internet is the backbone of high-speed data communication lines between master nodes or host computers (which include thousands of commercial, government, educational, and other computer systems routing data and messages). Of course, network data processing system 300 can also be implemented using many different types of networks. For example, network 302 can include at least one of the following: the Internet, intranet, local area network (LAN), metropolitan area network (MAN), or wide area network (WAN). Figure 3 This is intended as an example, not as an architectural limitation on different illustrative embodiments.

[0055] As used in this article, when referring to items with "multiple," it means one or more items. For example, "multiple different types of networks" means one or more different types of networks.

[0056] Furthermore, when the phrase "at least one" is used with a list of items, it means that different combinations of one or more items from the list can be used, and it is possible that only one item from each item in the list is required. In other words, "at least one" means any combination of items, and multiple items from the list can be used, but not all items in the list are required. Items can be specific objects, things, or categories.

[0057] For example, but not limited to, "at least one of project A, project B, or project C" can include project A, project A and project B, or project B. The example could also include project A, project B, and project C, or project B and project C. Of course, any combination of these projects is possible. In some illustrative examples, "at least one" can be, for example, but not limited to, two projects A; one project B; and ten projects C; four projects B and seven projects C; or other suitable combinations.

[0058] In this illustrative example, information manager 330 is located in server computer 304. Information manager 330 can manage copies of information in the form of records 332 located in storage 334. For example, information manager 330 can identify duplicate records 336 in records 332. In the depicted example, record 332 can be an object selected from at least one of individuals, companies, organizations, suppliers, agents, households, products, services, and other suitable types of objects.

[0059] When a match is identified in record 332, reconciliation can be performed. This reconciliation may include removing duplicate copies of the record, merging records, or other appropriate actions. In this illustrative example, duplicate record 336 may be an exact match or a sufficient match to represent the same object. In other words, in some examples, for two records to match and be designated as duplicate record 336, a 100% match between the two records may not be necessary.

[0060] For example, two records for a person can be considered duplicates even if the names are not spelled exactly the same. For instance, one record could be for "John Smith" and another for "Jon Smith." Other information in the records might be close enough that the records are considered a match even if the names are not an exact match. As another example, "144 River Lane" and "144 River Ln." can be considered a match of addresses in the records.

[0061] In this illustrative example, the information manager 330 may use subgraphs to perform comparisons of records 332. For example, the information manager 330 may identify two central nodes 338 in two subgraphs 340, each of the two central nodes 338 being in one of the two subgraphs 340. As depicted, the two subgraphs 340 also include adjacent nodes 342. Each of the two subgraphs 340 may include a portion of an adjacent node 342.

[0062] In this illustrative example, each of the adjacent nodes in adjacent node 342 may represent a record in record 332. For example, the two central nodes 338 may each represent a record about a person. Adjacent nodes 342 may be records or other data structures representing objects connected to or linked to the two central nodes 338. Objects may be selected from at least one of the following: friends, employers, residences, contracts, vehicles, neighbors, relatives, business partners, buildings, workplaces, or some other suitable object having connections to one or more of the two central nodes 338.

[0063] In this illustrative example, two subgraphs 340 are compared to determine if a match exists between records 332 of the two central nodes 338. In this illustrative example, the information manager 330 may use any currently available matching technique to identify the two central nodes 338. Information from the two central nodes 338 may be compared to generate a feature result 344. A feature is a characteristic of the comparison of information from the central nodes.

[0064] For example, information can be extracted from various fields in a record. This information could include name, last name, first name, business address, vehicle information, phone number, postal code, area code, or other information found in the record.

[0065] Features can be characteristics in information comparison. For example, features can be exact matches, partial matches, missing information, mismatches, or other types of features. These feature results 344 can be represented as numbers or scores in a vector. These feature results 344 can also be used to identify candidate records for analysis by the information manager 330. Feature results 344 can also be features based on the distance between two nodes (e.g., two central nodes 338).

[0066] In this example, feature result 344 can be used to determine which records in record 332 can be further processed by information manager 330. In other words, feature result 344 can be used to reduce the number of records compared when identifying duplicate records 336.

[0067] By identifying two central nodes 338 in two subgraphs 340, the information manager 330 can determine the similarity 348 of the two subgraphs 340 when determining whether the record 332 represented by the two central nodes 338 is a duplicate record 336. In this illustrative example, the similarity 348 can be based on the distance between the two subgraphs 340 as described below. As a result, a score 350 can be generated using either the similarity 348 or both the similarity 348 and the feature result 344 to determine whether the two central nodes 338 represent a duplicate record 336.

[0068] In this illustrative example, the information manager 330 can make this determination by comparing the score 350 with multiple thresholds 352. These thresholds can be upper limit thresholds, or a range can be defined for comparing the score 350 to determine whether two center nodes 338 represent duplicate records 336.

[0069] Therefore, the information manager 330 can increase the accuracy of identifying duplicate records 336. Furthermore, this accuracy can be improved in first-order matching for entities such as individuals, organizations, institutions, or other individual entities. Additionally, accuracy can be improved in second-order matching for entities such as families. Determining the similarity 348 of two central nodes 338 in two subgraphs 340 when analyzing relational information in the two subgraphs 340 can have improved accuracy for second-order matching.

[0070] As depicted, the information manager 330 can use two central nodes 338 and adjacent nodes 342 from two subgraphs 340 of two central nodes 338 as input to determine the similarity 348 of the two central nodes 338. As shown, the information manager 330 assigns adjacent nodes 342 to groups 354. Each group in group 354 represents a different node type. Each group in group 354 has adjacent nodes 342 from both subgraphs 340. Clustering can be performed to determine clusters 356 within group 354. In other words, each cluster of adjacent nodes 342 is a cluster of adjacent nodes 342 of the same type.

[0071] This clustering can be performed using any suitable clustering procedure. For example, density-based clustering can be performed on adjacent nodes 342 from groups of two subgraphs 340.

[0072] As depicted, each cluster in cluster 356 includes adjacent nodes 342 from both subgraphs 340. In other words, each cluster includes at least one adjacent node from each of the two subgraphs 340.

[0073] Information manager 330 can identify the best matching node pair for each cluster in cluster 356 to form best matching node pair 358. This determination can be made by determining the Hausdorrf distance, where the adjacency distance between two adjacent nodes from each subgraph in the cluster is calculated. This adjacency distance can be based on comparing adjacent nodes, the links of the compared adjacent nodes, and the indices of the compared adjacent nodes. Different distances can be used to determine the total distance 360, which can indicate the similarity 348 between two central nodes 338. The total distance 360 ​​is the distance between two central nodes 338 taking into account adjacent nodes 342. In other words, the distance between two central nodes 338 can change when adjacent nodes 342 are considered. In this example, adjacent nodes 342 are the best matching node pair between two central nodes 338. The total distance 360 ​​between two central nodes 338 can be used to determine whether the records 332 of the two central nodes 338 are similar enough to be considered duplicate records 336.

[0074] Now for reference Figure 4 A block diagram of an information environment is depicted according to an illustrative embodiment. In this illustrative example, the information environment 400 includes components that can be configured in hardware (such as...) Figure 3 The components implemented in the hardware shown in the network data processing system 300.

[0075] As depicted, information environment 400 is an environment in which information 402 can be managed. In this illustrative example, the management of information 402 may include coordinating information 402 located in one or more datasets 404. These datasets may reside in one or more repositories. These repositories may include at least one of, for example, a data warehouse, a data lake, a data mart, a database, or some other suitable data storage entity.

[0076] Information 402 can take various forms. For example, information 402 can take the form of record 406. Records in record 406 are data structures used to organize information 402. For example, a record can be a collection of fields of different data types. Record 406 can be stored in a database, table, or other suitable structure.

[0077] Information management system 408 in information environment 400 is operable to manage information 402. This management of information 402 may include storing, adding, removing, modifying, or performing other operations on information 402. For example, information management system 408 may find duplicate information in one or more datasets 404. These duplicates can then be coordinated, where actions such as deduplication, merging duplicate information, or other actions can be performed.

[0078] In this illustrative example, the information management system 408 includes several different components. As shown in the figure, the information management system 408 includes a computer system 410 and an information manager 412.

[0079] The information manager 412 can be implemented using software, hardware, firmware, or a combination thereof. When using software, the operations performed by the information manager 412 can be implemented in program code configured to run on hardware (such as a processor unit). When using firmware, the operations performed by the information manager 412 can be implemented using program code and data, and stored in permanent memory to run on the processor unit. When using hardware, the hardware may include circuitry that operates to perform the operations in the information manager 412.

[0080] In the illustrative example, the hardware may take the form of at least one selected from circuit systems, integrated circuits, application-specific integrated circuits (ASICs), programmable logic devices, or some other suitable type of hardware configured to perform multiple operations. Using a programmable logic device, the device can be configured to perform multiple operations. The device can be reconfigured at a later time or can be permanently configured to perform multiple operations. Programmable logic devices include, for example, programmable logic arrays, programmable array logic, field-programmable logic arrays, field-programmable gate arrays, and other suitable hardware devices. Furthermore, the method can be implemented in organic components integrated with inorganic components and can be entirely composed of organic components other than humans. For example, the method can be implemented as a circuit in an organic semiconductor.

[0081] Computer system 410 is a physical hardware system and includes one or more data processing systems. When there is more than one data processing system in computer system 410, these data processing systems communicate with each other using a communication medium. The communication medium may be a network. The data processing system may be selected from at least one of a computer, a server computer, a tablet computer, or some other suitable data processing system.

[0082] In this illustrative example, the information manager 412 in computer system 410 identifies a first center node 414 in a first subgraph 416 and a second center node 418 in a second subgraph 420. This identification can be performed in a variety of different ways. For example, currently available comparison algorithms for comparing multiple pieces of information, such as record 406, with each other can be used to identify the first center node 414 and the second center node 418 from information 402. These comparison algorithms include, for example, approximate string matching, record linking, or other processes. In one illustrative example, each of these center nodes may have a record in record 406. The information manager 412 can use this initial matching process to identify candidate center nodes for analysis.

[0083] Additionally, in this example, the information manager 412 identifies a first subgraph 416 and a second subgraph 420. Adjacent nodes 422 in these two subgraphs are linked to one of the first central node 414 and the second central node 418.

[0084] As depicted, the information manager 412 identifies a group 424 of adjacent nodes 422, which has adjacent nodes 422 from the first subgraph 416 and the second subgraph 420 that have the same node type 428 in node type 430. Node type 430 can be structured metadata and includes metadata for different fields of information fragments within the node. This metadata can include field names, data types, granularity, and other information. For example, the node type can be metadata for individuals, organizations, agents, sellers, families, houses, vehicles, contracts, insurance, warranties, services, or other suitable types.

[0085] In this illustrative example, a node is a collection of information of node type 430. A node can be, for example, a record or some other suitable fragment of information 402.

[0086] In creating group 424, information manager 412 can place neighboring nodes 422 from each subgraph into initial group 432 based on the node type 430 of neighboring nodes 422. Information manager 412 can select each initial group in initial group 432 that has neighboring nodes 422 from a first subgraph 416 and a second subgraph 420 that have neighboring nodes 422 from both the first subgraph 416 and the second subgraph 420 to form group 424 with neighboring nodes 422 from both the first subgraph 416 and the second subgraph 420.

[0087] In this illustrative example, information manager 412 creates a cluster set 434 based on each group of neighboring nodes 422, such that each cluster in cluster set 434 has neighboring nodes 422 from both the first subgraph 416 and the second subgraph 420. When creating cluster set 434, information manager 412 may create candidate clusters 436 within each group of neighboring nodes 422 in the group 424 of neighboring nodes 422. Information manager 412 may select from candidate clusters 436 each cluster having neighboring nodes 422 from both the first subgraph 416 and the second subgraph 420 of neighboring nodes 422 to form cluster set 434.

[0088] In the illustrative example, the information manager 412 identifies the best-matching node pairs 438 of the neighboring nodes 422 of each cluster in the cluster set 434 to form a set 440 of best-matching node pairs in the cluster set 434. The two neighboring nodes in the best-matching node pair 438 include a first neighboring node 442 from the neighboring nodes 422 of the first subgraph 416 and a second neighboring node 444 from the neighboring nodes 422 of the second subgraph 420.

[0089] When identifying the best matching node pair 438, the information manager 412 can determine the adjacency distance 450 of the neighboring nodes 422 being compared in the cluster. This comparison can be based on the neighboring nodes 422 being compared, the links between the neighboring nodes 422, and the depth of the neighboring nodes 422. The information manager 412 can identify the best matching node pair 438 for each cluster in the cluster set 434 as two nodes in the cluster with the shortest adjacency distance 452, forming the best matching node pair set 440 in the cluster set 434.

[0090] As depicted in this example, the information manager 412 uses the first central node 414, the second central node 418, and the best matching node pair set 440 in the cluster set 434 to determine whether the first central node 414 and the second central node 418 match based on the total distance 446 between the first central node 414 and the second central node 418.

[0091] Furthermore, the information manager 412 can use the feature results 448 to identify candidate center nodes for analysis. If two center nodes are close enough to each other, additional steps can be performed to determine the total distance 446.

[0092] In this illustrative example, feature result 448 may include features relating to information about comparing the first central node 414 and the second central node 418. Feature result 448 may also include features based on the distance between the first central node 414 and the second central node 418. Feature result 448 may also be the sum of features obtained by comparing the information between the first central node 414 and the second central node 418. In other words, features are characteristics of interest that may exist in the information being compared.

[0093] For example, the presence of a feature can be determined by comparing information between two central nodes, such as first name, last name, contract name, vehicle manufacturer, vehicle model, or other types of information. Features can be, for example, exact matches, partial matches, similar names, left-out names, non-matching names, the number of exact words, the number of similar words, the number of left-out words, the number of non-matching words, and other types of features of interest. These types of features are comparative features. Feature result 448 can include individual scores for different features or at least one of a total score based on all features. These scores can be organized in the form of a feature vector, where each element in the feature vector represents the presence of a particular feature. In one example, currently available comparison algorithms for identifying the first central node 414 and the second central node 418 can be used to determine feature result 448.

[0094] If the two central nodes match, the information manager 412 can perform a set of actions 454 on the information 402 fragments of the first central node 414 and the second central node 418. The set of actions 454 includes, for example, deduplication, combining information 402, correcting information 402, or other suitable actions.

[0095] In one illustrative example, there are one or more technical solutions that overcome the technical problems related to the amount of time and resources required to match a large number of records. As a result, one or more technical solutions can provide the technical effect of reducing at least one of the time or resource requirements for processing information 402 to determine whether duplicate information 402 fragments exist. In one illustrative example, one or more technical solutions are proposed that enable the comparison of subgraphs to provide a stronger indication of whether information fragments (such as records represented as the central node of a subgraph) are duplicates, compared to determining the similarity of the records themselves. In one illustrative example, one or more technical solutions are proposed in which subgraph comparisons are performed to improve the accuracy of the results of matching records.

[0096] Computer system 410 can be configured to use software, hardware, firmware, or a combination thereof to perform at least one of the steps, operations, or actions described in the various illustrative examples. As a result, computer system 410 operates as a dedicated computer system, wherein information manager 412 in computer system 410 enables the determination of whether information fragments 402 match using at least one of less time or fewer resources compared to current technologies. Specifically, information manager 412 transforms computer system 410 into a dedicated computer system compared to currently available general-purpose computer systems that do not have information manager 412.

[0097] In the illustrative example, the use of information manager 412 in computer system 410 integrates the process into the practical application for managing information 402, which improves the performance of computer system 410. In other words, information manager 412 in computer system 410 involves the practical application of the process in computer system 410 that uses subgraph analysis to determine whether there are matches between pieces of information. In this illustrative example, information manager 412 in computer system 410 can identify two central nodes and subgraphs containing these two central nodes and their adjacent nodes. Information manager 412 identifies groups of adjacent nodes of the two central nodes from the two subgraphs based on the node type of the adjacent nodes. In other words, each group for a specific node type contains at least one adjacent node from each of the subgraphs. Information manager 412 identifies one or more clusters of adjacent nodes in each group. In this illustrative example, each of these clusters includes at least one adjacent node from each of the two subgraphs. Information manager 412 identifies the best matching pair of adjacent nodes for each cluster. This identification can be performed by identifying the distance between node pairs and selecting the node pair with the shortest distance as the best matching pair within the cluster. Information Manager 412 can determine the total distance between two central nodes using two central nodes and the best-matching node pair identified for the cluster. Information Manager 412 can determine whether a match exists between the two central nodes based on the total distance 446 between them. The total distance 446 is the distance between the first central node 414 and the second central node 418, which takes into account neighboring nodes 442, such as the set 444 of best-matching node pairs 444 between the first central node 414 and the second central node 418.

[0098] In this way, it is determined whether two information fragments (such as two records corresponding to two central nodes) match. In this way, the information manager 412 in computer system 410 provides a practical application for matching information, thereby improving the functionality of computer system 410. For example, by matching subgraphs, the information manager 412 in computer system 410 can provide increased accuracy in determining whether a match exists between two information fragments. In an illustrative example, information manager 412 can use the total distance 446 between the two central nodes to determine whether a match exists.

[0099] Figure 4The illustration of information environment 400 in the illustration does not imply any physical or architectural limitations on the manner in which the illustrative embodiments can be implemented. Other components besides those shown, or components that replace those shown, may be used. Some components may be unnecessary. Furthermore, these boxes are presented to illustrate some functional components. When one or more of these boxes are implemented in the illustrative embodiments, these boxes may be combined, divided, or combined and divided into different boxes. For example, although dataset 404 is shown as residing outside computer system 410, one or more of datasets 404 may reside within computer system 410. Furthermore, when computer system 410 includes multiple data processing systems, information manager 412 may be distributed and include components residing in multiple data processing systems. In another example, first sub-figure 416 may not include any adjacent nodes 422, while second sub-figure 420 includes all adjacent nodes 422.

[0100] Figure 5-7 It can be made by Figure 4 A diagram illustrating the sub-graph processed by the information manager 412. See also... Figure 5 An illustration of two subgraphs with adjacent nodes grouped together is depicted according to an illustrative embodiment. In this illustrative example, the first subgraph 500 includes a first center node CN1 502, adjacent nodes 504, 506, 508, 510, 512, 514, 516, and 518. The second subgraph 520 includes a second center node CN2 522, adjacent nodes 524, 526, 528, 530, 532, 534, 536, and 538. As depicted, each adjacent node has a node type. These two subgraphs are... Figure 4 Example implementations of the first subgraph 416 and the second subgraph 420 in the diagram.

[0101] Turn now Figure 6 The illustration depicts a group of adjacent nodes according to an illustrative embodiment. In this illustrative example, the same reference numerals may be used for more than one figure. Reference numerals are repeated in different figures to indicate the same elements in different figures.

[0102] As depicted in the diagram, adjacent entities in the first subgraph 500 and the second subgraph 520 are assigned or placed into multiple groups based on node type. In other words, all adjacent nodes in a group are of the same node type.

[0103] As shown in the figure, group 600 includes neighboring nodes 512, 514, and 516 from the first subgraph 500 and neighboring node 534 from the second subgraph 520. Group 602 includes neighboring nodes 504 and 506 from the first subgraph 500 and neighboring nodes 524, 526, and 528 from the second subgraph 520. Group 604 includes neighboring nodes 508 and 510 from the first subgraph 500 and neighboring nodes 530 and 532 from the second subgraph 520.

[0104] In this illustrative example, group 606 includes neighboring nodes 536 and 538 from the second subgraph 520. Group 606 does not include any neighboring nodes from the first subgraph 500. Group 608 includes neighboring node 518 from the first subgraph 500. This group does not include any neighboring nodes from the second subgraph 520.

[0105] Select a group from the groups whose neighboring nodes come from both subgraphs. In this example, the groups include group 600, group 602, and group 604. Groups 606 and 608 are not included in the groups used for further processing. These groups do not include neighboring nodes from the two subgraphs. Therefore, these groups are not used for comparing distances or features between different subgraphs.

[0106] Next turn Figure 7 The illustration depicts a cluster created from groups of adjacent entities, according to an illustrative embodiment. In this illustrative example, a cluster is created from each group of adjacent nodes, where adjacent nodes in a group come from two subgraphs. Clustering is performed to group adjacent nodes such that adjacent nodes in a cluster of adjacent nodes are more similar to each other than adjacent nodes in other clusters.

[0107] This clustering can be formed using algorithms or machine learning models. Various clustering techniques can be used to perform clustering. For example, density-based noise-applied spatial clustering (BDSCAN), k-means clustering, distribution-based clustering, density-based clustering, or other types of clustering can be used.

[0108] As depicted, clustering results in the creation of clusters 700 and 702 in group 600; clusters 704, 706, and 708 in group 602; and cluster 710 in group 604. In this illustrative example, the clusters selected for further clustering are those that include neighboring nodes from both subgraphs. As depicted, clusters 702 and 708 are removed because they only include nodes from one of the two subgraphs. The result of clustering can be one or more clusters, where each cluster holds a set of neighboring nodes of the same type from each subgraph. In this example, four clusters remain, containing neighboring nodes of the same type from each subgraph in the group.

[0109] From these clusters, optimal matching node pairs can be determined. Optimal matching node pairs can be determined for each cluster containing adjacent nodes from two subgraphs. An optimal matching node pair in a cluster is a pair of nodes from different subgraphs that have the shortest distance. In other words, an optimal matching node pair includes a first adjacent node from the first subgraph 500 and a second adjacent node from the second subgraph 520, wherein these two adjacent nodes in the cluster have the shortest distance compared to other adjacent node pairs in the cluster.

[0110] For example, when the distance between neighboring node 516 and neighboring node 534 in cluster 700 is 0.1 and the distance between neighboring node 514 and neighboring node 534 is 0.6, the best matching pair is neighboring node 516 and neighboring node 534.

[0111] As another example, in cluster 704, the best-matching node pair is neighboring node 504 and neighboring node 524. These are the only two nodes in the cluster. Neighboring node 506 and neighboring node 526 are the best-matching node pair in cluster 706.

[0112] In cluster 710, the distance between neighboring nodes 510 and 532 is 0.2; the distance between neighboring nodes 510 and 530 is 0.3; the distance between neighboring nodes 508 and 532 is 0.6; and the distance between neighboring nodes 508 and 530 is 0.4. In this example, the best-matching node pair in cluster 710 includes neighboring nodes 510 and 532. As can be seen, the distance between node pairs is calculated, where each node pair includes neighboring nodes from each of the two subgraphs.

[0113] These identified minimum distances can be Hausdorff distances applied to different subsets of a node cluster. Mathematically, the Hausdorff distance measures how far apart two subsets of a metric space are from each other. The Hausdorff distance is also known as the Hausdorff metric. For example, the Hausdorff distance for cluster 700 could be dH = min(0.1, 0.6) = 0.1. The Hausdorff distance for cluster 704 is dH = min(0.2) = 0.2, and for cluster 706 it is dH = min(0.5) = 0.5. The Hausdorff distance for cluster 710 is dH = min(0.2, 0.3, 0.6, and 0.4) = 0.2.

[0114] As a result, the set of Hausdorff distances is [0.1, 0.2, 0.5, 0.2], where each of these values ​​is the minimum of the best matching node pair in the cluster identified for the group from the first subgraph 500 and the second subgraph 520.

[0115] In this illustrative example, a distance feature vector based on the distance between neighboring nodes can be determined based on the counts of distances within various thresholds or ranges. For example, the distance feature vector can be determined as follows: feature vector fv(i) = [counts for dH ≤ 0.3, counts for 0.7 > dHs > 0.3, counts for dH]. Therefore, the feature vector in this example is fv(i) = [3, 1, 0].

[0116] The comparison feature vector can be determined based on the information in the comparison center nodes. For example, if the first center node 502 is [John Smith Jr.] and the second center node 522 is [Johnny Smith], features can be identified based on the comparison of information between these two center nodes. Features based on information comparison can be, for example, [name_exact, name_similar, name_leftout, name_unmatched]. In this example, the comparison feature vector of the center nodes is fv(i) = [1, 1, 1, 0]. In this specific example, the first 1 is the count of [Smith relative to Smith], the second 1 is the count of [John relative to Johnny], and the third 1 is the count of [Jr. relative to none].

[0117] As a result, the overall feature vector, which includes the comparison features of the center node and the adjacent features of the distance feature, is fv(i) = [1, 1, 1, 0, 3, 1, 0]. This feature vector can be used to determine the similarity between the first subgraph 500 and the second subgraph 520, where the similarity considers the first center node 502, the second center node 522, and the best matching node pair.

[0118] In this example, similarity can be measured by the total distance between the first center node 502 and the second center node 522. In this specific example, the distance can be calculated using the feature vector fv and the coefficient vector cv as follows:

[0119]

[0120] Where cv(i) is the coefficient vector, fv(i) is the feature vector including comparison features and distance features, max(cv) is the element with the maximum value in the coefficient vector, min(cv) is the element with the minimum value in the coefficient vector, i is the index value, and n is the number of elements in the feature vector.

[0121] In this example, the feature vector, which includes comparison features from the comparison feature vector and distance features from the distance feature vector, can be used to determine the total distance between the first center node 502 and the second center node 522. Furthermore, weights can be applied to different feature vectors using feature vector coefficients. These coefficients can be predetermined. Subject matter experts or machine learning models can be used to determine the coefficients. For example, when determining the similarity between two center nodes, a higher feature vector coefficient can be used for a specific element in the feature vector, giving that specific element greater importance.

[0122] exist Figure 5-7 In the example depicted, for the feature vector [1, 1, 1, 0, 3, 1, 0] and the coefficient vector [10, 7, -5, -10, 5, 2, 0.5], the total distance between the first center node and the second center node can be determined as:

[0123]

[0124] This is a more accurate distance comparison compared to comparing the two center nodes without considering their neighboring nodes in their subgraphs:

[0125]

[0126] In this illustrated example, comparing the subgraphs of the central nodes provides increased accuracy and granularity in determining the similarity between records or information of the central nodes, compared to comparing only the records of the central nodes. In other words, subgraph comparison can be performed by determining the distance between the central nodes and adjusting the determined distance between the central nodes based on adjacent nodes in the subgraph, where the adjusted distance is the total distance between the two central nodes.

[0127] Figures 5 to 7The illustrations of the two center nodes and adjacent nodes of the two subgraphs in the example are presented for the purpose of illustrating one way in which different operations can be performed on the subgraphs in the illustrative example, and are not intended to limit the ways in which other illustrative examples can be implemented. For example, eight adjacent nodes are shown for each graph. In other illustrative examples, other numbers of adjacent nodes may exist. For example, each subgraph may have 3, 25, 300, or some other number of adjacent nodes. A subgraph may not have the same number of adjacent nodes as another subgraph analyzed subsequently. As another example, adjacent nodes are shown as having only a depth from the center node. In other illustrative examples, adjacent nodes may have other depths, such as 2, 3, 6, or some other depth in the subgraph. For example, a particular adjacent node may have a depth of 2 from the center node. In other words, a particular adjacent node may have a link to another adjacent node that links to the center node. In another illustrative example, the feature vector may only include the distance feature vector of the adjacent nodes.

[0128] In another illustrative example, feature vectors can be generated directly from comparison features and distance features, without having to generate separate comparison and distance feature vectors. In some illustrative examples, feature vectors may include distance features but not comparison features. In yet another illustrative example, feature vectors can be generated based on a comparison of two centroids, where the feature vectors include both comparison and distance features. In this example, the distance feature is based on the distance calculated between the two centroids.

[0129] Next reference Figure 8 The illustration depicts a segment of information in adjacent nodes according to an illustrative embodiment. In this illustrative example, Table 800 shows the information that may exist for adjacent nodes.

[0130] As shown in the figure, Table 800 includes several different columns. In this example, these columns include neighboring nodes 516 and 534, which are of the same node type.

[0131] In this illustrative example, Table 800 has several different columns that identify information about neighboring nodes. These columns include neighboring nodes 802, subgraph 804, link type 806, depth 808, neighbor 810, and address 812.

[0132] Neighbor node 802 is the identifier of the neighbor node. In this example, the neighbor node in line 814 corresponds to neighbor node 516, and the neighbor node in line 816 corresponds to neighbor node 534.

[0133] Subgraph 804 identifies the subgraph to which the adjacent nodes belong in this example. Link type 806 is an identifier for a specific type of link that connects an adjacent node to another node. This other node can be another adjacent node or the center node. The value in link type 806 indicates what type of structural metadata exists that contains information about the relationship between the two adjacent node types. In this illustrative example, link type 806 indicates a link to a neighboring node. Depth 808 identifies the number of links that connect an adjacent node to the center node. In this example, the depth of both adjacent nodes is 1.

[0134] In this illustrative example, neighbor 810 is a bucket group. The hash value in neighbor 810 is a hash value generated by hashing the names of the neighbors. Address 812 is a bucket for the addresses of the neighbors identified in neighbor 810. The hash value in address 812 is generated by hashing the address of each neighbor. Other examples of bucket categories include phone numbers, business addresses, car models, cities, countries, or other suitable categories.

[0135] In this illustrative example, hashes can be generated for fields or attributes. Different actions can be generated to account for known or acceptable variations of a specific category, such as a name. In this way, partial matches can be identified to account for data entry errors. This type of multi-bucket hash generation for a single attribute can be applied to data such as phone numbers, birthdays, or other suitable information.

[0136] Table 800 depicts data of a limited number of types, used to illustrate different characteristics in an illustrative example. The illustrative example can be implemented with more buckets or other information in adjacent nodes. Additionally, buckets can include more than one category. For example, a bucket could be a name and a region code. As another example, buckets could be Contracts, Jones, and Seattle.

[0137] Next turn Figure 9 A flowchart for a process for managing information is depicted according to an illustrative embodiment. Figure 9 The process can be implemented in hardware, software, or both. When implemented in software, the process can take the form of program code executed by one or more processor units in one or more hardware devices located in one or more computer systems. The process can... Figure 2 Implemented in Data Management 96, the process can be described in the illustrative example. Figure 3 This can be implemented in the information manager 330 of the network data processing system 300, and can be implemented in... Figure 4This is implemented in the information manager 412 of the computer system 410. This process can be used to manage information fragments. In this example, the information fragments take the form of records, but in a particular implementation they may take other forms.

[0138] The process begins by identifying records in one or more datasets that are sufficiently similar to serve as centroids, for determining the similarity of subgraphs between centroids (step 900). In step 900, comparisons can be made between records to obtain feature results, such as... Figure 4 The feature result 448. The results of these comparisons can be used to identify which center nodes are close enough or similar enough to warrant further processing. In other words, step 900 can be performed as an initial process for identifying candidate center nodes from the records. In this example, these comparisons do not consider adjacent nodes in the subgraph. For example, the distance between center nodes can be determined based solely on the center nodes themselves.

[0139] In step 900, identifying matches between center nodes reduces the number of comparisons required. As a result, detailed comparisons between the subgraphs of each center node and the subgraphs of every other center node are unnecessary.

[0140] After identifying two central nodes as sufficiently similar for further processing, comparing the similarity of the contexts of the two central nodes and the independent networks can increase or decrease the overall confidence in inferring whether the two central nodes are similar or different. These different networks are subgraphs of the two central nodes.

[0141] The process identifies the subgraph of the identified central nodes (step 902). The process determines the overall similarity between the central nodes (step 904). In step 904, the process can determine the overall similarity between central nodes by considering the central node and its adjacent nodes within its subgraph. For example, comparing two central nodes “John Smith”, they may be somewhat similar. If the first central node is only associated with the entity “Canada ABC Company” with which it has an employment relationship, and the second central node is only associated with “XYZ” with which it has a cooperative relationship, it can be interpreted that the central nodes are unlikely to be similar. However, if the second central node has an additional employment relationship with “ABC Company”, which may or may not be a different node from “Canada ABC Company” associated with the first node, this situation could lead to the inference that the two central nodes are more likely to be similar.

[0142] The process determines whether a record pair matches based on the overall similarity of its subgraph pairs (step 906). In this illustrative example, the determination may also include analysis of the feature results determined by the initial analysis of the records to identify the central node. In step 906, the record may be the central node.

[0143] The process then performs a set of actions based on the existence of a match (step 908). The process then terminates. In step 908, the actions may include deduplication, merging at least one of the matching records, or other suitable actions. In this way, consistency between information in different datasets can be achieved to perform operations such as reporting, transactions, or other suitable actions that require at least one of accuracy or consistency in the records found in one or more datasets.

[0144] Next turn Figure 10 A flowchart for matching a central node is depicted according to an illustrative embodiment. Figure 10 The process can be implemented in hardware, software, or both. When implemented in software, the process can take the form of program code executed by one or more processor units in one or more hardware devices located in one or more computer systems. The process can... Figure 2 This is implemented in Data Management 96. In the illustrative example, the process can be... Figure 3 Implemented in the information manager 330 of the network data processing system 300, or in Figure 4 This is implemented in the information manager 412 of the computer system 410. The process in this step can be used to implement... Figure 9 Step 908 in the process.

[0145] The process begins by identifying the first center node in the first subgraph and the second center node in the second subgraph (step 1000). The process then identifies adjacent node groups that have adjacent nodes from both the first and second subgraphs, wherein adjacent node groups within adjacent node groups have adjacent nodes of the same node type (step 1002).

[0146] The process creates a cluster set from each group of adjacent nodes, such that each cluster in the cluster set has adjacent nodes from both the first and second subgraphs (step 1004). The process identifies the best-matching node pairs among the adjacent nodes in each cluster in the cluster set to form a set of best-matching node pairs in the cluster set (step 1006). In step 1006, the adjacent nodes in the best-matching node pairs include a first adjacent node from the first subgraph and a second adjacent node from the second subgraph.

[0147] The process uses the first central node in the first subgraph, the second central node in the second subgraph, and the set of best-matching node pairs in the cluster set to determine whether the first central node in the first subgraph and the second central node in the second subgraph match based on the total distance between them (step 1008). In step 1008, this total distance is different from the distance between two central nodes without considering adjacent nodes in the subgraph. The process then terminates.

[0148] refer to Figure 11 A flowchart illustrating a process for identifying adjacent node groups is depicted according to an illustrative embodiment. The process in this diagram is... Figure 10 An example of an implementation of step 1002 in the example.

[0149] The process begins by placing neighboring nodes into an initial group based on the node type of the neighboring nodes from each subgraph (step 1100). The process then selects each initial group from the initial group containing neighboring nodes from both the first subgraph and the second subgraph of the neighboring nodes, to form a group of neighboring nodes containing neighboring nodes from both the first and second subgraphs (step 1102). The process then terminates.

[0150] Turn Figure 12 A flowchart for creating a cluster collection is depicted according to an illustrative embodiment. The process in the diagram is... Figure 10 An example of an implementation of step 1004 in the example.

[0151] The process begins by creating candidate clusters within each of the neighboring node groups (step 1200). The process then selects each cluster from the candidate clusters that has both a first subgraph from the neighboring nodes and a second subgraph from the neighboring nodes to form a cluster set (step 1202). The process then terminates.

[0152] refer to Figure 13 A flowchart illustrating the process for identifying the best matching pair of neighboring nodes is depicted according to an illustrative embodiment. The process in this diagram is... Figure 10 An example of an implementation of step 1006 in the example.

[0153] The process begins by determining the adjacency distances of the compared neighboring nodes in the cluster based on the compared neighboring nodes, the links between the compared neighboring nodes, and the depth of the compared neighboring nodes (step 1300). In step 1300, adjacency distances can be determined in several different ways. For example, breadth-first search, Dijkstra's algorithm, or the Bellman-Ford algorithm are examples of algorithms that can be used to determine these distances.

[0154] In this example, one of the following equations is used to calculate the neighbor distance of neighboring nodes in the cluster based on the compared neighboring node, the links between the compared neighboring nodes, and the depth of the compared neighboring nodes:

[0155] d(x, y) = e( log (1-distance(x,y))+log(1-distance(link(X),link(Y)))+log(const depth (x, y)))

[0156] Where distance (x, y) is the distance between node x and node y in the cluster, depth (x, y) is the average depth of the first depth of node x and the second depth of node y, and const is a constant value greater than 0 and less than or equal to 1. The depth of node x is the count of links with the shortest path from node x to the central node of node x. In this example, depth (x, y) can also be the average of the following: (1) the number of shortest links between node X and the first central node, and (2) the number of shortest links between node Y and the second central node.

[0157] d(x,y)=1((1-distance(x,y))*(1-distance(link x ,linkY))*const depth (x, y))

[0158] Where distance (x, y) is the distance between node x and node y in the cluster, depth (x, y) is the average depth of the first depth of node x and the second depth of node y, and const is a constant value greater than 0 and less than or equal to 1. The depth of node x is the count of links with the shortest path from node x to the central node of node x.

[0159] The process identifies the best-matching node pair for each cluster in the cluster set as the two nodes in the cluster with the shortest adjacent distance, thus forming the set of best-matching node pairs for the cluster set (step 1302). The process then terminates.

[0160] exist Figure 14 The diagram illustrates a process for determining whether a first central node and a second central node match, according to an illustrative embodiment. The process described in the diagram is... Figure 10 An example of an implementation of step 1008 in the example.

[0161] The process begins by determining the total distance between the first and second central nodes using the first central node, the second central node, and the best-matching node in the cluster set as follows:

[0162]

[0163] Where distance(CenterNode1,CenterNode2) is the distance between the first and second center nodes, dH(x,y) is the distance between neighboring nodes x and y in the best-matching node pair, and M is the number of node types in the group that have best-matching neighboring node pairs (step 1400). In this illustrative example, the distance represented by dH(x,y) is a value between 0 and 1. Furthermore, distance(CenterNode1,CenterNode2) is a value between 0 and 1. As a result, in this illustrative example, the total distance is a value between 0 and 1. In this example, a value of 0 indicates an exact match between the compared data, while a value of 1 indicates that the compared data are completely different. In some cases, there may be some neighboring nodes of a given node type in the first subgraph, while there are no neighboring nodes of the same node type in the second subgraph. These node types that do not match between the two subgraphs are not included in M.

[0164] In this example, neighboring node x can be connected by CenterNode1, and neighboring node y can be connected to CenterNode2. This connection can be direct or indirect, using intermediate nodes. In this example, dH(x, y) is the minimum distance that can be determined for different combinations of neighboring nodes (neighboring node x and neighboring node y) in the cluster.

[0165] The process determines whether the first and second subgraphs match based on the total distance calculated between the first and second central nodes (step 1402). The process then terminates.

[0166] Turn now Figure 15 A flowchart illustrating a process for determining whether a first center node and a second center node match is depicted according to an illustrative embodiment. The process in this diagram is... Figure 10 An example of an implementation of step 1008 in the example.

[0167] The process begins by determining the comparison features between the first and second central nodes to determine the comparison feature vectors of the first and second central nodes (step 1500). Features are characteristics of interest between the information being compared. This type of feature is a comparison feature. For example, when comparing names in the central nodes, the features of interest for name comparison could be [number of exact words, number of similar words, number of omitted words, number of non-matching words]. When comparing “John Smith Jr.” and “Johnny Smith” against these features, for the number of exact words [Smith, Smith], the elements of the comparison feature vector are counted as 1. A second feature (number of similar words) exists [John, Johnny]. A third feature (number of omitted words) exists relative to the identification [Jr., none]. Because a match exists, the fourth feature, the number of non-matching words, is 0. As a result, the comparison feature vector in this example is fv = [1, 1, 1, 0].

[0168] The process determines a distance feature based on the lowest distance between each cluster in the cluster set (step 1502). In this example, the distance feature may be based on whether a specific distance is within a threshold range specified for that distance feature. For example, the distance feature could be [distance less than 0.3, distance between 0.3 and 0.7, and distance greater than 0.7]. In this example, there are three distance features, and the distance feature vector indicates the count of how many nodes exist for each of the specific features.

[0169] This process uses comparison feature vectors and distance feature vectors to determine the total distance between the first center node and the second center node (step 1504). In step 1504, the comparison feature vectors are used for the center nodes, and the distance feature vectors are determined for the adjacent nodes. In step 1504, considering the adjacent nodes of the two center nodes in the form of the best-matched node pair, the total distance between the two center nodes is determined as follows:

[0170]

[0171] Where cv(i) is the element at index i in the coefficient vector, fv(i) is the element at index i in the feature vector, including the comparison feature vector and the distance feature vector, max(cv) is the element with the maximum value in the coefficient vector, min(cv) is the element with the minimum value in the coefficient vector, i is the index value, and n is the number of elements in the feature vector. In this specific example, the feature vector fv includes the comparison feature of the center node and the distance feature of the cluster.

[0172] In this example, the feature vector contains elements for the comparative features at the center node and the distance features to neighboring nodes. The coefficient vector contains elements used when applying weights to the corresponding features in the feature vector. These coefficient vectors can be used to show the importance of each feature in the feature vector to the overall computation. Machine learning models can be used to predetermine or generate the coefficient vectors.

[0173] The process determines whether the total distance is within the threshold for matching the first and second center nodes (step 1506). The process then terminates.

[0174] Now for reference Figure 16 A flowchart for a process of matching subgraphs is depicted according to an illustrative embodiment. Figure 16 The process can be implemented in hardware, software, or both. When implemented in software, the process can take the form of program code executed by one or more processor units in one or more hardware devices located in one or more computer systems. The process can... Figure 2 This is implemented in Data Management 96. In the illustrative example, the process can be... Figure 3 The information manager 330 in the network data processing system 300 and Figure 4 This is implemented in the information manager 412 of the computer system 410. The process in this step can be used to implement... Figure 9 Step 908 in the process.

[0175] The process begins by identifying two central nodes in two subgraphs, each of which is in one of the two subgraphs (step 1600). The process then groups the adjacent nodes of the two central nodes in the two subgraphs according to node type, wherein each group includes adjacent nodes from both subgraphs (step 1602). The process further clusters adjacent nodes of the same node type within each group to form a set of clusters, where each cluster in the set of clusters has at least one adjacent node from each of the two subgraphs (step 1604).

[0176] The process uses the Hausdorff distance to select the best-matching node pairs for each cluster, forming a set of best-matching node pairs for the cluster set (step 1606). In this example, the best-matching node pairs in the set of best-matching node pairs have neighboring nodes from each of the two subgraphs.

[0177] The process uses the two center nodes and the set of best-matching node pairs of their neighbors to determine the total distance between the two center nodes (step 1608). In step 1608, the total distance between the two center nodes takes into account the set of best-matching node pairs for the two center nodes. The process then determines whether a match exists between the two center nodes based on the total distance between them (step 1610). The process then terminates.

[0178] exist Figure 17 The diagram illustrates a flowchart of a process for assigning neighboring nodes to a group, based on an illustrative embodiment. The process in the diagram is... Figure 16 An example of an implementation of step 1602 in the example.

[0179] The process begins by placing neighboring nodes from each of the two subgraphs into an initial group based on the node type of the neighboring nodes (step 1700). The process then selects each initial group containing neighboring nodes from both subgraphs to form a group (step 1702). After this, the process terminates.

[0180] Next reference Figure 18 A flowchart illustrating the process for selecting the best matching node pairs for each cluster is depicted according to an illustrative embodiment. The process in the diagram is... Figure 16 An example of an implementation of step 1604 in the example.

[0181] The process begins by determining the adjacency distance of the compared adjacent nodes in the cluster based on the compared adjacent nodes, the links between the compared adjacent nodes, and the depth of the compared adjacent nodes (step 1800). The process then identifies the best-matching node pair for each cluster in the cluster set as the two nodes with the shortest adjacency distance in the cluster set, forming the best-matching node pair set for the cluster set (step 1802). The process then terminates.

[0182] Next turn Figure 19 A flowchart for generating feature vectors is depicted according to an illustrative embodiment. Figure 19 The process can be implemented in hardware, software, or both. When implemented in software, the process can take the form of program code executed by one or more processor units in one or more hardware devices located in one or more computer systems. The process can... Figure 2 This is implemented in Data Management 96. In the illustrative example, the process can be... Figure 3 The information manager 330 in the network data processing system 300 and Figure 4 It is implemented in the information manager 412 of the computer system 410.

[0183] The process begins by determining the comparison features of the two central nodes (step 1900). In step 1900, features are characteristics of interest present in the information being compared between the two central nodes. The process then determines a comparison feature vector of the comparison features (step 1902). In step 1902, each element in the comparison feature vector identifies the frequency of occurrence of a particular feature.

[0184] For example, when comparing names in the central node, the features of interest used for name comparison could be [exact name, similar name, omitted name, non-matching name]. When comparing "John Smith Jr." and "Johnny Smith", for these features, the element count of the comparison feature vector for the exact name [Smith, Smith] is 1. The second feature (similar name) is [John, Johnny]. The third feature (omitted name) is present relative to the identification [Jr., none]. Because there is a match, the fourth feature (non-matching name) is 0. As a result, in this example, the comparison feature vector is fv = [1, 1, 1, 0].

[0185] The process then determines the distance features of the cluster identified for the central node (step 1904). In step 1904, the features are based on the minimum distance between neighboring nodes in the cluster. In other words, the features are based on the distance determined between two neighboring nodes in the best-matching pair. The process generates a distance feature vector based on the distance features (step 1906). Each element in the distance feature vector indicates the number of times a particular feature occurs. The feature can be a threshold or range of distances between neighboring nodes.

[0186] For example, distance features could be [distance less than 0.3, distance between 0.3 and 0.7, and distance greater than 0.7]. In this example, there are three distance features, and the distance feature vector indicates the count of how many nodes exist for each of the specific features.

[0187] The process then generates a feature vector, which includes comparative features from the comparative feature vector and distance features from the distance feature vector (step 1108). The process then terminates. This feature vector can be used in one method for determining the total distance between central nodes.

[0188] Next turn Figure 20 A flowchart for matching a central node is depicted according to an illustrative embodiment. Figure 20 The process can be implemented in hardware, software, or both. When implemented in software, the process can take the form of program code executed by one or more processor units in one or more hardware devices located in one or more computer systems. The process can... Figure 2This is implemented in Data Management 96. In the illustrative example, the process can be... Figure 3 Implemented in the information manager 330 of the network data processing system 300, or in Figure 4 This is implemented in the information manager 412 of the computer system 410. The process in this step can be used to implement... Figure 9 Step 908 in the process.

[0189] The process is similar to Figure 10 The steps performed in the flowchart are shown in the example; in the illustrative example, creating a cluster collection is an optional step.

[0190] The process begins by identifying the first central node in the first subgraph and the second central node in the second subgraph (step 2000). The process then identifies adjacent node groups that have adjacent nodes from both the first and second subgraphs, wherein adjacent node groups within adjacent node groups have adjacent nodes of the same node type (step 2002).

[0191] The process identifies the best-matching node pairs in each group of neighboring nodes to form a set of best-matching node pairs in the cluster set (step 2004). In step 2004, the neighboring nodes in each best-matching node pair include a first neighboring node from the first subgraph and a second neighboring node from the second subgraph.

[0192] The process uses the first central node, the second central node, and the best-matching node pair set in the cluster set to determine whether the first central node and the second central node are a match based on the total distance between them (step 2006). The process then terminates.

[0193] The flowcharts and block diagrams in the various described embodiments illustrate the architecture, functionality, and operation of some possible implementations of the apparatus and methods in the illustrative embodiments. In this regard, each block in a flowchart or block diagram may represent at least one of the following: a module, segment, function, or part of an operation or step. For example, one or more blocks may be implemented as program code, hardware, or a combination of program code and hardware. When implemented in hardware, the hardware may, for example, take the form of an integrated circuit manufactured or configured to perform one or more operations in the flowchart or block diagram. When implemented as a combination of program code and hardware, the implementation may take the form of firmware. Each block in a flowchart or block diagram may be implemented using a dedicated hardware system performing different operations, or a combination of dedicated hardware and program code executed by the dedicated hardware.

[0194] In some alternative implementations of the illustrative embodiments, one or more functions marked in the boxes may occur in a different order than those marked in the figures. For example, in some cases, depending on the functions involved, two boxes shown consecutively may be executed substantially simultaneously, or these boxes may sometimes be executed in reverse order. Furthermore, in addition to the boxes shown in the flowchart or block diagram, other boxes may be added.

[0195] Turn now Figure 21 A block diagram of a data processing system is depicted according to an illustrative embodiment. The data processing system 2100 can be used to implement... Figure 1 Cloud computing node 10 and Figure 2 The hardware components in the hardware and software layer 60. The data processing system 2100 can also be used to implement Figure 3 The data processing system 2100 includes server computer 304, server computer 306, and client device 310. It can also be used to implement... Figure 4 The computer system 410 is shown. In this illustrative example, the data processing system 2100 includes a communication framework 2102 that provides communication between a processor unit 2104, a memory 2106, persistent storage 2108, a communication unit 2110, an input / output (I / O) unit 2112, and a display 2114. In this example, the communication framework 2102 takes the form of a bus system.

[0196] Processor unit 2104 is used to execute instructions for software that can be loaded into memory 2106. Processor unit 2104 includes one or more processors. For example, processor unit 2104 may be selected from at least one of the following: a multi-core processor, a central processing unit (CPU), a graphics processing unit (GPU), a physical processing unit (PPU), a digital signal processor (DSP), a network processor, or some other suitable type of processor. Furthermore, processor unit 2104 may be implemented using one or more heterogeneous processor systems, wherein the main processor and auxiliary processors reside together on a single chip. As another illustrative example, processor unit 2104 may be a symmetric multiprocessor system containing multiple processors of the same type on a single chip.

[0197] Memory 2106 and persistent storage 2108 are examples of storage device 2116. A storage device is any hardware capable of storing information, such as, but not limited to, at least one of data, program code in a functional form, or other suitable information, which may be temporary, permanent, or both. In these illustrative examples, storage device 2116 may also be referred to as a computer-readable storage device. In these examples, memory 2106 may be, for example, random access memory or any other suitable volatile or non-volatile storage device. Persistent storage 2108 may take various forms depending on the specific implementation.

[0198] For example, persistent storage 2108 may include one or more components or devices. For example, persistent storage 2108 may be a hard disk drive, a solid-state drive (SSD), flash memory, a rewritable optical disc, a rewritable magnetic tape, or a combination of the above. The media used in persistent storage 2108 may also be removable. For example, a removable hard disk drive may be used for persistent storage 2108.

[0199] In these illustrative examples, communication unit 2110 provides communication with other data processing systems or devices. In these illustrative examples, communication unit 2110 is a network interface card.

[0200] Input / output unit 2112 allows data input and output to other devices that can be connected to data processing system 2100. For example, input / output unit 2112 can provide a connection for user input via at least one of a keyboard, mouse, or some other suitable input device. Furthermore, input / output unit 2112 can send output to a printer. Display 2114 provides a mechanism for displaying information to the user.

[0201] Instructions for at least one of an operating system, application, or program may be located in storage device 2116 and communicate with processor unit 2104 via communication frame 2102. Processes of different embodiments may be executed by processor unit 2104 using computer-implemented instructions that may be located in memory (e.g., memory 2106).

[0202] These instructions are program instructions, and are also referred to as program code, computer-usable program code, or computer-readable program code, which can be read and executed by a processor in processor unit 2104. The program code in different embodiments may be implemented on different physical or computer-readable storage media, such as memory 2106 or persistent storage 2108.

[0203] Program code 2118 is functionally located on computer-readable medium 2120, which can be selectively removed and loaded onto or transferred to data processing system 2100 for execution by processor unit 2104. In these illustrative examples, program code 2118 and computer-readable medium 2120 form computer program product 2122. In the illustrative examples, computer-readable medium 2120 is computer-readable storage medium 2124.

[0204] Computer-readable storage medium 2124 is a physical or tangible storage device for storing program code 2118, and not a medium for propagating or transmitting program code 2118. As used herein, computer-readable storage medium 2124 should not be construed as a transient signal itself, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses through fiber optic cables), or electrical signals transmitted through wires.

[0205] Alternatively, program code 2118 may be transmitted to data processing system 2100 using a computer-readable signal medium. The computer-readable signal medium is a signal and may be, for example, a propagated data signal containing program code 2118. For example, the computer-readable signal medium may be at least one of an electromagnetic signal, an optical signal, or any other suitable type of signal. These signals may be transmitted via a connection such as a wireless connection, fiber optic cable, coaxial cable, wire, or any other suitable type of connection.

[0206] Furthermore, as used herein, "computer-readable medium 2120" can be singular or plural. For example, program code 2118 may be located in a single storage device or system-type computer-readable medium 2120. In another example, program code 2118 may be located in computer-readable media 2120 distributed across multiple data processing systems. In other words, some instructions in program code 2118 may be located in one data processing system, while other instructions in program code 2118 may be located in a data processing system. For example, a portion of program code 2118 may be located in a computer-readable medium 2120 in a server computer, while another portion of program code 2118 may be located in a computer-readable medium 2120 in a group of client computers.

[0207] The different components shown for data processing system 2100 do not imply an architectural limitation on how different embodiments can be implemented. In some illustrative examples, one or more components may be incorporated into or otherwise formed part of another component. For example, in some illustrative examples, memory 2106 or a portion thereof may be incorporated into processor unit 2104. Different illustrative embodiments may be implemented in data processing systems that include components other than those described for data processing system 2100 or components that replace those components. Figure 21 The other components shown may differ from the illustrative example shown. Different embodiments can be implemented using any hardware device or system capable of running program code 2118.

[0208] Therefore, the illustrative example provides a computer-implemented method, computer system, and computer program product for matching information. The computer system identifies a first central node in a first subgraph and a second central node in a second subgraph. The computer system identifies adjacent node groups having adjacent nodes from both the first and second subgraphs. Adjacent node groups within adjacent node groups have adjacent nodes of the same node type. The computer system creates a set of clusters from each adjacent node group, such that each cluster in the set of clusters has adjacent nodes from both the first and second subgraphs. The computer system identifies the best matching node pairs in each cluster in the set of clusters to form a set of best matching node pairs in the set of clusters, wherein the adjacent nodes in the best matching node pairs include a first adjacent node from the first subgraph and a second adjacent node from the second subgraph. The computer system uses the first central node, the second central node, and the set of best matching node pairs in the set of clusters to determine whether the first central node and the second central node match based on the total distance between them.

[0209] As a result, compared to current techniques that do not compare the central node and its neighboring nodes in its subgraph, the different illustrative examples can reduce at least one of the time or resources used in determining whether information matches. Furthermore, the different illustrative examples can improve the accuracy of matching information fragments in at least first-order matching or first-second-order matching.

[0210] For purposes of illustration and description, descriptions of various illustrative embodiments are given and are not intended to be exhaustive or limited to the embodiments disclosed. The various illustrative examples describe components that perform actions or operations. In the illustrative embodiments, components may be configured to perform the described actions or operations. For example, a component may have a configuration or design for a structure that provides the component with the ability to perform the actions or operations described in the illustrative examples as being performed by the component. Furthermore, with regard to the terms “comprising,” “including,” “having,” “containing,” and variations thereof as used herein, these terms are intended to be inclusive in a manner similar to the term “comprising” as an open-ended transitional term, without excluding any additional or other elements.

[0211] Various embodiments of the invention have been described for illustrative purposes, but are not exhaustive or limited to the disclosed embodiments. Not all embodiments include all the features described in the illustrative examples. Furthermore, different illustrative embodiments may provide different features compared to other illustrative embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope of the described embodiments. The terminology used herein is chosen to best explain the principles of the embodiments, their practical application, or improvements to existing technologies in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.

Claims

1. A method for matching information, the method comprising: The computer system identifies the first central node in the first subgraph and the second central node in the second subgraph. The computer system identifies adjacent node groups having adjacent nodes from both the first subgraph and the second subgraph, wherein adjacent node groups in the adjacent node groups have adjacent nodes of the same node type; The computer system creates a set of clusters from each group of adjacent nodes, such that each cluster in the set of clusters has adjacent nodes from both the first subgraph and the second subgraph; The computer system identifies the best matching node pairs of the neighboring nodes in each cluster of the cluster set to form the best matching node pair set, wherein each best matching node pair includes a first neighboring node from the first subgraph and a second neighboring node from the second subgraph; The computer system uses the first central node, the second central node, and the set of best matching nodes to determine whether the first central node and the second central node match. The information manager uses a subgraph to perform record comparisons, wherein the information manager manages copies of information in record form located in a repository; and When a match is identified in the record, coordination is performed.

2. The method of claim 1, wherein, The computer system identifies adjacent node groups from both the first subgraph and the second subgraph, wherein adjacent node groups having the same node type include: The computer system places the neighboring nodes into an initial group based on the node type of the neighboring nodes from each subgraph; and The computer system selects each initial group from the initial group having adjacent nodes from both the first subgraph and the second subgraph of the adjacent nodes, to form the adjacent node group having adjacent nodes from both the first subgraph and the second subgraph.

3. The method of claim 1, wherein, The computer system creates a set of clusters from each group of adjacent nodes, such that each cluster in the set of clusters has adjacent nodes from both the first subgraph and the second subgraph, including: The computer system creates candidate clusters within each of the adjacent node groups in the adjacent node groups; and The computer system selects each cluster from the candidate clusters that has adjacent nodes in both the first subgraph and the second subgraph with adjacent nodes, to form the cluster set.

4. The method of claim 1, wherein, The computer system identifies the best-matching node pair in each cluster of the cluster set, including: The computer system determines the adjacency distance of the compared adjacent nodes in the cluster based on the compared adjacent nodes, the links between the compared adjacent nodes, and the depth of the compared adjacent nodes; and The computer system identifies the best-matching node pair in each cluster of the cluster set as the two nodes with the shortest adjacent distance in the cluster, so as to form the set of best-matching node pairs in the cluster set.

5. The method according to claim 4, wherein, The adjacency distance of the neighboring nodes in the cluster is calculated using one of the following equations based on the compared neighboring nodes, the links of the compared neighboring nodes, and the depth of the compared neighboring nodes: Where distance(x, y) is the distance between node x and node y in the cluster, depth(x, y) is the average depth of the first depth of node x and the second depth of node y, and const is a constant value greater than 0 and less than or equal to 1; and Where distance(x, y) is the distance between node x and node y in the cluster, depth(x, y) is the average depth of the first depth of node x and the second depth of node y, and const is a constant value greater than 0 and less than or equal to 1.

6. The method according to claim 1, wherein, The computer system determines whether the first central node and the second central node match using the first central node, the second central node, and the set of best-matching nodes, including: The computer system uses the first central node, the second central node, and the best-matching node pair set in the cluster set to determine the total distance between the first central node and the second central node as follows: Where distance(CenterNode1, CenterNode2) is the distance between the first center node and the second center node, dH(x, y) is the distance between adjacent nodes x and y in the best-matching node pair, and M is the number of node types with best-matching adjacent node pairs in the group; and The computer system determines whether the first central node and the second central node match based on the total distance calculated between the first central node and the second central node.

7. The method according to claim 1, wherein, The computer system determines whether the first central node and the second central node match using the first central node, the second central node, and the set of best-matching nodes, including: The computer system compares the first central node and the second central node to determine the comparison characteristics of the first central node and the second central node; The computer system determines the distance feature based on the minimum distance between adjacent nodes in each cluster of the cluster set; The computer system uses the comparison feature and the distance feature to determine the total distance between the first center node and the second center node; and The computer system determines whether the total distance is within the threshold for matching the first central node and the second central node.

8. The method according to claim 7, wherein, The total distance between the first center node and the second center node is determined as follows: Where cv(i) is the coefficient vector, fv(i) is the feature vector including the comparison feature and the distance feature, max(cv) is the element with the maximum value in the coefficient vector, min(cv) is the element with the minimum value in the coefficient vector, i is the index value, and n is the number of elements in the feature vector.

9. A method for matching information, the method comprising: The computer system assigns adjacent nodes of two central nodes in two subgraphs into groups based on node type, wherein the groups include adjacent nodes from both of the two subgraphs; The computer system uses Hausdorff distance to select the best matching node pair for each neighboring node group to form a set of best matching node pairs for the neighboring node groups, wherein the best matching node pairs in the set of best matching node pairs have neighboring nodes from each of the two subgraphs. The computer system determines the total distance between the two central nodes using the set of best-matching node pairs of the two central nodes and the adjacent nodes, wherein the total distance between the two central nodes takes into account the set of best-matching node pairs of each of the two central nodes; The total distance between the two central nodes is used to determine whether a match exists between them. The information manager uses a subgraph to perform record comparisons, wherein the information manager manages copies of information in record form located in a repository; and When a match is identified in the record, coordination is performed.

10. The method of claim 9, further comprising: The computer system clusters adjacent nodes of the same node type in the group to form a set of clusters, wherein each cluster in the set of clusters has at least one adjacent node from each of the two subgraphs. The computer system uses the Hausdorff distance to select the best matching node pair for each neighboring node group, forming a set of best matching node pairs for the neighboring nodes of the neighboring node group. The best matching node pairs in the set have neighboring nodes from each of the two subgraphs, including: The computer system uses the Hausdorff distance to select the best matching node pairs of the neighboring nodes for each cluster, to form the set of best matching node pairs of the neighboring nodes for the set of clusters, wherein the best matching node pairs in the set of best matching node pairs have neighboring nodes from each of the two subgraphs.

11. The method according to claim 10, wherein, The computer system assigns adjacent nodes of the two central nodes in the two subgraphs into groups according to the node type, wherein the groups include adjacent nodes from both subgraphs, including: The computer system places the neighboring nodes from the neighboring nodes into an initial group based on the node type of the neighboring nodes from each of the two subgraphs; and The computer system selects each initial group from the initial group, which has adjacent nodes from both of the two subgraphs.

12. An information management system, comprising: A computer system that executes program instructions, used for: Identify the first center node in the first subgraph and the second center node in the second subgraph; Identify adjacent node groups having adjacent nodes from both the first subgraph and the second subgraph, wherein adjacent node groups in the adjacent node groups have adjacent nodes of the same node type; Create a cluster set from each group of adjacent nodes, such that each cluster in the cluster set has adjacent nodes from both the first subgraph and the second subgraph; Identify the best matching node pairs of the neighboring nodes in each cluster of the cluster set to form the best matching node pair set, wherein each best matching node pair includes a first neighboring node from the first subgraph and a second neighboring node from the second subgraph; The first center node, the second center node, and the best matching node pair set are used to determine whether the first center node and the second center node match. The information manager uses a subgraph to perform record comparisons, wherein the information manager manages copies of information in record form located in a repository; and When a match is identified in the record, coordination is performed.

13. The information management system according to claim 12, wherein, Identify adjacent node groups from both the first subgraph and the second subgraph, wherein the adjacent node groups have adjacent nodes of the same node type, and the computer system executes the program instructions to: The adjacent nodes are placed in the initial group based on the node type of the adjacent nodes from each subgraph; and Each initial group is selected from the initial group having adjacent nodes from both the first subgraph and the second subgraph of the adjacent node, to form the adjacent node group having adjacent nodes from both the first subgraph and the second subgraph.

14. The information management system according to claim 12, wherein, A cluster set is created from each group of adjacent nodes, such that each cluster in the cluster set has adjacent nodes from both the first subgraph and the second subgraph, and the computer system executes the program instructions to: Create a candidate cluster within each of the adjacent node groups; as well as Each cluster of adjacent nodes of both the first subgraph and the second subgraph with the adjacent nodes is selected from the candidate clusters to form the cluster set.

15. The information management system of claim 12, wherein the best-matching node pair in each cluster of the cluster set is identified, and the computer system executes the program instructions to: Based on the compared neighboring nodes in the cluster, the links between the compared neighboring nodes, and the depth of the compared neighboring nodes, determine the adjacent distances of the compared neighboring nodes in the cluster; and The best-matching node pair for each cluster in the cluster set is identified as the two nodes in the cluster with the shortest adjacent distance, to form the set of best-matching node pairs for the cluster set.

16. The information management system according to claim 15, wherein, The adjacency distance of the neighboring nodes in the cluster is calculated using one of the following equations based on the compared neighboring nodes, the links of the compared neighboring nodes, and the depth of the compared neighboring nodes: Where distance(x, y) is the distance between node x and node y in the cluster, depth(x, y) is the average depth of the first depth of node x and the second depth of node y, and const is a constant value greater than 0 and less than or equal to 1; and Where distance(x, y) is the distance between node x and node y in the cluster, depth(x, y) is the average depth of the first depth of node x and the second depth of node y, and const is a constant value greater than 0 and less than or equal to 1.

17. The information management system according to claim 12, wherein, The computer system uses the first central node, the second central node, and the set of best-matching nodes to determine whether the first central node and the second central node match, and executes the program instructions to: The total distance between the first central node and the second central node is determined using the first central node, the second central node, and the best-matching node pair in the cluster set as follows: Where distance(CenterNode1, CenterNode2) is the distance between the first center node and the second center node, dH(x, y) is the distance between adjacent nodes x and y in the best-matching node pair, and M is the number of node types with best-matching adjacent node pairs in the group; and Whether the first center node and the second center node match is determined based on the total distance calculated between the first center node and the second center node.

18. The information management system according to claim 17, wherein, The computer system uses the first central node, the second central node, and the set of best-matching nodes to determine whether the first central node and the second central node match, and executes the program instructions to: The first central node and the second central node are compared to determine the comparison characteristics of the first central node and the second central node; The distance feature is determined based on the minimum distance between the adjacent nodes of each cluster in the cluster set; The total distance between the first center node and the second center node is determined using the comparison feature and the distance feature; and Determine whether the total distance is within the threshold for matching the first center node and the second center node.

19. The information management system according to claim 18, wherein, The total distance between the first center node and the second center node is determined as follows: Where cv(i) is the coefficient vector, fv(i) is the feature vector including the comparison feature and the distance feature, max(cv) is the element with the maximum value in the coefficient vector, min(cv) is the element with the minimum value in the coefficient vector, i is the index value, and n is the number of elements in the feature vector.

20. An information management system, comprising: A computer system that executes program instructions, used for: Based on node type, the adjacent nodes of the two central nodes in the two subgraphs are grouped together, wherein the group includes adjacent nodes from both of the two subgraphs; The Hausdorff distance is used to select the best matching node pair for each neighboring node group to form a set of best matching node pairs for the neighboring node groups, wherein the best matching node pairs in the set of best matching node pairs have neighboring nodes from each of the two subgraphs. The total distance between the two center nodes is determined using the set of best-matching node pairs of the two center nodes and the adjacent nodes, wherein the total distance between the two center nodes takes into account the set of best-matching node pairs of each of the two center nodes. The existence of a match between the two central nodes is determined based on the total distance between them; and The information manager uses a subgraph to perform record comparisons, wherein the information manager manages copies of information in the form of records located in a repository; When a match is identified in the record, coordination is performed.

21. The information management system according to claim 20, wherein the computer system executes the program instructions to: Adjacent nodes of the same node type in the group are clustered to form a set of clusters, wherein... The clusters in the cluster set have at least one neighboring node from each of the two subgraphs, wherein the Hausdorff distance is used to select the best matching node pair for each neighboring node group to form the set of best matching node pairs for the neighboring node groups, wherein the best matching node pairs in the set of best matching node pairs have neighboring nodes from each of the two subgraphs, and the computer system executes the program instructions to: The Hausdorff distance is used to select the best matching node pair of the neighboring nodes for each cluster to form the set of best matching node pairs of the neighboring nodes for the cluster set, wherein the best matching node pairs in the set of best matching node pairs have neighboring nodes from each of the two subgraphs.

22. The information management system according to claim 20, wherein, The computer system assigns adjacent nodes of the two central nodes in the two subgraphs to groups according to the node type, wherein the groups include adjacent nodes from both subgraphs, and executes program instructions to: Based on the node type of the adjacent nodes, the adjacent nodes from each of the two subgraphs are placed in an initial group; and Select each initial group from the initial groups that has adjacent nodes from both of the two subgraphs.

23. A computer program product for matching information, the computer program product comprising a computer-readable storage medium having program instructions embodied therein, the program instructions being executable by a computer system to cause the computer to perform a method, the method comprising: The computer system identifies the first central node in the first subgraph and the second central node in the second subgraph. The computer system identifies adjacent node groups having adjacent nodes from both the first subgraph and the second subgraph, wherein adjacent node groups in the adjacent node groups have adjacent nodes of the same node type; The computer system creates a set of clusters from each group of adjacent nodes, such that each cluster in the set of clusters has adjacent nodes from both the first subgraph and the second subgraph; The computer system identifies the best matching node pairs of the neighboring nodes in each cluster of the cluster set to form the best matching node pair set, wherein each best matching node pair includes a first neighboring node from the first subgraph and a second neighboring node from the second subgraph; The computer system uses the first central node, the second central node, and the set of best matching nodes to determine whether the first central node and the second central node match. The information manager uses a subgraph to perform record comparisons, wherein the information manager manages copies of information in record form located in a repository; and When a match is identified in the record, coordination is performed.

Citation Information

Patent Citations

  • Sub-graph isomorphic matching result merging method, electronic equipment and storage medium

    CN111382315A

  • Multi-source transfer learning method based on graph structure

    CN112085085A