Record matching in database systems
By processing the attribute values of unstructured objects in the database system and using specialized technical means, the data matching problem in the data matching process is solved, achieving a higher data matching effect and overcoming the problems of low data matching efficiency and insufficient accuracy in existing technologies.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INTERNATIONAL BUSINESS MACHINE CORPORATION
- Filing Date
- 2022-03-28
- Publication Date
- 2026-06-16
AI Technical Summary
In existing master data management systems, the data matching process suffers from inefficiency and inaccuracy, especially when dealing with unstructured data objects, where it is difficult to effectively identify and link records of the same entity.
By processing unstructured objects in database records, a set of unstructured attribute values is identified, and a weighted matching rule based on occurrence characteristics and contribution weights is used to compare the similarity between records to determine whether they represent the same entity.
It improves the accuracy and efficiency of record matching, especially when dealing with unstructured data, enabling more precise identification and merging of identical entity records.
Smart Images

Figure CN115221936B_ABST
Abstract
Description
Technical Field
[0001] This invention generally relates to the field of digital computer systems, and more specifically, to a method for record matching in a database system. Background Technology
[0002] Enterprise data matching processing matches and links customer data received from various sources, creating a single version of the true data. Master Data Management (MDM) based solutions work with enterprise data, performing indexing, matching, and linking of the data. The MDM system provides access to this data. However, there is an ongoing need to improve data matching with the data within the MDM system. Summary of the Invention
[0003] Various embodiments provide methods, computer systems, and computer program products as described in the independent claims. Advantageous embodiments are described in the dependent claims. Embodiments of the invention may be freely combined with each other if they are not mutually exclusive. In one aspect, the invention relates to a computer-implemented method for matching records in a database system, wherein the records represent entities and are associated with one or more unstructured data objects. The method includes: processing unstructured objects of each record in the records of a database (e.g., a database system) to identify a set of one or more attribute values (hereinafter referred to as unstructured attribute values) in the unstructured object of each record; comparing the sets of unstructured attribute values of two records in the database to determine the degree of similarity between the two sets; and determining, based on the comparison result, whether the two records represent the same entity.
[0004] In another aspect, the present invention relates to a computer program product comprising a computer-readable storage medium having computer-readable program code configured to implement all the steps of the method according to the foregoing embodiments. In another aspect, the present invention relates to a computer system for record matching, wherein the record represents an entity and is associated with one or more unstructured data objects. The computer system is configured to: process the unstructured objects of each record in a database to identify a set of one or more attribute values (hereinafter referred to as unstructured attribute values) in the unstructured objects of each record; compare the sets of unstructured attribute values of two records in the database to determine the degree of similarity between the two sets; and, based on the comparison result, determine whether the two records represent the same entity. Attached Figure Description
[0005] The embodiments of the present invention will now be explained in more detail by way of example and with reference to the accompanying drawings, wherein:
[0006] Figure 1 This is a block diagram of a database device according to an example of the present invention.
[0007] Figure 2 This is a flowchart illustrating a method for record matching in a database system according to an example of the present invention.
[0008] Figure 3 This is a flowchart illustrating a method for record matching in a database system according to an example of the present invention.
[0009] Figure 4 This is a flowchart illustrating a method for record matching in a database system according to an example of the present invention.
[0010] Figure 5A This is a flowchart illustrating a method for comparing two records according to an example of the present invention.
[0011] Figure 5B This shows the records associated with the unstructured object and the resulting set of unstructured attribute values.
[0012] Figure 5C This shows the records associated with the unstructured object and the resulting set of unstructured attribute values.
[0013] Figure 5D The comparison results between two sets of unstructured attribute values are shown.
[0014] Figure 6 This refers to a computerized system adapted to implement one or more method steps as described in this invention. Detailed Implementation
[0015] The description of various embodiments of the present invention is presented for illustrative purposes and is not intended to be exhaustive or limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles of the embodiments, their practical application, or improvements to existing technologies in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.
[0016] Service provider computer systems typically include storage for information related to customers and services. Information is provided when customers fill out registration forms, service request forms, or other forms (such as contract documents, bank transactions, certificates, etc.). This may result in unstructured objects being stored in these systems. Unstructured objects can be objects that include attribute values in an unstructured form. These attributes are named unstructured attributes to distinguish them from the structured attributes of a record. Unstructured objects allow for the association of attributes with their corresponding attribute values. Unstructured objects can be files, documents, or objects that include free-form text or embedded values. Examples of unstructured objects may include word processing documents (e.g., Microsoft Word documents in native formats), Adobe Acrobat documents, emails, image files, video files, audio files, and other files in native formats relative to the software applications that created them. Furthermore, such computer systems may store customer and service data in a structured format as records in a database. This may result in the same entity, such as a person, being associated with information in different formats within the system. For example, a record representing a particular person may be associated with unstructured information about that particular person. A data record, or record, is a collection of related data items (such as name, date of birth, and category) for a specific user. A record represents an entity, where an entity is a user, object, or concept whose information is stored in that record. The terms "data record" and "record" are used interchangeably.
[0017] Therefore, such a computer system can include records with attributes that are named structured attributes and unstructured objects with unstructured attribute values. The value of an unstructured attribute can be the full value or a part of the full value of the unstructured attribute. For example, the value "street" can be the value of the unstructured attribute "address"; however, the value "street" is only a part of the address, and other values (such as the city name) can also constitute the full or complete value of the address.
[0018] Most operations performed in these computer systems involve matching records. Record matching involves comparing the structured attribute values of records. Matching records (mergeable records) are records that represent the same entity. The degree of matching between two records indicates the similarity of their attribute values.
[0019] This invention can be advantageous because it can improve the record matching process by utilizing existing unstructured objects, such as documents. The invention can draw inspiration from unstructured documents for decision-making. This can be particularly advantageous because many industrial systems can store large amounts of unstructured information attached to master data. For example, in the insurance industry, many insurance contracts are attached to customer records. In the manufacturing and utility industries, repair and maintenance manuals are attached to product records.
[0020] According to one embodiment, the method further includes: evaluating one or more occurrence characteristics for each identified unstructured attribute value, wherein the occurrence characteristics of a particular unstructured attribute value identified in one or more unstructured objects of a particular record include any one of the following: the frequency of occurrence of the particular unstructured attribute value in the unstructured objects of the particular record (referred to as a first occurrence characteristic), and an indication of other identified unstructured attribute values collocated with the particular unstructured attribute value in the unstructured objects of the particular record (referred to as a second occurrence characteristic), wherein comparing two sets of unstructured attribute values includes: comparing the occurrence characteristics of the unstructured attribute values of one set of the two evaluated sets with the occurrence characteristics of the unstructured attribute values of the other set of the two evaluated sets.
[0021] For example, if identical values exist in both sets and have the same frequency of presence in both sets, this can provide a strong indication of the similarity between the records in the two sets. Similarly, if identical values exist in both sets and belong to the same juxtaposition in both sets, this can provide a strong indication of the similarity between the records in the two sets. Therefore, the occurrence characteristic can further improve the accuracy of comparisons between the two sets, and thus improve the accuracy of record matching processing.
[0022] According to one embodiment, records have values of attributes (hereinafter referred to as structured attributes), wherein determining whether two records represent the same entity includes: assigning an initial contribution weight to each structured attribute in the structured attributes; selecting similar unstructured attribute values that exist in both sets based on the comparison results; replacing the contribution weight of the structured attribute with a weight indicating similarity between the two sets if the structured attribute value in the structured attribute values does not match any of the selected unstructured attributes; and increasing the contribution weight of the structured attribute if the structured attribute value in the structured attribute values fully or partially matches the selected unstructured attributes; and using the contribution weights to compare the two records.
[0023] This embodiment enables a weighted matching rule to assign weights (e.g., integer weights) to each structured attribute of the compared records. For each structured attribute, the associated contribution weight can be multiplied by the similarity score of the two records, and the scores can be summed. If the sum equals or exceeds a threshold, the two compared records are considered a match. This embodiment can be particularly advantageous when comparing a large number of attributes, preventing a single attribute that differs in two records from causing a mismatch.
[0024] According to one embodiment, the method further includes: executing an aggregation algorithm to aggregate the values constituting the full values of the selected unstructured attribute values, thereby producing zero or more aggregated values, wherein the processed selected unstructured attribute values are used to perform a comparison with the structured attribute values.
[0025] For example, the selected unstructured attribute values may include values v1, v2, ..., vr. Each value is the value of the corresponding unstructured attribute. However, some values may not be the full value of the corresponding unstructured attribute. For example, v1 may be a first name value, and v2 may be a last name value; both are values of the unstructured attribute "full name". However, v1 and v2 are not full values. The aggregation of v1 and v2 may be the full value of the unstructured attribute "full name". Each selected unstructured attribute value can be processed to determine whether it is the full value of the corresponding unstructured attribute, and if it is not a full value, other non-full values to be aggregated with it can be determined. A second occurrence characteristic evaluated for the selected unstructured attribute values can be used to determine which values can be aggregated together, for example, if the values are juxtaposed in the same sentence or paragraph and are not the full values of the same unstructured attribute, then these values can be aggregated. Alternatively or additionally, according to one embodiment, the aggregation includes: grouping the unstructured attribute values of each set into groups based on the category of the unstructured attribute, wherein aggregation is performed on values belonging to the same group. Aggregation is performed on values that belong to the same group and also based on the second occurrence characteristic of that group.
[0026] According to one embodiment, comparing two records includes: comparing the values of structured attributes of the two records to obtain an individual matching score for each structured attribute of the records; and combining the individual matching scores using contribution weights. The matching score can be, for example, a value ranging from 0 to 100, representing the degree of similarity between the two values. A value of 100 indicates that the two values are identical, and a value of zero indicates no similarity.
[0027] According to one embodiment, in cases where two records represent the same entity that would merge two records into a single record, the two records are kept separate in other ways.
[0028] Figure 1An exemplary computer system 100 is depicted. The computer system 100 may be configured, for example, to perform master data management and / or data warehousing; for instance, the computer system 100 may implement a deduplication system. The computer system 100 includes a data integration system 101 and one or more client systems or data sources 105. The client system 105 may include a computer system (e.g., as referenced). Figure 6 (As described). Client system 105 can communicate with data integration system 101 via a network connection, including, for example, a wireless local area network (WLAN) connection, a WAN (wide area network) connection, a LAN (local area network) connection, the Internet, or a combination thereof. Data integration system 101 can control access (read access and write access, etc.) to a database or repository 103, which is referred to herein as a structured repository because it includes structured records 107. Data integration system 101 can control access (read access and write access, etc.) to another repository 110, which is referred to herein as an unstructured repository because it includes unstructured objects 111.
[0029] like Figure 1 As shown, each structured record 107 stored in the structured repository 103 may have values for a set of attributes a_1…a_N (N≥1) (such as name attributes). Although this example is described with fewer attributes, more or fewer attributes may be used. Each record 107 may represent an entity, such as a person. Data records 107 stored in the central repository 103 may be received from client systems 105 and processed by the data integration system 101 (e.g., to transform them into a uniform structure) before being stored in the central repository 103. For example, records received from client systems 105 may have a different structure than those stored in the central repository 103. In another example, the data integration system 101 may import data records from the central repository 103 from client systems 105 using one or more Extract Transform Load (ETL) batch processes or via Hypertext Transfer Protocol (“HTTP”) communication or other types of data exchange.
[0030] Unstructured objects 111 may include, for example, scanned documents or forms. Unstructured objects 111 may be received, for example, from client system 105. As shown by the dashed lines, each entity or record in structured repository 103 may be associated with one or more unstructured objects in unstructured repository 110. For example, each record R_i of at least a portion of the records 107 in structured repository 103 may be associated with m_i unstructured objects 〖OB〗_1, 〖OB〗_2…〖OB〗_(m_i) in unstructured repository 110, where m_i ≥ 1. For example, an employee who has records in structured repository 103 describing their name, SNS, etc., may also have their employment contract scanned and stored in unstructured repository 110. Unstructured objects 111 may be provided by a CMIS-enabled system. For example, in an MDM system (such as IBM MDM), an OOTB connector to a content management system (such as Filenet) may enable access to unstructured objects. For example, master data records in an MDM system can be associated with unique resource locators to allow the retrieval of associated unstructured documents within a content management system based on standards such as CMIS.
[0031] Data integration system 101 can be configured to process record 107 and unstructured object 111 using one or more algorithms (such as algorithm 120) that implement at least a portion of this method. For example, data integration system 101 can use algorithm 120 to process data record 107 and unstructured object 111 to identify duplicate records in structured repository 103. Although shown as separate components, in another example, repository 103 and / or repository 110 may be part of data integration system 101.
[0032] In one example, algorithm 120 may include a matching engine 121 for matching records. Algorithm 120 further includes a token packet comparator 122 for comparing a set or packet of unstructured attribute values associated with the record to be compared. Algorithm 120 further includes a token packet manager 123 for managing packets determined by token extractor 124.
[0033] Figure 2 This is a flowchart illustrating a method for record matching in a system according to an example of the present invention. For illustrative purposes, Figure 2 The method described in [the document] can be used in [the following context] Figure 1 The system shown in the diagram is implemented, but is not limited to this implementation. Figure 2 The method can be performed, for example, by the data integration system 101.
[0034] In step 201, unstructured objects for each record in at least a portion of the records in database 103 can be processed to identify a set of values for one or more unstructured attribute unstructured objects for each record. Each record in the at least portion of the records can be associated with one or more unstructured objects.
[0035] In one example, at least a portion of the records may include all records in database 103. For each record in the database, it can be determined whether the record is associated with one or more unstructured objects. The associated unstructured objects can then be processed to identify the values of unstructured attributes. If a record is not associated with any unstructured object, the next unprocessed record can be processed, and so on, until all records have been processed. Processing all records in the database can be advantageous because it allows for the preparation of all information that can be readily used in later stages.
[0036] In one example, at least a subset of records may include a subset of records in database 103. For each record in the subset, it can be determined whether the record is associated with one or more unstructured objects. The associated unstructured objects can be processed to identify the values of unstructured attributes. The subset of records may, for example, include only records that require user processing. This example can be advantageous because it enables on-demand processing of records. This can save resources needed to process records whose results are not used.
[0037] For example, the identification of values of unstructured attributes in unstructured objects can be performed by parsing unstructured objects and performing data mining analysis to identify the values of attributes.
[0038] Therefore, processing the unstructured object of each record R_i in step 201 can produce a package or set (named "package"_i) of values for unstructured attributes b_1, b_2, ..., b_(M_i), where M_i ≥ 1. The unstructured attributes b_1, b_2, ..., b_(M_i) of each set "package"_i may or may not include attributes of structured attributes a_1, ..., a_N. For example, a record representing a student may include structured attributes such as "student ID," "class," "age," "name," etc., while a document 111 associated with a student may include values of different attributes (such as "address") and / or include values of the same attributes (such as "name"). The unstructured attributes of a set "package"_i may or may not include unstructured attributes of another set; that is, they may or may not share unstructured attributes with another set "package"_j. For example, two student records may be associated with completely different documents, one associated with an insurance contract document and the other with a resume; resulting in different identified unstructured attributes for the two students.
[0039] Each unstructured attribute can have at least one value in the corresponding set [package]_i. This at least one value can include duplicate values. For example, an employee's record can be associated with a document in a package ([package]_x) that has been processed to identify the values of unstructured attributes and has resulted in identified values for the unstructured attributes "car type," "phone number," and "address." Package [package]_x can include multiple values for the attribute "car type," for example, since employee "X" has several cars listed in the document. Because the same number appears in several documents for employee "X," package [package]_x can include five duplicate values for the attribute "phone number." In other words, the set associated with employee X's record has three unstructured attributes, but can include multiple values for each unstructured attribute.
[0040] The set of unstructured attribute values obtained in step 201 can be used to determine whether record 107 is a duplicate. For example, in step 203, the two sets 〖package〗_i and 〖package〗_j of the two records R_i and R_j can be compared separately to determine the similarity between the two sets 〖package〗_i and 〖package〗_j. That is, the values of unstructured attributes b_1, b_2, ..., b_(M_i) can be compared with the values of unstructured attributes b_1, b_2, ..., b_(M_j). In one example, this comparison can be a pairwise comparison between all possible pairs of values of the two sets, or it can be a pairwise comparison between pairs of values of the same unstructured attribute. This comparison can produce individual similarity scores, which are combined to obtain a similarity score between the compared sets. In another example, the Jaccard similarity algorithm can be used to compare sets 〖package〗_i and 〖package〗_j. Figure 3 An exemplary implementation of comparison step 203 is provided. The comparison as described herein is performed between two records, but the comparison is not limited to comparing more than two records by comparing their respective sets as described in the example of two records.
[0041] Therefore, the comparison result of step 203 can be used in step 205 to determine whether the two records represent the same entity. The degree of similarity between two sets [package]_i and [package]_j can respectively indicate the similarity between two records R_i and R_j; for example, if the two packages are very similar, this indicates that the two records represent the same entity.
[0042] Figure 3 This is a flowchart illustrating a method for comparing records according to an example of the present invention. For illustrative purposes, Figure 3 The method described in [the document] can be used in [the following context] Figure 1 The system shown in the diagram is implemented, but is not limited to this implementation. Figure 3 The method can be performed, for example, by the data integration system 101. Figure 3 The method provides Figure 2 An exemplary implementation of comparison step 203. For example, the record 107 to be compared can be compared with, as referenced... Figure 2 The corresponding unstructured attribute value package or set described is associated.
[0043] In step 301, one or more occurrence characteristics can be evaluated for each identified unstructured attribute value. For example, each set of unstructured attribute values [package]_i can be processed to evaluate the occurrence characteristics of each unstructured attribute value in that set. That is, the occurrence characteristics of each value of attribute b_1 can be evaluated, the occurrence characteristics of each value of attribute b_2 can be evaluated, and so on.
[0044] In the first example, the occurrence characteristic could be the frequency of occurrence of unstructured attribute values within an unstructured object of a specific record. In this case, the frequency of occurrence of the value of attribute b_1 in package_i of record R_i can be determined. Similarly, the frequency of occurrence of the value of attribute b_2 in package_i of record R_i can be determined, and so on. Continuing with the example of employee X's record, the frequency of occurrence of the unstructured attribute "phone number" is 5 because it appears 5 times in the documents associated with employee X.
[0045] In the second example, the occurrence characteristic of each value in the set [package]_i of record R_i can be an indication of other values in the same set [package]_i that are juxtaposed with each value in the unstructured object. For example, for each record R_i, the values of the unstructured attributes of the corresponding set [package]_i can be processed to identify the frequency and attribute values mentioned together in the same sentence or paragraph.
[0046] Therefore, in step 303, a comparison of the two records R_i and R_j can be performed by comparing the occurrence characteristics of the unstructured attribute values of the evaluated set [package]_i with the occurrence characteristics of the unstructured attribute values of the evaluated set [package]_j. For example, if the same values exist in both sets [package]_i and [package]_j and have the same frequency of occurrence in both sets, this can provide a strong indication of the similarity between records R_i and R_j.
[0047] Figure 4 This is a flowchart illustrating a method for comparing records according to an example of the present invention. For illustrative purposes, Figure 4 The method described in [the document] can be used in [the following context] Figure 1 The system shown in the diagram is implemented, but is not limited to this implementation. Figure 4 The method can be performed, for example, by the data integration system 101. Figure 4 The method can compare two records R_i and R_j.
[0048] In step 401, initial contribution weights can be assigned to each of the structured attributes a_1, ..., a_N. For example, to compare two employee records, the attribute "Employee ID" can be assigned a higher weight than the attribute "Name," because two employees may have the same name but are unlikely to have the same employee ID. Therefore, the employee ID can advantageously contribute more to the matching decision. For example, integer weights can be assigned to each of the structured attributes a_1, ..., a_N of the records R_i and R_j being compared.
[0049] In step 403, similar unstructured attribute values that exist in the two sets [package]_i and [package]_j of the two records R_i and R_j can be selected. This can be done, for example, by making the two sets [package]_i and [package]_j intersect. Figures 5C to 5D An example implementation of step 403 is provided. Step 403 can produce a collection referenced by package_i ∩ package_j, which includes the selected unstructured attribute values.
[0050] In step 405, the initial contribution weights can be adjusted. This adjustment can be performed by comparing the unstructured attribute values in the intersection of sets R_i and R_j with the values of the structured attributes a_1, ..., a_N in the two records R_i and R_j. For example, if the structured attribute values do not match any of the selected unstructured attributes, the contribution weight of the structured attribute can be replaced with a weight indicating the similarity between the two sets. If the structured attribute values fully or partially match the selected unstructured attributes, the contribution weight of the structured attribute can be increased by a predefined value.
[0051] In step 407, contribution weights can be used to compare two records R_i and R_j. Value pairs of each of the structured attributes a_1, ..., a_N can be compared, generating N similarity scores. The N similarity scores can be combined, for example, using a weighted sum, by multiplying the adjusted contribution weights by the corresponding similarity scores and summing the scores. The resulting scores can be compared to a threshold to determine whether the two records R_i and R_j are duplicates or unique.
[0052] Figure 5A This is a flowchart illustrating a method for comparing records according to an example of the present invention. For illustrative purposes, Figure 5A The method described in [the document] can be used in [the following context] Figure 1 The system shown in the diagram is implemented, but is not limited to this implementation. Figure 5A The method can be performed, for example, by the data integration system 101. Figure 5B and Figure 5C The diagram shows two records (e.g., MDM records) to be compared, R_1 and R_2. Figure 5B and Figure 5CAs shown, record R_1 is associated with a set of unstructured objects 〖OB〗_1, 〖OB〗_2, ... 〖OB〗_(m_1), and record R_2 is associated with a set of unstructured objects 〖OB〗_1, 〖OB〗_2, ... 〖OB〗_(m_2). For example, record R_1 may be associated with 14 documents, while R_2 may be associated with 17 documents. The two records R_1 and R_2 represent people, such as employees. The two records R_1 and R_2 have values for structured attributes (e.g., "Name", "Address", "Date of Birth" ("DOB"), "Gender", "Marital Status", and "SSN"). Each structured attribute can be assigned a contribution weight as follows: Name weight: Medium, Address weight: Medium, DOB weight: High, Gender weight: Very Low, Marital Status weight: High, SSN weight: Very High. The values "High", "Medium", and "Very High" can be represented by corresponding integers that can be used to perform a weighted sum.
[0053] In step 501, documents associated with each of the two records can be processed to identify the values of unstructured attributes. This can lead to, for example... Figure 5B The image shows an unstructured set of values (also called an entity token packet) for record R_1, and as shown in the image. Figure 5C The example shown is a set of unstructured values for record R_2, package_2. Step 501 enables the use of the entity detection module to analyze relevant unstructured content of the MDM record to detect person names, addresses, sensitive personal information, and other entities of interest. Figure 5B and Figure 5C As shown, each of the two sets, 〖package〗_1 and 〖package〗_2, includes values such as "John" and "USA", where John is the value of the unstructured attribute "name" and USA is the value of the unstructured attribute "country".
[0054] Each value in the two sets, 〖package〗_1 and 〖package〗_2, can be associated with an occurrence characteristic. This is in Figure 5B and Figure 5C As shown, each value is associated with its frequency of occurrence. For example, the value "street" appears 4 times in the 14 documents associated with record R_1, and 3 times in the 17 documents associated with record R_2. All extracted values can be stored in a so-called entity token package along with their frequency and entity relation score (which indicates the frequency of mention of entities together in the same sentence or paragraph). The entity relation score can be a second occurrence feature as defined herein.
[0055] In step 503, two sets, 〖package〗_1 and 〖package〗_2, can be compared to calculate the similarity score for the entire package. For example, similar packages may have a large number of identical values and very similar entity relation scores. Therefore, in step 503, the intersection or intersecting packages can be determined. Intersecting packages can be determined by including values that exist only in all packages 〖package〗_1 and 〖package〗_2 in the intersecting package. The resulting intersecting package 〖package〗_1 ∩ 〖package〗_2 is... Figure 5D As shown in the image. Figure 5D As shown, values written in normal font appear with the same frequency in all packages _1 and _2. Values written in italics appear in all packages _1 and _2, but with different frequencies. Values written in bold appear in both packages _1 and _2, but not in all structured record attributes. For example, the intersection value "Baker" is not part of record R_1; therefore, it is written in bold.
[0056] In step 505, the intersection packet can be used to adjust the weights of the structured attributes assigned to records R_1 and R_2.
[0057] For example, if intersecting packets have the same value for the unstructured attribute IATT with the same frequency, and that same value does not match the value of the structured attribute ATT corresponding to the unstructured attribute IATT, then the weight of the structured attribute ATT can be replaced with the weight of the token packet. These types of values (e.g., the value of IATT) are written in bold to indicate that a match based on the structured attribute ATT may be incorrect.
[0058] If intersecting packets have the same value for IATT with the same frequency for something, and that same value partially or completely matches a structured attribute, then the weight of the structured attribute ATT can be increased proportionally. These types of values (e.g., IATT values) are written in normal font to indicate that matching based on the attribute can be strengthened.
[0059] If intersecting token packets have the same value for something but different frequencies, and that same value only partially matches a structured attribute (e.g., address or name), then the weight of that structured attribute can be increased proportionally. These types of values (e.g., IATT values) are written in italics to indicate that matching based on that attribute can achieve a partially correct match.
[0060] In step 507, based on comparing the two records R_1 and R_2 using adjusted weights, it can be determined whether the two records represent the same entity.
[0061] Figure 6 This refers to a general computerized system 600 suitable for implementing at least a portion of the method steps as disclosed herein.
[0062] It should be understood that the methods described herein are at least partially non-interactive and are automated by computerized systems such as servers or embedded systems. However, in exemplary embodiments, the methods described herein can be implemented in (partially) interactive systems. These methods can also be implemented in software 612, 622 (including firmware 622), hardware (processor) 605, or combinations thereof. In exemplary embodiments, the methods described herein are implemented in software as an executable program and executed by a dedicated or general-purpose digital computer such as a personal computer, workstation, minicomputer, or mainframe computer. Thus, the most general system 600 includes a general-purpose computer 601.
[0063] In an exemplary embodiment, as shown in FIG5, the computer 601 includes a processor 605, a memory (main memory) 610 coupled to a memory controller 615, and one or more input and / or output (I / O) devices (or peripherals) 10, 645 communicatively coupled via a local input / output controller 635. The input / output controller 635 may be, but is not limited to, one or more buses or other wired or wireless connections, as known in the art. The input / output controller 635 may have additional elements (omitted for simplicity) to enable communication, such as controllers, buffers (caches), drivers, repeaters, and receivers. Further, the local interface may include address, control, and / or data connections to enable proper communication between the aforementioned components. As described herein, the I / O devices 10, 645 may generally include any general-purpose encryption card or smart card known in the art.
[0064] Processor 605 is a hardware device for executing software, specifically software stored in memory 610. Processor 605 can be any custom or commercially available processor, central processing unit (CPU), auxiliary processor among several processors associated with computer 601, semiconductor-based microprocessor (in the form of a microchip or chipset), or any device generally used for executing software instructions.
[0065] Memory 610 may include any one or a combination of volatile storage elements (e.g., random access memory (RAM, such as DRAM, SRAM, SDRAM, etc.)) and non-volatile storage elements (e.g., ROM, erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM)). Note that memory 610 may have a distributed architecture, in which various components are geographically separated but can be accessed by processor 605.
[0066] The software in memory 610 may include one or more individual programs, each including an ordered list of executable instructions for implementing logical functions (particularly those relevant to embodiments of the invention). In the example of FIG. 5, the software in memory 610 includes instructions 612, such as instructions for managing a database, such as a database management system.
[0067] The software in memory 610 typically also includes a suitable operating system (OS) 611. OS 611 essentially controls the execution of other computer programs, such as possible software 612 for implementing the methods described herein.
[0068] The methods described herein can take the form of source program 612, executable program 612 (object code), script, or any other entity including instruction set 612 to be executed. When using source program, the program needs to be translated by a compiler, assembler, interpreter, etc. (which may or may not be included in memory 610) in order to operate correctly in conjunction with OS 611. Furthermore, the described methods can be written in an object-oriented programming language with data classes and method classes, or a procedural programming language with routines, subroutines, and / or functions.
[0069] In an exemplary embodiment, a conventional keyboard 650 and mouse 655 may be coupled to an input / output controller 635. Other output devices, such as I / O device 645, may include input devices, such as, but not limited to, printers, scanners, microphones, etc. Finally, I / O devices 10, 645 may also include devices that communicate with input and output, such as, but not limited to, network interface cards (NICs) or modulators / demodulators (for accessing other files, devices, systems, or networks), radio frequency (RF) or other transceivers, telephone interfaces, bridges, routers, etc. I / O devices 10, 645 may be any general-purpose encryption card or smart card known in the art. System 600 may also include a display controller 625 coupled to a display 630. In an exemplary embodiment, system 600 may also include a network interface for coupling to a network 665. Network 665 may be an IP-based network for communication between computer 601 and any external server, client, etc., via a broadband connection. Network 665 sends and receives data between computer 601 and external system 30, and may be involved in performing some or all of the steps of the methods discussed herein. In an exemplary embodiment, network 665 may be a managed IP network managed by a service provider. Network 665 may be implemented wirelessly, for example using wireless protocols and technologies such as WiFi, WiMax, etc. Network 665 may also be a packet-switched network, such as a local area network (LAN), wide area network (WAN), metropolitan area network (MAN), the Internet, or other similar network environments. Network 665 may be a fixed wireless network, wireless local area network (WLAN), wireless wide area network (WWAN), personal area network (PAN), virtual private network (VPN), intranet, or other suitable network system, and includes devices for receiving and transmitting signals.
[0070] If the computer 601 is a PC, workstation, intelligent device, etc., the software in the memory 610 may also include a Basic Input / Output System (BIOS) 622. The BIOS is a set of basic software routines that initialize and test the hardware at startup, boot the OS 611, and support data transfer between hardware devices. The BIOS is stored in ROM so that it can be executed when the computer 601 is activated.
[0071] When computer 601 is running, processor 605 is configured to execute software 612 stored in memory 610, transfer data to and from memory 610, and generally control the operation of computer 601 according to the software. The methods and OS 611 described herein are read, in whole or in part (but usually the latter), by processor 605, may be buffered within processor 605, and then executed.
[0072] When the systems and methods described herein are implemented in software 612, as shown in Figure 5, these methods can be stored on any computer-readable medium (such as storage device 620) for use by or in conjunction with any computer-related system or method. Storage device 620 may include disk storage devices, such as HDD storage devices.
[0073] This invention can be a system, method, and / or computer program product at any possible level of technical detail integration. The computer program product may include one or more computer-readable storage media having computer-readable program instructions thereon for causing a processor to perform aspects of the invention.
[0074] Computer-readable storage media can be tangible devices capable of holding and storing instructions for use by an instruction execution device. Computer-readable storage media can be, for example, but not limited to, electronic storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of computer-readable storage media includes the following: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static random access memory (SRAM), portable optical disc read-only memory (CD-ROM), digital multifunction disc (DVD), memory sticks, floppy disks, mechanical encoding devices such as punch cards or recessed structures on which instructions are recorded, and any suitable combination of the foregoing. As used herein, computer-readable storage media should not be construed as transient signals themselves, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses through fiber optic cables), or electrical signals transmitted through wires.
[0075] The computer-readable program instructions described herein can be downloaded from a computer-readable storage medium to a suitable computing / processing device, or downloaded via a network (e.g., the Internet, a local area network, a wide area network, and / or a wireless network) to an external computer or external storage device. The network may include copper cables, optical fibers, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to a computer-readable storage medium within the respective computing / processing device.
[0076] Computer-readable program instructions used to perform the operations of this invention may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, integrated circuit configuration data, or source code or object code written in any combination of one or more programming languages (including object-oriented programming languages such as Smalltalk, C++, etc.) and procedural programming languages (such as the "C" programming language or similar programming languages). The computer-readable program instructions may be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer via any type of network (including a local area network (LAN) or a wide area network (WAN)) or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuits including, for example, programmable logic circuits, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs) may execute computer-readable program instructions by utilizing state information from the computer-readable program instructions to personalize the electronic circuits in order to perform aspects of this invention.
[0077] Various aspects of the present invention are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.
[0078] These computer-readable program instructions may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions / actions specified in one or more blocks of a flowchart and / or block diagram. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and / or other device to operate in a particular manner, such that the computer-readable storage medium in which the instructions are stored includes an article of writing comprising instructions for implementing aspects of the functions / actions specified in one or more blocks of a flowchart and / or block diagram.
[0079] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer-implemented process, such that the instructions executed on the computer, other programmable apparatus or other device perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.
[0080] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of instructions comprising one or more executable instructions for implementing a specified logical function. In some alternative embodiments, the functions indicated in the blocks may occur in a non-consecutive order as shown in the figures. For example, two blocks shown consecutively may actually be executed substantially simultaneously, or these blocks may sometimes be executed in reverse order, depending on the functions involved. It will also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented by a dedicated hardware-based system that performs the specified function or action or executes a combination of dedicated hardware and computer instructions.
Claims
1. A computer-implemented method for record matching in a database system, wherein, A record represents an entity, the record includes structured attributes and is associated with one or more unstructured data objects, the method includes: Assign initial contribution weights to the structured attributes in the record; Process the one or more unstructured data objects to identify the unstructured attribute values of the records; Based on the contribution weights of the structured attributes of two records in the database system, the two records are compared to determine the degree of similarity between them. Based on the comparison results, the unstructured attribute values that exist in both records are selected; and In response to determining that the values of the structured attribute of the two records do not match any of the selected unstructured attribute values, the contribution weight of the structured attribute is replaced with a contribution weight indicating the similarity between the two records.
2. The method according to claim 1, further comprising: Evaluate one or more occurrence characteristics of the identified unstructured attribute values, wherein the occurrence characteristics of the specific unstructured attribute values identified in one or more unstructured data objects of a specific record include any of the following: The frequency of occurrence of the specific unstructured attribute value in the unstructured data object of the specific record, and an indication of other identified unstructured attribute values juxtaposed with the specific unstructured attribute value in the unstructured data object for the specific record, wherein comparing the two sets of unstructured attribute values of the two records includes: comparing the occurrence characteristics of the unstructured attribute values of one set of the two sets of unstructured attribute values being evaluated with the occurrence characteristics of the unstructured attribute values of the other set of the two sets of unstructured attribute values being evaluated.
3. The method according to claim 1, further comprising: Based on the category of the unstructured attribute values, the unstructured attribute values are grouped into groups, wherein the comparison between the two records is performed by comparing groups of the same category.
4. The method according to claim 1, further comprising: In response to determining that the value of the structured attribute matches the value of the selected unstructured attribute, the contribution weight of the structured attribute is increased.
5. The method according to claim 4, wherein, The selection includes: making the two sets of unstructured attribute values intersect, thereby creating an intersection.
6. The method according to claim 5, wherein, The unstructured attribute value is the full value of the unstructured attribute or a portion of the full value of the unstructured attribute. The selection further includes: executing an aggregation algorithm to aggregate the selected unstructured attribute values to form the full value of the corresponding unstructured attribute, thereby producing zero or more aggregated values, wherein the comparison with the structured attribute value is performed using the processed selected unstructured attribute values.
7. The method according to claim 6, wherein, The aggregation includes: grouping the unstructured attribute values into groups based on the category of the unstructured attribute values, wherein the aggregation is performed on values belonging to the same group.
8. The method according to claim 6, wherein, The selected unstructured attribute values exist with the same frequency in each of the two sets of unstructured attribute values.
9. The method according to claim 6, wherein, Comparing the two records includes: comparing the values of the structured attributes of the two records to obtain an individual matching score for each structured attribute of the record, combining the individual matching scores using the contribution weights, and comparing the combined scores with a predefined threshold.
10. The method according to claim 1, further comprising: The two records are merged into a single record, wherein the two records represent the same entity.
11. The method of claim 1, occurring in response to receiving a corresponding request for matching the record.
12. The method according to claim 1, further comprising: Repeat the above method to compare other records in the database until all records in the database have been compared.
13. The method according to claim 1, wherein, is executed by a Master Data Management (MDM) system, The records being compared are MDM records, in which the entity detection module of the master data management system performs processing on the one or more unstructured data objects.
14. The method according to claim 1, wherein, The unstructured object corresponds to the document.
15. The method according to claim 14, wherein, The unstructured object corresponds to the scanned document.
16. The method according to claim 1, further comprising: Provide information to the person associated with the compared records, indicating the unstructured objects associated with the two records.
17. The method of claim 1 occurs in response to storing the compared records.
18. The method according to claim 5, wherein, The intersection includes the selected unstructured attribute values.
19. A computer program product for record matching in a database system, wherein, The record represents an entity, the record includes structured attributes and is associated with one or more unstructured data objects, and the computer program product includes instructions embodied therein, the instructions being executable by a processor to cause the processor to perform the method according to any one of claims 1 to 18.
20. A computer system for recording matches, wherein, A record represents an entity, the record includes structured attributes and is associated with one or more unstructured data objects, and the computer system includes: One or more computer processors: and Program instructions, including instructions that, when executed by the processor, perform the method of any one of claims 1 to 18.