A multi-source business data automatic cleaning and completion method

By constructing an entity association network based on quality degradation assessment and domain knowledge graph, the problem of inconsistency and missing data from multiple sources of business data is solved, and highly reliable automated cleaning and completion are achieved, thereby improving the accuracy and intelligence of data processing.

CN122309926APending Publication Date: 2026-06-30CHONGQING HIKE NETWORK TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHONGQING HIKE NETWORK TECH CO LTD
Filing Date
2026-04-03
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies lack dynamic and refined measurement of the overall quality degradation of the data source and its internal fields when dealing with inconsistencies and missing data from multiple sources. This results in inaccurate cleaning and completion decisions, and the construction of relationships is carried out in isolation without adaptive adjustment, leading to insufficient reliability of the completion results.

Method used

By guiding the construction of entity association networks based on quality degradation assessment and combining domain knowledge graphs for verification and completion path exploration, a reliable completion path is generated, enabling automated and highly reliable cleaning and completion of multi-source business data.

Benefits of technology

It improves the accuracy and intelligence of the data cleaning and completion process, ensures the business rationality and high credibility of the completion results, and can flexibly deal with different types of missing or conflicting scenarios, outputting high-quality and highly complete business entity data streams.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309926A_ABST
    Figure CN122309926A_ABST
Patent Text Reader

Abstract

This invention discloses an automatic cleaning and completion method for multi-source business data, belonging to the field of computer data processing technology. It includes acquiring the original multi-source business data stream and performing a quality degradation assessment on the original multi-source business data stream to generate quality degradation parameters; based on the quality degradation parameters, identifying and constructing the association relationships between cross-source business data to generate an entity association network; using a pre-set domain knowledge graph to verify and explore completion paths for the entity association network, generating verification results containing reliable completion paths; and performing collaborative cleaning and completion operations on the original multi-source business data stream according to the verification results, outputting the cleaned and completed business entity data stream. By guiding the construction of the entity association network based on quality degradation assessment and combining it with the collaborative processing of verification and completion path exploration using a domain knowledge graph, it is possible to achieve automated and highly reliable cleaning and completion of inconsistencies and missing data in multi-source business data.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer data processing technology, and in particular to a method for automatic cleaning and completion of multi-source business data. Background Technology

[0002] In enterprise IT operations, business data typically originates from multiple independent systems or channels, such as sales, customer service, and warehousing. When describing the same business entity, this multi-source data often suffers from quality issues such as conflicting field values ​​and missing information due to differences in input standards, update sequences, or system failures. Effectively cleaning and completing this type of multi-source, heterogeneous business data to form complete, consistent, and reliable data assets is a crucial foundation for supporting accurate analysis, intelligent decision-making, and business process optimization, and represents an important application area of ​​computer data processing technology.

[0003] Existing technologies often employ data cleaning methods based on predefined rules or simple completion strategies based on the quality assessment of a single data source when dealing with inconsistencies and missing data from multiple sources. For example, they might select data from specific sources by setting conflict resolution rules, or fill in fields based on the static reliability ranking of the data sources. Additionally, some methods attempt to establish relationships by calculating the similarity between records and then use these relationships to pass information and complete missing values.

[0004] However, existing technical solutions have significant drawbacks. First, most methods lack dynamic and refined measurement of the overall data source and the degree of quality degradation of its internal fields, resulting in insufficient precision in the basis for cleaning and completion decisions. Second, the construction of relationships is often carried out in isolation, failing to deeply integrate with data quality assessment results and lacking the use of domain knowledge to semantically verify and enhance relationships. Furthermore, the completion process typically employs a single, fixed strategy, unable to adaptively adjust according to specific data quality conditions and the semantic type of the relationship path, leading to insufficient reliability of the completion results and making it difficult to meet the high data credibility requirements of complex business scenarios. Summary of the Invention

[0005] To address the aforementioned issues, this invention provides an automatic cleaning and completion method for multi-source business data. By guiding the construction of an entity association network based on quality degradation assessment and combining it with domain knowledge graphs for verification and completion path exploration, this method can achieve automated and highly reliable cleaning and completion of inconsistencies and missing data in multi-source business data.

[0006] The above objectives can be achieved through the following approach:

[0007] An automatic cleaning and completion method for multi-source business data includes: acquiring original multi-source business data streams; evaluating the quality degradation of the original multi-source business data streams to generate quality degradation parameters; identifying and constructing the association relationships between cross-source business data based on the quality degradation parameters to generate an entity association network; using a preset domain knowledge graph to verify and explore completion paths for the entity association network to generate verification results containing credible completion paths; and performing collaborative cleaning and completion operations on the original multi-source business data streams according to the verification results to output the cleaned and completed business entity data streams.

[0008] Optionally, the step of acquiring the original multi-source business data stream and performing a quality degradation assessment on the original multi-source business data stream to generate quality degradation parameters includes: receiving original data packets from at least two independent business data sources and parsing them to obtain a structured business record stream; performing de-identification processing on the structured business record stream to generate a business data stream to be evaluated; performing multi-dimensional quality measurement on the business data stream to be evaluated, and calculating quality degradation parameters that characterize the degree of inconsistency and incompleteness at the data source and field levels.

[0009] Optionally, the step of performing multi-dimensional quality measurement on the business data stream to be evaluated and calculating quality decay parameters characterizing the degree of inconsistency and incompleteness at the data source and field levels includes: identifying conflicting field values ​​describing the same business entity in the business data stream to be evaluated, and calculating a field-level inconsistency score based on the conflict frequency and the credibility weight of the conflict source; detecting missing fields in the business entity records in the business data stream to be evaluated, and calculating a field-level incompleteness score based on the business criticality of the field and the source distribution of the missing records; aggregating the field-level inconsistency scores and field-level incompleteness scores of all fields under the same data source, and combining them with the real-time availability status of the corresponding data source to generate quality decay parameters, wherein the quality decay parameters include a source decay coefficient and a field decay vector.

[0010] Optionally, the step of identifying and constructing the association relationship between cross-source business data based on the quality attenuation parameter and generating an entity association network includes: locating the attenuation field set according to the field attenuation vector in the quality attenuation parameter; using the attenuation field set as the focus of association detection, filtering cross-source candidate association pairs containing association confidence in the original multi-source business data stream; and performing weighted correction on the association confidence of the cross-source candidate association pairs based on the source attenuation coefficient, filtering and constructing an entity association network with business entities as nodes, wherein the edge weights in the entity association network are the corrected association confidence.

[0011] Optionally, the step of using the decay field set as the focus of association detection and filtering cross-source candidate association pairs containing association confidence in the original multi-source business data stream includes: for each field in the decay field set, extracting business records involving the corresponding field from all data sources to form a field focus record set; calculating the similarity between the field focus record sets in terms of numerical values, text, or classification to generate a preliminary similarity relationship set; analyzing the historical data synchronization time sequence between the source data sources of the record pairs in the preliminary similarity relationship set, and combining the quality decay parameter to infer and filter out false associations caused by data delays to obtain cross-source candidate association pairs containing association confidence.

[0012] Optionally, the step of using a preset domain knowledge graph to verify and complete the path exploration of the entity association network and generate a verification result containing credible completion paths includes: mapping the nodes and edges in the entity association network to the preset domain knowledge graph, searching for corresponding entity relationship paths that already exist in the domain knowledge graph; performing reasoning to complete the relationship edges that exist in the entity association network but are missing or weakened in the domain knowledge graph, generating potential completion paths; fusing the corresponding entity relationship paths and the potential completion paths; and performing feasibility verification on the fused paths according to the constraint rules of entity attributes in the domain knowledge graph, generating a verification result containing credible completion paths, wherein the verification result identifies the status of each relationship edge as verified, pending completion, or conflicting.

[0013] Optionally, the step of performing feasibility verification on the fused path based on the constraint rules of entity attributes in the domain knowledge graph and generating a verification result containing a credible completion path includes: extracting business rule constraints and attribute value range constraints related to the entities at both ends of the associated edge from the domain knowledge graph; checking whether the changes in entity attributes or relationships of the potential completion path violate the business rule constraints and attribute value range constraints; marking the paths that pass the verification as credible completion paths and injecting them into the verification result; marking the paths that fail the verification as conflicts and recording the conflict constraint information into the verification result.

[0014] Optionally, the step of performing collaborative cleaning and completion operations on the original multi-source business data stream based on the verification results, and outputting the cleaned and completed business entity data stream, includes: parsing the verification results; performing consistency analysis on the business data corresponding to the verified association edges using data from a pre-set confidence source; calculating completion values ​​from a preset source or through path derivation based on the reliable completion path in the verification results for the association edges in the pending completion state and the missing or conflicting fields of the associations; and integrating the data after consistency analysis with the completion values ​​to reconstruct and generate the business entity data stream.

[0015] Optionally, the step of calculating the completed value from a preset source or through path derivation based on the credible completion path in the verification result for the associated edges in the state of pending completion and the associated missing or conflicting fields includes: for the abnormal business entity corresponding to the associated edge marked as pending completion, constructing a completion decision tree with the abnormal entity as the root node according to the entity association network and the verification result; traversing the completion decision tree, dynamically selecting a completion operator of value transfer, model derivation or knowledge graph query according to the quality decay parameter associated with the node and the credible completion path type corresponding to the edge; executing the completion operator to complete the assignment of all missing or conflicting fields of the abnormal business entity to obtain the completed value.

[0016] Based on the same inventive concept, this invention also provides an automatic cleaning and completion system for multi-source business data. The system includes: a quality degradation assessment module, used to acquire original multi-source business data streams and perform quality degradation assessment on the original multi-source business data streams to generate quality degradation parameters; an association network construction module, used to identify and construct association relationships between cross-source business data based on the quality degradation parameters, generating an entity association network; a knowledge verification and path exploration module, used to verify and explore completion paths for the entity association network using a preset domain knowledge graph, generating verification results containing credible completion paths; and a collaborative cleaning and completion execution module, used to perform collaborative cleaning and completion operations on the original multi-source business data streams according to the verification results, outputting a cleaned and completed business entity data stream.

[0017] Compared with the prior art, the present invention has the following advantages:

[0018] This invention, by introducing a quantitative assessment of quality degradation parameters, can locate inconsistencies and incompleteness in multi-source business data, and drive subsequent association construction and completion decisions. This makes the data cleaning and completion process no longer blind or based on fixed rules, but on a dynamic and quantitative data quality perception basis, thereby improving the accuracy of the processing target and the intelligence level of the processing process.

[0019] This invention constructs an entity association network guided by a quality decay parameter and uses a domain knowledge graph to verify and explore its paths. By combining data-driven association discovery with prior knowledge-driven semantic verification, it can identify and filter pseudo-associations and discover credible completion paths that conform to domain logic, thus ensuring the business rationality and high credibility of the final completion results.

[0020] When performing a completion operation, this invention can dynamically select the most suitable completion operator based on the specific quality degradation and the type of reliable completion path. This enables the system to flexibly cope with different types of missing or conflicting scenarios. By comprehensively utilizing multiple methods such as value passing, model derivation, and knowledge query, it achieves intelligent and adaptive data repair capabilities, thereby outputting high-quality and highly complete business entity data streams.

[0021] Other features and advantages of the invention will be set forth in the description which follows, and will be apparent in part from the description, or may be learned by practicing the invention. The objects and other advantages of the invention may be realized and obtained by means of the structures pointed out in the description, claims and drawings. Attached Figure Description

[0022] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0023] Figure 1 This is a flowchart illustrating an automatic cleaning and completion method for multi-source business data according to an embodiment of the present invention.

[0024] Figure 2 This is a schematic diagram of the entity association network according to an embodiment of the present invention.

[0025] Figure 3 This is a schematic diagram of the structure of an automatic cleaning and completion system for multi-source business data according to an embodiment of the present invention. Detailed Implementation

[0026] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0027] Reference Figure 1 One embodiment of the present invention proposes an automatic cleaning and completion method for multi-source business data. By guiding the construction of entity association network based on quality degradation assessment and combining it with domain knowledge graph for verification and completion path exploration, it can achieve automated and highly reliable cleaning and completion of inconsistencies and missing data in multi-source business data.

[0028] The method described in this embodiment specifically includes:

[0029] S1. Obtain the original multi-source service data stream and perform a quality degradation assessment on the original multi-source service data stream to generate quality degradation parameters.

[0030] In one embodiment of the present invention, step S1 includes the following steps:

[0031] Receive raw data packets from at least two independent business data sources and parse them to obtain a structured business record stream;

[0032] The structured business record stream is de-identified to generate a business data stream to be evaluated;

[0033] Identify conflicting field values ​​describing the same business entity in the business data stream to be evaluated, and calculate a field-level inconsistency score based on the conflict frequency and the credibility weight of the conflict source.

[0034] The missing fields of business entity records in the business data stream to be evaluated are detected, and a field-level incompleteness score is calculated based on the business criticality of the fields and the source distribution of the missing records.

[0035] The field-level inconsistency score and field-level incompleteness score of all fields under the same data source are aggregated, and combined with the real-time availability status of the corresponding data source, a quality decay parameter is generated. The quality decay parameter includes the source decay coefficient and the field decay vector.

[0036] Specifically, step S1 begins by receiving raw data packets from at least two independent business data sources, which may have different encapsulation formats or communication protocols. Through parsing, each raw data packet is converted into a business record with uniform field definitions; this collection of records constitutes a structured business record stream. Each business record represents information about a business entity, and its fields include an entity identifier and various attribute values. Subsequently, the structured business record stream undergoes de-identification processing. This process aims to remove or replace identifier fields in the records that can directly locate a specific natural person or organization, such as ID card numbers or mobile phone numbers, while retaining other business attributes used for entity association. The processed data forms the business data stream to be evaluated.

[0037] Next, we identify conflicting field values ​​describing the same business entity in the business data flow to be evaluated. Conflicting field values ​​refer to different values ​​for the same field appearing from different data sources in multiple records identified as belonging to the same business entity based on the entity identifier. Conflict frequency. Calculated for a specific business entity and a specific field i, defined as the number of independent data sources where that entity has distinct values ​​for that field minus one. Conflict Source Credibility Weight This is based on a pre-set historical accuracy rate from the data source, determined by a comparative analysis of 200 sets of industrial sensor measurement data and manual verification results. Field-level inconsistency scoring. The formula is derived from a weighted sum of conflict frequency and conflict source credibility weights:

[0038] ,

[0039] in, This represents the set of data sources where field i of this entity has conflicting values. This is the total number of data sources. It is the total number of records describing this entity. This is a moderating coefficient between 0 and 1, used to balance the impact of source credibility and conflict frequency. Its value is set to 0.6 based on historical data statistical analysis. The formula has been normalized, and all calculation results are dimensionless scalars. The final score... It also ranges from 0 to 1, with larger values ​​indicating more severe inconsistencies.

[0040] Simultaneously, it detects missing fields in the business entity records within the business data flow to be evaluated. A missing field refers to a field that, for a given business entity, should exist in a record from a given data source but is actually empty. The business criticality of the field is also assessed. Fields are predefined based on their necessity in the business process, categorized into three levels: critical, important, and general, and quantified with numerical values ​​of 1.0, 0.7, and 0.3, respectively. The source distribution of missing records is determined by calculating the information entropy of the data source where the record lacking the field originates. To measure:

[0041] ,

[0042] in This represents the proportion of missing records originating from data source s. Field-level incompleteness score. The calculation formula is:

[0043] ,

[0044] in, This is the adjustment factor, set to 0.5. Divide by Maximum value normalization was achieved, ensuring the second term falls within the range of 0 to 1. The entire formula guarantees... It is a dimensionless value between 0 and 1, and the higher the value, the more severe the incompleteness.

[0045] Aggregate field-level inconsistency scores and field-level incompleteness scores for all fields under the same data source, and combine them with the real-time availability status of the corresponding data source to generate quality degradation parameters. For the data source... Its source attenuation coefficient Calculate the weighted average of the scores for all fields under this source:

[0046] ,

[0047] in It is a collection of fields. and These represent the inconsistency score and incompleteness score of the data source on field i, respectively. It is a dynamic adjustment factor based on the real-time availability status of the data source (such as response time and error rate). When availability is normal... When delays or errors occur Field decay vector It is the data source A multidimensional vector, where each dimension corresponds to a field i, whose value... The quality degradation parameter ultimately includes the values ​​for each data source. and .

[0048] For example, suppose there are two independent business data sources, source A and source B. After parsing, a structured business record stream is obtained, containing records for business entities E1 and E2. After de-identification, the business data stream to be evaluated is obtained. For the "Account Status" field of entity E1, the record from source A is "Normal," while the record from source B is "Frozen," which is identified as a conflict. Assuming the historical credibility weight of source A is 0.9, source B is 0.8, the total number of data sources N=2, and the total number of records describing E1 M=2, then the conflict frequency... Inconsistency score is calculated as follows Meanwhile, it was detected that entity E2 is missing the "Registered Capital" field in source A, and this field's business-critical metric is 0.7. Since only source A is missing, the information entropy of the missing source distribution is [missing information]. Incompleteness score .

[0049] During aggregation, assuming source A has only two fields with inconsistency scores of 0 and 0, and incompleteness scores of 0 and 0.35, its real-time availability is normal. Then the source attenuation coefficient of source A. Its field decay vector The "Account Status" field is 0, and the "Registered Capital" field is 0.35. The calculation process for Source B is similar, ultimately generating a value containing... , , , The mass decay parameter.

[0050] S2. Based on the quality attenuation parameters, identify and construct the correlation between cross-source business data, and generate an entity association network;

[0051] In one embodiment of the present invention, step S2 includes the following steps:

[0052] Based on the attenuation vector field in the mass attenuation parameter, locate the attenuation field set;

[0053] For each field in the attenuation field set, extract all business records from all data sources that involve the corresponding field to form a field focus record set;

[0054] Calculate the similarity between the focus record sets of the field in terms of numerical value, text, or category, and generate a preliminary similarity relationship set;

[0055] By analyzing the historical data synchronization time sequence between the source data sources of the preliminary similarity relationship set record pairs, and combining the quality decay parameter, false associations caused by data delay are inferred and filtered to obtain cross-source candidate association pairs containing association confidence.

[0056] The association confidence of the cross-source candidate association pairs is weighted and corrected based on the source attenuation coefficient, and an entity association network with business entities as nodes is selected and constructed. The edge weights in the entity association network are the corrected association confidence.

[0057] Specifically, the input for step S2 is the quality decay parameter generated in the previous step. First, the set of decaying fields is located based on the field decay vector in the quality decay parameter. The field decay vector is a multi-dimensional vector, where each dimension corresponds to a business field and is assigned a value. This value is obtained by adding the field-level inconsistency score and the field-level incompleteness score, representing the overall quality decay degree of that field in the corresponding data source. The set of decaying fields is defined as the union of fields from all data sources whose field decay vector values ​​exceed a preset threshold, which is set to 0.3 based on historical data quality benchmark analysis.

[0058] For each field in the decay field set, extract all records from all data sources that relate to that field in the business records. This set of records constitutes the field focus record set for that field. Whether a record is included depends on whether it contains a valid value for that field.

[0059] Calculates the similarity between any two records within the focus record set for a given field, based on their corresponding field. The method of similarity calculation depends on the data type of the field. For numeric fields, the similarity is... The calculation is as follows:

[0060] ,

[0061] in and These are the field values ​​of two records. This is the set of all values ​​in the focused record set for this field. The formula constrains the result to between 0 and 1 through range normalization. For text fields, similarity... The method based on edit distance is used to calculate the result. For categorical fields, similarity Directly determine if the categories match; if they do, return 1; otherwise, return 0. Select the category with similarity above a threshold from all calculated results. The records are collected to form a preliminary similarity set, and a threshold is set. The value was set to 0.8 after statistical analysis of 1000 samples with known matching relationships.

[0062] Next, we analyze the historical data synchronization timeline between the source data sources for each record pair in the preliminary similarity set. The historical data synchronization timeline records the chronological order and time intervals of data update events between data sources. For a pair of records from different sources A and B, we examine whether the update of the corresponding entity in source A is always earlier than that in source B within the most recent synchronization period T. If such a stable chronological order exists, the current high similarity may be due to a delay in the data synchronization process from A to B, rather than a genuine business relationship. Therefore, a timeline penalty factor is defined. The calculation formula is as follows:

[0063] ,

[0064] in It is the current time. It is the timestamp of the last complete synchronization from source A to source B. Less than a set tolerance ratio If a data delay is suspected, the record pair will be filtered out from the initial similarity set. The threshold is typically set to 0.2 based on business tolerance. The remaining record pairs after filtering are the cross-source candidate association pairs.

[0065] For each cross-source candidate association pair, calculate an initial association confidence score. This value is directly taken from its field similarity. Then, the association confidence is weighted and corrected based on the source attenuation coefficient in the quality attenuation parameter. (Source attenuation coefficient) This reflects the overall quality degradation level of the data source s; a higher value indicates that the source data is less reliable. The correction formula is:

[0066] ,

[0067] in and These represent the data sources to which the two records in the association pair belong. This formula ensures that the confidence of associations from high-decaying data sources will be reduced. The corrected association confidence score. The range remains between 0 and 1. Finally, the selection is... Greater than the filtering threshold The threshold for the association pairs is set to 0.6. An entity association network is constructed using business entities as nodes and the filtered association pairs as edges. The weight of each edge represents its corrected association confidence. .

[0068] For example, suppose the quality attenuation parameter indicates that the field attenuation vector value of the "Contact Number" field is 0.4 in source C and 0.5 in source D, both exceeding the threshold of 0.3. Therefore, "Contact Number" is included in the attenuation field set. All records containing "Contact Number" are extracted from source C and source D to form a field focus record set, for example, containing records R1 (source C, phone number 13800138001, entity A), R2 (source D, phone number 13800138000, entity B), and R3 (source D, phone number 13800138001, entity C).

[0069] Calculate the similarity of phone numbers after numerical processing, such as converting phone numbers to numbers. R1 and R3 have the same phone number; their similarity is... =1.0, exceeding the threshold of 0.8, and thus entering the preliminary similarity set. R1 and R2 have different last digits in their phone numbers, and assuming a normalized similarity of 0.91, they also enter the preliminary similarity set.

[0070] Analyzing historical data synchronization timelines reveals that the data synchronization period T from source C to source D is 24 hours. The current time is only 2 hours since the last successful synchronization. Therefore, for the record pair R1 (source C) and R2 (source D), the timeline penalty factor... less than the tolerance ratio Therefore, it is inferred that the high similarity may be due to data latency, and this record pair is filtered out. R1 and R3 come from source C and source D respectively, but the synchronization record from source C to source D does not contain information about entity C, so there is no delayed synchronization relationship. This record pair is retained as a cross-source candidate association pair.

[0071] Assume the source attenuation coefficient of source C Source D For a candidate association pair (R1, R3), its initial confidence level is... Corrected confidence level This is greater than the screening threshold of 0.6. Therefore, as... Figure 2As shown, in the entity association network, nodes "Entity A" and "Entity C" are created, and an edge with a weight of 0.85 is established between them.

[0072] S3. Use a preset domain knowledge graph to verify and complete the path exploration of the entity association network, and generate a verification result containing a credible completion path;

[0073] In one embodiment of the present invention, step S3 includes the following steps:

[0074] Map the nodes and edges in the entity association network to a preset domain knowledge graph, and search for the corresponding entity relationship paths that already exist in the domain knowledge graph;

[0075] For the associated edges that exist in the entity association network but are missing or weakened in the domain knowledge graph, reasoning is performed to complete them and generate potential completion paths.

[0076] Merge the corresponding entity relationship path with the potential completion path;

[0077] Extract business rule constraints and attribute value range constraints related to the entities at both ends of the associated edge from the domain knowledge graph;

[0078] Check whether the changes in entity attributes or relationships in the potential completion path violate the business rule constraints and attribute value range constraints;

[0079] Mark the paths that pass the verification as trusted completion paths and inject the verification results;

[0080] Paths that fail the verification are marked as conflicts, and conflict constraint information is recorded in the verification result;

[0081] The verification results indicate the status of each associated edge as verified, pending completion, or conflict.

[0082] Specifically, the input to step S3 is the entity association network generated in the previous steps and the quality decay parameters. The implementation process begins by mapping the nodes and edges in the entity association network to a predefined domain knowledge graph. The domain knowledge graph is a large-scale semantic network containing entities, concepts, and rich relationships within a specific industry domain. The mapping operation aims to establish a correspondence between the nodes in the entity association network and the entity types defined in the domain knowledge graph, for example, mapping the "Customer A" node in the network to the "Individual Customer" entity type in the knowledge graph.

[0083] After mapping, the entity relationship paths corresponding to each edge in the entity association network are searched in the domain knowledge graph. An entity relationship path refers to one or more relationship chains connecting two entities in the knowledge graph. For example, if there is an edge in the network connecting "Company A" and "Company B", then the knowledge graph is searched to see if there are known relationship paths such as "Company A-Holding-Company B" or "Company A-Counterpartner-Company B".

[0084] For association edges that exist in the entity association network but lack a corresponding relationship in the domain knowledge graph or whose relationship confidence is below the weakening threshold, the system initiates an inference completion process to generate potential completion paths. This process is based on the neighbor topology of the entities at both ends of the association edge in the knowledge graph and the edge weights. One inference method is to calculate the completion confidence based on path connectivity. The formula is:

[0085] ,

[0086] in, It is the weight of the associated edge in the entity association network; It is the set of all intermediate paths of length 2 that connect the two entities in the knowledge graph, such as "entity A-relation 1-intermediate entity C-relation 2-entity B"; It is a single path The connectivity strength is the product of the historical co-occurrence frequencies of the two relationships along the path in the knowledge graph; This represents the total number of paths. The formula calculates a completion confidence score between 0 and 1 by averaging the connectivity strength of the intermediate paths and multiplying it by the network edge weights; a higher value indicates a more reliable completed path. If... Above the completion threshold If so, the corresponding "Entity A-Relationship 1-Intermediate Entity C-Relationship 2-Entity B" is generated as a potential completion path. The completion threshold is set to 0.5 based on the statistical analysis of 5000 known strongly related entity pairs in the knowledge graph.

[0087] The corresponding entity relationship paths found in the previous step are merged with the potential completion paths generated by reasoning to form a set of paths to be verified. Business rule constraints and attribute value range constraints related to the entities at both ends of each path in the path set are extracted from the domain knowledge graph. Business rule constraints define the logical conditions that must be met between entities or between entity attributes, such as "the age of an individual customer must be greater than or equal to 18 years old". Attribute value range constraints define the legal value range or data type of a certain attribute of an entity, such as "the account balance is a floating-point number greater than or equal to 0".

[0088] The system checks each potential completion path to see if it violates the extracted constraints. For attribute range constraints, it checks whether the changes in entity attributes implied or caused by the path exceed the range and calculates the degree of violation. :

[0089] ,

[0090] in These are attribute values ​​inferred from the path. It is the point on the boundary of the value range that is closest to the value, and the denominator is the value range. For business rule constraints, they are transformed into logical expressions and substituted into the inferred values ​​for verification; violations are marked. This will pass all constraint verifications, and Less than the tolerance Potential completion paths (set to 0.1) are marked as trusted completion paths and injected into the verification results.

[0091] Paths that fail constraint validation are marked as conflict paths, along with the specific constraint information that caused the conflict, such as the violated rule entries and the calculated constraints. The value is recorded in the verification result.

[0092] The final verification result is a structured document that clearly identifies the state of each association edge in the entity association network. The state is divided into three categories: "verified" indicates that the edge has a strong corresponding relationship path in the domain knowledge graph; "to be completed" indicates that the edge has generated a credible completion path through reasoning; and "conflict" indicates that the path corresponding to the edge cannot pass the constraint verification of the knowledge graph.

[0093] For example, suppose there is an edge in the entity association network connecting nodes "Enterprise X" and "Individual Y" with a weight of 0.75. The preset domain knowledge graph is a financial risk control knowledge graph. After mapping, "Enterprise X" corresponds to the "Corporate Client" entity, and "Individual Y" corresponds to the "Individual Client" entity.

[0094] A search of the knowledge graph revealed no direct relationship between "Company X" and "Individual Y," but found an intermediate path of length 2: "Company X - Legal Representative - Individual Z" and "Individual Z - Relative - Individual Y." Calculate the connectivity strength of this intermediate path. Assuming the historical co-occurrence frequency of "Company X - Legal Representative - Individual Z" is 0.9, and the frequency of "Individual Z - Relative - Individual Y" is 0.8, then... Since this is the only intermediate path, the set The value is 1. Complete the confidence level. The value is higher than the completion threshold of 0.5. Therefore, a potential completion path is generated: "Company X - Legal Representative - Individual Z - Relative - Individual Y".

[0095] Constraints related to "individual customers" are extracted from the knowledge graph, such as the business rule "individual customers must be at least 18 years old" and the attribute value range constraint "individual customers' ages are integers between 18 and 100". A potential completion path is examined; it does not directly generate an age value, therefore it does not violate the above constraints. Suppose the knowledge graph also contains the rule "the legal representative of an enterprise cannot be a minor," and the path implies that "individual Z" may be an adult relative of "individual Y," this rule is satisfied. Therefore, this path passes the validation and is marked as a reliable completion path, and the state of the corresponding edge is marked as "to be completed" in the validation results.

[0096] If another reasoning path leads to the conclusion that a customer is 16 years old, then calculate the degree of violation of the range constraint. Although the value is small, the business rule "age must be ≥18 years old" is violated. Therefore, the path is marked as "conflict", and the ID of the violated rule is recorded in the verification result.

[0097] S4. Based on the verification results, perform collaborative cleaning and completion operations on the original multi-source business data stream, and output the cleaned and completed business entity data stream.

[0098] In one embodiment of the present invention, step S4 includes the following steps:

[0099] The verification results are analyzed, and for the business data corresponding to the associated edges with a verified status, consistency analysis is performed using data from a pre-set confidence source.

[0100] For abnormal business entities corresponding to the associated edges marked as to be completed, a completion decision tree with the abnormal entity as the root node is constructed based on the entity association network and the verification result.

[0101] Traverse the completion decision tree and dynamically select completion operators such as value passing, model derivation, or knowledge graph query based on the quality decay parameter associated with the node and the reliable completion path type corresponding to the edge.

[0102] The completion operator is executed to assign values ​​to all missing or conflicting fields of the abnormal business entity, and the completed values ​​are obtained.

[0103] Integrate the data after consistency resolution with the completed values ​​to reconstruct and generate the business entity data flow.

[0104] Specifically, the inputs to step S4 are the verification results, the entity association network, and the quality degradation parameters. The implementation process begins with parsing the verification results, which identify the state of each edge in the entity association network as verified, incomplete, or conflicting. For two business entities linked by an association edge marked as verified, their corresponding original business data are considered reliable and consistent. During processing, the system selects a pre-set reliability source from all data source records involved in these two entities as a benchmark for parsing and retention. The pre-set reliability source is automatically determined by comparing the source degradation coefficients of the data sources to which the entities at both ends of the association edge belong, selecting the data source with the smaller source degradation coefficient, as its data quality degradation is relatively lower. The consistency parsing operation directly uses the complete record value of the corresponding business entity in the selected data source, thereby ensuring the consistency of the output data within the verified association.

[0105] For the edges marked as needing completion in the verification results, one or more corresponding business entities are identified as anomalous business entities with missing fields or conflicting values. The system constructs a completion decision tree for each anomalous business entity based on the entity association network and the reliable completion paths recorded in the verification results. This tree, with the anomalous entity as the root node, connects to other related entities as child nodes through edges in the entity association network with weights higher than the traversal threshold, forming a tree-like reasoning structure for deriving the completion value. Each edge in the tree is associated with a reliable completion path provided in the verification results.

[0106] The system employs a depth-first strategy to traverse the completion decision tree. For each field to be completed, the most suitable completion operator needs to be dynamically selected. The selection is based on a scoring function. This function comprehensively considers the quality degradation parameters of the data source where the current node entity resides, as well as the type of the trusted completion path corresponding to the current edge. The scoring function is defined as follows:

[0107] ,

[0108] in, It is the source decay coefficient of the data source to which the current node entity belongs; It is the confidence score of the trusted completion path corresponding to the current edge, recorded in the verification results. This confidence score comes from the completion confidence score in step S3. Or the inherent confidence level of a known path; and These are weighting coefficients, pre-configured based on the type of the completed path. For example, for a "family" relationship path, the values ​​of the related entities are preferred. The setting is relatively high. Both terms of this formula are dimensionless, and the calculation results are used to choose among the three completion operators: When favoring higher values, the value propagation operator is selected, which directly adopts the values ​​of the corresponding fields of adjacent entities with high relevance; when favoring medium values, the model derivation operator is selected, which uses a pre-trained regression or classification model to predict the missing field based on the other complete fields of the entity; when favoring lower values, the knowledge graph query operator is selected, which queries typical or default values ​​that meet the constraints from the domain knowledge graph based on the semantics of the credible completion path.

[0109] The selected completion operator is executed immediately. The value-passing operator directly assigns the source value to the target field. The model derivation operator calls a pre-trained random forest model for prediction. The knowledge graph query operator constructs a query statement based on path semantics; for example, for the path "customer-owner-product", it queries the default risk level for that product type. After the operator is executed, a completion value is generated for the target field. It also records the data source and generation confidence level of the value.

[0110] After assigning values ​​to all missing or conflicting fields of all abnormal business entities, the system enters the data integration phase. This phase merges the verified data obtained after consistency resolution with the completed values ​​generated by all completion operations. During merging, arbitration is performed based on the hierarchy of the data source, with the following priority order: consistent resolution data takes precedence over completed data; within the completed data, the result of the value passing operator takes precedence over the model derivation result, and the model derivation result takes precedence over the knowledge graph query result. Finally, the system reconstructs the complete record of each entity according to the granularity of the original business entity, sorts all records by time or business identifier, and outputs a cleaned and completed, internally consistent, and complete business entity data stream.

[0111] For example, suppose the verification results indicate that the edge state connecting "Customer M" and "Customer N" in the entity association network is "verified", and the source decay coefficient of the data source E to which "Customer M" belongs is... "Customer N" belongs to data source F The system selects source E as the preset confidence source; therefore, in the output stream, all field values ​​of "Customer M" are taken from records in data source E.

[0112] For another anomalous entity "Company P" marked as "To be completed", the "Industry Classification" field is missing. According to the entity association network, there is an edge between it and entity "Company Q" with a weight of 0.8, and the verification result shows that this edge provides a reliable completion path "Company P - Supplier - Company Q", with a path confidence of [value missing]. Construct a complete decision tree with "Company P" as the root node and "Company Q" as the child node. Traverse to the "Industry Classification" field, assuming the source attenuation coefficient of the data source G to which "Company P" belongs. The path completion type is "Supplier", and the configuration weight is [not specified]. , Calculate the score of the completion operator. The score is high, so the "value pass-through" operator is dynamically selected. If the "Industry Classification" field value of "Company Q" is "Manufacturing", then this value "Manufacturing" will be assigned as the completion value to "Company P".

[0113] Suppose another missing field needs to be derived from the model, its If the calculated value is 0.45, the pre-trained random forest model is invoked. Other known attributes of "Company P," such as "Registered Location" and "Registered Capital," are input, and the model outputs a predicted value. During integration, the records for "Customer M" use source E data, while the records for "Company P" incorporate its original data and the "Industry Classification" field, which is completed through value passing. This ultimately generates a cleaned and completed business entity data stream.

[0114] Based on the same inventive concept, such as Figure 3 As shown, the present invention also provides an automatic cleaning and completion system for multi-source business data, the system comprising:

[0115] The quality degradation assessment module is used to acquire the original multi-source service data stream, perform quality degradation assessment on the original multi-source service data stream, and generate quality degradation parameters.

[0116] The association network construction module is used to identify and construct the association relationships between cross-source business data based on the quality attenuation parameter, and generate an entity association network;

[0117] The knowledge verification and path exploration module is used to verify and complete the path exploration of the entity association network using a preset domain knowledge graph, and generate a verification result containing a credible completed path.

[0118] The collaborative cleaning and completion execution module is used to perform collaborative cleaning and completion operations on the original multi-source business data stream based on the verification results, and output the cleaned and completed business entity data stream.

[0119] It should be noted that the electrical connections between the various units described above do not necessarily represent direct or indirect connections. Any method of indirect connection is applicable to the embodiments of the present invention as long as it achieves the purpose of the present invention. The above descriptions are merely exemplary embodiments of the present invention and should not be construed as limiting the scope of the present invention.

[0120] All equivalent changes and modifications made in accordance with the teachings of this invention are still within the scope of this invention. Those skilled in the art will readily conceive of other embodiments of this invention upon considering the specification and the disclosure of practical truth. This application is intended to cover any variations, uses, or adaptations of this invention that follow the general principles of this invention and include common knowledge or conventional techniques in the art not described herein.

Claims

1. A method for automatic cleaning and completion of multi-source business data, characterized in that, The method includes: Acquire the original multi-source service data stream and perform a quality degradation assessment on the original multi-source service data stream to generate quality degradation parameters; Based on the quality attenuation parameter, the correlation between cross-source business data is identified and constructed, and an entity association network is generated; The entity association network is verified and the path completion is explored using a preset domain knowledge graph, generating a verification result containing a credible completion path; Based on the verification results, a collaborative cleaning and completion operation is performed on the original multi-source business data stream, and the cleaned and completed business entity data stream is output.

2. The method for automatic cleaning and completion of multi-source business data according to claim 1, characterized in that, The process of acquiring the original multi-source service data stream and performing a quality degradation assessment on the original multi-source service data stream to generate quality degradation parameters includes: Receive raw data packets from at least two independent business data sources and parse them to obtain a structured business record stream; The structured business record stream is de-identified to generate a business data stream to be evaluated; The business data stream to be evaluated is subjected to multi-dimensional quality measurement, and the quality decay parameters that characterize the degree of inconsistency and incompleteness at the data source and field levels are calculated.

3. The method for automatic cleaning and completion of multi-source business data according to claim 2, characterized in that, The multi-dimensional quality measurement of the business data stream to be evaluated, and the calculation of quality degradation parameters that characterize the degree of inconsistency and incompleteness at the data source and field levels, include: Identify conflicting field values ​​describing the same business entity in the business data stream to be evaluated, and calculate a field-level inconsistency score based on the conflict frequency and the credibility weight of the conflict source. The missing fields of business entity records in the business data stream to be evaluated are detected, and a field-level incompleteness score is calculated based on the business criticality of the fields and the source distribution of the missing records. The field-level inconsistency score and field-level incompleteness score of all fields under the same data source are aggregated, and combined with the real-time availability status of the corresponding data source, a quality decay parameter is generated. The quality decay parameter includes the source decay coefficient and the field decay vector.

4. The method for automatic cleaning and completion of multi-source business data according to claim 3, characterized in that, The step of identifying and constructing the correlation between cross-source business data based on the quality attenuation parameter, and generating an entity association network, includes: Based on the attenuation vector field in the mass attenuation parameter, locate the attenuation field set; Using the set of attenuation fields as the focus of association detection, cross-source candidate association pairs containing association confidence are filtered in the original multi-source business data stream; The association confidence of the cross-source candidate association pairs is weighted and corrected based on the source attenuation coefficient, and an entity association network with business entities as nodes is selected and constructed. The edge weights in the entity association network are the corrected association confidence.

5. The method for automatic cleaning and completion of multi-source business data according to claim 4, characterized in that, The step of using the attenuation field set as the focus of association detection, and filtering cross-source candidate association pairs containing association confidence in the original multi-source business data stream, includes: For each field in the attenuation field set, extract all business records from all data sources that involve the corresponding field to form a field focus record set; Calculate the similarity between the focus record sets of the field in terms of numerical value, text, or category, and generate a preliminary similarity relationship set; By analyzing the historical data synchronization time sequence between the source data sources of the preliminary similarity relationship set record pairs, and combining the quality decay parameter, false associations caused by data delays are inferred and filtered to obtain cross-source candidate association pairs containing association confidence.

6. The method for automatic cleaning and completion of multi-source business data according to claim 4, characterized in that, The step of using a preset domain knowledge graph to verify and complete the path exploration of the entity association network, and generating a verification result containing a credible completion path, includes: Map the nodes and edges in the entity association network to a preset domain knowledge graph, and search for the corresponding entity relationship paths that already exist in the domain knowledge graph; For the associated edges that exist in the entity association network but are missing or weakened in the domain knowledge graph, reasoning is performed to complete them and generate potential completion paths. Merge the corresponding entity relationship path with the potential completion path; The feasibility of the fused path is verified based on the constraint rules of entity attributes in the domain knowledge graph, and a verification result containing the credible completion path is generated. The verification result identifies the status of each associated edge as verified, pending completion, or conflict.

7. The method for automatic cleaning and completion of multi-source business data according to claim 6, characterized in that, The step of verifying the feasibility of the fused path based on the constraint rules of entity attributes in the domain knowledge graph, and generating a verification result containing a credible completed path, includes: Extract business rule constraints and attribute value range constraints related to the entities at both ends of the associated edge from the domain knowledge graph; Check whether the changes in entity attributes or relationships in the potential completion path violate the business rule constraints and attribute value range constraints; Mark the paths that pass the verification as trusted completion paths and inject the verification results; Paths that fail the verification are marked as conflicts, and conflict constraint information is recorded in the verification result.

8. The method for automatic cleaning and completion of multi-source business data according to claim 7, characterized in that, The step of performing collaborative cleaning and completion operations on the original multi-source business data stream based on the verification results, and outputting the cleaned and completed business entity data stream includes: The verification results are analyzed, and for the business data corresponding to the associated edges with a verified status, consistency analysis is performed using data from a pre-set confidence source. For associated edges in the state of "to be completed" and associated missing or conflicting fields, the completion value is calculated from a preset source or by path deduction based on the reliable completion path in the verification result. Integrate the data after consistency resolution with the completed values ​​to reconstruct and generate the business entity data flow.

9. The method for automatic cleaning and completion of multi-source business data according to claim 8, characterized in that, For associated edges in the state of "to be completed" and associated fields that are missing or conflicting, the completed values ​​are calculated from a preset source or through path derivation based on the reliable completion path in the verification result, including: For abnormal business entities corresponding to the associated edges marked as to be completed, a completion decision tree with the abnormal entity as the root node is constructed based on the entity association network and the verification result. Traverse the completion decision tree and dynamically select completion operators such as value passing, model derivation, or knowledge graph query based on the quality decay parameter associated with the node and the type of reliable completion path corresponding to the edge. The completion operator is executed to assign values ​​to all missing or conflicting fields of the abnormal business entity, resulting in the completed value.

10. A multi-source business data automatic cleaning and completion system, characterized in that, The system includes: The quality degradation assessment module is used to acquire the original multi-source service data stream, perform quality degradation assessment on the original multi-source service data stream, and generate quality degradation parameters. The association network construction module is used to identify and construct the association relationships between cross-source business data based on the quality attenuation parameter, and generate an entity association network; The knowledge verification and path exploration module is used to verify and complete the path exploration of the entity association network using a preset domain knowledge graph, and generate a verification result containing a credible completed path. The collaborative cleaning and completion execution module is used to perform collaborative cleaning and completion operations on the original multi-source business data stream based on the verification results, and output the cleaned and completed business entity data stream.