Data quality rule generation and dataset production methods and systems
By automatically inferring data quality rules using industry knowledge graphs and combining them with business task requirements and performance feedback, a closed-loop system is formed, solving the problem of low efficiency in manually formulating rules in existing technologies and achieving efficient and self-optimizing dataset generation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHINA STATE CONSTR OVERSEAS DEV CO LTD
- Filing Date
- 2026-04-16
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, the formulation of data quality rules relies heavily on manual labor, which is inefficient, makes it difficult to cope with massive and rapidly changing business data, and lacks industry semantic understanding. The rule generation is disconnected from the dataset production, resulting in high thresholds, low efficiency, and no ability to dynamically optimize.
By employing industry knowledge graphs for semantic analysis, data quality candidate rules are automatically generated through reasoning. These rules are then progressively governed in conjunction with business task requirements. Performance feedback is used to dynamically optimize the rule generation strategy, forming a collaborative learning closed loop of generation, production, and optimization.
It significantly lowers the threshold for rule making, improves the efficiency of rule making, has industry semantic understanding capabilities, discovers problems that traditional rules cannot discover, and achieves self-optimization and continuous improvement of high-quality datasets.
Smart Images

Figure CN122242705A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data governance technology, and specifically to a method and system for generating data quality rules and producing datasets. Background Technology
[0002] Currently, the following core technical bottlenecks exist in the process of building high-quality datasets: the formulation of data quality rules heavily relies on the manual sorting and coding by business experts and data engineers. This approach is inefficient, costly, and unable to cope with massive and rapidly changing business data, resulting in high thresholds for rule formulation, difficulty in business participation, and low efficiency in building high-quality datasets.
[0003] Existing rule generation technologies are mostly based on statistical data features, lacking an understanding of the industry business semantics behind the data. Rule generation is disconnected from the dataset production process, and the generated rules cannot be dynamically optimized based on the performance of the dataset in specific tasks.
[0004] Therefore, how to lower the threshold for formulating data quality rules, enable the rules to have industry semantic understanding capabilities, and connect rule generation with dataset production to form a self-optimizing closed-loop system are technical problems that urgently need to be solved in this field. Summary of the Invention
[0005] The purpose of this invention is to overcome the shortcomings of the prior art and provide a method and system for generating data quality rules and producing datasets, solving the problems of difficulty in formulating data quality rules, difficulty in integrating industry knowledge, and disconnect between rules and data production in the prior art.
[0006] To achieve the above objectives, this invention provides a method for generating data quality rules and producing datasets, comprising the following steps: S1. Based on the industry knowledge graph, perform semantic analysis and mapping on the patterns of the data tables to be governed, and automatically infer and generate candidate rules for data quality with confidence assessment. S2. Receive specific business task requirements, filter matching rule subsets from candidate rules, and use this rule subset to progressively govern the original data to produce a high-quality dataset. S3 monitors the performance of high-quality datasets in subsequent business applications, evaluates the effectiveness of rules that trigger data problems based on performance feedback, and dynamically optimizes the constraint confidence and rule generation strategy in the industry knowledge graph accordingly.
[0007] By adopting this technical solution, candidate rules for data quality are automatically generated based on industry knowledge graphs, transforming rule formulation from manual to automated, and shifting business personnel from rule writers to reviewers. This significantly lowers the threshold and labor costs for rule formulation, solving the problem of low efficiency in rule formulation relying on human experts. Rules are selected based on specific business task requirements to produce high-quality datasets, achieving targeted data generation and addressing the issue of low efficiency in constructing high-quality datasets. Based on performance feedback, the constraint confidence and rule generation strategies in the industry knowledge graph are dynamically optimized, forming a collaborative learning closed loop of generation-production-optimization. This enables the system to evolve, avoids a fixed rule base, and continuously improves the quality of generated data.
[0008] Furthermore, in step S1, the reasoning to generate candidate data quality rules with confidence assessment includes: The fields of the data table to be governed are mapped to corresponding entities or attributes in the industry knowledge graph through entity links, the associated atomic business constraints are discovered, and the atomic business constraints are instantiated into executable data quality detection rules for the respective fields; among them, The industry knowledge graph includes industry entities, relationships between entities, entity attributes, and atomic business constraints abstracted from industry standards and historical practices.
[0009] By adopting this technical solution, the generated rules are equipped with industry-specific semantic understanding capabilities, enabling them to discover semantic data problems that traditional rules based on statistical data features cannot detect.
[0010] Furthermore, in step S2, the incremental governance includes: First, basic syntax cleaning rules are executed to produce a baseline dataset; Then, complex rules based on industry semantics are executed. For triggered data anomalies, correction suggestions are provided based on the industry knowledge graph, and a human-machine collaborative decision-making process is triggered.
[0011] By adopting this technical solution, a gradual combination of automated governance and manual review has been achieved.
[0012] Furthermore, the results of human decision-making in human-machine collaborative decision-making are fed back into the industry knowledge graph as new samples to update the confidence of relevant constraints or supplement new constraints.
[0013] By adopting this technical solution, the experience of human-machine collaboration has been continuously accumulated, enabling the industry knowledge graph to be continuously enriched and improved with business practices.
[0014] Furthermore, in step S3, the dynamic optimization rule generation strategy includes: For rules that correctly intercept erroneous data, increase the confidence level of atomic constraints in their corresponding industry knowledge graph; For rules that generate false positives, reduce the confidence of their corresponding atomic constraints, and abstract new potential constraints based on the false negative problem pattern and add them to the industry knowledge graph.
[0015] By adopting this technical solution, differentiated assessment of rule effectiveness and refined evolution of industry knowledge graphs have been achieved.
[0016] This invention also provides a system for implementing a method for generating data quality rules and producing datasets, comprising: The industry knowledge graph module is used to store and manage industry entities, relationships, attributes, and atomic business constraints, and supports dynamic adjustment of constraint confidence. The rule generation engine, connected to the industry knowledge graph module, is used to automatically generate candidate rules for data quality based on the input data table pattern by querying and reasoning the industry knowledge graph. The dataset production engine, connected to the rule generation engine, is used to invoke relevant rules according to business task requirements, govern the raw data lake, and produce high-quality datasets through an interactive processing flow. The collaborative learning closed-loop module, connected to the dataset production engine and the industry knowledge graph module, is used to collect performance feedback of the high-quality dataset in business applications, analyze the effectiveness of rules, and drive the industry knowledge graph module and the rule generation engine to perform adaptive optimization.
[0017] By adopting this technical solution, a complete closed-loop system of generation, production, and optimization was constructed, thus connecting the rule generation and dataset production links.
[0018] Furthermore, the system operates according to a rule-data-task triplet collaborative model, where the effectiveness of business tasks serves as the optimization objective, updating the internal parameters of the knowledge graph and the rule generation strategy.
[0019] By adopting this technical solution, the system is able to learn from the data production results and continuously evolve itself.
[0020] Furthermore, the dataset production engine has a built-in human-machine collaborative decision-making interface. When data anomalies triggered by rules cannot be automatically repaired, a work order containing a problem description, knowledge graph derivation basis, and repair suggestions is generated, pushed to the designated business personnel, and the business personnel's processing decisions and results are synchronously fed back to the industry knowledge graph module.
[0021] By adopting this technical solution, the standardization and streamlining of human-machine collaborative decision-making processes and the continuous accumulation of governance experience have been achieved.
[0022] Furthermore, the rules generated by the rule generation engine include a comprehensive confidence score calculated from the entity link confidence score and the source constraint confidence score.
[0023] By adopting this technical solution, a quantitative basis for the credibility of the ranking and screening of candidate rules is provided.
[0024] Furthermore, business task requirements include the need to build AI model training datasets, with high-quality datasets including data quality profiles, versions of the applied rule sets, and data governance lineage information.
[0025] By adopting this technical solution, the subsequent use of the dataset provides complete quality assurance information and traceability, facilitating problem tracing and version management.
[0026] Compared with the prior art, the present invention has the following advantages: 1. By automatically generating rules based on industry knowledge graphs, business personnel are transformed from rule writers to reviewers, significantly improving the efficiency of rule formulation.
[0027] 2. By introducing industry knowledge graphs, semantic layer data problems that traditional rules cannot discover can be found.
[0028] 3. Through a collaborative learning loop of "generation-production-optimization", the rule base and knowledge graph are continuously optimized with application feedback, and the quality of the generated datasets is constantly improved.
[0029] 4. The resulting industry knowledge graph and rule package can be quickly replicated to new projects, enabling the accumulation and reuse of knowledge assets. Attached Figure Description
[0030] Figure 1 This is a schematic diagram of the data quality rule generation and dataset production method in this invention. Detailed Implementation
[0031] The following is in conjunction with the appendix Figure 1 The present invention will be further described with reference to specific embodiments.
[0032] Example: Production of high-quality datasets for project cost forecasting in the construction engineering field.
[0033] This embodiment uses the task of project cost prediction in the field of construction engineering as an example to provide a detailed description of the technical solution of the present invention.
[0034] System configuration and knowledge graph initialization: Hardware environment: cloud server, configured with 8-core CPU and 32GB memory.
[0035] Software environment: Neo4j graph database (for storing industry knowledge graphs KG), Python 3.9+ (for running the rule generation engine RE, dataset production engine DPE, and CL algorithm), Apache Spark (for processing large-scale data).
[0036] Knowledge graph construction: Import national and industry standard documents, and use NLP technology to extract entities and constraints (such as concrete-related clauses in GB 50666-2011); import historical project data, mine common data problem patterns to form constraints; manually enter the core business terminology table; The initial industry knowledge graph contains approximately 5,000 entities, 20,000 relationships, and 1,500 atomic constraints. Step S1: Receive the task and trigger rule generation.
[0037] The business task is "cost overrun prediction", the target model is XGBoost, and the quality requirements are that the integrity of key fields is greater than 99% and the consistency of numerical logic is 100%.
[0038] The input data table is the "Project Cost Details Table", which includes the following fields: Project Code, Cost Item, Budget Amount, Actual Amount, Contract Number, and Supplier.
[0039] The rule generation process is as follows: First, establish entity links: link the "Cost Item" field to the "Cost Item" entity in the industry knowledge graph, and link the "Budget Amount" field to the "Amount" attribute.
[0040] Then, constraint discovery was performed: starting from the "Cost Account" entity, the constraint "Account code must conform to the company's WBS dictionary" was found; starting from the "Amount" attribute and the "Contract" relationship, the associated constraint "Actual amount ≤ corresponding contract amount + change amount" was found.
[0041] Finally, instantiate the rules: instantiate the above atomic business constraints into executable rules for specific fields; for example, generate rule 1: the target field is "cost account", the type is value range check, the logic is "account code IN (SELECT code FROM wbs_dict)", and the confidence level is 0.92; generate rule 2: the target field is "actual amount", the type is association consistency, the logic is to check whether the actual amount exceeds the contract amount plus the change amount in the associated contract table, and the confidence level is 0.88.
[0042] The final output consists of 12 candidate rules. Step S2: Produce high-quality datasets in a targeted manner.
[0043] First, task matching is performed: based on the business task of "cost overrun prediction", 8 rules related to "cost" and "contract" are selected.
[0044] Then, a gradual approach to governance will be implemented. The first round of execution of basic syntax cleaning rules, including non-empty checks and format checks, processed about 5% of the records, producing a baseline dataset.
[0045] The second round of execution used complex rules based on industry semantics. When executing rule 2, it was found that the "actual amount" of 15 records exceeded the contract amount plus the change amount. The system automatically linked the contract table and found that 10 of these records had change orders, so the calculation was automatically corrected. The remaining 5 records did not have change orders, so the system generated an intelligent suggestion work order, which included the problem description "suspected overpayment, please check the contract and payment slip" and the reasoning based on the knowledge graph, and pushed it to the business department.
[0046] The business department confirmed in the work order system that 3 entries were data entry errors and were corrected, and 2 entries were special settlements (supplementary agreements were not entered). The correction results and the "special settlements" tag were synchronously fed back to the industry knowledge graph as an exception sample of the "actual amount ≤ contract amount + change amount" constraint, and triggered the potential constraint of "need to be associated with the supplementary agreement for inspection".
[0047] The final output, the "Project Cost Details Sheet_HQ" after treatment, along with its accompanying quality report, shows a completeness of 99.8% and a correlation consistency of 100%. Step S3: Feedback and optimization.
[0048] The cost prediction model was trained using the "Project Cost Details Table_HQ" and achieved an AUC value of 0.89 on the test set.
[0049] Analysis of the samples with prediction errors revealed that some errors were related to null values in the "Supplier" credit rating field, while the current rule set did not contain strong constraints on this field.
[0050] The collaborative optimization process is as follows: In the industry knowledge graph, the confidence level of the association constraint between the "supplier" entity and the "credit rating" attribute was originally 0.6. This feedback serves as negative evidence, and the confidence level slightly decreased to 0.55, and it was marked as needing more evidence. The system prompted the data governance officer: "The missing supplier credit rating data may be related to cost risks. It is recommended to assess whether to include it in the governance scope." After the governance officer confirmed, a high-confidence constraint was manually added to the industry knowledge graph: "The core supplier credit rating should not be empty," and the generation of governance rules for this field was initiated. In this embodiment, the rule-making time was reduced from 2-3 people / week in the traditional model to 2 people / day; the error rate of key business logic was reduced from the initial about 8% to below 0.1%; using the high-quality dataset produced, the AUC of the cost prediction model was improved from 0.82 to 0.89; this task added 2 effective constraints and 3 rule exceptions to the industry knowledge graph, which can be used in other projects.
[0051] The present invention has been described in detail above with reference to the accompanying drawings and embodiments. Those skilled in the art can make various modifications to the present invention based on the above description. Therefore, certain details in the embodiments should not be construed as limiting the present invention, and the scope of protection of the present invention shall be defined by the appended claims.
Claims
1. A method for generating data quality rules and producing datasets, characterized in that, Includes the following steps: S1. Based on the industry knowledge graph, perform semantic analysis and mapping on the patterns of the data tables to be governed, and automatically infer and generate candidate rules for data quality with confidence assessment. S2. Receive specific business task requirements, select a matching subset of rules from the candidate rules, and use this subset of rules to progressively govern the original data to produce a high-quality dataset. S3 monitors the performance of high-quality datasets in subsequent business applications, evaluates the effectiveness of rules that trigger data problems based on performance feedback, and dynamically optimizes the constraint confidence and rule generation strategy in the industry knowledge graph accordingly.
2. The data quality rule generation and dataset production method according to claim 1, characterized in that, In step S1, the reasoning process generates candidate rules for data quality with confidence assessment, including: The fields of the data table to be governed are mapped to corresponding entities or attributes in the industry knowledge graph through entity links. This identifies associated atomic business constraints, which are then instantiated into executable data quality detection rules for those fields. The industry knowledge graph includes industry entities, relationships between entities, entity attributes, and atomic business constraints abstracted from industry standards and historical practices.
3. The data quality rule generation and dataset production method according to claim 1, characterized in that, In step S2, the gradual governance includes: First, basic syntax cleaning rules are executed to produce a baseline dataset; Then, complex rules based on industry semantics are executed. For triggered data anomalies, correction suggestions are provided based on the industry knowledge graph, and a human-machine collaborative decision-making process is triggered.
4. The data quality rule generation and dataset production method according to claim 3, characterized in that, The results of human decision-making in human-machine collaborative decision-making are fed back into the industry knowledge graph as new samples to update the confidence of relevant constraints or supplement new constraints.
5. The method for generating data quality rules and producing datasets according to claim 2, characterized in that, In step S3, the dynamic optimization rule generation strategy includes: For rules that correctly intercept erroneous data, increase the confidence level of atomic constraints in their corresponding industry knowledge graph; For rules that generate false positives, reduce the confidence of their corresponding atomic constraints, and abstract new potential constraints based on the false negative problem pattern and add them to the industry knowledge graph.
6. A system for implementing the data quality rule generation and dataset production method as described in any one of claims 1-5, characterized in that, include: The industry knowledge graph module is used to store and manage industry entities, relationships, attributes, and atomic business constraints, and supports dynamic adjustment of constraint confidence. The rule generation engine, connected to the industry knowledge graph module, is used to automatically generate candidate rules for data quality based on the input data table pattern by querying and reasoning the industry knowledge graph. The dataset production engine, connected to the rule generation engine, is used to invoke relevant rules according to business task requirements, govern the raw data lake, and produce high-quality datasets through an interactive processing flow. The collaborative learning closed-loop module, connected to the dataset production engine and the industry knowledge graph module, is used to collect performance feedback of the high-quality dataset in business applications, analyze the effectiveness of rules, and drive the industry knowledge graph module and the rule generation engine to perform adaptive optimization.
7. The data quality rule generation and dataset production system according to claim 6, characterized in that, The system operates according to a rule-data-task triplet collaborative model, where the effectiveness of business tasks serves as the optimization objective, updating the internal parameters of the knowledge graph and the rule generation strategy.
8. The data quality rule generation and dataset production system according to claim 6, characterized in that, The dataset production engine has a built-in human-machine collaborative decision-making interface. When data anomalies triggered by rules cannot be automatically repaired, a work order containing a problem description, knowledge graph derivation basis, and repair suggestions is generated and pushed to the designated business personnel. The business personnel's processing decisions and results are then synchronously fed back to the industry knowledge graph module.
9. The data quality rule generation and dataset production system according to claim 6, characterized in that, The rules generated by the rule generation engine include a comprehensive confidence score calculated from the entity link confidence score and the source constraint confidence score.
10. The data quality rule generation and dataset production system according to claim 6, characterized in that, Business requirements include the need to build AI model training datasets, with high-quality datasets including data quality profiles, versions of the applied rule sets, and data governance lineage information.