AI-based data warehouse quality automatic monitoring method and system
By automatically monitoring data warehouse change events and combining AI analysis to generate SQL data packets and execute them in an isolated environment, the problem of automatically generating rules and SQL in data quality inspection systems has been solved, achieving efficient and secure data warehouse quality monitoring and adapting to automated quality control of large-scale data warehouses.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING QINGSONG YIKANG INFORMATION TECHNOLOGY CO LTD
- Filing Date
- 2026-05-20
- Publication Date
- 2026-06-19
AI Technical Summary
Existing data quality inspection systems cannot automatically generate personalized quality inspection rules and SQL, and lack self-verification and self-optimization capabilities, resulting in low rule generation efficiency and poor accuracy, making it difficult to meet the needs of automated quality control in large-scale data warehouse environments.
By automatically monitoring new change events in the data warehouse, acquiring multi-source data and calling AI for matching analysis, generating SQL data packages, and executing them in an isolated or sandbox environment, the system automatically generates revised SQL until acceptance criteria are met, thus achieving automated quality monitoring.
Significantly reduce manual investigation costs, improve the timeliness of change response, ensure that SQL rules are compatible with business needs, safeguard operational security, shorten optimization cycles, improve anomaly handling efficiency, and adapt to the quality control needs of large-scale data warehouses.
Smart Images

Figure CN122240602A_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of data warehouse data quality management technology, and in particular to an AI-based automatic monitoring method and system for data warehouse quality. Background Technology
[0002] Data warehouses are the core infrastructure for enterprise data management, and their data quality directly determines the accuracy of business decisions. Traditional data quality management relies on manually formulating detection rules, writing SQL, and connecting to a scheduling system for periodic execution. As the number of data warehouse tables increases and the complexity of the process chain rises, the manual approach has gradually exposed prominent problems such as insufficient rule coverage, low deployment efficiency, and high maintenance costs.
[0003] Currently, mainstream data quality inspection solutions include predefined rule template detection, statistical indicator threshold detection, and anomaly detection model distribution offset identification. Among them, predefined rule template systems have shortcomings such as limited template coverage, lack of semantic understanding, and static rules that cannot self-correct; machine learning-based anomaly detection systems rely on a large amount of historical data, have insufficient coverage of structured quality problems, weak interpretability of results, and require manual configuration and deployment.
[0004] Existing solutions lack automation and reliability in scenarios involving newly launched tables, lack of historical data, and the need for rapid rule implementation. They cannot automatically generate personalized detection rules and SQL, lack self-verification and self-optimization capabilities, and are difficult to adapt to the automated quality control requirements of large-scale data warehouse environments. Summary of the Invention
[0005] In view of this, the present disclosure provides an AI-based method and system for automatic monitoring of data warehouse quality, which can solve the problems of existing data quality detection systems being unable to automatically generate personalized quality detection rules and SQL for newly launched data warehouse tables, lacking self-verification and self-optimization capabilities, resulting in low rule generation efficiency, poor accuracy, complex deployment, and difficulty in meeting the needs of automated quality control in large-scale data warehouse environments.
[0006] In a first aspect, embodiments of this disclosure provide an AI-based automatic monitoring method for data warehouse quality, including: Automatically monitor new change events in the target data warehouse and obtain the object identifier of the change event; The target table corresponding to the object identifier is determined based on a preset association table; Obtain multi-source data corresponding to the target table, wherein the multi-source data includes at least table metadata, output task cycle information, and data lineage information; The AI is invoked to match and analyze the multi-source data with a preset rule base, determine the target fields and their corresponding target rules in the multi-source data, and generate an SQL data package based on a preset output structure and the target rules; wherein, the target field is the field that matches the preset rule base; In an isolated or sandbox environment, execute the SQL statements in the SQL data packet one by one and collect the execution results. If the execution results do not meet the preset acceptance conditions, generate a data correction request form and send it back to AI to generate a corrected version of the SQL based on AI. After replacing the original entries with the revised SQL, re-execute until the corresponding entry passes or the maximum number of iterations is reached, and mark the corresponding entry as a passed entry. The deployment method obtains the quality results after scheduling and execution by using the SQL corresponding to the entry. If the quality results are abnormal, the abnormal information will be automatically identified, and the corresponding strategy will be executed according to the level of abnormality.
[0007] Optionally, obtaining the multi-source data corresponding to the target table includes: Call the metadata service to obtain the table metadata corresponding to the target table. The table metadata includes at least the latest DDL text of the table, the field list, the partition key and partition granularity, and the table-level attributes. Obtain the output task cycle information corresponding to the target table from the task scheduling system. The output task cycle information includes at least the scheduling cycle of the output task, the output partitioning rule, and the upstream task ID it depends on. The data lineage information corresponding to the target table is obtained from the lineage service. The data lineage information includes at least the upstream table identifier and the downstream table identifier of the target table.
[0008] Optionally, the step of calling the AI to perform matching analysis between the multi-source data and a preset rule base, determining the target fields and their corresponding target rules in the multi-source data, and generating an SQL data package based on a preset output structure and the target rules, includes: The target field is identified from the table metadata according to the preset rule base, and the data management rule corresponding to the target field is obtained from the preset rule base; Based on all the target fields and their corresponding data management rules, a standardized quality rule list is generated. Based on the output task cycle information and the data lineage information, partition filtering conditions and data time window constraints are injected into each rule in the standardized quality rule list, and a single rule SQL is generated according to the preset unified output structure. Package all the individual rule SQL statements corresponding to all the rules in the standardized quality rule list into batches to generate an SQL data package that includes at least the rule's unique identifier and the output structure version number.
[0009] Optionally, the step of identifying the target field from the table metadata according to a preset rule base and obtaining the data management rule type corresponding to the target field from the preset rule base includes: Extract basic information from the table metadata, which includes at least table hierarchy, field types, field comments, and table name characteristics; Based on the field type and the table name characteristics, the target field is identified from the preset rule base and marked with a business semantic tag. The business semantic tag includes at least a primary key candidate, a business unique key, an amount field, a date field, and an encoding field. According to the table hierarchy and the field type, a double match is performed in the preset rule base to filter out suitable data management rules, and the data management rules are bound to the corresponding target fields.
[0010] Optionally, based on the output task cycle information and the data lineage information, partition filtering conditions and data time window constraints are injected into each rule in the standardized quality rule list, and a single rule SQL is generated according to a preset unified output structure, including: The partitioning filtering conditions are determined based on the data lineage information corresponding to the target table of each rule in the standardized quality rule list. The partitioning filtering conditions include the partitioning field and partitioning granularity of the data. The data time window constraint is determined based on the output task cycle information corresponding to the target table of each rule in the standardized quality rule list. Based on the partition filtering conditions, the data time window constraints, and the preset unified output structure, generate a single rule SQL corresponding to each rule.
[0011] Optionally, the SQL statements in the SQL data packet are executed one by one in an isolated or sandbox environment, and the execution results are collected. Construct an isolated or sandbox execution environment consistent with the target data warehouse environment, and import the SQL data package; The SQL statements in the SQL data packet are parsed and executed one by one in sequence. The execution status, execution log, execution time, number of bytes scanned, return result set and execution exception information of each SQL statement are collected in real time to form a standardized execution result.
[0012] Optionally, if the execution result does not meet the preset acceptance conditions, a data correction request form is generated and sent back to the AI. Based on the AI, a corrected version of the SQL is generated, including: The execution results are graded and judged based on three preset acceptance conditions of different levels. The three types of preset acceptance conditions at different levels include basic operational compliance, business result validity, and execution performance compliance. If the execution result does not meet any of the three preset acceptance conditions, extract the abnormal SQL statement, execution abnormal information, verification failure items and corresponding acceptance rules, and automatically generate a data correction request form containing a problem description, abnormal details and correction requirements. The data correction request form, along with the corresponding original SQL, execution log, and acceptance rules, are sent back to the AI model to generate a corrected version of the SQL.
[0013] Optionally, the deployment of obtaining the quality result after scheduling execution through the SQL corresponding to the entry includes: Retrieve all the SQL statements corresponding to the passed entries and record them as the target SQL data package. The target SQL data package shall include at least the SQL statements that have passed the verification and the corresponding data quality rule information. According to the preset scheduling configuration, the target SQL data packet is encapsulated into a task definition that the scheduling system can recognize, and the responsible person and online status are written and set through the scheduling system API to obtain the quality results after scheduling execution.
[0014] Optionally, it also includes updating the association information corresponding to the SQL of the entry to the preset rule base.
[0015] Secondly, this disclosure also provides an AI-based automatic monitoring system for data warehouse quality, comprising: The monitoring unit is used to automatically monitor new change events in the target data warehouse and obtain the object identifier of the change event; The target table determination unit is used to determine the target table corresponding to the object identifier based on a preset association table. A multi-source data acquisition unit is used to acquire multi-source data corresponding to the target table. The multi-source data includes at least table metadata, output task cycle information, and data lineage information. The SQL data packet generation unit is used to call AI to perform matching analysis between the multi-source data and the preset rule base, determine the target fields and their corresponding target rules in the multi-source data, and generate an SQL data packet based on the preset output structure and the target rules; wherein, the target field is the field that matches the preset rule base; The correction unit is used to execute the SQL in the SQL data packet line by line in an isolated or sandbox environment and collect the execution results; if the execution results do not meet the preset acceptance conditions, a data correction request form is generated and sent back to the AI to generate a corrected version of the SQL based on the AI. The re-execution unit is used to replace the original entries with the revised SQL and re-execute it until the corresponding entry passes or the maximum number of iterations is reached, and then mark the corresponding entry as a passed entry. The deployment unit is used to deploy the SQL corresponding to the entry to obtain the quality results after scheduling and execution; An anomaly handling unit is used to automatically identify anomaly information and execute corresponding strategies according to the anomaly level if the quality result is abnormal.
[0016] This application discloses an AI-based automatic data warehouse quality monitoring method that automatically monitors newly emerging change events in the data warehouse. It quickly locates the corresponding target table through a pre-set relational table, completing change identification and target locking without manual intervention, significantly reducing manual investigation costs. Secondly, by integrating multi-source data such as table metadata, output task cycles, and data lineage, and combining AI models with a pre-set rule base for matching analysis, it can accurately identify target fields and corresponding rules, automatically generating SQL data packages. Furthermore, SQL execution employs an isolated or sandbox environment, physically isolated from the production data warehouse, effectively preventing abnormal SQL from polluting and damaging online business and production data, ensuring data warehouse operation and maintenance security. For SQL that fails to meet acceptance criteria, AI automatically generates a corrected version and iteratively verifies it, forming a closed loop of generation, execution, correction, and retesting, significantly shortening the SQL optimization cycle and reducing manual correction and debugging costs. By deploying SQL entries, we can achieve routine monitoring of the quality results after scheduling and execution. It can automatically identify abnormal information and execute differentiated processing strategies according to the level of abnormality. At the same time, combined with the deduplication mechanism, it avoids duplicate alarms, improves the pertinence and efficiency of abnormal handling, reduces ineffective operation and maintenance investment, and provides strong support for the stable operation and efficient maintenance of data warehouses. It is suitable for the large-scale and refined quality control needs of various data warehouses.
[0017] The above description is merely an overview of the technical solution disclosed herein. In order to better understand the technical means of this disclosure and to implement it in accordance with the contents of the specification, and to make the above and other objects, features and advantages of this disclosure more apparent and understandable, preferred embodiments are described below in detail with reference to the accompanying drawings. Attached Figure Description
[0018] To more clearly illustrate the technical solutions of the embodiments of this disclosure, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0019] Figure 1 This is a flowchart illustrating an AI-based automatic monitoring method for data warehouse quality provided in an embodiment of this disclosure.
[0020] Figure 2 This is a flowchart illustrating the method for generating SQL data packets provided in this embodiment of the disclosure.
[0021] Figure 3 A flowchart illustrating the method for obtaining target fields and their corresponding data management rules provided in this embodiment of the disclosure.
[0022] Figure 4 This is a flowchart illustrating the method for generating a single rule SQL statement provided in an embodiment of this disclosure.
[0023] Figure 5 A flowchart illustrating the method for collecting execution results provided in this embodiment of the disclosure.
[0024] Figure 6 A flowchart illustrating the method for generating modified SQL provided in this embodiment of the disclosure. Detailed Implementation
[0025] The embodiments of this disclosure will now be described in detail with reference to the accompanying drawings.
[0026] It should be understood that the following specific examples illustrate the implementation of this disclosure, and those skilled in the art can easily understand other advantages and effects of this disclosure from the content disclosed in this specification. Obviously, the described embodiments are only a part of the embodiments of this disclosure, and not all of them. This disclosure can also be implemented or applied through other different specific implementation methods, and the details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of this disclosure. It should be noted that, in the absence of conflict, the following embodiments and features in the embodiments can be combined with each other. Based on the embodiments in this disclosure, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this disclosure.
[0027] It should be noted that various aspects of embodiments within the scope of the appended claims are described below. It will be apparent that the aspects described herein can be embodied in a wide variety of forms, and any particular structure and / or function described herein is merely illustrative. Based on this disclosure, those skilled in the art will understand that one aspect described herein can be implemented independently of any other aspect, and two or more of these aspects can be combined in various ways. For example, any number of aspects set forth herein can be used to implement the device and / or practice the method. Additionally, this device and / or method can be implemented using structures and / or functionalities other than one or more of the aspects set forth herein.
[0028] It should also be noted that the illustrations provided in the following embodiments are only schematic representations of the basic concept of this disclosure. The drawings only show the components related to this disclosure and are not drawn according to the number, shape and size of the components in actual implementation. In actual implementation, the form, quantity and proportion of each component can be arbitrarily changed, and the layout of the components may also be more complex.
[0029] Furthermore, specific details are provided in the following description to facilitate a thorough understanding of the examples. However, those skilled in the art will understand that the described aspects can be practiced without these specific details.
[0030] Reference Figure 1 This application discloses an AI-based automatic monitoring method for data warehouse quality, including: S100 automatically monitors new change events in the target data warehouse and obtains the object identifier of the change event; it then determines the target table corresponding to the object identifier based on a preset association table.
[0031] These change events include scheduling system event streams and metadata platform events, specifically new task deployment and task change events in the scheduling system, and new table creation and table structure change (DDL) events in the metadata platform. Object identifiers must include at least the database name, table name, partition key, and task ID.
[0032] S200: Obtain multi-source data corresponding to the target table.
[0033] Among them, multi-source data includes at least table metadata, output task cycle information, and data lineage information.
[0034] S300 calls AI to match and analyze multi-source data with a preset rule base, determines the target fields in the multi-source data and their corresponding target rules, and generates SQL data packets based on the preset output structure and target rules.
[0035] The target field is the field that matches the preset rule base.
[0036] S400 executes the SQL statements in the SQL data packet line by line in an isolated or sandbox environment and collects the execution results.
[0037] S500: If the execution result does not meet the preset acceptance conditions of the target data warehouse, a data correction request form is generated and sent back to AI to generate a corrected version of SQL based on AI. S600 replaces the original entries with the revised SQL and re-executes it until the corresponding entry passes or the maximum number of iterations is reached, then marks the corresponding entry as a passed entry. S700 deployment obtains the quality results after scheduling execution through the SQL corresponding to the entry; S800 automatically identifies abnormal information if the quality results are abnormal and executes the corresponding strategy according to the level of abnormality.
[0038] This application discloses an AI-based automatic data warehouse quality monitoring method that automatically monitors newly emerging change events in the data warehouse. It quickly locates the corresponding target table through a pre-set association table, completing change identification and target locking without manual intervention, significantly reducing manual investigation costs, improving change response timeliness, and avoiding quality risks caused by missed changes. Secondly, by integrating multi-source data such as table metadata, output task cycles, and data lineage, and combining AI models with a pre-set rule base for matching analysis, it can accurately identify target fields and corresponding rules, automatically generating SQL data packages. This replaces the tedious process of manually writing SQL, reducing human error, and ensuring that SQL rules are highly compatible with business needs and data characteristics, improving the accuracy and efficiency of rule formulation. Furthermore, SQL execution uses an isolated or sandbox environment, physically isolated from the production data warehouse, effectively preventing abnormal SQL from polluting and damaging online business and production data, ensuring data warehouse operation and maintenance security. For SQL that does not meet acceptance criteria, AI automatically generates a corrected version and iteratively verifies it, forming a closed loop of generation, execution, correction, and retesting, significantly shortening the SQL optimization cycle and reducing manual correction and debugging costs. By deploying SQL entries, routine monitoring of quality results after scheduling execution is achieved. It can automatically identify abnormal information and execute differentiated processing strategies according to the abnormality level. At the same time, combined with a deduplication mechanism, it avoids duplicate alarms, improves the pertinence and efficiency of abnormal handling, and reduces ineffective operation and maintenance investment. The fully automated closed-loop monitoring and handling not only ensures the accuracy, integrity and consistency of data warehouse data, but also effectively improves the standardization and intelligence of data quality monitoring, providing strong support for the stable operation and efficient maintenance of data warehouses, and adapting to the large-scale and refined quality control needs of various data warehouses.
[0039] Specifically, S200 includes: calling the metadata service to obtain the table metadata corresponding to the target table, which includes at least the latest DDL plaintext, field list (field name, data type, comments), partition key and partition granularity, and table-level attributes; obtaining the output task cycle information corresponding to the target table from the task scheduling system, which includes at least the scheduling cycle of the output task (day / hour, etc.), the output partitioning rule (e.g., dt=business date), and the upstream task ID it depends on; and obtaining the data lineage information corresponding to the target table from the lineage service, which includes at least the upstream table identifier and downstream table identifier of the target table.
[0040] Reference Figure 2 Regarding the S300 method of "calling AI to match and analyze multi-source data with a preset rule base, determining the target fields and their corresponding target rules in the multi-source data, and generating SQL data packets based on the preset output structure and target rules," the specific methods for generating SQL data packets include: S310 identifies the target field from the table metadata based on the preset rule base and obtains the data management rule corresponding to the target field from the preset rule base.
[0041] Further reference Figure 3 The methods for obtaining the S310 target fields and their corresponding data management rules specifically include: S311, Extract basic information from table metadata. The basic information includes at least table hierarchy, field types, field comments, and table name characteristics. S312, Identify the target field from the preset rule base according to the field type and table name characteristics, and mark it with business semantic tags. The business semantic tags include at least primary key candidates, business unique keys, amount fields, date fields, and code fields. S313 performs a double match in the preset rule base according to table hierarchy and field type to filter out suitable data management rules and bind the data management rules to the corresponding target fields.
[0042] Specifically, by combining semantic tags and field annotations, AI automatically binds rules to fields; the data management rules corresponding to the target field include rules for null values, uniqueness, data volume, partition integrity, and format / range validation.
[0043] The preset rule base includes built-in rule templates, semantic mapping dictionaries, verification thresholds, and execution specifications categorized by data warehouse layering, field type, and data quality dimension.
[0044] S320 generates a standardized list of quality rules based on all target fields and their corresponding data management rules.
[0045] Each rule in the standardized quality rule list contains at least several of the following: unique identifier, rule type, target object, indicator definition, threshold, priority, and data window attributes.
[0046] S330, based on the output task cycle information and data lineage information, injects partition filtering conditions and data time window constraints into each rule in the standardized quality rule list, and generates a single rule SQL according to the preset unified output structure.
[0047] S340 packages all individual rule SQL statements corresponding to all rules in the standardized quality rule list into batches, generating an SQL data package that includes at least the rule's unique identifier and the output structure version number.
[0048] Reference Figure 4 The specific methods for generating a single rule SQL statement in S330 include: S331, determine the partition filtering conditions based on the data lineage information corresponding to the target table of each rule in the standardized quality rule list.
[0049] The partitioning filtering conditions include the partitioning field of the data and the partitioning granularity.
[0050] Specifically, each rule in the standardized quality rule list is analyzed, the target table and related data lineage information corresponding to the current rule are extracted, the upstream data source, partition key definition and partition field format of the target table are traced, and the partition field name and partition granularity of the target table, such as day, hour, week, etc., are determined.
[0051] S332, determine the data time window constraint based on the output task cycle information corresponding to the target table of each rule in the standardized quality rule list.
[0052] Specifically, the output task cycle type, output time node, and data backtracking cycle corresponding to the target table are extracted. Combined with the data window attribute of each rule in the rule list, the specific range of the data time window is determined. If the data window attribute in the list explicitly specifies a time range, such as the last 7 days or the last 3 cycles, then the data window attribute is converted into a specific date range based on the output task cycle type. For example, for daily tasks, the last 7 days is 7 days prior to the current date. If the data window attribute is not explicitly specified, the most recent complete cycle of the output task is used by default as the data time window to ensure that the data covers the complete data range of the task output.
[0053] Furthermore, if the task is produced daily, the partition filter conditions are associated with the date field, and a filter statement is generated by combining the partition field, start date, and end date with the data window, or the partition field and target date set, to ensure that the filter conditions are strictly aligned with the task output cycle and the data window.
[0054] S333 generates a single rule SQL corresponding to each rule based on the partition filtering conditions, data time window constraints, and a preset unified output structure.
[0055] Specifically, the partition filtering conditions and data time window constraints are injected into the SQL generation logic corresponding to each rule in the standardized quality rule list; a preset unified output structure is loaded to clarify the fixed fields, field order, and field meanings of each rule SQL, ensuring that the field names, order, and semantics are unbiased and that no arbitrary additions, deletions, or adjustments are allowed, and then the single rule SQL corresponding to each rule is generated.
[0056] Furthermore, the rule id As a fixed field, it is directly populated; dt_or_partition is associated with the target table's partition field; metric value Is obtained through indicator calculation logic such as the number of null values, the number of unique values, and data volume statistics. pass By metrics value With threshold expr The comparison logic is generated, and the detail field has a preset exception prompt template. The SQL's FROM clause is supplemented, along with the WHERE clause containing partition filtering conditions, data window constraints, and rule-related filtering conditions. SQL syntax is optimized, generating a single-rule SQL statement for each rule to ensure compliance with the target data warehouse engine's execution specifications and avoid syntax errors.
[0057] Furthermore, this application also includes: validating the generated single rule SQL, verifying the validity of the partition filtering conditions, the rationality of the data window range, the completeness of the output fields, and the correctness of the field order. If the verification passes, it is determined to be the executable single rule SQL corresponding to the rule; if the verification fails, the association information is re-parsed and the filtering conditions or SQL logic is adjusted.
[0058] Reference Figure 5 The S400 method for "executing SQL statements in SQL data packets one by one in an isolated or sandbox environment and collecting the execution results" specifically includes the following methods for collecting execution results: S410: Build an isolated or sandboxed execution environment consistent with the target data warehouse environment and import the SQL data package; The S420 parses and executes each SQL statement in the SQL data packet sequentially, collecting the execution status, execution log, execution time, number of bytes scanned, return result set, and execution exception information of each SQL statement in real time, forming standardized execution results.
[0059] Reference Figure 6The S500 method of "if the execution result does not meet the preset acceptance conditions corresponding to the target data warehouse, a data correction request form is generated and sent back to AI, and a corrected version of SQL is generated based on AI" specifically includes the following: S510 uses three preset acceptance conditions of different levels to classify and judge the execution results.
[0060] The three categories of pre-set acceptance conditions at different levels include basic operational compliance, business result validity, and execution performance compliance.
[0061] The tiered acceptance and judgment of execution results includes: first, judging the basic operational compliance of the execution results; after passing the basic operational compliance, judging the validity of the business results; and after passing the validity of the business results, judging the execution performance compliance.
[0062] In this embodiment, the basic operational compliance judgment (i.e., the conditions for running the acceptance test) includes: correct SQL syntax and engine dialect, no execution errors, and the return columns containing the rule. id / metric value / is pass The ` / detail` field matches the type, and `dt_or_partition` can be parsed.
[0063] The validity judgment of business results (i.e., whether the acceptance criteria are met) includes: the metrics in the returned results. value The values must conform to the value range requirements of the corresponding rule type, where the null value rate / repetition rate ranges from 0 to 1, the count value is ≥0, and the partition existence value is 0 or 1; and the data cross-consistency check must be satisfied, where the number of duplicate rows is ≤ the total number of rows in the corresponding statistics.
[0064] Performance compliance (i.e. fast running acceptance criteria) includes: SQL execution time and scanned data volume do not exceed the corresponding configuration thresholds.
[0065] S520: If the execution result does not meet any of the three preset acceptance conditions, extract the abnormal SQL statement, execution exception information, verification failure items and corresponding acceptance rules, and automatically generate a data correction request form containing a problem description, exception details and correction requirements.
[0066] The data correction request form includes the abnormal SQL number, the original SQL statement, the execution error code, the error message, three categories of acceptance failures (running successfully / running fast), the target data warehouse version, the environment configuration, and the acceptance criteria.
[0067] S530 sends the data correction request form, along with the corresponding original SQL, execution log, and acceptance rules, back to the AI model and generates a corrected version of the SQL.
[0068] Specifically, the data correction request form, along with the corresponding original SQL, execution log, and acceptance rules, is sent back to the AI model. The AI model then generates a corrected version of the SQL that can pass three types of acceptance conditions based on the exception information, business rules, and the target data warehouse syntax specifications.
[0069] Furthermore, when the AI model generates the revised SQL, it automatically completes syntax repair, return field repair, logic repair, and performance optimization based on the target data warehouse dialect syntax, return field specifications, metric_value retrieval rules, execution performance thresholds, and historical correction samples. After the revised SQL is generated, it automatically performs secondary execution and acceptance verification in an isolated sandbox according to three criteria: able to run, run correctly, and run quickly. If the verification passes, the final usable SQL is output; if the verification fails, the correction process is iteratively executed until the criteria are met.
[0070] This embodiment achieves secure, efficient, and accurate governance of SQL data packets by building an isolated sandbox environment adapted to the target data warehouse and combining it with a three-tiered verification system for basic operational compliance, business result validity, and execution performance compliance, resulting in significant overall benefits.
[0071] Among these features, the isolation or sandbox environment can be physically and logically isolated from the formal data warehouse, restricting high-risk operations and effectively preventing abnormal statements during testing from polluting online business and formal data, thus ensuring the security of data warehouse operations and maintenance. Meanwhile, the three-tiered verification system comprehensively covers syntax engine adaptation, indicator value range constraints, cross-logic consistency, and resource consumption control. Compared to a single verification mode, it can accurately identify various hidden defects and achieve end-to-end risk management.
[0072] In this embodiment, the isolated or sandbox environment is a read-only environment, lacking production data write permissions and the ability to override privileges. It is also configured with restricted resource allocation, supporting only data query SQL execution. Furthermore, the sandbox execution environment interfaces with the target data warehouse, is configured with a dedicated read-only execution account, limits CPU, memory, and concurrency, disables write operations for Data Definition Language and Data Manipulation Language, allows only query SQL execution, and includes an execution timeout protection mechanism.
[0073] In this embodiment, by automatically collecting all SQL execution data, it is possible to clearly distinguish between three types of problems: syntax, logic, and performance, thus clarifying the cause of defects and solving the pain point of vague localization in traditional manual troubleshooting. Simultaneously, for scenarios where verification fails, a standardized correction request form is automatically generated and linked with an AI model to complete targeted repair and performance optimization, forming an automated closed-loop governance process that significantly reduces labor costs and shortens the SQL deployment cycle.
[0074] Specifically, S700 includes: obtaining all SQL statements corresponding to the passed entries and recording them as the target SQL data packet. The target SQL data packet includes at least the verified SQL statements and the corresponding data quality rule information (the corresponding rule). id (Target table, partitioning parameter rules, thresholds, etc.); based on the preset scheduling configuration, the target SQL data packet is encapsulated into a task definition that the scheduling system can recognize, and written through the scheduling system (a platform that automatically runs tasks on a timed basis) API (and the owner and online status are set), and the quality results after scheduling execution are obtained.
[0075] Specifically, S800 includes: obtaining the quality results after daily scheduling execution, filtering the data that has passed deployment, and extracting core fields such as the unique identifier of the quality result, table name, partition identifier, rule ID, indicator value, threshold expression, scheduling execution time and result storage time; based on the obtained quality results, combined with anomaly judgment strategies and such as preset alarm rules, to realize the full-process implementation of automatic anomaly identification, level judgment, alarm push and status write-back.
[0076] The automatic anomaly identification includes: combining preset rule thresholds and optional historical trend analysis to determine the indicator values of each quality record, outputting anomaly identifiers and anomaly levels, such as severe, major, and minor, and generating an anomaly event table, where duplicate keys (table + partition + rule) are removed. id The logic of "+date" is implicitly included in the exception event generation, providing a foundation for subsequent deduplication.
[0077] Based on the anomaly level, corresponding preset alarm policies are executed. Severe anomalies require immediate notification to the preset group and a separate mention of the responsible person; if unresolved within 15 minutes, a second notification is sent and copied to the operations and maintenance group. Major anomalies are notification to the business preset group and the responsible person is mentioned; if unresolved within 30 minutes, a reminder is issued. Minor anomalies are only notification to the business preset group without mentioning any personnel; they are only logged and summarized daily.
[0078] Simultaneously, time-window deduplication can be performed. A fixed time window, such as 1 hour, can be defined, and anomalies with the same deduplication key will only be pushed once within the window to avoid duplicate alarms. When an alarm is pushed, it is associated with the corresponding recipient or group chat according to the anomaly level, and a standardized message body containing the table name, partition, rule ID, indicator value, threshold, anomaly reason and handling suggestions is assembled. The preset interface is then called to complete the push.
[0079] After the push is completed, information such as sending status, push time, recipient, and anomaly level must be written back to the preset alarm record table for subsequent investigation and statistics. Finally, after the anomaly alarm is pushed, the relevant person in charge must update the processing status of the anomaly event table and the preset alarm record table after handling it, forming a complete process of identification, push, handling, and closure.
[0080] Furthermore, this application also includes learning the rule base; specifically, for SQL that has been validated and deployed, the final version of the SQL text and rules are extracted. id The system includes the following steps: Correction rounds and reasons for each correction (e.g., dialect errors, missing partition filtering, performance exceeding limits), thresholds, and execution effects (time consumption, scan volume, whether alarms were triggered). For false positives / false negatives, the system parses the rule ID, table, partition, and suggestions (e.g., thresholds too strict / too lenient, rule caliber deviation). This structured content is then written back to the rule base: adding or updating rule templates and applicable conditions, field semantic mappings, default threshold strategies, SQL snippets, and historical cases. Counterexamples generated during correction (e.g., syntax not supported by a certain engine) are written to a counterexample database for future generation. The rule base includes version numbers or timestamps for incremental updates and traceability, ensuring more accurate rule generation and fewer correction rounds for subsequent similar tables.
[0081] Addressing the issues of insufficient rule coverage, low deployment efficiency, and high maintenance costs resulting from manual rule formulation, SQL writing, and scheduling in traditional data quality management, this application automatically monitors new change events in the data warehouse and locates target tables, eliminating the need for manual investigation of changes and identification of control objects, thus reducing manual intervention costs. It integrates multi-source data such as table metadata, output task cycles, and data lineage to provide comprehensive data support for rule formulation, avoiding the tediousness and omissions of manual data analysis. AI is used to match and analyze multi-source data with a pre-set rule base, automatically identifying target fields, matching target rules, and generating SQL data packages, completely replacing the manual process of writing SQL and formulating rules, significantly improving deployment efficiency. Simultaneously, through AI semantic analysis and multi-source data linkage, the rule coverage is expanded, reducing manual maintenance costs and solving the pain point that manual methods cannot adapt to large-scale data warehouses.
[0082] To address the shortcomings of mainstream predefined rule template detection solutions, such as limited template coverage, lack of semantic understanding, and the inability of static rules to self-correct, this solution achieves breakthroughs through AI empowerment and dynamic optimization. Specifically, AI combines multi-source data (including table metadata and data lineage) for semantic analysis, rather than simply applying fixed templates. It can generate personalized detection rules and SQL based on the business characteristics and data structure of the target table, solving the problems of limited template coverage and lack of semantic understanding. At the same time, it constructs a closed loop of execution, verification, correction, and retesting. If the SQL execution result does not meet the acceptance criteria, a correction request form is automatically generated and sent back to AI. AI then generates a corrected SQL and iteratively verifies it until it passes, achieving dynamic self-correction of rules and SQL, and completely solving the defect that static rules cannot adapt to data changes.
[0083] To address the issues of machine learning-based anomaly detection systems relying on large amounts of historical data, insufficient coverage of structured quality problems, weak interpretability of results, and the need for manual configuration and deployment, this solution optimizes adaptability through multi-source data fusion and AI rule matching. The collected multi-source data does not rely on large amounts of historical data. Even in scenarios with newly launched tables and no historical data, detection rules and SQL can be generated using basic information such as table metadata and data lineage, combined with a pre-defined rule base, thus resolving the dependence on historical data. Based on the pre-defined rule base and the AI-generated SQL, clear rules are formulated for structured quality problems (such as null values, uniqueness, and format compliance), accurately covering structured scenarios. Furthermore, the rules and SQL are traceable and interpretable, addressing the issue of weak interpretability of results. Simultaneously, the entire process requires no manual configuration or deployment; from change identification, rule generation, SQL execution to anomaly handling, everything is automated, solving the problem of cumbersome manual configuration and deployment.
[0084] The basic principles of this disclosure have been described above with reference to specific embodiments. However, it should be noted that the advantages, benefits, and effects mentioned in this disclosure are merely examples and not limitations, and should not be considered as essential features of each embodiment of this disclosure. Furthermore, the specific details disclosed above are for illustrative and facilitative purposes only, and are not limitations. These details do not limit the scope of this disclosure to the necessity of employing the aforementioned specific details for implementation.
[0085] In this disclosure, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. The block diagrams of devices, apparatuses, devices, and systems involved in this disclosure are merely illustrative examples and are not intended to require or imply that they must be connected, arranged, or configured in the manner shown in the block diagrams. As those skilled in the art will recognize, these devices, apparatuses, devices, and systems can be connected, arranged, and configured in any manner. Words such as "comprising," "including," "having," etc., are open-ended terms meaning "including but not limited to," and are used interchangeably with them. The terms "or" and "and" as used herein refer to the terms "and / or," and are used interchangeably with them unless the context clearly indicates otherwise. The term "such as" as used herein refers to the phrase "such as but not limited to," and is used interchangeably with it.
[0086] Additionally, as used herein, the "or" used in a list of items beginning with "at least one" indicates a separate list, such that a list of, for example, "at least one of A, B, or C" means A or B or C, or AB or AC or BC, or ABC (i.e., A and B and C). Furthermore, the word "exemplary" does not imply that the described example is preferred or better than other examples.
[0087] It should also be noted that in the systems and methods of this disclosure, the components or steps can be decomposed and / or recombined. These decompositions and / or recombinations should be considered as equivalent solutions to this disclosure.
[0088] Various changes, substitutions, and modifications can be made to the technology described herein without departing from the teachings defined by the appended claims. Furthermore, the scope of the claims of this disclosure is not limited to the specific aspects of the processes, machines, manufactures, events, means, methods, and actions described above. Currently existing or later-developed processes, machines, manufactures, events, means, methods, or actions that perform substantially the same function or achieve substantially the same result as the corresponding aspects described herein can be utilized. Therefore, the appended claims include such processes, machines, manufactures, events, means, methods, or actions within their scope.
[0089] The above description of the disclosed aspects is provided to enable any person skilled in the art to make or use this disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other aspects without departing from the scope of this disclosure. Therefore, this disclosure is not intended to be limited to the aspects shown herein, but rather to be carried out within the widest scope consistent with the principles and novel features disclosed herein.
[0090] The above description has been given for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of this disclosure to the forms disclosed herein. Although numerous exemplary aspects and embodiments have been discussed above, those skilled in the art will recognize certain variations, modifications, alterations, additions, and sub-combinations therein.
Claims
1. An AI-based automatic monitoring method for data warehouse quality, characterized in that, include: Automatically monitor new change events in the target data warehouse and obtain the object identifier of the change event; The target table corresponding to the object identifier is determined based on a preset association table; Obtain multi-source data corresponding to the target table, wherein the multi-source data includes at least table metadata, output task cycle information, and data lineage information; The AI is invoked to match and analyze the multi-source data with a preset rule base, determine the target fields and their corresponding target rules in the multi-source data, and generate an SQL data package based on a preset output structure and the target rules; wherein, the target field is the field that matches the preset rule base; In an isolated or sandbox environment, execute the SQL statements in the SQL data packet one by one and collect the execution results. If the execution results do not meet the preset acceptance conditions, generate a data correction request form and send it back to AI to generate a corrected version of the SQL based on AI. After replacing the original entries with the revised SQL, re-execute until the corresponding entry passes or the maximum number of iterations is reached, and mark the corresponding entry as a passed entry. The deployment method obtains the quality results after scheduling and execution by using the SQL corresponding to the entry. If the quality results are abnormal, the abnormal information will be automatically identified, and the corresponding strategy will be executed according to the level of abnormality.
2. The AI-based automatic data warehouse quality monitoring method according to claim 1, characterized in that, The process of obtaining the multi-source data corresponding to the target table includes: Call the metadata service to obtain the table metadata corresponding to the target table. The table metadata includes at least the latest DDL text of the table, the field list, the partition key and partition granularity, and the table-level attributes. Obtain the output task cycle information corresponding to the target table from the task scheduling system. The output task cycle information includes at least the scheduling cycle of the output task, the output partitioning rule, and the upstream task ID it depends on. The data lineage information corresponding to the target table is obtained from the lineage service. The data lineage information includes at least the upstream table identifier and the downstream table identifier of the target table.
3. The AI-based automatic monitoring method for data warehouse quality according to claim 2, characterized in that, The AI is invoked to match and analyze the multi-source data with a preset rule base, determine the target fields and their corresponding target rules in the multi-source data, and generate an SQL data package based on a preset output structure and the target rules, including: The target field is identified from the table metadata according to the preset rule base, and the data management rule corresponding to the target field is obtained from the preset rule base; Based on all the target fields and their corresponding data management rules, a standardized quality rule list is generated. Based on the output task cycle information and the data lineage information, partition filtering conditions and data time window constraints are injected into each rule in the standardized quality rule list, and a single rule SQL is generated according to the preset unified output structure. Package all the individual rule SQL statements corresponding to all the rules in the standardized quality rule list into batches to generate an SQL data package that includes at least the rule's unique identifier and the output structure version number.
4. The AI-based automatic data warehouse quality monitoring method according to claim 3, characterized in that, The step of identifying the target field from the table metadata according to a preset rule base and obtaining the data management rule type corresponding to the target field from the preset rule base includes: Extract basic information from the table metadata, which includes at least table hierarchy, field types, field comments, and table name characteristics; Based on the field type and the table name characteristics, the target field is identified from the preset rule base and marked with a business semantic tag. The business semantic tag includes at least a primary key candidate, a business unique key, an amount field, a date field, and an encoding field. According to the table hierarchy and the field type, a double match is performed in the preset rule base to filter out suitable data management rules, and the data management rules are bound to the corresponding target fields.
5. The AI-based automatic data warehouse quality monitoring method according to claim 3, characterized in that, Based on the output task cycle information and the data lineage information, partition filtering conditions and data time window constraints are injected into each rule in the standardized quality rule list, and a single rule SQL is generated according to a preset unified output structure, including: The partitioning filtering conditions are determined based on the data lineage information corresponding to the target table of each rule in the standardized quality rule list. The partitioning filtering conditions include the partitioning field and partitioning granularity of the data. The data time window constraint is determined based on the output task cycle information corresponding to the target table of each rule in the standardized quality rule list. Based on the partition filtering conditions, the data time window constraints, and the preset unified output structure, generate a single rule SQL corresponding to each rule.
6. The AI-based automatic data warehouse quality monitoring method according to claim 1, characterized in that, The SQL statements in the SQL data packet are executed one by one in an isolated or sandbox environment, and the execution results are collected. Construct an isolated or sandbox execution environment consistent with the target data warehouse environment, and import the SQL data package; The SQL statements in the SQL data packet are parsed and executed one by one in sequence. The execution status, execution log, execution time, number of bytes scanned, return result set and execution exception information of each SQL statement are collected in real time to form a standardized execution result.
7. The AI-based automatic monitoring method for data warehouse quality according to claim 6, characterized in that, If the execution result does not meet the preset acceptance conditions, a data correction request form is generated and sent back to the AI. Based on the AI, a corrected version of SQL is generated, including: The execution results are graded and judged based on three types of preset acceptance conditions at different levels. The three types of preset acceptance conditions at different levels include basic operational compliance, business result validity, and execution performance compliance. If the execution result does not meet any of the three preset acceptance conditions, extract the abnormal SQL statement, execution exception information, verification failure items and corresponding acceptance rules, and automatically generate a data correction request form containing a problem description, exception details and correction requirements. The data correction request form, along with the corresponding original SQL, execution log, and acceptance rules, are sent back to the AI model to generate a corrected version of the SQL.
8. The AI-based automatic data warehouse quality monitoring method according to claim 1, characterized in that, The deployment process, which obtains the quality results after scheduling execution through the SQL corresponding to the entry, includes: Retrieve all the SQL statements corresponding to the passed entries and record them as the target SQL data package. The target SQL data package shall include at least the SQL statements that have passed the verification and the corresponding data quality rule information. According to the preset scheduling configuration, the target SQL data packet is encapsulated into a task definition that the scheduling system can recognize, and the responsible person and online status are written and set through the scheduling system API to obtain the quality results after scheduling execution.
9. The AI-based automatic monitoring method for data warehouse quality according to claim 1, characterized in that, Also includes: Update the associated information corresponding to the SQL of the entry to the preset rule base.
10. An AI-based automatic monitoring system for data warehouse quality, characterized in that, include: The monitoring unit is used to automatically monitor new change events in the target data warehouse and obtain the object identifier of the change event; The target table determination unit is used to determine the target table corresponding to the object identifier based on a preset association table. A multi-source data acquisition unit is used to acquire multi-source data corresponding to the target table. The multi-source data includes at least table metadata, output task cycle information, and data lineage information. The SQL data packet generation unit is used to call AI to perform matching analysis between the multi-source data and the preset rule base, determine the target fields and their corresponding target rules in the multi-source data, and generate an SQL data packet based on the preset output structure and the target rules; wherein, the target field is the field that matches the preset rule base; The correction unit is used to execute the SQL in the SQL data packet line by line in an isolated or sandbox environment and collect the execution results; if the execution results do not meet the preset acceptance conditions, a data correction request form is generated and sent back to the AI to generate a corrected version of the SQL based on the AI. The re-execution unit is used to replace the original entries with the revised SQL and re-execute it until the corresponding entry passes or the maximum number of iterations is reached, and then mark the corresponding entry as a passed entry. The deployment unit is used to deploy the SQL corresponding to the entry to obtain the quality results after scheduling and execution; An anomaly handling unit is used to automatically identify anomaly information and execute corresponding strategies according to the anomaly level if the quality result is abnormal.