[0044] Example 1
[0045] Below, reference Figure 1 to 3 An embodiment of the data quality monitoring method of the present invention will be described.
[0046] figure 1 A flowchart of an example of a data quality monitoring method of the present invention. like figure 1 As shown, the method includes the following steps.
[0047] Step S101, create a data quality monitoring task, and the monitoring task information is stored in the document type database, the monitoring task information includes monitoring configuration information and alarm configuration information.
[0048] Step S102, submit the monitoring task to a cluster-based data warehouse to monitor data in a data warehouse in offline.
[0049] Step S103, the monitoring result obtained by the monitoring task is processed in the relational database after processing the corresponding monitoring task information.
[0050] In step S104, the relational database is detected, and the alarm is performed according to the detection result, the detection result includes the monitoring result obtained by the monitoring task and the alarm configuration information.
[0051] In this example, the method is used to monitor the data quality of the offline data warehouse. The method of the present invention will be more specifically described below.
[0052] First, in step S101, the data quality monitoring task is created, and the monitoring task information is stored in the document type database, which includes monitoring configuration information and alarm configuration information.
[0053] Specifically, create a data quality monitoring task, for example using the Pyspark program, the monitoring task is stored in the document type database through parameters, wherein the monitoring task information includes configuration information and alarm configuration information, etc., the document type database, for example, MongoDB database.
[0054] It should be noted that Pyspark is the API, Spark, provided by Spark for Python developers. Spark is a generic parallel framework for class Hadoop MapReduce for Ucberkeley AMP Lab (AMP Labs, California), in the present invention, refers to a Spark cluster.
[0055] Specifically, the data quality monitoring task includes monitoring content corresponding to data in a data warehouse in an offline state.
[0056] Further, the monitoring configuration information includes monitoring parameters, monitoring types, monitoring methods, and comparison thresholds, where the monitoring parameters include the name, field name, primary key repetitive, data consistency, and null rate. , Accuracy, volatility, etc., the monitoring type includes a type of detection corresponding to each monitoring parameter, which also includes timing monitoring, real-time monitoring, periodic monitoring of specific time intervals.
[0057] Specifically, the alarm configuration information includes alarm information corresponding to the data quality monitoring task, alarm mode, including a telephone alarm, SMS alarm, mail alarm, other social tool alarms, etc.
[0058] Thereby, the flexible configuration data quality monitoring task can be realized, and the monitoring tasks of the above flexible configuration can be enabled for the table, the accuracy, the empty value, the enumeration value, the reproducibility, the null value, and volatility. Monitoring, and then increase the diversity of monitoring types, and also monitoring data integrity, accuracy, and consistency monitoring.
[0059] It should be noted that the above will be described only as an example, and the limitation of the invention is not understood.
[0060] Next, in step S102, the monitoring task is submitted to a cluster-based data warehouse to monitor data in a data warehouse in offline.
[0061] In this example, the data quality monitoring task of step S101 is submitted to a cluster-based data warehouse, wherein the cluster is a SPARK cluster.
[0062] Specifically, data in the data warehouse in offline state is monitored in accordance with the data quality monitoring task (in this example, also referred to as a monitoring task).
[0063] Further, the data includes a data table and a particular data.
[0064] Specifically, the data table includes at least one of the data sheets in an offline state: a list, an fact table, a dimension table, a summary table, a width, a fusion table, a data model, a water meter, an intermediate table.
[0065] More specifically, the particular data comprises data of different field levels.
[0066] For example, for the monitoring task 1 for data consistency, when the monitoring task 1 is executed, the name of the source data table to be monitored from the offline state is executed, the name of the source data table to be monitored, and the name of the target data table is monitored.
[0067] For example, for the monitoring task 2 of the primary key repeatability 2, the name and primary key field of the data table to be monitored is monitored when the monitoring task 2 is executed.
[0068] Another example is, for example, for the monitoring task of volatility, the name and time field of the data table to be monitored is monitored when the monitoring task is executed.
[0069] Thus, by monitoring the task, it is possible to achieve more efficient, more timely data monitoring, and can effectively ensure data quality monitoring, effectively avoiding the waste of subsequent operational resources caused by error data or problem data.
[0070] It should be noted that the above is described only as an alternative example, and it is not understood to be limited to the present invention.
[0071] Next, in step S103, the monitoring result obtained by the monitoring task is processed in the relational database after processing the corresponding monitoring task information.
[0072] In this example, the monitoring task will be performed to obtain the resulting monitoring result.
[0073] Specifically, the obtained monitoring result is dynamically assembled with the corresponding monitoring task information to generate an operation command of the relational database.
[0074] Further, the parameters for assembly processing include the table name, field name, verification indicator, and calculation indicators.
[0075] Specifically, the verification indicator includes whether it is a given data table name, the primary key field, the enumeration value field, the null field, the time field, and the source table name and target table name; the calculation indicators include enumeration values, empty Value rate, time field, the volatility is expressed as the ratio of the same value and the ring ratio.
[0076] For example, the monitoring result obtained by performing data consistency 1 is as follows: The name of the source data table is a.cnt, the name of the target data table is B.cnt. Further, the obtained monitoring result is dynamically assembled with the corresponding monitoring task information to generate an operation instruction of the relational database, for example, the operation instruction is SQL1: selecta.cnt, b.cnt from (SELECT COUNT (1 AS CNT from source_table_name) AS A Cross Join (SELECT Count (1) AS CNT from from Source_table_name) AS B. Also, the operational instruction can be used for subsequent check or calculation, etc. Next processing.
[0077] For example, the monitoring results obtained by performing the monitoring task 2 of the primary key repeatability are as follows: The name A1 and the primary key field ** S1 of the data table. Further, after the obtained monitoring results are dynamically assembled with the corresponding monitoring task information, the following operation command SQL2: SELECT priMARY_KEY, Count (1) AS cnt from table_name Groupby primary_key haVing CNT> K1, where K1 can be based on data The status of the table sets the corresponding value.
[0078] For example again, the monitoring task of enumerating the enumeration value is performed. After performing dynamic assembly processing, the following operation command SQL3: Select Distinct Enum_Key from Table_Name.
[0079]For example, the monitoring task 4 is performed, after dynamic assembly processing, generate the following operation command SQL4: SELECT (a.cnt-b.cnt) /b.cnt from "(SELECT Count (1) AS CNT from Table_Name Whereperiod_Key = 'This period') a cross Join (Select Count (1) AS CNT from Table_name whereperiod_key = 'The last number / symbol number') b).
[0080] It should be noted that the above will be described only as an example, and the limitation of the invention is not understood.
[0081] Next, in step S104, the relational database is detected, and the alarm is performed according to the detection result, the detection result includes the monitoring result obtained by the monitoring task and the alarm configuration information.
[0082] Specifically, for example using a predetermined detection program, the relationship type database is detected by step S103.
[0083] Alternatively, the detection of the relational database includes a step of detecting all data in the relational database (including new generation or variable data) in real time or timing, and / or detects whether the monitoring task in step S102 is executed Failure steps. See figure 2 (Remove step S104 into steps S104 and S201).
[0084] Specifically, when all data in the relational database has new, deleted, replace, or other modifications, the detected information is generated in real time and acts as part of the detection result for immediate Alarm processing, in other words, instantly end or stop performing subsequent operations related to problem data or error data. Thereby, the problem data or error data can be determined more efficient, more timely, and more timely, more efficient, and more timely, and can be more effective.
[0085] Further, the step of detecting whether the monitoring task in step S102 performs failure, wherein the monitoring task is automatically re-executed when detecting the execution of the monitoring task (eg, due to the failure of the monitoring task due to the cluster resource problem). Make sure the execution is successful. When the monitoring task is re-executed, the corresponding detection result is still generated, and the corresponding detection result is generated, and the alarm processing step is automatically notified to the corresponding business person. Thus, after the prior art detected the failure of the monitoring task, the monitoring task cannot be retryed, and the monitoring task is ensured to perform success to ensure more efficient, more timely determine (or discovery) problem data or Error data.
[0086] Specifically, when the detection result includes a check indicator or a computational type corresponding to the spectrum to determine the check type or the calculation type.
[0087] More specifically, the verification type includes primary key repetitiveness, data consistency, which includes a type corresponding to an enumeration value, a null rate, and a volatility.
[0088] For example, in the above example, the monitoring task 2, the monitoring task 3 is the test type, and the task 1 is monitored, and the monitoring task 4 is the calculation type.
[0089] Further, the alarm is performed according to the test results, specifically, when the detection result is present, the detection result is compared with the preset threshold (ie, the contrast threshold in the monitoring configuration information) is more than the set. When the threshold is set, the corresponding alarm file is determined and the alarm file is executed.
[0090] When there is an indicator of the calculation type in the monitored monitoring content, the calculation results are calculated as the alarm content, and the corresponding alarm file is determined for alarm notification.
[0091] Specifically, the alarm file includes a transmission mode, alarm time, notified of the user, which includes a telephone, SMS, mail, or other social tool.
[0092] For example, for the calculation processing of the data consistency, the value z2 of the Z1 and the target data table B.cnt of the source data table A.cnt is determined by determining whether Z1 and Z2 are equal to whether or not the z1 and z2 are equal to whether or not the z1 and z2 are equal to whether the source data table A.cnt and target data sheet is determined. If the amount of data of B.CNT is consistent, it is further determined whether alarm is warned.
[0093] For example, for the primary key repetitive check, when the relational database is detected by the SQL after assembly, by determining whether the query result is not empty, it is determined whether there is a primary key repetition to further determine whether alarm is performed.
[0094] For example again, for the enumeration value check, the enumeration value M1 is calculated by calculating the enumeration value M1, and the enumeration value M1 is compared with the preset contrast threshold to further determine whether alarm is performed.
[0095] Further, the present invention also includes monitoring alarms for the accuracy, null rate, volatility, and the like. Through the above detection processing, it is possible to accurately detect a problem or a certain field there is a problem, and the subsequent flow can be turned off or ended. For example, the Diagonal is a problem, resulting in double the data to double the data, and can effectively avoid dependencies. The data caused by the wrong data is inaccurate and the cluster resource wasted.
[0096] Thus, by active monitoring alarm, when the monitoring task is completed, the detection program can be performed immediately, and then the detection result is notified to the corresponding user, and the timeliness of the alarm information (or notification) can effectively avoid the wrong table. As a result, the execution error of the downstream dependency sheet can be guaranteed to ensure the timeliness and reliability of the monitoring alarm.
[0097] image 3 It is a flow chart of another example of the data quality monitoring method of the first embodiment of the present invention.
[0098] like image 3 As shown, the data quality monitoring method of the present invention further includes determining step S304 of the alarm priority.
[0099] What needs to be explained, due to image 3 Step S301, step S302, steps S303 and step S305 figure 1 In step S101, step S102, step S103, and step S104, the description of step S301, step S302, step S303, and step S305 are omitted.
[0100] In step S303, the alarm priority is determined to further determine the sequence of transmission of the alarm file.
[0101] In this example, according to the monitoring parameters, the coefficients corresponding to each monitoring parameter is determined, and the verification type and calculation type are determined according to the data resource consumption, such as the first monitoring task corresponding to the verification type. The second monitoring task corresponding to the calculation type is executed first, and the alarm file of the second monitoring task is sent first more than the alarm file of the first monitoring task, and so on.
[0102] Optionally, optionally, the alarm priority is determined based on the data security requirements of the business project and the importance of business projects, etc.
[0103] It should be noted that the business items include resource security input projects, resource allocation items, and the like.
[0104] Specifically, the transmission method corresponding to the alarm priority is set.
[0105] Further, the alarm priority includes a first priority, second priority, and third priority corresponding to the telephone, SMS, and mail.
[0106] Alternatively, the sequence of transmission of the alarm file is further determined according to the determined alert priority.
[0107] Specifically, the alarm file includes a transmission method, alarm priority, alarm time, is notified of the user.
[0108] Further, the transmission is performed based on the determined alarm and its transmission order. Thus, through the alarm priority setting, the alarm file can be effectively controlled, and the alarm file with high alarm priority can be more efficient, and the information buryable to effectively avoid excessive alarms can be effectively avoided.
[0109] It should be noted that the above will be described only as an example, and the limitation of the invention is not understood.
[0110] The process of the above method is for illustrative purposes only, wherein the order and quantity of the steps are not particularly limited. Further, the steps in the above method can also be split into two, three, or some steps can also be combined into one step to adjust according to the actual example.
[0111] Compared to the prior art, the present invention enables more efficient, more timely data monitoring, and ensures the timeliness of the alarm information (or notification), and can effectively ensure data quality monitoring.
[0112] Further, by flexible configuration of detection tasks associated with the accuracy, null rate, enumeration value, primary key repetition, null rate, volatility, etc., can increase the diversity of monitoring types. It also enables monitoring of data integrity, accuracy, and consistency; by active monitoring alarms, when the monitoring task is completed, the detection program can instantly perform the detection program, and then notify the test results to the corresponding user, effectively avoid error data or The subsequent run resource due to the problem data (or causes the downstream dependency error due to the wrong table result), in turn, it is possible to ensure the timeliness and reliability of the monitoring alarm; through the alarm priority setting, it is possible to effectively control the alarm file. It is possible to more effective, more timely send alarm files with high priority, and can also avoid problems such as over-alarms caused by excessive alarms.