Data quality monitoring method and system and computer equipment

A data quality and monitoring system technology, applied in the Internet field, can solve problems such as inability to notify business personnel of problematic or wrong data, inability to verify data quality in time, and invalidity of data processing, so as to achieve data integrity and avoid information burying , Guarantee the effect of data quality monitoring

Pending Publication Date: 2021-08-27
上海淇馥信息技术有限公司
5 Cites 1 Cited by

AI-Extracted Technical Summary

Problems solved by technology

[0006] In order to solve at least one of the following technical problems: the data quality cannot be verified in a timely, effective and accurate manner, the problem data or wrong data cannot be notified to the corresponding business personnel in ...
View more

Method used

Compared with the prior art, the present invention can realize more effective and timely data monitoring by flexibly configuring data quality monitoring tasks, can ensure the timeliness of alarm information (or notification), and can effectively ensure data quality monitoring .
Compared with the prior art, the present invention can realize more effective and timely data monitoring by flexibly configuring data quality monitoring tasks, can ensure the timeliness of warning information (or notification), and can effectively ensure data quality monitoring .
Further, by flexibly configuring the detection tasks related to various aspects such as table, field accuracy, null value rate, enumeration value, primary key repeatability, null value rate, volatility, etc. to monitor, the monitoring type can be increased The diversity of data can also realize the monitoring of data integrity, accuracy and consistency; through active monitoring and alarming, when the monitoring task is completed, the detection program can be executed in real time, and then the detection results will be notified to the corresponding users in time, effectively avoiding accidents due to The waste of subsequent operation resources caused by wrong data or problem data (or the execution error of downstream dependent tables due to wrong table results) can ensure the timeliness and reliability of monitoring alarms; through the setting of alarm priority, alarm files can be Effective control can send alarm files with high alarm priority more effectively and in a timely manner,...
View more

Abstract

The invention provides a data quality monitoring method and system and computer equipment, which are used for monitoring the data quality of an offline data warehouse, and the method comprises the following steps: creating a data quality monitoring task, and storing monitoring task information in a document database, the monitoring task information comprising monitoring configuration information and alarm configuration information; submitting the monitoring task to a data warehouse based on a cluster so as to monitor data in the data warehouse in an offline state; processing a monitoring result obtained by executing the monitoring task and corresponding monitoring task information, and then storing the monitoring result and the corresponding monitoring task information into a relational database; the relational database is detected, an alarm is given according to a detection result, and the detection result comprises the monitoring result and the alarm configuration information. By flexibly configuring the data quality monitoring task, more effective and timely data monitoring can be realized, the timeliness of alarm information (or notification) can be ensured, and the data quality monitoring can be effectively ensured.

Application Domain

Multi-dimensional databasesSpecial data processing applications +1

Technology Topic

Monitoring dataData monitoring +9

Image

  • Data quality monitoring method and system and computer equipment
  • Data quality monitoring method and system and computer equipment
  • Data quality monitoring method and system and computer equipment

Examples

  • Experimental program(3)

Example Embodiment

[0044] Example 1
[0045] Below, reference Figure 1 to 3 An embodiment of the data quality monitoring method of the present invention will be described.
[0046] figure 1 A flowchart of an example of a data quality monitoring method of the present invention. like figure 1 As shown, the method includes the following steps.
[0047] Step S101, create a data quality monitoring task, and the monitoring task information is stored in the document type database, the monitoring task information includes monitoring configuration information and alarm configuration information.
[0048] Step S102, submit the monitoring task to a cluster-based data warehouse to monitor data in a data warehouse in offline.
[0049] Step S103, the monitoring result obtained by the monitoring task is processed in the relational database after processing the corresponding monitoring task information.
[0050] In step S104, the relational database is detected, and the alarm is performed according to the detection result, the detection result includes the monitoring result obtained by the monitoring task and the alarm configuration information.
[0051] In this example, the method is used to monitor the data quality of the offline data warehouse. The method of the present invention will be more specifically described below.
[0052] First, in step S101, the data quality monitoring task is created, and the monitoring task information is stored in the document type database, which includes monitoring configuration information and alarm configuration information.
[0053] Specifically, create a data quality monitoring task, for example using the Pyspark program, the monitoring task is stored in the document type database through parameters, wherein the monitoring task information includes configuration information and alarm configuration information, etc., the document type database, for example, MongoDB database.
[0054] It should be noted that Pyspark is the API, Spark, provided by Spark for Python developers. Spark is a generic parallel framework for class Hadoop MapReduce for Ucberkeley AMP Lab (AMP Labs, California), in the present invention, refers to a Spark cluster.
[0055] Specifically, the data quality monitoring task includes monitoring content corresponding to data in a data warehouse in an offline state.
[0056] Further, the monitoring configuration information includes monitoring parameters, monitoring types, monitoring methods, and comparison thresholds, where the monitoring parameters include the name, field name, primary key repetitive, data consistency, and null rate. , Accuracy, volatility, etc., the monitoring type includes a type of detection corresponding to each monitoring parameter, which also includes timing monitoring, real-time monitoring, periodic monitoring of specific time intervals.
[0057] Specifically, the alarm configuration information includes alarm information corresponding to the data quality monitoring task, alarm mode, including a telephone alarm, SMS alarm, mail alarm, other social tool alarms, etc.
[0058] Thereby, the flexible configuration data quality monitoring task can be realized, and the monitoring tasks of the above flexible configuration can be enabled for the table, the accuracy, the empty value, the enumeration value, the reproducibility, the null value, and volatility. Monitoring, and then increase the diversity of monitoring types, and also monitoring data integrity, accuracy, and consistency monitoring.
[0059] It should be noted that the above will be described only as an example, and the limitation of the invention is not understood.
[0060] Next, in step S102, the monitoring task is submitted to a cluster-based data warehouse to monitor data in a data warehouse in offline.
[0061] In this example, the data quality monitoring task of step S101 is submitted to a cluster-based data warehouse, wherein the cluster is a SPARK cluster.
[0062] Specifically, data in the data warehouse in offline state is monitored in accordance with the data quality monitoring task (in this example, also referred to as a monitoring task).
[0063] Further, the data includes a data table and a particular data.
[0064] Specifically, the data table includes at least one of the data sheets in an offline state: a list, an fact table, a dimension table, a summary table, a width, a fusion table, a data model, a water meter, an intermediate table.
[0065] More specifically, the particular data comprises data of different field levels.
[0066] For example, for the monitoring task 1 for data consistency, when the monitoring task 1 is executed, the name of the source data table to be monitored from the offline state is executed, the name of the source data table to be monitored, and the name of the target data table is monitored.
[0067] For example, for the monitoring task 2 of the primary key repeatability 2, the name and primary key field of the data table to be monitored is monitored when the monitoring task 2 is executed.
[0068] Another example is, for example, for the monitoring task of volatility, the name and time field of the data table to be monitored is monitored when the monitoring task is executed.
[0069] Thus, by monitoring the task, it is possible to achieve more efficient, more timely data monitoring, and can effectively ensure data quality monitoring, effectively avoiding the waste of subsequent operational resources caused by error data or problem data.
[0070] It should be noted that the above is described only as an alternative example, and it is not understood to be limited to the present invention.
[0071] Next, in step S103, the monitoring result obtained by the monitoring task is processed in the relational database after processing the corresponding monitoring task information.
[0072] In this example, the monitoring task will be performed to obtain the resulting monitoring result.
[0073] Specifically, the obtained monitoring result is dynamically assembled with the corresponding monitoring task information to generate an operation command of the relational database.
[0074] Further, the parameters for assembly processing include the table name, field name, verification indicator, and calculation indicators.
[0075] Specifically, the verification indicator includes whether it is a given data table name, the primary key field, the enumeration value field, the null field, the time field, and the source table name and target table name; the calculation indicators include enumeration values, empty Value rate, time field, the volatility is expressed as the ratio of the same value and the ring ratio.
[0076] For example, the monitoring result obtained by performing data consistency 1 is as follows: The name of the source data table is a.cnt, the name of the target data table is B.cnt. Further, the obtained monitoring result is dynamically assembled with the corresponding monitoring task information to generate an operation instruction of the relational database, for example, the operation instruction is SQL1: selecta.cnt, b.cnt from (SELECT COUNT (1 AS CNT from source_table_name) AS A Cross Join (SELECT Count (1) AS CNT from from Source_table_name) AS B. Also, the operational instruction can be used for subsequent check or calculation, etc. Next processing.
[0077] For example, the monitoring results obtained by performing the monitoring task 2 of the primary key repeatability are as follows: The name A1 and the primary key field ** S1 of the data table. Further, after the obtained monitoring results are dynamically assembled with the corresponding monitoring task information, the following operation command SQL2: SELECT priMARY_KEY, Count (1) AS cnt from table_name Groupby primary_key haVing CNT> K1, where K1 can be based on data The status of the table sets the corresponding value.
[0078] For example again, the monitoring task of enumerating the enumeration value is performed. After performing dynamic assembly processing, the following operation command SQL3: Select Distinct Enum_Key from Table_Name.
[0079]For example, the monitoring task 4 is performed, after dynamic assembly processing, generate the following operation command SQL4: SELECT (a.cnt-b.cnt) /b.cnt from "(SELECT Count (1) AS CNT from Table_Name Whereperiod_Key = 'This period') a cross Join (Select Count (1) AS CNT from Table_name whereperiod_key = 'The last number / symbol number') b).
[0080] It should be noted that the above will be described only as an example, and the limitation of the invention is not understood.
[0081] Next, in step S104, the relational database is detected, and the alarm is performed according to the detection result, the detection result includes the monitoring result obtained by the monitoring task and the alarm configuration information.
[0082] Specifically, for example using a predetermined detection program, the relationship type database is detected by step S103.
[0083] Alternatively, the detection of the relational database includes a step of detecting all data in the relational database (including new generation or variable data) in real time or timing, and / or detects whether the monitoring task in step S102 is executed Failure steps. See figure 2 (Remove step S104 into steps S104 and S201).
[0084] Specifically, when all data in the relational database has new, deleted, replace, or other modifications, the detected information is generated in real time and acts as part of the detection result for immediate Alarm processing, in other words, instantly end or stop performing subsequent operations related to problem data or error data. Thereby, the problem data or error data can be determined more efficient, more timely, and more timely, more efficient, and more timely, and can be more effective.
[0085] Further, the step of detecting whether the monitoring task in step S102 performs failure, wherein the monitoring task is automatically re-executed when detecting the execution of the monitoring task (eg, due to the failure of the monitoring task due to the cluster resource problem). Make sure the execution is successful. When the monitoring task is re-executed, the corresponding detection result is still generated, and the corresponding detection result is generated, and the alarm processing step is automatically notified to the corresponding business person. Thus, after the prior art detected the failure of the monitoring task, the monitoring task cannot be retryed, and the monitoring task is ensured to perform success to ensure more efficient, more timely determine (or discovery) problem data or Error data.
[0086] Specifically, when the detection result includes a check indicator or a computational type corresponding to the spectrum to determine the check type or the calculation type.
[0087] More specifically, the verification type includes primary key repetitiveness, data consistency, which includes a type corresponding to an enumeration value, a null rate, and a volatility.
[0088] For example, in the above example, the monitoring task 2, the monitoring task 3 is the test type, and the task 1 is monitored, and the monitoring task 4 is the calculation type.
[0089] Further, the alarm is performed according to the test results, specifically, when the detection result is present, the detection result is compared with the preset threshold (ie, the contrast threshold in the monitoring configuration information) is more than the set. When the threshold is set, the corresponding alarm file is determined and the alarm file is executed.
[0090] When there is an indicator of the calculation type in the monitored monitoring content, the calculation results are calculated as the alarm content, and the corresponding alarm file is determined for alarm notification.
[0091] Specifically, the alarm file includes a transmission mode, alarm time, notified of the user, which includes a telephone, SMS, mail, or other social tool.
[0092] For example, for the calculation processing of the data consistency, the value z2 of the Z1 and the target data table B.cnt of the source data table A.cnt is determined by determining whether Z1 and Z2 are equal to whether or not the z1 and z2 are equal to whether or not the z1 and z2 are equal to whether the source data table A.cnt and target data sheet is determined. If the amount of data of B.CNT is consistent, it is further determined whether alarm is warned.
[0093] For example, for the primary key repetitive check, when the relational database is detected by the SQL after assembly, by determining whether the query result is not empty, it is determined whether there is a primary key repetition to further determine whether alarm is performed.
[0094] For example again, for the enumeration value check, the enumeration value M1 is calculated by calculating the enumeration value M1, and the enumeration value M1 is compared with the preset contrast threshold to further determine whether alarm is performed.
[0095] Further, the present invention also includes monitoring alarms for the accuracy, null rate, volatility, and the like. Through the above detection processing, it is possible to accurately detect a problem or a certain field there is a problem, and the subsequent flow can be turned off or ended. For example, the Diagonal is a problem, resulting in double the data to double the data, and can effectively avoid dependencies. The data caused by the wrong data is inaccurate and the cluster resource wasted.
[0096] Thus, by active monitoring alarm, when the monitoring task is completed, the detection program can be performed immediately, and then the detection result is notified to the corresponding user, and the timeliness of the alarm information (or notification) can effectively avoid the wrong table. As a result, the execution error of the downstream dependency sheet can be guaranteed to ensure the timeliness and reliability of the monitoring alarm.
[0097] image 3 It is a flow chart of another example of the data quality monitoring method of the first embodiment of the present invention.
[0098] like image 3 As shown, the data quality monitoring method of the present invention further includes determining step S304 of the alarm priority.
[0099] What needs to be explained, due to image 3 Step S301, step S302, steps S303 and step S305 figure 1 In step S101, step S102, step S103, and step S104, the description of step S301, step S302, step S303, and step S305 are omitted.
[0100] In step S303, the alarm priority is determined to further determine the sequence of transmission of the alarm file.
[0101] In this example, according to the monitoring parameters, the coefficients corresponding to each monitoring parameter is determined, and the verification type and calculation type are determined according to the data resource consumption, such as the first monitoring task corresponding to the verification type. The second monitoring task corresponding to the calculation type is executed first, and the alarm file of the second monitoring task is sent first more than the alarm file of the first monitoring task, and so on.
[0102] Optionally, optionally, the alarm priority is determined based on the data security requirements of the business project and the importance of business projects, etc.
[0103] It should be noted that the business items include resource security input projects, resource allocation items, and the like.
[0104] Specifically, the transmission method corresponding to the alarm priority is set.
[0105] Further, the alarm priority includes a first priority, second priority, and third priority corresponding to the telephone, SMS, and mail.
[0106] Alternatively, the sequence of transmission of the alarm file is further determined according to the determined alert priority.
[0107] Specifically, the alarm file includes a transmission method, alarm priority, alarm time, is notified of the user.
[0108] Further, the transmission is performed based on the determined alarm and its transmission order. Thus, through the alarm priority setting, the alarm file can be effectively controlled, and the alarm file with high alarm priority can be more efficient, and the information buryable to effectively avoid excessive alarms can be effectively avoided.
[0109] It should be noted that the above will be described only as an example, and the limitation of the invention is not understood.
[0110] The process of the above method is for illustrative purposes only, wherein the order and quantity of the steps are not particularly limited. Further, the steps in the above method can also be split into two, three, or some steps can also be combined into one step to adjust according to the actual example.
[0111] Compared to the prior art, the present invention enables more efficient, more timely data monitoring, and ensures the timeliness of the alarm information (or notification), and can effectively ensure data quality monitoring.
[0112] Further, by flexible configuration of detection tasks associated with the accuracy, null rate, enumeration value, primary key repetition, null rate, volatility, etc., can increase the diversity of monitoring types. It also enables monitoring of data integrity, accuracy, and consistency; by active monitoring alarms, when the monitoring task is completed, the detection program can instantly perform the detection program, and then notify the test results to the corresponding user, effectively avoid error data or The subsequent run resource due to the problem data (or causes the downstream dependency error due to the wrong table result), in turn, it is possible to ensure the timeliness and reliability of the monitoring alarm; through the alarm priority setting, it is possible to effectively control the alarm file. It is possible to more effective, more timely send alarm files with high priority, and can also avoid problems such as over-alarms caused by excessive alarms.

Example Embodiment

[0113] Example 2
[0114] The system embodiment of the present invention is described below, which can be used to perform the method of the invention. For the details described in the embodiment of the present invention, it is considered to be supplemented to the above method embodiments; for details which are not disclosed in the embodiment of the present invention, it may be implemented with reference to the above method embodiment.
[0115] Refer Figure 4 , Figure 5 with Image 6 The present invention also provides a data quality monitoring system 400 for monitoring the data quality of the offline data warehouse, the data quality monitoring system 400 includes: creation module 401 for creating data quality monitoring tasks, and will The monitoring task information is stored in the document type database; the monitoring module 402 is configured to submit the monitoring task to a cluster-based data warehouse to monitor data in a data warehouse in offline, including monitoring configuration information and Alarm configuration information; storage module 403, configured to process the monitoring result obtained by performing the monitoring task with the corresponding monitoring task information in the relational database; the alarm module 404 is used to detect the relational database, according to The test results are warned, which includes the monitoring results obtained by the monitoring task and the alarm configuration information.
[0116] like Figure 5 As shown, the data quality monitoring system 400 further includes processing module 501, which is configured to process the monitoring result obtained by the monitoring task with the corresponding monitoring task information, in which the corresponding monitoring task information is stored in the relational database, wherein Dynamically assemble the monitoring results with the corresponding monitoring task information to generate an operational instruction of the relational database.
[0117]In particular, the assembly process parameters include the table name, field name, and check indicators calculated metrics, wherein said indicator comprises checking whether the name of a given data table, the primary key field, enumeration value field, null field , time field and the source and destination table name table name; index comprises calculating the enumeration, null rate, the time field, expressed as the ratio of the volatility value of the ring up ratio.
[0118] In the present example, the detection of the relational database, an alarm detection result includes: when the detection result includes a checksum calculation or indicators corresponding to the type of indicators to determine the checksum type, or the type of calculation, the checksum type comprising repeatability primary key, data consistency, and the computing comprises enumerated type, null rate, the corresponding type of phase fluctuations.
[0119] In particular, according to the detection result comprises an alarm: in the presence of an indicator of the type of parity detection result, the detection results are compared with a preset threshold value, upon exceeding the set threshold value, determine the appropriate alarm file and execute the alarm file.
[0120] like Image 6 As shown in the data quality monitoring system 400 further includes a determining module 601, the determining module 601 for determining the corresponding alarm file, determining the type of index calculation module 601 exists in the monitored when monitoring contents, the calculation result as the content of the alarm, the alarm and determine the appropriate file to an alarm notification.
[0121] Specifically, the file comprises transmitting the alarm mode, the alarm priority, alarm time, the user is notified, and the alarm priority comprises telephone, text messaging, e-mail corresponding to a first priority, the second priority and the third priority class.
[0122] Note that, in Example 2, it is omitted same as in Example 1 described portion.
[0123] Those skilled in the art will appreciate that the above system embodiments described each module can be distributed in accordance with the system, may be a corresponding change in the distribution system to one or more of the above-described embodiment differs from the embodiment of. The modules of the above embodiments can be combined into a module, or further split into a plurality of sub-modules.
[0124] Compared to the prior art, the present invention enables more efficient, more timely data monitoring, and ensures the timeliness of the alarm information (or notification), and can effectively ensure data quality monitoring.
[0125] Further, by flexible configuration of detection tasks associated with the accuracy, null rate, enumeration value, primary key repetition, null rate, volatility, etc., can increase the diversity of monitoring types. It also enables monitoring of data integrity, accuracy, and consistency; by active monitoring alarms, when the monitoring task is completed, the detection program can instantly perform the detection program, and then notify the test results to the corresponding user, effectively avoid error data or The subsequent run resource due to the problem data (or causes the downstream dependency error due to the wrong table result), in turn, it is possible to ensure the timeliness and reliability of the monitoring alarm; through the alarm priority setting, it is possible to effectively control the alarm file. It is possible to more effective, more timely send alarm files with high priority, and can also avoid problems such as over-alarms caused by excessive alarms.

Example Embodiment

[0126] Example 3

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.

Similar technology patents

Ship maintenance operation allocation system

Owner:日照古工船舶服务有限公司

Hospital infection monitoring management method and system

PendingCN111223573AImprove the efficiency of incidence analysisGuaranteed timeliness
Owner:和宇健康科技股份有限公司

Classification and recommendation of technical efficacy words

  • Guaranteed timeliness

Method and device for user recognition

InactiveCN103427994AGuaranteed timelinessuniqueness guaranteed
Owner:BEIJING IZP NETWORK TECH CO LTD

Overwater emergency rescue device of unmanned aerial vehicle

InactiveCN108248865AGuaranteed timelinessHigh delivery accuracy
Owner:CHENGDU CAIZHI SHENGYOU TECH LLC

Digital certificate revocation method and equipment

InactiveCN102447705AThe undo process is straightforward and fastGuaranteed timeliness
Owner:HUAWEI TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products