Data quality detection method and device and storage medium
A technology for data quality and detection methods, applied in the field of data processing, can solve the problems of difficulty in guaranteeing detection efficiency, limited test case coverage, and missed detection of abnormal data, so as to improve coverage and accuracy, and reduce labor and time costs. , the effect of improving the detection efficiency
Pending Publication Date: 2019-04-19
PING AN TECH (SHENZHEN) CO LTD
0 Cites 10 Cited by
AI-Extracted Technical Summary
Problems solved by technology
Because this method involves human participation and the coverage rate of test cases is limited, there is a possibility of missed detection of abnor...
Method used
When described acquiring module 110 can be used for classifying the data to be detected of the same field into a data group to be detected, matching module 120 is also used to be the data to be detected in the same data group to be detected according to the data feature of the data group to be detected The detection data uniformly matches the detection rules. For example, when all the input variables of a certain neural network model are used as the data to be tested, different values of the same input variable are used as a data group to be tested, assuming that the data type of the input variable is String, and The detection rules matched by the data group to be detected may include: detection rules for detecting the proportion of zero values, detecting the average value, detecting standard deviation, detecting the maximum value, detecting the minimum value, and the like. It can be understood that by grouping the data to be detected in the same field into one data group to be detected, the speed of matching detection rules for the data to be detected can be increased, thereby improving the efficiency of data quality detection.
[0056] The matching module 120 is configured to match at least one detection rule for each piece of data to be detected in a pre-established detection rule library according to preset matching rules. In one embodiment, the detection rul...
Abstract
The invention relates to a big data technology, and provides a data quality detection method and device and a computer readable storage medium. The method comprises the steps of acquiring at least onepiece of to-be-detected data from a data source,wherein each piece of to-be-detected data comprises content data and metadata; matching at least one detection rule for each piece of to-be-detected data in a pre-established detection rule base according to a preset matching rule; and detecting the to-be-detected data by using a matched detection rule to obtain a data quality detection result. According to the invention, data quality detection automation can be realized, the data quality detection efficiency is improved, and the labor and time costs are reduced.
Application Domain
Digital data information retrievalSoftware testing/debugging +1
Technology Topic
MetadataAutomation +4
Image
Examples
- Experimental program(1)
Example Embodiment
[0035] In order to make the objectives, technical solutions, and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with several drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, but not to limit the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
[0036] The invention provides an electronic device. Reference figure 1 Shown is a schematic diagram of an embodiment of the electronic device 1 of the present invention. In this embodiment, the electronic device 1 may be a terminal device with storage and computing functions, such as a server, a smart phone, a tablet computer, a portable computer, a desktop computer, and the like. In an embodiment, when the electronic device 1 is a server, the server may be one or more of a rack server, a blade server, a tower server, or a cabinet server.
[0037] The electronic device 1 includes a memory 11, a processor 12 and a network interface 13.
[0038] Wherein, the memory 11 includes at least one type of readable storage medium. The at least one type of readable storage medium may be a non-volatile storage medium such as flash memory, hard disk, multimedia card, card-type memory, and the like. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1. In other embodiments, the readable storage medium may also be an external memory of the electronic device 1, such as a plug-in hard disk or a smart memory card (Smart Media Card, SMC) equipped on the electronic device 1, Secure Digital (SD) card, flash card (Flash Card), etc.
[0039] In this embodiment, the readable storage medium of the memory 11 is generally used to store the acquired data to be detected, the pre-established detection rule library, the data quality detection program 10 and so on. The memory 11 can also be used to temporarily store data that has been output or will be output.
[0040] The processor 12 may be a central processing unit (Central Processing Unit, CPU), a microprocessor or other data processing chip, used to run the program code or processing data stored in the memory 11, for example, to execute the data quality Test procedure 10 etc.
[0041] The network interface 13 may include a standard wired interface and a wireless interface (such as a Wi-Fi interface). It is usually used to establish a communication connection between the electronic device 1 and other electronic devices or systems, such as establishing a communication connection with a data source.
[0042] figure 1 Only the electronic device 1 having the components 11-13 and the data quality inspection program 10 is shown, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.
[0043] Optionally, the electronic device 1 may also include an input unit such as a keyboard (Keyboard), a voice input device such as a microphone (Microphone) and other devices with a voice recognition function, and a voice output device such as a speaker, earphone, etc. Wherein, the input unit can be used to add a new data detection rule to the pre-established detection rule library.
[0044] Optionally, the electronic device 1 may also include a display, which may also be called a display screen or a display unit. In some embodiments, it may be an LED display, a liquid crystal display, a touch liquid crystal display, an organic light-emitting diode (OLED) display, and the like. The display is used for displaying the information processed in the electronic device 1 and for displaying a visualized user interface, for example, displaying the data quality inspection result.
[0045] The electronic device 1 may also include a radio frequency (RF) circuit, a sensor, an audio circuit, etc., which will not be repeated here.
[0046] In the foregoing embodiment, the processor 12 may implement the following steps when executing the data quality detection program 10 stored in the memory 11:
[0047] Obtaining step: obtaining at least one piece of data to be detected from the data source, where each piece of data to be detected includes content data and metadata;
[0048] Matching step: matching at least one detection rule for each piece of data to be detected in a pre-established detection rule library according to a preset matching rule; and
[0049] Detection step: detecting the data to be detected using the matching detection rule to obtain the data quality detection result.
[0050] For a detailed description of the above steps, please refer to the following figure 2 About the program module diagram of an embodiment of the data quality inspection program 10 and image 3 Description of the flowchart of the first embodiment of the data quality detection method.
[0051] In other embodiments, the data quality detection program 10 may be divided into multiple modules, and the multiple modules are stored in the memory 12 and executed by the processor 13 to complete the present invention. The module referred to in the present invention refers to a series of computer program instruction segments capable of completing specific functions.
[0052] Reference figure 2 Shown as figure 1 The program module diagram of an embodiment of the data quality detection program 10 in China. In this embodiment, the data quality detection program 10 can be divided into: an acquisition module 110, a matching module 120, and a detection module 130.
[0053] The acquiring module 110 is configured to acquire at least one piece of data to be detected from a data source. The data source includes a local database and a foreign database. Generally, the obtaining module 110 obtains the data to be detected in batches from the data source. For example, obtain all the input variables or sample data of a certain neural network model as the data to be tested. For another example, the real-time data of the data interface is obtained through the monitoring data interface. For another example, by receiving the table name input by the user, the data table corresponding to the table name is obtained. Further, the obtaining module 110 may also receive the partition selected by the user for detection, obtain the partition table corresponding to the partition, and use it as the data to be detected. Among them, the partition table is a subset of a data table, that is, a data table can be divided into multiple partitions, and each partition is a partition table. In short, the acquisition module 110 can acquire at least one piece of data to be detected including content data and metadata from any data source. The metadata is data used to describe the content data, and is mainly used to describe data properties. ) To support functions such as indicating storage location, historical data, resource search, and file recording. In an embodiment, the metadata includes information such as the importance of the data to be detected, the default value, the timestamp, the field to which it belongs, and the data type of the field. Among them, the data types include but are not limited to Boolean type (Boolean), string type (String), date type (Data time), double-precision floating-point type (Double), integer type (Bigint), and precise value type (Decimal). The importance levels include high, medium, and low. The default value is the expected value or reasonable value of each piece of data to be tested, which is usually used for numerical data. The timestamp is used to indicate that a piece of data to be detected has existed, complete, and verifiable data before a certain time. It is usually a sequence of characters that uniquely identifies the time at a certain moment.
[0054] The acquisition module 110 is further configured to confirm the acquisition frequency of the data to be detected according to the metadata information, and classify the data to be detected that belong to the same field into a data group to be detected. For example, assuming that the input variable of a certain neural network model is the data to be detected, and the importance of the input variable is high, the acquisition module 110 will monitor the data interface of the neural network model in real time, and obtain input to the neural network. The real-time data of the model is regarded as the data to be detected, and the data belonging to the same field in the data to be detected are classified into a data group to be detected. Similarly, for candidate variables whose importance is medium or low, the acquisition module 110 may acquire historical data from the data interface of the neural network model at a preset acquisition frequency according to actual needs.
[0055] In an embodiment, the obtaining module 110 is further configured to obtain authoritative data corresponding to the data to be detected from officially released data. The officially released data includes data released by relevant government departments and data released by data sources widely recognized by relevant users. The data released by government departments is usually the highest quality authoritative data by default. For enterprise data, it is released on the official website of the company. The data is authoritative.
[0056] The matching module 120 is configured to match at least one detection rule for each piece of data to be detected in a pre-established detection rule library according to a preset matching rule. In one embodiment, the detection rule is a parallelized detection rule based on MapReduce, and the electronic device 1 improves the efficiency of data quality detection by parallelizing the detection rule. Among them, each detection rule includes rule name, rule description and expected result. It is understandable that for a detection rule that does not have a clear expected result or does not require a clear expected result, the corresponding expected result can be N/A. For example, suppose the data to be detected is data representing the high or low water level. Generally, there is no clear requirement for the value of the water level data. Therefore, the expected result of the detection rule matching the water level numerical data can be set to N/A.
[0057] When the acquisition module 110 can be used to classify the to-be-detected data of the same field into a to-be-detected data group, the matching module 120 is also used to unify the to-be-detected data in the same data group according to the data characteristics of the to-be-detected data group Match detection rules. For example, when all input variables of a neural network model are used as the data to be tested, the different values of the same input variable are regarded as a data group to be tested. Assuming that the data type of the input variable is String, then The detection rules matched by the data group to be detected may include: detection rules for the proportion of zero values, detection average values, detection standard deviations, detection maximum values, detection minimum values, and the like. It is understandable that by classifying the to-be-detected data in the same field into a to-be-detected data group, the speed of matching the detection rule for the to-be-detected data can be improved, thereby improving the efficiency of data quality detection.
[0058] The preset matching rules may be obtained by performing statistical analysis in advance according to test cases of commonly used data quality detection. For example, it is possible to count which data types are to be detected and which detection rules need to be used in the data quality detection process, and then add labels of the corresponding data types to the detection rules. Therefore, after obtaining a piece of data to be detected, all detection rules with tags of the data type can be found in the pre-established detection rule library according to the data type of the data to be detected, and used as the data type to be detected. Matching detection rules.
[0059] The preset matching rules can also be customized by developers according to actual needs. For example, developers can set such matching rules: the entry variables of a neural network model match the detection rules of data missing rate, zero value proportion, standard deviation, etc., then data missing rate, zero value proportion, standard Add the label of the model variable of the neural network model to the detection rule of difference. Therefore, after the acquisition module 110 acquires the entry variables of the neural network model, the matching module 120 automatically finds the detection rules for the data missing rate, the proportion of zero values, the standard deviation, etc., as described above, and uses them as the relevant The detection rules that match the entry variables of the neural network model.
[0060] The detection module 130 is configured to perform quality detection on each piece of data to be detected by using matching detection rules to obtain a data quality detection result. Specifically, for each piece of data to be detected or each group of data to be detected, the detection rule for the piece of data to be detected or the detection rule for the group of data to be detected can be The content data is compared with the expected results of the matching detection rules. If each content data meets the expected results, it indicates that the data to be detected does not appear abnormal in the data group to be detected; otherwise, it indicates that the data to be detected The data group to be detected is abnormal. For example, suppose the data type of a piece of data to be detected is defined as Double in the metadata, and the data type of the acquired data to be detected is Bigint. Another example is if the length of the data to be detected is greater than the expected result (for example, the ID number is greater than 18) , The data quality detection result is abnormal, and the detection module 130 may record the data quality detection result and specific description information.
[0061] When the obtaining module 110 can be used to obtain the authoritative data corresponding to the data to be detected from officially released data, the detection module 130 is also used to compare the data to be detected with the authoritative data to obtain data The detection result is compared, and the data quality detection result is combined with the data comparison detection result to obtain a comprehensive evaluation result of the data quality. The officially released data includes data released by relevant government departments and data released by data sources widely recognized by relevant users. The data released by government departments is usually the highest quality authoritative data by default. For enterprise data, it is released on the official website of the company. The data is authoritative. For example, when the data to be detected is motor vehicle violation data, the data released by the traffic safety service management platform is the authoritative data. When an employer investigates an employee’s traffic violations, the violation data entered by the employee is used as the data to be detected, and the data retrieved from the local government’s traffic safety service management platform based on the employee’s ID number or motor vehicle driver’s license number is used as the authoritative data , By comparing the data to be tested with the authoritative data, verifying the consistency of the data according to the obtained data comparison test results, and by combining the data quality test results and the data comparison test results, it is possible to judge the test data more accurately Whether the data is abnormal, the comprehensive evaluation result of the data quality of the data to be tested is obtained.
[0062] In one embodiment, an alarm threshold can be set for data quality detection. For example, when the missing rate of the data to be detected is higher than 30%, a warning is issued for abnormal problems in a preset manner. The preset method includes highlighting the abnormal problem in the data quality inspection result and sending the abnormal problem to the inspector in the form of email or short message.
[0063] In addition, the present invention also provides a data quality detection method. Reference image 3 Shown is a flowchart of the first embodiment of the data quality detection method of the present invention. The processor 12 of the electronic device 1 implements the following steps of the data quality detection method when the data quality detection program 10 stored in the memory 11 is executed:
[0064] In step S300, the obtaining module 110 obtains at least one piece of data to be detected from the data source. Among them, each piece of data to be detected includes content data and metadata. In an embodiment, the metadata includes one or more of the importance of a piece of data to be detected, a default value, a time stamp, a field to which it belongs, and a data type of the field. When the metadata of the data to be detected includes information about the importance of the data to be detected, if the importance is high, the acquisition module 110 monitors the data source in real time, and obtains real-time data from the data interface of the data source. If the importance is medium , Then set the frequency of obtaining data from the data source according to actual needs. If the importance is low, you can reasonably reduce the probability of obtaining data from the data source according to the specific situation, so as to reduce the impact of data quality detection on the performance of related systems .
[0065] Step S301: The matching module 120 matches at least one detection rule for each piece of data to be detected in a pre-established detection rule library according to a preset matching rule. Specifically, the detection rules are parallelized detection rules based on MapReduce, and the efficiency of data quality detection can be improved by processing the detection rules in parallel.
[0066] Exemplarily, the preset matching rules may be obtained by performing statistical analysis in advance according to test cases of commonly used data quality detection. For example, it is possible to count which data types are to be detected and which detection rules need to be used in the data quality detection process, and then add labels of the corresponding data types to the detection rules. Therefore, after obtaining a piece of data to be detected, all detection rules with tags of the data type can be found in the pre-established detection rule library according to the data type of the data to be detected, and used as the data type to be detected. Matching detection rules. The matching rule can also be obtained by other methods such as setting a mapping relationship, and will not be repeated here.
[0067] In this embodiment, the detection rules in the pre-established detection rule library can be flexibly added or deleted. When a new data detection rule needs to be added, the script code of the new detection rule can be added according to the principle of maintaining the independence of the existing detection rule.
[0068] Step S302: The detection module 130 uses the matched detection rule to perform quality detection on the data to be detected, and obtain a data quality detection result. Specifically, for each piece of data to be detected, at least one detection result can be obtained according to at least one detection rule that matches the piece of data to be detected, and the detection result is compared with the expected result of the matching detection rule, if If each detection result meets the expected result, it indicates that the piece of data to be detected is not abnormal; otherwise, it indicates that the piece of data to be detected is abnormal. For example, if the specific data to be tested has negative values (such as age, amount, etc.), the length of the data to be tested is greater than the expected result (for example, the ID number is greater than 18), the distribution of the data to be tested is abnormal, and the data type of the data to be tested is different If the data type defined in the metadata does not match, etc., the data quality detection result is abnormal, and the detection module 130 may record the data quality detection result and specific description information, including the storage location of the data to be detected.
[0069] Reference Figure 4 Shown is a flowchart of the second embodiment of the data quality detection method of the present invention. In this embodiment, the data to be detected that belong to the same field are classified into a data group to be detected, and the specific steps are as follows:
[0070] In step S400, the obtaining module 110 obtains at least one piece of data to be detected from the data source and classifies the data to be detected belonging to the same field into a data group to be detected. In this embodiment, each piece of data to be detected includes content data and metadata, and the metadata includes information such as the field of the data to be detected, the data type of the field, and the importance of the data to be detected, default value, and time stamp. .
[0071] Step S401: The matching module 120 uniformly matches at least one detection rule for the to-be-detected data in each of the to-be-detected data groups in a pre-established detection rule library according to a preset matching rule.
[0072] Step S402: The detection module 130 uses the matched detection rule to perform quality detection on the data to be detected to obtain a data quality detection result.
[0073] In this embodiment, an alarm threshold can be set for data quality detection. For example, when the missing rate of the data to be detected is higher than 30%, a warning is issued for abnormal problems in a preset manner. The preset method includes highlighting the abnormal problem in the data quality inspection result and sending the abnormal problem to the inspector in the form of email or short message. In this embodiment, after detecting an abnormality in the data to be detected and sending a warning to the inspector, the inspector can judge whether the abnormality of the data to be inspected is reasonable based on the actual situation. After the inspector processes the abnormal data, Steps S400-S402 are automatically repeated to detect the processed data again.
[0074] In this embodiment, by classifying the to-be-detected data belonging to the same field into a to-be-detected data group, the speed of matching the detection rule for the to-be-detected data is improved, thereby improving the efficiency of data quality detection.
[0075] In addition, the present invention also provides a computer-readable storage medium, which may be a hard disk, a multimedia card, an SD card, a flash memory card, an SMC, a read-only memory (ROM), an erasable programmable read-only memory (EPROM), portable compact disk read-only memory (CD-ROM), USB memory, etc. any one or any combination of several. The computer-readable storage medium stores a data quality detection program 10, and when the data quality detection program 10 is executed by the processor 12, the following operations are implemented:
[0076] Obtaining step: obtaining at least one piece of data to be detected from a data source, where each piece of data to be detected includes content data and metadata;
[0077] Matching step: matching at least one detection rule for each piece of data to be detected in a pre-established detection rule library according to a preset matching rule; and
[0078] Detection step: detecting the data to be detected using the matching detection rule to obtain the data quality detection result.
[0079] The specific implementation of the computer-readable storage medium of the present invention is substantially the same as the specific implementation of the above-mentioned data quality detection method and the electronic device 1, and will not be repeated here.
[0080] It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, It also includes other elements not explicitly listed, or elements inherent to the process, device, article, or method. In addition, the technical solutions between the various embodiments can be combined with each other, but they must be based on what can be achieved by those of ordinary skill in the art. When the combination of technical solutions is contradictory or cannot be achieved, it should be considered that such a combination of technical solutions does not exist. , Is not within the protection scope of the present invention.
[0081] Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. 的实施方式。 Based on this understanding, the technical solution of the present invention essentially or the part that contributes to the prior art can be embodied in the form of a software product. The computer software product is stored in a storage medium as described above and includes several instructions. It is used to make the electronic device execute the method described in each embodiment of the present invention.
[0082] The above are only the preferred embodiments of the present invention, and do not limit the scope of the present invention. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of the present invention, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of the present invention.
PUM


Description & Claims & Application Information
We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
Similar technology patents
Recall method and device for vehicle finding of trunk line logistics goods, electronic equipment and storage medium
PendingCN114881573AImprove accuracy and coverageReduce computing resource consumption
Owner:JIANGSU MANYUN SOFTWARE TECH CO LTD
Automatic grading loading and unloading device in floor static load test
PendingCN112557172AReduce labor and time costsgood guidance
Owner:浙江瑞邦科特检测有限公司
Text classification method, question-answering system and dialogue robot
ActiveCN112417111AReduce labor and time costs
Owner:XIAMEN KUAISHANGTONG TECH CORP LTD
A malicious URL detection system and method based on automatic feature extraction
Owner:SHANGHAI JIAO TONG UNIV
Convenient in-place method for large refrigerating unit
Owner:CHINA SHANXI SIJIAN GRP
Sonar image automatic target identification method based on neural network visualization
InactiveCN113052215AReduce labor and time costs
Owner:ZHEJIANG UNIV
Method and apparatus for entity recommendation
InactiveCN109063188AImprove accuracy and coverage
Owner:GUOXIN YOUE DATA CO LTD
Vehicle information query method, vehicle-mounted remote information processor and remote flashing platform
Owner:BEIJING ELECTRIC VEHICLE
Classification and recommendation of technical efficacy words
- Improve accuracy and coverage
- Reduce labor and time costs
A malicious URL detection system and method based on automatic feature extraction
Owner:SHANGHAI JIAO TONG UNIV
Method and apparatus for entity recommendation
InactiveCN109063188AImprove accuracy and coverage
Owner:GUOXIN YOUE DATA CO LTD
Recall method and device for vehicle finding of trunk line logistics goods, electronic equipment and storage medium
PendingCN114881573AImprove accuracy and coverageReduce computing resource consumption
Owner:JIANGSU MANYUN SOFTWARE TECH CO LTD
Vehicle height sensor based on laser radar ranging in airbag
Owner:ZHEJIANG UNIV +1
Digital rural ecological landscape application platform
Owner:SOUTHEAST UNIV
Text classification method, question-answering system and dialogue robot
ActiveCN112417111AReduce labor and time costs
Owner:XIAMEN KUAISHANGTONG TECH CORP LTD