Unlock instant, AI-driven research and patent intelligence for your innovation.

Large-scale data quality anomaly detection method based on data features

A large-scale data and data feature technology, applied in unstructured text data retrieval, electronic digital data processing, text database query, etc., can solve problems such as poor versatility, low efficiency, and limited scope, and achieve large-scale and automation, the effect of improving detection efficiency

Pending Publication Date: 2021-10-29
STATE GRID CORP OF CHINA +1
View PDF6 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The method verifies the model of the data warehouse according to the standard terms, and when the field definition matches the standard definition and the field type does not match the standard type, the field type is modified to be consistent with the standard type, thereby obtaining the standard Consistent model Traditional data quality anomaly detection is driven by rules, targeting specific fields of specific tables. Business experts design a set of quality anomaly detection methods based on business specifications and empirical knowledge to carry out corresponding special governance work. The detection method constructed in this way has specific detection objects and uses, and is not very versatile. When it is necessary to carry out large-scale data quality anomaly detection, it is inefficient, limited in scope, and needs to be specified individually. It is impossible to realize large-scale data quality anomaly detection.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Large-scale data quality anomaly detection method based on data features
  • Large-scale data quality anomaly detection method based on data features
  • Large-scale data quality anomaly detection method based on data features

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0019] A large-scale data quality anomaly detection method based on data characteristics, comprising the following steps:

[0020] Step S1: Build a database of data anomaly detection methods, set corresponding detection methods according to each data feature, and summarize and form a data anomaly detection method library.

[0021] The data anomaly detection method library is stored in the dictionary type, the tuple composed of the data feature name and its feature parameters is used as the key of the dictionary, and the anomaly detection method corresponding to the data feature is used as the value of the dictionary. Python's dictionary type is a key-value pair. Use Python's dictionary type to store data features and their anomaly detection methods. The key of the dictionary stores a tuple consisting of the name of the data feature and its feature parameters, and the value of the dictionary stores the data. The anomaly detection method corresponding to the feature, in which th...

Embodiment 2

[0031] This embodiment is generally consistent with Embodiment 1, except that the large-scale data feature traversal process is different. The large-scale data feature traversal process in this embodiment includes: scaling the value of each dimension in the word vector to be matched to 0 to 255 range, and divide 0 to 225 into several levels, modify the value of each dimension to the intermediate number in the corresponding level of the value, generate a new special word vector, and use the special word vector to calculate the cosine similarity to reduce large Computational intensity under large-scale data volume. This solution is still based on fuzzy word vectors to reduce the amount of calculations under large-scale data.

[0032] The substantive effects of the above-mentioned embodiments include: transforming the method of anomaly detection from being driven by detection rules to being driven by data characteristics, generating corresponding outlier detection methods based o...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a large-scale data quality anomaly detection method based on data features. The method comprises the following steps: constructing a data anomaly detection method library, setting a corresponding detection method according to each data feature, and summarizing to form the data anomaly detection method library; performing abnormal detection method matching on the data features, and performing detection according to an abnormal detection method in a matching result; and performing large-scale data feature traversal, and matching and detecting each data feature. The method has the substantive effects that a detection rule driving mode of anomaly detection is converted into a data feature driving mode, a corresponding anomaly value detection method is generated based on feature information of data in each field, and meanwhile, a special fuzzification processing mechanism is set for large-scale data; therefore, the large-scale and automatic data quality checking is realized, and the data quality problem detection efficiency is improved.

Description

technical field [0001] The invention relates to the field of data anomaly detection, in particular to a large-scale data quality anomaly detection method based on data characteristics. Background technique [0002] With the development of the digital economy, all walks of life no longer blindly pursue the scale of data volume, and the requirements for data quality in the process of data application are also getting higher and higher. Facing massive data resources, how to be faster and more efficient? Accurately and intelligently discovering and positioning data quality problems and carrying out corresponding governance work are the focus and core of current enterprise-level data asset management. [0003] For example, the invention with the publication number CN108256074A discloses a method for verification processing, including obtaining the model of the data warehouse to be verified, each model includes a plurality of field information, and the field information includes f...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/33G06F40/242
CPCG06F16/3344G06F40/242
Inventor 葛俊梁云丹黄建平张旭东张建松陈浩
Owner STATE GRID CORP OF CHINA