Method, device and storage medium for detecting quality of time series data

By adopting targeted detection strategies based on the time-series data type, the problem of low accuracy of conventional methods in banking scenarios is solved, and accurate detection and anomaly identification of time-series data are achieved.

CN117131028BActive Publication Date: 2026-06-23CHINA MERCHANTS BANK

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA MERCHANTS BANK
Filing Date
2023-08-22
Publication Date
2026-06-23

Smart Images

  • Figure CN117131028B_ABST
    Figure CN117131028B_ABST
Patent Text Reader

Abstract

The application discloses a time series data quality detection method and device and a storage medium thereof, and belongs to the technical field of time series data detection. The time series data quality detection method comprises the following steps: acquiring time series data to be detected; determining the data type of the time series data to be detected; and performing quality detection on the time series data to be detected according to a quality detection strategy corresponding to the data type. The application solves the technical problem of low accuracy of the time series data quality detection method in the conventional technology.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of time series data inspection technology, and in particular to a time series data quality inspection method, device and storage medium. Background Technology

[0002] Time series data is a continuous sequence of data arranged in chronological order, and it is one of the most common data types in production, business activities, and other fields. During the collection, transmission, and storage of data, software and hardware failures often lead to erroneous data. Quality inspection of time series data and marking outliers are crucial steps in correcting erroneous data, improving data integrity, and analyzing anomalous events.

[0003] Conventional time series data quality inspection involves analyzing and comparing time series data with predicted baseline data to identify abnormal states in the series, triggering alarms for anomalies, indicating fault information, and monitoring and ensuring the normal operation of various businesses. However, due to the diverse scenarios in banking, the accuracy of conventional time series data quality inspection methods is relatively low.

[0004] The above content is only used to help understand the technical solution of this application and does not represent an admission that the above content is prior art. Summary of the Invention

[0005] The main objective of this application is to provide a time-series data quality inspection method, device, and storage medium thereof, which aims to solve the technical problem of low accuracy in conventional time-series data quality inspection methods.

[0006] To achieve the above objectives, this application provides a time-series data quality detection method, which includes:

[0007] Obtain the time series data to be detected;

[0008] Determine the data type of the time series data to be detected;

[0009] The time series data to be tested is subjected to quality testing according to the quality testing strategy corresponding to the data type.

[0010] Optionally, the step of performing quality inspection on the time series data to be inspected according to the quality inspection strategy corresponding to the data type includes:

[0011] If the data type is periodic, then the residual terms of the time series data to be detected are obtained as the residual sequence;

[0012] Determine the first detection point in the residual sequence, and use the first detection point as the last time point to obtain a first subsequence with a preset window length;

[0013] Based on the median, median deviation, and the first detection point of the first subsequence, the periodic feature value of the first subsequence is determined;

[0014] Determine whether the periodic feature value is greater than a preset first threshold;

[0015] If so, the first detection point is determined to be abnormal, and the next time point of the first detection point is taken as the next first detection point, and the step of obtaining the first sub-sequence of the preset window length is executed;

[0016] If not, the first detection point is determined to be normal, and the next time point of the first detection point is taken as the next first detection point, and the step of obtaining the first sub-sequence of the preset window length is executed.

[0017] Optionally, the step of performing quality inspection on the time-series data to be inspected according to the quality inspection strategy corresponding to the data type further includes:

[0018] If the data type is skewed, then the second detection point in the time series data to be detected is determined, and the second detection point is used as the last time point to obtain the second subsequence with a preset window length;

[0019] The model parameters are determined based on the first and second exponential smoothing values ​​of the second subsequence.

[0020] The model parameters are input into a preset quadratic exponential smoothing model for prediction to obtain the predicted value;

[0021] Determine whether the absolute value of the deviation ratio between the second detection point and the predicted value is greater than a preset second threshold.

[0022] If so, the second detection point is determined to be abnormal, and the next time point of the second detection point is taken as the next second detection point, and the step of obtaining the second sub-sequence with a preset window length is executed;

[0023] If not, the second detection point is determined to be normal, and the next time point of the second detection point is taken as the next second detection point, and the step of obtaining the second sub-sequence with a preset window length is executed.

[0024] Optionally, the step of performing quality inspection on the time-series data to be inspected according to the quality inspection strategy corresponding to the data type further includes:

[0025] If the data type is stationary, then the third detection point in the time series data to be detected is determined, and the third subsequence of a preset window length is obtained with the third detection point as the last time point.

[0026] Based on the median, median deviation, and third detection point of the third subsequence, the stationary characteristic value of the third subsequence is determined;

[0027] Determine whether the stationary feature value is greater than a preset third threshold;

[0028] If so, the third detection point is determined to be abnormal, and the next time point of the third detection point is taken as the next third detection point, and the step of obtaining the third sub-sequence of the preset window length is executed;

[0029] If not, the third detection point is determined to be normal, and the next time point of the third detection point is taken as the next third detection point, and the step of obtaining the third sub-sequence of the preset window length is executed.

[0030] Optionally, after the step of determining that the point to be detected is abnormal, the method further includes:

[0031] Based on the previous time point of the third detection point, the third detection point is corrected, and the next time point of the third detection point is used as the next third detection point to obtain a third subsequence of a preset window length.

[0032] Optionally, the step of performing quality inspection on the time-series data to be inspected according to the quality inspection strategy corresponding to the data type further includes:

[0033] If the data type is irregular, the time series data to be detected will be arranged in ascending order to obtain sorted data;

[0034] The time series data range is determined based on the upper quartile, lower quartile, and quartile difference of the sorted data;

[0035] If a time point in the time series data to be detected is outside the range of the time series data, then the time point is determined to be abnormal.

[0036] If the timing point in the timing data to be detected is within the range of the timing data, then the timing point is determined to be normal.

[0037] Optionally, the step of determining the data type of the time-series data to be detected includes:

[0038] Multi-dimensional feature extraction is performed on the time series data to be detected to obtain time series features;

[0039] Based on the aforementioned timing characteristics, the data type of the timing data to be detected is determined.

[0040] Optionally, after the step of acquiring the time series data to be detected, the method further includes:

[0041] The time series data to be detected is preprocessed, wherein the preprocessing includes one or more of the following: interpolation, dimensionality reduction, normalization, and encoding.

[0042] This application also provides an electronic device, the electronic device comprising: a memory, a processor, and a timing data quality detection program stored in the memory and executable on the processor, the timing data quality detection program being configured to implement the steps of the timing data quality detection method described above.

[0043] This application also provides a storage medium, which is a computer-readable storage medium, on which a time-series data quality detection program is stored, and the time-series data quality detection program is executed by a processor to implement the steps of the above-described time-series data quality detection method.

[0044] This application discloses a method for time-series data quality inspection, which involves acquiring time-series data to be inspected; determining the data type of the time-series data; and then performing quality inspection on the time-series data according to the quality inspection strategy corresponding to the data type. By accurately matching a targeted data quality inspection strategy based on the data type of the time-series data, precise inspection of the time-series data is achieved, thereby improving the accuracy of data quality inspection. Attached Figure Description

[0045] Figure 1 This is a schematic diagram of the structure of an electronic device in the hardware operating environment involved in the embodiments of this application;

[0046] Figure 2 This is a flowchart illustrating the time-series data quality detection method involved in the embodiments of this application;

[0047] Figure 3 This is a schematic diagram of the data classification model involved in the embodiments of this application;

[0048] Figure 4 This is a schematic diagram of a scenario involving the first embodiment of the solution described in this application;

[0049] Figure 5 This is a schematic diagram of another scenario involving the first embodiment of the scheme in this application;

[0050] Figure 6 This is a schematic diagram of the time-series data quality detection system involved in the embodiments of this application.

[0051] The realization of the purpose, functional features and advantages of this application will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation

[0052] It should be understood that the specific embodiments described herein are merely illustrative of this application and are not intended to limit this application.

[0053] Furthermore, the use of terms such as "first" and "second" in this application is for descriptive purposes only and should not be construed as indicating or implying their relative importance or implicitly specifying the number of technical features indicated. Therefore, a feature defined with "first" or "second" may explicitly or implicitly include at least one of those features. Additionally, the term "and / or" throughout the text includes three solutions; taking A and / or B as an example, it includes technical solution A, technical solution B, and a technical solution that simultaneously satisfies A and B. Furthermore, the technical solutions of various embodiments can be combined with each other, but this must be based on the ability of a person skilled in the art to implement them. When the combination of technical solutions is contradictory or impossible to implement, it should be considered that such a combination of technical solutions does not exist and is not within the scope of protection claimed in this application.

[0054] Reference Figure 1 , Figure 1 This is a schematic diagram of the electronic device structure of the hardware operating environment involved in the embodiments of this application.

[0055] like Figure 1 As shown, the electronic device may include: a processor 1001, such as a central processing unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. The communication bus 1002 is used to enable communication between these components. The user interface 1003 may include a display screen or an input unit such as a keyboard; optionally, the user interface 1003 may also include a standard wired interface or a wireless interface. The network interface 1004 may optionally include a standard wired interface or a wireless interface (such as a Wi-Fi interface). The memory 1005 may be a high-speed random access memory (RAM) or a stable non-volatile memory (NVM), such as a disk drive. The memory 1005 may also optionally be a storage device independent of the aforementioned processor 1001.

[0056] Those skilled in the art will understand that Figure 1 The structure shown does not constitute a limitation on the electronic device and may include more or fewer components than shown, or combine certain components, or have different component arrangements.

[0057] like Figure 1As shown, the memory 1005, which serves as a storage medium, may include an operating system, a data storage module, a network communication module, a user interface module, and a timing data quality detection program.

[0058] exist Figure 1 In the illustrated electronic device, the network interface 1004 is mainly used for data communication with other devices; the user interface 1003 is mainly used for data interaction with the user; the processor 1001 and memory 1005 in the electronic device of this application can be disposed in the electronic device, and the electronic device calls the timing data quality detection program stored in the memory 1005 through the processor 1001 and performs the following operations:

[0059] Obtain the time series data to be detected;

[0060] Determine the data type of the time series data to be detected;

[0061] The time series data to be tested is subjected to quality testing according to the quality testing strategy corresponding to the data type.

[0062] Furthermore, the operation of performing quality detection on the time series data to be detected according to the quality detection strategy corresponding to the data type includes:

[0063] If the data type is periodic, then the residual terms of the time series data to be detected are obtained as the residual sequence;

[0064] Determine the first detection point in the residual sequence, and use the first detection point as the last time point to obtain a first subsequence with a preset window length;

[0065] Based on the median, median deviation, and the first detection point of the first subsequence, the periodic feature value of the first subsequence is determined;

[0066] Determine whether the periodic feature value is greater than a preset first threshold;

[0067] If so, the first detection point is determined to be abnormal, and the next time point of the first detection point is taken as the next first detection point, and the operation of obtaining the first sub-sequence of the preset window length is executed;

[0068] If not, the first detection point is determined to be normal, and the next time point of the first detection point is taken as the next first detection point, and the operation of obtaining the first sub-sequence of the preset window length is performed.

[0069] Furthermore, the operation of performing quality detection on the time-series data to be detected according to the quality detection strategy corresponding to the data type further includes:

[0070] If the data type is skewed, then the second detection point in the time series data to be detected is determined, and the second detection point is used as the last time point to obtain the second subsequence with a preset window length;

[0071] The model parameters are determined based on the first and second exponential smoothing values ​​of the second subsequence.

[0072] The model parameters are input into a preset quadratic exponential smoothing model for prediction to obtain the predicted value;

[0073] Determine whether the absolute value of the deviation ratio between the second detection point and the predicted value is greater than a preset second threshold.

[0074] If so, the second detection point is determined to be abnormal, and the next time point of the second detection point is taken as the next second detection point, and the operation of obtaining the second sub-sequence with a preset window length is performed;

[0075] If not, the second detection point is determined to be normal, and the next time point of the second detection point is taken as the next second detection point, and the operation of obtaining the second sub-sequence with a preset window length is performed.

[0076] Furthermore, the operation of performing quality detection on the time-series data to be detected according to the quality detection strategy corresponding to the data type further includes:

[0077] If the data type is stationary, then the third detection point in the time series data to be detected is determined, and the third subsequence of a preset window length is obtained with the third detection point as the last time point.

[0078] Based on the median, median deviation, and third detection point of the third subsequence, the stationary characteristic value of the third subsequence is determined;

[0079] Determine whether the stationary feature value is greater than a preset third threshold;

[0080] If so, the third detection point is determined to be abnormal, and the next time point of the third detection point is taken as the next third detection point, and the operation of obtaining the third sub-sequence of the preset window length is performed;

[0081] If not, the third detection point is determined to be normal, and the next time point of the third detection point is taken as the next third detection point, and the operation of obtaining the third sub-sequence of the preset window length is performed.

[0082] Furthermore, the processor 1001 can call the timing data quality detection program stored in the memory 1005 and also perform the following operations:

[0083] After the operation of determining that the point to be detected is abnormal, the method further includes:

[0084] Based on the previous time point of the third detection point, the third detection point is corrected, and the operation of obtaining a third sub-sequence of a preset window length is performed with the next time point of the third detection point as the next third detection point.

[0085] Furthermore, the operation of performing quality detection on the time-series data to be detected according to the quality detection strategy corresponding to the data type further includes:

[0086] If the data type is irregular, the time series data to be detected will be arranged in ascending order to obtain sorted data;

[0087] The time series data range is determined based on the upper quartile, lower quartile, and quartile difference of the sorted data;

[0088] If a time point in the time series data to be detected is outside the range of the time series data, then the time point is determined to be abnormal.

[0089] If the timing point in the timing data to be detected is within the range of the timing data, then the timing point is determined to be normal.

[0090] Furthermore, the operation of determining the data type of the time-series data to be detected includes:

[0091] Multi-dimensional feature extraction is performed on the time series data to be detected to obtain time series features;

[0092] Based on the aforementioned timing characteristics, the data type of the timing data to be detected is determined.

[0093] Furthermore, the processor 1001 can call the timing data quality detection program stored in the memory 1005 and also perform the following operations:

[0094] Following the operation of acquiring the time series data to be detected, the following is also included:

[0095] The time series data to be detected is preprocessed, wherein the preprocessing includes one or more of the following: interpolation, dimensionality reduction, normalization, and encoding.

[0096] Based on the above structure, various embodiments of the time series data quality detection method are proposed.

[0097] Reference Figure 2 , Figure 2 This is a flowchart illustrating the first embodiment of the time-series data quality detection method of this application.

[0098] In this embodiment, the execution subject of the time-series data quality detection method can be an electronic device, which can be a local device or a network device. No limitation is imposed in this embodiment. For ease of description, the execution subject is omitted from the following description of each embodiment. In this embodiment, the time-series data quality detection method includes:

[0099] Step S10: Obtain the time series data to be detected;

[0100] Obtain the time series data to be detected (hereinafter referred to as the time series data to be detected for distinction).

[0101] Optionally, time series data is a sequence of numbers composed of the values ​​of the same statistical indicator arranged in chronological order of their occurrence. It can be period data or point-in-time data. Different time series data have different development patterns. The prediction of time series data is based on the analysis of historical data trends to predict future performance. Different time series data require different prediction methods. Therefore, before prediction, it is necessary to first perform quality checks on the time series data to avoid the impact of abnormal time series data on the prediction data.

[0102] Optionally, time series data can be a collection of time series, including multiple time series. The time series in the collection can have the same or different sources. Different time series can correspond to the same statistical indicator or different statistical indicators.

[0103] In one feasible implementation, after step S10, which involves acquiring the time-series data to be detected, the method further includes:

[0104] Step S11: Preprocess the time series data to be detected, wherein the preprocessing includes one or more of the following: interpolation, dimensionality reduction, normalization, and encoding.

[0105] Preprocessing is performed on the time series data to be detected. Preprocessing may include one or more of the following: interpolation, dimensionality reduction, normalization, and encoding.

[0106] Optionally, interpolation is performed by using linear interpolation to fill in missing values ​​in time series data, thereby solving the problem of data loss in time series data and ensuring data integrity.

[0107] Optionally, dimensionality reduction is performed by using the Piecewise Aggregate Approximation (PAA) method to reduce the length of the time series data, thereby reducing computational load and improving data quality detection efficiency; for example, the sampling rate of dimensionality reduction is 0.1.

[0108] Optionally, normalization is performed on the time series data using a global normalization method or a windowed normalization method, so that each data point in the time series data has the same distribution, which is beneficial for model prediction and training and avoids the gradient vanishing phenomenon in backpropagation.

[0109] Optionally, the encoding process can be carried out using one-hot encoding, which uses an N-bit state register to encode N states. Each state has its own independent register bit, and at any given time, only one bit is valid, to ensure that each data can only obtain one activation state; for example, the one-hot encoding of periodic data is [1, 0, 0].

[0110] In this embodiment, the efficiency and accuracy of data quality inspection are improved by preprocessing the time series data to be inspected before classifying and / or inspecting the quality of the time series data.

[0111] Step S20: Determine the data type of the time series data to be detected;

[0112] The time series data to be tested is classified in order to determine the data type of the time series data to be tested.

[0113] Optionally, the time series data to be detected is input into a preset data classification model for prediction to determine the data type of the time series data to be detected, wherein the data types include: periodic, skewed, stationary and irregular.

[0114] Alternatively, the data classification model can be a one-dimensional convolutional neural network (1D-CNN), referring to... Figure 3 The model consists of four convolutional layers, two pooling layers, one fully connected layer, and one softmax (normalized exponent) activation layer. The data classification model was trained using a training set, and after parameter tuning and iterative backpropagation training, the model achieved convergence.

[0115] To aid in understanding the above technical solutions, the following is a scenario illustration of a first embodiment of a specific time-series data quality detection method, with reference to... Figure 4 The time series data to be detected is preprocessed and then input into the data classification model to obtain the classification result. Among them, the negative samples that fail to be classified can be regarded as irregular types.

[0116] Step S30: Perform quality detection on the time series data to be detected according to the quality detection strategy corresponding to the data type.

[0117] Based on the data type, determine the corresponding quality inspection strategy, and use the corresponding quality inspection strategy to perform quality inspection on the time series data to be inspected.

[0118] To aid in understanding the above technical solutions, the following illustration uses another scenario diagram of a specific first embodiment of a time-series data quality detection method for further explanation. Figure 5 The system constructs a decision tree based on the CART (Continuous Articulation and Reduction) pruning algorithm to adjust the quality detection strategy. For example, it determines whether the probability of the time series data to be detected being periodic is greater than a preset threshold based on the classification result. If so, it calls the periodic quality detection strategy to perform quality detection. This allows for the specification of appropriate algorithms for time series data of different types, thereby improving the accuracy of data quality checks.

[0119] In this embodiment, time-series data to be tested is acquired; then the data type of the time-series data to be tested is determined; and then, quality testing is performed on the time-series data to be tested according to the quality testing strategy corresponding to the data type. By accurately matching a targeted data quality testing strategy based on the data type of the time-series data, precise testing of the time-series data is achieved, thereby improving the accuracy of data quality testing.

[0120] Furthermore, based on the first embodiment described above, a second embodiment of the time-series data quality detection method of this application is proposed. In this embodiment, step S20, the step of determining the data type of the time-series data to be detected, includes:

[0121] Step S21: Perform multi-dimensional feature extraction on the time series data to be detected to obtain time series features;

[0122] Multi-dimensional feature extraction is performed on the time series data to be detected to obtain time series features.

[0123] Optionally, multiple time-series features can be extracted from the same time-series data, thus corresponding to multiple different data types. For example, in the same time-series data, the time-series values ​​of different time periods may have different patterns and characteristics. Therefore, for the same time-series data, it can be divided into multiple subsequences of different types for separate classification.

[0124] Optionally, the time-series features include: difference mean, window extreme difference, fit deviation distance, and fluctuation factor.

[0125] Step S22: Determine the data type of the time series data to be detected based on the time series characteristics.

[0126] Based on the temporal characteristics, determine the data type of the temporal data to be detected.

[0127] In one feasible implementation, the time series data to be detected is input into a preset data classification model, and continuous difference calculation is performed on the time series data to be detected based on the data classification model, that is, all differences S in the data to be detected are calculated. t- S t-1 , of which St Let be the time series value at time t, and calculate all S values ​​of the time series data to be detected. t- S t-1 The mean of the difference is calculated to obtain the mean of the difference; then it is determined whether the mean of the difference is less than a preset difference threshold. If it is, the data type of the time series data to be detected is determined to be periodic; if not, the data type of the time series data to be detected is determined to be non-periodic. Based on the characteristics of periodic time series data, the mean of the difference is calculated for the time series data to be detected, and then the determination of whether the time series data to be detected is periodic is made through the mean of the difference. The determination has high accuracy and strong anti-interference ability.

[0128] In one feasible implementation, the time-series data to be detected is input into a preset data classification model. Based on the data classification model, the time-series data is subjected to low-pass filtering to remove waveforms below a preset frequency threshold, resulting in multiple low-pass filtered sequences. A target stationary magnitude threshold for each low-pass filtered sequence is determined based on its time-series value. A sliding window process is then performed on each low-pass filtered sequence, centered on its time-series value, to obtain a window sequence for each low-pass filtered sequence. The window length of each window sequence is specified. Consistency; determine the window extreme value difference of each window sequence, the window extreme value difference being the difference between the maximum time series value and the minimum time series value of the window sequence; determine the window sequence whose extreme value difference is greater than the target stationary type magnitude threshold, and use it as the fluctuating window sequence; determine the fluctuation ratio of the fluctuating window sequence in the low-pass filter sequence, if the fluctuation ratio is less than or equal to a preset ratio threshold, then determine that the data type of the time series data to be detected is stationary; if the fluctuation ratio is greater than the preset ratio threshold, then determine that the data type of the time series data to be detected is not stationary.

[0129] Optionally, the step of determining the target stationary magnitude threshold of the low-pass filter sequence based on the time-series values ​​of each low-pass filter sequence includes: determining the time-series mean of the low-pass filter sequence based on the time-series values ​​of each low-pass filter sequence; then determining the weighted time-series mean based on a preset weighting coefficient, and determining whether the weighted time-series mean is greater than the preset stationary magnitude threshold; if so, the weighted time-series mean is used as the target stationary magnitude threshold; if not, the preset stationary magnitude threshold is used as the target stationary magnitude threshold; for example, the preset stationary magnitude threshold is 200, the preset weighting coefficient is 0.6, and the target stationary magnitude threshold = max(200, time-series mean * 0.6).

[0130] Optionally, the step of performing sliding window processing on each of the low-pass filter sequences, centered on each time-series value in each of the low-pass filter sequences, to obtain window sequences of each of the low-pass filter sequences includes: performing sliding window processing on the low-pass filter sequences according to a preset window length, centered on the position of each time-series value in each of the low-pass filter sequences, to obtain multiple window sequences; the preset window length can be set according to actual needs, and this embodiment does not limit it.

[0131] In one feasible implementation, a polynomial fitting is performed on the time-series data to be detected to obtain a fitted straight line. A first time-series value and a second time-series value in the time-series data to be detected are determined, and the sum of the distances from the first time-series value and the second time-series value to the fitted straight line is determined as the fitting deviation distance. The first time-series value and the second time-series value are the time-series values ​​with the largest distances to the fitted straight line, and the first time-series value and the second time-series value are respectively located on opposite sides of the fitted straight line. It is determined whether the fitting deviation distance is greater than a preset deviation threshold. If it is, the data type of the time-series data to be detected is determined to be not skewed; otherwise, the data type of the time-series data to be detected is determined to be skewed.

[0132] In one feasible implementation, the time series data to be detected is subjected to sliding window processing according to a preset first window length to obtain multiple first window sequences of the first window length; a first distance between each first window sequence is calculated, wherein the first distance is the maximum absolute difference between the time series values ​​of each first window sequence; the standard deviation of the time series data to be detected is determined, and a deviation distance threshold is determined according to the standard deviation and a preset deviation coefficient; the proportion of first window sequences whose first distance is greater than the deviation distance threshold is determined as a first feature. According to a preset second window length, the time series data to be detected is subjected to sliding window processing to obtain multiple second window sequences with second window lengths, wherein the second window length is not equal to the first window length; a second distance between each second window sequence is calculated, wherein the second distance is the maximum absolute difference between the time series values ​​of each second window sequence; the standard deviation of the time series data to be detected is determined, and a deviation distance threshold is determined based on the standard deviation and a preset deviation coefficient; the proportion of second window sequences whose second distance is greater than the deviation distance threshold is determined as a second feature; a third feature, a fourth feature, etc., can be calculated according to the above steps, and the more features obtained, the higher the accuracy of the judgment of the data type of the time series data; then the difference between the first feature and the second feature is calculated as a fluctuation factor; the average difference between each feature can also be calculated as a fluctuation factor; then if the fluctuation factor is greater than a preset fluctuation factor threshold, the data type of the time series data to be detected is determined to be irregular.

[0133] In another feasible implementation, if the data type of the time series data to be detected is not periodic, stationary, or skewed / trend-type, then the data type of the time series data to be detected is determined to be irregular.

[0134] In this embodiment, by extracting multi-dimensional features from time-series data, time-series features are obtained. Then, by combining the time-series features, the time-series data is classified, which effectively improves the classification accuracy of time-series data. This enables the classification accuracy of time-series data to meet the prediction needs of complex business scenarios such as the financial field.

[0135] Furthermore, based on the first and / or second embodiments described above, a third embodiment of the time-series data quality detection method of this application is proposed. In this embodiment, step S30, the step of performing quality detection on the time-series data to be detected according to the quality detection strategy corresponding to the data type, includes:

[0136] Step A31: If the data type is periodic, then obtain the residual terms of the time series data to be detected as the residual sequence;

[0137] If the data type of the time series data to be tested is periodic, the STL (Seasonal and Trend decomposition using Loess) algorithm is used to decompose the time series data into periodic terms, trend terms, and residual terms, and the residual terms are extracted as residual sequences.

[0138] Optionally, the STL algorithm is a time series decomposition method that uses robust locally weighted regression as a smoothing approach. The algorithm consists of an inner loop and an outer loop. The inner loop primarily performs trend fitting and periodic component calculation, with each loop comprising six steps: detrending, smoothing of periodic subsequences, low-throughput filtering of periodic subsequences, removing and smoothing the trend of periodic subsequences, deperiodicity, and trend smoothing. The outer loop is mainly used to adjust the robustness weights. If there are outliers in the time series, the residual term will be relatively large.

[0139] Step A32: Determine the first detection point in the residual sequence, and use the first detection point as the last time point to obtain the first subsequence with a preset window length;

[0140] The first detection point is determined from the residual sequence, and the first subsequence with the first detection point as the last time point is obtained with a preset window length.

[0141] For example, the residual sequence S{S1, S2, S3, S4, S5.....S t The preset window length is 5, and the first detection point is S. x Then S xFor the last time point, obtain the first subsequence S'{S x-5 S x-3 S x-1 S x-1 S x}

[0142] Step A33: Determine the periodic feature value of the first subsequence based on the median, median deviation, and the first detection point of the first subsequence;

[0143] Calculate the median and median deviation of the first subsequence, and determine the periodic characteristic value of the first subsequence based on the median, median deviation, and the first detection point.

[0144] Alternatively, the median, also known as the middle value, is the number that is in the middle of a set of data arranged in order.

[0145] Optionally, the deviation between each time point in the first subsequence and the median is determined, and then the median of the absolute value of the deviation is determined, which is the median deviation.

[0146] Optionally, the periodic feature value = |(first detection point - median) / median deviation|.

[0147] Step A34: Determine whether the periodic feature value is greater than a preset first threshold;

[0148] Step A35: If yes, then determine that the first detection point is abnormal, and take the next time point of the first detection point as the next first detection point, and execute the step of obtaining the first sub-sequence of the preset window length;

[0149] Step A36: If not, determine that the first detection point is normal, take the next time point of the first detection point as the next first detection point, and execute the step of obtaining the first sub-sequence of the preset window length.

[0150] Determine whether the periodic feature value is greater than a preset first threshold. If so, it indicates that the first detection point deviates significantly, and the first detection point is determined to be abnormal. The next time point of the first detection point is taken as the next first detection point, and the step of obtaining a first sub-sequence of a preset window length is executed until the first detection point is the last time point in the time series data to be detected. If not, it indicates that the first detection point does not deviate significantly, and the first detection point is determined to be normal. The next time point of the first detection point is taken as the next first detection point, and the step of obtaining a first sub-sequence of a preset window length is executed until the first detection point is the last time point in the time series data to be detected.

[0151] For example, the first detection point is S t The next time point S after the first detection point t+1The next first detection point is then selected, and the step of obtaining the first subsequence of the preset window length is executed.

[0152] Optionally, the window length can be set according to actual needs, and the first threshold can also be set according to actual needs. This embodiment does not limit this. For example, the first threshold is 3.5.

[0153] Optionally, after the step of determining that the first detection point is abnormal, the method further includes: based on the first detection point (S) t The previous time point (S) t-1 The first detection point is corrected, and the next time point of the first detection point is used as the next first detection point to obtain the first subsequence of a preset window length; in order to avoid the impact of abnormal time series points on the quality detection of subsequent time series data.

[0154] In this embodiment, based on the characteristics of periodic time series data, the residual terms in the time series data to be tested, i.e. random noise, are obtained. Since the residuals of normal time series data should also fluctuate within a reasonable range, data quality testing can be carried out efficiently and accurately by performing quality testing on the residual terms of periodic time series data.

[0155] In one feasible implementation, step S30, the step of performing quality detection on the time series data to be detected according to the quality detection strategy corresponding to the data type, further includes:

[0156] Step B31: If the data type is skewed, then determine the second detection point in the time series data to be detected, and use the second detection point as the last time point to obtain the second subsequence with a preset window length;

[0157] If the data type of the time series data to be detected is skewed, then the second detection point is determined from the time series data to be detected, and the second detection point is used as the last time point to obtain the second subsequence with a preset window length.

[0158] For example, the time series data to be detected is A{A1, A2, A3, A4, A5.....A t The preset window length is 5, and the second detection point is A. x Then A x For the last time point, obtain the second subsequence A'{A x-5 A x-3 A x-1 A x-1 A x}

[0159] Step B32: Determine the model parameters based on the first and second exponential smoothing values ​​of the second subsequence;

[0160] Step B33: Input the model parameters into a preset quadratic exponential smoothing model for prediction to obtain the predicted value;

[0161] Calculate the first and second exponential smoothing values ​​of the second subsequence, and calculate the model parameters based on the first and second exponential smoothing values; then input the model parameters into the preset second exponential smoothing model for prediction to obtain the predicted value.

[0162] For example, the quadratic exponential smoothing model is F t+T =a t +b t T, where the model parameter is a t and b t a t =2S t (1) -S t (2) b t =[1 / (1-a)]*(2S) t (1) -S t (2) ), where T is the time interval from t to the predicted time point, and F t+T These are predicted values.

[0163] Step B34: Determine whether the absolute value of the deviation ratio between the second detection point and the predicted value is greater than a preset second threshold.

[0164] Step B35: If yes, then determine that the second detection point is abnormal, and take the next time point of the second detection point as the next second detection point, and execute the step of obtaining the second sub-sequence of the preset window length;

[0165] Step B36: If not, the second detection point is determined to be normal, and the next time point of the second detection point is taken as the next second detection point, and the step of obtaining the second sub-sequence with a preset window length is executed.

[0166] Calculate the absolute value of the deviation ratio between the second detection point and the predicted value, i.e., the deviation ratio is (second detection point - predicted value) / predicted value; and determine whether the absolute value of the deviation ratio is greater than a preset second threshold. If yes, it indicates that there is a large deviation between the second detection point and the predicted value, so the second detection point is determined to be abnormal, and the next time point of the second detection point is taken as the next second detection point. Then, the step of obtaining a second subsequence of a preset window length is executed until the first detection point is the last time point in the time series data to be detected. If no, it indicates that the second detection point and the predicted value are close, so the second detection point is determined to be normal, and the next time point of the second detection point is taken as the next second detection point. Then, the step of obtaining a second subsequence of a preset window length is executed until the first detection point is the last time point in the time series data to be detected.

[0167] Optionally, the window length can be set according to actual needs, and the second threshold can also be set according to actual needs. This embodiment does not limit this. For example, the second threshold is 0.2.

[0168] Optionally, after determining that the second detection point is abnormal, the method further includes: based on the second detection point (S) t The previous time point (S) t-1 The second detection point is corrected, and the next time point of the second detection point is used as the next second detection point to obtain a second subsequence of a preset window length; in order to avoid the impact of abnormal time series points on the quality detection of subsequent time series data.

[0169] In this embodiment, based on the characteristics of skewed time series data, an approximate model is fitted to predict the future. Then, by comparing the predicted value with the actual value, it can be determined whether the actual value is abnormal. Conventional first-order exponential smoothing methods are only applicable to stationary data, while skewed data with trend properties require second-order exponential smoothing methods to ensure that the predicted time series includes the trend of previous data, thereby improving the accuracy of time series data quality detection.

[0170] In one feasible implementation, step S30, the step of performing quality detection on the time series data to be detected according to the quality detection strategy corresponding to the data type, further includes:

[0171] Step C31: If the data type is stationary, then determine the third detection point in the time series data to be detected, and use the third detection point as the last time point to obtain the third subsequence of a preset window length.

[0172] If the data type of the time series data to be detected is stationary, then the third detection point in the time series data to be detected is determined, and the third subsequence of the preset window length is obtained with the third detection point as the last time point.

[0173] Example time series data to be detected B{B1, B2, B3, B4, B5.....B t The preset window length is 5, and the third detection point is B. x Then B x For the last time point, obtain the third subsequence B'{B x-5 B x-3 B x-1 B x-1 B x}

[0174] Step C32: Determine the stationary characteristic value of the third subsequence based on the median, median deviation, and the third detection point of the third subsequence;

[0175] Calculate the median and median deviation of the third subsequence, and determine the stationary eigenvalue of the third subsequence based on the median, median deviation, and the third detection point.

[0176] Alternatively, the median, also known as the middle value, is the number that is in the middle of a set of data arranged in order.

[0177] Optionally, the deviation between each time point in the third subsequence and the median is determined, and then the median of the absolute value of the deviation is determined, which is the median deviation.

[0178] Optionally, the stationary eigenvalue = |(third detection point - median) / median deviation|.

[0179] Step C33: Determine whether the stationary feature value is greater than a preset third threshold;

[0180] Step C34: If yes, then determine that the third detection point is abnormal, and take the next time point of the third detection point as the next third detection point, and execute the step of obtaining the third sub-sequence of the preset window length;

[0181] Step C35: If not, the third detection point is determined to be normal, and the next time point of the third detection point is taken as the next third detection point, and the step of obtaining the third sub-sequence of the preset window length is executed.

[0182] Determine whether the stationary feature value is greater than a preset third threshold. If yes, it indicates that the third detection point deviates significantly, and the third detection point is determined to be abnormal. The next time point of the third detection point is then used as the next third detection point, and the step of obtaining a third sub-sequence of a preset window length is executed until the third detection point is the last time point in the time series data to be detected. If no, it indicates that the third detection point does not deviate significantly, and the third detection point is determined to be normal. The next time point of the third detection point is then used as the next third detection point, and the step of obtaining a third sub-sequence of a preset window length is executed until the third detection point is the last time point in the time series data to be detected.

[0183] For example, the third detection point is S t The next time point S after the third detection point t+1 The next third detection point is then set, and the step of obtaining the third subsequence with a preset window length is executed.

[0184] Optionally, the window length can be set according to actual needs, and the third threshold can also be set according to actual needs. This embodiment does not limit this. For example, the third threshold is 3.5.

[0185] In one feasible implementation, after step C34, which determines that the third detection point is abnormal, the method further includes:

[0186] Step C341: Based on the previous time point of the third detection point, correct the third detection point, and execute the step of taking the next time point of the third detection point as the next third detection point to obtain a third sub-sequence of a preset window length.

[0187] According to the third detection point (S) t The previous time point (S) t-1 The third detection point is corrected, and the next time point of the third detection point is used as the next third detection point to obtain a third subsequence of a preset window length; in order to avoid the impact of abnormal time series points on the quality detection of subsequent time series data.

[0188] In this embodiment, based on the characteristics of stationary time series data, namely, an approximately normal distribution and small data fluctuations, the median is used instead of the conventional average, and the median deviation is used instead of the conventional variance. This avoids interference from long-term time series data on data quality detection, as well as interference from an abnormal time series point on the average and variance of the time series data, thereby improving the accuracy of time series data quality detection.

[0189] In one feasible implementation, step S30, the step of performing quality detection on the time series data to be detected according to the quality detection strategy corresponding to the data type, further includes:

[0190] Step D31: If the data type is irregular, then the time series data to be detected is arranged in ascending order to obtain sorted data;

[0191] If the data type of the time series data to be tested is irregular, then the time series data to be tested will be arranged in ascending order to obtain sorted data.

[0192] Step D32: Determine the time series data range based on the upper quartile, lower quartile, and quartile difference of the sorted data;

[0193] Determine the upper quartile, lower quartile, and quartile difference in the sorted data, and then determine the range of the time series data based on the upper quartile, lower quartile, and quartile difference.

[0194] Optionally, the upper quartile refers to the number that is exactly in the upper 1 / 4 position (75%) when all data are arranged from smallest to largest; the lower quartile refers to the number that is exactly in the lower 1 / 4 position (25%) when all data are arranged from smallest to largest; the quartile difference refers to the difference between the upper quartile and the lower quartile.

[0195] Optionally, the time series data range includes an upper limit and a lower limit, where the upper limit = upper quartile + 1.5 * quartile difference; and the lower limit = lower quartile - 1.5 * quartile difference.

[0196] Step D33: If the timing point in the timing data to be detected is outside the range of the timing data, then the timing point is determined to be abnormal.

[0197] Step D34: If the timing point in the timing data to be detected is within the range of the timing data, then the timing point is determined to be normal.

[0198] Determine whether the timing data to be tested is within the timing data range. If the timing point in the timing data to be tested is outside the timing data range, the timing point is determined to be abnormal; if the timing point in the timing data to be tested is within the timing data range, the timing point is determined to be normal.

[0199] In this embodiment, based on the characteristics of irregular time series data, namely that irregular time series data fluctuates irregularly, the quality detection is performed by relying on the actual data of the time series data to be detected. The dependence on the distribution of the time series data to be detected is low. It reflects the original data distribution from the data itself, thereby improving the accuracy of time series data quality detection.

[0200] Furthermore, based on the first, second, and / or third embodiments described above, a fourth embodiment of the timing data quality detection method of this application is proposed, referring to... Figure 6In this embodiment, the time series data quality detection method is applied to a time series data quality detection system. The data quality detection system preprocesses each time series in the time series data, including one or more of the following: interpolation, dimensionality reduction, normalization, and encoding, to ensure the real-time performance, standardization, and integrity of the time series data. Then, it classifies the data based on a one-dimensional CNN sequence classifier to determine the data type, including: stationary, skewed, periodic, and irregular. Then, based on an anomaly algorithm decision tree using expert rules, it calls the quality detection strategy adapted to each type of data to perform data quality checks, i.e., anomaly diagnosis, and outputs the diagnostic results. Anomalies can be displayed and warned, and the one-dimensional CNN sequence classifier can be updated by manual labeling.

[0201] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or system. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or system that includes that element.

[0202] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) as described above, and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, or network device, etc.) to execute the methods described in the various embodiments of this application.

[0203] The above are merely preferred embodiments of this application and do not limit the patent scope of this application. Any equivalent structural or procedural transformations made using the content of this application's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of this application.

Claims

1. A method for detecting quality of time series data, characterized in that, The time series data quality detection method comprises the following steps: acquiring time series data to be detected; preprocessing the time series data to be detected, wherein the preprocessing comprises one or more of interpolation processing, dimension reduction processing, normalization processing and encoding processing; extracting multi-dimensional features from the time series data to be detected to obtain time series features, wherein the time series features comprise differential mean, window extreme difference, fitting deviation distance and fluctuation factor; determining the data type of the time series data to be detected according to the time series features by using a one-dimensional convolutional neural network, wherein the one-dimensional convolutional neural network is composed of four convolutional layers, two pooling layers, one fully connected layer and one softmax activation layer, and wherein the data type comprises periodic type, inclined type, stationary type and irregular type; performing quality detection on the time series data to be detected according to the quality detection strategy corresponding to the data type; when the data type is inclined type, the quality detection strategy comprises: determining a second detection point in the time series data to be detected, and acquiring a second subsequence of a preset window length with the second detection point as the last time point; determining model parameters according to the first exponential smoothing value and the second exponential smoothing value of the second subsequence; inputting the model parameters into a preset second exponential smoothing model for prediction to obtain a predicted value; judging whether the absolute value of the deviation ratio between the second detection point and the predicted value is greater than a preset second threshold value; if yes, determining that the second detection point is abnormal, taking the next time point of the second detection point as the next second detection point, and performing the step of acquiring the second subsequence of the preset window length; if no, determining that the second detection point is normal, taking the next time point of the second detection point as the next second detection point, and performing the step of acquiring the second subsequence of the preset window length.

2. The method of timing data quality detection of claim 1, wherein, The step of performing quality detection on the time series data to be detected according to the quality detection strategy corresponding to the data type comprises: if the data type is periodic type, acquiring a residual term of the time series data to be detected as a residual sequence; determining a first detection point in the residual sequence, and acquiring a first subsequence of a preset window length with the first detection point as the last time point; determining a periodic feature value of the first subsequence according to the median, median deviation of the first subsequence and the first detection point; judging whether the periodic feature value is greater than a preset first threshold value; if yes, determining that the first detection point is abnormal, taking the next time point of the first detection point as the next first detection point, and performing the step of acquiring the first subsequence of the preset window length; if no, determining that the first detection point is normal, taking the next time point of the first detection point as the next first detection point, and performing the step of acquiring the first subsequence of the preset window length.

3. The method of timing data quality detection of claim 1, wherein, The step of performing quality detection on the time series data to be detected according to the quality detection strategy corresponding to the data type further comprises: If the data type is stationary, a third detection point in the time series data to be detected is determined, and a third subsequence of a preset window length is obtained with the third detection point as the last time point; A stationary characteristic value of the third subsequence is determined according to the median, median deviation of the third subsequence and the third detection point; It is judged whether the stationary characteristic value is greater than a preset third threshold value; If yes, it is determined that the third detection point is abnormal, and the next time point of the third detection point is taken as the next third detection point, and the step of obtaining the third subsequence of the preset window length is executed; If no, it is determined that the third detection point is normal, and the next time point of the third detection point is taken as the next third detection point, and the step of obtaining the third subsequence of the preset window length is executed.

4. The method of timing data quality detection of claim 3, wherein, After the step of determining that the third detection point is abnormal, the method further comprises: The third detection point is corrected according to the previous time point of the third detection point, and the step of obtaining the third subsequence of the preset window length with the next time point of the third detection point as the next third detection point is executed.

5. The method of timing data quality detection of claim 1, wherein, The step of performing quality detection on the time series data to be detected according to the quality detection strategy corresponding to the data type further comprises: If the data type is irregular, the time series data to be detected is arranged in ascending order to obtain sorted data; A time series data range is determined according to the upper quartile, lower quartile and interquartile range of the sorted data; If a time series point in the time series data to be detected is outside the time series data range, it is determined that the time series point is abnormal; If a time series point in the time series data to be detected is within the time series data range, it is determined that the time series point is normal.

6. An electronic device, comprising: The device comprises a memory, a processor and a time series data quality detection program stored on the memory and executable on the processor, and the time series data quality detection program is configured to implement the steps of the time series data quality detection method according to any one of claims 1 to 5.

7. A storage medium, characterized by The storage medium stores a time series data quality detection program, and the time series data quality detection program implements the steps of the time series data quality detection method according to any one of claims 1 to 5 when executed by the processor.