A multi-level data backup method and system

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing a multi-level data backup method, combined with deep learning and machine learning technologies, and dynamically adjusting backup strategies, the problems of resource waste and data loss risks in existing technologies are solved, achieving efficient data backup.

CN122240398APending Publication Date: 2026-06-19GUANGZHOU SHANGZHIJIE NETWORK SAFETY TECH CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: GUANGZHOU SHANGZHIJIE NETWORK SAFETY TECH CO LTD
Filing Date: 2026-03-17
Publication Date: 2026-06-19

Application Information

Patent Timeline

17 Mar 2026

Application

19 Jun 2026

Publication

CN122240398A

IPC: G06F11/1446; G06F18/213; G06F18/20; G06N20/00; G06N3/096; G06F18/10; G06F18/15; G06N3/049; G06N3/045; G06N3/0442; G06N3/0464; G06N20/20; G06F18/2433; G06F18/243; G06F18/25; G06F16/906; G06F123/02

AI Tagging

Application Domain

Ensemble learning Error detection/correction

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing data backup systems struggle to accurately match backup strategies to the actual changes and operational status of data at different levels, leading to resource waste or an increased risk of data loss.

Method used

A multi-level data backup method is adopted, which uses a deep learning model to predict changes in data volume, a machine learning model to identify anomalies, and a business rule base to match backup instructions, thereby achieving differentiated backup processing for data at different levels.

Benefits of technology

It improves the execution efficiency of backup tasks, reduces system resource consumption, and enhances the stability and reliability of the data backup process.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122240398A_ABST

Patent Text Reader

Abstract

This application provides a multi-level data backup method and system, relating to the field of data processing. The method includes: acquiring multi-source backup data; layering the multi-source backup data to obtain multi-level backup data and constructing a multi-level backup dataset; extracting local temporal features of the multi-level backup dataset using a deep learning model and predicting the change in data volume at each level over a preset time period; identifying known and unknown anomalies in the multi-level backup dataset using a machine learning model and determining anomaly information at each level; matching multi-level backup instructions based on the change results and anomaly information at each level using a preset layered business rule base; performing layered backup of the multi-level backup data based on the multi-level backup instructions and outputting backup execution logs. This application, used in data backup processes, solves the technical problem that existing technologies cannot improve the execution efficiency of backup tasks while reducing unnecessary resource consumption.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of data processing, and in particular to a multi-level data backup method and system. Background Technology

[0002] Existing data backup solutions typically include full backup, incremental backup, and differential backup. In real-world large-scale information system environments, different types of data exhibit significant differences in access frequency, business importance, and growth rate. Therefore, differentiated backup strategies are necessary to ensure system reliability and the rational utilization of storage resources. However, existing data backup systems generally struggle to accurately match backup strategies to the actual changes and operational status of data at different levels. This leads to excessively high backup frequencies for some data, wasting storage and computing resources, or untimely backups of critical data, increasing the risk of data loss. The reasons for this problem are twofold: firstly, existing systems typically treat backup objects as a whole, lacking systematic analysis of the characteristics of different data levels, making it difficult to finely match backup strategies to different data types; secondly, data scale and business access behavior exhibit significant temporal variations. For example, peak business access, centralized data writing, or batch processing tasks can cause short-term fluctuations in data volume and system load, while traditional backup mechanisms often rely on fixed periods or static rules, making it difficult to reflect these changes in a timely manner. Therefore, how to formulate an effective backup strategy and improve the execution efficiency of backup tasks while reducing unnecessary resource consumption is a technical problem that urgently needs to be solved. Summary of the Invention

[0003] This application provides a multi-level data backup method and system, which solves the technical problem that existing technologies cannot improve the execution efficiency of backup tasks while reducing unnecessary resource consumption.

[0004] To achieve the above objectives, this application adopts the following technical solution: Firstly, a multi-level data backup method is provided, comprising: acquiring multi-source backup data, including: hierarchical backup historical data, business scenario data, system status data, and data hierarchy attribute data; hierarchically dividing the multi-source backup data to obtain multi-level backup data and constructing a multi-level backup dataset; extracting local temporal features of the multi-level backup dataset using a deep learning model and predicting the change in data volume at each level over a preset time period; identifying known and unknown anomalies in the multi-level backup dataset using a machine learning model and determining the anomaly information at each level; matching multi-level backup instructions based on the change results and the anomaly information at each level using a preset hierarchical business rule base; and backing up the multi-level backup data hierarchically based on the multi-level backup instructions and outputting backup execution logs.

[0005] In conjunction with the first aspect mentioned above, in one possible implementation, after backing up the multi-level backup data in layers, the method further includes: comparing the number of bytes and data entries of the successfully backed-up data at each level with the multi-level backup data before backup, calculating the data restoration rate, and determining that the integrity verification has passed when the data restoration rate reaches a preset threshold; verifying the foreign key constraints of relational data in the successfully backed-up data at each level, calculating the logical consistency pass rate, and determining that the logical consistency verification has passed when the logical consistency pass rate reaches a preset standard; performing random read tests on the multi-level backup data, determining that the storage validity verification has passed when the read is successful, and determining the storage validity verification result; when the data integrity verification, logical consistency verification, and storage validity verification all pass, outputting the multi-level backup verification result, and identifying and classifying the corresponding backup data for storage.

[0006] In conjunction with the first aspect mentioned above, in one possible implementation, after the corresponding backup data is identified and categorized for storage, the method further includes: collecting multi-level backup execution process data and multi-level backup verification results; the multi-level backup execution process data includes backup execution time, resource utilization rate, data transmission rate, exception handling records, and task completion status; the backup verification results include data restoration rate, logical consistency pass rate, and storage validity verification results; based on the multi-level backup execution process data, backup verification results, and multi-level backup data, constructing a full-process running dataset; inputting the full-process running dataset into a deep learning model and a machine learning model for incremental training, updating the model parameters of the deep learning model and the machine learning model, and obtaining optimized model parameters; based on the full-process running dataset, adjusting or supplementing the threshold parameters and matching conditions in the preset hierarchical business rule base, obtaining an updated preset hierarchical business rule base; and storing the optimized model parameters and the updated preset hierarchical business rule base for use in the next cycle of multi-level data backup processing.

[0007] In conjunction with the first aspect mentioned above, one possible implementation involves layering multi-source backup data to obtain multi-level backup data and constructing a multi-level backup dataset. This includes: dividing the multi-source backup data into real-time layer data, near-line layer data, and offline layer data according to preset data layering rules; wherein the preset data layering rules are determined based on data access frequency, business importance level, RTO threshold, and RPO threshold; and performing missing value processing and outlier processing on each layer of data; specifically, for real-time layer data, missing values are filled using a mean imputation method based on adjacent preset time windows, and data deviating from the normal range by a first preset proportion are removed; for near-line layer data, missing values are filled using a daily average imputation method, and data deviating from the normal range by a first preset proportion are removed. Data deviating from the normal range by a second preset proportion is filtered; missing values are filled in for offline layer data using a weekly average-based imputation method, and missing data in archives are marked with missing tags; data at each level is deduplicated based on timestamps and data identifiers, and the hierarchical backup historical data, business scenario data, system status data, and data hierarchy attribute data are associated using timestamps as the association key to form a multi-dimensional data structure containing time dimension features, business dimension features, system dimension features, hierarchy dimension features, and backup dimension features; the integrated data at each level is converted into a unified columnar storage format, and hierarchy labels and feature labels are added to each data entry to generate real-time layer datasets, near-line layer datasets, and offline layer datasets, forming a multi-level backup dataset.

[0008] In conjunction with the first aspect mentioned above, one possible implementation involves using a deep learning model to extract local temporal features from a multi-level backup dataset and predict the changes in data volume at each level over a preset time period. This includes: constructing features for the time-series data in the multi-level backup dataset, using data growth, data change frequency, peak business access, and system resource usage as input features, and serializing the data at each level according to a preset time window to obtain serialized data at each level; inputting the serialized data at each level into a TCN network, extracting local burst change features and short-period fluctuation features through a multi-layer dilated convolutional structure to obtain a first temporal feature representation; inputting the first temporal feature representation into an LSTM network, extracting long-period dependencies and cross-time period association features through recurrent units to obtain a second temporal feature representation; performing a fully connected mapping on the second temporal feature representation to output predicted values for data growth, data change frequency, and peak business access at each level within a preset time period; calculating the error based on historical real data, and outputting the changes in data volume at each level when the error meets a preset threshold condition.

[0009] In conjunction with the first aspect mentioned above, in one possible implementation, a machine learning model is used to identify known and unknown anomalies in a multi-level backup dataset and determine the anomaly information for each level. This includes: extracting anomaly judgment features from the multi-level backup dataset, including system resource utilization indicators, data change frequency indicators, synchronization latency indicators, storage read / write performance indicators, and historical fault marking information; inputting the anomaly judgment features into an XGBoost model to determine the known anomaly categories and corresponding anomaly probability values, and identifying known anomalies; wherein, the XGBoost model is trained using historically labeled fault sample data; inputting the anomaly judgment features into an isolated forest model, calculating the path length of samples by constructing multiple random subsampling trees to obtain anomaly score values, and identifying unknown anomalies not appearing in historical samples; fusing and judging known and unknown anomalies, and determining the corresponding data as anomalous data when the anomaly probability value or anomaly score value reaches a preset threshold; and generating anomaly information for each level based on the anomaly data, including anomaly type, anomaly level, anomaly impact range, and anomaly occurrence time identifier, and outputting the corresponding anomaly results for real-time, near-line, and offline layers respectively.

[0010] In conjunction with the first aspect mentioned above, in one possible implementation, based on the change results and anomaly information at each level, a multi-level backup command is matched using a pre-set hierarchical business rule base. This includes: retrieving the corresponding real-time layer rule table, near-line layer rule table, and offline layer rule table from the pre-set hierarchical business rule base. Each rule table includes data increment threshold conditions, anomaly status conditions, backup type parameters, resource scheduling thresholds, and anomaly response action parameters. The predicted data growth rate, predicted data change frequency, and predicted peak business access value from the change results are compared and matched with the data increment threshold conditions in the corresponding hierarchical rule tables. The anomaly type, anomaly level, and anomaly impact range in each level of anomaly information are matched with the anomaly status conditions in the corresponding level's rule table. When both the change result and the anomaly information meet the corresponding rule conditions, the backup type, execution time parameters, resource allocation parameters, and anomaly handling action parameters corresponding to that rule are determined, and corresponding multi-level backup instructions are generated for real-time, near-line, and offline layers respectively. The backup instructions include operation type fields, execution time fields, resource scheduling fields, and anomaly response fields. When no matching rule is found, the default backup instruction is generated by calling the preset basic backup rule of the corresponding level.

[0011] In conjunction with the first aspect mentioned above, in one possible implementation, multi-level backup data is backed up layer by layer based on multi-level backup instructions, and backup execution logs are output. This includes: performing real-time data synchronization backup via API calls according to real-time layer backup instructions; performing incremental or differential backup and storage resource scheduling operations via scripts according to near-line layer backup instructions; performing data writing and copy generation operations on offline storage media via device control instructions according to offline layer backup instructions; and recording task execution time, execution device, resource utilization, execution results, and exception information during the backup execution process at each level, thereby generating a multi-level backup execution log.

[0012] Secondly, a multi-level data backup system is provided to implement any of the methods in the first aspect. The system includes: a data acquisition device and an electronic device; wherein, the data acquisition device is used to acquire multi-source backup data, including: hierarchical backup historical data, business scenario data, system status data, and data hierarchy attribute data; the electronic device is used to hierarchically process the multi-source backup data to obtain multi-level backup data and construct a multi-level backup dataset; through a deep learning model, local temporal features of the multi-level backup dataset are extracted, and the change results of the data volume of each level within a preset time period are predicted; through a machine learning model, known and unknown anomalies in the multi-level backup dataset are identified, and the anomaly information of each level is determined; based on the change results and the anomaly information of each level, multi-level backup instructions are matched through a preset hierarchical business rule base; based on the multi-level backup instructions, the multi-level backup data is backed up hierarchically, and a backup execution log is output.

[0013] In conjunction with the second aspect mentioned above, in one possible implementation, the electronic device is further used to: compare the number of bytes and data entries of the successfully backed-up data at each level with the multi-level backup data before backup, calculate the data restoration rate, and determine that the integrity verification has passed when the data restoration rate reaches a preset threshold; verify the foreign key constraints of relational data in the successfully backed-up data at each level, calculate the logical consistency pass rate, and determine that the logical consistency verification has passed when the logical consistency pass rate reaches a preset standard; perform random read tests on the multi-level backup data, determine that the storage validity verification has passed when the read is successful, and determine the storage validity verification result; when the data integrity verification, logical consistency verification, and storage validity verification all pass, output the multi-level backup verification result, and perform result identification and classification storage on the corresponding backup data.

[0014] This application provides a multi-level data backup method and system. By constructing a multi-level data backup processing mechanism, it integrates multi-source backup data in layers and combines a deep learning model to predict changes in data volume at each level. Simultaneously, it utilizes a machine learning model to identify known and unknown anomalies. Based on this, it matches corresponding backup instructions through a hierarchical business rule base, achieving differentiated backup processing for data at different levels. Compared to traditional unified backup methods, this invention can dynamically adjust backup strategies according to data changes and system operating status, thereby improving the execution efficiency of backup tasks, reducing system resource consumption, and enhancing the stability and reliability of the data backup process. It solves the technical problem that existing technologies cannot improve the execution efficiency of backup tasks while reducing unnecessary resource consumption. Attached Figure Description

[0015] Figure 1 A system architecture diagram of a multi-level data backup system provided in this application embodiment; Figure 2 A flowchart illustrating a multi-level data backup method provided in this application embodiment; Figure 3 A flowchart illustrating another multi-level data backup method provided in this application embodiment; Figure 4 A flowchart illustrating another multi-level data backup method provided in this application embodiment; Figure 5 This is a flowchart illustrating another multi-level data backup method provided in an embodiment of this application. Detailed Implementation

[0016] In the description of this application, unless otherwise stated, " / " means "or," for example, A / B can mean A or B. The "and / or" in this document is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, and B alone. Furthermore, "at least one" means one or more, and "multiple" means two or more. The terms "first," "second," etc., do not limit the quantity or order of execution, and "first," "second," etc., do not necessarily imply differences.

[0017] It should be noted that, in this application, the terms "exemplary" or "for example" are used to indicate that something is being described as an example, illustration, or illustration. Any embodiment or design described as "exemplary" or "for example" in this application should not be construed as being more preferred or advantageous than other embodiments or design solutions. Specifically, the use of terms such as "exemplary" or "for example" is intended to present the relevant concepts in a concrete manner.

[0018] The multi-level data backup method provided in this application can be applied to, for example... Figure 1 In the multi-level data backup system shown, such as Figure 1 As shown, the system includes: a data acquisition device 101 and an electronic device 102.

[0019] Among them, the data acquisition device is used to acquire multi-source backup data, including: hierarchical backup historical data, business 101 scenario data, system status data, and data hierarchical attribute data; Electronic device 102 is used to layer multi-source backup data to obtain multi-level backup data and construct a multi-level backup dataset; it uses a deep learning model to extract local temporal features of the multi-level backup dataset and predict the change in data volume of each level within a preset time period; it uses a machine learning model to identify known and unknown anomalies in the multi-level backup dataset and determine the anomaly information of each level; based on the change results and the anomaly information of each level, it matches multi-level backup instructions through a preset hierarchical business rule base; based on the multi-level backup instructions, it performs layered backup of the multi-level backup data and outputs backup execution logs.

[0020] To address the technical problem that existing technologies cannot improve the execution efficiency of backup tasks while reducing unnecessary resource consumption, this application provides a multi-level data backup method.

[0021] Figure 2 A flowchart illustrating the multi-level data backup method provided in this application embodiment is shown below. Figure 2 As shown, the method includes: S201. Obtain multi-source backup data.

[0022] Multi-source backup data includes hierarchical backup historical data, business scenario data, system status data, and data hierarchy attribute data. Hierarchical backup historical data refers to the record data generated by data backup tasks at each level within a historical period, including backup time, backup type, backup data volume, and backup success status. Business scenario data refers to information related to business operations, such as access logs, business request volume, and data change records. System status data refers to the operational status information during backup execution, such as server CPU utilization, memory usage, network bandwidth utilization, and storage read / write performance indicators. Data hierarchy attribute data refers to attribute information describing the data's hierarchy, access frequency, business importance level, and data recovery requirements.

[0023] In one possible implementation, data acquisition devices deployed in the business system, database system, and backup management platform collect corresponding data source information and assign a unified timestamp to the collected data. The data is then categorized and stored according to its source type: historical backup records are extracted from the backup management system logs, business scenario data is extracted from business access logs and data change logs, system status data is obtained from the system monitoring interface, and data hierarchy attribute data is retrieved from the data asset management system. Finally, the aforementioned multi-source data is integrated according to a unified data structure to form a multi-source backup data set, which is then output to the subsequent processing module.

[0024] This step involves uniformly collecting and integrating data source information from different systems, enabling backup strategy analysis to simultaneously consider business access behavior, system operating status, and historical backup status, thereby improving the accuracy of subsequent data analysis and decision-making processes.

[0025] S202. Layer the multi-source backup data to obtain multi-level backup data, and construct a multi-level backup dataset.

[0026] Multi-level backup data refers to a collection of data at different levels, which is divided according to indicators such as data access frequency, business importance and data recovery needs. It includes real-time data, near-line data and offline data.

[0027] In one possible implementation, data objects in multi-source backup data are hierarchically classified according to a preset data hierarchy division rule. Data with high access frequency and strict recovery time requirements are classified as real-time layer data, data with medium access frequency and moderate recovery time requirements are classified as near-line layer data, and data with low access frequency and mainly used for archiving are classified as offline layer data. Missing value completion, outlier removal, and duplicate data removal operations are performed on the data at each level, and the data of each type are associated and integrated by timestamps. Finally, the processed data at each level are organized according to a unified field structure, and real-time layer dataset, near-line layer dataset, and offline layer dataset are generated respectively to form a multi-level backup dataset.

[0028] This step involves structuring and layering multi-source backup data, enabling data of different importance and access characteristics to be managed separately, thereby improving the targeting of subsequent backup strategy matching and resource scheduling.

[0029] S203. Using a deep learning model, extract local temporal features of multi-level backup datasets and predict the changes in data volume at each level over a preset time period.

[0030] The deep learning model is a TCN-LSTM combination model, where the TCN network is used to extract local change features in the time series, and the LSTM network is used to extract long-term time dependency features.

[0031] In one possible implementation, time-related feature data, including data growth, data change frequency, peak business access, and system resource utilization, is extracted from a multi-level backup dataset. Time series samples are then constructed according to a preset time window. The time series samples are input into a TCN network, where short-cycle change features are extracted through a multi-level dilated convolutional structure to obtain a first time series feature representation. Next, the first time series feature representation is input into an LSTM network, where long-term time dependencies are modeled through recurrent units to obtain a second time series feature representation. Finally, the second time series feature is mapped through a fully connected layer to output the predicted data growth and data change trends for each level within a preset future time period, thus forming the data volume change results for each level.

[0032] It should be noted that historical backup data and real data growth records should be used as training samples during model training, and the prediction results should be evaluated through an error function to ensure that the model has stable predictive capabilities.

[0033] This step enables the system to identify potential data growth peaks in advance by predicting data change trends, thus providing a basis for selecting subsequent backup strategies.

[0034] S204. Using machine learning models, identify known and unknown anomalies in multi-level backup datasets and determine the anomaly information for each level.

[0035] The machine learning model is a combination of XGBoost and Isolation Forest, where XGBoost is used to identify known anomaly types and Isolation Forest is used to identify unknown anomalies.

[0036] In one possible implementation, anomaly detection features are extracted from a multi-level backup dataset, including system resource utilization metrics, data change frequency metrics, synchronization latency metrics, and storage read / write performance metrics. The extracted features are then input into an XGBoost model, which calculates the probability of each sample belonging to different anomaly categories and identifies existing anomaly types in historical samples. Simultaneously, the same feature data is input into an isolated forest model, which calculates the sample path length using random subsampling trees and obtains anomaly scores, thereby identifying unknown anomalies deviating from the normal data distribution. The outputs of the two models are then fused for judgment; when the anomaly probability value or anomaly score exceeds a preset threshold, the data is determined to be anomalous. Finally, anomaly information for the corresponding level is generated based on the anomalous data.

[0037] This step can automatically identify abnormal situations in the backup environment, thereby avoiding the execution of unreasonable backup strategies under abnormal conditions.

[0038] S205. Based on the change results and the anomaly information at each level, multi-level backup instructions are matched through a preset hierarchical business rule base.

[0039] The hierarchical business rule base is a set of backup strategies that should be executed for different data levels under different operating conditions. It is a pre-configured set of condition-action mapping rules based on different data levels (real-time, near-line, offline) and different operating conditions (data growth, anomalies, system resource status, etc.), used to automatically generate corresponding backup strategy parameters when specific conditions are met. The rule base is usually stored in the form of a structured rule table or a strategy configuration table. Each rule includes at least a trigger condition field, a backup strategy field, and a resource scheduling field. Rules in the preset hierarchical business rule base are built according to actual task requirements and can be added, deleted, modified, and queried at any time. Table 1 below shows some examples of rules in the preset hierarchical business rule base.

[0040] Table 1. Example of Rules In one possible implementation, the system retrieves the corresponding real-time, near-line, and offline rule tables from a pre-defined hierarchical business rule base. Each rule table includes data increment threshold conditions, abnormal state conditions, backup type parameters, resource scheduling thresholds, and abnormal response action parameters. The system compares and matches the predicted data growth, data change frequency, and peak service access values from the change results with the data increment threshold conditions in the corresponding hierarchical rule tables. It then matches the abnormal type, abnormal level, and abnormal impact range in each level's abnormal information with the abnormal state conditions in the corresponding hierarchical rule tables. When both the change results and abnormal information simultaneously meet the corresponding rule conditions, the system determines the backup type, execution time parameters, resource allocation parameters, and abnormal handling action parameters corresponding to that rule, and generates corresponding multi-level backup instructions for the real-time, near-line, and offline layers. These backup instructions include operation type fields, execution time fields, resource scheduling fields, and abnormal response fields. When no matching rule is found, the system calls the pre-defined basic backup rules for the corresponding level to generate a default backup instruction.

[0041] Based on the above steps, this step can automatically select a suitable backup strategy according to data change trends and system operating status, thereby improving the flexibility of backup task execution.

[0042] S206. Based on multi-level backup instructions, perform layered backup of multi-level backup data and output backup execution logs.

[0043] In one possible implementation, the system parses the operation type, execution time, and resource scheduling parameters in the multi-level backup instructions and schedules the corresponding backup execution operations according to the data level. Based on the real-time layer backup instructions, real-time data synchronization backup is performed via API calls; based on the near-line layer backup instructions, incremental or differential backup and storage resource scheduling operations are performed via scripts; based on the offline layer backup instructions, data writing and copy generation operations on offline storage media are performed via device control commands. During the backup execution at each level, the system records task execution time, execution device, resource utilization, execution results, and exception information, generating a multi-level backup execution log.

[0044] As an example, in an enterprise data center, real-time data can be synchronized to backup nodes in real time via a database replication interface, while offline data can be periodically written to a tape library for long-term archiving.

[0045] Based on the above steps, this step can perform corresponding backup operations according to different data levels and record the backup execution process completely, thereby facilitating subsequent management of backup tasks.

[0046] In one possible implementation, after backing up multi-level backup data in layers, the method further includes: comparing the number of bytes and data entries of the successfully backed-up data at each level with the multi-level backup data before backup, calculating the data recovery rate, and determining that the integrity verification has passed when the data recovery rate reaches a preset threshold; verifying the foreign key constraints of relational data in the successfully backed-up data at each level, calculating the logical consistency pass rate, and determining that the logical consistency verification has passed when the logical consistency pass rate reaches a preset standard; performing random read tests on the multi-level backup data, determining that the storage validity verification has passed when the read is successful, and determining the storage validity verification result; when the data integrity verification, logical consistency verification, and storage validity verification all pass, outputting the multi-level backup verification result, and identifying and classifying the corresponding backup data for storage.

[0047] It should be noted that after the corresponding backup data is identified and categorized for storage, multi-level backup execution process data and multi-level backup verification results are collected. The multi-level backup execution process data includes backup execution time, resource utilization, data transfer rate, anomaly handling records, and task completion status. The backup verification results include data restoration rate, logical consistency pass rate, and storage validity verification results. Based on the multi-level backup execution process data, backup verification results, and multi-level backup data, a full-process runtime dataset is constructed. This full-process runtime dataset is then input into deep learning and machine learning models for incremental training, updating the model parameters of both models to obtain optimized model parameters. Based on the full-process runtime dataset, the threshold parameters and matching conditions in the preset hierarchical business rule base are adjusted or supplemented to obtain an updated preset hierarchical business rule base. The optimized model parameters and the updated preset hierarchical business rule base are then permanently stored for use in the next cycle of multi-level data backup processing.

[0048] This application embodiment acquires multi-source backup data and divides the data into hierarchical levels to construct a multi-level backup dataset. Then, it combines a deep learning model to predict the data volume change trend of each level and uses a machine learning model to identify possible anomalies in each level. Based on the change results and anomaly information, it matches the corresponding backup instructions through a preset hierarchical business rule base. Finally, it performs differentiated backup operations according to different data levels and generates backup execution logs. This enables dynamic adjustment of backup strategies based on data characteristics and operating status, improving the targeting and execution efficiency of backup tasks. It solves the technical problem in the prior art that it is impossible to improve the execution efficiency of backup tasks while reducing unnecessary resource consumption.

[0049] In one possible implementation of the embodiments of this application, combined with Figure 2 ,like Figure 3 As shown, the above S202 can be specifically implemented through the following S301 to S304, which are explained in detail below: S301. According to the preset data hierarchy division rules, the multi-source backup data is divided into real-time layer data, near-line layer data, and offline layer data.

[0050] The preset data hierarchy classification rules are determined based on data access frequency, business importance level, RTO threshold, and RPO threshold. Data access frequency refers to the number of times data is read or written by the business system per unit time; business importance level is a classification based on the degree of dependence of the business system on the data; RTO threshold (Recovery Time Objective) represents the maximum allowable recovery time after a system failure; and RPO threshold (Recovery Point Objective) represents the allowable data loss time range after a system failure.

[0051] In one possible implementation, business access logs, data modification records, and system operation monitoring information from multi-source backup data are obtained, and the access frequency and change frequency per unit time are calculated for each type of data. The data importance level defined in the business configuration file is read, and a comprehensive score is performed based on the RTO and RPO thresholds corresponding to each type of data. According to the preset hierarchical division rules, when the data access frequency is high and the RTO and RPO requirements are strict, the corresponding data is divided into real-time layer data; when the access frequency is medium and the recovery requirements are moderate, it is divided into near-line layer data; when the access frequency is low and the recovery time requirements are lenient, it is divided into offline layer data, thereby completing the hierarchical division of multi-source backup data.

[0052] It should be noted that the preset data hierarchy division rules can be pre-configured in the system policy file and can be dynamically adjusted according to the operation of the business system to avoid the problem of unreasonable data hierarchy division due to fixed rules.

[0053] As an example, in an enterprise business system, order transaction records can be classified as real-time data due to their high access frequency and strict requirements for recovery time; user historical behavior logs can be classified as near-line data; and long-term archived audit logs or historical statistical reports can be classified as offline data.

[0054] Based on the above steps, this step categorizes data by combining data access characteristics and business recovery needs, enabling different types of data to be processed with differentiated strategies during subsequent backups, thereby improving the targeting of data backup management and the efficiency of resource utilization.

[0055] S302. Perform missing value processing and outlier processing on data at each level.

[0056] Specifically, for real-time layer data, missing values are filled using the mean of adjacent preset time windows, and data that deviates from the normal range by a first preset proportion is removed; for near-line layer data, missing values are filled using the daily mean, and data that deviates from the normal range by a second preset proportion is filtered; for offline layer data, missing values are filled using the weekly mean, and missing data in archived categories are marked with missing tags.

[0057] It should be noted that the anomaly handling strategies for different levels of data can be configured according to business characteristics. For example, real-time data focuses on timely removal of outliers, while offline data focuses more on retaining the original records and performing labeling processing.

[0058] S303. Based on timestamps and data identifiers, deduplication is performed on data at each level. Using timestamps as the association key, hierarchical backup historical data, business scenario data, system status data, and data hierarchy attribute data are associated to form a multi-dimensional data structure containing time dimension features, business dimension features, system dimension features, hierarchy dimension features, and backup dimension features.

[0059] Among them, the timestamp refers to the time identifier of the data generation time or collection time; the data identifier refers to the data number or primary key field that can uniquely identify a data record.

[0060] In one possible implementation, the system sorts the data at each level according to the timestamp, then checks for duplicate records based on the data identifier. When records with the same data identifier and the same timestamp exist, only the latest version of the data is retained. After deduplication, the timestamp is used as a unified association key to associate the hierarchical backup historical data, business scenario data, system status data, and data hierarchy attribute data. A unified data record structure is then constructed through a data integration program to generate a multi-dimensional data structure containing time dimension features, business dimension features, system dimension features, hierarchy dimension features, and backup dimension features.

[0061] It should be noted that during the data association process, the timestamp format must be consistent, and the times of different source systems must be synchronized using a unified time standard to avoid data association errors caused by time discrepancies.

[0062] As an example, in an enterprise data center environment, the system can associate the timestamps in the database change log with the timestamps in the server operation monitoring data to form a comprehensive record that includes data change information and system operation status information.

[0063] Based on the above steps, this step achieves unified integration of multi-dimensional data information by deduplicating multi-source data and associating it with the time dimension, thereby improving the consistency and analyzability of the data structure.

[0064] S304. Convert the integrated data of each level into a unified columnar storage format, and add level labels and feature labels to each data entry to generate real-time layer datasets, near-line layer datasets, and offline layer datasets, forming a multi-level backup dataset.

[0065] Among them, columnar storage format refers to a data organization method that stores data in units of columns, which is suitable for large-scale data analysis and feature extraction scenarios.

[0066] In one possible implementation, the multidimensional data structure is reorganized columnarly according to fields and converted into a unified data storage format, such as Parquet or ORC. Hierarchical labels are added to each data record to identify its real-time, near-line, or offline layer, and corresponding feature labels are added according to the feature information contained in the data. After the labeling is completed, the real-time layer data, near-line layer data, and offline layer data are written into the corresponding dataset files to form a dataset with a unified structure.

[0067] It should be noted that during the data conversion process, field names, field types, and encoding formats need to be uniformly processed to ensure compatibility of data from different sources within the same dataset.

[0068] Based on the above steps, this step generates a structured dataset by performing unified format conversion and labeling on multi-source data, thereby improving data reading and processing efficiency and making the subsequent model analysis and backup strategy generation process more efficient and stable.

[0069] In one possible implementation of the embodiments of this application, combined with Figure 2 ,like Figure 4 As shown, the above S203 can be specifically implemented through the following S401 to S405, which are explained in detail below: S401. Construct features for the time series data in the multi-level backup dataset, taking data growth, data change frequency, business access peak and system resource usage indicators as input features, and serialize the data of each level according to the preset time window to obtain the serialized data of each level.

[0070] Among them, data growth refers to the change in the total amount of data per unit time; data change frequency refers to the number of times data is added, modified or deleted per unit time; business access peak refers to the maximum number of access requests to the business system within a preset time window; system resource usage indicators include CPU utilization, memory usage, network bandwidth utilization and disk I / O utilization.

[0071] In one possible implementation, the system extracts timestamped data records from a multi-level backup dataset and sorts them by time according to the real-time layer, near-line layer, and offline layer respectively. Based on a preset time window (e.g., 5 minutes, 1 hour, or 1 day), the system aggregates and calculates the data within a continuous time period to obtain the data growth, number of data changes, maximum business access value, and average system resource utilization rate within the corresponding time window. Then, the above features are constructed into a multi-dimensional time series vector in chronological order, and multiple consecutive time windows are combined into an input sequence using a sliding window method to obtain the serialized data at each level.

[0072] This step constructs a unified structure for multidimensional time series features, enabling data from different levels to be input into the deep learning model in a standardized sequence format, thereby improving the accuracy and stability of subsequent time series feature extraction.

[0073] S402. Input the serialized data of each level into the TCN network, and extract local sudden change features and short period fluctuation features through a multi-layer dilated convolutional structure to obtain the first temporal feature representation.

[0074] TCN refers to Temporal Convolutional Network, while dilated convolution refers to a convolution method that expands the receptive field of the convolution by setting a dilation rate during convolution computation.

[0075] In one possible implementation, the serialized time series data of each layer are input into the input layer of the TCN network, and feature extraction is performed through multiple one-dimensional convolutions. Each convolutional layer adopts a causal convolutional structure to ensure temporal consistency. In the convolutional layers, the temporal receptive field is expanded by increasing the dilation rate (e.g., 1, 2, 4, 8) layer by layer, thereby capturing local change patterns at different time scales. At the same time, a residual connection structure is introduced after each convolutional layer to reduce the gradient vanishing problem. Finally, the output is the first temporal feature representation containing short-period change features.

[0076] This step can effectively extract short-cycle fluctuation characteristics and sudden change characteristics of multi-level data, thereby improving the ability to capture data change trends.

[0077] S403. Input the first temporal feature representation into the LSTM network, and extract long-period dependencies and cross-time period association features through recurrent units to obtain the second temporal feature representation.

[0078] LSTM network refers to Long Short-Term Memory Network, which controls the transmission and updating of information in a time series through input gates, forget gates, and output gates.

[0079] In one possible implementation, the system uses the first temporal feature representation as the input sequence of the LSTM network and processes the time series data step by step through multiple LSTM recurrent units. In each time step, the input gate is used to control the writing of the current feature information, the forget gate is used to control the retention or forgetting of historical information, and the output gate is used to generate the hidden state of the current time step. Through recursive computation across multiple time steps, the model is able to learn the long-term dependencies between different time periods and finally outputs a second temporal feature representation containing information on long-term trends.

[0080] This step can extract long-term dependencies and cross-time period trends from time series, making the prediction results more stable and accurate.

[0081] S404. Perform a fully connected mapping on the second time-series feature representation and output the predicted values of data growth, data change frequency, and peak business access for each level within the preset time period.

[0082] In one possible implementation, the system inputs the second temporal feature representation into the fully connected layer and converts the high-dimensional feature vector into multiple prediction indicators through weight matrix mapping. Specifically, the system predicts the data growth, data change frequency, and business access peak at each level through multiple output nodes, and outputs the prediction results in chronological order as a prediction sequence for a future preset time period.

[0083] It should be noted that the output dimension of the fully connected layer is consistent with the number of prediction metrics to ensure that each prediction metric corresponds to an independent output node.

[0084] As an example, when the prediction period is the next 24 hours, the fully connected layer can output the predicted data growth and business access peak for each hour in the future, thus forming a complete sequence of prediction results.

[0085] This step converts the temporal features extracted by the deep learning model into quantifiable predictive metrics.

[0086] S405. Calculate the error of the prediction results based on historical real data, and output the change results of the data volume at each level when the error meets the preset threshold condition.

[0087] Error calculation can employ evaluation indicators such as mean absolute error, mean square error, or mean absolute percentage error.

[0088] In one possible implementation, the system compares the model prediction results with the historical real data of the corresponding time period and calculates the error value of each prediction indicator. Then, it compares the calculated error value with a preset error threshold. When the error is lower than the threshold, the prediction result is deemed valid, and the data volume change results at each level are output. When the error is higher than the threshold, the model parameter adjustment or re-prediction process is triggered to ensure the reliability of the prediction results.

[0089] It should be noted that the error threshold can be set differently depending on the data level. For example, a lower error threshold can be used for the real-time layer to ensure prediction accuracy, while the error range can be appropriately widened for the offline layer.

[0090] As an example, during the historical data verification process, if the average absolute error between the predicted data growth and the actual growth is less than 5%, the system considers the prediction result to meet the accuracy requirements and outputs the predicted change result.

[0091] Based on the above steps, this step can verify the effectiveness of the model prediction results, thereby ensuring the accuracy and reliability of the output data change results.

[0092] In one possible implementation of the embodiments of this application, combined with Figure 2 ,like Figure 5 As shown, the above S204 can be specifically implemented through the following S501 to S505, which are explained in detail below: S501. Extract anomaly detection features from multi-level backup datasets.

[0093] The anomaly detection features include system resource utilization metrics, data change frequency metrics, synchronization latency metrics, storage read / write performance metrics, and historical fault labeling information. System resource utilization metrics include CPU utilization, memory usage, and disk I / O utilization; data change frequency metrics represent the number of data addition, update, or deletion operations per unit time; synchronization latency metrics represent the data synchronization time difference between the primary data node and the backup node; storage read / write performance metrics include disk read / write throughput and response latency; and historical fault labeling information refers to the labels of abnormal events recorded during historical operation.

[0094] In one possible implementation, the system reads timestamped operational monitoring data from a multi-level backup dataset and extracts corresponding system resource monitoring metrics, data change records, and storage performance logs for the real-time, near-line, and offline layers, respectively. The system then aggregates and calculates the average resource utilization rate, number of data changes, average synchronization latency, and storage read / write performance metrics within a unit time window. Finally, it adds fault markers to the corresponding time windows based on historical fault records, thereby forming an anomaly detection feature vector.

[0095] It should be noted that, in order to ensure the accuracy of anomaly identification results, the anomaly judgment features need to be normalized or standardized during the extraction process to eliminate the influence of differences in the dimensions of different indicators.

[0096] As an example, in an enterprise data center scenario, the system can collect metrics such as server CPU utilization, database write count, backup node synchronization latency, and disk read / write speed from the monitoring platform, and combine them into an anomaly detection feature vector for anomaly detection.

[0097] Based on the above steps, this step can construct multi-dimensional anomaly detection features that include system operating status and data change behavior, providing reliable input data for the anomaly identification model.

[0098] S502. Input the anomaly detection features into the XGBoost model to determine the known anomaly categories and their corresponding anomaly probability values, and identify the known anomalies.

[0099] The XGBoost model is trained using historically labeled fault sample data; known anomaly categories include data synchronization anomalies, storage performance anomalies, resource overload anomalies, and data change anomalies.

[0100] In one possible implementation, the system inputs the extracted anomaly detection feature vector into the trained XGBoost model, and performs classification prediction on the features through multiple gradient boosting decision trees; in each decision tree, classification is performed based on the feature splitting node, and the model prediction results are continuously optimized through gradient boosting; finally, the prediction results of all tree models are weighted and summed to output the probability value of the corresponding anomaly category, and the known anomaly type is determined according to the principle of maximizing probability.

[0101] It should be noted that the XGBoost model uses historical fault sample data for supervised learning during the training phase and determines the model parameters through cross-validation to improve the accuracy of anomaly classification.

[0102] As an example, when the system resource utilization rate consistently exceeds 80%, the data change frequency increases abnormally, and the synchronization delay increases significantly, the XGBoost model can determine that there is a "data synchronization anomaly" and output the corresponding anomaly probability value.

[0103] Based on the above steps, this step can accurately identify the anomaly types that have appeared in historical samples, thereby quickly locating common fault types in system operation.

[0104] S503. Input the anomaly detection features into the isolated forest model, construct multiple random subsampling trees to calculate the path length of the samples, obtain the anomaly score, and identify unknown anomalies that have not appeared in historical samples.

[0105] Among them, the isolated forest model is an anomaly detection algorithm based on randomly partitioning the feature space. It determines whether a sample is abnormal data by calculating the path length of the sample in the random tree structure.

[0106] In one possible implementation, the system performs random subsampling of anomaly detection features and constructs isolated trees on multiple random subsamples. In each isolated tree, the samples are recursively divided by randomly selecting features and random partitioning thresholds, making it easier for anomaly samples to be isolated and form shorter path lengths. The average path length of the samples in all isolated trees is statistically analyzed, and anomaly score values are calculated using an anomaly scoring function.

[0107] It should be noted that when the average path length of a sample in an isolated tree is significantly less than that of a normal sample, it indicates that the sample is more likely to be isolated and is usually judged as anomalous data.

[0108] Based on the above steps, this step can discover abnormal patterns that have not appeared in historical samples, thereby improving the system's ability to detect unknown abnormal events.

[0109] S504. Perform a fusion judgment on known anomalies and unknown anomalies. When the anomaly probability value or anomaly score value reaches a preset threshold, the corresponding data is determined to be anomaly data.

[0110] In one possible implementation, the system obtains the anomaly probability value output by the XGBoost model and the anomaly score value output by the Isolation Forest model, and compares them with preset anomaly thresholds respectively. When the anomaly probability value is higher than the first preset threshold, it is directly determined as known anomaly data. When the anomaly score value is higher than the second preset threshold, it is determined as unknown anomaly data. If both reach the threshold conditions, it is determined as high-risk anomaly data.

[0111] It should be noted that the first and second preset thresholds can be set based on the statistical results of historical operation data to ensure the accuracy of anomaly detection results.

[0112] As an example, when the XGBoost model outputs an anomaly probability of 0.85 and exceeds the 0.8 threshold, it can be directly identified as a known anomaly; when the anomaly score of the Isolation Forest reaches 0.7 and exceeds the 0.65 threshold, it can be identified as an unknown anomaly.

[0113] Based on the above steps, this step improves the accuracy and comprehensiveness of anomaly detection by integrating the judgment results of supervised learning and unsupervised learning models.

[0114] S505. Generate abnormal information at each level based on abnormal data.

[0115] The anomaly information includes the anomaly type, anomaly level, anomaly impact range, and anomaly occurrence time identifier, and outputs corresponding anomaly results for real-time, near-line, and offline layers respectively. The anomaly level can be divided into high-level anomalies, medium-level anomalies, and low-level anomalies according to the degree of anomaly impact; the anomaly impact range refers to the data layer, business system, or storage node that the anomaly event may affect.

[0116] In one possible implementation, the system extracts the corresponding abnormal data records from the anomaly identification results and calculates the severity of the anomaly by weighting the anomaly probability value or anomaly score value; it generates structured anomaly records based on the anomaly type, anomaly level, and data hierarchy information, and classifies, stores, and outputs them according to the real-time layer, near-line layer, and offline layer respectively, thereby forming a complete set of hierarchical anomaly information.

[0117] It should be noted that after an anomaly is generated, it can be simultaneously recorded in the system monitoring log and the backup management system for anomaly tracking and subsequent processing.

[0118] Based on the above steps, this step can convert the anomaly detection results into structured anomaly information, thereby facilitating subsequent backup strategy matching and system operation and maintenance management.

[0119] Although this application has been described in conjunction with specific features and embodiments, it is obvious that various modifications and combinations can be made thereto without departing from the spirit and scope of this application. Accordingly, this specification and drawings are merely exemplary illustrations of this application as defined by the appended claims, and are considered to cover any and all modifications, variations, combinations, or equivalents within the scope of this application. Clearly, those skilled in the art can make various alterations and modifications to this application without departing from the spirit and scope of this application. Thus, if such modifications and modifications of this application fall within the scope of the claims of this application and their equivalents, this application is also intended to include such modifications and modifications.

[0120] This application embodiment extracts anomaly judgment features from multi-level backup datasets, combines the XGBoost model to identify known anomalies, utilizes the Isolation Forest model to detect unknown anomalies, and then generates anomaly information at each level through a fusion judgment mechanism. This enables the simultaneous identification of historically occurring anomaly patterns and novel anomalies, improving the comprehensiveness and accuracy of anomaly detection. Furthermore, by providing structured output of anomaly types, anomaly levels, and impact ranges, the system can more accurately perceive the operational status of data at each level.

Claims

1. A multi-level data backup method, characterized in that, include: Acquire multi-source backup data, including: hierarchical backup historical data, business scenario data, system status data, and data hierarchy attribute data; The multi-source backup data is layered to obtain multi-level backup data, and a multi-level backup dataset is constructed. Using a deep learning model, local temporal features of the multi-level backup dataset are extracted, and the changes in data volume at each level over a preset time period are predicted. Using a machine learning model, known and unknown anomalies in the multi-level backup dataset are identified, and anomaly information for each level is determined. Based on the changes and the anomaly information at each level, multi-level backup instructions are matched using a pre-defined hierarchical business rule base. Based on the multi-level backup instructions, the multi-level backup data is backed up layer by layer, and a backup execution log is output.

2. The method according to claim 1, characterized in that, After backing up the multi-level backup data in layers, the method further includes: Compare the number of bytes and data entries of successful backup data at each level with the multi-level backup data before backup, calculate the data restoration rate, and determine that the integrity verification is passed when the data restoration rate reaches a preset threshold. Verify the foreign key constraints of relational data in the successfully backed-up data at each level, calculate the logical consistency pass rate, and determine that the logical consistency check has passed when the logical consistency pass rate reaches the preset standard. Perform random read tests on multi-level backup data. If the read is successful, the storage validity check is deemed to have passed, and the storage validity check result is determined. When data integrity verification, logical consistency verification, and storage validity verification all pass, multi-level backup verification results are output, and the corresponding backup data is identified and categorized for storage.

3. The method according to claim 2, characterized in that, After identifying and classifying the corresponding backup data, the method further includes: Collect multi-level backup execution process data and multi-level backup verification results; the multi-level backup execution process data includes backup execution time, resource utilization rate, data transmission rate, exception handling records, and task completion status; the backup verification results include data restoration rate, logical consistency pass rate, and storage validity verification results. Based on the multi-level backup execution process data, the backup verification results, and the multi-level backup data, a full-process running dataset is constructed; The entire process dataset is input into the deep learning model and the machine learning model for incremental training, and the model parameters of the deep learning model and the machine learning model are updated to obtain optimized model parameters. Based on the full-process operation dataset, the threshold parameters and matching conditions in the preset hierarchical business rule base are adjusted or supplemented to obtain an updated preset hierarchical business rule base. The optimized model parameters and the updated preset hierarchical business rule base are permanently stored for multi-level data backup processing in the next cycle.

4. The method according to claim 1, characterized in that, The process of stratifying the multi-source backup data to obtain multi-level backup data and constructing a multi-level backup dataset includes: According to the preset data hierarchy division rules, the multi-source backup data is divided into real-time layer data, near-line layer data, and offline layer data; wherein, the preset data hierarchy division rules are determined based on data access frequency, business importance level, RTO threshold, and RPO threshold. Missing value processing and outlier processing are performed on data at each level. Specifically, for real-time data, missing values are filled using the mean of adjacent preset time windows, and data deviating from the normal range by a first preset proportion are removed. For near-line data, missing values are filled using the daily mean, and data deviating from the normal range by a second preset proportion are filtered. For offline data, missing values are filled using the weekly mean, and missing data in archived categories are marked as missing. Based on timestamps and data identifiers, data at each level is deduplicated. Using timestamps as the association key, hierarchical backup historical data, business scenario data, system status data, and data hierarchy attribute data are associated to form a multi-dimensional data structure that includes time dimension features, business dimension features, system dimension features, hierarchy dimension features, and backup dimension features. The integrated data from each level is converted into a unified columnar storage format, and level labels and feature labels are added to each data entry to generate real-time layer datasets, near-line layer datasets, and offline layer datasets, which together constitute the multi-level backup dataset.

5. The method according to claim 1, characterized in that, The step of extracting local temporal features of the multi-level backup dataset using a deep learning model and predicting the change in data volume at each level over a preset time period includes: Feature construction is performed on the time series data in the multi-level backup dataset. Data growth, data change frequency, business access peak and system resource consumption indicators are used as input features. The data of each level is serialized according to the preset time window to obtain the serialized data of each level. The serialized data at each level is input into the TCN network, and local sudden change features and short-period fluctuation features are extracted through a multi-layer dilated convolutional structure to obtain the first temporal feature representation. The first temporal feature representation is input into the LSTM network, and long-period dependencies and cross-time period association features are extracted through recurrent units to obtain the second temporal feature representation; Perform a fully connected mapping on the second time-series feature representation to output the predicted values of data growth, data change frequency, and peak service access for each level within a preset time period. The prediction results are calculated based on historical real data, and the changes in data volume at each level are output when the error meets a preset threshold condition.

6. The method according to claim 1, characterized in that, The process involves using a machine learning model to identify known and unknown anomalies in the multi-level backup dataset and determining anomaly information for each level, including: Extract anomaly detection features from the multi-level backup dataset. These anomaly detection features include system resource utilization indicators, data change frequency indicators, synchronization latency indicators, storage read / write performance indicators, and historical fault marking information. The anomaly detection features are input into the XGBoost model to determine the known anomaly categories and their corresponding anomaly probability values, and to identify the known anomalies; wherein, the XGBoost model is trained using historically labeled fault sample data; The anomaly detection features are input into the isolated forest model, and the path length of the samples is calculated by constructing multiple random sub-sampling trees to obtain anomaly scores and identify unknown anomalies that have not appeared in historical samples. The known anomalies and the unknown anomalies are fused and judged. When the anomaly probability value or anomaly score value reaches a preset threshold, the corresponding data is determined to be abnormal data. Based on the abnormal data, abnormal information at each level is generated. The abnormal information includes the abnormal type, abnormal level, abnormal impact range, and abnormal occurrence time identifier. The corresponding abnormal results are output for the real-time layer, near-line layer, and offline layer, respectively.

7. The method according to claim 1, characterized in that, Based on the changes and the anomaly information at each level, the system matches multi-level backup instructions using a pre-defined hierarchical business rule base, including: Retrieve the corresponding real-time layer rule table, near-line layer rule table, and offline layer rule table from the preset hierarchical business rule base. Each rule table includes data increment threshold conditions, abnormal state conditions, backup type parameters, resource scheduling thresholds, and abnormal response action parameters. The predicted values of data growth, data change frequency, and peak business access in the change results are compared and matched with the data increment threshold conditions in the corresponding level rule table. Match the anomaly type, anomaly level, and anomaly impact range in the anomaly information of each level with the anomaly state conditions in the corresponding level rule table; When the change result and the anomaly information simultaneously meet the corresponding rule conditions, the backup type, execution time parameter, resource allocation parameter, and anomaly handling action parameter corresponding to the rule are determined, and corresponding multi-level backup instructions are formed according to the real-time layer, near-line layer, and offline layer respectively; the backup instructions include operation type field, execution time field, resource scheduling field, and anomaly response field; When no matching rule is found, the default backup instruction is generated by calling the preset basic backup rule of the corresponding level.

8. The method according to claim 1, characterized in that, The backup process, based on the multi-level backup instructions, backs up the multi-level backup data in layers and outputs a backup execution log, including: Real-time data synchronization backup is performed via API calls based on the real-time layer backup instruction of the multi-level backup instruction. Based on the near-line layer backup instructions, perform incremental or differential backups and storage resource scheduling operations via scripts. Based on the offline layer backup instructions, data writing and copy generation operations on the offline storage medium are executed via device control commands. During the backup process at each level, the execution time, execution device, resource utilization, execution results, and exception information are recorded to generate a hierarchical backup execution log.

9. A multi-level data backup system for implementing the method according to any one of claims 1-8, characterized in that, The system includes: a data acquisition device and an electronic device; The data acquisition device is used to acquire multi-source backup data, including: hierarchical backup historical data, business scenario data, system status data, and data hierarchical attribute data. The electronic device is used to layer the multi-source backup data to obtain multi-level backup data and construct a multi-level backup dataset; extract local temporal features of the multi-level backup dataset using a deep learning model and predict the change in data volume of each level over a preset time period; identify known and unknown anomalies in the multi-level backup dataset using a machine learning model and determine the anomaly information of each level; match multi-level backup instructions based on the change results and the anomaly information of each level using a preset layered business rule base; perform layered backup of the multi-level backup data based on the multi-level backup instructions and output backup execution logs.

10. The system according to claim 9, characterized in that, The electronic device is also used for: Compare the number of bytes and data entries of successful backup data at each level with the multi-level backup data before backup, calculate the data restoration rate, and determine that the integrity verification is passed when the data restoration rate reaches a preset threshold. Verify the foreign key constraints of relational data in the successfully backed-up data at each level, calculate the logical consistency pass rate, and determine that the logical consistency check has passed when the logical consistency pass rate reaches the preset standard. Perform random read tests on multi-level backup data. If the read is successful, the storage validity check is deemed to have passed, and the storage validity check result is determined. When data integrity verification, logical consistency verification, and storage validity verification all pass, multi-level backup verification results are output, and the corresponding backup data is identified and categorized for storage.