A data processing method, related device, equipment and readable storage medium

By optimizing data storage and processing through the storage management module and task link elimination unit in the data processing system, the problem of rising costs caused by the explosive growth of data has been solved, and systematic cost control and resource optimization have been achieved.

CN117453833BActive Publication Date: 2026-06-26DUXIAOMAN TECH (BEIJING) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
DUXIAOMAN TECH (BEIJING) CO LTD
Filing Date
2023-10-26
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing data processing and storage technologies cannot effectively control the rising storage and computing costs caused by the explosive growth of data. Furthermore, the processing methods are not systematic, have poor applicability, high technical barriers, and high operation and maintenance costs.

Method used

A data processing system is adopted, including a data storage management and control module and a task link elimination unit. The storage space is determined by data path information and timestamp information, useless data is deleted or cold backup is stored, and tasks without downstream are terminated. The task layered computing module selects an efficient data processing engine, and resource utilization is upgraded and optimized by combining the offline hybrid module and the big data engine architecture.

Benefits of technology

It effectively reduces storage space redundancy, lowers storage and management costs, improves data processing efficiency, and achieves systematic cost control and resource optimization.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117453833B_ABST
    Figure CN117453833B_ABST
Patent Text Reader

Abstract

The application provides a data processing method, related devices, equipment and a readable storage medium. The method comprises the following steps: obtaining first information according to data generated and / or called by a data processing engine in a data processing engine layer, wherein the first information at least comprises input path information and / or output path information, task identification information; generating timestamp information of the input path and / or timestamp information of the output path information according to the data generated and / or called by the data processing engine; determining a first storage space and / or a second storage space based on the input path information and / or the output path information and the timestamp information, wherein the data in the first storage space is to be deleted, and the data in the second storage space is to be cold backup stored; deleting the data in the first storage space in the case that the first storage space exists; and cold backup storing the data in the second storage space in the case that the second storage space exists.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of data processing technology, and in particular to a data processing method, related apparatus, device, and computer-readable storage medium. Background Technology

[0002] With the continuous advancement of technology, big data is becoming an increasingly important field. Over the past few decades, the volume of data has exploded. According to IDC, the global data volume reached over 440 petaflops in 2020 and is projected to grow to 1750 petaflops by 2025. Various devices and applications generate massive amounts of data daily, including structured, semi-structured, and unstructured data. The growth rate of this data far exceeds the capabilities of traditional data processing and storage technologies. This explosive growth in data volume puts significant pressure on traditional data processing and storage technologies, while also posing a huge challenge to the operating costs of large enterprises. How to control data processing costs while addressing the explosive growth of data and business is an increasingly important issue for technical personnel. Summary of the Invention

[0003] This application provides a data processing method, related apparatus, device, and readable storage medium, which solves the problem of excessive data increasing enterprise management costs.

[0004] In a first aspect, embodiments of this application provide a data processing method, characterized in that it is applied to a data processing system, the data processing system including a data storage management module, the method comprising: obtaining first information based on data generated and / or invoked by a data processing engine in a data processing engine layer, the first information including at least input path information and / or output path information, and task identifier information; generating timestamp information of the input path and / or output path based on the data generated and / or invoked by the data processing engine; determining a first storage space and / or a second storage space based on the input path information and / or output path information and the timestamp information, wherein the data in the first storage space is data to be deleted, and the data in the second storage space is data to be cold-backed up; deleting the data in the first storage space if the first storage space exists; and cold-backing up the data in the second storage space if the second storage space exists.

[0005] In the above embodiments, after obtaining the first information and timestamp information, the data storage management module determines the first storage space and / or the second storage space based on the timestamp information. It then sends the path information of the first and / or second storage spaces to the corresponding data processing unit (data deletion unit or cold backup storage unit). In this way, data in storage spaces that have not been accessed for a long time can be identified as unimportant data, and thus deleted or cold-backed up. This reduces data redundancy in storage spaces while ensuring normal operation, avoiding excessive data in storage spaces and thus reducing storage and management costs.

[0006] In conjunction with the first aspect, in one possible implementation, a first storage space and / or a second storage space are determined based on input path information and / or output path information and timestamp information. Specifically, this includes: calculating the time difference between the current time and the first timestamp information of the input path information and / or the time difference between the current time and the first timestamp information of each output path information; the first timestamp information is the latest timestamp information of the input path information and / or output path information; determining the storage space indicated by the input path information and / or output path information with a time difference greater than or equal to a first threshold as the first storage space; determining the storage space indicated by the input path information and / or output path information with a time difference less than the first threshold and greater than a second threshold as the second storage space; the first threshold is greater than the second threshold.

[0007] In conjunction with the first aspect, in one possible implementation, the first storage space and / or the second storage space are determined based on the input path information and / or the output path information. Specifically, this includes: counting the number of timestamps for each input path information and / or path output information within a first time interval; determining the storage space indicated by the input path information and / or output path information with a timestamp count less than a third threshold as the first storage space; and determining the storage space indicated by the input path information and / or output path information with a timestamp count greater than the third threshold and less than a fourth threshold as the second storage space; wherein the third threshold is less than the fourth threshold.

[0008] Secondly, embodiments of this application provide a data processing method applied to a data processing system. The data processing system includes a data storage management and control module and a task link elimination unit. The method includes: the task link elimination unit receiving first information sent from the data storage management and control module, the first information including at least input path information and / or output path information and task identification information; the task link elimination module constructing a task relationship information network based on the first information, the task relationship information network being used to represent the upstream and downstream relationships between tasks; and the task link elimination module deleting a first task from the task relationship information network, the first task being a task that has no downstream tasks in the task relationship information network.

[0009] In this way, tasks without downstream counterparts can be terminated. The data generated by these tasks is often not read or used by other tasks. Deleting these tasks prevents them from generating excessive amounts of useless data, thus avoiding storage pressure.

[0010] In conjunction with the second aspect, in one possible implementation, the task link elimination module deletes the first task from the task relationship information network, including interrupting the process executing the first task.

[0011] In conjunction with the second aspect, in one possible implementation, the task link elimination module deletes the first task from the task relationship information network, including deleting the data generated during the execution of the first task.

[0012] In conjunction with the second aspect, in one possible implementation, the data processing system further includes a task-layered computation module. The method further includes: the task-layered computation module receiving task information sent by a task submission platform, the task information including task identification information, task type information, and task resource consumption; the task identification information representing a first target task, the task type information representing the type of the first target task, and the task resource consumption representing the amount of resources required to execute the first target task; and the task-layered computation module sending a task test message to a target data processing engine in the data processing engine layer, the task test message containing task test instructions instructing the target processing engine to calculate the predicted computation time required to execute the first target task. The task test message also includes the predicted computational resources consumed during the execution of the first target task, and / or the predicted computational resources. The target data processing engine is a data processing engine capable of executing tasks corresponding to the task type information. The task test message also includes the task resource consumption. The task hierarchical calculation module receives the predicted duration and / or predicted computational resources from the target data processing engine. The task hierarchical calculation module sends a first prompt message to the task submission platform. The first prompt information in the first prompt message is used to instruct the task submission platform to send a target task instruction to the first data processing engine. The target task instruction is used to instruct the first data processing engine to execute the first target task. The first data processing engine is the target data processing engine with the shortest predicted duration and / or the smallest predicted computational resources.

[0013] Thirdly, embodiments of this application provide a data processing device, which includes a memory and a processor;

[0014] The memory is used to store program code, and the processor is used to call the program code stored in the memory to execute the data processing method in the first aspect and its various possible implementations, or to execute the data processing method in the second aspect and its various possible implementations.

[0015] Fourthly, embodiments of this application provide a computer-readable storage medium storing a computer program that, when executed by a processor, implements the data processing method in the first aspect and its various possible implementations, or implements the data processing method in the second aspect and its various possible implementations.

[0016] Fifthly, embodiments of this application provide a computer program including instructions that, when executed by a computer, cause a data processing device to execute the processes executed by the data processing device in the first aspect and its various possible implementations, or the data processing device to execute the processes executed by the data processing device in the second aspect and its various possible implementations. Attached Figure Description

[0017] The accompanying drawings used in the embodiments of this application are described below.

[0018] Figure 1 This is a system architecture diagram of a data processing method provided in an embodiment of this application;

[0019] Figure 2 This is an example diagram of a data processing flow provided in an embodiment of this application;

[0020] Figure 3 This is an example diagram of a data processing flow provided in an embodiment of this application;

[0021] Figure 4 This application provides a task relationship information network in its embodiments.

[0022] Figure 5 This is a schematic diagram of the structure of a data processing device 50 provided in an embodiment of this application;

[0023] Figure 6 This is a schematic diagram of the structure of a data processing device 60 provided in an embodiment of this application. Detailed Implementation

[0024] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them. The term "embodiment" as used herein means that a specific feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of this application. The appearance of this phrase in different places in the specification does not necessarily indicate the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art will explicitly and implicitly understand that the embodiments described herein can be combined with other embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of this application without creative effort are within the scope of protection of this application.

[0025] The terms "first," "second," "third," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish different objects and not to describe a particular order. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, it may include a series of steps or units, or optionally, steps or units not listed, or other steps or units inherent to these processes, methods, products, or devices.

[0026] The accompanying drawings show only the portions relevant to this application, not all of them. Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although the flowcharts depict operations (or steps) as sequential processes, many of these operations may be performed in parallel, concurrently, or simultaneously. Furthermore, the order of the operations may be rearranged. The process may be terminated when its operation is completed, but may also have additional steps not included in the drawings. The process may correspond to a method, function, procedure, subroutine, subprogram, etc.

[0027] The terms “component,” “module,” “system,” “unit,” etc., used in this specification are used to refer to computer-related entities, hardware, firmware, combinations of hardware and software, software, or software in execution. For example, a unit can be, but is not limited to, a process running on a processor, a processor, an object, an executable file, a thread of execution, a program, and / or distributed between two or more computers. Furthermore, these units can be executed from various computer-readable media on which various data structures are stored. Units can communicate, for example, via local and / or remote processes based on signals having one or more data packets (e.g., data from a second unit interacting with another unit between a local system, a distributed system, and / or a network; for example, the Internet interacting with other systems via signals).

[0028] The explosive growth in the field of big data can be addressed through the following data processing technologies:

[0029] The first method is to compress data using data compression technology. Data compression technology reduces storage and transmission space by removing or replacing redundant information in data with smaller codes, thereby lowering costs and improving efficiency. In the field of big data, data compression technology is a commonly used technique that can effectively address the problem of explosive growth.

[0030] The second approach is to build a data lake for data governance. By using data lake technology, different types of data can be stored in a unified data warehouse, enabling centralized management and analysis. This effectively improves data utilization and value, thereby addressing the problem of explosive data growth.

[0031] The third approach is to perform distributed storage and computation on the data. By using distributed storage and computation technologies, data can be distributed across multiple servers, enabling parallel processing and storage. This effectively improves the efficiency of data processing and storage, thereby addressing the problem of explosive data growth.

[0032] However, processing data using the three methods described above also brings some problems. For example, using data compression technology may result in compression ratios that don't meet expectations, longer compression and decompression times, and the potential generation of noise during compression. Using data lake technology may introduce issues related to data quality, data security, and data governance. Using distributed storage and computing technologies may raise concerns about data synchronization, data consistency, and data security.

[0033] In summary, the above-mentioned data processing techniques mainly have the following four problems:

[0034] 1. Inability to completely address explosive growth and lack of continuous cost control. While data compression technology or allowing users to delete data can reduce storage pressure and meet short-term growth needs, the future data volume will continue to increase, making it impossible to continuously control user storage usage, and costs will continue to rise. Furthermore, data compression technology only focuses on storage growth while ignoring the explosive growth of computing and business operations, failing to fundamentally address the explosive growth problem of data and resulting in a lack of continuous cost control.

[0035] 2. The solutions are not systematic and cannot comprehensively address storage, computing, and business growth issues. The data processing methods mentioned above only focus on controlling storage growth, without addressing the fact that computing and business volumes are also growing rapidly. The lack of a systematic approach, focusing only on one aspect and using a single method for simple control, is far from keeping pace with the growth of storage, computing, and business.

[0036] 3. High technical barriers and high operation and maintenance costs do not align with the goal of cost reduction. As data volume continues to increase, technologies such as distributed storage, data processing, and computing also need constant upgrades and expansions. This requires significant investment of human, material, and financial resources, thus increasing costs. Furthermore, as data volume continues to grow, these technologies will encounter bottlenecks and fail to meet demands, leading to reduced efficiency in data processing and storage.

[0037] 4. It lacks universality and has poor applicability.

[0038] Therefore, to solve the above problems, this application proposes a data processing method. The system architecture of the data processing method proposed in this application will be described below with reference to the accompanying drawings. Please refer to... Figure 1 , Figure 1 This is a system architecture diagram of a data processing method provided in an embodiment of this application.

[0039] like Figure 1 The data processing system comprises a data processing engine layer, a data storage management and control layer, a data computation management and control layer, and a technology efficiency improvement layer. The data processing engine layer includes multiple data processing engines, such as the Hive engine, Spark engine, Hadoop engine, Presto engine, and Flink engine. These data processing engines are used to process data during task execution (e.g., human-computer interaction) or process execution (e.g., generating data resources required for task execution or process execution, retrieving data required for task execution or process execution from the storage space corresponding to the storage path, etc.).

[0040] The data storage management layer includes a storage management module, which in turn includes a data acquisition unit, a data authentication service unit, and a task lineage construction unit. In some embodiments, the data storage management module may also include a data deletion unit and a data cold backup storage unit.

[0041] The data acquisition module is used to acquire the initial information generated by each data processing engine during task execution. This initial information includes input path information (the path to the storage space for resources required for task execution), output path information (the path to the storage space for data generated during task execution), task identification information, and task type identifier. The task type identifier can be distinguished based on the data processing engine; tasks executed by the same data processing engine will have the same task type identifier corresponding to their initial information.

[0042] The data authentication service unit is used to record the timestamp information of each piece of first information. This timestamp information is used to characterize the time when the storage space corresponding to the input path information and / or output path information in the first information was most recently accessed.

[0043] The task lineage construction unit is used to filter the first information based on the timestamp information and filtering rules, thereby determining the first storage space and / or the second storage space. The data in the first storage space is the data to be deleted, and the data in the second storage space is the data to be cold-backed up.

[0044] In some embodiments, the data deletion unit is used to delete data in the first storage space after receiving a data deletion instruction from the task lineage building unit.

[0045] In some embodiments, the data cold backup storage unit is used to perform cold backup storage of data in the second storage space after receiving the data cold backup storage instruction from the task lineage building unit.

[0046] The data computing control layer includes a task-layered computing module and a task chain elimination module. The task-layered computing module is used to predict the time and / or computing resources required for each data processing engine to execute a task, and then selects the data processing engine with the shortest computing time and / or the least computing resources to process the task.

[0047] The task chain elimination module is used to collect the input and output paths of each offline task. It then performs data cleaning, data fusion, and data analysis on the collected path information to produce high-value, expected task data path information. The input and output paths are concatenated, and each path node is checked to see if it has child nodes in the entire chain. If no child nodes are found, the output of that path is worthless, and the worthless task is terminated.

[0048] The technology efficiency improvement layer includes an offline / online hybrid module, a big data engine architecture upgrade module, and a storage and computing middleware layer.

[0049] The hybrid online / offline module is used to deploy online and offline cluster machines together on a single Kubernetes cluster. It allows for flexible machine scheduling based on cluster traffic fluctuations, enabling resource sharing between online and offline machines. This significantly improves CPU utilization on offline machines and greatly reduces intermittent resource waste, thereby lowering costs.

[0050] The Big Data Engine Architecture Upgrade Module is used to upgrade engines such as Hive, Spark, Hadoop, Presto, and Flink from lower versions to higher community versions (from industry and actual test results, the performance of higher engine versions is far better than that of lower versions), thereby improving task execution efficiency and reducing task computing and operation costs.

[0051] The storage-compute middleware layer is used to integrate Alluxio on top of HDFS or BOS storage. Frequently accessed hot data for offline business is cached in Alluxio. When an offline task accesses hot data, it will directly read the data in the cache without having to fetch the underlying HDFS or BOS storage data again. This can greatly reduce data read and fetch IO, speed up task execution, and enable the same-sized computing cluster to support more business operations. To a certain extent, this reduces the overall cluster cost.

[0052] The following, with reference to the accompanying drawings, describes the data processing flow of the data storage management module in an embodiment of this application. Please refer to... Figure 2 , Figure 2 This is an example diagram of a data processing flow provided in an embodiment of this application. The data storage management module includes at least a data authentication service unit, a data acquisition unit, and a task lineage construction unit. The data processing flow of each unit interacting is as follows:

[0053] S201: The data acquisition unit obtains first information based on the data generated and / or called by the data processing engine in the data processing engine layer. The first information includes at least input path information and / or output path information, and task identification information.

[0054] Specifically, the data processing engine layer includes multiple data processing engines, which can be Hive, Spark, Flink, Hadoop, or Presto engines. These engines execute tasks (e.g., ensuring the normal operation of a process), and may generate or access data. The data acquisition unit monitors the data generation and / or access of each data processing engine in the data engine layer in real time. Upon detecting data generation or access by a data processing engine, the data acquisition unit can obtain the first information about this data. This first information includes at least input path information and / or output path information, task identifier information, and engine identifier information.

[0055] The input path information is used to represent the path of the storage space for storing the first data, which is the data generated by the data processing engine during task execution; the output path information is used to represent the path of the storage space for the second data, which is the data called by the data processing engine during task execution; the task identifier information is used to represent the task identifier corresponding to the input path information and / or the output path information; the engine identifier information is used to represent the data processing engine corresponding to the input path information and / or the output path information.

[0056] To facilitate a better understanding of the first information by those skilled in the art, it will be explained below with examples. Assume the current task is to run video software to play a historical video (Video 1). Running this task requires first reading and loading data from the device's first storage space (assuming the first storage space's storage path on the device is IQYI / Vedio / run) to ensure the normal operation of the video software. It also requires reading the previous playback progress data (progress data 1) from the device's second storage space (assuming the second storage space's storage path on the device is IQYI / Vedio / run / history / 102). During the playback of Video 1, the current playback progress data (progress data 2) of Video 1 needs to be periodically written to the device's third storage space (assuming the third storage space's storage path on the device is IQYI / Vedio / run / history / 102). The loading of data, the reading of the previous playback progress data of Video 1, and the writing of the current playback progress data of Video 1 are all executed by the Hadoop engine. Therefore, the first data can be loading data and progress data 1, the second data can be progress data 2, the input path information can include "IQYI / Vedio / run" and "IQYI / Vedio / run / history / 102", the output path information can include IQYI / Vedio / run / history / 102, the task identification information is used to indicate that the currently executed task is a video playback task, and the engine identification information is used to indicate that the data processing engine executing this task is the Hadoop engine.

[0057] S202: The data authentication service unit generates timestamp information for the input path and / or output path based on the data generated and / or called by the data processing engine.

[0058] Specifically, the data acquisition unit monitors in real time the data generation and / or retrieval of data by each data processing engine in the data engine layer. Upon detecting that a data processing engine is retrieving data from storage space, the data authentication service unit generates a timestamp for the input path (the path of the storage space containing the data retrieved by the data processing engine). This timestamp indicates the time the storage space corresponding to the input path was accessed. Similarly, upon detecting that a data processing engine is writing generated data from storage space, the data authentication service unit generates a timestamp for the output path (the path of the storage space where the data generated during task execution is stored). This timestamp indicates the time the storage space corresponding to the output path was accessed.

[0059] In some embodiments, the data authentication service module may save only the timestamp of the most recent access to the storage space of the input path and the timestamp of the most recent access to the storage space of the output path.

[0060] It should be understood that S201 can be executed before S202, after S202, or simultaneously with S202. This application embodiment does not limit this.

[0061] S203: The data acquisition unit sends the first information to the task lineage construction unit.

[0062] S204: The data authentication service unit sends the timestamp information to the task lineage construction unit.

[0063] It should be understood that S203 can be executed before S204, after S204, or simultaneously with S204. This application embodiment does not limit this.

[0064] S205: The task lineage construction unit determines a first storage space and / or a second storage space based on the first information and the timestamp information. The first storage space is the storage space indicated by the first input path information and / or the first output path information in the first information. The second storage space is the storage space indicated by the second input path information and / or the second output path information in the first information. The data in the first storage space is data to be deleted, and the data in the second storage space is data to be cold-stored.

[0065] Specifically, after receiving the timestamp information from the data authentication service unit and the first information from the data acquisition unit, the task lineage construction unit can determine a first storage space and / or a second storage space based on the first information and the timestamp information. The first storage space is the storage space indicated by the first input path information and / or the first output path information in the first information, and the second storage space is the storage space indicated by the second input path information and / or the second output path information in the first information. The data in the first storage space is data to be deleted, and the data in the second storage space is data to be cold-backed up. Cold-backed up storage can be understood as a storage method that reduces data storage performance; for example, reading data from cold-backed up storage is slower than reading data from non-cold-backed up storage. The task lineage construction unit can determine the first storage space and / or the second storage space based on the first information and the timestamp information in the following ways:

[0066] Optionally, the task lineage construction module can build a data warehouse based on the input path information and output path information in the first information it receives. This involves processes such as data exploration, data cleaning (e.g., deleting fields in the input and output path information that do not conform to analysis rules), and data fusion (e.g., in some embodiments, the input and output path information are on one data table, while the attribute information such as the size of the storage space indicated by the input and output path information is on another data table; data fusion refers to integrating two related pieces of information into one data table, which can be understood as integrating the input path information and the attribute information of the storage space indicated by that path information into one data table, and integrating the output path information and the attribute information of the storage space indicated by that path information into one data table). This process is used to construct a data warehouse about the input and output path information.

[0067] After the data warehouse is built, the task lineage construction unit can determine the first storage space and / or the second storage space based on the timestamp information of the input path information and output path information in the first information. The input path information corresponding to the first storage space is the first input path information, the output path information corresponding to the first storage information is the first output path information, and the data in the first storage space is the data to be deleted or the data in cold backup storage. The task lineage construction unit can obtain the time (first time) of the most recent access to the storage space indicated by the input path information and / or the output path information based on the timestamp information of the input path information and / or the timestamp information of the output path information, and calculate the time difference between the current time and the first time. The input path information and / or the output path information with the first time difference greater than or equal to a first threshold is determined as the first input path information and / or the first output path information, and the data in the first storage space indicated by the first input path information and / or the first output path information is the data to be deleted. Input path information and / or output path information with a first time difference less than a first threshold and greater than or equal to a second threshold are determined as second input path information and / or second output path information. The data in the target storage space indicated by this second input path information and / or second output information is the data to be stored in cold backup. The first threshold is greater than the second threshold.

[0068] S206: If it is determined that a first storage space exists, the task lineage construction module sends a first instruction message to the data deletion unit. The deletion instruction in the first instruction message is used to instruct the data deletion unit to delete the data in the first storage space. The first instruction message includes the first input path information and / or the first path output information.

[0069] S207: If it is determined that a second storage space exists, the task lineage construction unit sends a second instruction message to the data cold backup storage unit. The cold backup storage instruction in the second instruction message is used to instruct the data cold backup storage unit to perform cold backup storage of the data in the second storage space. The second instruction message includes the second input path information and / or the second path output information.

[0070] In this embodiment, after obtaining the first information and timestamp information, the task lineage construction unit determines the first storage space and / or the second storage space based on the timestamp information. It then sends the path information of the first and / or second storage spaces to the corresponding data processing unit (data deletion unit or cold backup storage unit). In this way, data in storage spaces that have not been accessed for a long time can be identified as unimportant data, and thus deleted or cold-backed up. This reduces data redundancy in storage spaces while ensuring normal execution, avoiding excessive data in storage spaces and increasing data storage and management costs.

[0071] It should be understood that, Figure 2 The execution order of the steps in the embodiments is merely illustrative and can be interpreted in a different way. Figure 2 The execution order of each step in the embodiments may be adjusted and / or one or more execution steps may be deleted to obtain different embodiments, and the obtained embodiments are still within the protection scope of the embodiments of this application.

[0072] The following, with reference to the accompanying drawings, describes the data processing flow of the task hierarchical calculation module and the task link elimination module in the embodiments of this application. Please refer to... Figure 3 , Figure 3 This is an example diagram of a data processing flow provided in an embodiment of this application. The specific flow is as follows:

[0073] S301: The data acquisition unit sends the first information to the task link elimination module.

[0074] S302: The task link elimination module constructs a task relationship information network based on the first information, and the task relationship information network is used to represent the upstream and downstream relationships between tasks.

[0075] Optionally, the task link elimination module constructs a data warehouse based on the input path information and output path information in the first information it receives. It performs data exploration, data cleaning (e.g., deleting fields in the input and output path information that do not conform to analysis rules), and data fusion (e.g., in some embodiments, the input and output path information are on one data table, while the attribute information such as the size of the storage space indicated by the input and output path information is on another data table; data fusion refers to integrating two related pieces of information into one data table, which can be understood as integrating the input path information and the attribute information of the storage space indicated by that path information into one data table, and integrating the output path information and the attribute information of the storage space indicated by that path information into one data table) and other processes to construct a data warehouse about the input and output path information.

[0076] The task link elimination module can determine the relationship between the input and output path information of each task based on the input path information and / or output path information in the first information and the task identification information, and construct a task relationship information network. This task relationship information network indicates the upstream and downstream relationships between tasks. In the task relationship information network, if the input path information of the first task is the output path information of the second task, then the first task is a downstream task of the second task, and the second task is a downstream task of the first task. During its execution, the first task needs to use the data generated by the second task.

[0077] For example, such as Figure 4 As shown, Figure 4 This is a task relationship information network provided in an embodiment of this application. Figure 4 In the context of tasks 1 through 7, if the tail of the arrow indicates an upstream task, and the head of the arrow indicates an upstream task, then... Figure 4 In the above, the output path information of Task 1 and Task 4 is the input path information of Task 2, the output path information of Task 2 is the input path information of Task 7, the output path information of Task 7 is the input path information of Task 5, the output path information of Task 5 is the input path information of Task 4, Task 8 and Task 6, the output path information of Task 4 is the input path information of Task 6, and the output path information of Task 3 is the input path information of Task 1 and Task 4.

[0078] S303: The task link elimination module determines the first task in the task relationship information network, and the first task is a task without downstream tasks.

[0079] Specifically, after constructing the task relationship network, the task elimination module determines the first task based on the task relationship network. The first task is a task without downstream tasks.

[0080] For example, the first task can be as described above. Figure 4 Task 8.

[0081] S304: The task link elimination module terminates the first task.

[0082] Specifically, the task chain elimination module can terminate the first task in the following ways: interrupting the execution of the first task or deleting the data generated during the execution of the first task.

[0083] S305: The task layering calculation module receives task information sent by the task submission platform. The task information includes task identification information, task type information, and task resource consumption. The task identification information is used to represent the first target task, the task type information is used to represent the type of the first target task, and the task resource consumption is used to represent the amount of resources required to execute the first target task.

[0084] Specifically, the task submission platform sends task instructions to the data processing engine, which instructs the engine to execute the corresponding task. Before sending the task to the data processing engine, the task submission platform can first send task information to the task hierarchical computing module. This task information includes task identification information, task type information, and task resource consumption.

[0085] Among them, task type information is used to characterize the type of task, and task resource consumption is used to characterize the total amount of processing resources required to execute the task.

[0086] S306: The task layered calculation module sends a task test message to the target data processing engine in the data processing engine layer. The task test instruction in the task test message is used to instruct the target processing engine to calculate the predicted computation time required to execute the first target task and / or the predicted value of the computational resources required to execute the first target task. The target data processing engine is a data processing engine that can execute the task corresponding to the task type information. The task test message also includes the task resource consumption.

[0087] S307: The target data processing engine preprocesses the task to be executed according to the task resource consumption to obtain the predicted computation time required for each target data processing engine to execute the first target task and / or the predicted value of the computational resources required to execute the first target task.

[0088] S308: The target data processing engine sends the computation prediction duration and / or the computation resource prediction value to the task hierarchical computation module.

[0089] S309: The task hierarchical calculation module selects the target data processing engine with the shortest calculation prediction time and / or the smallest calculation resource prediction value as the first data processing engine, and the first data processing engine is the data processing engine that executes the first target task.

[0090] S310: The task hierarchical calculation module sends a first prompt message to the task submission platform. The first prompt information in the first prompt message is used to instruct the task submission platform to send a target task instruction to the first data processing engine. The target task instruction is used to instruct the first data processing engine to execute the first target task.

[0091] In this way, one can select methods that shorten the time the data processing engine takes to execute tasks, thereby improving task execution efficiency; or reduce the computing resources consumed by the data processing engine to execute tasks, thereby reducing the computing and management costs required by the data processing engine.

[0092] In some embodiments, the data computing control layer may further include a data reading volume control module, which is used to obtain the amount of data read by the task on the data processing engine side. When the reading volume of the task during task execution is detected to be greater than a set threshold, the data reading volume control module terminates the task process.

[0093] It should be understood that, Figure 3 The execution order of the steps in the embodiments is merely illustrative and can be interpreted in a different way. Figure 3 The execution order of each step in the embodiments may be adjusted and / or one or more execution steps may be deleted to obtain different embodiments, and the obtained embodiments are still within the protection scope of the embodiments of this application.

[0094] The methods of the embodiments of this application have been described in detail above. The related devices, computer-readable storage media, and computer programs of the embodiments of this application are described below.

[0095] Please see Figure 5 , Figure 5 This is a schematic diagram of the structure of a data processing device 50 provided in an embodiment of this application. The data processing device 50 may include a memory 501 and a processor 502; wherein, the detailed description of each unit is as follows:

[0096] Memory 501 is used to store program code.

[0097] Processor 502 is used to call program code stored in memory to perform the following steps:

[0098] The first information is obtained based on the data generated and / or called by the data processing engine in the data processing engine layer. The first information includes at least input path information and / or output path information, and task identification information.

[0099] Generate timestamp information for input and / or output paths based on the data generated and / or invoked by the data processing engine;

[0100] The first storage space and / or the second storage space are determined based on the input path information and / or the output path information and the timestamp information. The data in the first storage space is the data to be deleted, and the data in the second storage space is the data to be stored in cold backup.

[0101] If the first storage space exists, delete the data in the first storage space;

[0102] If a second storage space exists, the data in the second storage space will be cold-backed up.

[0103] In one possible implementation, the first storage space and / or the second storage space are determined based on input path information and / or output path information and timestamp information, specifically including:

[0104] Calculate the time difference between the current time and the first timestamp of the input path information and / or the time difference between the current time and the first timestamp of each output path information;

[0105] The first timestamp information is the latest timestamp information of the input path information and / or output path information;

[0106] The storage space indicated by the input path information and / or output path information with a time difference greater than or equal to the first threshold is determined as the first storage space;

[0107] The storage space indicated by input path information and / or output path information with a time difference less than a first threshold and greater than a second threshold is determined as the second storage space, where the first threshold is greater than the second threshold.

[0108] In one possible implementation, the first storage space and / or the second storage space are determined based on input path information and / or output path information, specifically including:

[0109] Count the number of timestamps for each input path information and / or path output information within the first time interval;

[0110] The storage space indicated by the input path information and / or output path information where the number of timestamp information is less than the third threshold is determined as the first storage space;

[0111] The storage space indicated by the input path information and / or output path information where the number of timestamp information is greater than the third threshold but less than the fourth threshold is determined as the second storage space; the third threshold is less than the fourth threshold.

[0112] Please see Figure 6 , Figure 6 This is a schematic diagram of the structure of a data processing device 60 provided in an embodiment of this application. The data processing device 60 may include a memory 601, a processor 602, and a communication module 603; wherein, the detailed description of each unit is as follows:

[0113] The memory 601 is used to store program code.

[0114] The processor 602 is used to call the program code stored in memory to perform the following steps:

[0115] The communication module 603 receives first information sent from the data storage management module. The first information includes at least input path information and / or output path information, and task identification information.

[0116] A task relationship information network is constructed based on the first information. The task relationship information network is used to represent the upstream and downstream relationships between tasks.

[0117] Delete the first task in the task relationship information network. The first task is a task that has no downstream tasks in the task relationship information network.

[0118] In one possible implementation, the first task in the task relationship information network is deleted, including:

[0119] The process executing the first task is interrupted.

[0120] In one possible implementation, the first task in the task relationship information network is deleted, including:

[0121] Delete the data generated during the execution of the first task.

[0122] In one possible implementation, processor 602 is used to execute program code stored in memory:

[0123] The task layering calculation module receives task information sent by the task submission platform. This task information includes task identification information, task type information, and task resource consumption. The task identification information is used to identify the first target task, the task type information is used to identify the type of the first target task, and the task resource consumption is used to identify the amount of resources required to execute the first target task.

[0124] The communication module 603 sends a task test message to the target data processing engine in the data processing engine layer. The task test instruction in the task test message is used to instruct the target processing engine to calculate the predicted computation time required to execute the first target task and / or the predicted value of the computational resources required to execute the first target task. The target data processing engine is a data processing engine that can execute the task corresponding to the task type information. The task test message also includes the task resource consumption.

[0125] The communication module 603 receives the prediction duration and / or computational resource prediction values ​​sent from the target data processing engine.

[0126] The communication module 603 sends a first prompt message to the task submission platform. The first prompt information in the first prompt message is used to instruct the task submission platform to send a target task instruction to the first data processing engine. The target task instruction is used to instruct the first data processing engine to execute the first target task. The first data processing engine is the target data processing engine with the shortest prediction time and / or the smallest predicted value of computing resources.

[0127] This application provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the data processing methods described in the above embodiments and their various possible implementations.

[0128] This application provides a computer program that includes instructions that, when executed by a computer, enable a data processing device to execute the processes performed by the data processing device in the above embodiments and their various possible implementations.

[0129] It should be noted that the memory in the above embodiments can be read-only memory (ROM) or other types of static storage devices capable of storing static information and instructions, random access memory (RAM) or other types of dynamic storage devices capable of storing information and instructions, or electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disc storage, optical disc storage (including compressed optical discs, laser discs, optical discs, digital universal optical discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium capable of carrying or storing desired program code in the form of instructions or data structures and accessible by a computer, but is not limited thereto. The memory can exist independently and be connected to the processor via a bus. The memory can also be integrated with the processor.

[0130] The processor in the above embodiments may be a general-purpose central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits used to control the execution of the above scheme program.

[0131] For the foregoing method embodiments, in order to simplify the description, they are all expressed as a series of actions. However, those skilled in the art should understand that this application is not limited to the described order of actions, because according to this application, some steps may be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily essential to this application.

[0132] In the several embodiments provided in this application, it should be understood that the disclosed apparatus can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative. For instance, the division of the units described above is merely a logical functional division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed.

[0133] The units described above as separate components may or may not be physically separate. Similarly, the components shown as units may or may not be physical units; they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment, depending on actual needs.

[0134] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The aforementioned integrated unit can be implemented in hardware or as a software functional unit.

[0135] If the aforementioned integrated units are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in software form. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which can be a personal computer, server, or network device, specifically a processor in the computer device) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium may include various media capable of storing program code, such as a USB flash drive, portable hard drive, magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM).

[0136] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit it. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.

Claims

1. A data processing method, characterized in that, This method is applied to a data processing system, which includes a data processing engine layer, a data storage management layer, a data computation management layer, and a technology efficiency improvement layer. The data processing engine layer includes multiple data processing engines; the data storage management layer includes a data storage management module; the data computation management layer includes a task-layered computation module and a task chain elimination module; and the technology efficiency improvement layer includes an offline / online hybrid module, a big data engine architecture upgrade module, and a storage-computing intermediate layer. The data storage management module obtains first information based on data generated and / or invoked by the data processing engine in the data processing engine layer. This first information includes at least input path information and / or output path information, and task identifier information. It generates timestamp information for the input path and / or output path based on the data generated and / or invoked by the data processing engine. Based on the input path information and / or the output path information and the timestamp information, it determines a first storage space and / or a second storage space. The data in the first storage space is data to be deleted, and the data in the second storage space is data to be cold-backed up. If the first storage space exists, the data in the first storage space is deleted. If the second storage space exists, the data in the second storage space is cold-backed up. The task link elimination module constructs a task relationship information network based on the first information and deletes the first task in the task relationship information network. The task relationship information network is used to represent the upstream and downstream relationships between tasks, and the first task is a task that has no downstream tasks in the task relationship information network.

2. The method as described in claim 1, characterized in that, Determining the first storage space and / or the second storage space based on the input path information and / or the output path information and the timestamp information specifically includes: Calculate the time difference between the current time and the first timestamp of each input path information, and / or the time difference between the current time and the first timestamp of each output path information; the first timestamp is the latest timestamp of the input path information and / or the output path information. The storage space indicated by the input path information and / or output path information whose time difference is greater than or equal to the first threshold is determined as the first storage space; The storage space indicated by the input path information and / or output path information whose time difference is less than the first threshold and greater than the second threshold is determined as the second storage space; the first threshold is greater than the second threshold.

3. The method as described in claim 1, characterized in that, Determining the first storage space and / or the second storage space based on the input path information and / or the output path information specifically includes: Count the number of timestamps for each input path information and / or path output information within the first time interval; The storage space indicated by the input path information and / or output path information where the number of timestamp information is less than the third threshold is determined as the first storage space; The storage space indicated by the input path information and / or output path information whose number of timestamp information is greater than the third threshold and less than the fourth threshold is determined as the second storage space; the third threshold is less than the fourth threshold.

4. The method as described in claim 1, characterized in that, Deleting the first task from the task relationship information network includes: The process executing the first task is interrupted.

5. The method as described in claim 1, characterized in that, Deleting the first task from the task relationship information network includes: Delete the data generated during the execution of the first task.

6. The method as described in claim 1, characterized in that, The method further includes: The task hierarchical calculation module receives task information sent by the task submission platform. The task information includes task identification information, task type information, and task resource consumption. The task identification information is used to represent the first target task, the task type information is used to represent the type of the first target task, and the task resource consumption is used to represent the amount of resources required to execute the first target task. The task layered computing module sends a task test message to the target data processing engine in the data processing engine layer. The task test instruction in the task test message is used to instruct the target data processing engine to calculate the predicted computation time required to execute the first target task and / or the predicted value of the computational resources required to execute the first target task. The target data processing engine is a data processing engine that can execute the task corresponding to the task type information. The task test message also includes the task resource consumption. The task hierarchical computing module receives the predicted duration and / or the predicted computing resource value sent from the target data processing engine; The task hierarchical calculation module sends a first prompt message to the task submission platform. The first prompt information in the first prompt message is used to instruct the task submission platform to send a target task instruction to the first data processing engine. The target task instruction is used to instruct the first data processing engine to execute the first target task. The first data processing engine is the target data processing engine with the shortest prediction time and / or the smallest predicted value of computing resources.

7. A data processing apparatus, characterized in that, It includes a unit that performs the data processing method as described in any one of claims 1-6.

8. A data processing device, characterized in that, include: Memory and processor, wherein: The memory is used to store computer programs, the computer programs including program instructions; The processor is used to invoke the program instructions, causing the data processing device to perform the method as described in any one of claims 1-6.

9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the method as described in any one of claims 1-6.