A data bloodline generation method and device, electronic equipment and storage medium
By constructing a directed acyclic graph of the target, a more granular data lineage is generated, solving the problem that the data generation process cannot be tracked in existing technologies, and realizing the accurate characterization and visualization of data lineage.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
- Filing Date
- 2023-10-08
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies cannot effectively depict the fine-grained relationships of data, especially during data production and processing, where they cannot track the generation process and dependencies of specific data fields.
By constructing a target directed acyclic graph, the dependencies between target data nodes and task nodes are determined, generating a more granular data lineage, including the paths and assigned values of data nodes and task nodes, and executing a data generation process to generate the data lineage.
It enables a more granular characterization of data lineage, allowing for the tracking of dependencies and data generation paths at each node during the data generation process, thereby improving the accuracy and visualization capabilities of data lineage.
Smart Images

Figure CN117370630B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of computer technology, and in particular to the fields of data mining, data production and processing, and information flow technology, specifically to a data lineage generation method, apparatus, electronic device, and storage medium. Background Technology
[0002] Data production and processing refers to the process of processing raw data to produce final, displayable data. Taking search engine applications as an example, when a user searches, the search engine performs data generation processes such as web crawling, web parsing, and content extraction on the raw data. This process can obtain data such as images, text, and links to be displayed on the search results page. At this time, the raw data can be a URL (Universal Resource Locator).
[0003] In the process of data generation and processing, if it is necessary to understand the production and processing process of a certain data, such as: what data or processing processes the data was obtained from, or what data can be generated from the data, etc., the data lineage is particularly important. Summary of the Invention
[0004] This disclosure provides a method, apparatus, electronic device, and storage medium for generating data lineage.
[0005] According to a first aspect of this disclosure, a method for generating data lineage is provided, comprising:
[0006] In response to receiving a data lineage generation request, the target data nodes to be generated in the target directed acyclic graph are determined; wherein, the target directed acyclic graph includes data nodes and task nodes, each task node is connected to a data node, and the edge between any data node and any task node represents the dependency relationship between the data represented by the data node and the processing task represented by the task node.
[0007] Determine the target path in the target directed acyclic graph; wherein, the target path is the data generation path containing data nodes and task nodes that is required when generating the data of the target data node;
[0008] Obtain the assigned data of the specified data node in the target path;
[0009] Based on the assigned data, a data generation process is executed according to the target path to obtain the target data;
[0010] Based on the target data, the assigned data, and the target path, a data lineage for the target data node is generated.
[0011] According to a second aspect of this disclosure, a data lineage generation apparatus is provided, comprising:
[0012] The first determining module is used to determine the target data node to be generated in the target directed acyclic graph in response to receiving a data lineage generation request; wherein the target directed acyclic graph includes data nodes and task nodes, each task node is connected to a data node, and the edge between any data node and any task node represents the dependency relationship between the data represented by the data node and the processing task represented by the task node.
[0013] The second determining module is used to determine the target path in the target directed acyclic graph; wherein, the target path is a data generation path containing data nodes and task nodes that is required when generating the data of the target data node;
[0014] The acquisition module is used to acquire the assigned data of a specified data node in the target path;
[0015] The execution module is used to perform a data generation process according to the assigned data and the target path to obtain the target data;
[0016] The generation module is used to generate a data lineage for the target data node based on the target data, the assigned data, and the target path.
[0017] According to a third aspect of this disclosure, an electronic device is provided, comprising:
[0018] At least one processor; and
[0019] A memory communicatively connected to the at least one processor; wherein,
[0020] The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform any of the data lineage generation methods described above.
[0021] According to a fourth aspect of this disclosure, a non-transitory computer-readable storage medium is provided storing computer instructions, wherein the computer instructions are configured to cause the computer to perform any of the data lineage generation methods described above.
[0022] According to a fifth aspect of this disclosure, a computer program product is provided, comprising a computer program that, when executed by a processor, implements any of the data lineage generation methods described above.
[0023] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description
[0024] The accompanying drawings are provided to better understand this solution and do not constitute a limitation of this disclosure. Wherein:
[0025] Figure 1 This is a flowchart illustrating a data lineage generation method provided in this disclosure;
[0026] Figure 2 This is a schematic diagram of the structural relationship between strategy and task provided in this disclosure;
[0027] Figure 3 It is a schematic diagram of a directed acyclic graph provided in this disclosure;
[0028] Figure 4 This is a schematic diagram of a directed acyclic graph containing only task nodes, as provided in this disclosure;
[0029] Figure 5 This is another schematic diagram of a directed acyclic graph provided in this disclosure;
[0030] Figure 6 This is yet another schematic diagram of a directed acyclic graph provided in this disclosure;
[0031] Figure 7 This is another schematic diagram of a directed acyclic graph provided in this disclosure;
[0032] Figure 8 This is a schematic diagram of a data lineage generation device provided in this disclosure;
[0033] Figure 9 This is a block diagram of an electronic device provided in this disclosure. Detailed Implementation
[0034] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.
[0035] Regarding the data production and processing process, taking a search engine as an example, when a user searches, the search engine crawls web pages based on the user's input, such as a URL, to obtain HTML (Hypertext Markup Language) web pages. It then parses these HTML pages, extracts multimedia content such as images and videos, extracts features from this multimedia content, and builds an index to provide features for online ranking, ultimately providing the data to be displayed to the user. Meanwhile, offline data processing systems can process the data, generating multimedia content and features from the raw data that can be directly used by online systems.
[0036] Understanding the generation and processing of a particular data field is an application scenario of data lineage. For example, in a search scenario, what steps does the title in the search results go through, what other data fields does it depend on for generation, and how are those other data fields produced and processed?
[0037] In the technological age, machine learning methods can be used for feature extraction. This requires a large amount of raw input data, which may be generated in other ways. Multiple raw data sets undergo processing to ultimately obtain high-quality features. The iteration of upstream data affects downstream features, and these affected downstream features further influence which processing steps. When another input feature is added during the processing of downstream features, does the data processing process need to be re-executed from scratch based on the existing data? This places high demands on the data lineage.
[0038] Furthermore, data is a valuable asset and a core element in business logic. The demands for data digitization and intelligentization also place higher requirements on the accuracy of data lineage.
[0039] Some offline data processing systems exist in related technologies, such as Flink and Spark. These offline data processing systems are usually geared towards the data production and processing process. They can combine the data production and processing process through a certain topological order. Generally, a Directed Acyclic Graph (DAG) can be used to characterize the data production and processing process, but it does not have the ability to characterize the data lineage.
[0040] Related technologies also include workflow scheduling platforms or systems that can characterize coarse-grained data lineages, such as file-level and dataset-level granularity. For example, Compass (a site management system) schedules Hadoop (Hadoop Distributed File System) tasks, orchestrating multiple Hadoop tasks using a Directed Acyclic Graph (DAG). Then, based on dependencies and execution order, each Hadoop task is executed. Each Hadoop task places its output data file in a specific location, which the next task may then read, ultimately completing the workflow. Another example is DolphinScheduler, a distributed workflow task scheduling system that allows developers to create workflows, which also use a DAG to characterize the generation process.
[0041] The above methods have a coarse granularity in describing data lineage. They can only describe the overall output of the task, such as file granularity, dataset granularity, etc., and cannot describe the fine data lineage of data fields or specific features.
[0042] Based on this, embodiments of the present disclosure provide a data lineage generation method, apparatus, electronic device, and storage medium to generate finer-grained data lineages.
[0043] The following section first introduces a data lineage generation method provided in this disclosure.
[0044] The data lineage generation method disclosed herein can be applied to electronic devices, which may be terminal devices or servers. For example, terminal devices may include mobile phones, computers, etc. This disclosure does not limit the specific form of the electronic device. Furthermore, the data lineage generation method provided in this disclosure can be applied to any scenario where there is a need to view the generation process of upstream data, downstream data, or other dependent data of a certain data using data lineage. This disclosure does not limit the specific scenario.
[0045] Specifically, the entity executing this data lineage generation method can be a data lineage generation device. For example, the data lineage generation device can be functional software running on a terminal device, such as software used to generate data lines; the data lineage generation device can also be a plugin for an existing client, such as a plugin in a client used for data management, in which case the data lineage generation device can generate corresponding data lines for the managed data. For example, when the data lineage generation device is applied to a server, the data lineage generation device can be a computer program running on the server, such as a functional module in the server-side program corresponding to the client used for data management.
[0046] This disclosure provides a data lineage generation method, which may include the following steps:
[0047] In response to receiving a data lineage generation request, the target data nodes to be generated in the target directed acyclic graph are determined; wherein, the target directed acyclic graph includes data nodes and task nodes, each task node is connected to a data node, and the edge between any data node and any task node represents the dependency relationship between the data represented by the data node and the processing task represented by the task node.
[0048] Determine the target path in the target directed acyclic graph; wherein, the target path is the data generation path containing data nodes and task nodes that is required when generating the data of the target data node;
[0049] Obtain the assigned data of the specified data node in the target path;
[0050] Based on the assigned data, a data generation process is executed according to the target path to obtain the target data;
[0051] Based on the target data, the assigned data, and the target path, a data lineage for the target data node is generated.
[0052] In this scheme, a target directed acyclic graph (DAG) is pre-constructed, which includes data nodes and task nodes. Each task node is connected to a data node, and the edge between any data node and any task node represents the dependency relationship between the data represented by the data node and the processing task represented by the task node. Therefore, this disclosure can generate the data lineage of data nodes based on the target DAG. Specifically, in response to receiving a data lineage generation request, the target data node in the target DAG to which the data lineage is to be generated can be determined, and the data generation path containing data nodes and task nodes, i.e., the target path, can be determined when generating the data of the target data node. Then, the assigned data of the specified data node to be assigned in the target path can be obtained, and the data generation process can be executed according to the target path based on the assigned data to obtain the target data. At this point, the target path can represent the production relationship between task nodes and data nodes when generating the data of the target data node. The target data and the assigned data can be specific numerical values or parameters. The data lineage of the target data node can be generated based on the target data, the assigned data, and the target path. As can be seen, the data lineage generation method provided in this disclosure generates a data lineage that includes data for the target data node and the specific data used in the data generation process. It also includes the production relationship between the data node and the task node when the data of the target data node is generated. This solution can generate a more granular data lineage.
[0053] The following description, in conjunction with the accompanying drawings, provides an exemplary method for generating data lineage according to this disclosure.
[0054] like Figure 1 As shown, the data lineage generation method provided in this disclosure may include the following steps:
[0055] S101: In response to receiving a data lineage generation request, determine the target data node in the target directed acyclic graph from which a data lineage is to be generated;
[0056] The target directed acyclic graph includes data nodes and task nodes. Each task node is connected to a data node, and the edge between any data node and any task node represents the dependency relationship between the data represented by the data node and the processing task represented by the task node.
[0057] It is understandable that the data assigned to the upstream data node of each task node can be used as the data assigned to the input parameters of that task node; and for the downstream data node of each task node, the task execution result of that task node can be used as the data assigned to that downstream data node.
[0058] In this disclosure, for example, a user can issue a data lineage generation request by clicking a data node in the target directed acyclic graph or by entering the name of a data node. Then, in response to receiving the data lineage generation request, this disclosure can determine the target data node in the target directed acyclic graph for which a data lineage is to be generated. Accordingly, the data node indicated in the data lineage generation request in the target directed acyclic graph can be used as the target data node. For example, the data node in the target directed acyclic graph clicked by the user can be used as the target data node, or the data node in the target directed acyclic graph with the name of the data node entered by the user can be used as the target data node.
[0059] Of course, the data lineage generation request can also be a request automatically sent by other devices or modules. This request can contain the name of a data node, which can then be identified as the target data node in the target directed acyclic graph.
[0060] It should be noted that the above method for determining the target data node to be generated in the target directed acyclic graph is merely an example and should not constitute a limitation of this disclosure.
[0061] Optionally, the target directed acyclic graph can be pre-constructed according to user requirements, and the construction methods of the target directed acyclic graph include:
[0062] Obtain configuration information about data nodes and task nodes based on the input from the configuration interface; wherein, the configuration information includes: the formal parameters of the data represented by each data node to be constructed, the execution strategy of the processing task represented by each task node to be constructed, and the dependency relationship between each data node and task node.
[0063] Based on the formal parameters and execution strategy in the configuration information, the data node generation interface and the task node generation interface are called to generate data nodes and task nodes respectively. For any data node and any task node, according to the dependency relationship in the configuration information, an edge connecting the data node and the task node is generated to obtain the target directed acyclic graph.
[0064] When constructing the target directed acyclic graph, configuration information about data nodes and task nodes can be obtained first based on the configuration interface. Users can input configuration information about data nodes and task nodes through the configuration interface, such as the formal parameters of the data represented by each data node to be constructed, the execution strategy of the processing task represented by each task node to be constructed, and the dependencies between each data node and task node. Of course, the electronic device executing the data lineage generation method provided in this disclosure can also obtain this configuration information from other platforms, such as from a scheduling system.
[0065] This disclosure also provides an interface for generating a target directed acyclic graph. After obtaining the configuration information of data nodes and task nodes, the data node generation interface and task node generation interface can be called based on the formal parameters and execution strategy in the configuration information to generate data nodes and task nodes respectively. According to the dependency relationship in the configuration information, an edge connecting any data node and any task node is generated to obtain the target directed acyclic graph.
[0066] Furthermore, the data types represented by different data nodes to be constructed can be different. The target directed acyclic graph (DAG) can contain data types including source data, ordinary data, state data, and abstract data, and may also include concrete data, etc., which will be described in detail in subsequent embodiments and will not be elaborated here. The data node generation interface provided in this disclosure can include data node generation interfaces for multiple data types. This disclosure also provides other interfaces for generating the target DAG, such as interfaces for deleting data nodes, deleting task nodes, deleting current graph content, etc., to meet different graph construction requirements. This disclosure also provides other interfaces for operating on the target DAG, which will be described in detail in subsequent embodiments and will not be elaborated here.
[0067] The solution provided in this disclosure can obtain configuration information about data nodes and task nodes input through a configuration interface. Based on the formal parameters and execution strategies in the configuration information, it calls the data node generation interface and the task node generation interface to generate data nodes and task nodes respectively, and generates edges connecting the data nodes and task nodes according to the dependencies in the configuration information. However, the directed acyclic graphs constructed in existing technologies are usually only directed acyclic graphs about task nodes, excluding data nodes. The data lineages generated in existing technologies are also only coarse-grained, failing to meet the requirements for finer data lineages. In this disclosure, the target directed acyclic graph includes data nodes and task nodes, and edges representing dependencies connect the data nodes and task nodes, allowing for the subsequent generation of finer-grained data lineages based on the task nodes and data nodes.
[0068] Among them, the directed acyclic graph provided in this disclosure, such as Figure 3As shown, this includes data nodes title, a, and title_size. tid_1 is the task node connected downstream of the title node. TextSize(text:title) assigns the input parameter text of tid_1 to title and executes TextSize. After calculating count, the data node downstream of tid_1 can be assigned the title_size. At this point, a and title_size become the data nodes connected upstream of tid_2. Add(num1:a,num2:title_size) assigns the input parameter num1 of task node tid_2 to a and the input parameter num2 to title_size, and executes the Add process. After obtaining the sum sum, the data nodes connected downstream of tid_2 have an observation condition equal(four). When the title node satisfies this observation condition, the data nodes connected downstream of tid_2 can be assigned the sum result title_size + a. TextSize and Add are the execution logic configured for task nodes tid_1 and tid_2, respectively.
[0069] Of course, it should be noted that the target directed acyclic graph can also display the running status of each task, such as: completed startup, not triggered execution, skipped execution, successfully triggered and completed execution, etc. Furthermore, the edge types in the target directed acyclic graph can vary depending on their data type and the types of the two connected nodes. The representation types of data nodes and task nodes in the target directed acyclic graph can also be different; for example, data nodes can be circles, and task nodes can be rectangles. Different types of data nodes can also be represented by double circles or ellipses. In different states, data nodes and task nodes can have the color corresponding to that state. Specific details will be provided in subsequent embodiments and will not be elaborated here.
[0070] Of course, when not executed, data nodes in the target directed acyclic graph can have their formal parameters, task nodes can have their configured execution logic, and so on. When the target directed acyclic graph is executed, values can be assigned to data nodes. Any task node can use the assigned data of its upstream data node as input to execute the processing task represented by the task node. Of course, formal parameters can also be classified, such as: required input formal parameters, optional input formal parameters, etc. Whether optional input formal parameters are assigned values does not affect the execution of the task node that uses the optional input formal parameter as input.
[0071] S102: Determine the target path in the directed acyclic graph of the target;
[0072] The target path is the data generation path that includes data nodes and task nodes, which is required when generating the data of the target data node.
[0073] In this disclosure, in order to generate a more granular data lineage of the target data node by executing a data production and processing process that generates the data of the target data node, after determining the target data node, the data generation path containing the data node and the task node that needs to be used when generating the data of the target data node in the target directed acyclic graph can be determined, that is, the target path is determined.
[0074] For example, for a target data node, the data generation path containing data nodes and task nodes required by the target directed acyclic graph can be back-derived through the target directed acyclic graph. Of course, there are edges connecting adjacent data nodes and task nodes that represent their dependencies. Subsequently, the target path can be executed, that is, the data production and processing process for the target data node can be executed, thereby generating the data lineage of the target data node at the data granularity of the actual parameters.
[0075] Specifically, starting from the target data node, a reverse traversal can be performed in the target directed acyclic graph until a data node with existing data is reached. This data node is the specified data node. At this point, the upstream data nodes and task nodes of the specified data node can be omitted, and the data production process can start from the specified data node, thereby obtaining the data of the target data node.
[0076] S103: Obtain the assigned data of the specified data node in the target path;
[0077] Wherein, the specified data node is the data node that needs to be initially assigned a value when generating the data of the target data node;
[0078] After determining the target path, there are data nodes in the target path that need to be initially assigned values when generating data for the target data nodes, i.e., specified data nodes. In order to execute the data generation process of the target path, the assigned data of the specified data nodes in the target path can also be obtained.
[0079] For example, the assigned value of the specified data node can be included in the data lineage generation request, in which case the assigned value of the specified data node can be obtained directly from the data lineage generation request; of course, the assigned value of the specified data node can also be obtained interactively, for example: a prompt message can be displayed to allow the user to input the assigned value of the specified data node, or a request can be sent to other platforms or devices that contain the assigned value of the initial node, so that they return the assigned value of the specified data node.
[0080] It should be noted that any method that can obtain the assigned data of a specified data node in the target path is applicable to this disclosure, and this disclosure does not limit it.
[0081] S104: Based on the assigned data, execute the data generation process according to the target path to obtain the target data;
[0082] After obtaining the assigned data for the specified data node, the data generation process can be executed according to the target path. For example, the assigned data can be used to assign values to the specified data node, and the processing tasks represented by the task nodes can be executed according to the node order in the target path to obtain the target data.
[0083] Specifically, the designated data node includes the starting node in the target path;
[0084] The step of performing a data generation process according to the assigned data and the target path to obtain the target data includes:
[0085] The starting node in the target path is assigned the value of the assigned data. Based on the assigned starting node, the execution of the processing task represented by the task node in the target path or the assignment of the data node in the target path is triggered according to the node order in the target path to perform data generation processing and obtain the target data.
[0086] The data used by any processing task during execution includes: data assigned by the connected upstream data node; the data used by any data node during assignment includes: the task execution result of the connected upstream task node or data assigned by the connected upstream data node.
[0087] The specified data node can include the starting node in the target path. During the data generation process, the starting node in the target path can be assigned the obtained assigned data. Based on the assigned starting node, the execution of the processing task represented by the task node in the target path or the assignment of data nodes in the target path can be triggered according to the node order in the target path to generate data and obtain the target data. Specifically, after the starting node is assigned a value according to the node order in the target path, the assigned data can be used as the input to the downstream task node of the starting node. That is, the data used by any processing task during execution is the data obtained by assigning a value to the upstream data node connected to it, thereby executing the processing task represented by that task node. The assignment of data nodes or the execution of processing tasks represented by task nodes are triggered sequentially according to the node order in the target path. Of course, there can also be other data nodes in the target path besides the starting node and the target data node. The data used by these other data nodes during assignment can be the execution result of the task of the upstream task node connected to them, or the data obtained by assigning a value to the upstream data node connected to them. The target data is the data obtained by the target data node. After the other nodes in the target path have completed their execution, the task execution results of the task nodes connected upstream of the target data node can be used as the target data. Of course, there can also be data nodes connected upstream of the target data node. In this case, the data obtained by assigning values to the data nodes connected upstream can be used as the target data.
[0088] It is understandable that the specified data node includes the starting node in the target path, but the starting node in the target path is not necessarily the source node in the target directed acyclic graph.
[0089] When executing the data generation process, the specified data nodes can include the starting node in the target path. The starting node can be assigned values, such as assigning specific actual parameters to the formal parameters of the starting node. Based on the assigned starting node, the data generation process is executed according to the node order of the target path. Through the assignment of data nodes or the execution of the processing tasks represented by task nodes, the target data can be generated. Subsequently, based on the data generation process performed on the target path and its specific data, such as the target data and the assigned values, a more granular data lineage about the target data node can be generated.
[0090] Optionally, the designated data node further includes: intermediate nodes in the target path;
[0091] The data used when assigning values to intermediate nodes in the target path is: the obtained assignment data of the intermediate nodes.
[0092] It is understandable that the data nodes that need to be initially assigned values when generating data for the target data nodes in the target path can also be intermediate data nodes in the target path. That is, the specified data nodes also include intermediate nodes in the target path. For example, for an intermediate task node in the target path, when the processing task represented by this intermediate task node is executed, there are two upstream data nodes: a first data node and a second data node. The data used for assigning values to the first data node is the assigned value data obtained from the first data node. The data used for assigning values to the second data node is the task execution result of the upstream task node connected to the second data node. In this case, the first data node is the data node that needs to be initially assigned values, that is, the specified data node, and the first data node is an intermediate node in the target path.
[0093] In this disclosure, the specified data node may also include intermediate nodes in the target path. In this case, the intermediate node can be assigned a value based on the obtained assignment data of the intermediate node, and the target path can be executed. This can ensure the smooth execution of the target path and obtain the target data under the needs of different scenarios.
[0094] Optionally, the starting node in the target path is the source node in the target directed acyclic graph or a data node other than the source node;
[0095] The source node is a data node that has no upstream node.
[0096] The starting node in the target path can be the source node in the target directed acyclic graph, such as the first data node. The starting node can also be other data nodes besides the source node in the target directed acyclic graph, such as intermediate data nodes between the source node and the target data node in the target directed acyclic graph, etc. This disclosure does not limit this.
[0097] When the starting node in the target path is a data node other than the source node, the data generation process can begin from the starting node in the target directed acyclic graph (DAG) to obtain the target data. In other words, this disclosure supports starting from a data node in the middle of the target DAG and executing a partial data generation process for that DAG, instead of always starting from the source node in the target DAG, thus improving the efficiency of subsequent data lineage generation.
[0098] S105: Generate a data lineage for the target data node based on the target data, the assigned data, and the target path;
[0099] The target path can characterize the data generation process for the target data node. After obtaining the target data, a more granular data lineage can be generated for the specific data of the target data node based on the target data, the assigned data, and the target path.
[0100] Specifically, generating the data lineage of the target data node based on the target data, the assigned data, and the target path includes:
[0101] Determine the subgraph representing the target path in the directed acyclic graph of the target, and obtain the target subgraph;
[0102] At least the specified data nodes in the target subgraph are assigned values based on the assigned data, and the target data nodes in the target subgraph are assigned values based on the target data, to obtain a lineage graph representing the data lineage of the target data nodes.
[0103] This disclosure can generate data lineages of target data nodes based on a target directed acyclic graph (DAG). The generated data lineages can be represented by a lineage graph, which can be a DAG. When generating the data lineages of target data nodes, a subgraph representing the target path within the target DAG can be determined to obtain the target subgraph. That is, from the target DAG, a portion of the DAG representing the target path is selected to obtain the target subgraph. To generate more granular data lineages, values can be assigned to specified data nodes in the target subgraph based on assigned data, and values can be assigned to target data nodes in the target subgraph based on target data, thus obtaining a lineage graph representing the data lineages of the target data nodes.
[0104] As can be seen, this disclosure can first determine the target subgraph representing the target path in the target directed acyclic graph, and then obtain a more granular lineage graph representing the data lineage of the target data nodes by assigning values to the corresponding data nodes in the target subgraph through the assignment of data and target data.
[0105] Optionally, the target subgraph may also contain other data nodes besides the target data node and the specified data node, and the method further includes:
[0106] Determine the data assigned to data nodes other than the specified data node and the target data node used in the target data generation process;
[0107] The step of assigning values to the specified data nodes in the target subgraph based at least on the assigned data, and assigning values to the target data nodes in the target subgraph based on the target data, to obtain a lineage graph representing the data lineage of the target data nodes, includes:
[0108] The specified data node in the target subgraph is assigned a value based on the assigned data, the target data node in the target subgraph is assigned a value based on the target data, and the other data nodes are assigned a value based on the determined data to obtain a lineage graph representing the data lineage of the target data node.
[0109] When generating the lineage graph of data lineage, it is also possible to determine the data assigned to other data nodes besides the specified data node and the target data node used in the target data generation process. This data can be understood as the intermediate data generated when the target data is generated using the assigned data of the specified data node. When generating the lineage graph of the target data node, the specified data node in the target subgraph can be assigned a value based on the assigned data, the target data node in the target subgraph can be assigned a value based on the target data, and other data nodes can be assigned a value based on the determined data, i.e., the intermediate data, to obtain the lineage graph representing the data lineage of the target data node.
[0110] At this point, when there are other data nodes in the target subgraph besides the target data node and the specified data node, the data lineage generation method provided in this disclosure can also determine the data assigned to the other data nodes and assign values to the corresponding data nodes in the target subgraph. Values can be assigned to the target data node, the specified data node, and other data nodes in the target subgraph. Every data node in the target subgraph is assigned a value, which can yield a complete lineage relationship diagram that contains specific numerical values and has a finer granularity representing the data lineage of the target data node.
[0111] against Figure 3 For the target subgraph, values can be assigned to its data nodes, and a lineage graph corresponding to the data lineage of that target subgraph can be rendered. The resulting lineage graph is as follows: Figure 5 As shown.
[0112] The technical solutions disclosed herein involve the collection, storage, use, processing, transmission, provision, and publication of directed acyclic graphs, all of which comply with relevant laws and regulations and do not violate public order and good morals.
[0113] It should be noted that the data and tasks in this embodiment are from publicly available datasets.
[0114] In this scheme, a target directed acyclic graph (DAG) is pre-constructed, which includes data nodes and task nodes. Each task node is connected to a data node, and the edge between any data node and any task node represents the dependency relationship between the data represented by the data node and the processing task represented by the task node. Therefore, this disclosure can generate the data lineage of data nodes based on the target DAG. Specifically, in response to receiving a data lineage generation request, the target data node in the target DAG to which the data lineage is to be generated can be determined, and the data generation path containing data nodes and task nodes, i.e., the target path, can be determined when generating the data of the target data node. Then, the assigned data of the specified data node to be assigned in the target path can be obtained, and the data generation process can be executed according to the target path based on the assigned data to obtain the target data. At this point, the target path can represent the production relationship between task nodes and data nodes when generating the data of the target data node. The target data and the assigned data can be specific numerical values or parameters. The data lineage of the target data node can be generated based on the target data, the assigned data, and the target path. As can be seen, the data lineage generation method provided in this disclosure generates a data lineage that includes data for the target data node and the specific data used in the data generation process. It also includes the production relationship between the data node and the task node when the data of the target data node is generated. This solution can generate a more granular data lineage.
[0115] Optionally, in another embodiment of this disclosure, the method provided by this disclosure further includes:
[0116] During the data generation process, for any data node, in response to detecting that a first observation condition related to a first specified node is set for that data node, the assignment of that data node is triggered when the first specified node satisfies the first observation condition; wherein, the first specified node includes: the upstream data node or task node to which the data node is connected.
[0117] And / or,
[0118] For any task node, in response to detecting that a second observation condition is set for the task node regarding a second specified node, the execution of the processing task represented by the task node is triggered when the second specified node satisfies the second observation condition; wherein, the second specified node includes: an upstream data node or task node to which the task node is connected.
[0119] In the target directed acyclic graph, data nodes and task nodes can also be configured with observation conditions. During data generation and processing, for any data node, if a first observation condition is set regarding a first specified node, then the assignment of that data node can be triggered when the first specified node satisfies the first observation condition. For any task node, if a second observation condition is set regarding a second specified node, then the execution of the processing task represented by that task node can be triggered when the second specified node satisfies the second observation condition. It can be understood that the first observation condition can be interpreted as the trigger condition for assigning a value to a data node, and the second observation condition can be interpreted as the trigger condition for whether a task node is executed.
[0120] The first specified node can be an upstream data node or task node connected to the data node, and the second specified node can be an upstream data node or task node connected to the task node. In some scenarios, the first specified node and the second specified node can be the same and both be data nodes. In this case, the first observation condition and the second observation condition can also be the same. For example, the first observation condition and the second observation condition can be equal(four), that is, the first specified node and the second specified node are assigned the value 4. This can trigger the assignment of the data node and the execution of the task represented by the task node. This disclosure does not limit the types of the first specified node and the second specified node, and the specific conditions of the first observation condition and the second observation condition can be set according to the actual situation. This disclosure does not limit them.
[0121] like Figure 3 As shown, the data node `title_size+a` has its observation conditions. As indicated by the dashed arrow in the figure, the first specified node for `title_size+a` is its upstream data node `title`. The first observation condition is `equal(four)`, meaning that the assignment of the data node `title_size+a` will only be triggered when the data node `title` satisfies `equal(four)`, i.e., when the data node `title` is assigned the value `four`. Figure 5 As shown, when the data node title is assigned the value 'four', the assignment of the data node title_size+a is triggered, and the data node title_size+a is assigned the value 7.
[0122] By using a first observation condition and a second observation condition, the assignment of a data node is triggered only when the first designated node of any data node in the target directed acyclic graph satisfies the first observation condition, and the execution of the task represented by any task node is triggered only when the second designated node of any task node satisfies the second observation condition. By using observation conditions, useless assignment of data nodes in the target directed acyclic graph or useless execution of tasks represented by task nodes can be avoided, thus avoiding waste of computing resources and improving the efficiency of data lineage generation.
[0123] Optionally, in another embodiment of this disclosure, the method provided by this disclosure further includes:
[0124] During the data generation and processing, for any task node, in response to the detection of a historical data signature containing an input signature and an output signature for that task node, it is checked whether the input signature in the historical data signature matches the signature corresponding to the current input scenario, and whether the historical assignment data of the upstream data node of that task node is the same as the currently assigned data.
[0125] If they are the same, the processing task represented by the task node is skipped, and the task execution result of the task node is determined based on the output formal parameters represented by the output signature in the historical data signature and their historical assignment data.
[0126] If they are different, then the processing task represented by that task node will be executed;
[0127] The input signature is used to represent the signature corresponding to the historical input scenario, and the output signature is used to represent the output parameters of the task node under the historical input scenario.
[0128] In this disclosure, during the data generation and processing process, when certain conditions are met, the execution of the processing task represented by the task node can be skipped, and the task execution result of the task node can be determined directly.
[0129] It is understandable that for the same task node, when the input data and the task execution scenario are the same, its output will also be the same (i.e., the current input data, the task execution scenario, and the historical data are all the same, and the current output data of the task is also the same as the historical output data). For any task node, if it is detected that the task node has a historical data signature containing input and output signatures, it is possible to check whether the input signature in the historical data signature of the task node matches the signature corresponding to the current input scenario. For example, it is possible to check whether the input signature of the historical data signature is the same as the input signature of the current data signature, that is, to check whether the execution scenario of the task currently represented by the task node is the same as that of the task previously represented by the task node. The system checks whether the historical assignment data of the upstream data node of the task node is the same as the current assignment data. If they are the same, the system can skip the processing task represented by the task node. The output signature in the historical data signature represents the output parameters of the task node in the historical input scenario. That is, in the historical input scenario, which output parameters can be calculated to specific values. Therefore, the task execution result of the task node can be determined based on the output parameters represented by the output signature in the historical data signature and the historical assignment data of the output parameters. For example, the historical assignment data of the output parameters represented by the output signature in the historical data signature can be used as the task execution result of the task node.
[0130] The historical data signature can be the content carried in the data lineage generation request, which can be the historical cached content of the device that issued the data lineage generation request.
[0131] Specifically, the historical data signature can be called a process sign, which includes input_sign and output_sign. Input_sign is the input signature, and output_sign is the output signature. When the historical data signature matches the current data signature, it indicates that the historical execution scenario of the task is the same as the current execution scenario, and the historical output parameters of the task node are the same as the current output parameters. If the current input parameters of the task node are the same as the historical input parameters, then the task represented by the task node can be skipped, and the execution result of the task represented by the output_sign and its assigned values can be directly determined. Of course, the data signature of any task node can be calculated and cached in real time. This can be achieved through a signature algorithm and a cached signature factor. The specific implementation will be described in detail in subsequent embodiments.
[0132] The solution provided in this disclosure allows any task node to be configured with a data signature. If the task node has historical data signatures, it can detect whether the input signature in the historical data signature matches the signature corresponding to the current input scenario, and whether the historical assignment data of the upstream data node connected to the task node is the same as the current assignment data. In other words, it detects whether the historical input data of the task node is the same as the current input data, and whether the historical execution scenario is the same as the current execution scenario. If they are the same, the execution of the processing task represented by the task node can be skipped, and the task execution result of the task node can be determined directly based on the output parameters represented by the historical data signature and its historical assignment data. By using historical data signatures, processing tasks that do not need to be executed repeatedly can be skipped reasonably, and the execution result of the processing task can be determined directly using the historical data of the processing task. This can save computing resources, quickly generate target data, and improve the efficiency of generating data lineage.
[0133] Optionally, in another embodiment of this disclosure, it further includes:
[0134] In response to the detection of an update instruction for the target directed acyclic graph, the target directed acyclic graph is updated, and a graph signature of the currently updated target directed acyclic graph is generated based on the topology of the target directed acyclic graph; wherein, the topology of the target directed acyclic graph includes: the order of data nodes in the target directed acyclic graph, the order of task nodes, the order of edges between any data node and any task node, the execution strategy configured for the task node, the observation conditions set for the task node and / or the observation conditions set for the data node.
[0135] In response to the detection of an output command for the target directed acyclic graph, the currently updated target directed acyclic graph and its graph signature are output.
[0136] This disclosure also allows for the generation of graph signatures for the target directed acyclic graph (DAG), with different updated versions of the DAG each possessing its corresponding graph signature. Specifically, users can update the target DAG as needed by issuing update commands. The solution provided in this disclosure responds to the detection of an update command for the target DAG, updates the target DAG, and generates a graph signature for the currently updated target DAG based on its topology. Subsequently, the currently updated target DAG and its graph signature can be output. Furthermore, this disclosure does not limit the signature algorithm used in generating the graph signature.
[0137] The graph signature of any directed acyclic graph (DAG) can be generated based on the topology of the DAG. If two DAGs have the same topology, their graph signatures are also the same. For example, the topology of the target DAG includes: the order of data nodes, the order of task nodes, the order of edges between any data node and any task node, the execution strategy configured for the task nodes, the observation conditions set for the task nodes and / or the observation conditions set for the data nodes. Of course, other topologies may also be included, which are not limited in this disclosure.
[0138] It is understandable that both the target directed acyclic graph (DAG) and the currently updated target DAG can have their corresponding graph signatures. That is, different versions of the target DAG each have a corresponding graph signature. The graph signature of any version of the target DAG can be generated based on its topological structure. Graph signatures can be used to identify and distinguish different versions of the target DAG, and they can also represent the topological structure of the DAG. For example: the current page displays the target DAG. If an update is needed, it can be performed in the background. The page still displays the target DAG. The background can update the target DAG, obtaining the currently updated target DAG and its graph signature. This updated target DAG and its graph signature can then be output and used to replace the target DAG and its graph signature on the page. After the replacement, the page will display the currently updated target DAG and its graph signature.
[0139] The following describes a data lineage generation method provided in this disclosure based on another embodiment.
[0140] Offline data processing systems in related technologies can orchestrate the data processing process, but they cannot characterize data lineage or the granularity of the data lineage they do characterize is too coarse. This disclosure provides a DDC (data driven computing) system that can be compatible with the orchestration of the data processing process and can finely characterize the data lineage.
[0141] In this disclosure, the data processing logic generally consists of data and tasks. Users can define data and tasks, i.e., data nodes and task nodes, and form a directed acyclic graph (DAG) through the dependencies between data and tasks. This process can also be called the modeling process of a DDC system.
[0142] The DDC system can generate the data lineage of a data node in a DAG. That is, the data lineage generation method disclosed herein can be implemented through the DDC system. Users can provide assigned data for a specified data node, i.e., the original data set `datas`, and determine the target data node for which a data lineage is to be generated, i.e., define the set of `datas` to be driven for generation. Based on the data generation path in the DAG related to the target data node, a DAG computation is driven to obtain the target data of the target data node, thereby generating the data lineage of the target data node. Furthermore, the data lineage generation process is data-driven; therefore, this system can be called DDC: data-driven computing. Corresponding to the above steps in response to receiving a data lineage generation request, the system determines the target data node for which a data lineage is to be generated in the target directed acyclic graph; determines the target path in the target directed acyclic graph; obtains the assigned data for a specified data node in the target path; executes the data generation process according to the target path based on the assigned data to obtain the target data; and generates the data lineage of the target data node based on the target data, the assigned data, and the target path.
[0143] The DDC system supports computation starting from the source node's data source, and also supports user-provided intermediate data from intermediate data nodes in the DAG, starting computation from a specific intermediate data node in the DAG (corresponding to the specified data node mentioned above, this also includes intermediate nodes in the target path). During computation, caching capabilities can be used, allowing skipping the execution of a task node when its conditions are met. This corresponds to checking whether the input signature in the historical data signature matches the signature corresponding to the current input scenario, and whether the historical assignment data of the upstream data node of the task node is the same as the currently assigned data; if they are the same, the processing task represented by that task node is skipped, and the task execution result of the task node is determined based on the output parameters represented by the output signature in the historical data signature and their historical assignment data.
[0144] The DDC system provided in this disclosure can achieve the following functions:
[0145] Based on user-defined data and tasks, a DAG is generated, and the data lineage of a data node in the DAG is characterized.
[0146] During the data generation process, the data lineage generation request clarifies the target data node for which the data lineage to be generated, and assigns values to some or all data nodes in the DAG. The DDC system only triggers the execution of the task required to generate the data of the target data node, and completes the derivation and characterization of the data lineage of the target data node.
[0147] During the data generation process, calculations can be performed based on some known data, starting from an intermediate data node of the DAG, instead of starting from the source node of the DAG's data nodes.
[0148] The data lineage generation request carries cached data for the task node (such as the historical execution results, historical input results, etc. of the task node) and the historical data signature of the task node. During data production, if the cached data, historical data signature, current data signature, and input data are used to determine that the current input and execution scenario of the task node are the same as the historical input and execution scenario, the cached data can be reused (such as directly using the historical execution result as the execution result of the task) and the execution of the task can be skipped.
[0149] This disclosure provides a data lineage generation method for data-oriented data. The generation process of data can be clearly queried through a DAG composed of data and tasks, and convenient querying can be achieved by using a graph query language.
[0150] In offline scenarios, this solution is designed to enable computation from a single intermediate data node in the DAG, which is friendly to data iteration. During data iteration, values can be assigned to the data of a single intermediate node to test the impact of other data on the target data.
[0151] Offline computing often involves a large amount of computation and is costly. This solution presents a general cache design approach that can skip the processing tasks represented by the execution task nodes, significantly optimizing the cost of offline computing.
[0152] The DDC system disclosed herein will now be described in detail.
[0153] 1. data
[0154] Data, or data nodes, are the basic units of DDC system architecture and also the computational targets. The data in the data set can be categorized into the following four types:
[0155] SOURCE data: This is the raw data provided by the user before the DDC system performs calculations and executions, and it does not depend on other tasks.
[0156] NORMAL data: Ordinary data, produced in relation to tasks. It represents the output of a processing task once it has been completed. When creating NORMAL data, you can specify a task and which output parameter of that task the NORMAL data will map to. The actual arguments of that output parameter can then be used as NORMAL data, based on the execution of the processing task represented by the task and the specified output parameter.
[0157] STATUS data: Status data, which is the result of the execution of the strategy within a task. When creating STATUS data, a task can be specified. The STATUS data represents the result of the execution of the strategy configured within the task.
[0158] ABSTRACT data: Abstract data. This data is abstract and can define its dependent concrete data. Concrete data can be of any data type. The data type of ABSTRACT data is consistent with that of the concrete data. For example, the abstract data is "title", and the concrete data it depends on are "title1", "title2", "title3", ..., "title100". When retrieving the value of the abstract data "title", it can be "title = title1", "title = title2", ..., or "title = title100". The concrete data can be pre-configured with priorities. When retrieving the value of the abstract data, it can be retrieved according to the priority of the concrete data. If a higher-priority concrete data does not have a specific value, then the specific value of the next lower-priority concrete data can be used as the value of the abstract data. For example, if the concrete data are "title1", "title2", "title3", and "title4" in descending order of priority, and "title1" does not have a specific value, but "title2" does, then the abstract data "title" = the specific value of "title2". In addition, abstract data can have a pre-built mapping relationship with the specific data it depends on. Alternatively, in a DAG, the upstream node connected to the data node of the specific data that the abstract data depends on is the data node of the specific data that the abstract data depends on. That is, in a DAG, data nodes can also be connected to each other through reference edges, and data nodes and task nodes can be connected through their dependency edges.
[0159] Additionally, multiple Watch Data Value rules can be added to the data, which is the first observation condition mentioned above. The data assignment behavior will only be triggered when all Watch Data Value rules are met. However, this does not affect the execution behavior of the tasks affected by the data. For example, the tasks affected by the data can be executed according to the formal parameters of the data.
[0160] 2. strategy
[0161] A strategy is an abstract interface for strategy operators in a DDC system. When a task executes, it can call the `process` method of the strategy. A strategy can also be understood as the execution logic of a task. Any directed acyclic graph created by a DDC can reuse a strategy with the same configuration. The DDC system also supports user-defined new strategies. Of course, the same strategy can be applied to different tasks; in this case, the configuration content of the same strategy can be different in different tasks.
[0162] When adding a custom strategy, the strategy can only get the value of the input parameter and set the value of the output parameter. When the task runs, the actual value of the data node of the input parameter, i.e. the actual parameter, can be passed to the input parameter, and the actual parameter of the output parameter can be passed to its bound data node, so as to assign values to the input and output parameters of the task.
[0163] There are three types of formal parameter definitions:
[0164] Required input parameter refers to the formal parameter that must be selected. When the DAG is executed, if the actual parameter bound to the required input parameter does not exist, the task affected by the required input parameter will not be scheduled and executed by the DDC system.
[0165] An optional input parameter is a formal parameter that can be selected for execution. The execution of the task affected by this optional input parameter does not require the actual parameter bound to the optional input parameter to exist. If the optional input parameter does not have a corresponding actual parameter, the task affected by the optional input parameter can still be executed normally.
[0166] Output parameters, or formal parameters, are not required to be assigned values in a strategy. Only when the strategy returns SUCC (Sustainable Execution Code), meaning the strategy executes successfully, will the output parameters be assigned values, and only then will the actual values of the output parameters be propagated to the data nodes corresponding to those output parameters.
[0167] The return status (i.e., execution status) of a strategy has two possibilities:
[0168] SUCC indicates that the operator was executed successfully. At this point, the output parameters can be assigned values using the output arguments to ensure the consistency of the strategy output.
[0169] FAIL indicates that the operator failed to execute. The failure of the operator will not directly affect the execution of downstream tasks. Downstream tasks can determine whether the execution conditions are met based on the Watch Data Value rules of the task and whether the required input parameters have values, and thus whether to execute the task.
[0170] 3. task
[0171] A task, or process node (corresponding to the task node mentioned above), is the basic unit of the DCC system diagram. The composition of a task is as follows:
[0172] strategy, the main body responsible for executing this task.
[0173] config, strategy configuration
[0174] Mapping relationship between data and strategy input parameters
[0175] The cache factor, or cache signature factor, consists of multiple {input parameters, signature algorithms}. The DDC engine calculates the current process signature of each task in real time based on the cache factor, i.e., the signature of the task node. For a given task, if the cache data (e.g., the actual values of the task's input parameters) and process signature match the historical cache data and process signature, then the task can be skipped, and the historical execution result of the task can be used as the current execution result.
[0176] The relationship between strategy and task is as follows: Figure 2 As shown, the task is configured with various input parameters data1...dataX, output parameters dataY...dataZ, and a strategy. The strategy has three types of interfaces. When a strategy is applied to a task, connection relationships can be set for the corresponding interfaces. These three types of interfaces include: connection interfaces for output parameters, configuration interfaces, and connection interfaces for return values. Output parameters connect to the downstream data of this task. Configuration interfaces can provide configuration content during task definition. The return value connects to the STATUS data (i.e., the datas in the diagram).
[0177] When creating a DAG, each newly created task is unique (due to the uniqueness of a strategy, the same strategy can be used by one task at a time; therefore, each task has a unique strategy). The execution conditions for a task are as follows:
[0178] The required input parameters have specific numerical values for their corresponding actual parameters.
[0179] All Watch Data Value rules configured for the task are satisfied;
[0180] If the task has cache enabled, but the user has not provided enough cache data (e.g., only the historical execution results of the task are provided, the task cannot be skipped and can still be executed).
[0181] The DCC system can record the execution status of each task, which can include:
[0182] OS_STARTED indicates that the task has been started but is still in progress and has not yet returned.
[0183] OS_NOT_TRIGGERED: The current request requires the set of datas that drive the computation, i.e. the data of the target data node. Based on the path deduction, it is not necessary to trigger the execution of this task node.
[0184] OS_WATCH_MISS means that the Watch Data Value rules related to the current task are not met, and the task will not be executed.
[0185] OS_NOT_SATISFIED: The data bound to the required input parameter of the strategy related to the current task does not exist, that is, the actual parameter of the required input parameter does not have a specific value, and the task will not be executed.
[0186] OS_CACHE indicates that the current task has executed cache logic, meaning the task is skipped during execution. The underlying strategy's process method is not executed, but the task's output parameters are correctly assigned, and the status data dependent on this task is also assigned the value SUCC. Since the strategy has not been executed, the status data for the strategy seen in the dot file is UNKNOWN, indicating that the strategy has not been executed.
[0187] OS_DONE indicates that the task was successfully triggered and completed. At this point, the strategy's status data may be SUCC or FAIL.
[0188] It should be noted that users can implement custom strategies through the `init` and `process` interfaces. The `init` interface can accept configuration information as input, initialize the strategy's properties, and determine the `process` method.
[0189] In all directed acyclic graphs created within the same DDC system, if the configuration of a strategy is consistent when applied to multiple tasks, the same strategy instance can be reused.
[0190] 4. DDC graph
[0191] A DDC graph is a directed acyclic graph consisting of nodes and edges.
[0192] In a DDC system, there are two types of nodes in a directed acyclic graph (DAG): data and task, as mentioned above.
[0193] data node (abbreviated as data): Data node.
[0194] Task node (or simply task): A task node represents a computation process. The core of a task is a strategy. When the execution conditions of a task are met, the `process` method of the strategy is executed to complete the computation logic. A strategy can output return values, set output parameters, etc.
[0195] Nodes are connected by edges, and edges can be of the following types:
[0196] REQUIRED_IN_PARAM: The data node points to the task node. The data is bound to a required input parameter of the strategy on the task node. The data associated with the required input parameter exists, that is, the actual parameter of the required input parameter has a specific value. It is a necessary condition for the task node to run the strategy.
[0197] OPTIONAL_IN_PARAM: The data node points to the task node, and the data is bound to an optional input parameter of the strategy on the task node. Whether the data associated with the optional input parameter exists or not does not affect the execution of the strategy on the task node.
[0198] OUT_PARAM: The task node points to the data node, binding the output parameters of the task node's strategy to a data node.
[0199] DATA_WATCH_DATA: A data node (watched data) points to a data node (watch data), where the latter is the observer and the former is the observed object. The assignment operation of the watch data will only be performed when the watched data meets certain conditions.
[0200] TASK_WATCH_DATA: The data node (watched data) points to the task node (watch task). The task node's processing task will only be executed when the watched data meets certain conditions.
[0201] ABSTRACT_DATA: The data node (concrete data) points to the data node (abstract data). The value of the abstract data comes from the concrete data. There can be multiple concrete data. The concrete data with a value is selected as the value of the abstract data according to the configuration order or priority.
[0202] DDC graph provides the following graph construction-related interfaces, through which users can construct directed acyclic graphs (DAGs), and also perform secondary development to add strategies and implement configuration-based graph construction:
[0203] gen_normal_data registers the interface for normal data;
[0204] gen_source_data is an interface for registering source data;
[0205] The `gen_status_data` interface registers status data.
[0206] gen_abstract_data registers the interface for abstract data;
[0207] gen_task registers an interface for a task;
[0208] The `clear` interface clears the current graph content;
[0209] `delete_data` is an interface that deletes registered data. Deletion may cause a connected graph to become a disconnected graph; it will not delete upstream or downstream data.
[0210] `delete_task` deletes an interface that has already been registered as a task. Deletion may cause a connected graph to become a disconnected graph, but it will not delete the upstream or downstream content of the task.
[0211] DDC graph provides the following interfaces for graph operations:
[0212] `reload` loads the current graph content but does not initiate execution. After a successful reload, a root graph will also be generated. Figure 1 A graph signature (graph_sign) represents a version of a graph. Graphs may be constructed in different orders, but if their topological meaning is the same, their graph signatures (graph_sign) will be identical. For example, a page might display a target directed acyclic graph (DAG). If the target DAG needs updating, this can be done in the background. The page will still display the target DAG, but the background can update it, obtaining the updated target DAG and its graph signature (graph_sign). However, the updated target DAG is not yet loaded and displayed; this is called reloading.
[0213] `migrate` switches to the loaded graph, for example, it switches to load and displays a reloaded graph.
[0214] `stop` stops the currently running graph. For example, it prevents the currently running directed acyclic graph from being displayed.
[0215] `export` outputs the current graph content to a file, allowing the DDC system to import a graph from the file next time.
[0216] The `import` function reads a graph structure from a file, and then calls `reload` and `migrate` to run the graph.
[0217] DDC graph also provides several convenient methods for viewing graph content and execution results:
[0218] `render()` renders the image that has been successfully reloaded. The parameters control whether to output the currently running image or a reloaded but not yet migrated image (i.e., an image that hasn't been published yet). An example of a successfully reloaded image is shown below. Figure 3As shown, the `title` and `a` data nodes are the specified data nodes mentioned above. `tid_1` is the task node connected downstream of the `title` node. `TextSize(text:title)` assigns the input parameter `text` of `tid_1` to `title` and executes `TextSize`. After calculating `count`, the data node downstream of `tid_1` can be assigned the value `title_size`. At this point, `a` and `title_size` become the data nodes connected upstream of `tid_2`. `Add(num1:a,num2:title_size)` assigns the input parameter `num1` to `a`, assigns the input parameter `num2` to `title_size`, and executes `Add`. In process d, after obtaining the sum, the data nodes connected downstream of tid_2 have an observation condition equal(four). When this observation condition is met, the data nodes connected downstream of tid_2 can be assigned the result of sum, title_size + a. The graph signature of this directed acyclic graph is ddc_graph_sign[7033080076767472568]Board[], where Board represents the data lineage generation request. Board[] can represent no request, that is, the directed acyclic graph is a graph that has been reloaded but not yet migrated. When there is a specific request in Board[], the directed acyclic graph can be a migrated graph. TextSize and Add are the strategies of tid_1 and tid_2, respectively.
[0219] Where title and a are the specified data nodes mentioned above, a is the intermediate node mentioned above, and title is the starting node mentioned above; title_size is the other data nodes mentioned above besides the specified data nodes and target data nodes; the data node title_size+a is the target data node mentioned above; the data generation path contained in this directed acyclic graph is the target path, and this directed acyclic graph is the target subgraph mentioned above; equal(four) is the first observation condition for the data node title_size+a, and title is the first specified node for the data node title_size+a. The dashed line in the figure represents the first observation condition.
[0220] `rill_info()` is used to view the DAG graph consisting only of tasks; for Figure 3 In other words, a DAG graph consisting only of tasks is as follows: Figure 4As shown, 0-TextSize(text:title):itemwise represents the 0th task, whose strategy is TextSize, and whose input parameter is text, which can be assigned the value title. 1-Add(num1:a,num2:title_size):itemwise represents the 1st task, whose strategy is Add, and whose input parameters are num1 and num2, which can be assigned the values a and title_size respectively. The signature of this DAG graph consisting only of tasks can be JobRill DAG (machine generated).
[0221] The `render(const Board&)` method renders the execution result of a requested board, outputting a graph in Graphviz format, specifically a lineage diagram of the target data nodes. The output can be rendered and viewed using Graphviz software. Figure 3 In other words, the rendered execution result for request Board[readme_demo] is as follows: Figure 5 As shown:
[0222] As can be seen, the specified data nodes `title` and `a` are assigned the values `four` and `3` respectively. The execution states of `tid_1` and `tid_2` are both `OS_DONE`, and their respective strategy results are both `SUCC`. For the target data node `title_size+a`, its first specified node `title` satisfies the first observation condition `equal(four)`, meaning `title` is assigned the value `four`. Therefore, the target data node `title_size+a` can be assigned the value `7`. For any data node, it has a corresponding data signature. For example, for the specified data nodes `title` and `a`, the data signature is `ps:0:0`. For other data nodes `title_size` and the target data node, their data signatures are `ps:0:15020800` and `ps:0:f8d65d77` respectively. At this time, the signature of the target subgraph is: `ddc_graph_sign[7033080076767472568]Board[readme_demo]`.
[0223] The target directed acyclic graph rendered in this disclosure is as follows: Figure 7 As shown, for ease of understanding, the rendering rules of the DDC graph are explained below:
[0224] Data node format rules:
[0225] Data nodes are represented by circles. The first line represents the data name, the second line is the runtime debug string (the specific value), and of course, if the data is an image, it can also be the image name, etc. The third line is the data signature of the data node.
[0226] Source data is represented by double-circle nodes.
[0227] The data (anchor data) that users provide in the board is represented by "fulfilled" (in light gray in the image).
[0228] Data generated in DDC (including cached content assigned to data) is represented by green circles (i.e., gray ellipses in the diagram); if generation fails, it is represented by red circles (e.g., ...). Figure 7 (node p1 in the middle).
[0229] If data is the original target of the DDC system-driven computation, it is represented by a bold circle.
[0230] If the data node has a value, its debug string will be output.
[0231] If the data content can be cached to prevent execution by related tasks (such as tasks connected upstream of this data node), a non-zero process sign, ps, will be output. The process sign format is hexadecimal "input_sign:output_sign", such as: ps:332335edb:40021412. Here, input_sign is the input signature mentioned above, which represents the context in which the data was generated, and output_sign is the output signature mentioned above, used to help determine which output parameters can be calculated in the context of input_sign.
[0232] Additionally, the white data nodes in the diagram represent data nodes that have not yet been assigned values.
[0233] Task format rules:
[0234] Tasks are represented by rectangles. The first row represents the task name (task id), the second row is the readable label of the configured strategy, the third row is the task running status, the fourth row is the strategy return value, and if a fifth row exists, it can be the execution time of the task.
[0235] In DDC, tasks that are triggered to execute are represented in green (tasks that are triggered to execute are represented in gray in the graph, and tasks that are not triggered to execute are represented in white in the graph).
[0236] The relationship between data:
[0237] Data watch data is indicated by blue arrows, and brief Watch Data Value rule information is output next to it (i.e., the dashed line in the figure);
[0238] Abstract Data depends on one or more Concrete Datas, indicated by a flat arrow;
[0239] The relationship between task and data:
[0240] The actual parameter of data, which is a required input parameter of task, is represented by a solid line and an arrow.
[0241] The actual argument of data as an optional input parameter of task is indicated by dashed lines and arrows;
[0242] The data parameter, which is the actual parameter corresponding to the output parameter of the task, is represented by a solid line and an arrow, and the parameter name is indicated on the side.
[0243] Status data is represented by a dotted arrow pointing from the task to the status data;
[0244] Task watch data is indicated by blue arrows, and brief Watch Data Value rule information is output next to it (i.e., the dashed line in the figure).
[0245] It should be noted that, Figure 7 The purpose of this disclosure is merely to illustrate the target directed acyclic graph and should not be construed as limiting the scope of this disclosure.
[0246] 5. Watch Data Value Rules
[0247] Whether data is assigned a value is determined by configuring the Watch Data Value rule, that is, any data node can be configured with the first observation condition of the first specified node.
[0248] Suppose Data D is produced by Task S. If the user wants Data D to depend on the value of another Data item, then the Watch Data Value rule can be used. For example: Figure 5Whether the title_size+a data node is assigned a value depends on whether the title data node is assigned the value four.
[0249] Multiple Watch Data Value rules can be configured for a single piece of data. A value will only be assigned to that data if all conditions are met simultaneously. However, multiple rules cannot be configured for the same watched data.
[0250] Whether a task is executed is determined by configuring Watch Data Value rules, meaning that any task node can be configured with a second observation condition for a second specified node.
[0251] If you want to prevent a task from being executed, you can configure a Watch Data Value rule for the task. If the Watch Data Value rule is not met, the task will not be executed.
[0252] If a task watches multiple data points (i.e., a task has Watch DataValue rules configured for multiple data nodes, and all data points meet the conditions), the task can be executed if it meets the Watch DataValue rules. When configuring Watch DataValue rules for a task, multiple rules cannot be configured for the same watched data.
[0253] Using the Task Watch Data Value rule, it is possible to implement a DAG that is compatible with traditional procedural orchestration, so that tasks are executed or not executed appropriately.
[0254] 6. cache
[0255] When the following conditions are met: input remains unchanged, program remains unchanged (i.e., strategy remains unchanged), and output remains unchanged, caching can be used to skip task execution. Users pre-provide cached content of task output arguments (the cached content can include the values of the task's output arguments). When the cache is active, the task is not executed; instead, the cached content is used to assign values to the task's output arguments. Specifically, when the task's input parameters, corresponding arguments, and strategy remain unchanged (i.e., the current value is the same as the historical value), and the task's output also remains unchanged, the historical values of the task's output arguments can be used to assign values to the current task's output arguments, and the task execution can be skipped.
[0256] Tasks can only use caching if the strategy is cacheable. Cacheability indicates whether the task's computation process is stable, meaning that if the input remains unchanged, the program remains unchanged, and the output remains unchanged. This is determined by the user configuring the strategy. Some scenarios that do not meet the cacheable condition include:
[0257] There are random factors inside, such as a random number generator;
[0258] The system may access other systems internally, and these other systems involve random factors.
[0259] A task consisting of a cacheable strategy needs to explicitly enable caching. This allows the same strategy to be used on different tasks, with some tasks having caching enabled and others not. Additionally, tasks use the `add_cache_factor` interface to add one or more cache factors. The cache factor represents which input parameters the task is sensitive to for reuse. If a task is `enable_cache` but `add_cache_factor` is not called, it is only sensitive to the operator name, operator version, and configuration. Sensitivity to a specific input parameter means that the cache factor is calculated using that input parameter; in other words, the cache factor is calculated using the sensitive content.
[0260] When a task calls the add_cache_factor method to add a cache factor, the cache function will be enabled by default.
[0261] The steps to use the cache feature are as follows:
[0262] The strategy that the task depends on is cacheable, and the code that depends on the strategy implements the cache definition.
[0263] When constructing a directed acyclic graph (DAG), task caching needs to be explicitly enabled, and cache factors added as needed. Different cache factors can be set when the same strategy is applied to different tasks. Cache factors will be discussed in detail later.
[0264] Retrieve historical data, including cache values and input_sign (input_sign will be explained in detail later). This means retrieving the historical cache data for this task (such as the values of input arguments, output arguments, etc.), as well as the historical input_sign. Of course, you can also retrieve the historical output_sign.
[0265] When making a request, the Board's cache_set interface is called to set the cache value and process sign of some data. In other words, the data lineage can be generated by making a request later. Before making a request, the interface for making the request can be set, and the historical cache value and process sign to be obtained can be set. The request made can contain the historical cache value and process sign.
[0266] Initiate a request; that is, use the content obtained above to initiate a data lineage generation request. Based on this request, the execution of this task can be skipped. That is, the above steps can realize the cache function and use the values of the output parameters of the previous task as the values of the current output parameters of this task.
[0267] Process Signature
[0268] The process sign represents the process signature, which is the data signature of the data node mentioned above. It consists of input_sign and output_sign. Input_sign represents the context state of task execution, that is, the execution scenario of the task. Input_sign is only related to the following value (i.e., the cache signature factor):
[0269] strategy operator name
[0270] strategy implementation version
[0271] Strategy configuration content
[0272] The task's cache factor. The specific values that the input parameters can take, and the method for calculating the checksum based on those values.
[0273] Special cases regarding the value of input_sign:
[0274] When generating a task, if there is no optional input parameter for the mapping strategy, the parameter will not affect input_sign, and the cache factor of the corresponding parameter will be set to invalid.
[0275] When a task is generated, optional input parameters of the strategy are mapped, but the corresponding actual parameters do not exist during execution, which is equivalent to not setting the cache factor of the optional parameter.
[0276] The process sign will only be calculated if the strategy is cacheable and the task also has caching enabled.
[0277] In addition to recording the signature of the task execution state context, it can also record the values of which output parameters the strategy can produce in that state, represented by output_sign.
[0278] Some strategies, under a certain state context, only have values for some of their output parameters. If the data driving the computation comes from output parameters that have no values, theoretically, the cache can be reused; that is, if it couldn't be computed last time, it can't be computed this time either.
[0279] Therefore, output_sign can be used to help determine whether a given formal parameter can produce a value under this process signature.
[0280] In implementation, the approach uses a bloom filter. For the parameters that can be computed under the current process signature, multiple hash indices are calculated, and the corresponding bits are set to 1. When determining whether an output parameter can be computed, all parameters can be computed only if all bits are 1 (that is, using the values of the input parameters, the hash function is used to calculate which output parameters can be computed; when all bits are 1, all output parameters can be computed).
[0281] Even if the return value can be calculated, it may not actually be calculated due to hash collisions between multiple parameters. If no corresponding cached content is provided for the parameter (i.e., no historical data related to the parameter is provided, such as historical assignments, historical data signatures, etc.), then the strategy needs to be executed again. However, for the caching function, executing it again is acceptable.
[0282] If the return value cannot be calculated, the formal parameter will definitely not be able to be calculated. If the driver calculates the relevant data of the current formal parameter and does not provide cached content, the lack of relevant cached content for the current formal parameter is not a necessary condition for the strategy to execute, and there is still a chance to use the caching function.
[0283] cache factor
[0284] The cache factor can be understood as the cache signature factor mentioned above, consisting of multiple {input parameters, signature algorithm}, specifically {input parameter name, checksum calculation method}. DDC uses the checksum method of all cache factors to calculate the checksum value of the actual parameter, i.e., the signature, and then calculates the input_sign. If the input_sign matches the input_sign of the cached content provided by the user, the cached content can be used. That is, using the signature calculation method, the current signature of any task is calculated. If the current task's input_sign matches the historical input_sign, and the input parameters are consistent, the cache function can be used to skip executing that task.
[0285] The definition of the checksum calculation method depends on the data type of the formal parameters. In the DDC system, data types support user-defined extensions.
[0286] The impact of cache misses and cache hits
[0287] When a task uses caching, and the user provides at least one valid cache value in the request board (and the historical input_sign provided by the user is consistent with the current input_sign, and the historical input parameters are consistent with the current input parameters), then the task cache hits, and the task will skip execution. This is because providing a valid cache value indicates that the strategy previously returned SUCC under the current input_sign condition. In this case, the task's return status is OS_CACHE. If a Status data in the DAG topology depends on the return value of a task, then it is not necessary to explicitly set the cache content of the Status data; the DDC system will propagate SUCC to the Status data by default. However, if a strategy has no output parameters, and the user wants to use caching, then at least the historical cache content of the Status Data (i.e., the historical input and output parameters of the task) and the input sign at that time must be provided. The DDC system can use the input_sign to infer whether the current execution environment is consistent with the history.
[0288] If a certain output parameter of a task is mapped to multiple actual parameter data nodes, setting the cache value of any one of the actual parameter data nodes will propagate that cache value to all the output parameters corresponding to that output parameter, that is, propagate the output parameter of the task to multiple actual parameter data nodes.
[0289] If the output parameter corresponding to the data of the data node driving the computation can produce a value under the current input_sign, but the user has not provided the corresponding cache value, it will cause a task cache miss, and the process method of the strategy of the upstream task of the data node needs to be re-triggered.
[0290] 7. Start the data generation process from the DAG cross section.
[0291] Users can set the initial value of the Source data and then define the target data node to drive the production of data for the target data node. They can also set Status data, Abstract data, and Normal data, starting the calculation from a DAG section, that is, from the intermediate process of the DAG, rather than from the source data node of the DAG.
[0292] When driving the computation of a target data node, the DDC system starts from the target data node and performs a reverse depth-first traversal. It stops when it encounters a data node that already contains data, meaning it does not execute the upstream content of that existing data node. This function can be simply referred to as: execution starting from the data anchor point.
[0293] The data generation process begins from the DAG cross-section, as follows: Figure 6 As shown, the user provided raw values for 6 data nodes: a, b, c, d, ab, cd, driving the calculation of data node abcd and data node ab-status.
[0294] Figure 6 The gray ellipses represent data generated during the DDC system's calculations, the light gray circles represent user-assigned initial values, and the double circles represent source data. There are two tasks, with task IDs tid_1 and tid_2, both using the Addstrategy, which implements an addition operation.
[0295] The user explicitly assigned a value to `cd`, so task `tid_2`, which generated `cd`, was not scheduled. The user explicitly assigned `ab` to 10, but task `tid_1` was still scheduled due to the data node `ab-status`. However, the value of `ab` was not overwritten; after the task was executed, it still contained the value assigned by the user: 10. Ultimately, the value of `abcd` is 110.
[0296] When a user assigns a value to non-source data, the calculation can start from the DAG cross section, that is, from the middle of the DAG. In this case, there are several special design considerations:
[0297] Production tasks for non-source data and their upstream tasks will not be triggered by the DDC system unless driven by other data. When other data drives these tasks to execute, the initial values assigned by the user to these source data will not be overwritten. In other words, the user's explicit settings are respected; if a user assigns a value to a data node, the user's value will prevail regardless of whether the upstream production task of that data node is executed.
[0298] If a data node is configured with a Watch Data Value rule, the corresponding watched data will not be generated by default unless other data drives the generation. When other data drives the generation of watched data, it no longer affects the user's assignment of values to non-source data. In other words, the data assignment behavior has already occurred, which is an objective fact.
[0299] This disclosure provides a data lineage generation method for data-oriented data. The generation process of data can be clearly queried through a DAG composed of data and tasks, and convenient querying can be achieved by using a graph query language.
[0300] In offline scenarios, this solution is designed to enable computation from a single intermediate data node in the DAG, which is friendly to data iteration. During data iteration, values can be assigned to the data of a single intermediate node to test the impact of other data on the target data.
[0301] Offline computing often involves a large amount of computation and is costly. This solution presents a general cache design approach that can skip the processing tasks represented by the execution task nodes, significantly optimizing the cost of offline computing.
[0302] Based on the above method embodiments, this disclosure also provides a data lineage generation apparatus, such as... Figure 8 As shown, it includes:
[0303] The first determining module 810 is used to determine the target data node to be generated in the target directed acyclic graph in response to receiving a data lineage generation request; wherein the target directed acyclic graph includes data nodes and task nodes, each task node is connected to a data node, and the edge between any data node and any task node represents the dependency relationship between the data represented by the data node and the processing task represented by the task node.
[0304] The second determining module 820 is used to determine the target path in the target directed acyclic graph; wherein, the target path is a data generation path containing data nodes and task nodes that is required when generating the data of the target data node;
[0305] The acquisition module 830 is used to acquire the assigned data of a specified data node in the target path;
[0306] Execution module 840 is used to perform a data generation process according to the target path based on the assigned data to obtain target data;
[0307] The generation module 850 is used to generate a data lineage for the target data node based on the target data, the assigned data, and the target path.
[0308] In this scheme, a target directed acyclic graph (DAG) is pre-constructed, which includes data nodes and task nodes. Each task node is connected to a data node, and the edge between any data node and any task node represents the dependency relationship between the data represented by the data node and the processing task represented by the task node. Therefore, this disclosure can generate the data lineage of data nodes based on the target DAG. Specifically, in response to receiving a data lineage generation request, the target data node in the target DAG to which the data lineage is to be generated can be determined, and the data generation path containing data nodes and task nodes, i.e., the target path, can be determined when generating the data of the target data node. Then, the assigned data of the specified data node to be assigned in the target path can be obtained, and the data generation process can be executed according to the target path based on the assigned data to obtain the target data. At this point, the target path can represent the production relationship between task nodes and data nodes when generating the data of the target data node. The target data and the assigned data can be specific numerical values or parameters. The data lineage of the target data node can be generated based on the target data, the assigned data, and the target path. As can be seen, the data lineage generation device provided in this disclosure generates a data lineage that includes data for the target data node and the specific data used in the data generation process. It also includes the production relationship between the data node and the task node when the data of the target data node is generated. This solution can generate a more granular data lineage.
[0309] Optionally, the generation module includes:
[0310] The first determining submodule is used to determine the subgraph representing the target path in the target directed acyclic graph, thereby obtaining the target subgraph;
[0311] The assignment submodule is used to assign values to the specified data nodes in the target subgraph based at least on the assignment data, and to assign values to the target data nodes in the target subgraph based on the target data, so as to obtain a lineage graph representing the data lineage of the target data nodes.
[0312] Optionally, the generation module further includes:
[0313] The second determining submodule is used to determine the data assigned to other data nodes besides the specified data node and the target data node during the target data generation process;
[0314] The assignment submodule is specifically used for:
[0315] The specified data node in the target subgraph is assigned a value based on the assigned data, the target data node in the target subgraph is assigned a value based on the target data, and the other data nodes are assigned a value based on the determined data to obtain a lineage graph representing the data lineage of the target data node.
[0316] Optionally, the specified data node includes the starting node in the target path;
[0317] The execution module is specifically used for:
[0318] The starting node in the target path is assigned the value of the assigned data. Based on the assigned starting node, the execution of the processing task represented by the task node in the target path or the assignment of the data node in the target path is triggered according to the node order in the target path to perform data generation processing and obtain the target data.
[0319] The data used by any processing task during execution includes: data assigned by the connected upstream data node; the data used by any data node during assignment includes: the task execution result of the connected upstream task node or data assigned by the connected upstream data node.
[0320] Optionally, the designated data node further includes: intermediate nodes in the target path;
[0321] The data used when assigning values to intermediate nodes in the target path is: the obtained assignment data of the intermediate nodes.
[0322] Optionally, the starting node in the target path is the source node in the target directed acyclic graph or a data node other than the source node;
[0323] The source node is a data node that has no upstream node.
[0324] Optionally, it also includes:
[0325] The first detection module is used, during the data generation process, for any data node, in response to detecting that a first observation condition related to a first specified node is set for that data node, to trigger the assignment of the data node when the first specified node meets the first observation condition; wherein, the first specified node includes: the upstream data node or task node connected to the data node;
[0326] And / or,
[0327] The second detection module is configured to, for any task node, in response to detecting that a second observation condition is set for the task node regarding a second specified node, trigger the execution of the processing task represented by the task node when the second specified node satisfies the second observation condition; wherein the second specified node includes: an upstream data node or task node to which the task node is connected.
[0328] Optionally, a skip execution module is also included for:
[0329] During the data generation and processing, for any task node, in response to the detection of a historical data signature containing an input signature and an output signature for that task node, it is checked whether the input signature in the historical data signature matches the signature corresponding to the current input scenario, and whether the historical assignment data of the upstream data node of that task node is the same as the currently assigned data.
[0330] If they are the same, the processing task represented by the task node is skipped, and the task execution result of the task node is determined based on the output formal parameters represented by the output signature in the historical data signature and their historical assignment data.
[0331] If they are different, then the processing task represented by that task node will be executed;
[0332] The input signature is used to represent the signature corresponding to the historical input scenario, and the output signature is used to represent the output parameters of the task node under the historical input scenario.
[0333] Optionally, an output module is also included for:
[0334] In response to the detection of an update instruction for the target directed acyclic graph, the target directed acyclic graph is updated, and a graph signature of the currently updated target directed acyclic graph is generated based on the topology of the target directed acyclic graph; wherein, the topology of the target directed acyclic graph includes: the order of data nodes in the target directed acyclic graph, the order of task nodes, the order of edges between any data node and any task node, the execution strategy configured for the task node, the observation conditions set for the task node and / or the observation conditions set for the data node.
[0335] In response to the detection of an output command for the target directed acyclic graph, the currently updated target directed acyclic graph and its graph signature are output.
[0336] Optionally, the method for constructing the target directed acyclic graph includes:
[0337] Obtain configuration information about data nodes and task nodes based on the input from the configuration interface; wherein, the configuration information includes: the formal parameters of the data represented by each data node to be constructed, the execution strategy of the processing task represented by each task node to be constructed, and the dependency relationship between each data node and task node.
[0338] Based on the formal parameters and execution strategy in the configuration information, the data node generation interface and the task node generation interface are called to generate data nodes and task nodes respectively. For any data node and any task node, according to the dependency relationship in the configuration information, an edge connecting the data node and the task node is generated to obtain the target directed acyclic graph.
[0339] According to embodiments of this disclosure, this disclosure also provides an electronic device, a readable storage medium, and a computer program product.
[0340] This disclosure provides an electronic device, including:
[0341] At least one processor; and
[0342] A memory communicatively connected to the at least one processor; wherein,
[0343] The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform any of the data lineage generation methods described above.
[0344] This disclosure provides a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are configured to cause the computer to perform any of the data lineage generation methods described above.
[0345] This disclosure provides a computer program product, including a computer program that, when executed by a processor, implements any of the data lineage generation methods described above.
[0346] Figure 9A schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.
[0347] like Figure 9 As shown, device 900 includes a computing unit 901, which can perform various appropriate actions and processes based on a computer program stored in read-only memory (ROM) 902 or a computer program loaded from storage unit 908 into random access memory (RAM) 903. RAM 903 may also store various programs and data required for the operation of device 900. The computing unit 901, ROM 902, and RAM 903 are interconnected via bus 904. Input / output (I / O) interface 905 is also connected to bus 904.
[0348] Multiple components in device 900 are connected to I / O interface 905, including: input unit 906, such as keyboard, mouse, etc.; output unit 907, such as various types of monitors, speakers, etc.; storage unit 908, such as disk, optical disk, etc.; and communication unit 909, such as network card, modem, wireless transceiver, etc. Communication unit 909 allows device 900 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0349] The computing unit 901 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the various methods and processes described above, such as the data lineage generation method. For example, in some embodiments, the data lineage generation method may be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and / or installed on device 900 via ROM 902 and / or communication unit 909. When the computer program is loaded into RAM 903 and executed by the computing unit 901, one or more steps of the data lineage generation method described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the data lineage generation method by any other suitable means (e.g., by means of firmware).
[0350] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.
[0351] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.
[0352] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0353] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).
[0354] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with embodiments of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.
[0355] Computer systems can include clients and servers. Clients and servers are generally located far apart and typically interact via communication networks. Client-server relationships are created by computer programs running on the respective computers and having a client-server relationship with each other. Servers can be cloud servers, servers in distributed systems, or servers incorporating blockchain technology.
[0356] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.
[0357] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.
Claims
1. A method for generating data lineage, comprising: In response to receiving a data lineage generation request, the target data nodes to be generated in the target directed acyclic graph are determined; wherein, the target directed acyclic graph includes data nodes and task nodes, each task node is connected to a data node, and the edge between any data node and any task node represents the dependency relationship between the data represented by the data node and the processing task represented by the task node. Determine the target path in the target directed acyclic graph; wherein, the target path is the data generation path containing data nodes and task nodes that is required when generating the data of the target data node; Obtain the assigned data of a specified data node in the target path; wherein, the specified data node is the data node that needs to be initially assigned a value when generating the data of the target data node; Based on the assigned data, a data generation process is executed according to the target path to obtain the target data; Determine the subgraph representing the target path in the directed acyclic graph of the target, and obtain the target subgraph; At least the specified data nodes in the target subgraph are assigned values based on the assigned data, and the target data nodes in the target subgraph are assigned values based on the target data, to obtain a lineage graph representing the data lineage of the target data nodes.
2. The method according to claim 1, further comprising: Determine the data assigned to data nodes other than the specified data node and the target data node used in the target data generation process; The step of assigning values to the specified data nodes in the target subgraph based at least on the assigned data, and assigning values to the target data nodes in the target subgraph based on the target data, to obtain a lineage graph representing the data lineage of the target data nodes, includes: The specified data node in the target subgraph is assigned a value based on the assigned data, the target data node in the target subgraph is assigned a value based on the target data, and the other data nodes are assigned a value based on the determined data to obtain a lineage graph representing the data lineage of the target data node.
3. The method according to any one of claims 1-2, wherein, The specified data node includes the starting node in the target path; The step of performing a data generation process according to the assigned data and the target path to obtain the target data includes: The starting node in the target path is assigned the value of the assigned data. Based on the assigned starting node, the execution of the processing task represented by the task node in the target path or the assignment of the data node in the target path is triggered according to the node order in the target path to perform data generation processing and obtain the target data. The data used by any processing task during execution includes: data assigned by the connected upstream data node; the data used by any data node during assignment includes: the task execution result of the connected upstream task node or data assigned by the connected upstream data node.
4. The method according to claim 3, wherein, The specified data node also includes: intermediate nodes in the target path; The data used when assigning values to intermediate nodes in the target path is: the obtained assignment data of the intermediate nodes.
5. The method according to claim 3, wherein, The starting node in the target path is either the source node in the target directed acyclic graph or a data node other than the source node. The source node is a data node that has no upstream node.
6. The method according to claim 3, wherein, Also includes: During the data generation process, for any data node, in response to detecting that a first observation condition related to a first specified node is set for that data node, the assignment of that data node is triggered when the first specified node satisfies the first observation condition; wherein, the first specified node includes: the upstream data node or task node to which the data node is connected. And / or, For any task node, in response to detecting that a second observation condition is set for the task node regarding a second specified node, the execution of the processing task represented by the task node is triggered when the second specified node satisfies the second observation condition; wherein, the second specified node includes: an upstream data node or task node to which the task node is connected.
7. The method according to claim 3, further comprising: During the data generation and processing, for any task node, in response to the detection of a historical data signature containing an input signature and an output signature for that task node, it is checked whether the input signature in the historical data signature matches the signature corresponding to the current input scenario, and whether the historical assignment data of the upstream data node of that task node is the same as the currently assigned data. If they are the same, the processing task represented by the task node is skipped, and the task execution result of the task node is determined based on the output formal parameters represented by the output signature in the historical data signature and their historical assignment data. If they are different, then the processing task represented by that task node will be executed; The input signature is used to represent the signature corresponding to the historical input scenario, and the output signature is used to represent the output parameters of the task node under the historical input scenario.
8. The method according to any one of claims 1-2, further comprising: In response to the detection of an update instruction for the target directed acyclic graph, the target directed acyclic graph is updated, and a graph signature of the currently updated target directed acyclic graph is generated based on the topology of the target directed acyclic graph; wherein, the topology of the target directed acyclic graph includes: the order of data nodes in the target directed acyclic graph, the order of task nodes, the order of edges between any data node and any task node, the execution strategy configured for the task node, the observation conditions set for the task node and / or the observation conditions set for the data node. In response to the detection of an output command for the target directed acyclic graph, the currently updated target directed acyclic graph and its graph signature are output.
9. The method according to any one of claims 1-2, wherein, The methods for constructing the target directed acyclic graph include: Obtain configuration information about data nodes and task nodes based on the input from the configuration interface; wherein, the configuration information includes: the formal parameters of the data represented by each data node to be constructed, the execution strategy of the processing task represented by each task node to be constructed, and the dependency relationship between each data node and task node. Based on the formal parameters and execution strategy in the configuration information, the data node generation interface and the task node generation interface are called to generate data nodes and task nodes respectively. For any data node and any task node, according to the dependency relationship in the configuration information, an edge connecting the data node and the task node is generated to obtain the target directed acyclic graph.
10. A data lineage generation device, comprising: The first determining module is used to determine the target data node to be generated in the target directed acyclic graph in response to receiving a data lineage generation request; wherein the target directed acyclic graph includes data nodes and task nodes, each task node is connected to a data node, and the edge between any data node and any task node represents the dependency relationship between the data represented by the data node and the processing task represented by the task node. The second determining module is used to determine the target path in the target directed acyclic graph; wherein, the target path is a data generation path containing data nodes and task nodes that is required when generating the data of the target data node; The acquisition module is used to acquire the assigned data of a specified data node in the target path; wherein, the specified data node is the data node that needs to be initially assigned a value when generating the data of the target data node; The execution module is used to perform a data generation process according to the assigned data and the target path to obtain the target data; A generation module is used to determine a subgraph in the target directed acyclic graph that represents the target path, thereby obtaining a target subgraph; at least based on the assigned data, assigning values to the specified data nodes in the target subgraph, and based on the target data, assigning values to the target data nodes in the target subgraph, thereby obtaining a lineage graph representing the data lineage of the target data nodes.
11. The apparatus according to claim 10, wherein the generating module further comprises: The second determining submodule is used to determine the data assigned to other data nodes besides the specified data node and the target data node during the target data generation process; The assignment submodule is specifically used for: The specified data node in the target subgraph is assigned a value based on the assigned data, the target data node in the target subgraph is assigned a value based on the target data, and the other data nodes are assigned a value based on the determined data to obtain a lineage graph representing the data lineage of the target data node.
12. The apparatus according to any one of claims 10-11, wherein, The specified data node includes the starting node in the target path; The execution module is specifically used for: The starting node in the target path is assigned the value of the assigned data. Based on the assigned starting node, the execution of the processing task represented by the task node in the target path or the assignment of the data node in the target path is triggered according to the node order in the target path to perform data generation processing and obtain the target data. The data used by any processing task during execution includes: data assigned by the connected upstream data node; the data used by any data node during assignment includes: the task execution result of the connected upstream task node or data assigned by the connected upstream data node.
13. The apparatus according to claim 12, wherein, The specified data node also includes: intermediate nodes in the target path; The data used when assigning values to intermediate nodes in the target path is: the obtained assignment data of the intermediate nodes.
14. The apparatus according to claim 12, wherein, The starting node in the target path is either the source node in the target directed acyclic graph or a data node other than the source node. The source node is a data node that has no upstream node.
15. The apparatus according to claim 12, wherein, Also includes: The first detection module is used, during the data generation process, for any data node, in response to detecting that a first observation condition related to a first specified node is set for that data node, to trigger the assignment of the data node when the first specified node meets the first observation condition; wherein, the first specified node includes: the upstream data node or task node connected to the data node; And / or, The second detection module is configured to, for any task node, in response to detecting that a second observation condition is set for the task node regarding a second specified node, trigger the execution of the processing task represented by the task node when the second specified node satisfies the second observation condition; wherein the second specified node includes: an upstream data node or task node to which the task node is connected.
16. The apparatus of claim 12, further comprising a skip execution module, configured to: During the data generation and processing, for any task node, in response to the detection of a historical data signature containing an input signature and an output signature for that task node, it is checked whether the input signature in the historical data signature matches the signature corresponding to the current input scenario, and whether the historical assignment data of the upstream data node of that task node is the same as the currently assigned data. If they are the same, the processing task represented by the task node is skipped, and the task execution result of the task node is determined based on the output formal parameters represented by the output signature in the historical data signature and their historical assignment data. If they are different, then the processing task represented by that task node will be executed; The input signature is used to represent the signature corresponding to the historical input scenario, and the output signature is used to represent the output parameters of the task node under the historical input scenario.
17. The apparatus according to any one of claims 10-11, further comprising: The output module is configured to: update the target directed acyclic graph in response to detecting an update instruction for the target directed acyclic graph, and generate a graph signature of the target directed acyclic graph based on the topology of the currently updated target directed acyclic graph; wherein the topology of the target directed acyclic graph includes: the order of data nodes in the target directed acyclic graph, the order of task nodes, the order of edges between any data node and any task node, the execution strategy configured for the task node, and the observation conditions set for the task node and / or the observation conditions set for the data node; In response to the detection of an output command for the target directed acyclic graph, the currently updated target directed acyclic graph and its graph signature are output.
18. The apparatus according to any one of claims 10-11, wherein, The methods for constructing the target directed acyclic graph include: Obtain configuration information about data nodes and task nodes based on the input from the configuration interface; wherein, the configuration information includes: the formal parameters of the data represented by each data node to be constructed, the execution strategy of the processing task represented by each task node to be constructed, and the dependency relationship between each data node and task node. Based on the formal parameters and execution strategy in the configuration information, the data node generation interface and the task node generation interface are called to generate data nodes and task nodes respectively. For any data node and any task node, according to the dependency relationship in the configuration information, an edge connecting the data node and the task node is generated to obtain the target directed acyclic graph.
19. An electronic device comprising: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.
20. A non-transitory computer-readable storage medium storing computer instructions, wherein, The computer instructions are used to cause the computer to perform the method according to any one of claims 1-9.
21. A computer program product comprising a computer program that, when executed by a processor, implements the method according to any one of claims 1-9.