Rollback recovery with data lineage capture for data pipelines

By recording and synchronizing the input and output log states of intermediate operators in a distributed data pipeline, the accuracy and efficiency issues of rollback recovery and data lineage capture in a distributed data pipeline are resolved, enabling automatic recovery and error analysis after system failures.

CN115220941BActive Publication Date: 2026-06-19SAP SE

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SAP SE
Filing Date
2021-11-30
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies struggle to achieve accurate and repeatable rollback recovery in distributed data pipelines, especially in effectively restoring the data pipeline to its correct state during system failures and efficiently capturing fine-grained data lineage information.

Method used

By recording input events at intermediate operators in the data pipeline and updating the state in the input and output logs of intermediate operators, and using an asynchronous handshake mechanism to synchronize the process state, efficient rollback recovery and data lineage capture are achieved, ensuring that the state of the data pipeline can be accurately restored after a failure.

🎯Benefits of technology

It enables efficient and accurate rollback recovery and fine-grained data inheritance capture in distributed data pipelines, supports automatic recovery and error analysis after system failures, reduces redundant processing, and improves the response speed and reliability of data pipelines.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115220941B_ABST
    Figure CN115220941B_ABST
Patent Text Reader

Abstract

Computer-readable media, methods, and systems for performing rollback recovery using data lineage capture on a data pipeline are disclosed. An intermediate operator receives ingested input events from a source operator that reads data from an external input data source. The intermediate operator then records information about the intermediate input events in its intermediate operator input log, designating the recorded intermediate input event information as incomplete. The intermediate operator then processes the data associated with the intermediate input events and updates the intermediate input log entries, setting the intermediate input log entries to a completion recording status specified for intermediate input events that have been consumed to generate one or more intermediate output events. The intermediate operator then transmits intermediate output events to subsequent operators. Garbage collection is performed to remove complete entries from the intermediate operator output log. Finally, based on a recovery message received from a subsequent operator, the corresponding intermediate output events are retransmitted.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The embodiments generally relate to accurate and repeatable rollback recovery in a data pipeline. More specifically, the embodiments relate to rollback recovery using efficient, fine-grained data lineage capture performed in conjunction with a distributed data pipeline. Background Technology

[0002] Data pipelines support the processing of large amounts of bounded and unbounded data. Typically, a data pipeline ingests data from a data source, transforms the data, and supports subsequent data storage or further processing. Ideally, from the user's perspective, a distributed data pipeline can be viewed as a single application entity. Therefore, hiding the technical details associated with the distributed execution of the data pipeline (including potential failures of components within the pipeline) becomes the responsibility of the execution infrastructure surrounding the data pipeline. One requirement is the ability to recover from system failures related to communication between processes, where failures typically involve message loss or the loss of some execution state of a process during execution. When a failure occurs, rollback recovery protocols must be applied to restore the correct state of the data pipeline corresponding to an earlier point in time, allowing the data pipeline to continue execution. If subsequent execution of the data pipeline will yield the same result as fault-free execution, then the correct state has been restored.

[0003] To establish accurate rollback recovery for a data pipeline, it is necessary to determine the state of the data pipeline at specific points in time during execution. Two main decisions influence the design of the rollback recovery algorithm: (i) constructing the state of the data pipeline, which involves several questions, such as: what state to capture, when and where to capture it, and (ii) how to restore the correct state when recovering from a failure. The effectiveness of the algorithm depends on parameters such as the space overhead required to capture the state, the latency incurred by storing the state, potential bottlenecks to the entire pipeline execution, and the amount of redundant processing that must be completed after recovery. The amount of redundant processing affects the overall response time of the data pipeline when a given failure occurs. A fundamental requirement of the recovery protocol is that recovery of a failed data pipeline should not require intervention from the data pipeline developer or application: the system automatically maintains the state of pipeline execution according to some predefined policies that each operator must adhere to, enabling the data pipeline to automatically recover from failures.

[0004] Another issue associated with distributed data pipelines is the computational complexity and potentially large storage requirements for capturing fine-grained data progression during pipeline execution. Data progression describes the relationships between individual input and output data items in a computation. Data items can be as granular as records in a table. For example, given an erroneous output record in a data pipeline, it is helpful to retrieve the intermediate or input records that were used to generate the erroneous record. This helps in investigating the root cause of the error (e.g., bad input data to the data pipeline, or incorrect computation in the operation). Similarly, identifying output records that were affected by corrupted input records can help prevent erroneous computations. Therefore, there is a need for an accurate and repeatable rollback recovery mechanism that also provides efficient fine-grained data progression capture in conjunction with distributed data pipeline execution, thereby addressing the aforementioned issues. Summary of the Invention

[0005] The disclosed embodiments address the aforementioned problems by providing one or more non-transitory computer-readable media storing computer-executable instructions, wherein, when executed by a processor, the computer-executable instructions perform a method for performing rollback recovery on a data pipeline using data tracing capture. The method includes: at an intermediate operator, receiving one or more input events from a source operator, which are ingested by the source operator through a read operation on an external input data source; logging information about the one or more intermediate input events to an intermediate operator input log associated with the intermediate operator, wherein the one or more intermediate input events are logged with an incomplete logging status specification; processing data associated with the one or more intermediate input events; updating one or more intermediate input log entries; setting one or more intermediate input log entries to a completed logging status specification corresponding to a consumed subset of one or more intermediate input events that were previously consumed to generate one or more intermediate output events; transmitting one or more intermediate output events to one or more subsequent operators; and retransmitting the corresponding intermediate output events retained in the intermediate output log based on a recovery message received from one or more subsequent operators.

[0006] This summary is provided to introduce a series of concepts in a simplified form, which will be further described in the detailed description below. This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to limit the scope of the claimed subject matter. Other aspects and advantages of this teaching will become clear from the following detailed description of embodiments and drawings. Attached Figure Description

[0007] The embodiments are described in detail below with reference to the accompanying drawings, wherein:

[0008] Figure 1 An exemplary hardware platform for a particular embodiment is described;

[0009] Figure 2 Components for performing a particular embodiment of the system are described;

[0010] Figure 3A Exemplary data pipelines according to various embodiments are described;

[0011] Figure 3B Exemplary block diagrams illustrating data succession capture according to various embodiments are depicted;

[0012] Figure 3C Exemplary block diagrams illustrating components involved in processing in a data pipeline according to various embodiments are depicted;

[0013] Figure 4A An exemplary data pipeline is depicted, illustrating data flow paths and analysis points according to various embodiments;

[0014] Figure 4B An exemplary data pipeline is depicted, illustrating the injection of a monitoring agent into a data pipeline according to various embodiments;

[0015] Figure 4C A partial data pipeline is depicted, illustrating the execution of a data pipeline employing a monitoring agent according to various embodiments;

[0016] Figure 5 Exemplary block diagrams illustrating the operation of a data continuation application according to various embodiments are depicted;

[0017] Figure 6 An exemplary data flow diagram illustrating the operation of exemplary rollback recovery mechanisms according to various embodiments is depicted; and

[0018] Figure 7 Exemplary flowcharts illustrating the operation of methods according to various embodiments are depicted.

[0019] The accompanying drawings are not intended to limit the invention to the specific embodiments disclosed and described herein. The drawings are not necessarily drawn to scale, but are intended to clearly illustrate the principles of this disclosure. Detailed Implementation

[0020] In some embodiments, a data platform is disclosed that enables the provisioning and execution of applications in the form of data pipelines within a large-scale, scalable distributed architecture. In some embodiments, the distributed architecture is provided in conjunction with a serverless cloud service environment. The programming concepts associated with the data pipelines disclosed in connection with the present embodiments are based on a stream-based programming paradigm. As described herein, a “data pipeline” can be represented as a directed graph of black-box components, hereinafter referred to as “operators,” where “operators” exchange information packets (also interchangeably referred to as “messages” or “events”) by associating the “output ports” of operators with the “input ports” of operators. Operators represent asynchronous processes that execute in a data-driven mode (i.e., execute whenever their necessary inputs are available on their input ports). Operators can be grouped to execute together in an execution environment (e.g., within the same application container). A set of operators can be configured to run with dedicated multiplicity, meaning that the set of operators can be replicated across multiple instances, each running in its own node or execution environment.

[0021] This article describes several types of operators: (i) source operators that ingest data into a data pipeline and do not have a preceding operator (they have no input connections); (ii) reader operators that read data from an external system and output data to its output port (they may have input ports); (iii) intermediate operators that acquire and produce intermediate results; and (iv) writer operators that write data to an external system. Source operators can be reader operators, but some reader operators are not source operators because they have one or more connected input ports. Intermediate operators can be writer operators, but some writer operators are not intermediate operators because they have no output connection.

[0022] In the various embodiments used herein, messages exchanged over the connection have headers containing metadata information, and datasets transmitted via the messages have a logical table format. Thus, in various embodiments, each dataset has a table structure and consists of a set of records. For the purposes of this teaching, no particular granularity of messages is imposed; these messages may consist of a single record or a set of records. In various embodiments, datasets may be bounded (i.e., having a fixed size) or unbounded (i.e., infinite), the latter being referred to herein as a “stream”.

[0023] The operators enumerated above for each type perform differently relative to the creation of logs used for rollback recovery and data inheritance capture. Source operators ingest data into the data pipeline either by generating their own data or by reading data from external systems such as databases, file systems, or publish-subscribe system queues, as specified by their configuration parameters. Source operators output events carrying table records. They also maintain a log of the events they ingest into the data pipeline, known as the output log.

[0024] All other operators share common behavior with respect to records. For example, if operator A sends event e to operator B, the following steps are performed. First, A records in its output log that event e has been sent to B with the status "Incomplete," corresponding to the incomplete record status specification. Then, B records event e in its input log with the status "Incomplete." When B produces an output event, B uses a single atomic transaction to: (i) record the output event in its output log with a system-generated ID (e.g., a sequence number), mark the status of the output event as "Incomplete," and retain a reference to the set of input events that were used to produce the output event; and (ii) mark the status of the corresponding input event in its input log as "Completed." The "Completed" status corresponds to the record completion status specification. In the background, an asynchronous "garbage collection" task is performed. In a background task, operator B informs operator A of input events with the status "Completed," and A sets these events to "Completed" in its output log. Then, in another background task, operator A instructs operator B to ignore, terminate, or otherwise discard events marked "complete" in its output log, and similarly, B removes them from its input log.

[0025] In various embodiments, the corresponding rollback recovery process operates as follows. After a failure occurs, each failed process recovers from its persistent (input and output) logs. For each recovery process A, the following steps are performed. First, all receivers of A's output events receive a "recovery" message from A. If receiver process B receives the "recovery" message, B sends an "acknowledgment" message to A containing the latest event ID received from A. Then, A sends to B all A's output events with a status of "incomplete" since that ID. Next, all senders of events with a status of "incomplete" in A's input log receive a "recovery" message from A containing the latest event ID received from the sender. If sender process B receives a "recovery" message from A, B again sends to A all B's output events with a status of "incomplete" since that ID. Next, when A receives an event, A checks whether A already has the event in its input log before logging it. All recovered input events with a status of "incomplete" are received before processing the corresponding events in sequence.

[0026] The handshake described above serves two purposes. First, it synchronizes A and B based on the latest received or sent events. Second, it supports multi-node failures; that is, when A and its related processes fail, subsequent processing can proceed independently, and the fault-free processes can continue execution.

[0027] The subject matter of this disclosure is described in detail below to satisfy statutory requirements; however, the specification itself is not intended to limit the scope of the claims. Rather, the claimed subject matter may be embodied in other ways in combination with other current or future techniques to include different steps or combinations of steps similar to those described in this document. Those skilled in the art will understand minor variations to the description below, and these variations are intended to be included within the scope of the claims. Unless the order of the steps is explicitly described, terminology should not be construed as implying any particular order of the described steps.

[0028] The following detailed description of embodiments is taken with reference to the accompanying drawings, which illustrate specific embodiments in which the present teachings may be practiced. The described embodiments are intended to illustrate aspects of the disclosed invention in sufficient detail to enable those skilled in the art to practice the invention. Other embodiments may be utilized, and changes may be made without departing from the scope of the invention as claimed. Therefore, the following detailed description should not be construed as limiting. The scope of the embodiments is defined only by the appended claims and the full scope of their equivalents.

[0029] In this specification, references to "one embodiment," "embodiment," or "multiple embodiments" mean that one or more features referred to are included in at least one embodiment of the present technology. Individual references to "one embodiment," "embodiment," or "multiple embodiments" in this specification do not necessarily refer to the same embodiment and are not mutually exclusive unless so stated and / or unless readily apparent from the specification to those skilled in the art. For example, features, structures, or actions described in one embodiment may be included in other embodiments, but are not necessarily included. Therefore, the present technology can include various combinations and / or integrations of the embodiments described herein.

[0030] Operating environment for the embodiments

[0031] First go to Figure 1 This describes an exemplary hardware platform for a particular embodiment. Computer 102 may be a desktop computer, laptop computer, server computer, mobile device such as a smartphone or tablet, or any other form factor general-purpose or special-purpose computing device containing at least one processor. For illustrative purposes, computer 102 is depicted with several components. In some embodiments, certain components may be arranged differently or absent. Other components may also be present. Included in computer 102 is system bus 104, through which other components of computer 102 may communicate with each other. In a particular embodiment, multiple buses may be present, or components may communicate directly with each other. Connected to system bus 104 is a central processing unit (CPU) 106. Attached to system bus 104 are one or more random access memory (RAM) modules 108. Attached to system bus 104 is a graphics card 110. In some embodiments, graphics card 110 may not be a physically separate card but may be integrated into the motherboard or CPU 106. In some embodiments, graphics card 110 has a separate graphics processing unit (GPU) 112 that can be used for graphics processing or for general-purpose computing (GPGPU). In addition, GPU memory 114 is also present on the graphics card 110. Connected (directly or indirectly) to the graphics card 110 is a display 116 for user interaction. In some embodiments, there is no display, while in other embodiments, the display is integrated into the computer 102. Similarly, peripheral devices such as a keyboard 118 and a mouse 120 are connected to the system bus 104. Like the display 116, these peripheral devices may be integrated into the computer 102 or may not be present. Also connected to the system bus 104 is local storage 122, which may be any form of computer-readable medium (such as non-transitory computer-readable medium) and may be installed inside the computer 102 or removably attached to the outside of the computer 102.

[0032] Computer-readable media include both volatile and non-volatile media, removable and non-removable media, and media intended to be readable by a database. For example, computer-readable media include (but are not limited to) RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, digital versatile disc (DVD), holographic media or other optical disc storage, cassette tape, magnetic tape, disk storage, and other magnetic storage devices. These technologies can store data temporarily or permanently. However, unless explicitly stated otherwise, the term "computer-readable media" should not be construed as including physical (but rather temporary) forms of signal transmission, such as radio broadcasts, electrical signals over wires, or light pulses over optical fibers. Examples of stored information include computer-usable instructions, data structures, program modules, and other data representations.

[0033] Finally, a network interface card (NIC) 124 is also attached to the system bus 104, allowing computer 102 to communicate on a network such as network 126. NIC 124 can be any form of network interface known in the art, such as Ethernet, ATM, fiber optic, Bluetooth, or Wi-Fi (i.e., the IEEE 802.11 series of standards). NIC 124 connects computer 102 to local network 126, which may also include one or more other computers (such as computer 128) and network storage (such as data storage 130). Generally, data storage (such as data storage 130) can be any repository from which information can be stored and retrieved as needed. Examples of data storage include relational or object-oriented databases, spreadsheets, file systems, flat files, directory services (such as LDAP and Active Directory), or email storage systems. Data storage can be accessed via complex APIs (such as Structured Query Language), simple APIs that only provide read, write, and seek operations, or APIs of any complexity in between. Some data storage may also provide management functions for the datasets stored therein, such as backup or version control. The data storage may be a local data storage on a single computer (such as computer 128), accessible on a local network (such as local network 126) or remotely accessible on the public internet 132. Local network 126 is then connected to the public internet 132, which connects multiple networks (such as local network 126), remote network 134, or directly attached computers (such as computer 136). In some embodiments, computer 102 itself may be directly connected to the public internet 132.

[0034] Now go to Figure 2 The illustration depicts an exemplary schematic diagram showing components of a system for performing an embodiment, and the exemplary schematic diagram Figure 1 Generally referred to by reference numeral 200. System 200 provides a platform for building, deploying, running, monitoring, and maintaining data pipelines. System 200 includes any number of client devices, such as end-user client devices 204 and developer client devices 202. Individual users can concurrently or sequentially connect to components of system 200 using a single client device or multiple client devices. Similarly, in some embodiments, multiple users can (concurrently or sequentially) share a single client device to access analytics associated with the data pipeline. Figure 2 As described above, the client device can be relative to the above. Figure 1 The discussion pertains to any form of computing device. Specifically, users may use desktop computers, laptops, or mobile devices to access components of system 200. Components of system 200 may be accessible via dedicated software on a specific client device or via a web browser associated with the client device. In some embodiments, developers and application hosting system administrators may access management functions via any client device. In other embodiments, management functions may be accessible only by a limited subset of client devices (e.g., only via developer client device 202). In some embodiments, on-premises data source 210 is an enterprise application that includes application server 206 and application data source 208. On-premises data source 210 may also be a data center, data mart, data lake, relational database server, or database server that does not adhere to relational database principles. On-premises data source 210 may provide data in a structured or unstructured manner. The data associated with on-premises data source 210 may be finite in size or may be provided as an unbounded stream.

[0035] In some embodiments, the on-premises data source 210 is used in conjunction with the application server 206 to provide services. The on-premises data source 210 may be a dedicated server, a shared server, a virtual machine instance in a cloud computing environment, or something similar to the above. Figure 1 Any other form of computing device discussed. While a single application server 206 is depicted, embodiments with multiple such application servers are also envisioned to provide scalability, redundancy, and / or isolation between different instances of applications and data sources.

[0036] Cloud service provider 212 refers to an on-demand cloud computing platform, wherein the on-demand cloud computing platform uses dedicated servers, shared servers, virtual machine instances in a cloud computing environment, or the above relative to... Figure 1The discussion also considers any other form of computing device used to provide data storage and computing resources. Cloud service providers 212 can provide services as Software as a Service (SaaS), Infrastructure as a Service (IaaS), or Platform as a Service (PaaS), including serverless execution in event-driven serverless execution environments. Serverless execution environments can allow the deployment of application containers built for a specific execution environment. Broadly speaking, an application container is an isolated instance of a specific application, including application code, application configuration resources, and specific associated libraries and application dependencies that allow for rapid and independent deployment of the application.

[0037] An exemplary application server 206 is communicatively coupled to client devices 202 and 204 and a cloud data provider 214 via a network 216. The network 216 can be a local area network (LAN), a wide area network (WAN), a virtual private network (VPN), or the Internet. Broadly speaking, any type of network is envisioned for providing communication between the various components of system 200. The application server 206 can provide web server functionality to support web-based clients and can provide non-web server functionality to support clients using dedicated applications. Alternatively, both web-based clients and dedicated application clients can use a single web server, or the web server can be a gateway providing web-based access to the dedicated application server. Other technologies are also envisioned for enabling communication between various types of client applications.

[0038] Application data source 208 is communicatively connected to application server 206. As shown, application data source 208 is directly connected to application server 206; however, any form of communication connection can be used (e.g., network-attached storage (NAS), network file system (NFS), or cloud-based storage). Broadly speaking, application data source 208 essentially stores all persistent information used by application server 206. As previously discussed, multiple application servers can exist in system 200. In such embodiments, application servers can have their own copies of application data source 208. Alternatively, multiple group-based communication system servers can share a single network-attached application data source. Alternatively or additionally, in any of these embodiments, data can be sharded across multiple application data sources.

[0039] Operation of the Example

[0040] Now go to Figure 3AThe exemplary data pipeline 300 is depicted according to various embodiments. The data pipeline 300 consists of five operators: operator 302 labeled "W1", operator 304 labeled "W2", operator 306 labeled "R2", operator 308 labeled "R1", and operator 310 labeled "M". Each of the depicted operators has a port represented by a black dot and a connection represented by a directed link. As shown, operators R1 and R2 are both reader and source operators, where operators R1 and R2 have no input connections. Operator M is an intermediate operator, and operators W1 and W2 are two writer operators without output connections. Finally, operator M is encapsulated in a group with a multiplicity of 2, meaning that operator M is instantiated on two nodes that may operate in parallel.

[0041] When data pipelines are deployed on a distributed system architecture, as described below in conjunction with various example implementations of exemplary pipeline engine platforms, each operator is transformed into a process that runs independently or within a generic process called a sub-engine. In some embodiments, each group of operators is executed on different processing nodes of the distributed system. Nodes can be as generic as a collection of physical machines, virtual machines, machine processors, or containerized applications (e.g., Kubernetes pods). In some embodiments, if the multiplicity of a group is greater than 1, a replica of each group is executed on a different processing node. Within a processing node, operators communicate using local inter-process or inter-thread communication, and typically, communication between nodes is performed via remote inter-process communication. In various embodiments, inter-process communication is performed using an asynchronous messaging framework, which can be implemented, for example, via a publish-subscribe message distribution model, a logically global associative memory, or using low-level communication primitives within containerized UNIX pipes or sockets. Each processing node provides shared persistent storage accessible by processes running on that particular node. Therefore, there is not necessarily a unique globally shared storage service across all processing nodes, although such a unique globally shared storage service has specific advantages as described below.

[0042] Now go to Figure 3BThe diagram depicts an exemplary block diagram 330 illustrating data tracing capture according to various embodiments. Block diagram 330 illustrates data tracing capture for the operator M described above in conjunction with data pipeline 300. Input sets 336 and 334 represent all input events consumed by the operator M (represented by 340) at the input ports of M, respectively, and output set 332 represents all output events produced by M. Given a record 338 labeled "r" in output set 332, a reverse data tracing query will return a set of records for each input of M, represented by the set of records 342 and 344 that were previously responsible for generating record 338. The data tracing query will be explained further below.

[0043] Now go to Figure 3C The diagram 360 illustrates exemplary block diagrams 360 illustrating components involved in the processing of a data pipeline according to various embodiments. In various embodiments, the components depicted in block diagram 360 provide a platform for implementing a pipeline execution engine. A log manager 362 is responsible for supporting data properties for atomicity, consistency, isolation, and durability (ACID) transactions on the input and output logs of operators. In some embodiments, the presence of a single shared log manager 362 is not necessary for managing logs associated with all operators; that is, all logs can be distributed and managed by multiple log managers 362. As described above, the remaining components are various processes 364 deployed on the processing nodes. In some embodiments, in conjunction with the data pipeline engine, each processing node includes components that manage the lifecycle of processes implementing operators within the node. Such components (not shown) may be referred to as "group managers" or "job managers." Messages are passed on a message bus 366.

[0044] Now go to Figure 4A This document describes an exemplary data pipeline 400 illustrating data tracing paths and analysis points according to various embodiments. This teaching discloses fine-grained data tracing processing incorporating data tracing capture, which occurs in conjunction with logs required to implement the currently disclosed rollback recovery mechanisms. The fine-grained data tracing state of each operator is captured via a log-based recovery protocol. However, a tense trade-off exists between the need for log garbage collection and the requirement to retain the contents of the logs used for data tracing queries. In various embodiments, this trade-off is managed by extending various rollback recovery embodiments to support the two main data tracing use cases described below.

[0045] The main benefits of fine-grained data tracing queries disclosed in this teaching are as follows. First, for some observed events at the output port of a downstream operator, it may be useful to identify the root cause of any error or inconsistent calculation based on events upstream in the pipeline. Second, it may be useful to identify the impact of data events at the input port of an operator on the output of some downstream operators in the pipeline. Therefore, it is useful to define the “analysis start point” in the data pipeline as the point from which data tracing analysis begins. The analysis point corresponds to the input or output port of the operator for which it detects the corresponding input or output events stored in a specific log. Next, the “analysis target point” is defined as a point in the data pipeline from which the results of forward or reverse data tracing analysis, i.e., the results of the data tracing query, can be observed. In various embodiments, the analysis start point and analysis target point are defined, respectively, as a set of connected paths in the data pipeline, which are used to formulate and process forward or reverse data tracing queries based on whether the target point is downstream or upstream. As described in data pipeline 400, the input and output ports are labeled with corresponding names. Analysis points are represented by gray diamonds, where the starting point is labeled "s" and the target point is labeled "t". These two points define two reverse data follow paths (represented by dashed lines): one path is from the input in2 of OP3 (operator 410) to the output out2 of OP1 (operator 406), and the other path is from the input in1 of OP3 to the output out1 of OP1. Reader 402, designated as R2, provides input to operator 406. Figure 4A In this process, reader 401, designated R1, provides input to operator 408, designated OP2. Operator 408, designated OP2, has an output that provides additional input to operator 410, designated OP3. Finally, the starting point "s" of the reverse data tracing analysis is the input to OP4 (operator 412). To work backward from the starting point "s" to the target point "t", the data tracing query system, consistent with this teaching, traverses the rollback recovery log to determine the specific input and output values ​​that led to the corresponding previous upstream computation.

[0046] In some embodiments, supporting fine-grained data tracing in data pipelines that process streaming (i.e., unbounded data) presents a significant challenge for data tracing capture. In practice, since it is impossible to know in advance when a data tracing query will be issued, arbitrary cutoff times must be established for how long the contents of the rollback recovery log are retained. To address this issue, a data tracing pattern corresponding to the data streaming use case is established when it is necessary to detect events generated by the data pipeline that represent critical or anomalous situations and to issue alerts (hereinafter referred to as alert events) for these events. In these cases, it is useful to retain upstream events that were once responsible for generating the alert events (potentially until the source operator responsible for some of the events is eventually ingested). This ability to find the “root cause” of alert events is referred to herein as fine-grained data streaming sourcing.

[0047] In various embodiments, a monitoring agent is introduced into the data pipeline at specific connections between operators to examine conditions regarding events flowing through a particular connection. In some embodiments, the monitoring agent has a single input port corresponding to the monitored connection. However, it should be understood that the correlation mechanism disclosed herein can be generalized to support cases with different numbers of input ports. Having multiple input ports allows the monitoring agent to use additional data (e.g., access data from external systems) to examine conditions regarding incoming events on the connection. The monitoring agent may also have two output ports: one port outputting a “good” output for events that satisfy the correlation conditions, and another port outputting a “bad” output for events that do not satisfy the correlation conditions (i.e., alarm events). In some embodiments, the logic associated with the monitoring event process functions to examine conditions (stateless or stateful) regarding one or more input events and output these input events on one of its two output ports. An example of a stateless condition is checking that an event's attribute value is not out of bounds or has no null values. An example of a stateful condition is detecting outliers in a fixed-size set of consecutive input events.

[0048] Now go to Figure 4B This describes an exemplary data pipeline 430 illustrating the injection of a monitoring agent into a data pipeline according to various embodiments. Data pipeline 430 represents a previous example of data pipeline 400, where a monitoring agent 442, designated MA, has been injected into a connection between an operator 410 designated OP3 and an operator 444 designated OP4 to detect alarm events. A new operator 445, designated OP5, is introduced to manage alarm events returned by the monitoring agent 442. The output port “bad” of the monitoring agent 442 is associated with a starting analysis point “s”. (E.g., on the input port of OP1) Setting the target analysis point “t” defines a reverse traversal path that can be used to generate a reverse traversal query.

[0049] In various embodiments, the monitoring agent's output port is associated with a possible starting analysis point, thereby retaining all events in the rollback recovery log necessary to provide the underlying data associated with alert events generated at those analysis points. This has at least two benefits. First, the scope of data tracing queries is limited to the data tracing path from the monitoring agent's faulty output port to the target analysis point. Therefore, only the logs of operators located on these paths are involved in data tracing capture. Second, the scope of events that must be retained in these logs is determined based on the occurrence of alert events. Therefore, if the monitoring agent does not detect an alert event, upstream events can be removed from the logs. The various embodiments described below depict alternative embodiments of the disclosed rollback recovery protocol to implement associated changes to the log entry retention rules.

[0050] Suppose an event 'e' is created on a faulty output port of a monitoring agent. Typically, this event is logged in the output log, and the status of the input events that were used to generate the event is marked as "complete." In some such embodiments, all input events that were used to generate the alert event are marked as "frozen" instead of "complete." The "frozen" status means that the event is kept in the log and prevented from being garbage collected later by associated background tasks.

[0051] Now go to Figure 4C The diagram depicts a partial data pipeline 460 employing a monitoring agent, illustrating partial data pipeline execution according to various embodiments. In partial data pipeline 460, an intermediate operator 462 (operators preceding OP1 are not shown) designated OP1 has an output port connected to a monitoring agent 464 designated MA, where MA outputs its events to two recipients 466 and 468 designated OP2 and OP3, respectively. Figure 4C The described "bad" output is connected to OP3, and the condition checks of the MA process are stateless (i.e., each output event refers to itself as an input event). The logs built during the normal execution of the correlated data pipeline are modified through the steps added above. The MA's input log has two events, e1 and e2. The MA operator generates event e2 on the MA's "bad" output port and event e1 on the MA's "good" output port. Therefore, e2 is an alert event, and its state in the MA's output log is "frozen". Then, since e2 depends on itself, event e2 must also be retained in the MA's input log, so its state is also set to "frozen" in the MA's input log, which preserves the recorded value in the rollback recovery log for future data tracing queries. Finally, since e1 is not an alert event and e1 depends on itself, its state is set to "complete" in the MA's input log.

[0052] In various embodiments, the logic associated with the log garbage collection task is adapted such that if input event e has a "frozen" status in the process's input log, the task sends a "frozen" event (instead of a "confirmed" event) with the ID associated with frozen event e to the event sender. For all input events with a "complete" status, the background garbage collection task operates as previously described.

[0053] Because the garbage collection process is performed independently for each process, it is essential to ensure that the process does not begin discarding events that are actually needed to explain the root cause of a particular alert event that might occur downstream. In fact, in the previous example, the garbage collection task might read the input log associated with intermediate operators (such as OP1), send an "acknowledgment" to its sending process, and finally discard all its "complete" events, potentially discarding events needed for future data-driven queries. This would have the undesirable effect of losing information—discarding events that were once used to generate e2 and might ultimately be needed to explain the alert event.

[0054] In one embodiment, a background task of each upstream process along the reverse data traversal path of the monitoring agent reads the input log and performs the following operations: First, if the state of event e is "Completed" and the state of all output events associated with e is "Completed", the task sends an "Acknowledge" event with the ID of the completed event e to the source of the event. Second, if the state of event e is "Frozen", the task sends a "Frozen" event with the ID associated with event e to the source of the event.

[0055] In this embodiment, a background task of each process not on the reverse data tracing path of the monitoring agent reads the input log and performs the following operations: If the status of event e is "Completed", the task sends an "Acknowledge" event with the ID of event e to the source of the event. When a process receives a "Freeze" event for event e, the process sets its status to "Freeze" in its output log, and the process sets the status of all input events associated with e to "Freeze" in its input log. When a process receives an "Acknowledge" event for event e, the process performs the following operations: If the event ID exists in the process's output log, the process sets the status of the event to "Completed"; otherwise, the process sends an "Ignore" event with the same event ID to the process that sent the "Acknowledge" event. In this embodiment, another background task of each process reads its output log and performs the following operations: If the status of an event is "Completed", the task sends an "Ignore" event with the event ID to the recipient of the event, wherein the "Ignore" event can be performed in batches, and deletes the event from its output log. When a process receives an "Ignore" event for event ID, the process removes the event from its input log.

[0056] In some embodiments, the garbage collection task for intermediate operation OP1 is blocked from execution until the status of its output events is set to "Completed" or "Frozen," meaning that OP1 receives an "Acknowledge" or "Frozen" event from the MA process. If MA receives an "Acknowledge" event from process OP2 for event e1, MA will send an "Acknowledge" event to process OP1 for event e1 once MA starts its garbage collection task. Then, MA sends a "Frozen" event to OP1 for event e2. When OP1 receives both the "Acknowledge" and "Frozen" events from process MA, OP1 sets the status of e1 and e2 to "Completed" and "Frozen" respectively in its output log, and OP1 sets the status of input events b1 and b2 that generated e2 to "Frozen." Next, OP1's garbage collection task begins and continues based on the status consistent with the aforementioned event states. This state is safe because all events required to interpret event e2 are marked as "Frozen," and no garbage collection occurs.

[0057] This article discloses an alternative pattern for data lineage capture applicable to data pipeline development use cases. In this pattern, designers use test data to check the correct behavior of the data pipeline under construction. While the tests are being executed, the data pipeline can run for a limited time, and designers can inspect the data output by some operators to check its correctness. Therefore, it is known at which operator output ports the data can be inspected for data lineage, but not which data will be inspected. Consequently, previous techniques that selectively freeze log content using knowledge of alert events output by monitoring agents cannot be employed.

[0058] To overcome this, in some embodiments, the data pipeline designer initializes a starting analysis point and one or more associated target analysis points, wherein one or more associated target analysis points are enabled when the pipeline is run in debug mode. During runtime, all logs along the reverse data traversal path from the analysis starting point to the target analysis point are retained. Similarly, all logs along the forward data traversal path from the analysis starting point to the target analysis point are retained. When analysis points are statically set, the pre-computation phase can identify all logs that must be retained intact, thus identifying processes whose garbage collection tasks must be disabled. In such test or debug modes, the data pipeline runs for a limited time, therefore the size of the frozen logs is limited.

[0059] Because data storage is generally considered finite, for data tracing capture, a definition is provided regarding how the content of added data tracing capture logs eventually expires. In the streaming monitoring scenario described above, the triggering event for discarding events in the log could be the deletion of an alert event. After deleting an alert event, all associated "frozen" events that were once used solely to explain the data tracing of those events can also be removed from the log. These events can be counted incrementally starting from the alert event. All "frozen" events used to generate alert event e and not used to generate another alert event can be discarded. To perform such a test efficiently, a counter can be maintained for each input event in a "frozen" state, where the counter indicates the number of output events representing each input event. This can also be performed in conjunction with the log merging technique described below. The same process is then iteratively repeated along the reverse data tracing path of the monitoring agent. Finally, the process terminates when the source process is reached.

[0060] In various other embodiments, a configurable expiration time can be used for alert events based on the timestamp of when the alert event was generated. The expiration time can be configured differently for each monitoring agent. When an alert event expires, the alert is scheduled to be deleted. However, the alert event and its data lineage can be extracted from the logs and loaded into some third-party storage for later analysis when needed.

[0061] In development and debugging scenarios, the execution of a data pipeline terminates at a certain point. A period for retaining events in the log must be defined. Alternatively, an expiration strategy based on timestamps defining the start and end times of the data pipeline can be used. The contents of the logs for older data pipeline executions can be discarded. This can be achieved using graph identifiers associated with all events in the log. In various implementations, the contents of the logs for a specific data pipeline execution are extracted and persistently stored, for example, in a third-party persistent storage.

[0062] In some embodiments, the method used to process data tracing queries depends on the specific data tracing use case. In streaming processing mode, the data pipeline runs continuously and generates alert events. As described above, each “bad” output port of the monitoring agent is associated with a starting analysis point. Before running a reverse data tracing query, the analyst must select one or more alert events and set the analysis target point for the specific query. This determines the set of reverse data tracing paths used to process the query. In some embodiments, such queries run concurrently with the data pipeline that continues its normal execution.

[0063] In development and debug mode, data pipeline designers can set the analysis start point and one or more associated analysis target points. This is done before deploying and executing the data pipeline. Unlike streaming mode, data tracing queries begin when the data pipeline stops. Before running a reverse data tracing query, the analyst must select one or more events in the output log associated with the analysis start point and choose one of the predefined target points. This determines the set of reverse data tracing paths used to process the query. The process for generating reverse and forward data tracing queries is similar. For reverse data tracing queries, the user selects a specific output event in the output log associated with the analysis start point. The result of the user's selection is a simple initial filter on the output log that takes into account data tracing processing.

[0064] Now go to Figure 5The diagram depicts an exemplary block diagram 500 illustrating the operation of a data tracing application according to various embodiments. Block diagram 500 illustrates an architecture for a specific pipeline with two groups 508 and 510 and three sub-engines 512, 514, and 516. Operators are represented by circles, and rectangles to the left and right of the circles represent threads that process and log input and output events, respectively. Dashed black arrows indicate connections between these threads and the log backends assigned to them (some of these connections are omitted for simplicity). Solid arrows indicate data connections between operators and communication between garbage collector threads responsible for cleaning up each log.

[0065] In the illustrated example, separate log storage backends 504 and 506 are provided for each group. The number of different backend stores used for logging can be configured by the user or determined automatically by the system by attempting to minimize some cost function (e.g., communication cost). The only limitation imposed by the exemplary protocol is that the input and output logs of a given operator must reside in the same backend storage, as the protocol requires atomic operations that write to the output log and then change the state of some event in the input log.

[0066] In various embodiments, the source and target analysis points for data tracing can be specified at design time (before running the graph). This ensures that the subgraph description sent to each sub-engine during pipeline startup contains sufficient information to perform data tracing capture without centralized coordination. In streaming monitoring mode, data tracing application 502 can present all alert events returned by the monitoring agent and capture user selections for these events. The user is also asked to define the analysis target point. The user selection and analysis target point are then considered during the generation of data tracing queries. After the generated data tracing query is executed directly on the relevant log storage backend, data tracing application 502 returns the query results to the user. In some embodiments, data tracing application 502 can poll the source logs for problematic events. Data tracing application 502 operates similarly in development and debug modes.

[0067] Now go to Figure 6 This document describes an exemplary data flow diagram 600 illustrating the operation of exemplary rollback recovery mechanisms according to various embodiments. In some embodiments, data flow diagram 600 illustrates a rollback recovery mechanism that provides a unified solution for both rollback recovery and fine-grained data tracing capture during the execution of a distributed data pipeline. The general idea is to maintain a persistent state of the data pipeline, which is sufficient to enable proper recovery after a failure while supporting data tracing queries. The anticipated benefit is achieving data tracing capture with very little association overhead as a side effect of the rollback recovery protocol.

[0068] Data Flow Graph 600 is based on similar Figure 3A The pipeline diagram is the same as the pipeline diagram of 300, except that there are single copies of the single reader operators R and M. Each operator is deployed on a separate processing node. Figure 6 The diagram illustrates the data pipeline processing. Source operator 604 executes two consecutive read operations (e.g., two database queries one after the other), resulting in the ingestion of a sequence of events consisting of two parts (i.e., event parts 614 and 616 in the pipeline, or event parts 628 and 630). That is, event parts 614 and 616 are the results of the first query from input data source 602, and event parts 628 and 630 are the results of the second query from input data source 602. Intermediate operator 606 (process M) is stateless and processes input events sequentially: when process M receives event e, it generates two new multipart events that are sent to W1 and W2 respectively. When event 1 arrives at process M, it is recorded as "incomplete" in the input log associated with process M. Associated data is processed according to the logic associated with process M, and then output events are generated within atomic transactions, with the event status also set to "complete," indicated by a black dot.

[0069] Each writer process accumulates received events originating from the same original read action and uses these received events to issue a single write transaction to the external system. Labeled gray diamonds indicate points where write transactions (represented by gray dots) have been issued using previously received events from M, marked "complete". Therefore, transactions marked "t1" in the gray diamond are formed at W1 using events 618 and 622 from the intermediate operator, and another transaction marked "t2" is formed at W1 using events 632 and 636 from the intermediate operator.

[0070] However, reliable communication protocols cannot guarantee the reliability of event delivery when process failures occur. For example, if a sent event is lost due to a failure of the intended receiver, the communication protocol may generate a timeout and notify the sender that the event could not be delivered. However, the disclosed rollback recovery protocol ultimately makes all sent events available to the intended (multiple) receivers(s) after successful recovery, ensuring a consistent state of the data pipeline execution.

[0071] In the working example, the node-hosting process W2 fails, while all other processes remain active. Then, all processes except W2 will continue running, and process W2 will restart. When this happens, process W2 will perform the following steps: W2 sends a "recovery" message to M containing the ID of event 622, where event 622 is the last event successfully received by W2. Then, M resends... Figure 6 Events 644, 646, and 648 are indicated by dashed arrows.

[0072] After W2 receives all recovery events, events 620 and 644 are used to form the first write transaction, and the input events are updated to the status "Completed" in W2's input log. Next, events 646 and 648 are used to form another write transaction. While these steps are being performed, M's input log and R's output log can be cleaned up via a background asynchronous garbage collection task.

[0073] In addition to maintaining the necessary state for accurate pipeline rollback recovery, data continuity is captured. First, an overview of solutions for capturing data continuity in logs created using mechanisms from the disclosed rollback recovery protocol is provided. The general principle is to associate a reference with each output event of operator A, where the reference indicates the input event that A once used to generate that event. Different types of references are possible. A reference can be the ID of a single input event, an offset interval of consecutive events in the input log, or a set of input events to which an ID (such as a window ID defined in a streaming system) was assigned when the event was received.

[0074] A major challenge associated with data tracing capture is determining how long the log contents should be retained. In the context of streaming data processing applications (where the data pipeline ingests a continuous stream of events and can run indefinitely), it is certainly impossible to retain the log contents indefinitely. The goal of the first solution is to monitor the processing of events in the pipeline and issue alerts when certain events meet specific conditions. The correlated data tracing approach minimizes the number of events to be retained in the log so that reverse data tracing queries can be answered efficiently for alerted events. More specifically, the disclosed method first involves marking alerted events as "frozen," and then using control messages exchanged by operators to ensure that events recursively contributing to alert events are not garbage collected by the rollback recovery protocol background process. Ultimately, the log consists only of events in the "frozen" state, and these events are guaranteed to be the minimum number of events that must be retained in the log to support the processing of reverse data tracing queries.

[0075] This document discloses another embodiment for scenarios where data pipelines run for a limited time, because the input data ingested into the pipeline is bounded (e.g., the input data is a file) or pipeline execution is intentionally stopped at some point. Such scenarios are well-suited for the development and debugging phases of data pipelines that run tests using limited test input data. This embodiment involves setting a starting point for analyzing the output port of an operator to indicate all output events that should be retained in the output log. Events in the log can be detected later, and one or more of these events can be used to initiate data traversal processing. To achieve this, a target analysis point can be set on the input port of another operator, and the path connecting these two starting and target analysis points defines the scope of either a forward or reverse data traversal query. All analysis points are set before the data pipeline is executed, providing the ability to scope events in the log that need to be retained for future data traversal queries combined with one or more test or debug sessions. The disclosed embodiment strikes a balance between retaining the contents of the log used to resolve future data traversal queries and discarding events from the log using a background task that garbage collects the log.

[0076] Data traversal queries can be performed in various ways. In some embodiments, for a given operator, a reverse data traversal query is expressed as a single join between the operator's output and input logs. Additional joins are used to retrieve data referenced by an event unique identifier computed through the first join. The join expression between the operator's input and output logs depends on the method by which the output event references the input events that act on the output event. This paper describes two distinct scenarios: referencing (i) a single data traversal path; and (ii) multiple data traversal paths.

[0077] For a single data path, query generation can be described in two steps. First, the single data path Φ = (out.op1, in.op1, ..., out.opN, in.opN) is taken as input, and a query Q is generated for the logs of operators "op1" to "opN". More specifically, the query takes the form: Q = (I, O, project, join), where I refers to the input log of operator "opN", O refers to the output log of "op1", project contains log fields that uniquely identify the output events in O and the input events in I, and join is the junction of the join predicate that links the logs of "op1" to those of "opN". Starting with a single data path, and assuming the length of the data path is 1, that is, the path runs from the output port "out" of operator OP1 to the input port "in" of OP1. Therefore, Φ = (out.op1, in.op1). Assuming the log structure is consistent with the logs described above for various rollback and recovery protocols, the input and output logs of OP1 will be represented as I1 and O1, respectively. The query definition for the operator (OP1 in this case) is as follows. The formula for the join expression depends on the method used to reference the input event within the output event.

[0078]

[0079] Table 1

[0080] Given a data path Φ = (out.op1, in.op1, ..., out.op1) n ,in.op n ). Make Q i (I i O i (,project,join) as a subpath Φ in Φ i =(out.op i ,in.op i Related queries, and make Q i .join as Q i The join clause. The bridge query between the two operators is defined as follows: Make Q... i (I i O i (,project,join) and O i+1 (I i+1 O i+1 (,project,join) as each subpath in Φ. i =(out.op i ,in.op i ) and Φ i =(out.opi+1 ,in.op i+1 Two related queries, bridging query Q i,i+1 (O i+1 ,I i (project,join) is as follows:

[0081]

[0082] Table 2

[0083] Then, in the query Q(I) of the entire path Φ n In `O1,project,join`, the join expression is constructed by combining and interleaving the join expression of the query for a path of length 1 with the bridging query, as shown below:

[0084]

[0085] Table 3

[0086] Return to reference Figure 4A Consider an exemplary data path: Φ1 = (out.op3, in1.op3, out.op2, in2.op2, out1.op1, in.op1). Each individual query fragment is computed as follows. Each join expression below is indicated by symbols representing join operations. The names of the joined logs are abbreviated. Each of the following queries can be used as an alternative query against an operator and a bridge query between two operators.

[0087]

[0088] Table 4

[0089] In this example, the final join expression for the query Φ1 is the concatenation of individual join expression fragments, i.e.: Q(Φ1).join = Q1.join and Q 1,2 .join and Q2.join and Q 3,2 .join and Q3.join.

[0090] In various embodiments, a separate query generation process is used for multiple data traversal paths. For example, given two data traversal paths Φ1 = (out.op1, ..., in.op1),... n ) and Φ2=(out.op1,…,in.op n This allows the two paths to share a common operator `op`. Therefore, we have a subpath `out.op` in Φ1. i ,in.op i), and has (out′.op) in Φ2 j ,in′.op j ), making op i =op j =op.

[0091] In this case, using the same method as before for a single path, first target each subpath (out.op1,...,in.op1). i ) and (out.op1,…,in′.op j Calculate two queries. Note that the results of each query have the exact same schema (same fields) as the schema defined by the `project` clause of the query. Therefore, assuming the notation of `I` corresponds to the input log of the operator `op`, the schema for each query is: `(I.Event_ID, I.Input_ID, O1.Event_ID, O1.Output_ID)`. Next, perform a duplicate-free set union on the two result sets of the queries projected onto the fields of the input log `I`. The result (denoted as `I`) is then calculated. r This is used to compute a subset of the output logs, where the subset is used as the initial output log to construct a sequence with each remaining subpath (out.op). i+1 ,…,in.op n ) and (out.op j+1 ,…,in.op n Related queries. Therefore, I r Used to define bridging query Q for each path in Φ1 and Φ2 respectively. i,i+1 And bridging query Q j,j+1 The join expression in the code is as follows:

[0092]

[0093] Table 5

[0094]

[0095] Table 6

[0096] Bridged queries define the construction of each subpath (out.op) i+1 ,…,in.op n ) and (out.op j+1 ,…,in.op n The associated queries must consider the various initial output logs.

[0097] The same method can be applied to any number of data traversal paths with common operators. Then, we take the remaining sub-paths that must be evaluated using the same method and continue until no more sub-paths are available. This can be achieved by returning to the reference... Figure 4A Consider two paths, Φ1 and Φ2, to illustrate an example. The only common operator on these two paths is OP3. Therefore, what is needed is to compute the corresponding query for the path up to OP3. Similar to the computation for Φ1 above, a similar approach can be applied to compute the query for Φ2 = (out.op3, in2.op3, out2.op1, in.op1). As described above, the following query snippet is used:

[0098]

[0099] Table 7

[0100] Therefore, the final query for Φ2 has the join expression: Q(Φ2).join = Q1.join and Q 3,1 .joinand Q3.join. Next, the union of the two query results is computed to find the input event of I3 responsible for the event in O1. Since OP3 is the last operator in the data progression path, the data progression processing is complete.

[0101] The described exemplary data inheritance processing method involves retrieving input events identified by corresponding unique identifiers. This enables highly efficient join operations on small tables. However, in some embodiments, the data associated with each event is stored in the corresponding output log for each operator. Therefore, in some embodiments, event data is accessed by performing supplementary joins with the output log containing the corresponding generated events.

[0102] Several optimizations can be made in conjunction with the disclosed embodiments. For source operators that are not readers, if the sequence of generated events is stateful, an alternative option is to store the process state in persistent storage and recover that state in case of process failure. If the process is a reader and the external system can replay atomic read actions in a past observable state, it allows the output events of atomic actions to be logged and sent before the action is completed. The process only needs to retain a log of the atomic actions sent along with information related to the portion of the observed state. When the action is completed, the entire effect of the atomic action is logged. If a process failure occurs before the atomic action is completed, the process can recover its action log and continue with the unfinished action. Furthermore, the same technique can be applied if the source process is a reader and the source process accesses immutable state of an external system.

[0103] Stateless operators read and process input events to generate one or more output events. In this case, since the process has access to the associated event identifier and the identifier of the port receiving the event, it can also obtain the output port identifier of the sender's connection. Therefore, the process can associate this information with each generated output event. In this case, no input log entries are needed, and associated writes to the input log can be avoided. The corresponding background tasks and associated recovery steps for garbage collection are adjusted accordingly.

[0104] In various embodiments, logs can be merged for any connection between two operators. For a given reliable centralized storage availability of logs, output and input logs can be merged for a single connection between operators. In the merged log, for each output event, the sender port identifier and receiver port identifier of the connection are stored as a single state for the event, carrying a value of "complete" or "incomplete". Using the merged log, the above process functions as previously described. However, both the garbage collection and recovery protocols are simplified because there is no need to exchange messages between processes. Instead, each process can access the shared log to determine when events can be removed from the log and which events must be replayed after recovery. For example, the state of a new event might be initially set to "incomplete" and then changed to "complete" after that event has been used to generate an output event on another connection. A background task simply removes the "complete" event.

[0105] Figure 7 An exemplary flowchart illustrating the operation of a method according to various embodiments is depicted. In step 702, one or more input events are received from another process. In some embodiments, the input events are ingested by the source process through a read operation on an external input data source. In some embodiments, the external data source may be a data center, data mart, data lake, relational database server, or a database server that does not adhere to relational database principles. In these embodiments, if the source process is a reader that sends an atomic read action to an external system, the source process records the effect of the atomic action in its output log as "incomplete" before sending any of these output events to another process. If the recording fails to complete successfully, no event is recorded for the atomic read action. If the source process is not a reader and the data generation process is stateless, output events do not need to be recorded before sending the generated events. When the reader process sends an atomic read action, the reader process records the effect of the read action in its input log before any of the corresponding events are used to generate an output event on any of the reader's output ports. If the recording fails to complete successfully, no event is recorded for the atomic read action.

[0106] Next, in step 704, information about one or more intermediate input events is logged to the intermediate operator input log associated with the intermediate operator, wherein the one or more intermediate input events are logged with an incomplete logging status. In some embodiments, when a process receives an event, the process logs events with an "incomplete" status in the process's input log before processing the event. Optional preprocessing stages may be provided for assigning input events to "groups" (e.g., windows) or for calculating incrementing statuses. If a "group" is assigned to an event, the "group" can be updated in the corresponding input log. Next, in step 706, the data associated with one or more intermediate input events is processed according to the operation associated with the operator.

[0107] Next, in step 708, one or more intermediate input log entries are updated, setting the completed log entry status to correspond to the consumed subset of one or more intermediate input events that were previously consumed to generate one or more intermediate output events. In some embodiments, associated sequence numbers are used to process input events sequentially. When an output event is generated for one or more output ports, the process uses an atomic transaction to record the associated output event in the output log as "incomplete," and for input events in the output log that were previously consumed to generate the corresponding output event, the associated log status is set to "complete." When the writer process creates an atomic write action, the writer process records the atomic write action in the writer process's output log upon completion and then sends the atomic write action to the appropriate external system. After sending the atomic write action, if the action is successful, the status of the corresponding output event is set to "complete." Otherwise, if the action fails before completion, the process must undo the write action (unless it has already been committed by the appropriate external system) and attempt to execute the action again. Next, in step 710, one or more intermediate output events are transferred to one or more subsequent operators.

[0108] Next, in step 712, background garbage collection is performed on the intermediate operator output log, wherein updated intermediate input log events that have been updated to reflect the completion status are removed from the intermediate operator output log. In some embodiments, the background asynchronous "garbage collection" task is performed as follows: The background task reads the input log, and if the status of an event is specified as "complete," the background task sends an "acknowledgment" event with an identifier corresponding to the completed event to the sender of the event. When a process receives an "acknowledgment" event for a specific event identifier from another process, the process performs the following operation: If the event identifier exists in the corresponding output log, the process sets the corresponding status to "complete"; otherwise, the process sends an "ignore" event with the same event identifier to the process that sent the "acknowledgment" event. In some embodiments, another background task reads the output log and performs the following operation: If the status of an event is "complete," the task sends an "ignore" event with an event identifier to the recipient of the event and deletes the event from its output log. When a process receives an "ignore" event for a specific event identifier, the process removes the event from the corresponding input log. Next, in test 714, if a recovery message is received from one or more subsequent operators, the corresponding output events received since the last completed log entry are transmitted. Finally, in step 716, the corresponding intermediate output events retained in the intermediate output log are retransmitted, as described above. Figure 6 As described.

[0109] Example execution environment.

[0110] Pipeline engine implementations can be employed in various execution environments. In one embodiment, operators are implemented based on the runtime environment and corresponding libraries. For example, operators executing Python code require a Python interpreter and libraries. Runtime environment requirements are defined by the application container file and deployed in an application execution environment (such as a cloud-based serverless execution environment). Operator definitions, libraries, and application container files are stored in a repository. Tags are associated with operators and application container files, thus establishing one or more dependencies: all required tags must match one or more application container files to satisfy the associated dependencies.

[0111] During deployment, operators in the pipeline are translated into threads that run independently or within so-called sub-engine processes. Sub-engines can use their specific operators to interpret and execute portions of the graph. Sub-engines have associated predefined labels. When deploying the data pipeline, for each operator, the image compositor searches for one or more suitable application container files that match the operator's desired label. The image compositor then automatically groups the operators in such a way that each group of operators can be implemented by a single application container file. User-defined groups within the data pipeline remain unchanged, and the associated pipeline engine only checks for the existence of one or more matching application container files for the group. The resulting application container files are then built and deployed on a container execution environment (such as Kubernetes), with each group of operators assigned to different containers and pods. Control events that change the state of the graph communicate via NAT using the publisher-subscriber paradigm. For example, when the graph needs to stop, a stop event is sent to all Kubernetes pods. Furthermore, when an error causes some pods to fail, the event is notified to all other pods belonging to the same pipeline, which triggers the graph to stop.

[0112] Within each pod, there exists a group manager process responsible for managing the lifecycle of operators and sub-engines within its subgraph. During graph startup, connections between operators need to be established, operator initialization methods run, and finally, operators are started. The group manager process listens for stop events issued by the API server, and if one of its operators fails, the group manager process must issue stop events for the other pods. The group manager is also responsible for serializing and deserializing messages exchanged between different groups.

[0113] Data is transmitted from operator to operator in a generic message format, which can be refined by structured metadata. The transmission medium can be an in-process queue, or, depending on whether the message crosses a sub-engine boundary or a group boundary, other low-level communication primitives. In the latter case, when crossing a sub-engine boundary, the message is serialized and delivered via inter-process communication; or when crossing a group, the message is serialized and delivered using an internal messaging system built on top of the TCP protocol.

[0114] The pipeline engine (API server) keeps track of the running graphs and stores metadata about these graphs in a database instance. The pipeline engine is a user application; that is, each user runs their own engine instance. Therefore, modifications to artifacts in the repository can be performed at the user's level (i.e., without exposing modifications to other users in the pipeline execution environment).

[0115] Various arrangements of the depicted components and those not shown are possible without departing from the scope of the appended claims. The embodiments of the invention described are intended to be illustrative rather than restrictive. Alternative embodiments will become clear after reading this disclosure, and because of reading this disclosure, for the reader of this disclosure. Alternative methods for implementing the above embodiments can be accomplished without departing from the scope of the appended claims. Specific features and sub-combinations are useful and can be employed without reference to other features and sub-combinations, and are contemplated within the scope of the claims. Although the invention has been described with reference to embodiments shown in the accompanying drawings, it should be noted that equivalents and substitutions may be used herein without departing from the scope of the invention as set forth in the claims.

[0116] Following the description of various embodiments of the invention, the claims that are new and desirable for patent protection include the appended claims.

Claims

1. One or more non-transitory computer-readable media storing computer-executable instructions, wherein, When executed by a processor, a computer-executable instruction performs a method for performing rollback recovery by employing data edge capture on a data pipeline, the method comprising: At the intermediate operator, one or more input events are received from the source operator by the source operator through a read operation on an external input data source, wherein the source operator specifies in the output log that the recorded event has been sent to the intermediate operator in an incomplete recording state; Information about one or more intermediate input events is logged to the intermediate operator input log associated with the intermediate operator, wherein the one or more intermediate input events are logged with an incomplete logging status specified. Process data associated with one or more intermediate input events; One or more intermediate input log entries are updated by setting a completed recording status specification corresponding to a consumed subset of one or more intermediate input events that have been consumed to generate one or more intermediate output events, wherein when an output event is generated, the output event is recorded in an intermediate output log with an incomplete recording status specification. One or more intermediate output events are transmitted to one or more subsequent operators, and the status of the corresponding output events is set to complete. A background task reads the input log, and if an event has a status specified as complete, the background task sends an event with an identifier corresponding to the completed event to the source operator, and the source operator sets the corresponding log status to complete. Based on the recovery message received from one or more subsequent operators, the corresponding intermediate output events from the intermediate output log are resent.

2. The non-transitory computer-readable medium according to claim 1, further comprising: A background garbage collection is performed on the intermediate operator output log, wherein updated intermediate input log events that have been updated to reflect the completion status are removed from the intermediate operator input log, and completed output events are removed from the intermediate operator output log.

3. The non-transitory computer-readable medium according to claim 1, further comprising: Establish one or more starting points and one or more target points for data lineage analysis.

4. The non-transitory computer-readable medium according to claim 3, wherein, Updating one or more intermediate input log entries by setting one or more intermediate input log entries to a completed logging status specification includes setting one or more intermediate input log entries to a completed log retention status specification.

5. The non-transitory computer-readable medium according to claim 4, wherein, Performing background garbage collection on intermediate operator output logs includes retaining one or more intermediate input log entries with a completed log retention status specified.

6. The non-transitory computer-readable medium of claim 3, further comprising: Insert the monitoring agent operator into the data pipeline to separate output events that meet the association criteria from those that do not. as well as Establish one or more data lineage analysis start points downstream of the inserted monitoring agent operator.

7. The non-transitory computer-readable medium of claim 6, further comprising: Traverse the intermediate operator input log and intermediate operator output log to identify intermediate input and output values ​​at intermediate operators between one or more data lineage analysis start points and one or more data lineage analysis target points, in order to determine the initial values ​​of one or more data lineage analysis target points.

8. A method for performing rollback recovery on a data pipeline using data tracing capture, the method comprising: At the intermediate operator, one or more input events are received from the source operator by the source operator through a read operation on an external input data source, wherein the source operator specifies in the output log that the recorded event has been sent to the intermediate operator in an incomplete recording state; Information about one or more intermediate input events is logged to the intermediate operator input log associated with the intermediate operator, wherein the one or more intermediate input events are logged with an incomplete logging status specified. Process data associated with one or more intermediate input events; One or more intermediate input log entries are updated by setting a completed recording status specification corresponding to a consumed subset of one or more intermediate input events that have been consumed to generate one or more intermediate output events, wherein when an output event is generated, the output event is recorded in an intermediate output log with an incomplete recording status specification. One or more intermediate output events are transmitted to one or more subsequent operators, and the status of the corresponding output events is set to complete. A background task reads the input log, and if an event has a status specified as complete, the background task sends an event with an identifier corresponding to the completed event to the source operator, and the source operator sets the corresponding log status to complete. Based on the recovery message received from one or more subsequent operators, the corresponding intermediate output events from the intermediate output log are resent.

9. The method according to claim 8, further comprising: A background garbage collection is performed on the intermediate operator output log, wherein updated intermediate input log events that have been updated to reflect the completion status are removed from the intermediate operator input log, and completed output events are removed from the intermediate operator output log.

10. The method according to claim 8, further comprising: Establish one or more starting points and one or more target points for data lineage analysis.

11. The method according to claim 10, wherein, Updating one or more intermediate input log entries by setting one or more intermediate input log entries to a completed logging status specification includes setting one or more intermediate input log entries to a completed log retention status specification.

12. The method according to claim 11, wherein, Performing background garbage collection on intermediate operator output logs includes retaining one or more intermediate input log entries with a completed log retention status specified.

13. The method according to claim 10, further comprising: Insert the monitoring agent operator into the data pipeline to separate output events that meet the association criteria from those that do not. as well as Establish one or more data lineage analysis start points downstream of the inserted monitoring agent operator.

14. The method according to claim 13, further comprising: Traverse the intermediate operator input log and intermediate operator output log to identify intermediate input and output values ​​at intermediate operators between one or more data lineage analysis start points and one or more data lineage analysis target points, in order to determine the initial values ​​of one or more data lineage analysis target points.

15. A system comprising at least one processor and at least one non-volatile memory storing computer-executable instructions, wherein, When executed by the processor, computer-executable instructions cause the system to perform actions, including: At the intermediate operator, one or more input events are received from the source operator by the source operator through a read operation on an external input data source, wherein the source operator specifies in the output log that the recorded event has been sent to the intermediate operator in an incomplete recording state; Information about one or more intermediate input events is logged to the intermediate operator input log associated with the intermediate operator, wherein the one or more intermediate input events are logged with an incomplete logging status specified. Process data associated with one or more intermediate input events; One or more intermediate input log entries are updated by setting a completed recording status specification corresponding to a consumed subset of one or more intermediate input events that have been consumed to generate one or more intermediate output events, wherein when an output event is generated, the output event is recorded in an intermediate output log with an incomplete recording status specification. One or more intermediate output events are transmitted to one or more subsequent operators, and the status of the corresponding output events is set to complete. The background task reads the input log, and if an event has a status specified as complete, the background task sends an event with an identifier corresponding to the completed event to the source operator, and the source operator sets the corresponding log status specified as complete. Background garbage collection is performed on the intermediate operator output log, wherein updated intermediate input log events that have been updated to reflect the completion status are removed from the intermediate operator input log, and completed output events are removed from the intermediate operator output log; and Based on the recovery message received from one or more subsequent operators, the corresponding intermediate output events from the intermediate output log are resent.

16. The system according to claim 15, wherein, The intermediate operator input log and the intermediate operator output log are combined to form an intermediate operator merge log.

17. The system according to claim 15, further comprising: Establish one or more starting points and one or more target points for data lineage analysis.

18. The system according to claim 17, wherein, Updating one or more intermediate input log entries by setting one or more intermediate input log entries to a completed logging status specification includes setting one or more intermediate input log entries to a completed log retention status specification.

19. The system according to claim 18, wherein, Performing background garbage collection on intermediate operator output logs includes retaining one or more intermediate input log entries with a completed log retention status specified.

20. The system according to claim 17, further comprising: Insert the monitoring agent operator into the data pipeline to separate output events that meet the association criteria from those that do not. as well as Establish one or more data lineage analysis start points downstream of the inserted monitoring agent operator.

Citation Information

Patent Citations

  • Method for guaranteeing processing of messages in a continuous processing system

    US7818757B1