End-to-end resumability of inter-region replication using new replication
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- ORACLE INT CORP
- Filing Date
- 2023-06-13
- Publication Date
- 2026-06-19
AI Technical Summary
Existing file system replication technologies face challenges in efficiently restarting replication processes due to failures, interruptions, or infrastructure changes, particularly in cross-region setups, lacking effective checkpoint mechanisms and coordination between source and target file systems, leading to potential corruption and resource wastage.
Implementing a method that synchronizes resource cleanup operations in both source and target file systems using distinct sets of states, allowing for seamless restart of inter-region replication by tracking resource management and job ownership, ensuring atomic transactions, and utilizing state machines and inter-region APIs to ensure consistency and resumability.
Enables efficient and reliable restart of cross-region file system replication, minimizing resource consumption and costs by resuming from the most recent common snapshot, ensuring data integrity and consistency across regions.
Smart Images

Figure 00000000_0000_ABST
Abstract
Description
Technical Field
[0001] Cross - Reference to Related Applications This application claims priority to U.S. Non - Provisional Patent Application No. 18 / 332,462, entitled "END - TO - END RESTARTABILITY OF CROSS - REGION REPLICATION USING A NEW REPLICATION", filed on June 9, 2023, and U.S. Non - Provisional Patent Application No. 18 / 332,475, entitled "END - TO - END RESTARTABILITY OF CROSS - REGION REPLICATION USING A COMMON SNAPSHOT", filed on June 9, 2023. All of these U.S. Non - Provisional Patent Applications claim the benefit and priority under 35 U.S.C. 119(e) to U.S. Provisional Patent Application No. 63 / 352,992, filed on June 16, 2022, U.S. Provisional Patent Application No. 63 / 357,526, filed on June 30, 2022, U.S. Provisional Patent Application No. 63 / 412,243, filed on September 30, 2022, and U.S. Provisional Patent Application No. 63 / 378,486, filed on October 5, 2022. The disclosures of these are hereby incorporated by reference in their entirety for all purposes.
[0002] Field The present disclosure generally relates to file systems. More particularly, but not by way of limitation, techniques for performing various types of restart operations for file storage replication between file systems within different cloud infrastructure regions are described.
Background Art
[0003] Background Replication processes for disaster recovery may need to restart file system replication due to failures, interruptions, or infrastructure changes. It is important to restart file system replication properly and efficiently. Therefore, there is a need to improve the restartability of file system replication.
Summary of the Invention
[0004] Brief Summary The present disclosure generally relates to file systems. More particularly, but not by way of limitation, techniques are described for performing various types of restart operations for file storage replication between file systems within different cloud infrastructure regions. Various embodiments are described herein, including methods, systems, programs, code, or instructions executable by one or more processors, and non-transitory computer-readable media storing the same.
Means for Solving the Problems
[0005] In one embodiment, a technique is provided that includes a method that includes, by a computing system, performing inter-region replication between a source file system and a target file system, where the source file system and the target file system are in different regions, the method further including, by the computing system, receiving a request to terminate the inter-region replication between the source file system and the target file system, and synchronizing operations within the source file system and operations within the target file system by using a first set of states and a second set of states, where the operations within the source file system include performing resource cleanup within the source file system, the operations within the target file system include performing resource cleanup within the target file system, and the method further includes, by the computing system, after the resource cleanup within the source file system and the resource cleanup within the target file system, starting a new inter-region replication between the source file system and the target file system.
[0006] In yet another embodiment, the first set of states tracks the management and utilization of resources and is visible to the customer.
[0007] In yet another embodiment, the second set of states tracks the ownership of replication-related jobs for components of the source file system and the target file system and is invisible to the customer.
[0008] In yet another embodiment, performing resource cleanup within the source file system and the target file system uses the first set of states, and performing resource cleanup within the source file system and the target file system uses the first subset of the second set of states when a request to end inter-region replication is initiated by the source file system, and uses the second subset of the second set of states when a request to end inter-region replication is initiated by the target file system.
[0009] In yet another embodiment, a request to end inter-region replication is initiated by the source file system, and resource cleanup within the source file system and resource cleanup within the target file system are performed simultaneously.
[0010] In yet another embodiment, a request to end inter-region replication is initiated by the target file system, and resource cleanup within the target file system is performed and completed before resource cleanup within the source file system is initiated.
[0011] In various embodiments, a system is provided that includes one or more data processors and a non-transitory computer-readable medium having instructions that, when executed on the one or more data processors, cause the one or more data processors to perform some or all of one or more of the methods disclosed herein.
[0012] In various embodiments, the non - transitory computer - readable medium stores computer - executable instructions that, when executed by one or more processors, cause one or more processors of a computer system to perform one or more of the methods disclosed herein.
[0013] In various embodiments, a computer program product includes computer programs / instructions that, when executed by a processor, cause the processor to perform any of the methods disclosed herein.
[0014] The techniques described above and below can be implemented in a number of methods and a number of situations. Referring to the following figures, which are described in more detail below, a number of exemplary implementations and situations are provided. However, the following implementations and situations are only a part of many implementations and situations.
[0015] The features, embodiments, and advantages of the present disclosure will be better understood when the following detailed description is read with reference to the accompanying drawings.
Brief Description of the Drawings
[0016]
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6A
Figure 6B
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18
Figure 19
Figure 20
Figure 21
Figure 22
Figure 23
Figure 24
Figure 25
Figure 26
Figure 27
[0017] Detailed Description In the following description, for purposes of explanation, specific details are set forth in order to provide a thorough understanding of an embodiment. It will be evident, however, that various embodiments may be practiced without these specific details. The figures and description are not intended to be restrictive. The word "exemplary" as used herein means "serving as an example, instance, or illustration." Any embodiment or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments or designs.
[0018] When a problem occurs in the file system, the file system needs a way to resume the running job, resume exactly from the correct point, and ensure a definitive result. Failures during replication can include, for example, system crashes, inability to obtain KMS keys, or the need for an upgrade. The system may need to resume from either the source file system or the target file system because either the source file system or the target file system may fail during replication.
[0019] Existing techniques without a checkpoint mechanism do not have background cleaning, so they may have to wait until everything is complete and clean up a huge amount of information. Even when a checkpoint mechanism exists, the checkpoint exists in the data plane and the control plane does not recognize the checkpoint. Therefore, checkpoints alone have limitations regarding the purpose of resuming between regions.
[0020] An additional issue related to the restart process is the coordination between the source file system and the target file system because the source file system cannot be restarted until the target file system finishes the cleanup. Otherwise, there may be corruption in the file system.
[0021] A scale-out distributed system that includes machines and databases distributed across different geographical regions poses additional challenges due to network latency or congestion. Such a system may require a mechanism to guarantee atomic transactions between different regions in order to maintain consistency between databases when a failure or update occurs.
[0022] The techniques disclosed in this disclosure can be targeted at various types of resumption operations for existing inter-region replication processes, such as replication deletion and resumption of a previous snapshot of replication. Replication deletion terminates the current replication process, exits that process, and performs resource cleanup by cleaning up all data (e.g., metadata, checkpoint-related records, job / processing queues, etc.) included in both the source file system and the target file system, after which a new inter-region replication can be started. This replication deletion technique can be used in cases where a permanent failure has occurred or the customer wishes to switch to a different region. Resumption of a previous snapshot of replication restarts (or resumes) the existing replication process from a previous common snapshot between the source file system and the target file system without cleaning up all data in both file systems, and the replication process can be terminated. This technique can be used when some recoverable failure event, such as a software problem, occurs or the customer wishes to resume the replication process from a previous snapshot due to some problem with the current snapshot. The technique for resumption of a previous snapshot of replication can further include resumption in the same data flow direction as the current replication or in the opposite data flow direction. An example of resumption of a previous snapshot by reversing the data flow can be a customer who wishes to use the original source file system again after inter-region replication between the source file system and the target file system.
[0023] The source region and the target region may each have a unique database (e.g., a shared database called SDB) for communication between the data plane (DP) and the control plane (CP) within each region. There is no connection between these two databases, and the objects within both databases are independent. Each database can be of a different type, such as relational or non - SQL. The techniques disclosed in the present disclosure utilize the inter - region API and state machines within the control planes (CPs) of both the source region and the target region to track the replication processes within both regions and to confirm that they are synchronized. There is one state machine per region, and one state within a region can trigger a state transition within another region. Sequence numbers within the database and reservation and distributed in - region lock mechanisms using new tables are used to assist in providing such atomicity guarantees. Thus, the disclosed techniques help to synchronize these asynchronous operations, such as differential uploads and differential downloads, in both the source file system and the target file system by using two sets of states.
[0024] The disclosed techniques can provide the additional advantage of isolating a failed job in order to identify the root cause and then resume without affecting other running jobs within the same region or the same file system. The disclosed techniques also support deterministic re - application and re - trial to guarantee the same result. This is sometimes referred to as idempotency.
[0025] Finally, instead of having to restart the entire inter-region replication from the beginning just because a snapshot was corrupted during the replication process, the disclosed technology can help customers save a lot of resources (e.g., bandwidth, computing power) and costs by restarting from the most recent common snapshot between the source file system and the target file system.
[0026] The terms source region and source file system can be used in the same sense when referring to the inter-region replication process because the replication process is executed by the source file system within the source region. Similarly, the terms target region and target file system can be used in the same sense when referring to the inter-region replication process because the replication process is executed by the target file system within the target region.
[0027] Explanation of terms in certain embodiments In certain embodiments, "Recovery Time Objective" (RTO) refers to the period of time that a user needs to make replication available within the secondary (or target) region after a failure occurs within the availability domain (AD) of the primary (or source) region, regardless of whether the failure is planned or unplanned.
[0028] In certain embodiments, "Recovery Point Objective" (RPO) refers to the maximum allowable range with respect to the time of data loss between a failure in the primary region (usually due to an unplanned outage) and the availability of the secondary region.
[0029] In certain embodiments, a "replicator" can refer to a component (e.g., a virtual machine (VM)) within the data plane of a file system that uploads differences to a remote object store (i.e., an object storage service) when the component is located in a source region, or downloads differences from the object storage for applying the differences when the component is located in a target region. The replicator is formed as a fleet (i.e., a plurality of VMs or replicator threads) called a replicator fleet and can execute the inter-region (or cross-region) replication process (e.g., uploading of differences to a target region) in parallel.
[0030] In certain embodiments, a "delta generator" (DG) can refer to a component within the data plane of a file system that extracts the differences (i.e., changes) between the keys and values of two snapshots when the component is located in a source region, or applies differences to the latest snapshot within the B-tree of the file system when the component is located in a target region. The delta generator in the source region can use multiple threads (referred to as delta generator threads or range threads for multiple split B-tree key ranges) to execute the extraction of differences (or B-tree scanning) in parallel. The delta generator in the target region can use multiple threads to apply the downloaded differences to the latest snapshot in parallel.
[0031] For the purposes of the present disclosure, in certain embodiments, a "shared database" (SDB) can refer to a key-value store that components (e.g., a replicator fleet) within both the control plane and the data plane of a file system can read from and write to in order to communicate with each other. In certain embodiments, the SDB can be part of a B-tree.
[0032] In some embodiments, a "File System Communicator" (FSC) may refer to the file manager layer that runs on storage nodes within the data plane of a file system. This service helps with file creation requests, deletion requests, read requests, and write requests, and works with an FNS server (e.g., Orca) to service I / O to clients. The replicator fleet can communicate with multiple storage nodes, thereby distributing file system data read / write operations across the storage nodes.
[0033] In some embodiments, a "blob" may refer to a data type for storing information (e.g., a formatted binary file) in a database. Blobs are generated during replication by a source region and uploaded to an object store (i.e., object storage) within a target region. A blob may contain binary tree (B-tree) keys and values as well as file data. Blobs within an object store are called objects. The key-value pairs of the B-tree and the data associated with them are packed together into the blob that is uploaded to the object store within the target region.
[0034] In certain embodiments, a "manifest" may refer to information transmitted by a file system in a source region (referred to herein as the source file system) to a file system in a target region (referred to herein as the target file system) to facilitate an inter-region replication process. There are two types of manifest files: a master manifest and a checkpoint manifest. A range manifest file (or master manifest file) is created by the source file system at the start of the replication process and describes information (e.g., B-tree key ranges) required by the target file system. A checkpoint manifest file is created after a checkpoint in the source file system and notifies the target file system of the number of blobs included in the checkpoint and uploaded to the object store. In response, the target file system can then download that number of blobs.
[0035] In certain embodiments, a "difference" may refer to the differences identified between two specific snapshots after a replicator has recursively visited all nodes of a B-tree (also referred to herein as scanning the B-tree). A difference generator identifies key-value pairs of the B-tree with respect to the differences, traverses the B-tree nodes, and retrieves the file data associated with the B-tree keys. The difference between two snapshots may include multiple blobs. The term "difference" may include blobs and manifests when used in the context of uploading information by a source file system to an object store and downloading by a target file system from the object store.
[0036] An "object", in certain embodiments, may refer to a partial set of information representing the entire difference during an inter-region replication cycle, and is stored in an object store. An object may have a size of several megabytes and is stored at a specific location within a bucket of the object store. An object may contain many differences (i.e., blobs and manifests). A blob uploaded and stored in the object store is called an object.
[0037] A "bucket", in certain embodiments, may refer to a container that stores objects in compartments within an object storage namespace (tenancy). In the present disclosure, a bucket is used by a source replicator to store differences protected using server-side encryption (SSE), and is also used by a target replicator to download changes and apply them to a snapshot.
[0038] "Difference application", in certain embodiments, may refer to the process of applying differences downloaded by a target file system to the latest snapshot to create a new snapshot. Difference application may include analyzing a manifest file, applying snapshot metadata, inserting B-tree keys and values into a B-tree, and storing data associated with B-tree keys (i.e., file data or the data portion of a blob) in local storage. Snapshot metadata is created and applied at the start of a replication cycle.
[0039] A "region", in certain embodiments, may refer to a logical abstraction corresponding to a geographical area. Each region can include one or more connected data centers. A region is independent of other regions and can be separated by vast distances.
[0040] End-to-End Inter-Region Replication Architecture An end-to-end inter-region replication architecture provides new techniques for end-to-end file storage replication and security between file systems within different cloud infrastructure regions. In certain embodiments, a file storage service generates differences between snapshots within a source file system and, upon disaster recovery, transfers the differences and associated data via high-throughput object storage to recreate new snapshots in a target file system located in a different region. The file storage service utilizes new techniques to achieve scalable, reliable, and restartable end-to-end replication. New techniques for ensuring secure transfer and consistency of information during end-to-end replication are also described.
[0041] In the context of the cloud, a realm refers to a logical collection of one or more regions. Realms are typically separated from each other and do not share data. Within a region, the data centers within the region can be organized into one or more availability domains (ADs). Availability domains are separated from each other, are fault tolerant, and have a very low probability of failing simultaneously. An AD is configured such that a failure in one AD within a region is unlikely to affect the availability of other ADs within the same region.
[0042] Current disaster recovery practices can include taking periodic snapshots and resynchronizing those snapshots to another file system within a different availability domain (AD) or region. Resynchronization is manageable and maintained by the customer but lacks a user interface to display progress, is a slow serialized process, and is not easy to manage as data grows over time.
[0043] Therefore, different approaches are needed to address these and other challenges. The file storage replication of a cloud service provider (e.g., Oracle Cloud Infrastructure (OCI)) disclosed in this disclosure is based on incremental snapshots and provides a consistent point-in-time view of the entire file system by propagating the differences in changed data from the primary AD within a region to a secondary AD within the same or a different region. As used herein, the primary site (or source side) may refer to the location where the file system is located and where the replication process for disaster recovery is initiated (e.g., an AD or a region). The secondary site (or target side) may refer to the location where the file system receives information from the file system within the primary site during the replication process and becomes the new operational file system after disaster recovery (e.g., an AD or a region). The file system located at the primary site is called the source file system, and the file system located at the secondary site is called the target file system. Therefore, the primary site, source side, source region, primary file system, or source file system (referring to one of the file systems on the source side) may be used interchangeably. Similarly, the secondary site, target side, target region, secondary file system, or target file system (referring to one of the file systems on the target side) may be used interchangeably.
[0044] The file storage service (FSS) of the present disclosure supports complete disaster recovery for failover or failback with minimal management effort. Failover is a series of actions to make the secondary site / target site the primary / source (i.e., start providing services for the workload), which may include planned failover and / or unplanned failover. A planned failover (sometimes called a planned migration) is initiated by the user to perform a planned failover from the source side (e.g., source region) to the target side (e.g., target region) without data loss. An unplanned failover is the case where, for example, due to a disaster, the source side stops unexpectedly and the source side is lost, so the user needs to start using the target side. Failback is to restore the primary side / source side to become the primary / source again before the failover. Failback may occur when the user wants to reuse the source side as the primary AD by reversing the failover process after a planned failover or an unplanned failover and a trigger event (e.g., power outage) has ended. The user can resume either from the last point in time on the source side before the trigger event or from the latest changes on the target side. The replication process described in the present disclosure can maintain the identity of the file system after round-trip replication. In other words, the source file system can resume providing services for the workload again after performing a failover and then a failback.
[0045] The techniques disclosed in this disclosure (e.g., methods, computer-readable media, and systems) use consistent snapshot information to replicate differences between snapshots from a source region to multiple remote (or target) regions, and then scan (or recursively visit) all keys and values within one or more file trees (e.g., B-trees) of the source file system (sometimes referred to herein as "scanning the B-tree" or "scanning the keys") to construct consistent information (e.g., the differences or discrepancies between the keys and values of two snapshots created at different times), including region-to-region replication of file system data and / or metadata. The constructed consistent information is in blob format and is transferred to the remote side (e.g., the target region) using an object interface, such as an object store (described later), so that the target file system on the remote side can immediately detect the information transferred through the object interface and start downloading and applying it. This process is realized using a control plane and can be scaled to thousands of file systems and hundreds of replication machines. Both the source file system and the target file system can operate simultaneously and asynchronously. Operating simultaneously means that the data upload process by the source file system and the data download process by the target file system can occur simultaneously. Operating asynchronously means that the source file system and the target file system can each operate at their own pace without waiting for each other at all stages, e.g., with different start times, end times, processing speeds, etc.
[0046] In one embodiment, multiple file systems may exist in the same region and are represented by the same B-tree. Each of these file systems within the same region can be replicated independently across regions. For example, file system A may have a set of parallel execution replicator threads that scan the B-tree to perform replication of file system A. File system B, represented by the same B-tree, may have another set of such parallel execution replicator threads that scan the same B-tree to perform replication of file system B.
[0047] Regarding security, cross-region replication is completely secure. Information is transferred securely and applied securely. The disclosed technology provides separation between the source region and the target region such that keys are not shared between the two without being encrypted. Thus, if the source key is involved, the target is not affected. Further, the disclosed technology includes ways to read keys, convert those keys into a certain format, and upload and download those keys securely. Since different keys are created and used in different regions, separate keys are created at the target and applied to the information with a target-centric security mechanism. For example, FSS generates a session key that is only valid during one replication cycle or session to encrypt data uploaded from the source region to the object store and decrypt data downloaded from the object store to the target region. Separate keys are used locally within the source region and the target region.
[0048] In the disclosed technology, each upload process and download process through the object store during replication has different pipeline stages. For example, the upload process has multiple pipeline stages including scanning a B-tree to generate differences, accessing storage I / O, and uploading data (or blobs) to the object store. The download process has multiple pipeline stages including downloading data, applying differences to a snapshot, and storing data in storage. Each of these pipelines also includes parallel processing threads to improve the throughput and performance of the replication process. Further, the parallel processing threads can take over a failed processing thread and resume the replication process from the point of failure without restarting from the beginning. Thus, the replication process is highly scalable and reliable.
[0049] Figure 1 shows an exemplary concept of the target recovery point in time (RPO) and target recovery time (RTO) for an unplanned failover according to an embodiment. The RPO is the maximum allowable range of data loss between a primary site failure and the availability of a secondary site (usually specified in minutes). As shown in Figure 1, the primary site A102 encounters an unplanned incident at time 110 and triggers the failover replication process by copying the latest snapshot and its delta to the secondary site B104. The first copied information reaches the secondary site B104 at time 112. The primary site A102 completes the copy of information to the secondary site B104 at time 114, and the secondary site B104 completes the replication process at time 116. Thus, the secondary site B104 becomes fully operational at time 116. As a result, the user's data is not accessible within the primary site A110 starting from point 110 until the point 116 where the data becomes available again. Thus, the RPO is the time between point 110 and point 116. For example, if there is data equivalent to 10 minutes that the user is not interested in, the RPO is 10 minutes. If the data loss exceeds 10 minutes, the RPO is not met. An RPO of 0 means synchronous replication.
[0050] RTO is the time it takes for the secondary to become fully operational after a failure, so that the user can access the data again (usually specified in minutes). RTO is considered from the perspective of the secondary site. Referring again to Figure 1, the primary site A102 starts the failover replication process at time 120. However, the secondary site B104 remains operational until time 122 when it recognizes the incident (or power outage) at the primary site A102. Therefore, the secondary site B104 stops its service at time 122. The secondary site B104 becomes fully operational at time 126 using the same failover replication process as described for RPO. Therefore, RTO is the time between 122 and 126. Here, the secondary site B104 becomes able to take over the role of the primary site. However, for customers using the primary site A102, the loss of service is between times 120 and 126.
[0051] The primary (or source) site is where the action is taking place, and the secondary (or target) site is inactive and cannot be used until a disaster occurs. However, the customer may be provided with a point in time to continue using for test-related activities at the secondary site. This relates to how the customer sets up replication, how the customer can start using the target if any problems occur, and how the customer can return to the source after the source has failed over.
[0052] Figure 2 is a simplified block diagram showing an architecture for inter-region remote replication according to an embodiment. In FIG. 2, the end-to-end replication architecture shown includes two regions: a source region 290 and a target region 292. Each region may include one or more file systems. In one embodiment, the end-to-end replication architecture includes data planes 202 and 212, a control plane (only control APIs 208a-n and 218a-n are shown), local storages 204 and 214, an object store 260, and a key management service (KMS) 250 for both the source region 290 and the target region 292. FIG. 2 shows only one file system 280 in the source region 290 and one file system 282 in the target region 292 for simplicity. If there are two or more file systems in one region, the same replication architecture is applied to each pair of the source file system and the target file system. In one embodiment, by using parallel processing threads, multiple inter-region replications can occur simultaneously between each pair of the source file system and the target file system. In some embodiments, one source file system can be replicated to different target file systems located in the same target region. Further, file systems within a region may share resources. For example, some resources in the KMS 250, the object store 260, and the data plane may be shared by many file systems within the same region depending on the implementation.
[0053] The data plane within the architecture includes local storage nodes 204a - n and 214a - n, as well as replicators (or replicator fleets) 206a - n and 216a - n. The control API hosts within each region perform all orchestration between different regions. The FSS receives a request from a customer to set up replication between a source file system 280 and a target file system 282 where the customer's data will be moved. The control plane 208 obtains the request, performs resource allocation, and notifies the replicator fleet 206a - n within the source data plane 202 to start uploading data 230a from different snapshots to the object storage 260 (or sometimes called uploading the differences). An API is available to assist the customer in setting the target time and recovery time objective (RTO) for the replication. The replication model disclosed in this disclosure is a "push - based" model based on snapshot differences, that is, the source region initiates the replication.
[0054] As used herein, the data 230a and 230b transferred between the source file system 280 and the target file system 282 are general terms and may include an initial snapshot, keys and values of different B - trees between two snapshots, file data (e.g., fmap), snapshot metadata (i.e., a set of B - tree keys of snapshots reflecting different snapshots taken within the source file system), and other information (e.g., manifest files) that helps facilitate the replication process.
[0055] Regarding the data plane of the inter-region replication architecture, the replicator is a component within the data plane of the file system. The replicator performs differential generation or differential application on the file system according to the region where the file system is located. For example, the replicator fleet 206 within the file system 280 of the source region performs the generation and replication of the difference 230a. The replicator fleet 216 within the file system 282 of the target region downloads the differences 230b and applies them to the latest snapshot within the file system 282 of the target region. The file system 282 of the target region can also use the control plane and workflow to guarantee end-to-end transfer.
[0056] All incremental operations are based on snapshots, which are existing resources within file storage as a service. A snapshot is a point in time, data point, or image of what is happening within the file system and is executed periodically within the file system 280 of the source region. For the very first replication (e.g., no replication has been obtained so far), FSS acquires a base snapshot, which is a snapshot of all the contents of the source file system, and transfers all of that content to the target system. In other words, the replicator reads from the storage layer of that specific file system and stores all the data in the object storage bucket.
[0057] After the data plane 202 of the source file system 280 uploads all the data 230a to the object storage (or object store) 260, the source-side control plane 208 notifies the target-side control plane 218 that there is new work to be done on the target side, and then this notification is relayed to the target-side replicator. Thereafter, the target-side replicators 216a~n start downloading objects (e.g., initial snapshots and deltas) from the object storage bucket 260 and applying the deltas captured on the source side.
[0058] In the case of a base copy (e.g., the entire contents of the file system up to a point in time ranging from the past 5 days to 5 years), the upload process may take time. To assist in meeting service level goals regarding time and performance, the source system 280 can take replication snapshots at specific intervals such as one hour. The source side 280 can then transfer all the data within that one hour to the target side 282 and take new snapshots every hour. If there is any cache with many changes, the replication can be set to a shorter replication interval.
[0059] To illustrate the above, consider the situation where a first snapshot is created on the file system within the source region (referred to as the source file system). Replication is performed periodically, and thus the first snapshot is replicated to the file system within the target region (referred to as the target file system). Thereafter, when some update is performed within the source file system, a second snapshot is created. If an unplanned power outage occurs after the second snapshot is created, the source file system attempts to replicate the second snapshot to the target file system. During failover, the source file system may well identify the difference (i.e., the delta) between the first snapshot and the second snapshot, which includes the keys and values of the B-tree and the file data associated therewith within the B-tree representing both the first snapshot and the second snapshot. Next, deltas 230a and 230b are transferred from the source file system to the target file system via the object store 260 within the target region, and the target file system recreates the second snapshot by applying the deltas to the first snapshot previously established within the target region. When the second snapshot is created in the target file system, the failover replication process is complete and the target file system is ready to operate.
[0060] Regarding the control plane and its application programming interfaces (APIs), the control plane provides instructions for the data plane that includes replicators as executors to execute instructions. Storage (204 and 214) and replicator fleets (206 and 216) are both within the data plane. The control plane is not shown in Figure 2. As used herein, a "cycle" may refer to a period that starts when the source file system 280 begins to transfer data 230a to the target file system 282 and ends when the target file system 282 has received all the data 230b and completed the application of the received data. The data 230a - b is captured on the source side and then applied on the target side. When all changes on the target side are applied to the cycle, the source file system 280 takes another snapshot and starts another cycle.
[0061] The control APIs (208a - n and 218a - n) are a set of hosts within the overall architecture of the control plane and execute the configuration of the file system. The control APIs are responsible for communicating state information between different regions. State machines that track various state activities within a region, such as the progress of a job, the location of keys, and future tasks to be executed, are distributed across multiple regions. All this information is stored in the control plane of each region and communicated between regions via the control APIs. In other words, the state information relates to the details of the life cycle, the details of the differences, and the life cycle of the resources. The state machine can also help track the progress of replication and cooperate with the data plane to estimate the time taken for replication. Therefore, the state machine can provide the user with a status regarding whether the replication is proceeding as expected and the normality of the job.
[0062] Furthermore, communication between the control APIs (208a - n) of the source file system 280 and the control APIs (218a - n) of the target file system 218 in different regions includes the transfer of snapshots and metadata for creating an exact copy from the source to the target. For example, when a customer periodically takes snapshots within the source file system, the control plane can ensure that snapshots of the same user, including metadata tracking, transfer, and recreation, are created in the target file system.
[0063] The object store 260 in FIG. 2 (also referred to as an "object" in this specification) is an object storage service (e.g., Oracle's object storage service) that enables reading blobs and writing files for archival purposes. The advantages of using an object store are, first, ease of configuration, second, ease of streaming data to the object store, and third, having the advantage of secure streaming as a reliable repository for maintaining information, all because there is no network loss, the data can be immediately downloaded, and it exists permanently. Direct communication between replicators within the source region and the target region is possible, but direct communication requires the configuration of an inter-region network, which is not scalable and difficult to manage.
[0064] For example, if there is a large amount of data being moved from a source to a target, the source can upload the data to the object store 260, and the target 282 does not need to wait for all the information uploaded to the object store 260 to start downloading. Thus, both the source 280 and the target 282 can operate continuously and simultaneously. The use of the object store enables the system to scale and achieve higher throughput. Further, the Key Management Service (KMS) 250 can control access to the object store 260 to ensure security. In other words, the source tries to move the data out of the source region as fast as possible and hold the data somewhere so that it is not lost before the data can be applied to the target.
[0065] Compared to using a network pipe with packet loss and recovery issues, the use of the object store 260 between the source region and the target region enables continuous data streaming where hundreds of file systems can be written from the source region to the object store, and at the same time, the target region can apply hundreds of files simultaneously. Thus, data streaming via the object store can achieve high throughput. Further, both the source region and the target region can operate at their own speeds for uploading and downloading.
[0066] Whenever a user changes some data in the source file system 280, a snapshot is taken and the difference before and after the change is updated. These changes are accumulated in the source file system 280 and can be streamed to the object store 260. The target file system 282 can detect that the data is available in the object store 260 and immediately download the changes and apply them to that file system. In some embodiments, only the differences are uploaded to the object storage after the base snapshot.
[0067] In some embodiments, the replicator can communicate with many different regions (e.g., from Phoenix to Ashburn and further to other remote regions), and the file system can manage many different endpoints on the replicator. Each replicator 206 within the source file system 280 can maintain a cache of these object storage endpoints and, in further cooperation with the KMS 250, generate a transfer key (e.g., a session key) for encrypting the data addresses of the data within the object storage 260 (e.g., server-side encryption or SSE) to protect the data stored in the bucket. There is one master bucket for each AD within the target region. A bucket is a container that stores objects in compartments within the object storage namespace (tenancy). Since all remote clients can communicate with the bucket and write information in a specific format, the information of each file system can be uniquely identified, preventing the mixing of data from different customers or file systems.
[0068] The object store 260 is a high-throughput system, and the techniques disclosed in this disclosure can utilize the object store. In one embodiment, the replication process includes multiple pipeline stages, a B-tree scan within the source file system 280, storage I / O access, data upload to the object store 260, data download from the object store 260, and differential application within the target file system 282. Each stage includes parallel processing threads that participate in improving the performance of data streaming from the source region 290 to the target region 292 via the object store 260.
[0069] In one embodiment, each file system within the source region may include a set of replicator threads 206a - n that are executed in parallel to upload the differences to the object store 260. Each file system within the target region may also include a set of replicator threads 216a - n that are executed in parallel to download the differences from the object store 260. Since both the source side and the target side operate asynchronously simultaneously, the source can upload as fast as possible, while the target can start downloading after detecting that the differences are available in the object store. Thereafter, the target file system applies the differences to the latest snapshot and deletes the differences in the object store after the application. Thus, the FSS consumes very little space in the object store, and the object store has a very high throughput (e.g., gigabyte - scale transfers).
[0070] In one embodiment, multiple threads are also executed in parallel for storage I / O access (e.g., DASD) 204a - n and 214a - n. Thus, all processes related to the replication process, including accessing storage, uploading snapshots and data 230a from the source file system 280 to the object store 260, and downloading snapshots and data 230b to the target file system 282, include multiple threads that are executed in parallel to perform data streaming.
[0071] File storage is a local service of AD. When a file system is created, that file system is within a specific AD. When a customer transfers or replicates data from one file system to another file system within the same region or a different region, artifact (also called manifest) transfer may be required.
[0072] As an alternative to using an object store to transfer data, a network connection between remote machines (e.g., between source and target replicator nodes) can be set up and VCN peering can be used to enable Classless Inter-Domain Routing (CIDR) on a per-region basis.
[0073] Referring again to FIG. 2, the Key Management System (KMS) 250 provides security for replication and provides storage services to a cloud service provider (e.g., OCI). In some embodiments, the file systems 280 on the source (or primary) side and the target (or secondary) side use separate KMS keys and key management is hierarchical. The reason for using separate keys is that if the source is compromised, an attacker cannot decrypt the target using the same key. The FSS has a three-tier key architecture. Since the source and target use different keys during data transfer, the source must first decrypt the data, re-encrypt it using an intermediate key, and then re-encrypt the data on the target side. The FSS defines a session and each session is one data cycle. A key for transferring data in that session is created. In other words, a new key is used for each new session. In other embodiments, a key can be used for two or more sessions (e.g., two or more data transfers) before creating another key. The key is not transferred via the object store 260, the key is only available on the source side and is not visible from outside the source for security reasons.
[0074] The replication cycle (also called a session) is periodic and adjustable. For example, the replicators (206a~n and 216a~n) execute replication once every hour. The cycle starts when a new snapshot is created on the source side 280 and ends when all the differences 230b have been applied to the target side 282 (i.e., the target has reached the DONE state). Each session is completed before another session starts. Therefore, there is always only one session and no overlap between sessions.
[0075] Secret management (i.e., replication using the KMS) processes the transfer of confidential materials between the source (primary) file system 290 and the target (or secondary) file system 292 using the KMS 250. The source file system 280 calculates the differences, reads the file data, and then decrypts the file data in cooperation with the key management service using the encryption key of the local file system. Next, the source file system 280 generates a session key (referred to as a delta encryption key (DEK)), encrypts it to become an encrypted session key (referred to as a delta transfer key (DTK)), and transfers the DTK to the target file system 282 via the respective control planes 208 and 218. The source file system 280 further encrypts the data 230a using the DEK and uploads the encrypted data 230a to the object store 260 via the Transport Layer Security (TLS) protocol. Next, the object store 260 uses server-side encryption (SSE) to ensure the security for the storage of the data (e.g., differences, manifests, and metadata) 230a.
[0076] The target file system 282 securely obtains an encrypted session key DTK via the control plane 218 (using HTTPS via inter-region API communication), decrypts the session key DTK via the KMS 250 to obtain a DEK, and places the DEK at a location within the target region 292. When a replication job is scheduled within the target file system 282, the DEK is provided to a replicator (one of the replication fleets 216a - n), and the replicator uses this key to decrypt data (e.g., a delta including file data) 230b downloaded from the object store 260 for application and re-encrypts the file data using the local file system key.
[0077] Replication between the source file system 280 and the target file system 282 is a parallel process, and both the source file system 280 and the target file system 282 operate at their own paces. When the source side completes an upload (which can occur before the target download process), the source side cleans up its memory and removes all keys. When the target completes the application of deltas to the latest snapshot, it similarly cleans up its memory and removes all keys. The FSS service also releases the KMS key. In other words, there are two copies of the session key, one within the source file system 280 and another within the target file system 282. Both copies are deleted at the end of each session, and a new session key is generated for the next replication cycle. This process ensures that the same key is not used for different purposes. Further, the session key is encrypted by the file system key, creating a two-fold protection. This is to ensure that only a specific file system can use this session key.
[0078] Figure 3 is a simplified schematic diagram of components involved in inter-region remote replication according to an embodiment. In one embodiment, components called the differential generator (DG) 310 in the source region A302 and 330 in the target region B304 are part of the replicator fleet 318 and operate on thousands of storage nodes within the fleet. The replicator 318 in the source region A makes remote procedural calls (RPCs: Remote Procedural Call) to the differential generator 310 (e.g., obtaining a set of keys and values, locking a block, etc.), and collects the keys, values, and data pages of the B-tree from the direct-access storage device (DASD: Direct-Access Storage Device) 314, which is a replication storage service for accessing storage and is regarded as a data server. The DG 310 in the source region A is a helper for the replicator 318, divides the key range of the differences, and packs all the keys / values in a specific range into a blob to be returned to the replicator 318. Both regions have multiple storage nodes 322 and 342 connected to DASDs 314 and 334, and each node contains a large number of disks (e.g., 10TB or more).
[0079] In one embodiment, the file system communicators (FSCs) 312 and 332 in both regions are metadata servers that help update the source file system for user updates to the system. The FSCs 312 and 332 are used for file system communication, and the differential generator 310 is used for replication. Both the DGs 310 and 330 and the FSCs 312 and 332 are metadata servers. User traffic passes through the FSCs 312 and 332 and the DASDs 314 and 334, while replication traffic passes through the DG. In an alternative embodiment, the functions of the FSC can be merged into the functions of the DG.
[0080] In one embodiment, the shared databases (SDBs) 316 and 336 of both regions are key-value stores, and through these components, both the control plane and the data plane (e.g., the replicator fleet) can read and write for themselves to communicate with each other. The control planes 320 and 340 of both regions can place new jobs into queues within their respective shared databases 316 and 336, and the replicator fleets 318 and 338 continuously read the queues within the shared databases 316 and 336, and when the replicator fleets 318 and 338 detect a job request, they can initiate file system replication. In other words, the shared databases 316 and 336 are conduits between the replicator fleet and the control plane. Further, the shared databases 316 and 336 are resources distributed across different regions, and the IO traffic between the shared databases 316 and 336 should be minimized. Similarly, the IO traffic with the DASD needs to be minimized so as not to affect the user's performance. However, the replication process may be adjusted because it is a secondary service compared to the primary service.
[0081] The replicator fleet 318 within the source region A can cooperate with the DG310 to start scanning the B-tree in the file system within the source region A, collect keys and values, and convert those keys and values into a flat file or blob to be uploaded to the object store. When the data blob (including keys and values and actual data) is uploaded, the target can immediately apply those data blobs without waiting for a large number of blobs to be present in the object store 360. The object store 360 is located in the target region B for disaster recovery reasons. The goal is to push from the source to the target region B as quickly as possible and keep the data safe.
[0082] Optimize space using lower-cost machines with smaller footprint, and schedule as many replications as possible while ensuring fair bandwidth allocation among those machines. To replicate thousands of file systems, there are multiple replicators. The replicator fleets 318 and 338 in both regions are run on virtual machines that can be automatically scaled up and down to build the entire fleet for running replications. The replicators and replication services can adapt dynamically based on capacity to support each job. If the load on one replicator is high, another replicator can be selected to share the load. Different replicators in the fleet can balance the load among each other to ensure that jobs can continue and do not stop due to overloading individual replicators.
[0083] FIG. 4 is a simplified flowchart showing steps executed during inter-region remote replication according to an embodiment.
[0084] Step S1: When the customer sets up replication, the customer provides a source (or primary) file system (A) 402, a target (or secondary) file system (B) 404, and an RPO. The file systems are uniquely identified by file system identification information (e.g., Oracle Cloud ID or OCID), which is globally unique for the file system. The data is stored in the file storage service (「FSS」) control plane database.
[0085] Step S2: The source (A) control plane (CP-A) 410 coordinates to periodically create system snapshots at regular intervals (less than the RPO), and notifies the data plane (including the replicator / uploader 412) of the latest snapshot and the last snapshot successfully copied to the target (B) file system 404.
[0086] Step S3: CP-A410 notifies the replicator 412 (or uploader), which is a component within the data plane, to copy the latest snapshot.
[0087] S3a: The replicator 412 of the source (A) scans the B-tree to calculate the difference between two specific snapshots. The existing key infrastructure is used to decrypt the file system data.
[0088] S3b: These differences 414 are uploaded to the object store 430 within the target (B) region (the data can be compressed and / or deduplicated during copying). This upload can be performed in parallel by multiple replicator threads 412.
[0089] Step S4: CP-A410 notifies the target (B) control plane (CP-B) 450 of the completion of the upload.
[0090] Step S5: CP-B450 calls the target replicator B452 (or downloader) to apply the differences.
[0091] S5a: The replicator B452 downloads the data 454 from the object store 430.
[0092] S5b: The replicator B452 applies these differences to the target file system (B).
[0093] Step S6: After the difference application is completed, CP-A410 is notified of the new snapshot currently available at the target (B).
[0094] Step 7: The inter-region remote replication process repeats from Step S2 to Step S6.
[0095] Figure 5 is a simplified diagram showing a high-level concept of B-tree scanning according to an embodiment. The B-tree structure can be used within a file system. The differential generator scans the B-tree and ensures the consistency of the scan. In other words, the scan confirms that the keys and values are as expected at the end of the scan so that data corruption cannot occur, and captures all information between any two snapshots. The file system is a transactional file system that may be changed, and since another user may update the same transaction or data, the user needs to know about the changes and re-do the transaction.
[0096] The keys, values, and snapshots are immutable (i.e., cannot be changed except that a garbage collector may remove them). As shown in Figure 5, there are many snapshots (Snapshot 1 to Snapshot N) in the file system. When the differential generator is scanning the B-tree keys (510 to 560) in the source file system, the garbage collector 580 may come in and clean up the keys of the snapshots that it considers garbage, so the snapshots may be removed. When the differential generator scans the B-tree keys, the differential generator needs to ensure that the keys associated with the remaining snapshots (e.g., keys not removed by the garbage collector) are copied. When keys, such as 540 and 550, are removed by the garbage collector 580, the B-tree page can be shrunk, for example, from 2 pages before garbage collection to 1 page after garbage collection. A way for the differential generator to ensure consistency when scanning the B-tree keys is for the garbage collector 580 to confirm that it has not changed or deleted any of the keys in the pages (or sections between two snapshots) that the differential generator has just scanned (e.g., between two keys). Once consistency is confirmed, the differential generator collects the keys and sends them to the replicator for processing and uploading.
[0097] The B-tree key can indicate what has changed. The techniques disclosed in this disclosure can determine which B-tree keys are new and what has been updated between two snapshots. The diff generator can collect metadata parts, keys and values, and related data, and then send them to the target. The target can understand that the received information is within the range of two snapshots and applies to the target file system. The diff generator (or a thread of the diff generator) scans the section between two keys, confirms its consistency, and then uses the last end key as the next start key for the next scan. This process is repeated until all keys are checked, and the diff generator collects related data every time the consistency is confirmed.
[0098] For example, when a file is changed (e.g., created, deleted, and then recreated) within a file system, this process creates multiple versions of the corresponding file directory entry. During the replication process, the garbage collector may clean up (or remove) the version of the file directory entry corresponding to the deleted file, which may cause a consistency problem called a whiteout. A whiteout occurs when there is a mismatch between the source file system and the target file system, because the target file system may fail to reconstruct the original snapshot chain containing the changed file. The disclosed techniques can detect whiteout files (i.e., changed files affected by the garbage collector) during B-tree scanning, extract the version of the changed file that is not affected, and provide related information to the target file system during the same replication cycle to ensure the consistency between the source file system and the target file system by properly reconstructing the correct snapshot chain.
[0099] Figures 6A and 6B are diagrams showing the pipeline stages of inter-region replication according to an embodiment. The inter-region replication of the source file system disclosed in the present disclosure includes four pipeline stages, namely, the start of inter-region replication, the B-tree scan in the source file system (i.e., the differential generation pipeline stage), the storage IO access for retrieving data (i.e., the data read pipeline stage), and the data upload to the object store (i.e., the data upload pipeline stage), which are included within the source file system. The target file system includes four pipeline stages in a similar but reverse order, namely, the preparation for inter-region replication, the download of data from the object store, the application of the difference within the target file system, and the storage IO access for storing data. Figure 6A shows the four pipeline stages within the source file system, and the same concept applies to the target file system. Figure 6B shows the processes and interactions between the components involved in the pipeline stages. These pipeline stages can all operate in parallel. Each pipeline stage operates independently and can pass information to the next pipeline stage when the processing at the current stage is completed. Each pipeline stage receives a portion of the total bandwidth and is guaranteed not to use more than necessary. In other words, resources are fairly allocated among all jobs. If no other jobs are operating within the system, the operating job can acquire as many resources as possible.
[0100] Threads within each pipeline stage also execute tasks independently of each other in parallel (or simultaneously) within the same pipeline stage (i.e., if a thread fails, it does not affect other threads). Further, the tasks (or replication jobs) executed by threads in each pipeline stage are restartable, i.e., if a thread fails, a new thread (also called a replacement thread) can take over from the failed thread and continue the original task from the last successful point.
[0101] In some embodiments, the B-tree scan may be performed using parallel processing threads within the source file system 280. The B-tree may be divided into a plurality of key ranges between the first key and the last key in the file system. The number of key ranges may be determined by the customer. A plurality of (e.g., about 8 to 16) range threads may be used for the B-tree scan for each file system. One range thread can perform a B-tree scan of one key range, and all range threads operate simultaneously in parallel. The number of threads used varies depending on factors such as the size of the file system, the availability of resources, and the bandwidth for balancing resource and traffic congestion. Usually, the number of key ranges is more than the number of available range threads for fully utilizing the range threads. Therefore, the B-tree scan is scalable and can be processed by simultaneous parallel scans (e.g., using multiple threads).
[0102] After the difference generator scans the pages, if some keys are missing and thus some keys are inconsistent, the system may remove the ongoing uncommitted transactions and return to the starting point for scanning again. During the repetition of the B-tree scan due to the inconsistency, the difference generator may ignore the missing keys and the data associated with them by not collecting them because these associated data are considered garbage in order to minimize the amount of information to be processed or uploaded to the target side. Therefore, the B-tree scan and data transfer can be made more efficient. Further, the difference generator does not need to wait for the garbage collector to remove the information to be deleted before scanning the B-tree keys. For example, keys have dependencies on each other. If a key or an iNode points to a block that has been deleted by the garbage collector or is to be deleted, the system (or the difference generator) can itself grasp that the specific block is garbage and the difference generator does not need to carry it.
[0103] The differential generator usually makes no changes on the source side (e.g., does not delete keys or blocks of data considered as garbage), and simply does not copy them to the target side. The B-tree traversal process and garbage collection are asynchronous processes. For example, when the block of data pointed to by a key no longer exists, the file system can flag that key as garbage, notify that the key should not be modified (e.g., is immutable), and only the garbage collector can remove that key. The differential generator can continue scanning the next key without waiting for the garbage collector. In other words, the differential generator and the garbage collector can proceed at their own pace.
[0104] In FIG. 6A, when the source region starts an inter-region replication process that can include multiple file systems, the main threads 610a - n select a replication job (one job per file system). The main thread of the file system (e.g., 610a or 610 for later use) within the source region (i.e., the source file system) communicates with the differential generator 620 (shown in FIG. 6B) to obtain the number of key ranges requested by the customer and update the corresponding records in the SDB 622. Once the main thread 610 of the source file system knows the number of key ranges required, it further creates a set of range threads 612a - n based on the number of key ranges required. These range threads 612a - n are executed by the differential generator 620. These range threads 612a - n initialize the GETKEYVAL buffer 640 (shown in FIG. 6B), update the checkpoint record 642 in the SDB 622 (shown in FIG. 6B), and perform storage I / O access 644 by interacting with the DASD I / O threads 614a - n.
[0105] In one embodiment, each main thread 610 is responsible for monitoring all the range threads 612a - n that it creates. During replication, the main thread 610 may generate a master manifest file that summarizes the entire replication. The range threads 612a - n generate a range manifest file that includes the number of key ranges (i.e., the subdivision of the entire replication), and then generate a checkpoint manifest (CM) file for each range to provide updates to the target file system regarding the number of blobs per checkpoint, where the checkpoints are created during B - tree traversal. One checkpoint is created by the range thread 612. When the main thread 610 determines that all range threads 612a - n are complete, it creates a final checkpoint manifest (CM) file that includes an end - of - file marker, and then uploads the CM file to the object store so that the target file system can understand the progress within the source file system. The CM file includes an overview of all individual ranges, such as the number of ranges, the final state of the checkpoint records, and other information.
[0106] Range threads 612a - n are used for parallel processing to significantly reduce the time for B - tree traversal of a large source file system. In one embodiment, the B - tree keys are divided into ranges of approximately equal size. One range thread can perform a B - tree traversal of one key range. The number of range threads 612a - n used varies depending on factors such as the size of the file system, resource availability, and bandwidth for balancing resources, the amount of data generated, and traffic congestion. Typically, the number of key ranges is about two to four times more than the number of available range threads 612a - n to fully utilize the range threads. Each of the range threads 612a - n has a dedicated buffer (GETKEYVAL) 640 that contains jobs available for work. Each range thread 612 operates independently of other range threads and periodically updates checkpoint records 642 in the SDB622.
[0107] Range threads 612a - n may need to collect file data (e.g., FMAP) associated with B - tree keys and request IO access 644 to storage when traversing the B - tree (i.e., when recursively visiting all nodes of the B - tree). These IO requests are queued by each range thread 612 so that DASD IO threads 614a - n (i.e., the data - read pipeline stage) can handle those IO requests. These DASD IO threads 614a - n are common threads shared by all range threads 612a - n. After the DASD IO threads 614a - n acquire the requested data, the data is placed in an output buffer 646 to serialize the data into blobs so that the replica object threads 616a - n (i.e., the data - upload pipeline stage) can upload it to the object store located in the target region. Each object thread selects an upload job that can include a portion of all the data to be uploaded, and all object threads execute the uploads in parallel.
[0108] FIG. 7 is a diagram showing a hierarchical structure in the FSS data plane according to an embodiment. In FIG. 7, the replicator fleet 710 includes four layers: a job layer 712, a delta generator client 714, an encryption / DASD IO 716, and an object 718. The replicator fleet 710 is a single process that serves to exchange information with a storage fleet 720, a KMS 730, and an object storage 740. In one embodiment, the job layer 712 polls the SDB 704 for a job 706 that is queued as either an upload job or a download job. The replicator fleet 710 includes VMs (or threads) that select enqueue replication jobs up to maximum capacity. A replicator thread may own a part of a replication job, but coordinates with another replicator thread that owns the remaining part of the same replication job to complete the entire replication job simultaneously. The replication jobs executed by the replicator fleet 710 are restartable in that if a replicator thread fails during replication, another replicator thread can take over and continue from the last successful point to complete the job that the failed replicator thread originally owned. If a strayed replicator thread (e.g., a replicator thread that fails and restarts) competes with another replicator thread, the FSS can avoid the conflict by using a mechanism called a generation number to cause both replicator threads to update different records.
[0109] The differential generator client layer 714 performs a B-tree scan by accessing a differential generator server 724 in which a B-tree exists within the storage fleet 720. The encryption / DASD IO layer 716 is responsible for security and storage access roles. After the B-tree scan, the replicator fleet 710 may request IO access via the encryption / DASD IO layer 716 to access the DASD range 722 of file data associated with the differentials identified during the B-tree scan. The replicator fleet 710 and the storage fleet 720 both periodically update the status of the control API 702 (e.g., checkpoint and lease of the replicator fleet 710) via the SDB 704, enabling the control API 702 to trigger an alarm or execute an action if necessary.
[0110] During the inter-region replication process, the encryption / DASD IO layer 716 exchanges information with the KMS and FSK fleet 730 on the target side to create a session key (or snapshot encryption key), and uses the FSK for encryption and decryption of the session key. Finally, the object layer 718 is responsible for uploading differential and file data from the source file system to the object store 740 and downloading them from the object store 740 to the target file system.
[0111] The data plane of the FSS is responsible for differential generation. The data plane stores FSS data using a B-tree, which includes various types of key-value pairs including, but not limited to, a leader block, a superblock, an iNode, a file name key, a cookie map (cookies associated with directory entries), and a block map (also called an FMAP in the case of file content data).
[0112] These B-tree keys are processed together by the replicator and the diff generator within the data plane. An algorithm for calculating the pairs of keys and values (i.e., part of the diff) that have changed between two specific snapshots within the file system continuously reads the keys, returns the keys to the replicator using the transaction budget, and finally ensures that the transaction is committed to obtain a consistent pair of keys and values for processing.
[0113] In other embodiments, the diff generation and calculation can be extensible. An extensible approach can calculate the diff (i.e., the change in the pairs of keys and values) between two snapshots by utilizing multiple threads to divide the B-tree into many key ranges. A pool of threads (i.e., diff generators) can execute a scan of the B-tree (i.e., traversal of the B-tree) and calculate the diffs in parallel.
[0114] Figure 8 shows a simplified exemplary binary large object (BLOB) format according to an embodiment. A blob is a data type for storing information (e.g., binary data) in a database. Blobs are generated by the source region during replication and uploaded to the object store. The target region needs to download and apply the blobs. Blobs and objects can be used interchangeably depending on the context.
[0115] During the B-tree scan, when the differential generator encounters the iNode of a specific file (i.e., data content) and its block map (also called FMAP, data associated with the B-tree key), the differential generator cooperates with the replicator to traverse all pages within the blocks (FMAP blocks) within the DASD range pointed to by the FMAP, reads them into the data buffer, decrypts the data using the local encryption file key, puts it into the output buffer, and serializes it into a blob for the replicator to upload to the object store. In other words, the differential generator needs to collect all FMAPs of the identified differences in order to obtain all data related to the differences between the two snapshots.
[0116] The snapshot differences stored in the object store may span multiple blobs (or objects if stored in the object store). The blob format of these blobs includes a key, a value, and, if present, data associated with the key. For example, in Figure 8, the snapshot difference 800 includes at least three blobs 802, 804, and 806. The first blob 802 includes a prefix 810 indicating the types of the key and value, the length of the key, and the length of the value, followed by a key 812 (key 1) and a value 814 (value 1). The second blob 804 includes a prefix 820 (types of the key and value, length of the key, and length of the value), a key 822 (key 2), a value 824 (value 2), a data length 826, and data 828 (data 2). In the prefix 820 of this second blob 804, since this blob includes additional data 828 associated with the key 822, the types of the key and value are fmap. The third blob 830 includes a format similar to that of the first blob 810, for example, a prefix 830, a key 832 (key 3), and a value 834 (value 3).
[0117] The data is decrypted, collected, and then written to a blob. All processes are executed in parallel. Multiple blobs can be processed and updated simultaneously. When all processes are complete, the data is written in blob format (shown in Figure 8) and can then be uploaded to an object store in the format (shown in Figure 9) or path name.
[0118] Figure 9 shows an exemplary replication bucket format according to an embodiment. A "bucket" can refer to a container that stores objects in a compartment within an object storage namespace. In one embodiment, a bucket is used by a source replicator to store data protected using server-side encryption (SSE) technology and is also used by a target replicator to download changes and apply them to a snapshot. Replication data for all file systems in a target region can share a bucket within that region.
[0119] The data layout of a bucket in the object store has a directory structure that includes, but is not limited to, a file system ID (e.g., Oracle Cloud ID), a difference including a start snapshot number and an end snapshot number, a manifest that describes the content of the information in the object's layout, and blobs. For example, the bucket in FIG. 9 includes two objects 910 and 930. The first object 910 includes two differences 912 and 920. This object starts with a path name 911 (e.g., ocid1.filesystem.oc1.iad...) that uses the source file system ID as a prefix, followed by a first difference 912 generated from snapshot 1 and snapshot 2, and a second snapshot 920 generated from snapshot 2 and snapshot 3. Each difference includes one or more blobs that represent the content of that difference. The first difference 912 stores two blobs 914 and 916 in the order of generation. The second difference 920 includes only one blob 922. Each difference also includes a manifest that describes the content of the information in the layout of this difference, for example, manifest 918 of the first difference 912 and manifest 924 of the second difference 920. The manifest in the bucket is content that describes the difference, such as the file system number and snapshot range. The manifest can be a master manifest, a range manifest, or a checkpoint manifest depending on the stage of the replication process.
[0120] The second object 930 also includes two differences 932 and 940 in a similar format starting with the path name 931. The two objects 910 and 930 in the bucket come from different source regions, namely IAD for object 910 and PHX for object 930. After the blobs are applied, the corresponding information in the layout can be removed to reduce space utilization.
[0121] The final manifest object (i.e., checkpoint manifest, CM file) is uploaded from the source region to the object store, and the source file system indicates to the target region that the upload of the snapshot delta of a specific object has been completed. The source CP transmits this event to the target CP, and the target CP can notify the target DP via the SDB to trigger the download process of that object by the target replicator.
[0122] The control plane within the source region or the target region coordinates all of the replication workflow and drives the replication of data. The control plane performs the following functions: (1) creates the underlying system snapshot for creating the delta, (2) determines when such a snapshot needs to be created, (3) initiates replication based on the snapshot, (4) monitors the replication, (5) triggers the download of the delta by the secondary (or target side), and (6) indicates to the primary (or source) side that the snapshot has reached the secondary.
[0123] The file system has several operations for processing its resources, including but not limited to creating, reading, updating, and deleting (CRUD). These operations are typically synchronized within the same region, starting a workflow when the file system receives an HTTPS request from the API server, making changes in the backend to create resources, and returning responses to customers. Resources are split into a source region and a target region. The state is maintained for the same resource between the source region and the target region. Therefore, there is asynchronous communication between the source region and the target region. Customers can interact with the source region to create or update resources, and these creations or updates can be automatically reflected in secondary or auxiliary resources within the target region. The state machine in the control plane also targets recovery in many aspects, including but not limited to fleet failures, key management failures, disk failures, and object failures.
[0124] Regarding the application programming interfaces (APIs) within the control plane, there are various APIs for users to configure replication. The control API for any new resource only functions within the region where the object was created. In the target file system, a field named "IsTargetable" can be set in the API to ensure that the target file system receiving replication cannot be accidentally used by consumers. In other words, setting this field to false means that consumers can view the target file system, but no one can export the target file system or access any data within the live system. Since export is not just a read-only permission but a read / write permission for export, any export has the potential to modify the data. Therefore, during the replication process, exports are not permitted to prevent any changes to the target file system. Consumers can only access the data within the old snapshots that have already been replicated. Any newly created or replicated file system can have this field set to true. The reason is that the target can only obtain data from a single source. Otherwise, collisions may occur when data is written or deleted. The system needs to know whether the target file system in use is already part of some replication. Setting the "IsTargetable" field to "true" means that replication is not in progress, and setting it to "false" means that the target file system cannot be used.
[0125] Regarding inter-region communication between components of the control plane, the primary resource on the source file system is called an application, and the auxiliary (or secondary) source on the target file system is called an application target. When source and target objects are created, they have a single replication relationship. Both objects can be updated only from the source side, including changes to compartments, editing of details, or deletion. If the user wants to delete the target side, the replication itself can be deleted. In the case of a planned failover, it is possible to delete the source side, and both the source-side and target-side replications are deleted. In the case of an unplanned failover, the source side is not available, so only the target replication can be deleted. In other words, there are two resources for a single replication, and those resources should be kept in a synchronized state. There are various workflows for updating metadata on both the source and target sides. Additionally, inter-region APIs for retries, fault handling, and failover are also part of the inter-region communication process.
[0126] When creating the necessary security and other related artifacts, the source uploads the security and artifacts to the object store, starts a job at the target (i.e., notifies the target that the job is available), and the target can start downloading the artifacts (e.g., snapshots or deltas). Thereafter, the target continues to search for an end-of-file marker (also referred to herein as a checkpoint manifest (CM) file) within the object store. The CM file is used as a mechanism for the source side and the target side to communicate the completion of the upload of an object during the replication process. At every checkpoint, the source side uploads this CM file containing information such as the number of blobs uploaded up to this checkpoint, enabling the target side to download this number of blobs and apply them to the current snapshot. This CM file is a mechanism for the source side to communicate to the target side that the upload of an object to the object store has been completed, and for the target to start working on that object. In other words, the target continues to download until there are no more objects in the object storage. Thus, this approach enables concurrent processing on both the source side and the target side.
[0127] FIG. 10 is a flowchart showing a state machine for simultaneous source upload and target download according to an embodiment. As previously explained, both the source file system and the target file system can perform replication simultaneously and thus can have their own state machines. In one embodiment, each file system can have its own state machine while sharing some common job-level states. In FIG. 10, the source file system has states 1030-1034 for session key generation and transfer in addition to states 1002-1018 for performing data upload. The target file system has states 1050-1068 related to data download. The session key can be generated at any time within the source file system while differences are being uploaded to the object storage. Thus, the session key transfer has its own state sequence 1030-1034. In FIG. 10, the target file system cannot start the replication download process (i.e., Ready_to_Reconcile state 1050) until it receives an indication that at least an object has been uploaded to the object storage by the source file system (i.e., Mainfest_Copied state 1014) and that it is ready to download the session key (i.e., Copied_DTK state 1034).
[0128] In the source file system, multiple functional blocks such as a snapshot generator, a control API, and a differential monitor are part of the CP. The replicator fleet is part of the DP. The snapshot generator is responsible for periodically generating snapshots. The differential monitor periodically monitors the progress of the replicator in replication-related tasks, including the creation of snapshots and the replication schedule. When the differential monitor detects that the replicator has completed a replication job, it transitions the state to a copied state on the source side (e.g., the Manifest_Copied state 1014) or a replicated state on the target side (e.g., the Replicated state 1058). In certain embodiments, multiple file systems can concurrently execute replication from a source region to a target region.
[0129] Referring to FIG. 10, in certain embodiments, in the source file system, in the concurrent mode state machine, after creating a snapshot signal to the differential monitor that a snapshot has been generated, the snapshot generator. The differential monitor that executes the CP replication state (CpRpSt) workflow is responsible for initiating the upload of snapshot metadata to the object store on the target side. The snapshot metadata can include the type of snapshot, snapshot identification information, the time of the snapshot, etc. The CpRpSt workflow sets the Ready_to_Copy_Metadata state 1002 for the replicator fleet to start copying the metadata. When the replicator obtains a replication job, it creates a copy of the snapshot metadata (i.e., the Snapshot_Metadata_Copying state 1004) and uploads those copies to the object store. When all replicators have completed the upload of the snapshot metadata, the state is set to the Snapshot_Metadata_Copied state 1006. Thereafter, the CpRpSt workflow continues to poll the source SDB for the session key.
[0130] Here, the CpRtSt workflow returns control to the differential monitor to monitor the differential upload process that transitions to the Ready_to_Copy state 1008 indicating that the differential calculation is scheduled. Next, the source CP API sends a request to the replicator to start the next stage of replication by uploading the differential and creating a copy of the manifest. The replicator that selects the replication job can start creating a copy of the manifest (i.e., the Mainfest_Copying state 1010). When the source file system completes the copy of the manifest, it transitions to the Manifest_Copied state 1014 and simultaneously notifies the target file system that it can start the internal state (the Ready_to_Reconcile state 1050).
[0131] As described above, the session key can be generated by the source file system during the upload of data. The replicator of the source file system communicates with the target KMS vault to obtain the master key that can be provided by the customer and creates a session key (referred to herein as the differential encryption key or DEK). Next, the replicator encrypts the session key using the local file system key (FSK: file system key) (which becomes the encrypted DEK, also referred to herein as the differential transfer key (DTK)). Thereafter, the DTK is stored in the SDB within the source region and reused by the replicator thread during the replication cycle. The state machine transitions to the Ready_to_Copy_DTK state 1030.
[0132] The source file system transfers the resource identification information of DTK and KMS to the target API, and then the target API puts those resource identification information into the SDB within the target region. During this transfer process, the state machine is set to the Copying_DTK state 1032. The CpRpSt workflow within the source file system, when it finishes polling the source SDB for the session key, sends a notification to the target side that the target file system has downloaded the session key (DTK) and is ready to use that session key to decrypt the downloaded differences for application. Then, the state machine transitions to the Copied_DTK state 1034. The replicator on the target side retrieves the DTK from the SDB and requests the KMS API to decrypt the DTK into the plaintext DEK (i.e., the decrypted session key).
[0133] When the source file system completes the upload of data for a specific replication cycle including session key transfer, the difference monitor notifies the target control API of the status such as validation information and transitions to the X-region_Copied_Done state 1016. This can occur before the target file system completes the download and application of the data. The source file system further cleans up the memory and removes all keys. Then, the source file system transitions to the Awaiting_Target_Response state 1018 and waits for a response from the target file system to start a new replication cycle.
[0134] As described above, the target file system cannot start the replication download process until it receives an indication that at least the object has been uploaded to the object storage by the source file system (i.e., the Mainfest_Copied state 1014) and that it is ready to download the session key (i.e., the Copied_DTK state 1034). When these two conditions are met, the state machine transitions to the Ready_To_Reconcile state 1050. Next, in the Reconciling state 1052, the target file system starts an adjustment process with the source side, such as synchronizing snapshots of the source file system and the target file system, obtaining snapshots, and generating statistical values, and also performs some internal CP management operations, including communication within the target file system between the delta monitor and the CP API.
[0135] After the adjustment process is completed, the replication job is passed to the target replicator (i.e., the Ready_to_Replicate state 1054). The target replicator monitors the checkpoint manifest (CM) file uploaded by the source file system. The CM file is marked by the target. Then, the target replicator thread starts to download the manifest and apply the downloaded and decrypted deltas (i.e., the Replicating state 1056). The target replicator thread also reads the FMAP data blocks from the blobs downloaded from the object store, communicates with the local FSK service to obtain the file system key FSK, and the FSK is used to re-encrypt each FMAP data block and store it in local storage.
[0136] When the source file system finishes uploading data, it updates the final CM file by setting the end-of-file (eof) field to true and uploads it to the object store. As soon as the target file system detects this final CM file, it ends the download of the blobs and applies them, and the state machine transitions to the Replicated state 1058.
[0137] After the target file system applies all the deltas (or blobs), it continues to download the snapshot metadata from the object store and inputs the information of the source file system's snapshot into the target file system's snapshot (i.e., the Snapshot_metadata_Populating state 1060). When the target file system's snapshot is input, the state machine transitions to the Snapshot_Metadata_Populated state 1062.
[0138] In the Snapshot_Deleting state 1064, the target file system deletes all the blobs in the object store for the blobs that were downloaded and applied to the latest snapshot. Then, the target control API notifies the target delta monitor when the blobs in the object store are deleted and proceeds to the Snapshot_Deleted state 1066. The target file system further cleans up the memory and removes all the keys. The FSS service also releases the KMS key.
[0139] When the target DP finishes applying the differences and cleaning up, it uses the target control API to verify the validity regarding the status of the source file system and whether it has received the X-region_Copied_Done notification from the source file system. If the notification has been received, the target difference monitor transitions to the X-region DONE state 1068 and sends the X-region DONE notification to the source file system. In some embodiments, the target file system checks whether the end of the file exists for all key ranges and all upload processing threads because all objects uploaded to the object store have special markers such as file end markers in the CM file, so as to detect whether the source file system has completed the upload.
[0140] Referring back to the state machine of the source file system, while the source file system is in the Awaiting_Target_Response state 1018, it checks whether the status of the target CP has changed to completed, indicating that all the differences downloaded by the target have been applied and the file data has been stored locally. If the status of the target CP changes to completed, this marks the end of the replication cycle.
[0141] The source side and the target side operate asynchronously. When the source file system completes the replication upload, it notifies the target control API of the X-region_Copied_Done notification. Then, when the target file system completes the replication process, the difference monitor target communicates in the reverse direction with the source control API using the X-region DONE notification. The source file system returns to the Ready_to_Copy_Metadata state 1002 and starts another replication cycle.
[0142] FIG. 11 is an exemplary flow diagram showing the exchange of information between a data plane and a control plane within a source region according to an embodiment. The data plane components and the control plane components communicate with each other using a shared database (SDB), such as 1106. The SDB is a key-value store that both the control plane components and the data plane components can read from and write to. The data plane components include a replicator and a difference generator. The exchange of information between the components within source region A 1101 and target region B 1102 is also shown.
[0143] In FIG. 11, at step S1, the source control plane (CPa) 1103 requests the object store within the target region B (OSb) 1112 to create a bucket. At step S2, the source replicator (REPLICATORa) 1108 periodically updates the heartbeat status to the source SDB (SDBa) 1106. The heartbeat is a concept used to track the progress of replication executed by the replicator. The heartbeat can use a mechanism called lease, where the heartbeat is continuously updated each time the replicator works on a job, enabling the control plane to recognize the overall release information. For example, the byte count is continuously moving on the job. If the replicator cannot function properly, the heartbeat may become stale, and then another replicator can detect and take over to continue working on the remaining jobs. Therefore, if the system crashes midway, the system can start exactly from the last point based on the checkpoint mechanism. The checkpoint helps the system know where the last point of progress was and enables it to continue from that point without re-executing the entire job.
[0144] In step S3, CPa1103 further requests the File System Service Workflow (FSW_CPa) 1104 to create snapshots periodically. In step S4, FSW_CPa1104 notifies CPa1103 about the new snapshot. In step S5, next, CPa1103 stores the snapshot information in SDBa1106. In step S6, REPLICATORa1108 polls SDB1106 for any changes to the existing snapshots. If a change is detected, in step S7, it retrieves the job specification. When REPLICATORa1108 detects a change to the snapshot in step S8, this initiates the replication process. In step S8, REPLICATORa1108 provides information about two snapshots (SNa and SNb) including the changes between the snapshots to the Differencer (DGa) 1110. In step S9, REPLICATORa1108 enters work item information such as the number of key ranges into SDBa1106. In step 10, REPLICATORa1108 checks the replication job queue in SDBa1106 to obtain work items. In step S11, it assigns those work items to the Differencer (DGa) 1110 to scan the B-tree keys of the snapshot (i.e., traverse the B-tree) and calculate the differences and the corresponding key-value pairs. In step 12, REPLICATORa1108 decrypts the file data associated with the identified B-tree keys and packs them together with the key-value pairs into a blob. In step 13, REPLICATORa1108 encrypts the blob using the session key and uploads them as objects to OSb1112. In step S14, REPLICATORa performs a checkpoint and stores the checkpoint record in SDBa1106. This replication process (S8 - S14) repeats (as a loop) until all differences are identified and the data is uploaded to OSb1112.In step S15, REPLICATORa1108 then notifies SDBa1106 of the details of the replication job, and then this detail is passed to CPa1103 in step S16 and further relayed to CPb1114 as the final CM file in step S17. In step S18, CPb1114 stores the details of the job in SDBb1116.
[0145] The exchange of information between the data plane and the control plane within target region B is similar. At the end of applying the difference to the target file system, the control plane within target region B notifies the control plane within source region A that the snapshot has been successfully applied. Thereby, the control plane within source region A can start over using the new snapshot.
[0146] Authentication is performed for all components. There is an authentication mechanism that uses the replication ID and the file system number, from the replicator to the file system key (FSK). The key can be given to the replicator only if the replicator provides appropriate content. Thus, the authentication mechanism can prevent fraudsters from obtaining the decryption key. Other security mechanisms include blocking network ports. A component called the file system key server (FSKS) is a gatekeeper for properly checking the requester by checking metadata such as the job executed by the requester and other information. For example, assume that the replicator is attempting to request the key to the file system. In that case, FSKS can check whether the replicator is associated with a specific job (e.g., whether the replication is actually associated with that file system) to confirm the validity of the requester.
[0147] Availability addresses situations where a machine can automatically restart after going down, or where services remain available while software deployment is in progress. For example, since all replicators are stateless, losing a replicator is transparent to the customer because another replicator can pick up and continue the job's work. The job's state is kept not locally, but in a shared database and other reliable locations. The shared database is a service like the one the control plane uses to maintain information about the file system and is based on a B-tree.
[0148] The system has thousands of storage nodes that allow any storage node to perform differential replication, so storage availability in the FSS of this disclosure is high. By using many machines that can take over from each other in case of some failure, the availability of the control plane is high. For example, the progress of replication is not simply blocked by the failure of a single control plane. Thus, there is no single point of failure. Network access availability uses congestion management, including various types of throttling, to prevent source nodes from becoming overloaded.
[0149] The state of replication is written to the shared database, and replication is durable by using checkpointing where the replicator is stateless. The replication process should be idempotent. Idempotency can refer to deterministic reapplication where, if an operation fails, retrying the same operation, e.g., using the same key, upload process, or scan process, should result in the same outcome.
[0150] Operations within multiple regions should be equivalent. In the control plane, the actions taken should be stored. For example, in the case where an HTTP request itself is repeated, an equivalence cache can be useful for remembering that a particular operation has been performed and that it is the same operation. For example, in the data plane, when a block is allocated, the block and the file map key of the file system are written together. Thus, if the block is allocated again, the block can be identified. If the block is sealed, the write operation fails. The equivalence mechanism can know that the block has been sealed in the past and there is no need to retry the write operation. In yet another example, the equivalence mechanism remembers the chain of steps that need to be executed for the processing of a particular key and value. In other words, the equivalence mechanism enables all operations to be checked to ensure that they are in the correct state. Thus, the system can simply proceed to the next step without repeating.
[0151] Atomic replay enables the application of differences to start as soon as the first difference object reaches the object store when a snapshot is rolled back, e.g., when going back from snapshot 10 to snapshot 5. To make the replay atomic, the entire difference needs to be maintained in the object store before the differences can be applied.
[0152] Regarding the replication expansion, the FSS of the present disclosure enables adding the number of replication machines (e.g., replicator virtual machines ("VMs")) required to support many file systems. The number of replicators can be dynamically increased or decreased by considering the bandwidth requirements and availability of resources. Regarding the storage expansion, thousands of storages can be used to parallelize the process and improve the working speed. Regarding the inter-region bandwidth, the bandwidth allocation is automatically adjusted, such as by adjusting all the inter-region bandwidths by grasping the increase in latency and reducing the required speed, to ensure that each workload is not overused or does not exceed the predefined throughput limit. All replicator processors (or threads) have this function.
[0153] In the expansion of checkpoint storage, the uploader and downloader checkpoint the progress to persistent storage, and the shared storage is used as a work queue for dividing the key range. If the checkpoint workload overburdens the shared database, for the purpose of expansion, the checkpoint storage function can be added to the differential generator. The current workload of the shared database may consume less than 10 IOPs.
[0154] End-to-end resumability of inter-region replication Synchronization between the source region and the target region In inter-region replication, the source file system and the target file system operate asynchronously by uploading data from the source FS to the object store and downloading data from the object store to the target FS. Further, the upload operation and the download operation are executed by parallel execution threads that execute asynchronously. The techniques disclosed in this disclosure attempt to synchronize these asynchronous replication-related operations, such as delta generation, upload, download, delta application, and resource cleanup, by using two sets of states within two state machines. In certain embodiments, in both the source region and the target region, an inter-region API and a state machine are used to ensure synchronization between the two regions. The flowchart of the state machine described in FIG. 10 pertains to the simultaneous upload of the source region and download of the target region during normal inter-region operation. The state machine described in FIG. 10 is referred to as a delta state machine. The delta state machine is an internal FSS structure that is invisible to the customer and is used for functions such as job ownership between the CPs and DPs of multiple microservices (excluding the object store regarded as a resource), as well as delta application in both the source file system and the target file system for jobs related to delta application. Another state machine, called the lifecycle state machine, is a resource-level structure visible to the customer, such as resource management and resource utilization, including resource creation, deletion, and suspension. All resources in a cloud infrastructure such as Oracle Cloud Infrastructure (OCI) may have the same standard lifecycle state. Only the CP of a region (or file system) maintains this lifecycle state machine.
[0155] FIG. 12 is a flowchart showing a state machine of a control plane of a file system according to an embodiment. In FIG. 12, customer 1202 may issue a request 1240 to create an inter-region replication between a source file system and a target file system, thereby triggering a transition of the lifecycle state to the CREATING state 1204. The FSS may initiate a replication creation process 1242, which includes allocating auxiliary resources within the target region to create the target file system. After the resource allocation within the target region is completed and the FSS is able to execute the inter-region replication, the state becomes the ACTIVE state 1212. In some embodiments, during the replication creation process, if the source file system and the target file system cannot identify a common snapshot as the base snapshot (e.g., 1244), and if the target file system is not empty, the replication creation process may not be able to proceed. The state within the source file system may change from CREATING to FAILED 1210.
[0156] In an embodiment, during the inter-region replication, if some problem occurs during the replication, such as an inter-region connectivity problem (e.g., 1250), the FSS may change to the NEED_ATTENTION state 1214. When the problem is resolved, the state may return to the ACTIVE state 1212.
[0157] In some embodiments, if the replication process stops for some reason (e.g., 1252) and cannot proceed, the lifecycle state can change from the ACTIVE state 1212 to the FAILED state 1210. This can occur in the source region when the target file system deletes the inter-region replication (i.e., replication deletion). Thus, the customer may need to clean up the source file system accordingly, and the source file system can change its state to FAILED. In another embodiment, if the customer invalidates the KMS key within the vault, the source file system may set its state to FAILED.
[0158] In an embodiment, when the customer requests to delete the existing inter-region replication of a file system within a specific region (e.g., 1260), the CP of that file system can transition to the DELETING state 1220. The replication deletion request for the file system in the source region can trigger resource clean-up in both the source region and the target region. When the clean-up is complete (e.g., 1262), depending on the region involved, the lifecycle state of the file system affected by that region can change to the DELETED state 1222. If the source file system is currently in the FAILED state 1210, the customer may request to clean up the resources within the source file system. As a result, the state can change from the FAILED state 1210 to the DELETING 1220, and after the completion of the clean-up, it can change to the DELETED 1222.
[0159] Each region has a CP, a DP, checkpoints, and metadata. The CPs and DPs within a region may need to be synchronized first. The CPs and DPs may be synchronized by using a shared database (SDB). As shown in FIG. 11, both the source CP (CPa, or may be represented as a control API) and the source DP, including a replicator (REPLICATORa) and a differential generator (DGa), can communicate with each other via the source SDB (SDBa) to synchronize regarding the progress of replication, for example, the status of differential generation and differential upload. A similar mechanism also applies to the target region, the target CP, and the target DP. In other words, the target CP and the target DP can communicate with each other via the source SDB (SDBa) to synchronize regarding the progress of replication, for example, differential application and differential download.
[0160] History ID To efficiently identify common snapshots between a source file system and a target file system, a technique called History ID can be used. The History ID is special identification information that uniquely identifies a snapshot across regions, regardless of whether the snapshot is a system snapshot or a user snapshot. Assume that two file systems have the same History ID for a particular snapshot. In that case, this means that the snapshots contained in each of these two file systems are very similar up to that point, both have a common ancestor or the same known point in time, and can be used as the base snapshot for cross-region (or x-region) replication. The History ID applies to both system snapshots and user snapshots. Therefore, the History ID technique saves valuable cloud resources while reducing network traffic and IO traffic for performing cross-region replication.
[0161] A snapshot is a point-in-time picture of a file system and is immutable (i.e., write-protected). There are two types of replicas that can be created from a snapshot: clones and replicas. A clone is sometimes referred to as a writable snapshot and is typically created in the same region. When a clone is created, each clone can be written to independently using I / O. All of these clones have the same lineage. If one clone is created between two file systems, both file systems share the same copy of the snapshot for reading. A separate copy is created only if one of the file systems needs to write to the clone. A replica is a replicated snapshot that is created in a different region (i.e., between regions or different data centers) by a replication process.
[0162] Replication and cloning can differ in that replication can be achieved by first copying all data from the source region to the target region and then copying the differences between the snapshots. Cloning, on the other hand, copies only the data necessary to create the clone. Intra-region cloning is much faster than inter-region replication because cloning does not involve additional encryption / decryption, object storage transfer, and many stages of the pipeline required for replication. Since clones do not receive further changes after they are created, they capture only point-in-time snapshots.
[0163] In one embodiment, all snapshots may have three pieces of information associated with the snapshot, namely, a snapshot number (snapNum), a provenance ID (ProvID or PID), and a resource ID (e.g., OCID). Since the snapshot consumes resources, the resource ID is a globally unique ID for identifying the resource. The snapshot number is for internal housekeeping use and for tracking purposes within the file system. The provenance ID is for external use and is unique among all snapshots either within or across regions. The provenance ID is set at the time the snapshot is created and does not change when the snapshot is cloned or replicated. These three pieces of information together uniquely identify the history of the snapshot (e.g., the parent-child relationship between all snapshots) and can distinguish the snapshot from other resources within the cloud infrastructure. Additionally, a file system number (FS number) helps track clones within a region and replicas across regions. Across different regions, the provenance ID helps track the history of the snapshot by holding the provenance ID of the original parent snapshot.
[0164] In one embodiment, before replication begins, the source FS and the target FS can compare the history IDs of their respective snapshots to detect matching pairs of snapshots. If a particular pair of snapshots has the same history ID, the source FS and the target FS can start replication from the identified pair without having to transfer the entire copy of the base snapshot from the source FS to the target FS at the start of replication. As a result, this saves resources and avoids traffic associated with data transfer. For example, assume that a previous replication between the source FS and the target FS replicated snapshots S1 - S100 and then stopped. After a while, these two file systems plan to perform another replication and need to find the starting point for this new replication. Assume that the source FS is already at snapshot S200. In that case, the source FS can trace back from S200 to S1 and compare the history ID of that snapshot with the history ID of the last snapshot of the target FS (this comparison process is also referred to as a trace in this specification) and may detect that S100 in both the source FS and the target FS is a matching pair. At that point, S100 can be used as the starting point (i.e., the base snapshot) in both the source file system and the target file system for the new replication process. The source FS can calculate the difference between snapshot S100 (i.e., the base snapshot) and snapshot 200 (i.e., the new snapshot), and then transfer those differences to the target FS, and the target FS can apply those differences to S100 to create S200 in the target FS. The source FS does not need to transfer snapshot 100 to the target FS again as a copy of the base snapshot for the replication process to start. This saves a lot of data and IO transfers.
[0165] In some embodiments, the provenance ID can be useful for all file systems within the same region by cloning a snapshot from another file system within the same target region to the target FS when a snapshot replicated from the source region already exists in the target region but does not exist in the target FS. This can be shown in FIG. 13.
[0166] FIG. 13 is a diagram showing an exemplary use of a provenance ID according to an embodiment. In FIG. 13, the FSS creates clones of three snapshots, snapNum 1 / ProvID S1 / OCID S1, snapNum 2 / ProvID S2 / OCID S2, and snapNum 3 / ProvID S3 / OCID S3, of the file system FS1 within the same region 1, such that they become snapshots of the file system FS2, snapNum 1 / ProvID S1 / OCID K1, snapNum 2 / ProvID S2 / OCID K2, and snapNum 3 / ProvID S3 / OCID K3 (step 1310). Further, a new snapshot snapNum 5 / ProvID K5 / OCID K5 is also created in FS2. The clones within FS2 have different resource IDs (S* becomes K*) because they use different resources within the same region. Note that snapshot 4 of FS1 is not cloned.
[0167] Next, the FSS creates replicas of snapshots 1, 2, 3, and 5 of the file system FS2 such that they become snapNum 1 / ProvID S1 / OCID M1, snapNum 2 / ProvID S2 / OCID M2, snapNum 3 / ProvID S3 / OCID M3, and snapNum 5 / ProvID K5 / OCID M5 of the file system FS3 in region 2 (i.e., step 1320). After snapNum 5 is replicated, the replication is then deleted (i.e., step 1322), that is, regions 1 and 2 no longer communicate with each other. Further, snapshots snapNum 6 / ProvID G6 / OCID M6 and snapNum 7 / ProvID G7 / OCID M7 are then created on FS3 in region 2.
[0168] After a while, FSS attempts to perform replication of snapshots 1, 2, 3, and 7 of FS3 in region 2 to FS4 in region 1 (i.e., attempts to create replicas at step 1330). Since FS4 (i.e., the target FS) does not exist in region 1 but FS1 (i.e., the non-target FS) already exists in the same region, before replication, FS3 in region 2 and FS1 in region 1 compare the history IDs of their snapshots (i.e., step 1340). This comparison may detect that snapshots 1, 2, and 3 of FS3 have the same history IDs (S1, S2, and S3) as snapshots 1, 2, and 3 of FS1 in region 1. Thus, to save resources and network bandwidth, FS1 located in the same region 1 as FS4 can first create clones of snapshots 1, 2, and 3 of FS1 (snapNum 1 / ProvID S1 / OCID S1, snapNum 2 / ProvID S2 / OCID S2, and snapNum 3 / ProvID S3 / OCID S3) to become the same as (snapNum 1 / ProvID S1 / OCID P1, snapNum 2 / ProvID S2 / OCID P2, and snapNum 3 / ProvID S3 / OCID P3) of FS4 in the same region 1 as the base copy of the snapshot (i.e., step 1342). Thereafter, FS3 only needs to replicate snapshot 7 (snapNum 7 / ProvID G7 / OCID M7) of FS3 in region 2 to become snapshot 7 (snapNum 7 / ProvID G7 / OCID P4) of FS4 in region 1 by transferring the difference between snapshot 3 (ProvID S3) and snapshot 7 (ProvID G7) (i.e., step 1344).In other words, the periodic inter-region replication of four snapshots 1, 2, 3, and 7 from FS3 in region 2 to FS4 in region 1 can be simplified such that, in addition to the three in-region clones of snapshots 1, 2, and 3 between FS1 and FS4 in the same region, it becomes the inter-region replication of snapshot 7 between FS3 in region 2 and FS4 in region 1. As a result, the use of history IDs saves resources, data transfer traffic (i.e., network traffic or IO traffic), and time.
[0169] FIG. 14 is a flowchart showing a process of identifying a base snapshot for inter-region replication using a lineage ID according to an embodiment. As shown in FIG. 14, in step 1401, a source FS in a source region may periodically generate a system snapshot or may generate a user snapshot upon a user's request. In step 1402, a unique lineage ID and other identification information (e.g., a snapshot ID and a resource ID) may be assigned to each snapshot. In step 1404, the source FS may receive a request to perform inter-region replication between the source FS and a target FS due to either a power outage or a planned failover. In step 1408, as described above, in some embodiments, both the source FS in the source region and the file system in the target region compare the lineage IDs of their respective snapshots to identify a base snapshot (i.e., matching snapshots having the same or matching lineage IDs) for the purpose of inter-region replication or in response to a request to perform inter-region replication. For example, in FIG. 13, FS3 (i.e., the source FS) in source region 2 compares the lineage ID of its snapshot with the lineage IDs of the snapshots of both the target FS (i.e., FS4) and the non-target FS (i.e., FS1) (i.e., step 1340). In other embodiments, the comparison of the lineage IDs may first be performed between the source FS and the target FS in the target region. If no match is detected, the source FS may perform a comparison of the lineage ID with the non-target FS in the target region.
[0170] In step 1410, if no matching history ID is detected between the source FS and the file system in the target region, then in step 1412, the inter-region replication process may use the latest snapshot of the source FS as the selected base snapshot. In other words, as shown in step 1420, the source FS may need to transfer a copy of the entire base snapshot (i.e., the selected base snapshot) to the target FS and then perform any necessary differential transfer to the target FS. In step 1410, if a matching history ID is detected between the source FS and the file system in the target region, then in step 1414, the process further determines to which snapshot of either the target FS or the non-target FS in the target region the matching history ID belongs.
[0171] In step 1414, if a matching history ID (i.e., matching snapshots with the same history ID) does not belong to the snapshots of the target FS (i.e., belongs to the snapshots of a non-target FS), in step 1416, the non-target FS may perform in-region cloning of the snapshots with the matching history ID to the target FS to create a base snapshot. Thereafter, at 1420, inter-region replication can use the cloned base snapshot of the target FS as the selected base snapshot. In other words, the source FS can generate the difference between the latest snapshot and the selected base snapshot with the matching history ID, and transfer only this difference to the target FS via the object store. This eliminates the need to transfer a complete copy of the base snapshot. For example, in FIG. 13, the non-target FS1 may clone snapshots S1, S2, and S3 to the target FS4 within the same region 1 (i.e., step 1342). Since the three snapshots (S1, S2, and S3) have matching history IDs, all three snapshots can be used as base snapshots. In one embodiment, the source FS can use the latest of the three snapshots (i.e., S3) as the selected base snapshot to generate the difference between snapshot S3 and G7 for inter-region replication (i.e., step 1344).
[0172] In step 1414, if the matching history ID belongs to the snapshots of the target FS, in step 1418, both the source FS and the target FS use the snapshot with the matching history ID as the selected base snapshot. In step 1420, the source FS can generate the difference between its latest snapshot and the selected base snapshot, and transfer the difference to the target FS for difference application during inter-region replication.
[0173] In addition to selecting a base snapshot for inter-region replication, in some embodiments, the provenance ID may also be useful for resumability in the event that replication fails or is accidentally deleted. For example, as described above, multiple inter-region replications may occur between regions. If one inter-region replication fails during the replication process, the corresponding source file system and target file system can use the provenance ID to search for and detect a snapshot of the target file system or non-target file system within the target region, and use it as a base snapshot to resume the inter-region replication. Since FSS performs replication using incremental deltas, the better FSS can identify a unique starting point common to both the source file system and the target file system more easily and quickly, the better FSS can resume the replication process and recover from failures. The provenance ID can avoid the need for a complete base copy every time a failure occurs.
[0174] Intra-region lock To delete and update resources included in either the source region or the target region, an in-region locking mechanism is used to ensure that no one else can delete, use, or update the resources that are pending deletion or update, preventing damage. Consider a situation where two different customers' users issue deletion requests simultaneously in both the source region and the target region. Such a situation may cause a race condition in both regions. For example, User 1 issues a deletion request to the source region that can affect the target region. User 2 also directly issues a deletion request to the target region. In one embodiment, the FSS may first resolve which region receives the request first. If the target region receives the deletion request directed to that region first, the target region may set an in-region lock on the target resource. The source region, which also receives a deletion request from another user, may forward this request to the target region via the inter-region API, but since the target region is already in the Deleting state of the lifecycle state machine, the source region receives a response (or error message) from the target region indicating that it cannot lock the resource for deletion.
[0175] Region locking can help separate the target region from overlapping requests from the source region to prevent inconsistencies and race conditions. Then, when the target region changes from the Deleting state to the Deleted state after executing a previously received deletion request, the target region notifies the source region of the state change. As a result, the source region may receive two notifications from the target region for the purpose of inter-region synchronization. One notification (i.e., an error message) indicates that the deletion request from the source region cannot be executed in the target region. Another notification indicates that the target region has completed the deletion request. These notifications help the source region understand potential race conditions resulting from overlapping requests, and the source region can continue with the appropriate process.
[0176] When the target region first receives a deletion request from the source region, a region lock can be set on the target region to prevent another overlapping request based on this request. After the target region executes the deletion request initiated by the source region, it can notify the source region of the completion of the target request as described above regarding inter-region synchronization. In some embodiments, regardless of whether the target region first receives a request from User 1 or User 2, after the source region receives a request from User 1, it changes the state of the CP from the Active state to the Deleting state and may remain in the Deleting state until it receives an inter-region synchronization notification from the target region. After receiving this notification, the source region changes the state of the CP from the Deleting state to the Deleted state and may notify User 1 of the completion of the request. Since the databases of the source region and the target region may exist in different regions that are far apart, the inter-region synchronization process may take some time depending on the IO traffic or network traffic.
[0177] Replication Creation FIG. 15 is a flowchart showing a replication creation process according to an embodiment. To create a replication that can enable a file system within a source region (referred to as the source file system) to participate in inter-region replication with a file system within a target region (referred to as the target file system), the FSS may need to allocate auxiliary resources (or objects) within the target region. The auxiliary object can be information within the target file system that needs to be synchronized with any changes within the source file system, such as the last applied snapshot, the name of the resource, etc.
[0178] In one embodiment, at step S1 of FIG. 15, the customer may issue a replication creation request to the source control API 1580. At step S2, the source identity and key management service (KMS) related components 1516 may verify the user's permission and security validity. At step S3, the source SDB 1514 may check whether the customer's request contains a tag indicating that the same request has been received previously, so as to prevent duplicate execution. The FSS may return the status of the previous request. At step S4, after the validity of both the user's identity and security has been verified, the control API may transition to a success status and change its lifecycle state to the CREATING state.
[0179] In certain embodiments, in some of the following steps, the source region and the target region may attempt to identify a common snapshot that both regions can use as a starting point for inter-region replication. In step S5, the source control API 1510 may communicate with the target control API 1530 via an inter-region API call to obtain the history ID of the latest snapshot in the target file system. The target control API may confirm that the target file system has not been previously exported for reading or writing by anything other than the source file system. Otherwise, replication creation between the source file system and the target file system may become unreliable. In step S6, the target control API may check with the target KMS for security purposes and verify the validity of the source KMS key. In step S7, if the target file system successfully passes the security validation for the source request, it may return the requested history ID of the latest snapshot to the source control API. In step S8, the source control API may check the history ID information in the source SDB to confirm whether the source SDB contains a snapshot with the same history ID from the target file system. Based on the result of this detection, the FSS may proceed to either step S9 or S11.
[0180] If the target history ID does not exist in the source file system, the FSS may proceed to step S9, and the source control API 1510 may notify the customer regarding this failure. In step S10, the source control API may set the lifecycle state to FAILED in the source SDB. As a result, the FSS may need to perform a copy of the base snapshot from the source FS to the target FS.
[0181] When the source file system identifies a snapshot having the same origin ID from the target file system, the FSS may proceed to step S11 and notify the target control API 1530 to create an auxiliary replication object in the target region. In step S12, the target control API may place a job for creating the auxiliary object in the job queue within the target SDB. In step S13, the target control API may respond to the source control API 1510 using the resource identification information that references the newly created auxiliary object in the target region. In step S14, the source control API may change the lifecycle state to ACTIVE in the source SDB.
[0182] Replication deletion Customers may want to switch cross-replication to a different region, either in the source region or the target region, or to two different source and target regions. As a result, the customer may need to end the current replication, exit from it, and start a new inter-region replication to achieve that goal. The FSS can perform replication deletion for this purpose. On the other hand, if a failure occurs in the current replication and the failure is persistent, such as when the differential application job stops in the target region and retries for a long time, the FSS may mark such a failure as a system failure that requires the operator's attention to identify the root cause. Another possible persistent failure may include a failed source region. When the persistent failure is resolved, the file system may also need to perform replication deletion before starting a new replication with a clean start.
[0183] The replication deletion process can be initiated from either the source FS in the source region (which may also be referred to herein as an end request starting from the source) or the target FS in the target region (which may also be referred to herein as an end request starting from the target). For example, initiating replication deletion from the source region may be suitable when a permanent failure occurs in the target region (e.g., the differential application job has stopped). Initiating replication deletion from the target region may be suitable when a permanent failure occurs in the source region (e.g., the source region is not responding).
[0184] As previously described with respect to the differential state machine, in some embodiments, the replication deletion process may have other differential states in addition to the states described in FIG. 10. The additional differential states may include, but are not limited to, the ABORT_COPY state and the SNAPSHOT_METADATA_DELETE state for the source region only, the ABORT_REPLICATION state for the target region only, and the TERMINATE state for both the source region and the target region.
[0185] The differential state ABORT_COPY may indicate that the source CP of the source file system attempts to stop the differential generation and upload process, followed by resource cleanup. The SNAPSHOT_METADATA_DELETE state may indicate that the source file system in the source region attempts to delete the snapshot metadata stored in the source SDB and the object storage. The ABORT_REPLICATION state may indicate that the target CP of the target region attempts to stop the differential application and download process, followed by resource cleanup.
[0186] Finally, the DIFFERENCE state TERMINATE is used in either the source region or the target region and may then trigger an inter-region call to the other region. In the case of the source region, the TERMINATE state may indicate that the source file system attempts to delete an unused system snapshot that was recently created and then notifies the target file system to change its state accordingly (e.g., change the lifecycle state from DELETING to DELETED). In the case of the target region, the TERMINATE state may indicate that the target region attempts to end the replication process by converting the last snapshot to a user snapshot if the last snapshot is a system snapshot and then performing a cleanup (e.g., contents of various job / processing queues, entries in the delta monitor queue (DMQ)) and then notifies the source file system to change its state accordingly (e.g., change the lifecycle state from the ACTIVE state to the FAILED state). System snapshots are periodically generated by the FSS and cannot be deleted by the customer. However, user snapshots are created by the user and can be deleted at any time.
[0187] The differential states ABORT_REPLICATION and TERMINATE can perform similar cleanup functions within the target region, but the different differential state names help the source region distinguish whether the target cleanup operation was initiated (or caused) by the source region or the target region. The differential state ABORT_REPLICATION within the target region is used for replication deletion initiated at the source and can trigger the source region to change the source's final lifecycle state to the DELETED state. On the other hand, the differential state TERMINATE within the target region is used for replication deletion initiated at the target and can trigger the source region to change the source's final lifecycle state to the FAILED state. In other words, replication deletion (or termination request) initiated at the source and replication deletion initiated at the target can use the same set of lifecycle states (sometimes called the first set of states), but can use different subsets of differential states (sometimes called the second set of states). For example, subsets of differential states such as ABORT_COPY and ABORT_REPLICATION are used for replication deletion initiated at the source, while another subset of differential states such as TERMINATE is used for replication deletion initiated at the target.
[0188] FIG. 16 is a flowchart showing a replication deletion process starting from a source according to an embodiment. Components involved in the snapshot deletion process started by the source region are the control API 1610, the difference monitor 1612, and the replicator 1614 within the source region, as well as the control API 1624, the difference monitor 1622, and the replicator 1620 within the target region. As described above, the control APIs (1610 and 1624) can be a set of hosts within the CP that are responsible for transmitting state information between different regions. The difference monitors (DM, 1612, and 1622) can be threads in the control plane API (control API) service that are periodically launched to monitor the progress of replication, including scheduled snapshot creation and replication jobs. The DM also records snapshot metadata such as name, status, and tags. The difference monitor also has a difference monitor queue (DMQ), and each replicator thread can work on one DMQ entry at a time. The DMQ entry can be cleaned up at the end of the cleanup process in either the source file system or the target file system. The replicators (1614 and 1620) including the difference generator can be responsible for generating and uploading differences within the source region and can be responsible for downloading and applying differences within the target region.
[0189] In step S1 of FIG. 16, depending on the current stage of inter-region replication, the target FS may be in an idle state before the source FS uploads any differences or downloads the manifest file and differences from the object store and applies the differences. In step S2, the customer may request replication deletion from the source control API 1610. In step S3, the source control API host receives from the customer a request to delete the existing replication process. After verification of validity, the source CP may set its lifecycle state to DELETING and return a response to the customer (not shown in FIG. 16). The source CP (including the control API) may send a cross-region (x-region) request to the target CP (including the control API) to abort the current replication (i.e., stop applying differences). If the target FS is applying differences, the target FS may wait until the current difference application is complete and then take action on the replication deletion request.
[0190] In step S4, the target CP may change its internal state to the Abort_Replication state and notify the target differential monitor to stop replication. In step S5, when the target replicator detects a state change via the target differential monitor, it may perform some cleanup within the target region, such as records related to the purpose of the checkpoint, file data associated with B-tree keys, and the contents of various job / processing queues used for differential application. In an alternative embodiment, after each checkpoint, blob cleanup is performed asynchronously at regular intervals, which can shorten the time for future cleanup if necessary. The target replicator may clean up blobs stored in an object store, for example, an object storage path that stores objects uploaded by the source FS for all key ranges. In step S6, after the cleanup including the cleanup of the DMQ in the differential monitor is successfully completed, the target replicator may notify the differential monitor existing in the target SDB to set the differential state to the Abort_Replication_Done state. The lifecycle state may be changed to the DELETED state.
[0191] In step S7, the target CP (including both the differential monitor 1622 and the control API 1624) can then notify the source CP about the target's cleanup status via an inter-region API call for the source file system to perform cleanup starting from snapshot metadata such as the history ID, snapshot type, and snapshot time. In step S8, the control API sets the differential state in the source differential monitor to the Snapshot_Metadata_Delete state. The source control CP may perform the deletion of the snapshot metadata. In step S9, the source differential monitor can call a workflow for deleting the replication snapshot, i.e., a cleanup task for deleting the replication snapshot that has become unnecessary at the end of the replication cycle (i.e., completion of differential generation in the source FS and differential application in the target FS). In step S10, the workflow may perform a cleanup for deleting the snapshots and their associated metadata.
[0192] In step S11, after the source region has completed deleting the metadata, the source control API may change the difference state in the source difference monitor existing in the source SDB to Abort_Copy. In step S12, when the source replicator detects the change in the difference state to Abort_Copy, the source replicator performs a cleanup on the records related to the purpose of the checkpoint, the snapshot including the file data associated with the B-tree key, and the content of various job / processing queues used for difference generation. In step S13, after the source replicator has successfully completed the cleanup, it may notify the source CP by changing the difference state in the difference monitor to Abort_Copy_Done and changing the lifecycle state to DELETED. In step S14, the difference monitor may record in the source SDB that the requested cleanup transaction has been completed. The cleanup in both the source FS and the target FS, as well as in the object store, is important to prevent data corruption of the file system when starting a new replication.
[0193] Figure 16 shows that the source FS does not start its cleanup process until the target's cleanup process is completed, but the source's cleanup process can be executed in parallel with or simultaneously with the target's cleanup process depending on the stage of the existing inter-region replication process. In other words, both the source FS and the target FS can be aborted simultaneously after receiving the customer's request to delete the current replication. There may be three possible situations (or stages) regarding the interaction between the source FS and the target FS at the time of receiving the customer's request in the inter-region replication process: (1) both the source FS and the target FS are in an idle state, (2) the source FS is executing difference generation and the target FS is in an idle state, and (3) the source FS is executing difference generation and the target FS is executing difference application.
[0194] In the first situation, when both file systems are in an idle state, after receiving a customer request for replication deletion, the source FS may request the target FS to abort the replication process, and then the source FS may abort it immediately thereafter. Both the source FS and the target FS can execute their respective cleanup processes either simultaneously or in parallel. In the second situation, when the source FS is performing differential generation but the target FS is in an idle state, after the source FS notifies the target FS to abort the replication process, the source FS may abort after the replication process reaches a safe point (e.g., a checkpoint has been completed). The cleanup processes of both the source FS and the target FS may overlap. In the third situation, when both file systems are performing replication, the target FS does not have to abort until it has completed applying the differential. As shown in FIG. 16, the source FS may wait for a notification from the target file system regarding the completion of the target's cleanup process and then abort.
[0195] After the replication deletion process is complete, the customer may need to request replication creation in order to start a new inter-region replication after the customer has selected valid source and target file systems, as described in FIG. 15. The valid source and target file systems may be the same as the original source and target file systems or may be different from either the original source file system or the original target file system. The source FS within the source region may create a system snapshot and follow the parallel-mode state machine for the new replication process described in FIG. 10. During the differential state Reconciling (1052 in FIG. 10), both the source region and the target region may cross-check and identify any common history IDs and adjust the snapshots to start a new replication process.
[0196] Figure 17 is a flowchart showing a replication deletion process started by a target. As described above, the replication deletion process may be started by a target region. The components used in the replication deletion process started by a source are also applicable to the replication deletion process started by a target. In some embodiments, the replication deletion process started by a target can be executed only when the target FS is in an idle state. When a customer requests replication deletion on the target FS and the target FS is performing differential application, the target FS can respond to the customer using a contention signal.
[0197] In step S1 of FIG. 17, the customer may send a replication deletion request to the target control API 1724 when the target FS is in an idle state. In step S2, the control API may set the lifecycle state to DELETING and the differential state to TERMINATE, and this differential state is reflected in the target differential monitor. The TERMINATE differential state may trigger the target file system to end its replication process and may convert the last snapshot to a user snapshot for deletion. In step S3, when the target replicator detects a state change via the target differential monitor, it may perform a cleanup within the target region, such as records related to the purpose of the checkpoint, file data associated with the B-tree keys, and the contents of various job / processing queues used for differential application. In step S6, after the cleanup including the cleanup of the DMQ in the differential monitor is successfully completed, the target replicator may notify the differential monitor present in the target SDB to set the lifecycle state to DELETED. The differential state remains in the TERMINATE state.
[0198] In step S7, since the target file system may be unable to download the differences uploaded by the source file system, the target difference monitor of the target CP can notify the source difference monitor of the source CP, via an inter-region API call, of the change in the status of the target file system and the fact that the target file system has completed its replication process, so that the replication process within the source file system can fail. At this point, the source lifecycle state may still be ACTIVE. In step S8, the source FS may delete the previous replication snapshot and all unused replication snapshots. In step S9, the source CP (DM and control API) changes its difference state to TERMINATE, changes its lifecycle state to FAILED, and may reflect the fact that the replication process may fail due to the status of the target file system, since the replication process within the source FS may fail when the target FS is detached (i.e., resources are cleaned up and the source FS cannot receive differences). This change in the lifecycle state within the source FS can alert the customer who owns the source file system.
[0199] Referring to FIG. 17 again, at step S10, after the customer receives a notification indicating that the life cycle state has changed to the FAILED state from the source FS, the customer may need to issue a deletion request to clean up the resources in the source FS. At step S11, the source control API may change the life cycle state of the source in the differential monitor to DELETING. At step S12, when the source replicator detects the change in the life cycle state to DELETING, the source replicator performs cleanup on the records related to the purpose of the checkpoint, the snapshot including the file data associated with the B-tree key, and the content of various job / processing queues used for differential generation. At step S13, after the source replicator completes the cleanup, the source replicator may notify the source CP by changing the life cycle state to DELETED and the differential state to TERMINATED, thereby indicating that this process is a replication deletion process that is started at the target.
[0200] After the replication deletion process is completed, the customer may need to request replication creation in order to start a new inter-region replication after the customer selects a valid source file system and a target file system, as described in FIG. 15. The source FS in the source region may create a system snapshot and follow the parallel-mode state machine for the new replication process described in FIG. 10.
[0201] Figure 18 is a flowchart showing a high-level process flow of replication deletion according to an embodiment. In step 1810 of FIG. 18, the FSS may receive a request for inter-region file system replication between a source file system and a target file system, where the source file system and the target file system are in different regions. In step 1812, the FSS performs the requested inter-region replication between the source file system and the target file system. In step 1820, the FSS may receive a request to end the current inter-region replication between the source file system and the target file system and then resume using a new inter-region replication. In some embodiments, the request to end the current inter-region replication (i.e., to start the replication deletion process) may be sent / issued to or received by either the source file system or the target file system, but not both. In other embodiments, if two requests to end the current inter-region replication are issued to both the source file system and the target file system respectively, the file system that received the request first may acquire and set an in-region lock so that only one file system can start the replication deletion process.
[0202] In step 1822, both the source FS and the target FS may synchronize the operations of these FSs by using at least two sets of states belonging to two or more state machines respectively. For example, as described above, one set of states may be a lifecycle state (or the first set of states), and another set of states may be a differential state (or the second set of states). These operations may include, but are not limited to, resource cleanup in both the source FS and the target FS.
[0203] In step 1824, each file system may perform resource cleanup within the region depending on whether the replication deletion process is initiated by the source FS or the target FS. For example, the cleanup sequence may vary depending on whether the process starts at the source or the target. In a replication deletion process starting at the source, the cleanup operations may be performed in parallel or simultaneously for both the source file system and the target file system. However, in a replication deletion process starting at the target, the cleanup operations within the target FS may be performed before the cleanup operations within the source FS. The cleanup operations within the source file system may include deleting checkpoint records, file data, the contents within various processing queues used for differential generation, and metadata, as described above in relation to FIGS. 16 and 17. The cleanup operations within the target file system may include deleting checkpoint records, file data, the contents within various processing queues used for differential application, and metadata, as described above in relation to FIGS. 16 and 17.
[0204] After the replication deletion process is complete, in step 1826, the customer may request to start a new inter-region replication between the source FS and the target FS, or between different pairs of the source FS and the target FS, as described above in relation to FIGS. 15, 16, and 17.
[0205] Resumption of a Prior Snapshot of Replication The customer may resume inter-region replication between the source file system and the target file system without a full resource cleanup in either the source region or the target region, but may want to resume from a previous (or prior) common snapshot that has the same history ID. This resumption process without a full resource cleanup is sometimes referred to as resuming a replication's prior snapshot. In some embodiments, the resuming a replication's prior snapshot process may continue with the data flow direction of the current replication. In another embodiment, the resuming a replication's prior snapshot process may reverse the data flow direction of the current replication.
[0206] Resuming a replication's prior snapshot using the same data flow may be initiated by either an operator or a customer. Resumption initiated by an operator may occur if a software bug causes problems in the current snapshot of the replication process (i.e., the snapshot in which the delta generation and delta application are being performed), or if a customer error occurs, such as accidentally invalidating a KMS key. If a customer accidentally invalidates one or more KMS keys, the source FS or target FS associated with this key may become unusable for reading or writing. Subsequently, the replication process may not be able to proceed even after retries, and the lifecycle state may change from ACTIVE to FAILED, warning the customer. As a result of both situations (i.e., software bugs or customer errors), the operator may need to discard the current snapshot of the replication and resume from a previous appropriate snapshot that has passed through the replication successfully.
[0207] On the one hand, a resumption initiated by a customer using the same data flow can occur at any time when the customer desires to resume from a previous snapshot. For example, a customer may accidentally delete an application or create a software bug that corrupts the current snapshot, and may need to find an appropriate snapshot while transferring content without passing through the new inter-region replication again using the same source FS and target FS (i.e., without copying the base snapshot from the source FS to the target FS). Resuming a previous snapshot of replication using the same data flow helps the customer save a lot of resources (e.g., bandwidth, computing power) and costs.
[0208] Regarding the resumption of a previous snapshot of replication using a reverse data flow, a customer may desire to reuse the original source file system after an inter-region replication (i.e., a failover to the target file system) between the source file system and the target file system. For example, after a power outage, the original source file system (primary site) may be stopped for only a short period of time, or the operating cost of the original source file system may be lower than that of the target file system (secondary site). These can be possible reasons for the customer to return to the original source file system and use the source region as the primary region.
[0209] The resumption of a previous snapshot of replication using a reverse data flow is sometimes called the failback mode. There are two options: the last point in time within the source file system before the failover trigger event, or the latest change within the target file system. These two options will be explained in more detail later.
[0210] FIG. 19 is a flowchart showing a high-level process flow of a replication prior snapshot resume process that uses the same data flow as existing inter-region replication according to an embodiment. At step 1910, a source FS in a source region and a target FS in a target region may perform inter-region replication. At step 1912, the source FS may receive a request from either an operator or a customer to resume the current inter-region replication from a previous common snapshot and continue in the same data flow direction.
[0211] At step 1920, the source FS and the target FS may perform a comparison of lineage IDs between the snapshot in the source FS and the snapshot in the target FS to detect matching lineage IDs. This comparison may start from the latest snapshot in the target FS and go back to older snapshots. If no matching lineage IDs are detected between the snapshots of both the source file system and the target file system, the process may proceed to step 1924 and the resume process may be aborted. The target FS may be in one of two situations: a non-empty target FS or an empty target FS. If the target FS is non-empty, i.e., the target FS may have been cloned from another file system, it may be risky to copy anything from the source FS to the target FS to prevent damaging the target FS. In this situation, replication deletion may be a better option. If the target FS is empty, the source FS may need to copy the base snapshot to the target FS. This copy process of the base snapshot may take much longer than a simple prior snapshot resume process.
[0212] If one or more matching history IDs are detected between snapshots of both the source file system and the target file system, the process may proceed to step 1926. At step 1926, both file systems may continue current inter-region replication using the latest snapshot of the matching history ID in the target FS as the base snapshot.
[0213] Regarding resuming the preceding snapshot of replication using reverse data flow, FIG. 20 is a schematic diagram showing a failback mode according to an embodiment. The failback mode enables restoring the primary side / source side to become the primary / source again before failover. As shown in FIG. 20, the primary availability domain (AD) 2002 includes the source file system 2006, and the secondary AD 2004 includes the target file system 2008. The secondary AD 2004 may exist in the same region or a different region from the region of the primary AD 2002.
[0214] In FIG. 20, snapshot 1 2020 and snapshot 2 2022 in the source file system 2006 exist before the failover due to a power outage event. Similarly, snapshot 1 2040 and snapshot 2 2042 in the target file system 2008 exist before the failover. When a power outage occurs at snapshot 3 2024 in the primary AD 2002, the FSS performs an unplanned failover 2050, and snapshot 3 2024 in the source file system 2006 is replicated to the target file system 2008 to become a new snapshot 3 2044 (i.e., a replica of snapshot 3 2024). After the target file system 2008 starts operating, the customer can make changes to create snapshot 4 2046 for the target file system 2008.
[0215] If the customer decides to reuse the source file system again, the FSS service may perform a failback. The customer has two options: (1) a failback by using the last point in time within the source file system before trigger event 2051, or (2) a failback with reverse replication by using the latest changes within the target file system 2052.
[0216] In the case of the first option (failback only), the user can resume from the last point in time within the source file system 2006 prior to the trigger event (i.e., snapshot 3 2024). In other words, snapshot 3 2024 becomes the snapshot for use after the failback because it was previously successfully failed over to the target file system 2008. To execute the failback 2051, the state of the source file system 2006 is changed to inaccessible. Next, the FSS service identifies the last point in time within the source file system 2006, snapshot 3 2024, before the failover 2050 succeeded. A successful failover may refer to, for example, completing the differential generation within the source FS based on source snapshot 3 2024 and source snapshot 2 2022, and completing the differential application within the target FS to create a replica of source snapshot 3 2024 (i.e., target snapshot 3 2044). The FSS may execute a clone (i.e., replication within the same region) of source snapshot 3 2024 within the primary AD 2002. Now, the primary AD 2002 returns to its initial settings before the power outage, and the user can reuse the source file system 2006. Since snapshot 3 2024 already exists in the file system being used, no data transfer from the secondary AD 2004 to the primary AD 2002 is required.
[0217] In the case of the second option (failover using reverse replication), the user wants to reuse the source file system 2006 with the latest changes in the target file system 2008. In other words, since the target snapshot 4 2046 in the target file system 2008 was the latest change in the target file system 2008, it becomes the snapshot for use after failover. The failover process 2052 for this option includes reverse replication (i.e., reversing the roles of the source and target file systems for the replication process), and the FSS performs the following steps. Step 1. The state of the source file system 2006 is changed to inaccessible. Step 2. Next, the FSS service identifies the latest snapshot in the successfully replicated target file system 2008, e.g., target snapshot 3 2044. Step 3. The FSS service also detects the corresponding source snapshot 3 2024 in the source file system 2006 and performs a clone (i.e., a replication within the same region). Step 4. The FSS service starts the reverse replication 2052 with a region - to - region replication process similar to that described in relation to Figure 4, but in the reverse direction. In other words, both the source file system 2006 and the target file system 2008 need to be synchronized, and then the target file system 2008 can upload the differences to the object store within the primary AD2002 (i.e., the original source region). The source file system 2006 can download the differences from the object store and complete the application to the source snapshot 3 2024, creating a new source snapshot 4 2026 that is a replica of the target snapshot 4 2046 (within the target file system).
[0218] Here, the primary AD 2002 returns to the initial settings before the power outage, and the user can reuse the source file system 2006 again without transferring the data that already exists in both the source file system 2006 and the target file system 2008, for example, snapshots 1 to 3 (2020 to 2024) in the source file system 2006. This saves time and prevents unnecessary bandwidth.
[0219] FIG. 21 is a flowchart showing a process flow in a failback mode according to an embodiment. In step 2110 of FIG. 21, the FSS may receive a customer's request to reuse the source FS as the primary region (i.e., part of the primary AD) after failover to the target FS (i.e., inter-region replication). In step 2112, the FSS may determine which of the two options, failback only or failback with reverse replication, is specified in the customer's request. If the request is for failback only, the process may proceed to step 2120. If the request is for failback with reverse replication, the process may proceed to step 2130.
[0220] In step 2120, source FS can identify the last point-in-time snapshot in source FS prior to a successful failover, which is a snapshot that has been copied from source FS to target FS (i.e., the generation of the difference in source FS and the application of the difference in target FS are completed). In some embodiments, identifying the last point-in-time snapshot in source FS may include, for example, checking the replication identification information (replication ID or the ID of the job executing the replication) of a successful inter-region replication and the history ID of the snapshot associated with the replication ID in both source FS and target FS. The replication ID can be used to identify a specific replication job between source FS and target FS. For example, source snapshot 3 2024 in source FS may be associated with replication job 2050 (i.e., failover from source FS to target FS), which can be identified by a specific replication ID. Similarly, target snapshot 3 2044 in target FS may be associated with replication job 2050. Since target snapshot 3 2044 in target FS is a replica of source snapshot 3 2024 in source FS, these snapshots should have the same history ID (i.e., unique identification information of the snapshot). Since target snapshot 3 2044 in target FS has been successfully created by replication job 2050, source snapshot 3 2024 in source FS can be used as the last point-in-time snapshot in source FS for the purpose of failback.
[0221] In step 2122, the source FS may execute a clone of the last point-in-time snapshot within the source region. The clone may be referred to as a writable snapshot and is typically created in the same region. In step 2124, the source FS may use the cloned snapshot to perform normal operations without requiring the target FS.
[0222] In step 2130, for a customer request for a failback with reverse replication, the FSS may identify the most recent common snapshot between the source file system and the target file system having the same provenance ID (provID). In other words, in some embodiments, the source FS may request the provenance ID from the target FS and, starting from the most recent snapshot in each of the two file systems, compare their provenance IDs to the provenance ID of the source FS. For example, in FIG. 20, the source FS may compare the provID of its most recent snapshot (source snapshot 3 2024) to the provID of the most recent snapshot of the target file system (target snapshot 4 2046). Since target snapshot 4 2046 of the target FS contains a new update, no match is detected. Next, the source FS may compare the provID of its most recent snapshot (source snapshot 3 2024) to the provID of another target snapshot 3 2044. Since target snapshot 3 2044 was copied from (or is a replica of) source snapshot 3 2024, a match is detected. Thus, source snapshot 3 2024 is the most recent common snapshot on the source side, and target snapshot 3 2044 is the most recent common snapshot on the target side. The two common snapshots should be the same.
[0223] After the latest common snapshot with the same origin ID is detected, at step 2132, inter-region reverse replication (steps 2132 to step 2138) can be performed by reversing the roles of the original source FS and the target FS. In some embodiments, the source FS may clone the identified common snapshot for the purpose of inter-region reverse replication.
[0224] As part of the reverse replication process, at step 2134, the target FS may perform a difference generation between its latest snapshot (e.g., snapshot 4 2046 in FIG. 20) and the latest common snapshot (e.g., snapshot 3 2044 in FIG. 20) that is identified. In some embodiments, the identified common snapshot may not be the latest common snapshot within each file system. For example, snapshot 2 (2022 in the source FS) and snapshot 2 (2042 in the target FS) may exist based on a previous replication between the two file systems. The target's latest snapshot may include new changes that are not available in the source FS after the current inter-region replication. At step 2136, the target FS may transfer the generated difference and other replication-related information (e.g., manifest file, metadata) to the source FS via the object store located in the source region. At step 2138, the source FS may download the difference and apply this difference to the identified latest common snapshot to create a new snapshot (e.g., 2026 in FIG. 20). At step 2140, the source FS may perform normal operations using the new snapshot without requiring or depending on the target FS.
[0225] Figure 22 is a flowchart showing a high-level process flow of a replication prior snapshot resume process that uses a reverse data flow according to an embodiment. At step 2210, after the FSS encounters a trigger event such as a power outage or a system failure, it may perform an inter-region replication (i.e., failover) between the source FS (within the source region) and the target FS (within the target region). The source region and the target region are different regions. The data flow of the inter-region replication may generate a difference within the source FS, then transfer this difference from the source FS to the target FS via the object store located in the target region, and finally apply it to the target FS.
[0226] At step 2212, after the completion of the inter-region replication (i.e., failover), the FSS may receive a customer request to reuse the source FS as the primary region. The primary region is the region that was operating before the trigger event that triggered the failover. At step 2214, both the source FS and the target FS may communicate replication-related information to each other for the purpose of resuming. The replication-related information may include, but is not limited to, the identification information of the job executing the inter-region replication (replication ID, or identification information of the inter-region replication), the history ID of the snapshot within the source FS, and the history ID of the snapshot within the target FS.
[0227] At step 2220, the FSS may identify a resumable base snapshot within the source FS, and the resumable base snapshot may enable the source FS to operate properly after the trigger event. In other words, the source FS can continue to perform normal operations such as accessing information from the resumable base snapshot and updating information to the resumable base snapshot without depending on the target FS.
[0228] In step 2222, the FSS may need to determine the type of restartable base snapshot, the last point-in-time snapshot in the source FS prior to a successful failover to the target FS (e.g., 2024 in FIG. 20) (i.e., the failback-only option), or a replica in the source FS created by inter-region reverse replication between the source FS and the target file FS (e.g., 2026 in FIG. 20) (i.e., the failback option with reverse replication). The reverse data flow means that instead of attempting to make the target region the new primary region after inter-region replication, the FSS returns the new primary region to the original source region (i.e., the original primary region).
[0229] In step 2224, the FSS executes the failback process described in steps 2120 - 2124 of FIG. 21 and may prepare the restartable base snapshot for use if the type of restartable base snapshot is determined to be the last point-in-time snapshot in the source FS. The FSS executes the failback process with reverse replication described in steps 2130 - 2138 of FIG. 21 and may prepare the restartable base snapshot for use if the type of restartable base snapshot is determined to be a replica in the source FS (e.g., 2026 in FIG. 20). Finally, in step 2226, the source FS may operate independently (i.e., without depending on the target region) using the restartable base snapshot.
[0230] Exemplary Cloud Architecture As described above, infrastructure as a service (IaaS) is a specific type of cloud computing. IaaS can be configured to provide virtualized computing resources via a public network (e.g., the Internet). In the IaaS model, a cloud computing provider can host infrastructure components (e.g., servers, storage devices, network nodes (e.g., hardware), deployment software, platform virtualization (e.g., hypervisor layer), etc.). In some cases, the IaaS provider may provide various services that arise in connection with those infrastructure components (examples of services include billing software, monitoring software, logging software, load balancing software, clustering software, etc.). Thus, since these services can be policy-driven, IaaS users may be able to implement policies to drive load balancing and maintain application availability and performance.
[0231] In some cases, IaaS customers may access resources and services via a wide area network (WAN), such as the Internet, and use the cloud provider's services to install the remaining elements of the application stack. For example, a user can log in to the IaaS platform, create virtual machines (VMs), install an operating system (OS) on each VM, deploy middleware such as a database, create storage buckets for workloads and backups, and install enterprise software on the VM. The customer can then use the provider's services to perform various functions, including balancing network traffic, troubleshooting application problems, monitoring performance, managing disaster recovery, etc.
[0232] In most cases, cloud computing models require the participation of a cloud provider. A cloud provider can be a third-party service that specializes in providing (e.g., offering, lending, selling) IaaS, but it doesn't have to be. An entity may choose to deploy a private cloud and become its own provider of infrastructure services.
[0233] In some examples, the deployment of IaaS is the process of placing a new application or a new version of an application on a prepared application server, etc. This process may include the process of preparing the server (e.g., installing libraries, daemons, etc.). This process is often managed by a cloud provider under a hypervisor layer (e.g., servers, storage, network hardware, and virtualization). Thus, a customer may play a role in handling the deployment of (e.g., on top of a self-service virtual machine (e.g., that can be spun up on demand)) an (OS), middleware, and / or an application.
[0234] In some examples, the provisioning of IaaS can also refer to acquiring a computer or virtual host for use and installing the required libraries or services on those computers or virtual hosts. In most cases, deployment does not include provisioning, and provisioning may need to be done first.
[0235] In some cases, there are two different challenges in IaaS provisioning. First, there is the initial challenge of provisioning an initial set of infrastructure before anything is executed. Second, after everything is provisioned, there is the challenge of evolving the existing infrastructure (e.g., adding new services, changing services, removing services, etc.). In some cases, these two challenges can be addressed by enabling the infrastructure configuration to be defined declaratively. In other words, the infrastructure (e.g., which components are required and how those components exchange information) can be defined by one or more configuration files. In this way, the entire infrastructure topology (e.g., which resources depend on which resources and how each of those resources interact) can be described declaratively. In some cases, after the topology is defined, a workflow can be generated to create and / or manage the various components described in the configuration file.
[0236] In some examples, the infrastructure can include many interconnected elements. For example, there may be one or more virtual private clouds (VPCs), also known as a core network (e.g., a configurable and / or shared pool of computing resources, possibly on-demand). In some examples, there may be one or more inbound traffic / outbound traffic group rules provisioned to define how inbound and / or outbound traffic of the network is set up, and one or more virtual machines (VMs). Other infrastructure elements such as load balancers, databases, etc. may be provisioned. As more infrastructure elements are desired and / or added, the infrastructure can evolve gradually.
[0237] In some cases, continuous deployment techniques may be employed to enable the deployment of infrastructure code across various virtual computing environments. Further, the techniques described can enable infrastructure management within these environments. In some examples, a service team may write code that is desirably deployed to one or more, but in many cases a large number of, different production environments (e.g., across different geographical locations and sometimes across the world). However, in some examples, the infrastructure to which the code is deployed must first be provisioned. In some cases, provisioning can be done manually, provisioning tools may be utilized to provision resources, and / or deployment tools may be utilized to deploy the code after the infrastructure has been provisioned.
[0238] FIG. 23 is a block diagram 2300 showing an exemplary pattern of an IaaS architecture according to at least one embodiment. A service operator 2302 can be communicatively coupled to a secure host tenancy 2304 that can include a virtual cloud network (VCN) 2306 and a secure host subnet 2308. In some examples, the service operator 2302 may use one or more client computing devices, which can be portable handheld devices (e.g., iPhone®, mobile phone, iPad®, computing tablet, personal digital assistant (PDA)) or wearable devices (e.g., Google® Glass head-mounted display) that run software such as Microsoft Windows Mobile®, and / or various mobile operating systems such as iOS, Windows Phone, Android, BlackBerry 8, Palm OS, etc., and have Internet, email, short message service (SMS), BlackBerry®, or other communication protocols enabled. Alternatively, the client computing device can be a general-purpose personal computer, including, for example, personal computers and / or laptop computers that run various versions of Microsoft Windows®, Apple Macintosh®, and / or Linux® operating systems. The client computing device can be a workstation computer that runs any of various commercially available UNIX® or UNIX-like operating systems, including, but not limited to, various GNU / Linux operating systems such as Google Chrome OS.Alternatively or in addition, the client computing device can be any other electronic device, such as a thin client computer, an Internet-enabled gaming system (e.g., a Microsoft Xbox gaming console with or without a Kinect® gesture input device), and / or a personal messaging device, that can communicate via a network and / or the Internet that has access to the VCN 2306.
[0239] The VCN 2306 can include an LPG 2310 and can be communicatively coupled to an SSH VCN 2312 via a local peering gateway (LPG) 2310 included in a secure shell (SSH) VCN 2312. The SSH VCN 2312 can include an SSH subnet 2314 and can be communicatively coupled to a control plane VCN 2316 via an LPG 2310 included in the control plane VCN 2316. Also, the SSH VCN 2312 can be communicatively coupled to a data plane VCN 2318 via an LPG 2310. The control plane VCN 2316 and the data plane VCN 2318 can be included in a service tenancy 2319 that can be owned and / or operated by an IaaS provider.
[0240] The control plane VCN 2316 can include a control plane demilitarized zone (DMZ) layer 2320 that functions as a boundary network (e.g., a part of the enterprise network between the enterprise intranet and the external network). The servers based on the DMZ have limited responsibilities and can help keep the intrusion contained. Further, the DMZ layer 2320 can include a control plane application layer 2324 that can include one or more load balancer (LB) subnets 2322, an app subnet 2326, and a control plane data layer 2328 that can include a database (DB) subnet 2330 (e.g., a front-end DB subnet and / or a back-end DB subnet). The LB subnet 2322 included in the control plane DMZ layer 2320 can be communicatively coupled to the app subnet 2326 included in the control plane application layer 2324 that can be included in the control plane VCN 2316 and the Internet gateway 2334, and the app subnet 2326 can be communicatively coupled to the DB subnet 2330 included in the control plane data layer 2328 as well as the service gateway 2336 and the network address translation (NAT) gateway 2338. The control plane VCN 2316 can include the service gateway 2336 and the NAT gateway 2338.
[0241] The control plane VCN 2316 can include a data plane mirror app layer 2340 that can include an app subnet 2326. The app subnet 2326 included in the data plane mirror app layer 2340 can include a virtual network interface controller (VNIC) 2342 that can execute a compute instance 2344. The compute instance 2344 can communicatively couple the app subnet 2326 of the data plane mirror app layer 2340 to an app subnet 2326 that may be included in the data plane app layer 2346.
[0242] The data plane VCN 2318 can include a data plane app layer 2346, a data plane DMZ layer 2348, and a data plane data layer 2350. The data plane DMZ layer 2348 can include an LB subnet 2322 that can be communicatively coupled to the app subnet 2326 of the data plane app layer 2346 and the internet gateway 2334 of the data plane VCN 2318. The app subnet 2326 can be communicatively coupled to the service gateway 2336 and the NAT gateway 2338 of the data plane VCN 2318. The data plane data layer 2350 can also include a DB subnet 2330 that can be communicatively coupled to the app subnet 2326 of the data plane app layer 2346.
[0243] The internet gateways 2334 of the control plane VCN 2316 and the data plane VCN 2318 can be communicatively coupled to a metadata management service 2352 that can be communicatively coupled to the public internet 2354. The public internet 2354 can be communicatively coupled to the NAT gateways 2338 of the control plane VCN 2316 and the data plane VCN 2318. The service gateways 2336 of the control plane VCN 2316 and the data plane VCN 2318 can be communicatively coupled to cloud services 2356.
[0244] In some examples, the service gateway 2336 of the control plane VCN 2316 or the data plane VCN 2318 can make application programming interface (API) calls to the cloud service 2356 without going through the public Internet 2354. The API call from the service gateway 2336 to the cloud service 2356 can be one-way, and the service gateway 2336 can make an API call to the cloud service 2356, and the cloud service 2356 can send the requested data to the service gateway 2336. However, the cloud service 2356 does not have to initiate an API call to the service gateway 2336.
[0245] In some examples, the secure host tenancy 2304 can be directly connected to the service tenancy 2319 or, otherwise, can be separated. The secure host subnet 2308 can communicate with the SSH subnet 2314 via the LPG 2310, and the LPG 2310 can enable two-way communication on a separated system if not. Connecting the secure host subnet 2308 to the SSH subnet 2314 can give the secure host subnet 2308 access to other entities within the service tenancy 2319.
[0246] The control plane VCN 2316 may enable users of service tenancy 2319 to set or otherwise provision desired resources. Desired resources provisioned within the control plane VCN 2316 may be deployed or otherwise used in the data plane VCN 2318. In some examples, the control plane VCN 2316 may be separable from the data plane VCN 2318, and the data plane mirror app layer 2340 of the control plane VCN 2316 may communicate with the data plane app layer 2346 of the data plane VCN 2318 via VNICs 2342 that may be included in the data plane mirror app layer 2340 and the data plane app layer 2346.
[0247] In some examples, a user or customer of the system may perform requests, such as create, read, update, or delete (CRUD) operations, via the public internet 2354 that can communicate requests to the metadata management service 2352. The metadata management service 2352 may communicate the requests to the control plane VCN 2316 via the internet gateway 2334. The requests may be received by the LB subnet 2322 included in the control plane DMZ layer 2320. The LB subnet 2322 may determine that the requests are valid, and in response, the LB subnet 2322 may send the requests to the app subnet 2326 included in the control plane app layer 2324. If the validity of the requests is confirmed and the requests require calls to the public internet 2354, the calls to the public internet 2354 may be sent to the NAT gateway 2338 that can make calls to the public internet 2354. Metadata that may desirably be stored by the requests may be stored within the DB subnet 2330.
[0248] In some examples, the data plane mirror application layer 2340 can facilitate direct communication between the control plane VCN 2316 and the data plane VCN 2318. For example, it may be desirable for changes, updates, or other appropriate modifications to the configuration to be applied to the resources included in the data plane VCN 2318. Through the VNIC 2342, the control plane VCN 2316 can communicate directly with the resources included in the data plane VCN 2318, thereby enabling changes, updates, or other appropriate modifications to the configuration of the resources.
[0249] In some embodiments, the control plane VCN 2316 and the data plane VCN 2318 may be included in the service tenant 2319. In this case, the user or customer of the system does not have to own or operate either the control plane VCN 2316 or the data plane VCN 2318. Instead, the IaaS provider may own or operate both the control plane VCN 2316 and the data plane VCN 2318, which may both be included in the service tenancy 2319. This embodiment can enable network isolation that can prevent a user or customer from exchanging information with the resources of other users or other customers. Also, this embodiment can enable a user or customer of the system to privately store a database without having to rely on the public Internet 2354, which may not have the desired level of threat prevention for storage.
[0250] In other embodiments, the LB subnet 2322 included in the control plane VCN 2316 may be configured to receive signals from the service gateway 2336. In this embodiment, the control plane VCN 2316 and the data plane VCN 2318 may be configured to be invoked by a customer of the IaaS provider without invoking the public Internet 2354. A customer of the IaaS provider may desire this embodiment because the databases used by the customer may be controlled by the IaaS provider and may be stored in a service tenancy 2319 that may be separated from the public Internet 2354.
[0251] FIG. 24 is a block diagram 2400 showing another exemplary pattern of an IaaS architecture according to at least one embodiment. A service operator 2402 (e.g., service operator 2302 of FIG. 23) can be communicatively coupled to a secure host tenancy 2404 (e.g., secure host tenancy 2304 of FIG. 23) that can include a virtual cloud network (VCN) 2406 (e.g., VCN 2306 of FIG. 23) and a secure host subnet 2408 (e.g., secure host subnet 2308 of FIG. 23). The VCN 2406 can be communicatively coupled to a Secure Shell (SSH) VCN 2412 (e.g., SSH VCN 2312 of FIG. 23) via a local peering gateway (LPG) 2410 (e.g., LPG 2310 of FIG. 23) included in the SSH VCN 2412. The SSH VCN 2412 can include an SSH subnet 2414 (e.g., SSH subnet 2314 of FIG. 23), and the SSH VCN 2412 can be communicatively coupled to a control plane VCN 2416 (e.g., control plane VCN 2316 of FIG. 23) via the LPG 2410 included in the control plane VCN 2416. The control plane VCN 2416 can be included in a service tenancy 2419 (e.g., service tenancy 2319 of FIG. 23), and a data plane VCN 2418 (e.g., data plane VCN 2318 of FIG. 23) can be included in a customer tenancy 2421 that can be owned or operated by a user or customer of the system.
[0252] The control plane VCN 2416 can include a control plane DMZ layer 2420 (e.g., the control plane DMZ layer 2320 in FIG. 23) that can include an LB subnet 2422 (e.g., the LB subnet 2322 in FIG. 23), a control plane application layer 2424 (e.g., the control plane application layer 2324 in FIG. 23) that can include an application subnet 2426 (e.g., the application subnet 2326 in FIG. 23), and a control plane data layer 2428 (e.g., the control plane data layer 2328 in FIG. 23) that can include a database (DB) subnet 2430 (e.g., similar to the DB subnet 2330 in FIG. 23). The LB subnet 2422 included in the control plane DMZ layer 2420 is communicatively coupled to the application subnet 2426 included in the control plane application layer 2424 that can be included in the control plane VCN 2416, and to an Internet gateway 2434 (e.g., the Internet gateway 2334 in FIG. 23). The application subnet 2426 is communicatively coupled to the DB subnet 2430 included in the control plane data layer 2428, and to a service gateway 2436 (e.g., the service gateway 2336 in FIG. 23) and a network address translation (NAT) gateway 2438 (e.g., the NAT gateway 2338 in FIG. 23). The control plane VCN 2416 can include the service gateway 2436 and the NAT gateway 2438.
[0253] The control plane VCN 2416 can include a data plane mirror application layer 2440 (e.g., the data plane mirror application layer 2340 of FIG. 23) that can include an application subnet 2426. The application subnet 2426 included in the data plane mirror application layer 2440 can include a virtual network interface controller (VNIC) 2442 (e.g., the VNIC 2342) that can execute a compute instance 2444 (e.g., similar to the compute instance 2344 of FIG. 23). The compute instance 2444 can facilitate communication between the application subnet 2426 of the data plane mirror application layer 2440 and an application subnet 2426 that can be included in the data plane application layer 2446 (e.g., the data plane application layer 2346 of FIG. 23) via the VNIC 2442 included in the data plane mirror application layer 2440 and the VNIC 2442 included in the data plane application layer 2446.
[0254] The internet gateway 2434 included in the control plane VCN 2416 can be communicatively coupled to a metadata management service 2452 (e.g., the metadata management service 2352 of FIG. 23) that can be communicatively coupled to the public internet 2454 (e.g., the public internet 2354 of FIG. 23). The public internet 2454 can be communicatively coupled to the NAT gateway 2438 included in the control plane VCN 2416. The service gateway 2436 included in the control plane VCN 2416 can be communicatively coupled to a cloud service 2456 (e.g., the cloud service 2356 of FIG. 23).
[0255] In some examples, the data plane VCN 2418 may be included in the customer's tenancy 2421. In this case, the IaaS provider may provide a control plane VCN 2416 for each customer, and the IaaS provider may configure the specific compute instances 2444 included in the service tenancy 2419 for each customer. Each compute instance 2444 may enable communication between the control plane VCN 2416 included in the service tenancy 2419 and the data plane VCN 2418 included in the customer's tenancy 2421. The compute instance 2444 may enable the resources provisioned within the control plane VCN 2416 included in the service tenancy 2419 to be deployed or otherwise used in the data plane VCN 2418 included in the customer's tenancy 2421.
[0256] In other examples, a customer of an IaaS provider may have a database that persists in the customer's tenancy 2421. In this example, the control plane VCN 2416 can include a data plane mirror app layer 2440 that can include an app subnet 2426. The data plane mirror app layer 2440 can exist in the data plane VCN 2418, but the data plane mirror app layer 2440 does not have to persist in the data plane VCN 2418. That is, the data plane mirror app layer 2440 can have access rights to the customer's tenancy 2421, but the data plane mirror app layer 2440 does not have to exist in the data plane VCN 2418 and does not have to be owned or operated by the customer of the IaaS provider. The data plane mirror app layer 2440 can be configured to make calls to the data plane VCN 2418, but does not have to be configured to make calls to any entity included in the control plane VCN 2416. The customer may wish to deploy or otherwise use resources within the data plane VCN 2418 that are provisioned within the control plane VCN 2416, and the data plane mirror app layer 2440 can facilitate the desired deployment or other use of the customer's resources.
[0257] In some embodiments, a customer of an IaaS provider can apply a filter to the data plane VCN 2418. In this embodiment, the customer can determine which data plane VCN 2418s are accessible, and the customer can restrict access from the data plane VCN 2418 to the public Internet 2454. The IaaS provider does not have to be able to apply a filter or otherwise control access of the data plane VCN 2418 to any external network or database. Applying filters and controls by the customer to the data plane VCN 2418 included in the customer's tenancy 2421 can help to isolate the data plane VCN 2418 from other customers and from the public Internet 2454.
[0258] In some embodiments, cloud service 2456 may be invoked by service gateway 2436 to access services that may not exist on any of public Internet 2454, control plane VCN 2416, or data plane VCN 2418. The connection between cloud service 2456 and control plane VCN 2416 or data plane VCN 2418 may not be operational or continuous. Cloud service 2456 may exist on a different network owned or operated by an IaaS provider. Cloud service 2456 may be configured to receive calls from service gateway 2436 and may be configured not to receive calls from public Internet 2454. Some cloud services 2456 may be isolated from other cloud services 2456, and control plane VCN 2416 may be isolated from cloud services 2456 that may not exist in the same region as control plane VCN 2416. For example, control plane VCN 2416 may be located in "Region 1," and "Deployment 23" of the cloud service may be located in Region 1 and "Region 2." When a call to Deployment 23 is made by service gateway 2436 included in control plane VCN 2416 located in Region 1, this call may be sent to Deployment 23 within Region 1. In this example, control plane VCN 2416, or Deployment 23 within Region 1, may or may not be communicatively coupled to Deployment 23 within Region 2.
[0259] FIG. 25 is a block diagram 2500 showing another exemplary pattern of an IaaS architecture according to at least one embodiment. A service operator 2502 (e.g., service operator 2302 of FIG. 23) can be communicatively coupled to a secure host tenancy 2504 (e.g., secure host tenancy 2304 of FIG. 23) that can include a virtual cloud network (VCN) 2506 (e.g., VCN 2306 of FIG. 23) and a secure host subnet 2508 (e.g., secure host subnet 2308 of FIG. 23). The VCN 2506 can be communicatively coupled to an SSH VCN 2512 (e.g., SSH VCN 2312 of FIG. 23) via an LPG 2510 included in the SSH VCN 2512, and the VCN 2506 can include an LPG 2510 (e.g., LPG 2310 of FIG. 23). The SSH VCN 2512 can include an SSH subnet 2514 (e.g., SSH subnet 2314 of FIG. 23), and the SSH VCN 2512 can be communicatively coupled to a control plane VCN 2516 (e.g., control plane VCN 2316 of FIG. 23) via an LPG 2510 included in the control plane VCN 2516 and to a data plane VCN 2518 (e.g., data plane 2318 of FIG. 23) via an LPG 2510 included in the data plane VCN 2518. The control plane VCN 2516 and the data plane VCN 2518 can be included in a service tenancy 2519 (e.g., service tenancy 2319 of FIG. 23).
[0260] The control plane VCN 2516 can include a control plane DMZ layer 2520 (e.g., the control plane DMZ layer 2320 of FIG. 23) that can include a load balancer (LB) subnet 2522 (e.g., the LB subnet 2322 of FIG. 23), a control plane application layer 2524 (e.g., similar to the control plane application layer 2324 of FIG. 23) that can include an application subnet 2526 (e.g., similar to the application subnet 2326 of FIG. 23), and a control plane data layer 2528 (e.g., the control plane data layer 2328 of FIG. 23) that can include a DB subnet 2530. The LB subnet 2522 included in the control plane DMZ layer 2520 can be communicatively coupled to the application subnet 2526 included in the control plane application layer 2524 that can be included in the control plane VCN 2516, and to an Internet gateway 2534 (e.g., the Internet gateway 2334 of FIG. 23). The application subnet 2526 can be communicatively coupled to the DB subnet 2530 included in the control plane data layer 2528, as well as to a service gateway 2536 (e.g., the service gateway of FIG. 23) and a network address translation (NAT) gateway 2538 (e.g., the NAT gateway 2338 of FIG. 23). The control plane VCN 2516 can include the service gateway 2536 and the NAT gateway 2538.
[0261] The data plane VCN 2518 can include a data plane application layer 2546 (e.g., the data plane application layer 2346 of FIG. 23), a data plane DMZ layer 2548 (e.g., the data plane DMZ layer 2348 of FIG. 23), and a data plane data layer 2550 (e.g., the data plane data layer 2350 of FIG. 23). The data plane DMZ layer 2548 can include a reliable application subnet 2560 and an unreliable application subnet 2562 of the data plane application layer 2546, and an LB subnet 2522 communicatively coupled to an Internet gateway 2534 included in the data plane VCN 2518. The reliable application subnet 2560 can be communicatively coupled to a service gateway 2536 included in the data plane VCN 2518, a NAT gateway 2538 included in the data plane VCN 2518, and a DB subnet 2530 included in the data plane data layer 2550. The unreliable application subnet 2562 can be communicatively coupled to a service gateway 2536 included in the data plane VCN 2518 and a DB subnet 2530 included in the data plane data layer 2550. The data plane data layer 2550 can include a DB subnet 2530 communicatively coupled to a service gateway 2536 included in the data plane VCN 2518.
[0262] The untrusted application subnet 2562 can include one or more primary VNICs 2564(1)-(N) communicatively coupled to tenant virtual machines (VMs) 2566(1)-(N). Each tenant VM 2566(1)-(N) can be communicatively coupled to respective application subnets 2567(1)-(N) that can be included in respective container egress VCNs 2568(1)-(N) that can be included in respective customer tenancies 2570(1)-(N). Each secondary VNIC 2572(1)-(N) can facilitate communication between the untrusted application subnet 2562 included in the data plane VCN 2518 and the application subnets included in the container egress VCNs 2568(1)-(N). Each container egress VCN 2568(1)-(N) can include a NAT gateway 2538 communicatively coupled to the public internet 2554 (e.g., the public internet 2354 of FIG. 23).
[0263] The internet gateway 2534 included in the control plane VCN 2516 and in the data plane VCN 2518 can be communicatively coupled to a metadata management service 2552 (e.g., the metadata management system 2352 of FIG. 23) communicatively coupled to the public internet 2554. The public internet 2554 can be communicatively coupled to the NAT gateway 2538 included in the control plane VCN 2516 and in the data plane VCN 2518. The service gateway 2536 included in the control plane VCN 2516 and in the data plane VCN 2518 can be communicatively coupled to cloud services 2556.
[0264] In some embodiments, the data plane VCN 2518 can be integrated with the customer's tenancy 2570. This integration can, in some cases, be useful or desirable for the customers of the IaaS provider, such as when they may want support when running code. A customer may provide code to run that can be disruptive, communicate with the resources of other customers, or otherwise cause unwanted effects. In response, the IaaS provider can determine whether to execute the code provided to the IaaS provider by the customer.
[0265] In some examples, a customer of an IaaS provider may grant the IaaS provider temporary network access rights and request a function that connects to the data plane application layer 2546. The code for performing this function may be executed in VMs 2566(1) to (N), and this code need not be configured to execute elsewhere on the data plane VCN 2518. Each of VMs 2566(1) to (N) may be connected to the tenancy 2570 of one customer. Each container 2571(1) to (N) included in VMs 2566(1) to (N) may be configured to execute the code. In this case, double separation can exist (for example, the containers 2571(1) to (N) that execute the code, the containers 2571(1) to (N) may be included in at least VMs 2566(1) to (N) included in the untrusted application subnet 2562), which can help prevent incorrect or otherwise undesirable code from damaging the IaaS provider's network or damaging the networks of different customers. The containers 2571(1) to (N) may be communicatively coupled to the customer's tenancy 2570 and may be configured to send or receive data with the customer's tenancy 2570. The containers 2571(1) to (N) need not be configured to send or receive data with any other entity within the data plane VCN 2518. Upon completion of the execution of the code, the IaaS provider may force the containers 2571(1) to (N) to terminate or otherwise discard them.
[0266] In some embodiments, the trusted application subnet 2560 can execute code that can be owned or operated by an IaaS provider. In this embodiment, the trusted application subnet 2560 may be communicatively coupled to the DB subnet 2530 and may be configured to execute CRUD operations within the DB subnet 2530. The untrusted application subnet 2562 may be communicatively coupled to the DB subnet 2530, but in this embodiment, the untrusted application subnet may be configured to execute read operations within the DB subnet 2530. The containers 2571(1)-(N) that can execute customer code, which may be included in each customer's VMs 2566(1)-(N), may not be communicatively coupled to the DB subnet 2530.
[0267] In other embodiments, the control plane VCN 2516 and the data plane VCN 2518 may not be directly communicatively coupled. In this embodiment, there may be no direct communication between the control plane VCN 2516 and the data plane VCN 2518. However, communication can occur indirectly by at least one method. The LPG 2510 may be established by the IaaS provider, thereby facilitating communication between the control plane VCN 2516 and the data plane VCN 2518. In another example, the control plane VCN 2516 or the data plane VCN 2518 can make calls to the cloud service 2556 via the service gateway 2536. For example, a call from the control plane VCN 2516 to the cloud service 2556 can include a request for a service that can communicate with the data plane VCN 2518.
[0268] FIG. 26 is a block diagram 2600 showing another exemplary pattern of an IaaS architecture according to at least one embodiment. A service operator 2602 (e.g., the service operator 2302 of FIG. 23) can be communicatively coupled to a secure host tenancy 2604 (e.g., the secure host tenancy 2304 of FIG. 23) that can include a virtual cloud network (VCN) 2606 (e.g., the VCN 2306 of FIG. 23) and a secure host subnet 2608 (e.g., the secure host subnet 2308 of FIG. 23). The VCN 2606 can be communicatively coupled to an SSH VCN 2612 (e.g., the SSH VCN 2312 of FIG. 23) via an LPG 2610 (e.g., the LPG 2310 of FIG. 23) included in the SSH VCN 2612, and can include the LPG 2610. The SSH VCN 2612 can include an SSH subnet 2614 (e.g., the SSH subnet 2314 of FIG. 23), and the SSH VCN 2612 can be communicatively coupled to a control plane VCN 2616 (e.g., the control plane VCN 2316 of FIG. 23) via the LPG 2610 included in the control plane VCN 2616, and to a data plane VCN 2618 (e.g., the data plane 2318 of FIG. 23) via the LPG 2610 included in the data plane VCN 2618. The control plane VCN 2616 and the data plane VCN 2618 can be included in a service tenancy 2619 (e.g., the service tenancy 2319 of FIG. 23).
[0269] The control plane VCN 2616 can include a control plane DMZ layer 2620 (e.g., the control plane DMZ layer 2320 in FIG. 23) that can include an LB subnet 2622 (e.g., the LB subnet 2322 in FIG. 23), a control plane application layer 2624 (e.g., the control plane application layer 2324 in FIG. 23) that can include an app subnet 2626 (e.g., the app subnet 2326 in FIG. 23), and a control plane data layer 2628 (e.g., the control plane data layer 2328 in FIG. 23) that can include a DB subnet 2630 (e.g., the DB subnet 2530 in FIG. 25). The LB subnet 2622 included in the control plane DMZ layer 2620 can be communicatively coupled to the app subnet 2626 included in the control plane application layer 2624 that can be included in the control plane VCN 2616, and to an Internet gateway 2634 (e.g., the Internet gateway 2334 in FIG. 23). The app subnet 2626 can be communicatively coupled to the DB subnet 2630 included in the control plane data layer 2628, as well as to a service gateway 2636 (e.g., the service gateway in FIG. 23) and a network address translation (NAT) gateway 2638 (e.g., the NAT gateway 2338 in FIG. 23). The control plane VCN 2616 can include the service gateway 2636 and the NAT gateway 2638.
[0270] The data plane VCN 2618 can include a data plane application layer 2646 (e.g., the data plane application layer 2346 of FIG. 23), a data plane DMZ layer 2648 (e.g., the data plane DMZ layer 2348 of FIG. 23), and a data plane data layer 2650 (e.g., the data plane data layer 2350 of FIG. 23). The data plane DMZ layer 2648 can include a trusted application subnet 2660 (e.g., the trusted application subnet 2560 of FIG. 25) and an untrusted application subnet 2662 (e.g., the untrusted application subnet 2562 of FIG. 25) of the data plane application layer 2646, and an LB subnet 2622 communicatively coupled to an Internet gateway 2634 included in the data plane VCN 2618. The trusted application subnet 2660 can be communicatively coupled to a service gateway 2636 included in the data plane VCN 2618, a NAT gateway 2638 included in the data plane VCN 2618, and a DB subnet 2630 included in the data plane data layer 2650. The untrusted application subnet 2662 can be communicatively coupled to a service gateway 2636 included in the data plane VCN 2618, and a DB subnet 2630 included in the data plane data layer 2650. The data plane data layer 2650 can include a DB subnet 2630 communicatively coupled to a service gateway 2636 included in the data plane VCN 2618.
[0271] The untrusted application subnet 2662 can include primary VNICs 2664(1) to (N) communicatively coupled to tenant virtual machines (VMs) 2666(1) to (N) existing within the untrusted application subnet 2662. Each tenant VM 2666(1) to (N) can execute code within respective containers 2667(1) to (N) and can be communicatively coupled to an application subnet 2626 that can be included in a data plane application layer 2646 that can be included in a container egress VCN 2668. Each secondary VNIC 2672(1) to (N) can facilitate communication between the untrusted application subnet 2662 included in the data plane VCN 2618 and the application subnet included in the container egress VCN 2668. The container egress VCN can include a NAT gateway 2638 communicatively coupled to a public internet 2654 (e.g., the public internet 2354 of FIG. 23).
[0272] The internet gateway 2634 included in the control plane VCN 2616 and in the data plane VCN 2618 can be communicatively coupled to a metadata management service 2652 (e.g., the metadata management system 2352 of FIG. 23) communicatively coupled to the public internet 2654. The public internet 2654 can be communicatively coupled to a NAT gateway 2638 included in the control plane VCN 2616 and in the data plane VCN 2618. The service gateway 2636 included in the control plane VCN 2616 and in the data plane VCN 2618 can be communicatively coupled to a cloud service 2656.
[0273] In some examples, the pattern shown by the architecture of block diagram 2600 in FIG. 26 may be regarded as an exception to the pattern shown by the architecture of block diagram 2500 in FIG. 25, which may be desirable for an IaaS provider's customers when the IaaS provider cannot communicate directly with a customer (e.g., a disconnected region). Each container 2667(1) to (N) included in VM2666(1) to (N) for each customer may be accessed in real time by the customer. Containers 2667(1) to (N) may be configured to make calls to respective secondary VNICs 2672(1) to (N) included in the application subnet 2626 of the data plane application layer 2646 that may be included in the container egress VCN 2668. Secondary VNICs 2672(1) to (N) may be able to send calls to the NAT gateway 2638, and the NAT gateway 2638 may send the calls to the public Internet 2654. In this example, containers 2667(1) to (N) that may be accessed in real time by the customer can be separated from the control plane VCN 2616 and may be separated from other entities included in the data plane VCN 2618. Containers 2667(1) to (N) may also be separated from the resources of other customers.
[0274] In another example, a customer can call cloud service 2656 using containers 2667(1) to (N). In this example, the customer may execute the code within containers 2667(1) to (N) that requests a service from cloud service 2656. Containers 2667(1) to (N) can send this request to secondary VNICs 2672(1) to (N), and secondary VNICs 2672(1) to (N) can send this request to a NAT gateway, and the NAT gateway can send this request to public internet 2654. Public internet 2654 can send this request to LB subnet 2622 included in control plane VCN 2616 via internet gateway 2634. In response to determining that this request is valid, the LB subnet can send this request to app subnet 2626, and app subnet 2626 can send this request to cloud service 2656 via service gateway 2636.
[0275] It should be understood that the IaaS architectures 2300, 2400, 2500, 2600 shown in the figures may include components other than the components shown. Further, the embodiments shown in the figures are merely some examples of cloud infrastructure systems that can incorporate embodiments of the present disclosure. In some other embodiments, the IaaS system may include more or fewer components than the components shown in the figures, combine two or more components, or may have different configurations or arrangements of components.
[0276] In one embodiment, the IaaS systems described herein may include the provision of a series of application, middleware, and database services that are delivered to customers in a self-service, subscription-based, elastically scalable, reliable, highly available, and secure manner. An example of such an IaaS system is Oracle Cloud Infrastructure (OCI) provided by the present assignee.
[0277] FIG. 27 shows an exemplary computer system 2700 in which various embodiments may be implemented. System 2700 may be used to implement any of the computer systems described above. As shown in the figure, computer system 2700 includes a processing unit 2704 that communicates with a plurality of peripheral subsystems via a bus subsystem 2702. These peripheral subsystems may include a processing acceleration unit 2706, an I / O subsystem 2708, a storage subsystem 2718, and a communication subsystem 2724. Storage subsystem 2718 includes a tangible computer-readable storage medium 2722 and system memory 2710.
[0278] The bus subsystem 2702 provides a mechanism for the various components and subsystems of the computer system 2700 to communicate with each other as intended. The bus subsystem 2702 is shown schematically as a single bus, although alternative embodiments of the bus subsystem may utilize multiple buses. The bus subsystem 2702 can be any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, and a local bus that uses any of a variety of bus architectures. For example, such architectures can include an ISA (Industry Standard Architecture) bus, an MCA (Micro Channel Architecture) bus, an EISA (Enhanced ISA) bus, a VESA (Video Electronics Standards Association) local bus, and a PCI (Peripheral Component Interconnect) bus implemented as a mezzanine bus manufactured to the IEEE P1386.1 standard.
[0279] The processing unit 2704, which may be implemented as one or more integrated circuits (e.g., conventional microprocessors or microcontrollers), controls the operation of the computer system 2700. One or more processors may be included in the processing unit 2704. These processors can include single-core processors or multi-core processors. In certain embodiments, the processing unit 2704 may be implemented as one or more independent processing units 2732 and / or 2734, with a single-core processor or multi-core processor included in each processing unit. In other embodiments, the processing unit 2704 may be implemented as a quad-core processing unit formed by integrating two dual-core processors on a single chip.
[0280] In various embodiments, the processing unit 2704 can execute various programs according to program code and can maintain multiple programs or processes running simultaneously. At any given time, some or all of the program code being executed can be present in the processor 2704 and / or the storage subsystem 2718. With appropriate programming, the processor 2704 can provide the various functions described above. The computer system 2700 can further include a processing acceleration unit 2706 that can include a digital signal processor (DSP), an application specific processor, and / or the like.
[0281] The I / O subsystem 2708 may include user interface input devices and user interface output devices. User interface input devices may include a keyboard, a pointing device such as a mouse or trackball, a touchpad or touch screen incorporated in a display, a scroll wheel, a click wheel, a dial, a button, a switch, a keypad, a voice input device having a voice command recognition system, a microphone, and other types of input devices. The user interface input devices may enable a user to interact with and control input devices such as a Microsoft Xbox (registered trademark) 360 game controller via a natural user interface using gestures and spoken commands, and may include motion detection devices and / or gesture recognition devices such as a Microsoft Kinect (registered trademark) motion sensor. The user interface input devices may also include gesture recognition devices such as a Google Glass (registered trademark) blink detector that detects a user's eye activity (e.g., a "blink" when taking a photo and / or selecting a menu) and converts the eye gesture into an input to the input device (e.g., Google Glass (registered trademark)). Further, the user interface input devices may include a voice recognition detection device that enables a user to interact with a voice recognition system (e.g., a Siri (registered trademark) navigator) via voice commands.
[0282] The user interface input device may also include, but is not limited to, 3D mice, joysticks or pointing sticks, game pads, graph tablets, and audio / visual devices such as speakers, digital cameras, digital video cameras, portable media players, web cameras, image scanners, fingerprint scanners, barcode readers, 3D scanners, 3D printers, laser distance meters, and eye tracking devices. Further, the user interface input device may include, for example, medical image input devices such as computed tomography, magnetic resonance imaging, positron emission tomography, and medical ultrasonic examination devices. The user interface input device may also include, for example, audio input devices such as MIDI keyboards and digital musical instruments.
[0283] The user interface output device may include, but is not limited to, visual displays other than display subsystems, indicator lights, or audio output devices. The display subsystem may be a flat panel device such as a flat panel device using a cathode ray tube (CRT), a liquid crystal display (LCD), or a plasma display, a projection device, a touch screen, or the like. Generally, the use of the term "output device" is intended to include all possible types of devices and mechanisms for outputting information from the computer system 2700 to the user or another computer. For example, the user interface output device may include, but is not limited to, various display devices for visually transmitting text information, graphics information, and audio / video information, such as monitors, printers, speakers, headphones, car navigation systems, plotters, audio output devices, and modems.
[0284] Computer system 2700 may include a storage subsystem 2718 that provides a tangible, non-transitory computer-readable storage medium for storing software and data structures that provide the functionality of the embodiments described in this disclosure. The software can include programs, code modules, instructions, scripts, etc., and when executed by one or more cores or processors of processing unit 2704, provides the aforementioned functionality. The storage subsystem 2718 may also provide a repository for storing data used in accordance with this disclosure.
[0285] As shown in the example of FIG. 27, the storage subsystem 2718 can include various components including a system memory 2710, a computer-readable storage medium 2722, and a computer-readable storage medium reader 2720. The system memory 2710 may store program instructions that are readable and executable by the processing unit 2704. The system memory 2710 may also store data used during the execution of the instructions and / or data generated during the execution of the program instructions. Various different types of programs may be loaded into the system memory 2710 including, but not limited to, client applications, web browsers, middle-tier applications, relational database management systems (RDBMS), virtual machines, containers, etc.
[0286] System memory 2710 may also store an operating system 2716. Examples of operating systems 2716 include Microsoft Windows®, Apple Macintosh®, and / or Linux operating systems, various commercially available UNIX® or UNIX-like operating systems (including, but not limited to, various GNU / Linux operating systems, Google Chrome® OS, etc.), and / or various versions of mobile operating systems such as iOS, Windows® Phone, Android® OS, BlackBerry® OS, and Palm® OS. In some implementations where computer system 2700 runs one or more virtual machines, the virtual machines may be loaded into system memory 2710 along with guest operating systems (GOS) and executed by one or more processors or cores of processing unit 2704.
[0287] System memory 2710 may be provided in different configurations depending on the type of computer system 2700. For example, system memory 2710 may be volatile memory (such as random access memory (RAM)) and / or non-volatile memory (such as read-only memory (ROM), flash memory, etc.). Various types of RAM configurations may be provided, including static random access memory (SRAM), dynamic random access memory (DRAM), etc. In some implementations, system memory 2710 may include a basic input / output system (BIOS) that contains basic routines that help transfer information between elements within computer system 2700 during startup and the like.
[0288] The computer-readable storage medium 2722 includes computer-readable information for use by the computer system 2700, including instructions executable by the processing unit 2704 of the computer system 2700, and temporarily and / or more persistently contains and stores the information. In addition to a storage medium for storing, it can represent a remote storage device, a local storage device, a fixed storage device, and / or a removable storage device.
[0289] The computer-readable storage medium 2722 can include any suitable medium known in or used in the art, including storage media and communication media, such as volatile and non-volatile, removable and non-removable media implemented in any method or technology for storing and / or transmitting information, but not limited thereto. The computer-readable storage medium 2722 can include tangible computer-readable storage media such as RAM, ROM, electronically erasable programmable ROM (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile disk (DVD), or other optical storage, magnetic cassette, magnetic tape, magnetic disk storage, or other magnetic storage devices, or other tangible computer-readable media.
[0290] As an example, computer-readable storage medium 2722 can include a hard disk drive that reads from or writes to a removable non-volatile magnetic medium, a magnetic disk drive that reads from or writes to a removable non-volatile magnetic disk, and an optical disk drive that reads from or writes to a removable non-volatile optical disk such as a CD ROM, DVD, and Blu-ray (registered trademark) disk, or other optical medium. Computer-readable storage medium 2722 can include, but is not limited to, Zip (registered trademark) drives, flash memory cards, universal serial bus (USB) flash drives, secure digital (SD) cards, DVD disks, digital video tapes, etc. Computer-readable storage medium 2722 can also include solid-state drives (SSDs) based on non-volatile memory such as flash memory-based semiconductor drives, enterprise flash drives, semiconductor ROMs, SSDs based on volatile memory such as semiconductor RAM, dynamic RAM, static RAM, DRAM-based SSDs, magnetoresistive RAM (MRAM) SSDs, and hybrid SSDs that use a combination of DRAM and flash memory-based SSDs. Disk drives and associated computer-readable media can provide non-volatile storage of computer-readable instructions, data structures, program modules, and other data of computer system 2700.
[0291] Machine-readable instructions executable by one or more processors or cores of processing unit 2704 may be stored on a non-transitory computer-readable storage medium. The non-transitory computer-readable storage medium can include a physically tangible memory or storage device, including a volatile memory storage device and / or a non-volatile storage device. Examples of non-transitory computer-readable storage media include magnetic storage media (e.g., disks or tapes), optical storage media (e.g., DVDs, CDs), various types of RAM, ROM, or flash memory, hard drives, floppy (registered trademark) drives, removable memory drives (e.g., USB drives), or other types of storage devices.
[0292] The communication subsystem 2724 provides an interface to other computer systems and networks. The communication subsystem 2724 functions as an interface for receiving data from other systems of the computer system 2700 and for transmitting data to other systems. For example, the communication subsystem 2724 may enable the computer system 2700 to connect to one or more devices via the Internet. In some embodiments, the communication subsystem 2724 can include components of a radio frequency (RF) transceiver for accessing wireless voice and / or data networks (such as cellular phone technology, advanced data network technologies such as 3G, 4G, or EDGE (enhanced data rates for global evolution), WiFi (registered trademark) (IEEE 802.11 family of standards, or other mobile communication technologies, or any combination thereof), components of a global positioning system (GPS) receiver, and / or other components. In some embodiments, the communication subsystem 2724 can provide a wired network connection (such as Ethernet (registered trademark)) in addition to, or instead of, the wireless interface.
[0293] In some embodiments, the communication subsystem 2724 may also receive input communications on behalf of one or more users who may use the computer system 2700, in the form of structured and / or unstructured data feeds 2726, event streams 2728, event updates 2730, and the like.
[0294] As an example, the communication subsystem 2724 can be configured to receive the data feed 2726 in real time from social networks such as Twitter (registered trademark) feeds, Facebook (registered trademark) updates, and / or other communication services, web feeds such as Rich Site Summary (RSS) feeds, and / or real-time updates from one or more third-party information sources.
[0295] Furthermore, the communication subsystem 2724 may be configured to receive data in the form of a continuous data stream, which can include an event stream 2728 and / or event updates 2730 of real-time events that have no explicit end, are essentially continuous, or need not have boundaries. Examples of applications that generate continuous data can include, for example, sensor data applications, financial tickers, network performance measurement tools (e.g., network monitoring and traffic management applications), clickstream analysis tools, automotive traffic monitoring, and the like.
[0296] The communication subsystem 2724 can also be configured to output structured and / or unstructured data feeds 2726, event streams 2728, event updates 2730, etc. to one or more databases that can communicate with one or more streaming data source computers coupled to the computer system 2700.
[0297] The computer system 2700 can be one of various types, including a handheld portable device (e.g., an iPhone (registered trademark) cellular phone, an iPad (registered trademark) computing tablet, a PDA), a wearable device (e.g., a Google Glass (registered trademark) head-mounted display), a PC, a workstation, a mainframe, a ticket vending machine, a server rack, or any other data processing system.
[0298] Due to the constantly changing nature of computers and networks, the description of the computer system 2700 shown in the figures is merely intended to be a specific example. Many other configurations are possible that include more or fewer components than the system shown in the figures. For example, customized hardware may be used and / or certain elements may be implemented in hardware, firmware, software (including applets), or combinations thereof. Additionally, connections to other computing devices such as network input / output devices may be employed. Based on the disclosure and teachings provided herein, one of ordinary skill in the art will understand other methods and / or ways to implement various embodiments.
[0299] Although specific embodiments have been described, various modifications, changes, alternative structures, and equivalents are also included within the scope of the present disclosure. Embodiments are not limited to operating within a particular data processing environment and can operate freely within multiple data processing environments. Further, although embodiments have been described using a particular series of transactions and steps, it should be apparent to one of ordinary skill in the art that the scope of the present disclosure is not limited to the series of transactions and steps described. The various features and aspects of the foregoing embodiments may be used individually or together.
[0300] Furthermore, although embodiments have been described using specific combinations of hardware and software, it should be recognized that other combinations of hardware and software are within the scope of the present disclosure. Embodiments may be implemented using only hardware, only software, or combinations thereof. The various processes described herein may be implemented on the same processor or different processors in any combination. Thus, when a component or service is described as being configured to perform an operation, such a configuration may be realized, for example, by designing an electronic circuit to perform this operation, by programming a programmable electronic circuit (such as a microprocessor) to perform this operation, or by any combination thereof. Processes can communicate using a variety of techniques including, but not limited to, conventional techniques for interprocess communication, and different pairs of processes may use different techniques, or the same pair of processes may use different techniques at different times.
[0301] Embodiments may be implemented by using a computer program product that includes computer programs / instructions, which, when executed by a processor, cause the processor to perform any of the methods described in the present disclosure.
[0302] Accordingly, the specification and drawings are to be regarded as illustrative rather than restrictive. However, it is clear that additions, deletions, omissions, as well as other modifications and changes can be made without departing from the broader ideas and scope as set forth in the claims. Thus, while specific embodiments of the disclosure have been described, these are not intended to be limiting. Various changes and equivalents are within the scope of the appended claims.
[0303] In the context of describing the disclosed embodiments, particularly in the context of the appended claims, the terms "a," "an," and "the," and the use of similar referents should be interpreted to cover both the singular and the plural, unless specifically indicated otherwise herein or clearly contradicted by the context. The terms "comprising," "having," "including," and "containing" should be construed as open-ended terms (i.e., meaning "including, but not limited to") unless otherwise noted. The term "connected" should be interpreted to mean either internally or partially or fully contained within, connected to, or joined together with, even in the presence of intervening elements. The recitation of a range of values herein is merely intended to serve as a convenient way to refer individually to each separate value that falls within the range, and each separate value is incorporated herein as if it were individually recited herein. All methods described herein can be performed in any suitable order, unless specifically indicated otherwise herein or clearly contradicted by the context. The use of any examples, or exemplary language (e.g., "such as") provided herein is merely intended to better clarify the embodiments and does not impose a limitation on the scope of the disclosure, unless otherwise claimed. No language in this specification should be construed as indicating any non-claimed element as essential to the practice of the disclosure.
[0304] Disjunctive language, such as the phrase "at least one of X, Y, or Z," is generally intended, unless otherwise explicitly stated, to convey that items, conditions, etc. can be any one of X, Y, or Z, or any combination thereof (e.g., X, Y, and / or Z) within the context in which it is used. Thus, such disjunctive language is not generally intended, and should not be taken, to mean that a particular embodiment requires the presence of at least one of each of at least one of X, at least one of Y, or at least one of Z.
[0305] In this specification, preferred embodiments of the present disclosure are described, including the best mode known to the applicant for carrying out the present disclosure. Variations of such preferred embodiments may become apparent to those skilled in the art when reading the foregoing description. Those skilled in the art should be able to adopt such variations as needed, and the present disclosure may be practiced otherwise than as specifically described herein. Accordingly, the present disclosure includes all modifications and equivalents of the subject matter recited in the claims appended hereto as permitted by applicable law. Further, any combination of the foregoing elements in all possible variations of the embodiments is included in the present disclosure unless otherwise specifically indicated herein.
[0306] All references, including publications, patent applications, and patents cited herein, are hereby incorporated by reference in their entirety to the same extent as if each reference were individually and specifically indicated to be incorporated by reference and were set forth in its entirety herein.
[0307] In the foregoing specification, aspects of the present disclosure have been described with reference to specific embodiments of the present specification. Those skilled in the art will recognize that the present disclosure is not limited thereto. The various features and aspects of the foregoing disclosure may be used individually or together. Furthermore, embodiments may be utilized in any number of environments and applications beyond those described herein and within the broader spirit and scope of the present specification without departing from it. Accordingly, the present specification and drawings are to be regarded as illustrative rather than restrictive.
[0308] Claim 1: A method comprising: after a computing system encounters a trigger event, performing inter-region replication between a source file system in a source region and a target file system in a target region, the source region and the target region being different regions, the method further comprising the computing system receiving a request to reuse the source file system as a primary region after inter-region replication, the primary region being the region that was operating before the trigger event occurred, the method further comprising the computing system transmitting replication-related information between the source file system and the target file system; and the computing system identifying a resumable base snapshot in the source file system, the resumable base snapshot being configured to enable the source file system to operate properly after the trigger event, the method further comprising the source file system of the computing system performing operations using the resumable base snapshot without depending on the target file system.
[0309] Item 2: The method according to Item 1, wherein the replication-related information includes identification information for inter-region replication, unique identification information for a snapshot in the source file system, and unique identification information for a snapshot in the target file system.
[0310] Item 3: The method according to Item 1 or Item 2, wherein the resumable base snapshot in the source file system is the source snapshot that has successfully created a replica in the target file system through inter-region replication.
[0311] Item 4: The method according to Item 3, wherein the replica in the target file system has the same unique identification information as the source snapshot in the source file system.
[0312] Item 5: The method according to any of the preceding items, wherein the resumable base snapshot in the source file system is a replica of the latest snapshot in the target file system after inter-region reverse replication between the source file system and the target file system.
[0313] Item 6: The method according to Item 5, wherein the inter-region reverse replication includes: identifying the latest common snapshot between the source file system and the target file system; the target file system generating a difference within the target file system based at least in part on the difference between the latest snapshot and the latest common snapshot; transferring the difference from the target file system to the source file system; and the source file system applying the difference to the latest common snapshot in the source file system to generate a resumable base snapshot within the source file system.
[0314] Claim 7: The method according to claim 6, wherein transferring the difference from the target file system to the source file system further comprises uploading the difference from the target file system to an object storage located in the source region and downloading the difference from the object storage to the source file system.
[0315] Claim 8: A non-transitory computer-readable medium storing computer-executable instructions which, when executed by one or more processors of a computing system, cause the one or more processors to perform operations, the operations comprising the computing system performing inter-region replication between a source file system in a source region and a target file system in a target region after encountering a trigger event, the source region and the target region being different regions, and the operations further comprising the computing system receiving a request to reuse the source file system as a primary region after inter-region replication, the primary region being the region that was operating before the trigger event occurred, and the operations further comprising the computing system transmitting replication-related information between the source file system and the target file system; and the computing system identifying a resumable base snapshot in the source file system, the resumable base snapshot being configured to enable the source file system to operate properly after the trigger event, and the operations further comprising the source file system of the computing system performing operations using the resumable base snapshot without depending on the target file system.
[0316] Item 9: The non - transitory computer - readable medium according to Item 8, wherein the replication - related information includes identification information for inter - region replication, unique identification information for a snapshot in the source file system, and unique identification information for a snapshot in the target file system.
[0317] Item 10: The non - transitory computer - readable medium according to Item 8 or Item 9, wherein the resumable base snapshot in the source file system is the source snapshot that has successfully created a replica in the target file system by inter - region replication.
[0318] Item 11: The non - transitory computer - readable medium according to Item 10, wherein the replica in the target file system has the same unique identification information as the source snapshot in the source file system.
[0319] Item 12: The non - transitory computer - readable medium according to any one of Items 8 to 11, wherein the resumable base snapshot in the source file system is a replica of the latest snapshot in the target file system after inter - region reverse replication between the source file system and the target file system.
[0320] Item 13: The non - transitory computer - readable medium according to Item 12, wherein the inter - region reverse replication identifies the latest common snapshot between the source file system and the target file system, generates a difference in the target file system at least partially based on the difference between the latest snapshot and the latest common snapshot in the target file system, transfers the difference from the target file system to the source file system, The source file system includes applying the differences to the latest common snapshot within the source file system to generate a resumable base snapshot within the source file system.
[0321] Item 14: The non-transitory computer-readable medium according to item 13, wherein transferring the differences from the target file system to the source file system further includes uploading the differences from the target file system to an object storage located in the source region and downloading the differences from the object storage to the source file system.
[0322] Item 15: A system, one or more processors, and one or more computer-readable media storing computer-executable instructions that, when executed by the one or more processors, cause the system to, after encountering a trigger event, perform inter-region replication between a source file system in a source region and a target file system in a target region, where the source region and the target region are different regions, and the instructions further cause the system to, after inter-region replication, receive a request to reuse the source file system as the primary region, where the primary region is the region that was operating before the trigger event occurred, and the instructions further cause the system to, transmit replication-related information between the source file system and the target file system, and identify a resumable base snapshot within the source file system, where the resumable base snapshot is configured to enable the source file system to operate properly after the trigger event, and the instructions further cause the system to, The source file system is further caused to execute operations using a resumable base snapshot without depending on the target file system.
[0323] Claim 16: The system according to claim 15, wherein the replication-related information includes identification information for inter-region replication, unique identification information for a snapshot in the source file system, and unique identification information for a snapshot in the target file system.
[0324] Claim 17: The system according to claim 15 or 16, wherein the resumable base snapshot in the source file system is a source snapshot that has successfully created a replica in the target file system by inter-region replication.
[0325] Claim 18: The system according to any one of claims 15 to 17, wherein the resumable base snapshot in the source file system is a replica of the latest snapshot in the target file system after inter-region reverse replication between the source file system and the target file system.
[0326] Claim 19: The system according to claim 18, wherein the inter-region reverse replication includes identifying the latest common snapshot between the source file system and the target file system, the target file system generating a difference within the target file system based at least in part on the difference between the latest snapshot and the latest common snapshot, transferring the difference from the target file system to the source file system, and the source file system applying the difference to the latest common snapshot in the source file system to generate a resumable base snapshot in the source file system.
[0327] Item 20: The system according to Item 19, wherein transferring the difference from the target file system to the source file system further includes uploading the difference from the target file system to an object storage located in the source region and downloading the difference from the object storage to the source file system.
Claims
1. It is a method, The computing system includes performing cross-region replication between a source file system and a target file system, wherein the source file system and the target file system are located in different regions, and the method further includes: The computing system receives a request to terminate the inter-region replication between the source file system and the target file system, The computing system includes synchronizing operations in the source file system and operations in the target file system by using a first set of states and a second set of states, wherein the operations in the source file system include performing resource cleanup in the source file system, and the operations in the target file system include performing resource cleanup in the target file system, and the method further includes A method comprising the computing system initiating a new inter-region replication between the source file system and the target file system after the resource cleanup in the source file system and the resource cleanup in the target file system.
2. The method according to claim 1, wherein the first set of states is trackable for the management and use of resources and is visible to the customer.
3. The method according to claim 1, wherein the second set of states tracks ownership of replication-related jobs for the components of the source file system and the target file system, and is invisible to the customer.
4. Performing resource cleanup within the source file system and the target file system uses the first set of states, The method according to claim 1, wherein resource cleanup is performed in the source file system and the target file system using a first subset of the second set of states when the request to terminate the inter-region replication is initiated by the source file system, and using a second subset of the second set of states when the request to terminate the inter-region replication is initiated by the target file system.
5. The method according to claim 1, wherein the request to terminate the inter-region replication is initiated by the source file system.
6. The method according to claim 5, wherein the resource cleanup in the source file system and the resource cleanup in the target file system are performed simultaneously.
7. The method according to claim 1, wherein the request to terminate the inter-region replication is initiated by the target file system.
8. The method according to claim 7, wherein the resource cleanup in the target file system is performed and completed before the resource cleanup in the source file system is started.
9. A program that causes one or more processors of a computing system to execute the method according to any one of claims 1 to 8.
10. It is a system, One or more processors, The system comprises one or more computer-readable media that store executable instructions, and when an instruction is executed by the one or more processors, the system... Perform cross-region replication between a source file system and a target file system, wherein the source file system and the target file system are in different regions, and the instruction further to the system, Receiving a request to terminate the inter-region replication between the source file system and the target file system, The instructions cause the system to synchronize operations in the source file system and operations in the target file system by using a first set of states and a second set of states, wherein the operations in the source file system include performing resource cleanup in the source file system, and the operations in the target file system include performing resource cleanup in the target file system, and the instructions further cause the system to: A system that, after the resource cleanup in the source file system and the resource cleanup in the target file system, further initiates a new inter-region replication between the source file system and the target file system.
11. The system according to claim 10, wherein the first set of states is for tracking and making visible to the customer the management and use of resources.
12. The system according to claim 10 or 11, wherein the second set of states tracks ownership of replication-related jobs for the components of the source file system and the target file system, and is invisible to the customer.
13. Performing resource cleanup within the source file system and the target file system uses the first set of states, The system according to claim 10 or 11, wherein resource cleanup is performed in the source file system and the target file system using a first subset of the second set of states when the request to terminate the interregion replication is initiated by the source file system, and using a second subset of the second set of states when the request to terminate the interregion replication is initiated by the target file system.
14. The request to terminate the inter-region replication is initiated by the source file system. The system according to claim 10 or 11, wherein the resource cleanup in the source file system and the resource cleanup in the target file system are performed simultaneously.
15. The request to terminate the inter-region replication is initiated by the target file system. The system according to claim 10 or 11, wherein the resource cleanup in the target file system is performed and completed before the resource cleanup in the source file system is started.