Node failure detection and resolution in distributed databases

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A leaderless fault resolution process in distributed databases maintains data integrity by identifying and shutting down non-essential nodes, ensuring only the largest connected component remains operational, addressing node and network failures without data loss.

JP7872827B2Active Publication Date: 2026-06-10NUODB INC

View PDF 3 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Patents
Current Assignee / Owner: NUODB INC
Filing Date: 2024-11-25
Publication Date: 2026-06-10

Application Information

Patent Timeline

25 Nov 2024

Application

10 Jun 2026

Publication

JP7872827B2

IPC: H04L41/0677; H04L43/02; H04L41/142; H04L41/085

CPC: G01R31/08; H04L41/0631; H04L69/40; G06F11/181; G06F16/27; H04L12/1877; H04L41/0686; H04L43/0811

AI Tagging

Application Domain

Error detection/correction Database distribution/replication

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Distributed database systems face communication interruptions due to node failures, network failures, or inconsistent states, leading to potential data inconsistency and loss of connectivity between nodes.

Method used

A leaderless fault resolution process in distributed databases that identifies the largest group of directly connected nodes and shuts down nodes not part of this group, maintaining data integrity and connectivity by ensuring only the largest connected component remains operational.

Benefits of technology

This process ensures data integrity and maintains connectivity without blocking or requiring a leader node, effectively handling various network failure scenarios, including partial connectivity and membership changes, while avoiding data loss.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 0007872827000002
Figure 0007872827000003
Figure 0007872827000004

Patent Text Reader

Abstract

To provide methods and systems to detect and resolve failure in a distributed database system.SOLUTION: A first node in a distributed database system can detect an interruption in communication with at least one other node in the distributed database system. This indicates a network failure. In response to detection of this failure, the first node starts a failure resolution protocol. This invokes coordinated broadcasts of respective lists of suspicious nodes among neighbor nodes. Each node compares its own list of suspicious nodes with its neighbors' lists of suspicious nodes to determine which nodes are still directly connected to each other. Each node determines the largest group of these directly connected nodes and whether or not it exists in that group. If a node does not exist in that group, it fails itself to resolve the network failure.SELECTED DRAWING: Figure 1

Need to check novelty before this filing date? Find Prior Art

Description

[Technical Field]

[0001] Cross-reference of related applications This application claims priority under § 119(e) of U.S. Patent Application No. 62 / 800,009, entitled “Node Failure Detection and Resolution,” filed on 1 February 2019, the disclosure of which is incorporated herein by reference in its entirety. [Background technology]

[0002] In a distributed database, data and metadata are stored across multiple nodes that communicate with each other. However, communication interruptions can occur between nodes. For example, a node in a distributed database system may crash or cease to function if it notices it is in an inconsistent state. In another example, a virtual machine or process running on a node in a distributed database system may crash or cease to function. In yet another example, a communication link between a first node and a second node in a distributed database system may fail. For example, a network (e.g., local area network, wide area network, Ethernet, etc.) connecting two or more nodes in a distributed database system may fail, interrupting communication between nodes. [Overview of the project]

[0003] This specification describes a distributed database system. A distributed database system may include multiple nodes. Each node in the multiple nodes may include a corresponding processor and corresponding memory. Each node in the multiple nodes may be connected to all other nodes in the multiple nodes. The processor of the first node in the multiple nodes may be configured to resolve failures in the distributed database system by identifying a suspected node in the multiple nodes, broadcasting a list of the first suspected nodes to neighboring nodes in the multiple nodes, receiving a second list of suspected nodes from at least one other neighboring node, determining, based on connection information, whether the first node is in a winning fully connected component of the distributed database, continuing the operation of the first node in response to the determination that the first node is in a winning fully connected component of the multiple nodes, and shutting down the first node to resolve the failure in response to the determination that the first node is not in a winning fully connected component of the multiple nodes. A suspected node may be a node in the multiple nodes that has lost its connection to the first node as a result of a failure in the distributed database system. The first list of suspected nodes may contain suspected nodes. Adjacent nodes may be nodes within a group of nodes that remain directly connected to the first node after a network failure. A victorious all-connected component may contain more than half of the nodes within a group of nodes, and each node within a victorious all-connected component node is directly connected to each other node within that victorious all-connected component node.

[0004] This specification describes a method for resolving failures in a distributed database. A distributed database may include multiple nodes, each node in the multiple nodes may be directly connected to each other node in the multiple nodes. This method may include: detecting a communication interruption between a first node in the multiple nodes and a second node in the multiple nodes; in response to detecting the interruption, initiating a coordinated broadcast of a list of suspected nodes among neighboring nodes in the multiple nodes; determining connection information based on the list of suspected nodes; and resolving the failure at least partially based on the connection information. Neighboring nodes may be nodes in the multiple nodes that remain directly connected to the first node. The list of suspected nodes of the first node includes the second node.

[0005] This specification describes methods for resolving failures in distributed databases. A distributed database can include multiple nodes, and each node in the multiple nodes can be connected to each other node in the multiple nodes.This method, in response to detecting a fault, determines whether the first node in a group of nodes is connected to at least half of the nodes in the group; in response to determining that the first node is directly connected to less than half of the nodes in the group of nodes, deactivates the first node to resolve the fault at least partially; in response to determining that the first node is directly connected to at least half of the nodes in the group of nodes, broadcasts a list of the first suspected nodes to the neighboring nodes in the group of nodes; receives a list of the second suspected nodes from at least one of the neighboring nodes; determines whether the list of the first suspected nodes matches the list of the second suspected nodes; in response to determining that the list of the first suspected nodes matches the list of the second suspected nodes, keeps the first node operational for at least partially resolving the fault; and if the list of the first suspected nodes matches the list of the second suspected nodes... In response to determining that there are no nodes, the process may include broadcasting a first updated list of suspected nodes to neighboring nodes based on a first list of suspected nodes and a second list of suspected nodes; receiving a second updated list of at least one suspected node from at least one of the neighboring nodes; determining connection information for multiple nodes based at least partially on the first updated list of suspected nodes and the second updated list of suspected nodes; determining the winning all-connection component of the distributed database based on the connection information; determining whether the first node is within the winning all-connection component; in response to determining that the first node is within the winning all-connection component of multiple nodes, allowing the first node to continue operating in order to at least partially resolve the failure; and in response to determining that the first node is not within the winning all-connection component of multiple nodes, shutting down the first node in order to at least partially resolve the failure. The first list of suspected nodes may include nodes that are not directly connected to the first node. Neighboring nodes may be nodes that remain directly connected to the first node after the failure.A Victory All-Connected Component includes more than half of the nodes within multiple nodes, and each node within a Victory All-Connected Component is directly connected to each other node within that component.

[0006] All combinations of the aforementioned concepts and further concepts are discussed in more detail below (where such concepts are not mutually contradictory), and they constitute part of the subject matter of the invention disclosed herein. More specifically, all combinations of claimed subject matter at the end of this disclosure constitute part of the subject matter of the invention disclosed herein. Terms used herein, which may also appear in any disclosure incorporated by reference, are given meanings that best correspond to the specific concepts disclosed herein.

[0007] As those skilled in the art will understand, the drawings are primarily for illustrative purposes and are not intended to limit the scope of the subject matter of the invention described herein. The drawings are not necessarily to a fixed proportion, and in some cases, different aspects of the subject matter of the invention disclosed herein may be exaggerated or enlarged in the drawings to facilitate the understanding of different features. In the drawings, similar reference letters generally refer to similar features (e.g., functionally similar and / or structurally similar elements). [Brief explanation of the drawing]

[0008] [Figure 1] This diagram illustrates the process of resolving network failures and restoring full connectivity between nodes in a distributed database. [Figure 2] This figure illustrates a typical case of network failure in a distributed database system, where a network partition event splits the distributed database system into two disconnected groups of fully connected nodes. [Figure 3] This figure shows an exemplary distributed database system with three nodes having partial connections that can be resolved by the process shown in Figure 1. [Figure 4]This figure illustrates an exemplary distributed database system with five nodes and two link failures that can be resolved by the process shown in Figure 1. [Figure 5] This figure illustrates an exemplary distributed database system with five nodes and four link failures that can be resolved by the process shown in Figure 1. [Figure 6] Figure 5 shows an example of a partial connection case. [Figure 7] Figure 1 illustrates an exemplary distributed database system with five nodes and three link failures that can be resolved by the process shown. [Figure 8] Figure 1 illustrates an exemplary distributed database system with five nodes and five link failures that can be resolved by the process shown. [Figure 9] Figure 1 illustrates an example of a special partial connection with a unidirectional link fault that can be resolved by the process shown in Figure 1. [Figure 10] Figure 1 shows an example of a network failure during a membership change that can be resolved by the process shown in Figure 1. [Figure 11] This figure shows an example of the process shown in Figure 1. [Figure 12] This flowchart shows the extended process for resolving network failures. [Figure 13] This diagram illustrates the membership changes in a distributed database system when a network partition separates new nodes and entry nodes from the remaining nodes. [Figure 14] This diagram shows a modified version of the scenario in Figure 13, where a new node and entry node are separated from the remaining nodes due to a network partition in a distributed database system. [Figure 15]A diagram showing membership changes in a distributed database system when a new node, an entry node, and some peers are separated from the remaining peers due to network partitioning. [Figure 16] A diagram showing membership changes in a distributed database system when an entry node is separated from the remaining nodes due to network partitioning. [Figure 17] A diagram showing membership changes in a distributed database system when a new node, an entry node, and some peers are separated from the remaining nodes due to network partitioning. [Figure 18] Another diagram showing membership changes in a distributed database system when a new node, an entry node, and some peers are separated from the remaining nodes due to network partitioning. [Figure 19] A diagram showing nodes exchanging failure detection messages to resolve a network failure event. [Figure 20] A diagram showing how to handle node failures while exchanging failure detection messages. [Figure 21] A diagram showing how to handle node failures while performing a failover.

Best Mode for Carrying Out the Invention

[0009] A distributed database system includes multiple nodes that store fragments of data and / or metadata for a distributed database. All nodes in a distributed database system are directly connected to each other so that they can communicate with one another. However, one or more nodes in a distributed database system may experience communication disruptions due to network failures. These communication disruptions may be caused by failures in communication links between two or more nodes, or by failures of one or more nodes. These failures can be resolved by identifying the nodes that are still directly connected to each other, identifying the largest group of directly connected nodes, and disabling the nodes that are not part of that group, as will be explained in more detail below.

[0010] Distributed database system A distributed database system can include two types of nodes: transaction engine (TE) nodes that provide user access to the distributed database, and storage manager (SM) nodes that maintain the respective disk archives of the entire distributed database. Each storage manager node typically stores a copy of the entire distributed database, but a single transaction engine node may only contain the portion of the distributed database necessary to support the transactions currently running on that transaction engine node.

[0011] Each node in a distributed database system has its own processor, memory, and communication interface(s), and can communicate directly with all other nodes in the distributed database system via the database system network. Communication between any two nodes may include the transmission of serialized messages. Serialized messages can follow the Transmission Control Protocol (TCP) or any other suitable messaging protocol.

[0012] Each node in a distributed database system has a unique identifier (e.g., a lexicographical ID) and stores a list of all other nodes in the distributed database system in order of their unique identifiers. Each node uses this list to track the status of all transaction engine nodes and storage manager nodes in the distributed database system. Furthermore, each node can track all database transactions and the location of all database records (i.e., which node stores which data fragments). A node may store this node and transaction information in its respective copy of the master catalog, which contains metadata about the distributed database system and is replicated to all nodes in the database. When a new node joins the distributed database system, it receives a copy of the master catalog from another node, which is called an entry node.

[0013] Tracking database transactions and the location of database fragments helps distributed database systems maintain indivisibility, consistency, isolation, and persistence—commonly known as ACID properties—to ensure the accuracy, completeness, and integrity of data within the distributed database.

[0014] Network failure and failure detection Each node in a distributed database system sends "heartbeat" messages to all other nodes in the distributed database system at frequent intervals. For example, each node sends a heartbeat message to all other nodes every second or every few seconds. (Optionally, a node that receives a heartbeat message can send an acknowledgment message to the node that sent the heartbeat message.) If there is no communication interruption, all nodes in a distributed database system continue to send heartbeat messages directly to and receive heartbeats directly from all other nodes in the distributed database system. However, such communication can be interrupted by a network failure. A node that detects a communication interruption (for example, one that has not received a heartbeat message from another node within a given time) initiates a fault resolution protocol to resolve the network failure.

[0015] Resolving network issues In the fault resolution process described here, nodes in a distributed database regroup themselves in response to network failures and shut down if they are not part of the largest, most numerous, and most fully connected node group, with the highest lexicographical ID ranking. If the largest fully connected group contains less than half of the nodes in the distributed database system, all nodes may shut down. By shutting down disconnected or partially connected nodes, the likelihood of part or all of the database becoming invalid is reduced. These fault resolution processes can be performed in a leaderless manner without blocking or aborting ongoing database transactions.

[0016] Figure 1 shows a process 100 for resolving a network failure. Any node in the distributed database system can initiate this process 100 in response to detecting a network failure (for example, failure to receive heartbeat messages from other nodes within a given period). In 102, the first node detects the network failure and initiates the failure resolution process 100 by creating a list of "suspected nodes," i.e., nodes that the first node suspects of having a failure. For example, the suspect list of the first node is a list of nodes that satisfy one or both of the following conditions: (a) the first node has not received heartbeat messages from those nodes within a given timeout interval (e.g., pingTimeout seconds), and (b) the operating system has closed the connection(s) between the first node and the other(s). At this point, if the suspect list of the first node includes all other nodes in the distributed database system, the first node may shut down itself to at least partially resolve the network failure.

[0017] In 104, the first node (i.e., the node that initiated process 100) broadcasts its list of suspected nodes to its neighboring nodes, which are nodes that the first node can still communicate with directly after a network failure. (In the absence of a network failure, in a distributed database, every node is a neighbor of every other node.) The neighboring nodes receive this list of suspected nodes and broadcast their own list of suspected nodes to their neighbors. Depending on the nature of the network failure, the neighboring nodes' list of suspected nodes may be identical to or different from the first node's list of suspected nodes.

[0018] In 106, the first node receives suspect lists from its neighboring nodes and uses them, along with its own suspect list, to construct a connection graph. The connection graph shows which nodes in the distributed database system the first node is actually directly connected to (i.e., which nodes are actually neighboring nodes of the first node). Other nodes also construct connection graphs. Depending on the nature of the network failure, these connection graphs may be the same as or different from the connection graph of the first node. Similarly, each connection graph may supplement the suspect lists of the corresponding nodes.

[0019] Each node uses its connectivity graph to identify groups of nodes that remain directly connected to each other after a network failure. Each group of directly connected nodes is called a “whole-connected component.” Within a whole-connected component, each node continues to communicate with all other nodes within the whole-connected component even after a network failure. When each node identifies a whole-connected component in the distributed database system, it determines whether it is part of a “winner whole-connected component” (110). If it is not part of a whole-connected component, each node shuts itself down to resolve the network failure (112). If it is part of a winner whole-connected component, it continues to operate (114).

[0020] A winning all-connection component can, but is not required to, contain all data in the database (for example, it does not need to contain a storage manager node). This procedure does not consider the type of nodes that make up the winning all-connection component. (However, in some cases, this process can be modified to pay attention to the type of nodes in the all-connection component when determining the winning all-connection component.) If a winning all-connection component does not contain all data in the distributed database, the user may intervene to ensure proper operation.

[0021] Each node can determine whether it is part of a winning all-connection component as follows: First, each node can determine whether it is part of an all-connection component based on its connection graph. If it is not part of an all-connection component, it shuts down. However, if the node is part of an all-connection component (or possibly two or more all-connection components), the node determines the size of its all-connection component(s) based on its connection graph. If the node determines that it is not part of the largest all-connection component (based on its connection graph and the information each node has stored about other nodes in the distributed database system), the node shuts down (112). If the node is part of the largest all-connection component, and that all-connection component includes more than half of the total number of nodes in the distributed database system before the network failure, the node remains operational (114). This all-connection component is called the "winning all-connection component" because, at the end of failure resolution process 100, it includes all operational nodes in the distributed database system.

[0022] If a node determines that there are two or more all-connected components of the same size, each representing more than half of the nodes in the distributed database, and larger than all other all-connected components combined, it performs a tie-breaking process to identify the winning all-connected component. The tie-breaking process may involve sorting the nodes within each all-connected component in order of their unique identifiers. Once the unique identifiers are sorted, the node selects the winning all-connected component based on the lexicographical order of the unique identifiers. For example, a node might select the all-connected component with the smallest node ID following a common prefix as the winning all-connected component.

[0023] Technical advantages over other troubleshooting processes The fault resolution process shown in Figure 1 has several differences and advantages compared to other methods for resolving faults in distributed databases. First, unlike blocking processes, the fault resolution process shown in Figure 1 restores complete connectivity by excluding one or more nodes in the distributed database after a network failure. Blocking is undesirable because it can roll back updates made to data in the distributed database. Unlike other methodologies, the process described herein does not involve any kind of blocking mechanism.

[0024] Furthermore, the fault resolution process shown in Figure 1 does not require or use a leader node. Conversely, other methodologies for resolving faults in distributed databases implement a strong leadership model. Essentially, this methodology uses a leader node to make fault resolution decisions. Unlike this leader-based methodology, the process described herein does not have a leader node that makes fault resolution decisions. Instead, as described above with respect to Figure 1, any node can initiate the fault resolution process, and each node, as part of the process, decides whether it should shut down or remain operational without instructions from the leader node.

[0025] Unlike blocking and leader-based fault resolution processes, the non-blocking, leaderless fault resolution process described herein can consistently address partial connectivity network failures. In a partial connectivity network failure, network partitioning within a distributed database system can cause a node or set of nodes to communicate only with a subset of nodes within the distributed database system. To address the partial connectivity case, other processes apply a rotational leader model, requiring leaders and informers to use explicit message acknowledgments. In some cases, leadership may constantly shift between nodes experiencing communication disruptions, potentially delaying the resolution of the network failure (sometimes indefinitely).

[0026] Various cases of network failures Process 100 will not allow two or more unconnected node groups (i.e., different fully connected components) to remain operational after a network failure event. To avoid trivial solutions (e.g., shutting down all nodes), process 100 will, where possible, allow a single node group to remain operational.

[0027] Furthermore, if the user chooses to shut down more than half of the surviving nodes in the distributed database system, process 100 does not necessarily have to shut down the remaining nodes. Process 100 can also handle slow links (i.e., communication paths between two or more nodes with slow connections) in addition to link failures. In other words, process 100 handles slow links and link failures in the same way.

[0028] Figures 2 to 10 illustrate various types of network failures that can be resolved using process 100 in Figure 1.

[0029] Case A: Figure 2 illustrates a typical failure scenario in which a network partition event splits the distributed database system 200 into two or more unconnected groups of fully connected nodes. As seen on the left side of Figure 2, the distributed database system 200 includes three transaction engine nodes TE1, TE2, and TE3, and two storage manager nodes SM1 and SM2, all of which are interconnected. These nodes communicate with each other via their respective communication links 212a to 212j. TE1 communicates with TE2 via link 212a, TE2 communicates with TE3 via link 212d, TE3 communicates with SM2 via link 212e, SM2 communicates with SM1 via link 212f, TE2 communicates with SM1 via link 212b, TE2 communicates with SM1 via link 212c, TE1 communicates with SM2 via link 212h, TE3 communicates with SM1 via link 212j, and TE2 communicates with SM2 via link 212i.

[0030] In the center of Figure 2, a network partition divides the chorus into two unconnected node groups (two fully connected components 202' and 202''). (A chorus or chorus group is the set of all nodes in a distributed database system.) In this example, the first fully connected component 202' includes {TE1, TE2, SM1}, and the second fully connected component 202'' includes {TE3, SM2}. Next, process 100 determines that the first fully connected component 202' is larger than the second fully connected component 202'' and includes more than half of the nodes in the distributed database 200, and therefore is the winning fully connected component 204'. The nodes {TE1, TE2, SM1} in the winning fully connected component 204' continue to operate, while nodes TE3 and SM2 shut down in response to realizing they are no longer in the winning fully connected component 204'.

[0031] Case B: Figures 3 to 8 illustrate various examples of partial connectivity. In each of these examples, a network partition or (bidirectional) link failure(s) causes a node or set of nodes to communicate only with a subset of other nodes in the distributed database system. In the case of partial connectivity, the connections between nodes do not satisfy the transitive property; for example, node TE1 may be able to communicate directly with node TE2, and node TE2 may be able to communicate directly with node SM1, but node SM1 cannot communicate directly with node TE1.

[0032] Example B1: Figure 3 shows a distributed database system 300 having three nodes TE1, TE2, and SM1. As seen in Figure 3, the three nodes TE1, TE2, and SM1 form a chorus group in which TE1 communicates with TE2 via link 212a, TE2 communicates with SM1 via link 212b, and SM1 communicates with TE1 via link 212c. A failure of link 212a between TE1 and TE2 (or, assuming TE1 and TE2 are in different data centers, a network separation between the data centers of TE1 and TE2) creates two fully connected components 202'{SM1,TE1} and 202''{SM1,TE2}, along with the partial connections of nodes TE1 and TE2. Since the fully connected components 202' and 202'' are the same size and have more than half the number of nodes that were operational before the link failure, these nodes perform tie-breaking processes such as lexicographical ordering discussed above to determine the winning fully connected component 204'. In Figure 3, {SM1,TE1} are the winning all-connected components 204' (determined by tie-breaking processes such as lexicographical order), so SM1 and TE1 continue to operate, while TE2 shuts itself down.

[0033] Example B2: Figure 4 shows a distributed database system 400 having a chorus group of five nodes TE1, TE2, TE3, SM1, and SM2. In this example, two link failures occur, one between SM1 and SM2 (link 212f) and the other between SM2 and TE3 (link 212e). These failures create full-connection components 402'{TE1,TE2,TE3,SM1} and 402''{TE1,TE2,SM2}, where node SM2 is partially connected to the other nodes, and the other nodes remain directly connected to each other. The first full-connection component 402'{TE1,TE2,TE3,SM1} includes more than half of the nodes and is larger than the other winning full-connection component 402'', therefore it is a winning full-connection component 404'. Node SM2 becomes non-functional, while the other nodes continue to operate.

[0034] Example B3: Figure 5 shows a 5-node distributed database system 500 having a chorus group of 5 nodes TE1, TE2, TE3, SM1, and SM2 that are experiencing 4 link failures. In this example, 4 link failures occur between TE1 and SM1 (link 212c), between TE1 and SM2 (link 212h), between TE2 and TE3 (link 212g), and between TE3 and SM1 (link 212j). These failures result in several fully connected components, but only one has at least 3 nodes {TE2, SM1, SM2}, as shown on the right. Nodes TE1 and TE3 remain partially connected to the distributed database but cannot communicate directly with all other nodes in the distributed database. As a result, nodes TE1 and TE3 become non-functional, and {TE2, SM1, SM2} remains as the victorious fully connected component 404'.

[0035] Figure 6 illustrates why the partial connection case in Figure 5 cannot be addressed using the rotational leader model methodology. As shown in Figure 6, the five nodes TE1, TE2, TE3, SM1, and SM2 form a group in Step 1 (left). In Step 1, all of these nodes can communicate with each other without interruption. However, as shown in Step 2 (right), communication links fail between TE1 and SM1 (link 212c), between TE1 and SM2 (link 212h), between TE2 and TE3 (link 212g), and between TE3 and SM1 (link 212j). These failures interrupt direct communication between TE1 and SM1, between TE1 and SM2, between TE2 and TE3, and between TE3 and SM1.

[0036] SM1 is the current leader immediately before a network failure. Following a link failure, based on the rotational leader methodology, SM1 receives heartbeat messages from TE2 and SM2, and therefore continues to assume it is the leader. Due to a link failure between TE1 and SM1 (link 212c), TE1 does not receive heartbeat messages from SM1, and therefore rotates leadership to TE2. Similarly, due to a link failure between TE3 and SM1 (link 212j), TE3 rotates leadership to TE1. Thus, SM1, TE2, and TE1 take on leadership in succession (not necessarily in this order), but since TE1 is not connected to either SM1 or SM2, it doesn't even know if SM1 is connected to SM2 or not. This rotational leadership makes it difficult to resolve failures(s).

[0037] Conceptually, as described above, a leader node may not be connected to all other nodes (and consequently, the leader may not know the connection information of all other nodes), making it difficult for a centralized, leader-based solution to adequately handle partial connectivity cases. However, the leaderless fault resolution process described herein is an improvement over leader-based methods because it reliably handles all of these partial connectivity cases.

[0038] Example B4: Figure 7 shows a distributed database system 700 with a chorus group of five nodes having three link failures: one between TE1 and SM1 (link 212c), another between TE2 and TE3 (link 212g), and another between SM1 and TE3 (link 212j). These failures create fully connected components {TE1,TE2,SM2}, {TE2,SM1,SM2}, and {TE1,SM2,TE3}, with nodes TE1 and TE2 being partially connected. Each of these three fully connected components contains more than half the number of nodes in the chorus group before the link failures. Furthermore, these three fully connected majority groups are the same size. Therefore, the nodes perform tie-breaking operations such as lexicographical ordering to identify the winning fully connected component 704'. In this example, {TE2,SM1,SM2} is the winning fully connected component 704' (determined by the tie-breaking operation). Therefore, nodes TE1 and TE3 will shut themselves down to resolve the network failure.

[0039] Example B5: Figure 8 shows a distributed database system 800 with a chorus group of five nodes: TE1, TE2, TE3, SM1, and SM2. In this example, five link failures occur: between TE1 and SM1 (link 212c), between TE1 and SM2 (link 212h), between TE2 and TE3 (link 212g), between TE2 and SM2 (link 212i), and between TE3 and SM1 (link 212j). As can be seen from Figure 8, after these failures, there are five fully connected node groups, each the size of two nodes. This is less than half the number of nodes in the chorus group before the link failures. Therefore, since there is no fully connected majority group after the link failures, all nodes cease to function.

[0040] Case C: Figure 9 illustrates a special case of partial connectivity where a node or set of nodes can communicate with a subset of other nodes in one direction but not in the other due to a (unidirectional) link failure(s). As seen in Figure 9, three nodes TE1, TE2, and SM1 in the distributed database system 900 form a chorus group. However, a unidirectional link failure occurs between TE1 and TE2 (link 212a''), allowing TE2 to send messages to TE1 (link 212a'), but preventing TE1 from sending messages to TE2 (link 212a''). This unidirectional link failure (similar to a bidirectional link failure between TE1 and TE2) creates full-connection components 902'{TE1,SM1} and 902''{TE2,SM1}. Since the two sets of full-connection components are the same size and contain more than half the number of nodes that were operational before the link failure (i.e., two out of a total of three nodes), the nodes perform a tie-breaking process to determine the winning full-connection component 904'. In this example, {TE1,SM1} is the winning all-connected component 904' (determined by the tiebreaker process described above). Therefore, nodes TE1 and SM1 continue to operate, while node TE2 shuts itself down.

[0041] Case D: Process 100 also prevents the distributed database system from splitting into multiple majority groups due to a network failure during a membership change. A membership change refers to a new node joining the distributed database system or an existing node leaving the distributed database system. Figure 10 shows an example of Case D. In this example, Chorus 1000 starts with three nodes TE1, SM1, and SM2. Two nodes, TE2 and TE3, attempt to join Chorus 1000. While they are joining, a network partition occurs, causing the distributed database to split into all connected components 1002'{TE2,TE3,TE1} and 1002''{SM1,SM2}. Both groups may continue operating because the members of group {TE2,TE3,TE1} consider themselves to be part of the majority because they are part of the chorus {TE2,TE3,TE1,SM1,SM2}, and similarly the members of group {SM1,SM2} consider themselves to be part of the majority because they are part of the chorus {TE1,SM1,SM2}. Process 100 ensures that only one group continues operating. In other words, process 100 ensures that neither {TE2,TE3,TE1} nor {SM1,SM2} continues operating simultaneously.

[0042] Collection and sharing of information about suspected nodes The fault resolution process disclosed herein (for example, process 100) is a leaderless process. In response to a network failure event, each node identifies its suspect list, exchanges connectivity information (its own and optionally that of other nodes) with other nodes, and then makes a fault resolution decision. This process causes nodes to communicate and exchange connectivity information in such a manner that, at the end of the communication phase of this process, each node has sufficient connectivity information about other nodes in its partition to reach the same fault resolution decision(s). Any new network failure event occurring during the progress of the protocol causes all nodes to restart the protocol.

[0043] Generally, the fault resolution process of the present invention may include two phases: Phase 1, in which each node collects information about the suspect list / connections of other nodes; and Phase 2, in which each node makes a fault resolution decision (for example, by shutting down itself) based on the information collected during Phase 1.

[0044] During Phase 1, each node participates in a maximum of two rounds of coordinated broadcasts. These coordinated broadcasts of the suspect list involve the exchange of connection information / suspect lists between nodes within the partition. In Case A above, each node performs one coordinated broadcast. In Cases B and C above, each node performs two coordinated broadcasts. Two rounds of coordinated broadcasts are sufficient for all nodes to agree on the group membership change in Cases A, B, and C.

[0045] To make this process intuitively understandable, the unoptimized connection information exchange process is shown below, which includes (n-1) rounds of broadcasting, where n is the number of nodes in the chorus during Phase 1. Then, the optimized version of the connection information exchange process is shown below, in which each node participates in a maximum of two rounds of broadcasting, regardless of the number of nodes in the chorus.

[0046] For clarity and simplicity, we assume that during the connection information exchange process, there are no new network failure events, no new nodes joining, and no failures of chorus member nodes. However, the connection information exchange process described herein can be extended to include all of these events. These assumptions and / or limitations will be removed in a later section after the description of the core process.

[0047] Distribution of an unoptimized list of suspected nodes First, the chorus consists of n fully connected nodes. Suppose a network failure event occurs. Each node executes the following protocol to resolve the network failure event.

[0048] Each node prepares its own suspect list (this suspect list can be empty, which can happen if, after a network failure event, the node is fully connected to (or at least believes to be connected to) all other nodes).

[0049] Phase 1: Each node performs (n-1) rounds of coordinated broadcasting to collect information about other nodes' suspect lists / connections. In Round 1, each node sends its own suspect list to its neighbors and waits until it receives a suspect list from its neighbors. From Round 2 through Round (n-1), each node sends the suspect lists of other nodes received in the previous round to its neighbors and waits until it receives such information from its neighbors.

[0050] Phase 2: At this point, each node has received connection information for all other nodes in its partition (since the chorus contains n nodes, each node obtains connection information for all other nodes in its partition by performing (n-1) rounds of broadcasting in the manner described above). Each node prepares a connection graph for its partition and finds the all-connection component of the largest size (or largest clique) of the connection graph. If there are two or more such all-connection components, the node selects one all-connection component as the winning all-connection component, determined by a tie-breaking process (for example, based on the lexicographical order of the unique identifiers of the nodes in the all-connection component). If the size of the winning all-connection component is at least (n / 2+1) and the node is a member of the winning all-connection component, the node decides to continue operating (and terminate the protocol); otherwise, the node shuts itself down.

[0051] The following is an optimization to ensure that nodes agree on a membership change after a maximum of two rounds of broadcasting.

[0052] Optimization 1: This is an optimization applicable to the scenario covered in Case A (in the section above). This is based on the observation that when a network failure event splits the database into unconnected groups of fully connected nodes, all nodes in the group / partition will have the same suspect list. Consider, for example, Figure 2. In Figure 2, nodes TE1, TE2, and SM1 suspect TE3 and SM2, and nodes TE3 and SM2 suspect TE1, TE2, and SM1. After the first round of coordinated broadcasting in Phase 1, if a node's suspect list matches the suspect list of all its neighbors, the node can infer that (a) it is part of the fully connected component and (b) it can determine the size of the fully connected component (which is equal to the size of the chorus minus the size of its own suspect list). Thus, all nodes can agree on a membership change after the first round of broadcasting in Phase 1.

[0053] Optimization 2: This is an optimization that is mainly applicable to cases B and C above, and partially applicable to case A. In the unoptimized process, all nodes participate in (n-1) rounds of cooperative broadcasting. This allows each node to know the connectivity information of all other nodes in its partition. However, is it really necessary for each node to know the connectivity information of all other nodes in its partition in order to arrive at the best fault resolution decision? Consider dividing nodes into two categories based on their suspect list after a network failure event. Category (M) includes nodes that suspect fewer than n / 2 other nodes, and Category (N) includes nodes that suspect more than n / 2 nodes. Nodes that suspect more than n / 2 may immediately shut themselves down rather than broadcast their suspect list because they cannot be part of the winning whole connectivity component.

[0054] For example, consider Figure 11. After a network failure event, nodes TE2, SM1, and SM2 enter category (M), while nodes TE1 and TE3 enter category (N). Let's consider category (M). Do nodes in category (M) need to know the connectivity information of other nodes in category (M) in order to make the best fault resolution decision? Yes. This is because nodes in category (M), together with other nodes in category (M), can form a total connectivity component of at least (n / 2+1) size, and knowing the connectivity information of other nodes in category (M) helps in (a) determining whether it is part of a total connectivity component of at least (n / 2+1) size, (b) identifying all total connectivity components of at least (n / 2+1) size, and (c) determining whether it is part of a winning total connectivity component. Do nodes in category (M) need to know the connectivity information of nodes in category (N) in order to make the best fault resolution decision? No. This is because the nodes of category (M) will never form a total connected component of size (n / 2+1) together with the nodes of category (N), because the nodes of category (N) suspect more than (n / 2) other nodes.

[0055] Let's consider category (N) here. Does a node in category (N) need to know the connectivity information of nodes in categories (M) and (N) in order to make the best fault resolution decision? No. This is because a node in category (N) suspects more than (n / 2) other nodes, and therefore will not form a complete connectivity component of at least (n / 2+1) size with any other arbitrary nodes. Having connectivity information for all other nodes helps a node in category (N) know which other nodes are still operational, but it does not change the fact that that node cannot form a complete connectivity component of at least (n / 2+1) size with the other nodes.

[0056] Therefore, for all nodes in a distributed database system to agree on the optimal fault resolution outcome, it would suffice a few rounds of coordinated broadcasting to make each node in category (M) aware of the connectivity information of each other node in category (M). Thus, as a correction to the unoptimized process, the optimized process begins by disabling the nodes in category (N) before the start of phase 1, but simultaneously retaining them as members of the chorus. In other words, the nodes in category (M) disabling the nodes in category (N) before the start of phase 1, but retaining the nodes in category (N) on their node list until phase 2. By retaining the disabling nodes (the nodes in category (N) that can be disabling before the start of phase 1) as members of the chorus until phase 2, accuracy is ensured, namely, the fault resolution outcome is a fully connected set with at least (n / 2+1) nodes, where n includes the nodes that were disabling as an optimization before phase 1. (If we omit the category (N) node (or any type of node), the value of n (group size) and the size of the majority may change, making it more difficult to prove the accuracy of the results.)

[0057] Disabling a Category (N) node does not affect the connections between Category (M) nodes (i.e., a Category (N) node failure does not cause a Category (M) node to disconnect), because any two Category (M) nodes are either directly connected to each other or connected through other Category (M) nodes. Therefore, disabling a Category (N) node should not affect the optimality of the fault resolution outcome.

[0058] Conceptually, the optimization essentially leads to Category (M) nodes reaching an agreement on the outcome of fault resolution, and Category (N) nodes following that outcome. This optimization ensures that each node that initiated Phase 1 is connected to at least (n / 2) other nodes, so the diameter of the connection graph (i.e., the maximum distance between any two nodes in the connection graph) is at most 2. Therefore, only two rounds of broadcasting are needed for each node that initiated Phase 1 to know about the connections of each other node that initiated Phase 1. Since each node in Phase 1 is connected to at least n / 2 other nodes, the diameter of the connection graph is at most 2, and therefore any two nodes are at most one node apart.

[0059] Distribution of an optimized list of suspected nodes Consider a chorus containing n fully connected nodes. Suppose a network failure occurs. Each node performs the following protocol to resolve the network failure: Each node prepares its own suspect list (Note: This suspect list can be empty, which can happen if, after the network failure event, the node is fully connected to (or believes to be fully connected to) all other nodes).

[0060] Phase 0: Each node checks if it suspects more than (n-1 / 2) other nodes. If it does, the node shuts itself down. (Other nodes may hear about this shutdown during Phase 1. If they hear, they restart the protocol and start again from Phase 0.)

[0061] Phase 1, Round 1: Each node sends its suspect list to its neighbors and waits until it receives its neighbors' suspect lists. As described above, if one or more of a node's neighbors fail in Phase 0, that node may hear about those failures while waiting for its neighbors' suspect lists. Upon hearing about such failures, the node restarts the protocol and begins again from Phase 0. This causes other nodes to restart the protocol as well. Similarly, if a neighbor node restarts the protocol, the node begins again from Phase 0. Also, as described above, this node does not initiate failover for any failed nodes in this stage (i.e., it keeps all nodes in its chorus for the purpose of determining the winning all-connected component). This also applies to multiple rounds of Phase 0.

[0062] Each node checks whether its suspect list matches the suspect lists of all its neighboring nodes. If a node's suspect list matches the suspect lists of all its neighboring nodes, this indicates that the node is fully connected to its neighboring nodes. This scenario is covered in Case A above (for example, Figure 2). Since each node that initiated Phase 1 is connected to at least (n / 2) other nodes, the size of a node's list of neighbors can be at least (n / 2) (the node, together with its neighboring nodes, forms a group containing at least (n / 2+1) nodes). The node decides to continue operating and terminates the protocol.

[0063] If a node's suspect list does not match the suspect list of at least one of its neighboring machines, this indicates that the node is not fully connected to all other nodes within its partition. This scenario is covered in cases B and C above (for example, Figures 3-9). Such a node cannot determine whether it should continue operating based on the information received in Round 1. Therefore, Round 2 of Phase 1 is performed.

[0064] Phase 1, Round 2: Each node sends the list of suspects from other nodes received in Round 1 to its neighbors and waits until it receives such a list of suspects from its neighbors' neighbors.

[0065] Phase 2: At this point, each node has received connection information for all other nodes in its partition. Each node prepares a connection graph for its partition and finds the largest all-connection component (or the largest clique of at least (n / 2+1) size) in the connection graph that has at least (n / 2+1) nodes. If there are two or more all-connection components (for example, Figure 7), the node selects one all-connection component as the winning all-connection component, determined by a tie-breaking process (e.g., lexicographical order), in order to make fault resolution deterministic. If the node is a member of the winning all-connection component, the node decides to continue operating (and terminate the protocol); otherwise, the node shuts itself down.

[0066] If a new network failure event occurs while a distributed database system is resolving a network event, the protocol redirects the node back to the node, allowing it to re-examine its connectivity considering the impact of the new network event, and then make a decision on resolving the failure.

[0067] In addition to new network failure events, node failures (for example, those caused by manual shutdown of a node) can also occur while nodes in the distributed database system are resolving network failures. In response to a node failure, the protocol restarts the node from phase 0 while retaining the failed node as a member of the chorus until phase 2 (by not performing a failover for the failed node, preventing the remaining nodes from removing the failed node from their list of nodes). As described above, by retaining the failed node as a member of the chorus until phase 2, accuracy is ensured, namely, the result of failure resolution is a complete connected set with at least (n / 2+1) nodes, where n includes the failed node, and therefore there can be only one such set that continues to operate after phase 2.

[0068] Figure 12 is a flowchart of the optimized process 1200 for resolving network failures. Since each node follows the same process, the flowchart shows process 1200 from a single node's perspective. Process 1200 is described in detail from a stage perspective.

[0069] Stage 0: Initial stage. In 1202, the node is fully connected to all other nodes in the chorus. When a suspected node is detected locally or remotely, the node moves to Stage 1.

[0070] In Stage 1:1210, the node waits for another ping (heartbeat) timeout for one ping (heartbeat) cycle, prepares its suspect list, and after receiving the suspect list message, proceeds to Stage 2.

[0071] Stage 2: At 1220, the node checks if it suspects more than (n-1 / 2) other nodes (where n is the number of nodes in the chorus). If it does, at 1299, the node shuts itself down. Otherwise, the node prepares its suspect list in Stage 1 and then checks if there are any new suspects. The node also checks if any of its neighbors have restarted the protocol because they detected a new suspect. Each node may assign a number called protocolIterationNumber for each iteration of process 1200 it performs. Each node sets this number in the suspect list message it sends and compares its local protocolIterationNumber to the protocolIterationNumber in the suspect list received from other nodes. If the node determines that its protocolIterationNumber is smaller than the protocolIterationNumber of its neighbor, it determines that its neighbor has restarted the process and returns to Stage 1. Otherwise, the node enters Stage 3. (If a node's protocolIterationNumber is greater than that of its neighbor, that node has restarted its protocol (presumably because it has found a new suspect), which should cause the neighbor to restart its protocol as well.)

[0072] Stage 3: At 1230, the node broadcasts its Round 1 suspect list to its neighboring nodes. At 1232, the node may detect a new suspect or hear that one or more of its neighbors have detected a new suspect while waiting for Round 1 suspect list messages. In this case, the node stops waiting for further responses and returns to Stage 1. At 1234, the node receives Round 1 suspect list messages from all of its neighboring nodes. If the node does not receive a response from any of its neighbors in a timely manner (e.g., within a given period), at 1236, the node marks such neighbors as suspects and returns to Stage 1. If the node receives a Round 1 suspect list with a protocolIterationNumber greater than its own protocolIterationNumber, at 1238, the node returns to the beginning of Stage 1. Once the node has received Round 1 responses from all of its neighbors, it enters Stage 4.

[0073] In Stage 4: 1240, if a node's suspect list matches the suspect list of all its neighbors, the node determines that it is fully connected to all of its neighbors (for example, as shown in Figure 2). Since each node that started Stage 3 is connected to at least (n / 2) other nodes, the size of the node's list of neighbors can be at least (n / 2) (i.e., the node and its neighbors form a fully connected component or group containing at least (n / 2+1) nodes). In 1201, the node decides to continue operating, eliminates the suspect node, and terminates process 1200.

[0074] If a node's suspect list does not match the suspect list of at least one of its neighboring machines, the node is not fully connected to all other nodes in its partition (for example, as shown in Figures 3-9). Since the node cannot determine whether to continue operating or shut down based on the information received in Round 1, the node enters Stage 5, which includes broadcasting the Round 2 suspect list message at 1250.

[0075] At Stage 5:1250, the node broadcasts its Round 2 suspect list, which includes its original suspected machine plus the suspected machines of its neighboring nodes, to its neighboring nodes and waits until it receives Round 2 suspect list messages from all of its neighboring nodes. After broadcasting its Round 1 suspect list message at 1230, the node can receive Round 2 suspect list messages from other nodes at any time. The node accumulates these Round 2 suspect list messages. At 1252, if a new network failure occurs, or if the node receives a Round 1 message from another node, or if the node hears about another node's failure, the node returns to Stage 1. Upon returning to Stage 1, the node discards all accumulated Round 2 suspect list messages. However, if other nodes return and send other messages, those messages are retained. The node distinguishes between Round 1 and Round 2 suspect list messages based on the protocolIterationNumber within the Round 1 and Round 2 suspect list messages. In other words, messages based on protocolIterationNumber include the protocolIterationNumber and the round number.

[0076] In 1254, a node enters Stage 6 when it has received Round 2 suspect list messages from all of its neighboring nodes. If a new network event occurs or the node hears of another node's failure after broadcasting its Round 2 suspect list message, the fault resolution decision may become suboptimal. There are at least two possible cases: case (a) the node has already received a Round 2 message from a new suspect or failure node, and case (b) the node has not received a Round 2 message from a new suspect or failure node.

[0077] In case (a), the node may proceed to stage 6, perform fault resolution for the current network event, and then address the new network event by restarting the protocol, or return to stage 1 (without resolving the current network event) and restart process 1200 (which then resolves both the current and new network faults). In case (b), the node returns to stage 1 because it does not receive a round 2 message from either the new suspect or the failed node. However, there is no guarantee that the other nodes will also return to stage 1 before completing stage 6 (because they may have received a round 2 message from the new suspect or the failed node). In this case, the fault resolution outcome may be suboptimal (i.e., the survivor set is smaller than it should have been, but there is still only one survivor set). However, even if this node moves to stage 1, it does not prevent the other nodes from progressing because it has already sent its own round 2 message.

[0078] At stage 6:1260, the node prepares a connection graph for its partitions and finds the largest all-connected component (or the largest clique) of at least (n / 2+1) size in the connection graph. If there are two or more such components, the node selects one of them, determined by the tiebreaker process, as the winning all-connected component. If the node is a member of the winning all-connected component, at 1201, the node decides to continue operating and eliminates any nodes that are not part of the winning all-connected component. Otherwise, at 1299, the node shuts itself down.

[0079] Protocol Iteration Number As discussed above, any node in a distributed database system can initiate a fault resolution protocol (e.g., process 1200 in Figure 12) in response to detecting one or more suspected nodes. Furthermore, any new network failure event occurring during the execution of the fault resolution protocol can trigger a restart of the protocol. To enable a node to determine whether a suspected list message it receives (round 1 or round 2) belongs to the current call of the protocol, or to the next call resulting from a protocol restart (or, in the case of a node that restarted the protocol, to a previous call), the node associates a number called protocolIterationNumber with each call of the fault resolution protocol.

[0080] Each node maintains its own local protocolIterationNumber and sets this number in the suspect list messages it sends. Each node compares its own local protocolIterationNumber with the protocolIterationNumber in the received suspect list message. If the numbers match, the node infers that the received suspect list message corresponds to the current call to the protocol. If the protocolIterationNumber in the received suspect list message is greater than its own protocolIterationNumber, the node infers that the sender has started (and therefore restarts) the protocol. Also, if the protocolIterationNumber in the received suspect list message is less than its own protocolIterationNumber, the node infers that the sender is still running the previous iteration of the protocol and will ignore the message.

[0081] Each node can maintain its own local protocolIterationNumber in the following way: (a) The ProtocolIterationNumber is set to zero on the first node during database initialization and database restart. (b) The ProtocolIterationNumber is serialized as part of the master catalog, and a new node joining the distributed database system receives the current ProtocolIterationNumber from the master catalog chairman when it fetches the master catalog. A new node receives the master catalog when it joins the distributed database system. By storing the current ProtocolIterationNumber in the master catalog, the current ProtocolIterationNumber becomes available to the new node. (c) If a node does not suspect any other node, has detected a suspected node, has not received any suspected list messages from other nodes, and has invoked the fault resolution protocol, the node increments its protocolIterationNumber. (d) If a node does not suspect any other node, has detected a suspected node, has received one or more suspected list messages from other nodes, and has invoked a fault resolution protocol, the node sets its protocolIterationNumber to the largest protocolIterationNumber in the suspected list messages it has received. (e) If a node has not detected any suspected nodes, has received one or more suspected list messages from other nodes, and has invoked the fault resolution protocol, the node sets its protocolIterationNumber to the largest protocolIterationNumber in the suspected list messages received from other nodes. (f) If a node is running the fault resolution protocol and has not received any suspect list messages with a protocolIterationNumber greater than its local number, and detects a new network failure event, the node increments its protocolIterationNumber (and restarts the protocol). (g) If a node is running the fault resolution protocol and receives a suspect list message with a protocolIterationNumber greater than its local number, and detects a new network failure event, the node sets its local protocolIterationNumber to the number in the received suspect list message (and restarts the protocol). (h) If a node is running the fault resolution protocol and receives a suspect list message with a protocolIterationNumber greater than its local number, and does not detect any new network failure events, the node sets its local protocolIterationNumber to the number in the received suspect list message (and restarts the protocol). (i) If a node is running a fault resolution protocol and receives a suspect list message with a protocolIterationNumber smaller than its own local number, the node ignores the message.

[0082] These points can be summarized as follows: (A) The ProtocolIterationNumber is set to zero on the first node during database initialization and database restart. (B) The ProtocolIterationNumber is serialized as part of the master catalog, and new nodes joining the database receive the current ProtocolIterationNumber from the master catalog chairman when they fetch the master catalog. (C) If a node is calling the fault resolution protocol (because it has detected a suspected node and / or received a suspected list message from another node), the node checks whether it has received a suspected list message with a protocolIterationNumber greater than its own local protocolIterationNumber. If it has, the node sets its own local protocolIterationNumber to the largest protocolIterationNumber in the received suspected list message(s); otherwise, it increments its own protocolIterationNumber.

[0083] Dealing with unidirectional link failures Unidirectional link failures, such as in Case D above (Figure 10), can be resolved by treating them as bidirectional link failures (i.e., by making the nodes on both sides of the failed link suspect each other). For example, consider two nodes, Node A and Node B, in a distributed database system. Assume that Node A can send messages to Node B, but Node B cannot send messages to Node A. Node A can send a ping message to Node B, but does not receive an acknowledgment message from Node B, so Node A begins to suspect Node B. At this point, Node B does not yet suspect Node A. However, since Node A now suspects Node B, it stops sending ping messages to Node B. This causes Node B to suspect Node A, and the unidirectional link failure is transformed into a bidirectional link failure.

[0084] In the process described herein, a node sends a MsgPing message (e.g., a ping message) and sets Node::lastPingTime for a particular node only if that node has acknowledged the previous MsgPing message. This ensures that in the event of a unidirectional link failure, nodes on both sides of the link will suspect each other. Thus, the above protocol can resolve unidirectional link failures, or a mixture of unidirectional and bidirectional link failures.

[0085] Chorus membership change If a network failure event occurs while a new node (or set of new nodes) is joining the chorus, this process must be handled in a way that prevents the chorus from splitting into multiple majority groups. In Figure 10, for example, a network partition splits the chorus into a majority group {SM1,SM2} and a minority group {TE1}. However, the minority group {TE1}, along with the new nodes {TE2,TE3}, forms a majority group {TE1,TE2,TE3}, resulting in two “majority” groups {TE1,TE2,TE3} and {SM1,SM2}.

[0086] One way to resolve issues related to adding new nodes to a chorus is to deactivate the new node(s) if a network failure event occurs while they are in the process of joining the chorus. This prevents the minority set of nodes in the current chorus from forming a majority group with the new node(s). In Figure 10, the new nodes TE2 and TE3 (which are still in the process of joining the chorus) can be deactivated, which also deactivates TE1, leaving a single majority group {SM1,SM2} in the database. This process does not limit the number of nodes that can join a chorus simultaneously. However, since the new node(s) may be deactivated and some nodes in the current chorus may still be aware of the new node(s), this process can affect system availability (depending on the odd or even number of nodes in the current chorus, the number of nodes trying to join the chorus, and the number of nodes in the chorus that are aware of the new node(s) at the time of the network failure).

[0087] This process can also piggyback on the process of requesting data fragments from a distributed database to get current chorus members to agree on a new node joining the chorus (the originator sends available fragments, peers send acknowledgments to the originator, and the originator sends the complete data to the requester). This process includes the following modifications to the fault resolution process 1200 in Figure 12 to allow nodes to agree on the chorus size during the node joining process.

[0088] During Round 1 and Round 2 broadcasts, nodes exchange their complete connectivity information (i.e., their list of neighboring nodes and their list of suspected nodes). In response to receiving Round 1 / Round 2 messages, nodes compare their list of suspected nodes and neighboring nodes with the list of suspected nodes and neighboring nodes of their neighbors. A node then checks for n nodes it is unaware of.j If a node realizes that its neighbors are aware of it, it sets its chorus size to n j Increment only that value and restart the process.

[0089] This process ensures accuracy, and if a new node(s) cannot be added to the node list of all nodes in the chorus due to a network partition, the new node(s) will shut down during fault resolution. Let n be the number of nodes in the chorus. j If we define n as the number of nodes that are trying to join the chorus at the same time but cannot enter the node list of all n nodes due to network partitioning, then n j Each node (new node), regardless of its partition, will shut down during processing. Therefore, up to n nodes check after Round 1 whether they are in the majority partition to determine whether they should continue operating. Nodes within each partition have a chorus size s(n ≤ s ≤ n + n). j It operates in this way, and after round 1, there are at most n nodes in the chorus, so at most one partition can form the majority group, which ensures accuracy.

[0090] However, what happens if all nodes in a partition add a new node(s) to their node list after the fault resolution protocol has started? (Note that nodes prepare their suspect node list and neighbor node list during Stage 1 at the start of the protocol and cache this information.) None of the nodes will be able to detect that a new node(s) has been added to their node list. As a result, the master catalog of the new node(s) may transition to a complete state, the new node(s) may participate in the fault resolution process, and this may result in multiple majority groups.

[0091] For example, consider this scenario. A chorus consists of nodes A, B, and C, where A is the chairman / leader of a distributed database fragment (e.g., fragment "master catalog"). New nodes D and E attempt to join the chorus simultaneously. Node A sends available messages to B and C for D and E. B and C, not receiving ping messages from A, become suspicious of A and initiate protocols. Since B and C have not (yet) applied the available messages from A, they initiate protocols as chorus members {A, B, C}. Subsequently, B and C apply the available messages and send acknowledgment messages to A, after which network partitioning occurs. D and E's master catalog becomes complete, so A, D, and E initiate protocols as chorus members {A, B, C, D, E}. Both groups {A, D, E} and {B, C} believe they can form a majority group.

[0092] The following extension can prevent such a situation: After applying available messages (or, in the case of a chairman node, after sending the master catalog to the new node), the node restarts the fault resolution protocol (if one is in progress), which causes the node to invalidate its cached suspect and adjacency lists and recalculate them with a larger chorus size.

[0093] Figures 13 to 18 illustrate several exemplary failure scenarios and how the failure resolution process of the present invention addresses them.

[0094] Scenario (A): A network partition occurs, separating the new node and the entry node (master catalog originator) from the remaining nodes.

[0095] In Figure 13, SM3 requests the master catalog from TE1 (the chairman of the master catalog) and receives it from TE1. A network partition occurs before TE1 sends MsgObjectAvailable (for example, a message notifying the receiving nodes that the source node has joined the distributed database system) to SM1 and SM2. All nodes, including SM3, initiate the resolution protocol. SM3 and TE1 suspect nodes SM1 and SM2, and SM1 and SM2 suspect TE1 (SM1 and SM2 are unaware of SM3). SM3 is still in the process of joining the chorus (and has not received completion from TE1), so it fails. TE1 is also failing (in phase 0) because it suspects two nodes in the chorus {SM1, SM2, SM3, TE1}, and SM1 and SM2 continue to operate as they form the majority of the chorus {SM1, SM2, TE1}.

[0096] Scenario (B): A variation of Scenario (A). A network partition occurs, and the new node and entry node (master catalog originator) are separated from the remaining nodes.

[0097] In Figure 14, SM3 requests the master catalog from TE1 (the chairman of the master catalog) and receives it from TE1, SM1 receives MsgObjectAvailable from TE1, and a network partition occurs before SM2 receives MsgObjectAvailable from TE1. SM3 and TE1 suspect SM1 and SM2, SM1 suspects SM3 and TE1, and SM2 suspects TE1 (SM2 is unaware of SM3). SM3 is still in the process of joining the chorus (has not received final confirmation of joining from TE1) and therefore fails, and TE1 and SM1 fail (in phase 0) because they suspect two nodes in the chorus {SM1, SM2, SM3, TE1}. SM2 initially suspects only TE1 (if n=3, this is less than n / 2 of the number of nodes), so it does not fail in phase 0 and sends the round 1 message of phase 1 to SM1, but after hearing about SM1's failure, it restarts the resolution protocol and then fails itself.

[0098] Scenario (D): A network partition occurs, and a new node, an entry node, and some peers are separated from the remaining peers.

[0099] In Figure 15, SM3 requests the master catalog from TE1 (the chairman of the master catalog) and receives it from TE1. SM1 receives MsgObjectAvailable from TE1, but before SM2 receives MsgObjectAvailable from TE1, SM2 is isolated from the remaining nodes due to a network partition. SM3 is still in the process of joining the chorus (has not yet received completion from TE1) and therefore becomes inoperable. SM2 is in the minority partition of the chorus {SM1, SM2, TE1} and therefore becomes inoperable. TE1 and SM1 initiate the protocol, do not receive a response from SM3 (round 1), and after finally suspecting SM3, they inoperate themselves.

[0100] Scenario (E): Network partitioning causes the entry node (chairman of the master catalog) to become separated from the remaining nodes.

[0101] In Figure 16, SM3 requests the master catalog from TE1 (the chairman of the master catalog) and receives it from TE1. SM1 and SM2 receive MsgObjectAvailable from TE1, and due to network partitioning, entry node TE1 is separated from the remaining nodes. SM3 is still in the process of joining the chorus and therefore becomes inoperable. TE1 is in the minority partition of the chorus {SM1, SM2, SM3, TE1} and therefore becomes inoperable. SM1 and SM2 begin troubleshooting, do not receive a response from SM3, and ultimately suspect SM3 before shutting themselves down.

[0102] Scenario (H): Network partitioning isolates the new node, entry node, and some peers from the remaining nodes.

[0103] In Figure 17, SM4 requests the master catalog from TE1 (the chairman of the master catalog) and receives it from TE1. SM1 and SM3 receive MsgObjectAvailable from TE1, a network partition occurs, and SM2 is isolated from the remaining nodes. SM4 is still in the process of joining the chorus and therefore ceases to function. SM2 is in the minority partition of the chorus {SM1,SM2,SM3,TE1} and therefore ceases to function. TE1, SM1, and SM3 form the majority group in the chorus {SM1,SM2,SM3,SM4,TE1} and continue to operate. In this case, the group {TE1,SM1,SM3} was the majority in the original chorus {TE1,SM1,SM2,SM3} and is still the majority in the new chorus {TE1,SM1,SM2,SM3,SM4}. This allows TE1, SM1, and SM3 to continue operating even after SM4 is added to their node list. Generally, this behavior occurs whenever the chorus size changes from an even number of nodes to an odd number of nodes (for example, when one node attempts to join the chorus during a network failure).

[0104] Scenario (I): Network partitioning causes new nodes, entry nodes, and some peers to be separated from the remaining nodes.

[0105] In Figure 18, SM4 and SM5 request the master catalog from TE1 (the chairman of the master catalog) and receive it from TE1. SM1 and SM3 receive MsgObjectAvailable from TE1 for both SM4 and SM5, and SM2 is isolated from the remaining nodes due to network partitioning. SM4 and SM5 cease to function because they are still in the process of joining the chorus, and SM2 ceases to function because it is in the minority group of the chorus {SM1, SM2, SM3, TE1}. TE1, SM1, and SM3 also cease to function because they form the minority group in the chorus {SM1, SM2, SM3, SM4, SM5, TE1}. Nodes TE1, SM1, and SM3, which continued to operate in scenario (H), cease to function here because there are two nodes trying to join the chorus, and these nodes become the minority group in the new chorus.

[0106] Conceptually, a chorus with n nodes can tolerate a network partition (or simultaneous failure of up to (n-(n / 2+1)) nodes) in which at most (n-(n / 2+1)) nodes are separated from the rest of the chorus, and can still continue to operate. If a single node is trying to join the chorus, and n is odd, the chorus can tolerate the separation of (n-(n / 2+1)-1) nodes and can still continue to operate. If a single new node is trying to join, and n is even, the chorus can tolerate the separation of (n-(n / 2+1)) nodes and can still continue to operate.

[0107] Set the fault tolerance of the chorus to the maximum number of node failures that the chorus can tolerate without all the nodes in the chorus stopping functioning. In a chorus with n nodes, when there are no new nodes joining the chorus, the fault tolerance of the chorus is (n - (n / 2 + 1)) (column 1 of Table 1). When there is one node attempting to join the chorus, the fault tolerance of the chorus decreases to (n - (n / 2 + 1)) if n is odd and remains (n - (n / 2 + 1)) if n is even (column 2 of Table 1). When the number of new nodes attempting to join the chorus (simultaneously) is greater than 1, the fault tolerance of the chorus can further decrease. Table 1 summarizes the fault tolerance of the chorus for various numbers of nodes (n) in the chorus and various numbers of nodes (n j ) attempting to join the chorus simultaneously. In Table 1 below, there are n j nodes attempting to join the chorus simultaneously, and at least one node in the majority partition has received MsgObjectAvailable for all n j nodes. TIFF0007872827000001.tif56154 Table 1

[0108] The fault resolution for Scenarios B, D, and F (above) is incorporated into the table entries for n = 3 and n j = 1. Since the fault tolerance of the chorus for this configuration is zero, the entire chorus stops functioning due to a network partition (or any node failure) occurring while a new node is joining (with at least one node receiving MsgObjectAvailable). Scenario A is not incorporated into Table 1 because none of the nodes in the majority group of Scenario A received MsgObjectAvailable. Scenario H is incorporated into the entry for n = 4 and n j = 1. The fault tolerance of the chorus for Scenario H is 1. Since the chorus has a single node in the minority partition, the chorus continues to operate. Scenario I is for n = 4 and nj It is included in the entry with =2. Since the chorus failure tolerance in this configuration is zero, a network partition while a node is joined will cause the entire chorus to cease functioning.

[0109] Handling node failures This section discusses how a distributed database system handles one or more node failures (or shutdowns) while resolving a network failure. As discussed above, the process of resolving a network failure event involves nodes exchanging failure detection messages, nodes deciding whether to continue operating based on the exchanged messages, and nodes that decide to continue operating performing a failover of the suspected node. This process is illustrated in Figure 19.

[0110] In Figure 19, the chorus includes members {A, B, C, D}. Network partitioning separates {A, B, C} from D. Nodes A, B, and C suspect node D, exchange fault detection messages, decide to continue operating, and perform a failover of D. Node D suspects nodes A, B, and C, initiates a fault resolution protocol, and shuts itself down.

[0111] If ping (heartbeat) timeouts are enabled, a node failure will cause the neighbors of the failed node to initiate (or restart) a fault resolution protocol, agree to remove the failed node, and remove it from the chorus. If a node failure occurs while a distributed database system is resolving a network failure event, the failed node may appear as a new suspect to its neighbors. This allows the neighbors to restart the protocol. Therefore, there is no special mechanism to deal with node failures during partition resolution. Instead, the processing described herein ensures that nodes that initiate / restart a fault resolution protocol in response to a node failure agree on chorus membership.

[0112] Handling node failures during fault detection message exchange If a node fails while fault detection messages are being exchanged, that node will no longer be present on its neighbor's suspect node list. As a result, neighbor nodes will agree on chorus membership / chorus size, similar to the process discussed above. In response to detecting a new suspect node resulting from the node failure, the neighbor restarts the fault resolution process using the updated suspect list. This updated suspect list is the union of suspect nodes resulting from the network failure and the node that failed. Neighbors continue operation if they form a majority group based on the updated suspect list.

[0113] In Figure 20, a network partition separates {A, B, C} from D, and while the nodes are exchanging messages, node C fails. Nodes A and B, suspecting C, restart the protocol. Since A and B do not form a majority group in the chorus {A, B, C, D}, they cease to function.

[0114] Handling node failures during failover If a node fails while it is performing a failover (while removing the non-functioning node from the chorus membership list), its neighbors may have started or completed the failover of other suspected nodes. As a result, neighbors may have removed one or more suspected nodes from their own node list, which may cause mismatches in chorus membership / chorus size when the protocol is started / restarted.

[0115] In Figure 21, the chorus contains members {A, B, C, D}. Due to a network partition, {A, B, C} are separated from D. Nodes A, B, and C decide to initiate the protocol, exchange failure detection messages, and continue operating. Nodes A, B, and C initiate the failover of node D. After A completes the failover of D (and removes D from its node list), node C fails while B is still performing the failover of D. As a result, A suspects C and initiates node failure processing in the chorus {A, B, C} and suspect list {C}. Consequently, B also initiates node failure processing in the chorus {A, B, C, D} and suspect list {C, D}. As a result, A and B do not agree on chorus membership.

[0116] In this case, the nodes will match in terms of chorus size as follows:

[0117] Nodes that are not malfunctioning exchange their complete connectivity information (i.e., their list of neighboring nodes and their list of suspected nodes) during rounds 1 and 2 of the broadcast. After receiving the round 1 / round 2 messages, a node compares its list of suspected nodes and neighboring nodes with the list of suspected nodes and neighboring nodes of its neighbors. A node then checks for n nodes that it is not aware of. j If a node realizes that its neighbors are aware of it, it sets its chorus size to n j Increment only that value and restart the fault resolution process.

[0118] Therefore, if n is the number of nodes in the majority partition, f is the number of non-functioning nodes, and e is the number of nodes excluded due to failover, then if (nf) ≥ (s / 2+1), the nodes in the partition will continue to operate, where (n ≤ s ≤ n+e).

[0119] However, what happens if a failover is completed on a node while the node is performing fault resolution? To increase the likelihood of keeping chorus members operational, the following changes can be made to the fault resolution process: After the node has completed the failover of the node being removed, the fault resolution process is restarted (if one is in progress), which will cause this process to run with a smaller chorus size.

[0120] To ensure all nodes match in chorus size when a node restarts with a smaller chorus size, this process can be further extended as follows: Nodes exchange their complete connectivity information (i.e., their neighbor list and their suspect list) during rounds 1 and 2 of the broadcast. Nodes then compare their suspect list and neighbor list with the suspect list and neighbor list of their neighbors. Nodes then check for n nodes they are unaware of. j If a node realizes that its neighbors are aware of it, it sets its chorus size to n j It increments only and restarts processing. After that, the neighboring machines of the node will move from their chorus list to r j If you delete a node and restart the process, the node will change its chorus size to r j Decrement only that value and restart the process.

[0121] This means that nodes agree on chorus size, but do not need to agree on chorus membership (or chorus size) as long as nodes with questionable membership and new nodes are down. In other words, each node can perform a fault resolution process based on the chorus membership determined by its master catalog node list. This process ensures that all nodes reach the correct result, as long as nodes whose membership is not agreed upon are down before the process starts or during the process.

[0122] To understand why this holds true, n+n j Let n be the number of nodes in the chorus. n is the number of nodes for which the master catalog is complete. j This is the sum of the number of nodes that have failed, the number of nodes that will fail (as in the case of node failure, the master catalogs of these nodes may or may not be complete at the time of failure), or the number of new nodes that will fail when the fault resolution protocol is started (as in the case of node joining, the master catalogs of these nodes will not be complete at the time of failure).

[0123] If s is the size of the master catalog node list of nodes participating in the fault resolution protocol, then n ≤ s ≤ n + n j Note that s may not be the same for all nodes participating in the fault resolution protocol.

[0124] If the fault resolution protocol runs the protocol using a chorus size set to its own master catalog node list size, can it guarantee that at most one node in a partition will remain operational? Yes, it can, but the reason is that n ≤ s ≤ n + n j Therefore, the majority group size calculated by each node is at least (n / 2+1). Each node in the partition is in the majority group (n / 2+1 ≤ majority group size < (n+n j If we can conclude that (n / 2+1) means that the partition has at least (n / 2+1) nodes. Since only n nodes participate in the protocol, there can be at most one such partition. Therefore, at most one node in a partition can successfully complete the fault resolution protocol and continue to operate.

[0125] Not all nodes within a partition need to conclude that they are in the majority of all connected components for that partition to become a winning all connected component. A subset of nodes within a partition may conclude that they are not in the majority group, depending on the size of their master catalog node list. These nodes will cease to function during stage 2 of the process (Figure 12), and the remaining nodes in that partition will restart the process. However, if the remaining nodes can conclude that they are in the majority group when the process restarts, that may be sufficient for their all connected component to become a winning all connected component.

[0126] Achieving additional performance targets If a user chooses to shut down more than half of the chorus member nodes, fault detection cannot be triggered. This is achieved by modifying the fault resolution process so that manually shut-down nodes are not treated as suspected nodes.

[0127] Identifying nodes that are shutting down When a node receives a shutdown request from the management layer, it broadcasts a Message NodeState (MsgNodeState) message indicating that it is shutting down (for example, by the node state NODE_STATE_SHUTTING_DOWN). The management layer in a distributed database system is the layer of nodes that users can interact with the distributed database. The management layer can track nodes in the distributed database system and facilitate communication between users and nodes in the distributed database system. For example, if a user wants to shut down a node, the user can issue a shutdown command to the management layer, which then sends a shutdown message to the node specified by the user. This process relies on at least one chorus member receiving this node state message from the node that is shutting down.

[0128] Changes to the fault resolution protocol The following changes can be made to the fault resolution protocol. (a) Nodes that are known to be shutting down during Stage 1 of the process (Figure 12) are not considered suspect nodes. (b) Have the chorus members gossip about the node that is shutting down (for example, by exchanging fault detection messages between stages 3 and 4 of the process).

[0129] The following is an example of how these changes can satisfy the need to identify nodes that are being manually shut down. Consider a chorus with nodes A, B, C, and D. Assume that the user shuts down nodes C and D at approximately the same time. Assume that only node A receives a node status message from C, and only node B receives a node status message from D. Node A initiates fault resolution with chorus {A,B,C,D}, suspect list {D}, and shutting-down node list {C}, and sends a round 1 fault detection message to B. Node B initiates the protocol with chorus {A,B,C,D}, suspect list {C}, and shutting-down node list {D}, and sends a round 1 fault detection message to A. In response to receiving the fault detection message, node A updates its shutting-down node list to {C,D}, updates its suspect list to {}, and restarts the protocol. Node B does the same. After Round 1, Nodes A and B conclude that they are in the majority partition based on chorus size = 4 and suspect node list size = 0, and continue operating.

[0130] However, if a network partition or link failure occurs while a node is shutting down, how will the modified protocol arrive at the correct solution? Consider this scenario. The chorus includes nodes A, B, C, D, and E. The user shuts down node E, and almost simultaneously, a network partition separates {A,B} from {C,D}. Assume all nodes receive node status messages from E. Node A initiates the protocol with chorus {A,B,C,D,E}, suspect list {C,D}, and shutdown node list {E}, and sends a round 1 failure detection message to B. Node B also initiates the protocol with chorus {A,B,C,D,E}, suspect list {C,D}, and shutdown node list {E}, and sends a round 1 failure detection message to A. Upon receiving the failure detection messages, nodes A and B conclude (based on chorus size = 5 and suspect node list size = 2) that they are in the majority partition and continue operating. Nodes C and D also continue operating with the same logic. The following approach can ensure that the protocol reaches the correct processing in this scenario: If a network partition (or link failure) occurs while the node(s) are shutting down, treat the shutting-down node as the suspect node.

[0131] In summary, if a user shuts down more than half of the chorus member nodes (let's call this set of nodes SD), this process will result in the remaining nodes (let's call this set of nodes NSD) continuing to operate if the following conditions are met. (A) At least one node of the NSD receives node state change messages from each node of the SD. (B) Each node in the NSD receives a node status message from at least one node in the SD. (This is because if an NSD node does not receive a node status message from any node in the SD, it will become suspicious of all nodes in the SD and will shut itself down in Stage 2 before it learns about any other nodes shutting down.) (C) No network interruptions or link failures will occur until the NSD node completes the fault resolution protocol and removes the SD node from the chorus.

[0132] knot While various embodiments of the present invention have been described and illustrated herein, it is likely that those skilled in the art will readily imagine various other means and / or structures for performing the functions described herein and / or obtaining one or more of the results and / or benefits, and each of such variations and / or modifications will be considered to fall within the scope of the embodiments of the present invention described herein. More generally, all parameters, dimensions, materials and configurations described herein are intended to be typical, and those skilled in the art will readily understand that the actual parameters, dimensions, materials and / or configurations will depend on one or more specific applications in which the teachings of the present invention are used. Those skilled in the art will be able to recognize or confirm many equivalents to specific embodiments of the present invention described herein through mere routine experimentation. Therefore, it should be understood that the embodiments described herein are merely examples, and embodiments of the present invention may be carried out in ways other than those specifically described and claimed, within the scope of the appended claims and their equivalents. Embodiments of the invention in this disclosure are directed toward each individual feature, system, article, material, kit and / or method described herein. In addition, any combination of two or more such features, systems, articles, materials, kits, and / or methods is included in the scope of the invention of this disclosure, provided that such features, systems, articles, materials, kits, and / or methods are not contradictory to each other.

[0133] Furthermore, various inventive concepts can be embodied in one or more methods, and examples are provided. The actions performed as part of a method can be arranged in any suitable manner. Thus, embodiments can be constructed in which the actions are performed in a different order than those shown, which may include performing some actions simultaneously, even if they are shown as sequential actions in the exemplary embodiments.

[0134] All definitions defined and used herein should be understood to supersede dictionary definitions, definitions in references incorporated by citation, and / or the common meanings of the terms defined.

[0135] The indefinite articles "a" and "an," as used herein and in the claims, should be understood to mean "at least one" unless explicitly indicated otherwise.

[0136] When used herein and in the claims, the phrase "and / or" should be understood to mean "one or both" of the elements thus connected at equal intervals, that is, elements that exist conjunctively in one case and disjunctively in another. Multiple elements listed using "and / or" should similarly be interpreted as "one or more" of the elements thus connected at equal intervals. Other elements other than those specifically identified by the "and / or" clause may be present at their discretion, whether related to or unrelated to those specifically identified elements. As a non-restrictive example, a reference to "A and / or B," when used with open-ended phrases such as "equipped with," may in one embodiment refer to A only (including elements other than B at their discretion), in another embodiment refer to B only (including elements other than A at their discretion), and in yet another embodiment refer to both A and B (including other elements at their discretion), and so on.

[0137] As used herein and in the claims, “or” should be understood to have the same meaning as “and / or” as defined above. For example, when dividing items in a list, “or” or “and / or” should be interpreted as inclusive, meaning that it includes not only several elements or lists of elements and at least one of further optional items not listed, but two or more. Only terms that explicitly indicate otherwise, such as “only one of” or “exactly one of” or, as used in the claims, “consisting of”, refer to including exactly one element from several elements or lists of elements. In general, as used herein, the term “or” should be interpreted as indicating only an exclusive choice (i.e., “one or the other, but not both”) when accompanied by terms of exclusivity, such as “either of,” “one of,” “only one of,” or “exactly one of.” “Essentially consisting of” should have its usual meaning as used in the field of patent law when used in the claims.

[0138] As used herein and in the claims, the phrase “at least one” in relation to a list of one or more elements should be understood to mean at least one element selected from any one or more elements in that list of elements, and not necessarily including at least one of all elements specifically listed in that list of elements, nor excluding any combination of elements in that list of elements. The definition also acknowledges that elements other than those specifically identified may be optionally present in the list of elements referred to by the phrase “at least one,” whether or not they are related to those specifically identified elements. Therefore, as a non-restrictive example, "at least one of A and B" (or synonymously "at least one of A or B," or synonymously "at least one of A and / or B") may mean, in one embodiment, there is at least one A, optionally including two or more A's and no B (and optionally including elements other than B); in another embodiment, there is at least one B, optionally including two or more B's and no A (and optionally including elements other than A); and in yet another embodiment, there is at least one A, optionally including two or more A's, there is at least one B, optionally including two or more B's (and optionally including other elements), and so on.

[0139] In the claims and the above specification, please understand that all transitional clauses, such as “equipment,” “includes,” “carry,” “have,” “incorporate,” “accompany,” “hold,” and “consist of,” are open-ended, meaning “includes but not limited to.” Only the transitional clauses “consist of” and “essentially consist of” are considered closed or semi-closed transitional clauses, respectively, as described in Section 2111.03 of the U.S. Patent and Trademark Examination Manual.

Claims

1. A method for resolving a failure affecting nodes in a distributed database, wherein each node in the distributed database is configured to communicate directly with each other node in the distributed database, and the method is In response to the failure, the non-functioning nodes in the distributed database replace the list of nodes suspected to have ceased to be configured to communicate directly with each other node in the distributed database as a result of the failure. Each node that is not malfunctioning determines, based on the list sent and received in the exchange, whether the node remains configured to communicate directly with more than half of the nodes in the distributed database despite the failure, To resolve the failure, the nodes that are no longer configured to communicate directly with more than half of the nodes in the distributed database are deactivated. The method, including the method described above.

2. The method according to claim 1, wherein exchanging the list includes broadcasting the list in a coordinated broadcast of up to two rounds among the non-failed nodes in the distributed database.

3. The method according to claim 1, wherein exchanging the list includes one of the non-faulting nodes broadcasting a protocol iteration number representing a fault resolution protocol called to resolve the fault.

4. The method according to claim 3, further comprising the exchange of the list being performed by at least one of the non-failed nodes incrementing a local protocol iteration number in response to the protocol iteration number included in at least one of the list.

5. The method according to claim 1, wherein determining whether each node that is not malfunctioning remains configured to communicate directly with more than half of the nodes in the distributed database despite the failure includes determining the connection information of the nodes.

6. The method according to claim 1, wherein determining whether each node that is not malfunctioning remains configured to communicate directly with more than half of the nodes in the distributed database despite the failure includes comparing lists from different nodes.

7. Determining whether each node that is not malfunctioning remains configured to communicate directly with more than half of the nodes in the distributed database despite the failure is: Identifying a group of nodes that remain configured to communicate directly with each other despite the aforementioned failure, Determining the number of nodes in each of the aforementioned group of nodes, Based on the number of nodes, one of the groups of nodes is identified as the winning group, The method according to claim 1, including the method according to claim 1.

8. The method according to claim 7, wherein identifying the winning group includes selecting the group of nodes with the largest number of nodes as the winning group.

9. Identifying the winning group means Determining that two of the aforementioned groups of nodes contain (i) the same number of nodes and (ii) more nodes than any other group of nodes, Using a tie-break process, select one of the groups of nodes as the winning group, The method according to claim 7, including the method described in claim 7.

10. Before replacing the aforementioned list, In response to the failure, each of the nodes in the distributed database determines whether the node suspects that more than half of the nodes are no longer configured to communicate directly with the other nodes in the distributed database due to the failure, or whether less than half of the nodes are no longer configured to communicate directly with the other nodes in the distributed database due to the failure. The node suspected of being more than half of the nodes that are no longer configured to communicate directly with other nodes in the distributed database due to the aforementioned failure is to be deactivated. The method according to claim 1, including the method described in claim 1.

11. It is a distributed database system, The system comprises multiple nodes, each node in the multiple nodes is configured to communicate directly with each other node in the distributed database, and the multiple nodes include a first node, the first node is In response to the failure, exchange a list of nodes suspected to be no longer configured to communicate directly with each other node in the distributed database as a result of the failure with the other nodes in the plurality of nodes, Based on the list transmitted and received in the exchange, it is determined whether the first node remains configured to communicate directly with more than half of the nodes in the distributed database despite the failure, If the first node ceases to be configured to communicate directly with more than half of the nodes in the distributed database, it will shut down to resolve the failure. The distributed database system is configured as follows.

12. The distributed database system according to claim 11, wherein the first node is configured to participate in a maximum of two rounds of cooperative broadcasting of the list.

13. The distributed database system according to claim 11, wherein the first node is configured to broadcast a protocol iteration number representing a fault resolution protocol that was called to resolve the fault.

14. The distributed database system according to claim 13, wherein the first node is further configured to increment the protocol iteration number in response to receiving a larger protocol iteration number from another node among the nodes in the plurality of nodes.

15. The distributed database system according to claim 11, wherein the first node is configured to determine a connection graph based on the list.

16. The distributed database system according to claim 11, wherein the first node is configured to compare lists from different nodes.

17. The first node is, Identifying a group of nodes that remain configured to communicate directly with each other despite the aforementioned failure, Determining the number of nodes in each of the aforementioned group of nodes, Based on the number of nodes, one of the groups of nodes is identified as the winning group, The distributed database system according to claim 11, wherein it is configured to determine whether each node that is not malfunctioning remains configured to communicate directly with more than half of the nodes in the distributed database despite the failure.

18. The distributed database system according to claim 17, wherein the first node is configured to identify the winning group by selecting the group of nodes with the largest number of nodes as the winning group.

19. The first node is, Determining that two of the aforementioned groups of nodes contain (i) the same number of nodes and (ii) more nodes than any other group of nodes, Using a tie-break process, select one of the groups of nodes as the winning group, The distributed database system according to claim 17, configured to identify the winning group by means of the winning group.

20. The first node, before exchanging the list, In response to the failure, it is determined whether the first node is configured to communicate directly with more than half of the nodes in the plurality of nodes despite the failure, or with less than half. The system malfunctions if it is configured to communicate directly with less than half of the nodes among the aforementioned multiple nodes. A distributed database system according to claim 11, configured as described above.

21. It is a distributed database, It comprises multiple nodes, and each node within the multiple nodes is directly connected to each other node within the multiple nodes. The plurality of nodes include a first node, and the first node is To detect an interruption in communication with a second node among the multiple nodes caused by a failure in the distributed database, In response to detecting the interruption, the following is to initiate a coordinated broadcast of a list of suspected nodes among neighboring nodes in the plurality of nodes, wherein the neighboring nodes are nodes in the plurality of nodes that remain directly connected to the first node, and the list of suspected nodes of the first node includes the second node. Determining connection information based on the list of suspected nodes, wherein the connection information is information indicating which of the multiple nodes maintain a state of direct connection to each other after the failure, based on the list of suspected nodes of the first node and the list of suspected nodes received from other nodes during the cooperative broadcast. To resolve the fault based at least partially on the aforementioned connection information, The distributed database configured to perform the following.

22. The distributed database according to claim 21, wherein the first node is configured to initiate the cooperative broadcast by broadcasting a fault resolution protocol for resolving the fault.

23. The aforementioned cooperative broadcast is The first round involves the first node transmitting a list of suspected nodes of the first node to its neighboring nodes and receiving a list of suspected nodes of the third node from a third node, which is one of the neighboring nodes. The first node transmits a list of suspected nodes of the first node to a fourth node, which is one of the neighboring nodes. A distributed database according to claim 21, including the following:

24. The distributed database according to claim 23, wherein the list of suspected nodes of the first node includes nodes in the plurality of nodes that the first node suspects have lost direct connection to the first node as a result of the failure.

25. The distributed database according to claim 23, wherein the first node is configured not to participate in the second round if the list of suspected nodes of the first node matches the list of suspected nodes received by the first node during the first round.

26. The distributed database according to claim 23, wherein the first node is further configured to update the list of suspected nodes of the first node in response to determining that another node among the plurality of nodes has failed to function.

27. The distributed database according to claim 26, wherein the first node is further configured to determine, in response to updating the first node's list of suspected nodes, whether the first node's list of suspected nodes includes more than half of the nodes in the plurality of nodes, and to shut down itself if the first node's list of suspected nodes includes more than half of the nodes in the plurality of nodes.

28. The distributed database according to claim 21, wherein the first node is further configured to shut down itself if the list of suspected nodes of the first node includes more than half of the nodes in the plurality of nodes.

29. The distributed database according to claim 21, wherein the first node is configured to determine the connection information based on a maximum of two rounds of cooperative broadcasting.

30. The distributed database according to claim 21, wherein the first node is further configured to deactivate itself if the connection information indicates that the first node is directly connected to less than half of the nodes in the plurality of nodes.