Demystifying Distributed Consensus

Last updated on Feb 12, 2025

consensus

1. What am I reading?

As a software engineer, you've likely encountered terms like distributed consensus in distributed systems, Raft in etcd, KRaft in Kafka, or RedisRaft in Redis. Some of you may already know that distributed consensus is a very important concept that allows a group of nodes to agree on a shared state, ensuring data consistency, and enabling fault-tolerance. Some of you may even implement Raft once.

But,

Why do we need our systems to be fault-tolerant?
How can we tolerate faults?
Why do we need distributed consensus in the first place?
Can we really achieve consensus in distributed systems?
How do modern distributed systems use distributed consensus?
Is there any relationship between distributed consensus and distributed transactions?

This post is primarily intended for mid-level engineers or above who are looking to further their knowledge.

2. Why do we need distributed consensus?

Let's begin by breaking down this definition of distributed consensus:

In a dependable and secure distributed system, distributed consensus enables a group of processes to agree on a shared state even in the presence of faults.

2.1. A dependable and secure distributed system

This section explains the phrase a dependable and secure distributed system in the definition of distributed consensus above.

A distributed system satisfies four criteria:

Multiple processes: The system contains more than one sequential process. These processes can be either system or user processes, but each process has an independent thread of control. This definition of process will be used for the rest of this post.
Interprocess communication: Processes communicate via messages with a finite delivery time from one process to another. The delay depends on the physical characteristics of the message links. These message links are called channels.
Disjoint address space: Processes have disjoint address spaces. Thus, a shared-memory multiprocessor is not a representation of a distributed system.
Collective goal: Processes must interact to achieve a common goal.

A dependable distributed system is a distributed system that satisfies four major requirements:

Requirement	Description
Availability	The probability that the system is running correctly at any given moment, and is available to perform its functions for users.
Reliability	The time interval that the system can run continuously without interruption. For example, a system that experiences an outage of a few milliseconds every hour may have high availability (99.9999%), but low reliability. Conversely, a system that never crashes but is shut down for two weeks each year has high reliability but only 96% availability.
Safety	The system avoids catastrophic consequences even during temporary failures. For example, a temporary failure in a spacecraft control system could have disastrous results.
Maintainability	The ease with which a failed system can be repaired.

A secure distributed system is a distributed system that protects its resources from unauthorized access, modification, or destruction. For now, this is all we need to know about security to understand distributed consensus.

In conclusion, due to the fundamental characteristics of distribution (partial failures, network communication, shared resources), dependability and security are not optional features but rather intrinsic concerns that must be addressed in every conceivable distributed system.

2.2. Fault, error, failure

broken

This section explains the word faults in the definition of distributed consensus above.

A fault is the root cause of a potential or actual deviation from the intended behavior of a system.
An error is an internal state of the system that deviates from the expected state. A fault is the cause of an error.
A failure is an event that occurs when the delivered service deviates from the intended service. An error leads to a failure.

Examples:

Faults can be:
- Hardware faults: Physical defects in hardware components
- Software faults: Bugs in the software code
- Human faults: Mistakes made by system operators
- Environmental faults: Power outage
Errors can be:
- A hardware fault may cause a memory location to store incorrect data (error)
- A software fault may cause a variable to have an incorrect value (error)
Failures can be:
- The incorrect data in memory (error) may cause the program to crash (failure)
- The incorrect variable value (error) may cause the system to return wrong results to a user (failure)

The tables below describe classification of faults and failures.

Fault type	Description
Transient faults	Occur once then disappear. For example, if the transmission times out and is retried, it will probably work again.
Intermittent faults	Occur, then disappear, then reappear, and so on. A loose contact on a connector is an example.
Permanent faults	Exist until the faulty component is replaced. Hard drive crashes and broken network cards are some examples.

Failure type	Server's behavior
1. Crash failure	Halts, but is working correctly until it halts
2. Omission failure	Fails to respond to incoming requests
2.1. Receive omission	Fails to receive incoming messages
2.2. Send omission	Fails to send messages
3. Timing failure	Response lies outside a specified time interval
4. Response failure	Response is incorrect
4.1. Value failure	The response's value is wrong
4.2. State-transition failure	Deviates from the correct flow of control
5. Arbitrary failure	May produce arbitrary responses at arbitrary times

Receive omission and send omission are two sub-types of omission failure. Value failure and state-transition failure are two sub-types of response failure. Arbitrary failures are also known as Byzantine failures.

We will talk more about crash failure and Byzantine failure in the context of distributed consensus below.

2.3. Achieving dependability and security in the presence of faults

This section introduces methods to build dependable and secure distributed systems in the presence of faults.

Why not "without the presence of faults"?
This question is equivalent to Why do we need a dependable and secure distributed system? As previously discussed in the section above, the concept of "no faults" implies a perfect system, which is unattainable in practice due to the complexity of real-world systems and the unpredictable nature of the environments they operate in.

There are four major categories of methods to achieve dependability and security:

Category	Description
1. Fault prevention	To prevent the occurrence or introduction of faults. It's aimed at failure avoidance, which is done via error detection and system recovery.
2. Fault tolerance	To avoid failures in the presence of faults.
3. Fault removal	To reduce the number and severity of faults.
4. Fault forecasting	To estimate the present number, the future incidence, and the likely consequence of faults.

In this post, we only focus on fault tolerance as it's closely related to distributed consensus. For other categories, please check reference #3.

If a system is to be fault tolerant, the best it can do is to try to hide the occurrence of failures from other processes. The key technique for hiding faults is to use redundancy.

No fault-tolerant system can be designed without some form of redundancy. - Reference #1, page 266.

There are three types of redundancy:

Redundancy type	Description
1. Information redundancy	Extra bits are added to allow recovery from deviated bits. For example, the checksum is sent along with the original data so that the receiver can validate the received data.
2. Time redundancy	An action is repeated. For example, in TCP, the sender retransmits a message if it's lost or corrupted.
3. Physical redundancy	Extra equipment or processes are added to tolerate the loss or malfunctioning of some components. For example, when running multiple identical servers that provide the same service, if a server fails, the others can continue to serve requests.

To conclude, to build dependable and secure distributed systems in the presence of faults, we must use some forms of redundancy.

2.4. A group of processes

This section explains the phrase a group of processes in the definition of distributed consensus above.

A group of processes, which possibly spread out across several machines, is an implementation of process or service replication, which leverages physical redundancy, to attain fault tolerance to achieve dependability and security in distributed systems.

This group of processes has a unique characteristic: any message sent to the group is delivered to all members. This redundancy ensures that if one member fails, others can take over its responsibilities. By creating a group of identical processes, we can mask the failure of individual members. Essentially, we're replacing a single, potentially vulnerable process with a fault-tolerant group of replicated processes.

Important note: Replicating a process as a group is one of many techniques for achieving dependability and security in distributed systems. We focus on this method because it's related to distributed consensus.

2.5. Agreement of a group of processes

This section explains the phrase a group of processes to agree on a shared state in the definition of distributed consensus above.

State refers to the collective data or configuration that the group of processes maintains. It represents the overall condition or status of the group at a given point in time.

A group of processes to agree on a shared state means that all non-faulty processes in the group eventually reach the same state.

Why must they reach the same state?

When introducing a group of processes, a process $P$ can send a message to a group $Q=\{Q_1, Q_2, ..., Q_N\}$ of servers without having to know who they are, how many there are, where they are, or which may change from one call to the next. To $P$ , the group $Q$ appears to be a single, logical process.

By contradiction, if processes $Q_i$ , with $i \in \{1, 2, ..., N\}$ , don't reach the same state:

Disagreement on outputs: If the processes $Q_i$ produce different outputs in response to the same request from process $P$ , process $P$ would not know which response to trust, leading to inconsistent system behavior, which violates system's reliability.
Disagreement on inputs: If some of $Q_i$ process inconsistent inputs, group $Q$ may fail to respond cohesively, leaving $P$ without a reliable answer, which violates system's reliability too.

To conclude:

To achieve fault tolerance in a dependable and secure distributed system, a single process $Q$ can be replaced by a group of processes $Q_i$ without other processes being aware of the substitution.
This requires all processes within $Q_i$ to behave as a single logical process, which requires them to maintain a consistent shared state, in another word, to reach consensus among them.

3. On reaching consensus

disappointed

We should now understand what distributed consensus is and when it's necessary in distributed systems. This section discusses when processes $Q_i$ can reach consensus in the group $Q$ , given different system characteristics, such as synchronous or asynchronous, faulty or non-faulty.

3.1. Consensus in asynchronous systems

Distributed consensus algorithms aim to achieve agreement among processes within a finite number of steps. The specific solution depends on the system's characteristics, for example:

Synchronous versus asynchronous systems
Communication delay is bounded or not
Message delivery is ordered in real time or not
Message transmission is done through unicasting or multicasting

In this section, we will discuss about synchronous and asynchronous systems. For other characteristics, please read the reference #2.

3.1.1. Synchronous systems

Synchronous systems have five major behaviors:

Synchronous clocks: the drift rate from real time of local clocks of every process has a known upper bound.
Synchronous processes: there is a known lower bound of its instruction execution speed.
Synchronous channels: there is a known upper bound on the message propagation delay along that channel.
Synchronous message order: the receiver process receives messages in the same order in which sender process sent them.
Synchronous communication: a sender sends a message only when the receiver is ready to receive it and vice versa.

Based on the five behaviors above, in a synchronous system, process execution speeds and message-delivery times are bounded. This means when $Q$ shows no more activity, process $P$ can conclude that $Q$ has crashed.

However, purely synchronous systems exist only in theory. In reality, physical limitations, network congestion, operating system overhead, and hardware variations make it impossible to guarantee such perfect bounds of process execution speeds and message delivery times.

3.1.2. Asynchronous systems

An asynchronous system is a system where no assumptions about process execution speeds or message-delivery times are made. Consequently, when process $P$ no longer perceives any actions from $Q$ , it cannot conclude that $Q$ crashed. $Q$ may be just slow, or its messages may have been lost.

Imagine a group of people $Q_i$ trying to decide where to eat. One person $Q_1$ is about to select a restaurant. If $Q_1$ selects "Italian", they'll all eat Italian. If $Q_1$ selects "Chinese", they'll all eat Chinese.

What if $Q_1$ suddenly leaves (crashes) before selecting anything? The others are left unsure. They might talk among themselves, but they don't know what $Q_1$ would have selected. Eventually, they might get to a point where everyone is ready to select but still needs $Q_1$ 's input.

In an asynchronous system, a group of processes may never reach a consensus as long as one of the members postpones its decision indefinitely or does not respond (or crash). This makes achieving consensus impossible with even a single potential crash.

Great! We've established that perfectly synchronized systems are theoretical and that designing a consensus algorithm to handle all possible failures in a fully asynchronous system is impossible! So, how do we achieve consensus in practice?

3.1.3. Partially synchronous systems

Classifying all distributed systems as purely asynchronous is overly pessimistic and doesn't reflect real-world behavior. Instead, it's more realistic to consider them partially synchronous: most of the time they behave as synchronous systems, but periods of unbounded asynchrony can occur.

In other words, asynchronous behavior is treated as an exception rather than the norm. This assumption is critical for building practical distributed systems that can achieve consensus. From now on, we will assume distributed systems are partially synchronous. The next section will discuss about consensus in faulty and non-faulty systems.

3.2. Consensus in faulty systems

In a perfect world without any failures, achieving consensus would be trivial. Imagine a single coordinator, like a traffic controller, assigning a unique number to each request or command. Everyone simply follows the order dictated by these numbers. This method is called Lamport's totally ordered multicasting (opens in a new tab).

However, real-world systems are prone to failure. Machines crash, networks go down, and messages get lost. These potential failures make reaching an agreement among multiple processes significantly more challenging. The difficulty lies in ensuring everyone agrees on the same outcome even when some components might be unreliable. Hence, to reach consensus in dependable and secure distributed systems, we need to develop fault-tolerant distributed consensus algorithms.

A system is said to be k-fault tolerant if it can survive faults in $k$ processes and still meet its specifications. If $k$ processes fail silently, then having another $k+1$ processes is enough to provide k-fault tolerance (a total of $2k+1$ processes). That means, if $k$ processes simply stop, the answer from the other $k+1$ processes can be used.

There are several distributed consensus algorithms designed for faulty systems such as Paxos (opens in a new tab) and Raft (opens in a new tab). Since understanding these algorithms takes some time, we will not discuss how they work in this post. These algorithms assume that processes may exhibit crash failures, but not Byzantine failures, nor do processes collude.

To reach consensus in the presence of Byzantine failures, we need Byzantine Fault Tolerance (BFT) algorithms. While extensions of Paxos and Raft exist to provide BFT, they are more complex than the original algorithms.

Despite theoretical importance, Byzantine fault tolerance is not widely used in practice due to its complexity, high overhead, and the relatively low probability of Byzantine failures in many systems. Crash fault tolerance mechanisms are often sufficient for most applications.

3.3. What can we do on reaching consensus?

We can do state replication: replicating a sequence of states across processes in a group. The term state machine replication is more precise.

Examples of states that people usually want to replicate:

The current leader of the group: Establishing a consistent leader among a group of processes
Access rights to critical regions: Implementing distributed locking mechanisms to control access to shared resources
Cluster membership: Maintaining a consistent view of cluster members for management purposes

Why do we need state machine replication?

There are two primary reasons for replicating data:

To increase reliability: Similar to using a group of processes, data replication introduces redundancy for fault tolerance.
To improve performance: Replication enhances performance, particularly for scaling in terms of either size or geographical distribution.
- Scaling regarding size: When a single server is overwhelmed due to increasing client requests, replicating the server and distributing the workload improves performance.
- Scaling regarding a geographical area: Placing a copy of data closer to clients reduces access latency, thus improving perceived performance.

While replication enhances reliability and performance, it introduces the challenge of maintaining consistency across multiple copies of data. Each modification to one copy creates a difference from the others, requiring updates to all copies. This is exactly where distributed consensus algorithms become essential. They ensure that all modifications, and their order, are consistent across all copies, in other words, reaching the same state on all processes where copies are persisted.

The power of distributed consensus lies in its ability to provide reliable agreement in unreliable environments. While state machine replication is the main use case, the underlying need for agreement is much broader. Any distributed system where we need multiple independent components to act in a coordinated fashion, especially in the face of failures, can potentially benefit from a distributed consensus mechanism. In short, it's about moving beyond just making copies of data and thinking about coordination, decision-making, and ensuring consistency of action in a decentralized setting.

To conclude:

In asynchronous distributed systems, it's impossible to design a consensus algorithm that will tolerate the crash failure of a single process.
In practice, we assume distributed systems are partially synchronous systems, so that a group of processes can reach consensus.
Paxos and Raft are commonly used in practice as distributed consensus algorithms.
Distributed consensus algorithms are usually used for state machine replication.

4. Distributed consensus vs. distributed transactions

transaction

This section goes deeper into distributed consensus by exploring its relationship with distributed transactions. Many people find these concepts confusing, so we will clarify the distinctions and connections between them within distributed systems.

A transaction is a sequence of operations that must be carried out atomically, which means either all operations must be executed or none of them will be executed at all.

Commonly, a transaction complies ACID properties:

Atomicity: Either all operations are completed or none of them is executed.
Consistency: Regardless of what happens, the process must not violate its variants. In other words, all transactions must follow internal rules. Some examples of internal rules are foreign key constraints, unique constraints, business constraints, and other invariants.
Isolation: If multiple transactions run concurrently, it must appear as if they were executed in some sequential order. The updates of a transaction must not be visible to another transaction until it commits. In short, concurrent transactions do not interfere with each other.
Durability: Once a transaction commits, its effect will be permanent.

A distributed transaction is a transaction that involves objects distributed over a set of distinct processes.

The relationship between distributed consensus and distributed transactions is not causal, it's not about which one contains which one. Distributed consensus can be used to implement distributed transactions, and distributed transactions can be used to implement distributed consensus.

One example is that a traditional two-phase commit (2PC) protocol can be made more robust by using a consensus algorithm to manage the commit/abort decision. Instead of a single coordinator that can become a single point of failure, multiple processes can participate in a consensus protocol to decide whether to commit or abort the transaction.

Another example is that distributed transactions can simulate consensus by ensuring that operations across multiple processes either succeed entirely or fail completely. For example, imagine multiple servers updating their databases with a new product price. If any server fails, the entire update is rolled back, ensuring all servers maintain the same price. This mimics consensus, where all processes must agree before the change is applied.

The preceding examples merely illustrate that there's no inherent causal relationship between distributed consensus and distributed transactions. In reality, I'm not sure if these examples are ever implemented.

What is the relationship between consensus algorithms and two-phase commit?

Two-phase commit (2PC) can be used to implement atomicity in distributed transactions. 2PC and consensus algorithms solve different problems.

2PC causes different processes to do different things (e.g. Alice's bank debits Alice, Bob's bank credits Bob), and causes them all to do their thing, or none of them.

Consensus algorithms cause a majority of the processes to all do the same thing (so they remain replicas). A consensus algorithm waits only for a majority of replicas.

The figure below demonstrates the coordination of 2PC between two banks where each bank runs consensus algorithms internally. The transaction coordinator can be managed by one of the two banks, or by a 3rd-party service. Note that in reality, 2PC may not be a practical solution for banking transactions.

alice-bob

5. Distributed consensus applications

This section discusses some applications of distributed consensus in modern distributed systems.

etcd (opens in a new tab) uses Raft for:

Data replication: Raft is used to ensure consistency of data replicated across etcd nodes.
Leader election: Raft is used to elect a leader among etcd nodes, to ensure the system always has a single leader responsible for handling client requests and coordinating updates on data.

TiDB (opens in a new tab) uses Raft for:

Data replication: Raft is used to ensure consistency of data replicated across replicas within a TiKV Raft group (opens in a new tab). TiKV is the storage engine of TiDB.
Data consistency: Raft is used in the Placement Driver cluster (PD) (opens in a new tab) to make the time service highly available and consistent. PD is used to provide global timestamps for distributed transactions.

Google Spanner (opens in a new tab) uses Paxos for:

Data replication: A variation of Paxos is used to ensure consistency of data replicated across Spanner nodes.
Note: Similar to TiDB, Spanner has an internal service called TrueTime (opens in a new tab) which provides global timestamps for distributed transactions using atomic clocks. Actually, TiDB's design is inspired by Spanner.

Google Chubby (opens in a new tab) uses Paxos for:

Data replication: A variation of Paxos is used to ensure consistency of data (primarily configuration and lock state) replicated across Chubby replicas.
Leader election: Paxos is used to elect a leader among Chubby replicas, responsible for handling client requests, managing locks, and coordinating updates to the Chubby file system.

Qdrant (opens in a new tab) database uses Raft for:

Cluster topology: Raft is used to ensure a consistent view of the cluster topology on every Qdrant node. For example, which nodes are part of the cluster, their roles, and their status.
Note: Qdrant provides a data replication feature, but Raft is not used for this.
- To control data consistency, Qdrant provides some configurations (opens in a new tab) such as write_consistency_factor, which represents the number of replicas that must acknowledge a write operation before responding to the client. This is the trade-off Qdrant made to prioritize availability and maximum throughput over consistency.
- Distributed consensus algorithms are designed for strong consistency and fault tolerance, which comes with some performance overhead. For vector search operations, which require high throughput and low latency, this overhead can be significant.

Besides Paxos and Raft, there are other ways to reach consensus in modern applications:

Zookeeper uses ZAB (opens in a new tab), an atomic broadcast system, which has a leader election built in, much like Raft.
Elasticsearch (opens in a new tab) does not use a standard consensus algorithm like Paxos or Raft. Early versions of Elasticsearch used Zen Discovery for cluster coordination. Newer versions of Elasticsearch use a custom-built cluster coordination subsystem (opens in a new tab) that incorporates some ideas from consensus algorithms but is tailored to Elasticsearch's specific needs.

6. My experience with distributed consensus

I recently worked on setting up disaster recovery for a project with data centers spread across various locations: within a country, across countries in the same region, and across continents.

Disaster recovery involves replicating critical services, such as MySQL, Redis, Elasticsearch, Kafka, etc., to multiple locations. If one location suffers a disaster, data remains available elsewhere. The decision of which services to replicate depends on the specific business needs.

Replicating data across data centers is similar to a group of processes communicating and coordinating as we discussed before. We can use consensus algorithms to make sure data is synchronously replicated across all locations. However, consensus algorithms can be slow and expensive due to several factors. Firstly, they often involve at least one round-trip time between nodes across data centers, leading to increased latency. Secondly, maintaining strong consistency across a large number of nodes can be computationally demanding, requiring significant processing power and resources. Finally, implementing and maintaining a robust consensus system can be complex, requiring specialized expertise and potentially increasing operational costs.

Another approach is to asynchronously replicate data to other locations after it's been updated. This is faster, but there's a risk of losing data if the main server fails before the copy is complete. My team's approach was to design a Data Fix solution for each service to tolerate up to $X$ minutes of data loss, where " $X$ minutes" is the Recovery point objective (RPO). Formally, RPO is the maximum amount of data, measured by time, that can be lost after a recovery from a disaster before data loss exceeds what is acceptable to an organization. $X=5\ minutes$ is a common value.

Asynchronous replication can significantly reduce latency compared to synchronous replication, where changes must be acknowledged by the secondary server before being committed on the primary. This makes asynchronous replication suitable for applications prioritizing availability and speed, even at the risk of minor data loss. For critical systems requiring zero tolerance for inconsistencies such as financial transactions, we should still prioritize synchronous replication where data is acknowledged by the secondary site before being committed on the primary.

Ultimately, I believe the most effective approach often involves a combination of techniques. Reliable but slow consensus algorithms can be used alongside faster but potentially less reliable methods to create a system that is both fast and resilient.

7. Conclusion

This post started with the reasoning process of why we need distributed consensus with comprehensive definition and classification of important terminologies in dependable and secure distributed systems.

We then discussed when a group of processes can reach consensus in synchronous and asynchronous systems and what we can do when they reach consensus. However, we haven't yet discussed about how we can reach consensus in other conditions, such as bounds on communication delay, the order of message delivery, and the method of message transmission.

This post didn't discuss how Paxos and Raft work, and other ways to make our distributed systems fault-tolerant, for instance, via reliable client-server communication and reliable group communication. They all deserve dedicated posts.

For optimal comprehension, I recommend you to read all the documents in the References section below. That may take weeks or months.

8. Acknowledgments and references

Special thanks to Dr. Cuong Bui (opens in a new tab) for patiently explaining to me the naive causality between distributed consensus and the rest of the distributed universe. And as always, thanks Xinran for face-punches of knowledge through unstoppable questions. My face is now accepting donations for reconstructive surgery.

Here is the list of references:

Distributed Systems: An Algorithmic Approach - Sukumar Ghosh (opens in a new tab) (2nd Edition, 2014)
Distributed Systems - Maarten van Steen (opens in a new tab) (4th Edition, 2024)
Basic Concepts and Taxonomy of Dependable and Secure Computing - Avizienis A., Laprie J.-C., Randell B., and Landwehr C. (opens in a new tab) (2004)
6.5840: Distributed Systems - MIT (opens in a new tab) (2024)

Introduction