Problems in Distributed Systems

Last updated on Jan 15, 2023

failure

Partial failure

In distributed systems, services communicate over asynchronous networks, which means:

A service must wait for a response from another service after sending a request
Transmission time can vary due to network congestion, dynamic packet routing, and network connection failures
Data loss is possible due to packet corruption in wireless networks or an unavailable receiving service
Services are not synchronized because they do not have identical internal clocks

The fundamental issue with the situations described above is determining whether and when a response is received, which is referred to as handling partial failures.

Consider two services, A and B. If A cannot send a request to B, it is straightforward to handle. However, if A can send the request to B, several scenarios can occur:

B receives the request but is unable to process it, resulting in no response being sent
B takes an excessive amount of time to process the request and the request times out at A's side
B returns a response, but A is unable to receive it

When a request times out on A's side, it may be resent, causing duplication of the request. However, in some cases, such as withdrawing $1000 from a bank account, duplicate requests must not be applied. To prevent this, a system property called idempotence ensures that duplicate requests are only executed once.

Implement idempotence

Read requests are naturally idempotent since they do not make any state changes, but write requests are not. Idempotence can be implemented for write requests as follows:

Include a unique idempotency key, such as an UUID, in the request
When the request is received by the server, it checks if the key is already stored in the database
If the key is not present, the server processes the request and stores the key in the database, both of which are done in a transaction to ensure consistency
If the key is already present, the server simply returns the response of the previous request, avoiding any duplicate processing

To implement the database for storing idempotency keys, we can create a separate table in the database for storing application data. The database could be a key-value store to support low-latency lookups. Note that there is no need to keep idempotency keys forever.

Types of delivery guarantees

The property of idempotency guarantees that each message is delivered exactly once, a.k.a exactly-once delivery, which is one of the three delivery guarantees shown in the table below. When considering reliability and performance, we must weigh the trade-offs among the delivery guarantees.

	At-most-once	At-least-once	Exactly-once
Definition	A message is sent <= 1 times	A message is sent >= 1 times	A message is sent 1 times
Transmission ack	Fire and forget	Retransmission ack	Retransmission ack, no duplicates
Reliability	Unreliable	Moderate	Reliable
Performance	Fast	Moderate	Slow
Example	UDP messages (network layer)	TCP messages (network layer)	Database transactions (application layer)

Consensus

election

Consensus is a fundamental problem in distributed systems where a group of nodes need to reach an agreement on a single value or decision. The problem arises due to the possibility of node failures or network partitions, which can result in different nodes having different views of the system state. Consensus is essential for many distributed systems applications such as database replication, leader election, and atomic transactions.

There are several consensus algorithms designed to solve the consensus problem, including Paxos, Raft, and Zab. These algorithms differ in their design and complexity, but they share the same goal of ensuring agreement among nodes in the presence of failures. These algorithms typically involve a leader node that proposes a value, and the other nodes vote on whether to accept or reject the value.

Achieving consensus in a distributed system is not an easy task, and it comes with trade-offs. For instance, consensus algorithms often sacrifice system performance and availability for safety. In addition, consensus algorithms can be challenging to implement and verify, which can lead to bugs and other issues that can affect the system's correctness.

Time

To determine the order of events in distributed systems, we can compare their timestamps. However, clocks on different nodes can vary due to temperature or voltage changes. To address this, time services are available to provide accurate time sources. The most commonly used one is the Network Time Protocol (NTP).

One issue with NTP synchronization is the periodic resetting of clocks, which can affect the local node time. This can result in inaccurate measurements of event response times.

It is recommended to use NTP services to periodically resynchronize clocks, with intervals ranging from 1 hour to 1 day. For applications that require precise ordering of events across different nodes, it is better to use Chrony (opens in a new tab), which has higher accuracy and scalability compared to NTP. Another option for AWS customers is Amazon Time Sync Service (opens in a new tab). However, applications should not depend solely on timestamps from different nodes to determine the actual event order.

Keynotes

Partial failure

Services in distributed systems communicate over asynchronous networks
Handling partial failures is the core problem
Idempotence ensures duplicated requests are only applied once

Consensus

Consensus algorithms prioritize safety over performance and availability
Implementation and verification of consensus algorithms can be challenging and may introduce bugs

Time

Clocks on different nodes are different
Applications should not rely on timestamps to calculate the actual order of events

Distributed Systems Architecture Application Services