ML Skills
Scalable Distributed Systems
Problems In Distributed Systems
Last updated on Jan 15, 2023

failure

Partial failure

In distributed systems, services communicate over asynchronous networks, which means:

  • A service must wait for a response from another service after sending a request
  • Transmission time can vary due to network congestion, dynamic packet routing, and network connection failures
  • Data loss is possible due to packet corruption in wireless networks or an unavailable receiving service
  • Services are not synchronized because they do not have identical internal clocks

The fundamental issue with the situations described above is determining whether and when a response is received, which is referred to as handling partial failures.

Consider two services, A and B. If A cannot send a request to B, it is straightforward to handle. However, if A can send the request to B, several scenarios can occur:

  • B receives the request but is unable to process it, resulting in no response being sent
  • B takes an excessive amount of time to process the request and the request times out at A's side
  • B returns a response, but A is unable to receive it

When a request times out on A's side, it may be resent, causing duplication of the request. However, in some cases, such as withdrawing $1000 from a bank account, duplicate requests must not be applied. To prevent this, a system property called idempotence ensures that duplicate requests are only executed once.

Implement idempotence

Read requests are naturally idempotent since they do not make any state changes, but write requests are not. Idempotence can be implemented for write requests as follows:

  • Include a unique idempotency key, such as an UUID, in the request
  • When the request is received by the server, it checks if the key is already stored in the database
  • If the key is not present, the server processes the request and stores the key in the database, both of which are done in a transaction to ensure consistency
  • If the key is already present, the server simply returns the response of the previous request, avoiding any duplicate processing

To implement the database for storing idempotency keys, we can create a separate table in the database for storing application data. The database could be a key-value store to support low-latency lookups. Note that there is no need to keep idempotency keys forever.

Types of delivery guarantees

The property of idempotency guarantees that each message is delivered exactly once, a.k.a exactly-once delivery, which is one of the three delivery guarantees shown in the table below. When considering reliability and performance, we must weigh the trade-offs among the delivery guarantees.

At-most-onceAt-least-onceExactly-once
DefinitionA message is sent <= 1 timesA message is sent >= 1 timesA message is sent 1 times
Transmission ackFire and forgetRetransmission ackRetransmission ack, no duplicates
ReliabilityUnreliableModerateReliable
PerformanceFastModerateSlow
ExampleUDP messages (network layer)TCP messages (network layer)Database transactions (application layer)

Consensus

election

Consensus is a fundamental problem in distributed systems where a group of nodes need to reach an agreement on a single value or decision. The problem arises due to the possibility of node failures or network partitions, which can result in different nodes having different views of the system state. Consensus is essential for many distributed systems applications such as database replication, leader election, and atomic transactions.

There are several consensus algorithms designed to solve the consensus problem, including Paxos, Raft, and Zab. These algorithms differ in their design and complexity, but they share the same goal of ensuring agreement among nodes in the presence of failures. These algorithms typically involve a leader node that proposes a value, and the other nodes vote on whether to accept or reject the value.

Achieving consensus in a distributed system is not an easy task, and it comes with trade-offs. For instance, consensus algorithms often sacrifice system performance and availability for safety. In addition, consensus algorithms can be challenging to implement and verify, which can lead to bugs and other issues that can affect the system's correctness.

Time

To determine the order of events in distributed systems, we can compare their timestamps. However, clocks on different nodes can vary due to temperature or voltage changes. To address this, time services are available to provide accurate time sources. The most commonly used one is the Network Time Protocol (NTP).

One issue with NTP synchronization is the periodic resetting of clocks, which can affect the local node time. This can result in inaccurate measurements of event response times.

It is recommended to use NTP services to periodically resynchronize clocks, with intervals ranging from 1 hour to 1 day. For applications that require precise ordering of events across different nodes, it is better to use Chrony (opens in a new tab), which has higher accuracy and scalability compared to NTP. Another option for AWS customers is Amazon Time Sync Service (opens in a new tab). However, applications should not depend solely on timestamps from different nodes to determine the actual event order.

Keynotes

Partial failure

  • Services in distributed systems communicate over asynchronous networks
  • Handling partial failures is the core problem
  • Idempotence ensures duplicated requests are only applied once

Consensus

  • Consensus algorithms prioritize safety over performance and availability
  • Implementation and verification of consensus algorithms can be challenging and may introduce bugs

Time

  • Clocks on different nodes are different
  • Applications should not rely on timestamps to calculate the actual order of events