ML Skills
Scalable Distributed Systems
Scalable Databases
Last updated on Feb 19, 2023

databases

Scaling relational databases

Relational databases are typically suitable for handling relatively small amounts of data. However, as data grows, there are various approaches to scaling them.

Scaling up

One approach to scaling relational databases is to deploy them on more robust hardware, but this has certain drawbacks.

DrawbackDescription
CostThe expenses associated with hardware increase exponentially as more powerful hardware is required
AvailabilityRelying on a single machine creates a single point of failure
MigrationAs data continues to grow, migrating to more powerful hardware becomes necessary
Distant servingServing services that are far from the server can lead to increased latency

Scaling out: Read replicas

A common approach to scaling out is to utilize read replicas, where the main database node serves as the primary node and the read replicas act as secondary nodes. Writes are directed to the primary node, and changes are asynchronously replicated to the secondary nodes, which could be located in different geographical locations. This type of architecture is often referred to as read replication or primary/secondary-based database architectures.

distribution-through-read-replication

This approach is effective for applications with a high read load. However, there may be a slight delay of a few milliseconds between when data is written to the primary node and when it is replicated to the secondary nodes. As a result, clients may potentially read outdated data from the secondary nodes.

Scaling out: Partitioning data

Splitting or partitioning data is another approach for distributing a database across multiple independent disk partitions and database engines. There are two main strategies for this: horizontal partitioning and vertical partitioning.

StrategyDescriptionExample
Horizontal partitioningDividing rows into multiple partitions based on certain valuesHashing the primary key to split the data
Vertical partitioningDividing a table by columnsStoring read-only columns and others in separate tables

A database engine can support partitioning tables on disk or across nodes to achieve horizontal scalability. However, this technique also has some challenges. For example, when a request needs to join data from multiple partitioned tables, it may require high-bandwidth network traffic and joining coordination, which could potentially impact performance.

Scaling non-relational databases - NoSQL

Non-relational databases, also known as NoSQL databases, are often characterized by their schemaless nature, which includes the following properties:

  • Simplified data models that can be easily changed
  • Limited or no support for joins in proprietary query languages
  • Ability to horizontally scale on low-cost hardware

When designing relational databases, the focus is on modeling the problem domain, and SQL and joins are used to reference the data. However, when designing with NoSQL databases, the focus is on modeling the solution domain. This involves designing common data access patterns that the application supports. When reading data, the data must be pre-joined to serve requests, which can be thought of as creating a table for each use case instead of for each object. This trade-off between flexibility and efficiency is inherent in the solution domain modeling approach.

NoSQL data models

There are 4 main NoSQL data models.

Data modelDescriptionExample
Key value (KV)Represents a hash map where values can be strings, JSON, images, etc.Redis, Oracle NoSQL
DocumentSimilar to KV model and relational tables, but without a requirement for uniform document formatMongoDB, Couchbase
Wide columnRepresents a 2-dimensional hash map that allows sorting by columns, with flexibility in column usageApache Cassandra, Google Bigtable
GraphDesigned for highly connected data with relationships between database objectsNeo4j, Amazon Neptune

Data distribution

NoSQL databases are specifically designed for horizontal scaling across distributed nodes, which eliminates single points of failure and improves performance, scalability, and availability (except for graph databases, which will not be discussed here).

Partitioning or sharding is a technique used to evenly distribute data objects across different nodes. Sharding involves using a shard or partition key that maps the object to a specific partition on a node.

data-partitioning

There are 3 main techniques for selecting the partition in sharding.

Sharding techniqueDescription
Hash keyApply a hash function to the shard key using modulus or consistent hashing
Value-basedUse the value of the shard key, such as country
Range-basedDefine the partition based on the range of values of the shard key

When data is distributed across nodes, if a partition becomes unavailable, the corresponding data chunk cannot be accessed. This can be addressed by replicating data in each partition to multiple replicas hosted on different nodes, typically more than 2 replicas. However, update requests would require all replicas to be updated.

There are 2 fundamental architectures for managing replication in distributed databases.

ArchitectureDescription
Leader-followerIn this architecture, the leader replica handles write operations, while the followers are read-only replicas. It is suitable for read-heavy applications
LeaderlessIn this architecture, any replica can handle both write and read operations. Requested replicas must coordinate updates on other replicas. It is suitable for write-heavy applications

The solution for replica consistency in distributed systems presents challenges. For instance, determining when and how updates should be sent between replicas to ensure data consistency. If all replicas always have the same value, it results in strong consistency, where clients must wait until all replicas are updated before an update is considered successful. On the other hand, if clients allow for a delay in data updates, the database can provide eventual consistency.

The CAP theorem

The CAP theorem dictates that a distributed data store can only deliver 2 out of 3 guarantees.

cap-theorem Source: Wikipedia (opens in a new tab)

GuaranteeDescription
ConsistencyEvery read obtains either the most recent write or an error
AvailabilityEvery request receives a response without errors, but without the guarantee of containing the most recent write
Parition toleranceThe system remains operational despite arbitrary messages being dropped/delayed by the network between nodes

For instance, when an update occurs and the network between replicas fails, the database can either

  • Return an error to maintain replica consistency (CP), which results in unavailability
  • Or apply the update to the reachable subset of replicas (AP), allowing for inconsistency

You can read more about CAP theorem here (opens in a new tab).

Keynotes

  • Scaling up: involves increasing the resources of a single database instance to improve its performance
  • Scaling out using read replicas: involves creating multiple copies of the database and distributing read requests across them to improve query response times
  • Scaling out using partitioning data: involves dividing the database into smaller partitions that can be distributed across multiple servers to improve write performance and increase storage capacity
  • NoSQL data models: are designed for horizontal scaling and can be more efficient than traditional relational databases
  • CAP theorem: states that a distributed system can only guarantee two out of three properties: consistency, availability, and partition tolerance