Scalable Databases - AIEngineer.net

Last updated on Feb 19, 2023

databases

Scaling relational databases

Relational databases are typically suitable for handling relatively small amounts of data. However, as data grows, there are various approaches to scaling them.

Scaling up

One approach to scaling relational databases is to deploy them on more robust hardware, but this has certain drawbacks.

Drawback	Description
Cost	The expenses associated with hardware increase exponentially as more powerful hardware is required
Availability	Relying on a single machine creates a single point of failure
Migration	As data continues to grow, migrating to more powerful hardware becomes necessary
Distant serving	Serving services that are far from the server can lead to increased latency

Scaling out: Read replicas

A common approach to scaling out is to utilize read replicas, where the main database node serves as the primary node and the read replicas act as secondary nodes. Writes are directed to the primary node, and changes are asynchronously replicated to the secondary nodes, which could be located in different geographical locations. This type of architecture is often referred to as read replication or primary/secondary-based database architectures.

distribution-through-read-replication

This approach is effective for applications with a high read load. However, there may be a slight delay of a few milliseconds between when data is written to the primary node and when it is replicated to the secondary nodes. As a result, clients may potentially read outdated data from the secondary nodes.

Scaling out: Partitioning data

Splitting or partitioning data is another approach for distributing a database across multiple independent disk partitions and database engines. There are two main strategies for this: horizontal partitioning and vertical partitioning.

Strategy	Description	Example
Horizontal partitioning	Dividing rows into multiple partitions based on certain values	Hashing the primary key to split the data
Vertical partitioning	Dividing a table by columns	Storing read-only columns and others in separate tables

A database engine can support partitioning tables on disk or across nodes to achieve horizontal scalability. However, this technique also has some challenges. For example, when a request needs to join data from multiple partitioned tables, it may require high-bandwidth network traffic and joining coordination, which could potentially impact performance.

Scaling non-relational databases - NoSQL

Non-relational databases, also known as NoSQL databases, are often characterized by their schemaless nature, which includes the following properties:

Simplified data models that can be easily changed
Limited or no support for joins in proprietary query languages
Ability to horizontally scale on low-cost hardware

When designing relational databases, the focus is on modeling the problem domain, and SQL and joins are used to reference the data. However, when designing with NoSQL databases, the focus is on modeling the solution domain. This involves designing common data access patterns that the application supports. When reading data, the data must be pre-joined to serve requests, which can be thought of as creating a table for each use case instead of for each object. This trade-off between flexibility and efficiency is inherent in the solution domain modeling approach.

NoSQL data models

There are 4 main NoSQL data models.

Data model	Description	Example
Key value (KV)	Represents a hash map where values can be strings, JSON, images, etc.	Redis, Oracle NoSQL
Document	Similar to KV model and relational tables, but without a requirement for uniform document format	MongoDB, Couchbase
Wide column	Represents a 2-dimensional hash map that allows sorting by columns, with flexibility in column usage	Apache Cassandra, Google Bigtable
Graph	Designed for highly connected data with relationships between database objects	Neo4j, Amazon Neptune

Data distribution

NoSQL databases are specifically designed for horizontal scaling across distributed nodes, which eliminates single points of failure and improves performance, scalability, and availability (except for graph databases, which will not be discussed here).

Partitioning or sharding is a technique used to evenly distribute data objects across different nodes. Sharding involves using a shard or partition key that maps the object to a specific partition on a node.

data-partitioning

There are 3 main techniques for selecting the partition in sharding.

Sharding technique	Description
Hash key	Apply a hash function to the shard key using modulus or consistent hashing
Value-based	Use the value of the shard key, such as `country`
Range-based	Define the partition based on the range of values of the shard key

When data is distributed across nodes, if a partition becomes unavailable, the corresponding data chunk cannot be accessed. This can be addressed by replicating data in each partition to multiple replicas hosted on different nodes, typically more than 2 replicas. However, update requests would require all replicas to be updated.

There are 2 fundamental architectures for managing replication in distributed databases.

Architecture	Description
Leader-follower	In this architecture, the leader replica handles write operations, while the followers are read-only replicas. It is suitable for read-heavy applications
Leaderless	In this architecture, any replica can handle both write and read operations. Requested replicas must coordinate updates on other replicas. It is suitable for write-heavy applications

The solution for replica consistency in distributed systems presents challenges. For instance, determining when and how updates should be sent between replicas to ensure data consistency. If all replicas always have the same value, it results in strong consistency, where clients must wait until all replicas are updated before an update is considered successful. On the other hand, if clients allow for a delay in data updates, the database can provide eventual consistency.

The CAP theorem

The CAP theorem dictates that a distributed data store can only deliver 2 out of 3 guarantees.

cap-theorem Source: Wikipedia (opens in a new tab)

Guarantee	Description
Consistency	Every read obtains either the most recent write or an error
Availability	Every request receives a response without errors, but without the guarantee of containing the most recent write
Parition tolerance	The system remains operational despite arbitrary messages being dropped/delayed by the network between nodes

For instance, when an update occurs and the network between replicas fails, the database can either

Return an error to maintain replica consistency (CP), which results in unavailability
Or apply the update to the reachable subset of replicas (AP), allowing for inconsistency

You can read more about CAP theorem here (opens in a new tab).

Keynotes

Scaling up: involves increasing the resources of a single database instance to improve its performance
Scaling out using read replicas: involves creating multiple copies of the database and distributing read requests across them to improve query response times
Scaling out using partitioning data: involves dividing the database into smaller partitions that can be distributed across multiple servers to improve write performance and increase storage capacity
NoSQL data models: are designed for horizontal scaling and can be more efficient than traditional relational databases
CAP theorem: states that a distributed system can only guarantee two out of three properties: consistency, availability, and partition tolerance

Microservices Eventual Consistency