Ml Systems Design Overview

Last updated on Aug 05, 2021

In this post, we will discuss the goals of designing an ML system. We will also get an idea about different types of ML systems.

Goals of ML systems design

ML system design is the process of defining the interface, algorithms, data, infrastructure, and hardware for an ML system to satisfy specified requirements.

ml-system

There're four main characteristics that an ML system should have, including:

Reliability
Scalability
Maintainability
Adaptability

Reliablity

The system should continue to perform the correct function at the desired level of performance even in adversity (hardware or software faults, and even human error).

For example, what if your predict() function is called correctly, but the prediction is wrong? How do we know if it is wrong if we don't have the ground truth labels to compare? You probably want to think about a way to handle this case in your system.

Scalablity

As the system grows in data, traffic, etc., there should be reasonable ways to expand and reduce resources when needed.

Maintanability

ML engineers, DevOps/MLOps engineers, and Subject Matter Experts (SMEs) should be able to work productively together.

Besides the data labeling phase, you'd want SMEs to be involved in steps such as:

Problem forming
Feature engineering
Error analysis
Model evaluation
Reranking predictions
UI design

There're some known problems when collaborating among cross-functional teams. For instance, how do you explain an ML algorithm's limitations and capabilities to SMEs who might not have engineering or statistical backgrounds? How do you version SMEs' domain expertise (e.g., if the user's shoulder is tilted a little bit like this and that while his back is tilted a little bit like that and this, then it might be a sign of back pain).

Many companies developed node-code/low-code platforms that allow people to make changes without writing code.

Adaptability

To adapt to the change in data distribution and business requirements, the system should be able to both discover aspects for performance improvement and allow updates without service interruption.

For example, an ML system should have the functionalities such as:

Data drift detection
Auto model retraining triggering
Auto A/B testing between model versions
Auto shifting data and allowing rollback in model deployment

Types of ML Systems

Batch prediction vs. Online prediction

Below is the table to compare these two types of ML systems.

	Batch prediction	Online prediction
Frequency	Periodical	As soon as the request comes
Useful for	Processing accumulated data when we don't need immediate results	When predictions are required as soon as a data sample is generated
Optimized	High throughput	Low latency
Input space	Finite - Need to know how many predictions to generate	Can be infinite
Examples	Generate recommendations for users every few hours	Want to see a transaction is fraudulent as soon as possible

They can be used in one hybrid solution, not necessarily be mutually exclusive.

When the online prediction is not cheap enough or is not fast enough, batch prediction is a workaround solution.

When you make a batch prediction, you need to consider the solution for storing and retrieving the pre-generated predictions.

Edge computing vs. Cloud computing

For edge computing, end devices should have enough memory and battery to store models and to make the inference.

The benefits of edge computing include:

Might be able to work without an internet connection or with unreliable internet
No need to worry about network latency
Fewer concerns about privacy
Save on server cost

For cloud computing, the network connections should be fast and reliable.

Online learning vs. Offline learning

Below is the summary table to compare these two types of ML systems.

	Online learning	Offline learning
Interaction cycle	Continual (minutes)	Periodical (months)
Batch size	Hundreds	Thousands to millions
Data usage	Each sample seen at most once	Each sample seen multiple times (epochs)
Evaluation	Offline evaluation as the sanity check. Mostly relying on online evaluation (A/B Testing)	Mostly offline evaluation

If the infrastructure is set up correctly, there's no difference between online and offline learning. We just need to tune the batch size or how often to evaluate models.

Companies also train their models offline in parallel and then combine offline with online versions.

Ending

Let's keep this post not too long. In the next post, we will see the iterative process of an ML system, especially the questions you need to discuss with your teammates to scope your ML system.

Planning An ML Project ML Systems Design - The ML Lifecycle