ML Skills
The Ultimate Machine Learning System
ML In Production
Last updated on Jul 20, 2021

research

This post will discuss the differences between ML in research and ML in production, between traditional software and ML systems, some ML challenges in production, and some ML deployment myths. This information will help you set the right expectation for your ML project.

Research vs. Production

The table below shows five significant problems and the differences between ML systems in research and production.

ResearchProduction
ObjectivesModel performanceDifferent stakeholders have different objectives
Computational priorityFast training, high throughputFast inference, low latency
DataStaticConstantly shifting
FairnessGood to have (sadly)Important
InterpretabilityGood to haveImportant

Objective

conflict

In academia, the objective of an ML system usually is model performance. Researchers want to achieve state-of-the-art results on benchmark datasets. Models often are too complicated to be helpful in real-life applications.

Different stakeholders have different objectives. For example, Facebook wants to train a model that recommends ads on users' news feed:

  • ML engineers want that model to predict ads with a high chance of being clicked by users.
  • Sales team wants that model predicts ads paid with the highest advertising fee to be shown in the user's news feed.
  • Manager wants to maximize the profit, maybe by sacking somebody.

Users won't see the difference between a model with 98% accuracy and a model with 98.2% accuracy. This 0.2% can save Google millions of dollars.

If a simple model can do a reasonable job, complex models must perform significantly better to justify the complexity.

Computational priority

In research, we want the training process faster. In production, we want the inference faster.

In research, we want the training process to take as many samples as possible in a second (throughput). In production, latency matters a lot. If you can type your next word faster than your iPhone predicts, do you ever want to wait and click on the predicted word?

Data

data

In research, data is clean and formatted. They are unchanged, so people can use them as benchmarks for evaluation. The process of preparing data and feeding it to your model usually was done by somebody.

In production, data is messy. You have to clean it and re-format it. It's not easy to split into the training set, test set, or validation set because it usually has issues like biased, imbalanced, outdated, etc. Sometimes you have to add more label classes or merge two existing label classes. This is a nightmare!

In research, data usually was created a long time ago. Data could be created a long time ago, streaming data, or both in production. In production, you need to care about data privacy and regulations.

ResearchProduction
CleanMessy
StaticConstantly shifting
Mostly historical dataHistorical + streaming data
Privacy + regulatory concerns

Fairness

You might be a victim of biased ML algorithms. Your resume might be ranked very low because your name is not common. The ranking model picks name as an important feature :)

ML algorithms don't predict the future but encode the past, perpetuating the biases in the data and more.

The minority groups would be harmed badly because the wrong predictions have minor consequences on the model's overall performance.

Interpretability

ai-surgeon

Model interpretability is important to understand why the model makes that prediction/decision. Otherwise, we might feel uncomfortable trusting it. It also makes debugging, monitoring, and improving the model easier.

While most of us are comfortable using a microwave without understanding how it works, many don't feel the same way about AI yet, especially if that AI makes important decisions about their lives.

Addons

Most companies cannot pursue pure research unless it leads to short-term profitable applications.

Nowadays, more people and organizations in different fields want to find applications due to the easy accessibility of state-of-the-art models. That's why the majority of ML-related jobs are in ML production.

Traditional software vs. ML systems

ML production would be a better place if ML experts were better software engineers. Many traditional software engineering tools can be used to develop and deploy ML applications.

ml-engineering

However, many challenges are unique to ML applications and require their own tools. Below is the table to compare traditional software and ML systems.

Traditional softwareML systems
Code & dataAre separatedPart code, part data
Testing & versioningTest and version codeTest and version code & data, models
SizeData & code are not too bigModel size might be a challenge
Monitor & debugA good logging system might be enoughNot-trivial

ML production challenges

The table below shows some common challenges in ML production.

#ChallengeDescriptionExample
1Data labelingQuickly label new data or re-label existing data for a new model?Snorkel
2Data testingTest the usefulness and correctness of data? Is a sample good or bad for your system?
3Data and model versioningVersion datasets and checkpoints? Merge different versions of dataDVC
4Data formatTake out a subset of features in datasets -> use column-based data format (e.g., PARQUET, ORC). Row-based data formats (CSV) require loading all features
5Data manipulationDataFrames designed for parallelization and compatible with GPUs as pandas doesn't work on GPUsdask
6MonitoringHas data distribution shifted? Do we need to retrain?Dessa
7Model compressionCompress model to fit onto consumer devices?Xnor.ai
8DeploymentPackage and deploy new model or replace existing model?OctoML
9CI/CD testRun tests after each change of new model?Argo
10Inference optimizationSpeed up inference time? Can we fuse operations? Can we use lower precision?TensorRT
11Edge deviceHardware designed to run ML algorithms fast and cheap?Coral SOM
12PrivacyUse user data while preserving their privacy? Make your process GDPR-compliant?PySyft

ML deployment myths

myth

The table below summarizes some common ML deployment myths.

#MythDescription
1Deploying is hardDeploying is easy; deploying reliable is hard. Making the model available to millions of users with a latency of milliseconds and 99% uptime is hard
2Only deploy one or two ML models at a timeCompanies have many ML models. Each different feature of an application requires its own model
3If we don't do anything, model performance remains the sameDrift concept: the data your model runs inference on drifts further and further away from the data it was trained on. ML sys performs best right after training
4No need to update models as muchSince a model performance decays over time, we want to update it as fast as possible
5No need to worry about scaleE.g., a system that serves hundreds of queries per second or millions of users per month
6ML can transform the business overnightMagically - possible, but overnight - no. The longer you've adopted ML, the faster your development cycle will run, and the higher your Returns On Investment (ROI) will be

Case studies

To end this post, these are some helpful case studies that might help you to see how actual teams deal with different deployment requirements and constraints.

  1. Using Machine Learning to Predict Value of Homes On Airbnb (opens in a new tab)
  2. Using Machine Learning to Improve Streaming Quality at Netflix (opens in a new tab)
  3. 150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com (opens in a new tab)
  4. How we grew from 0 to 4 million women on our fashion app, with a vertical machine learning approach (opens in a new tab)
  5. Machine Learning-Powered Search Ranking of Airbnb Experiences (opens in a new tab)
  6. From shallow to deep learning in fraud (opens in a new tab)
  7. Space, Time and Groceries (opens in a new tab)
  8. Creating a Modern OCR Pipeline Using Computer Vision and Deep Learning (opens in a new tab)
  9. Scaling Machine Learning at Uber with Michelangelo (opens in a new tab)
  10. Spotify's Discover Weekly: How machine learning finds your new music (opens in a new tab)