ML Skills
The Ultimate Machine Learning System
Data Engineering - What To Consider
Last updated on Aug 25, 2021

data-engineering

Data engineering is the second phase of the ML lifecycle. In this phase, we will perform some steps such as:

  • Collect data
  • Understand data
  • Verify data quality
  • Prepare data for training

Before diving into each step, firstly, this post will introduce some important aspects of data that we should consider, including:

  • Data sources
  • Data formats
  • Structured vs. Unstructured data
  • OLTP vs. OLAP databases
  • Extract - Transform - Load processes
  • Batch processing vs. Stream processing

Data sources

Data can be from different sources. The table below shows the four common sources.

#SourceExample
1User-generated dataUser book a trip, ignore a popup, spend x(s) on page y
2System-generated dataLogs, predictions
3Enterprise application dataData generated from enterprise apps used for managing company assets (inventory, customer relationships, users, etc.)
4Third-party data

First-party data is the data that your company already collects about your users or customers.

Second-party data is the data collected by another company on their own customers.

Third-party data companies collect data on the general public who aren't their customers.

Data formats

Once we have data, we will serialize and store it. Data serialization is the process of converting a data structure or object state into a format that can be stored or transmitted and reconstructed later. There are many, many data serialization formats. This section compares different types of data formats.

Row-based vs. column-based

Row-basedColumn-based
Better forAccessing samplesAccessing features
ExampleRide-share transaction with all infoPrice of ride-share transaction
Use casesRequire lots of writesRequire lots of reads with almost no writes

Note: NumPy creates an array with row-major by default. Pandas creates DataFrame, which is columnar.

Text vs. binary format

FormatText/BinaryHuman-readableExample
JSONTextYesEverywhere
CSVTextYesEverywhere
ParquetBinaryNoHadoop, AMZ Redshift
AvroBinary primaryNoHadoop
ProtobufBinary primaryNoGoogle, Tensorflow (TFRecord)
PickleText, binaryNoPython, PyTorch serialization

Note: AWS recommends using the Parquet format because "the Parquet format is up to 2x faster to unload and consumes up to 6x less storage in Amazon S3 than text formats."

Structured vs. Unstructured data

When the data schema evolved and the number of possible data sources expanded, companies just stored all data in a data lake to not deal with schema changes. Loading data into storage first and then processing it later is sometimes called ELT (extract, load, transform).

As data keeps growing and companies switch to running applications on the cloud, and infrastructures become standardized, data structures also become standardized, which makes some companies store data in a structured way.

StructuredUnstructured
SchemaClearly definedWhatever
AdvantageEasy to search and analyzeFast arrival (no need to clean up first)
Data sourceOnly data with specific schemaAny source
Schema changesNeed to update all dataNo need to worry about schema changes
Stored inData warehouseData lake

OLTP vs. OLAP databases

This section compares two types of databases to store data. The table below shows the traits and use cases for OnLine Transaction Processing and OnLine Analytical Processing databases.

OLTPOLAP
TraitsMost of the operations will be inserting, deleting, and updating an existing transactionMost operations will be aggregating data in columns across multiple rows. Ex: compute the average price for all rides in Sept in Singapore
Use caseDesigned for processing queries fastDesigned for queries that allow you to look at data from different viewpoints

OLTP databases also require:

  • Isolation controls: two transactions happen at the same time as if they were isolated. Two users accessing the same data won't change it simultaneously. For example, you don't want two users to book the same driver at the same time.
  • Atomicity controls: all the steps in a transaction are completed successfully as a group. If any steps between the transaction fail, all other steps must fail also. For example, if a user's payment fails, you don't want to still assign a driver to that user.

Atomicity and isolation are part of ACID (Atomicity, Consistency, Isolation, Durability), a standard set of properties that guarantee database transactions are processed reliably.

Extract - Transform - Load

ETL processes and aggregates OLTP databases to generate OLAP databases.

  • Extract: extracting data from data sources.
  • Transform: processing data into the target format. E.g., clean, validate data by joining data from multiple sources, splitting data into different parts, applying operations such as transposing, deduplicating, aggregating, etc.
  • Load: loading it into the target destination, e.g., a file or a data warehouse.

Batch processing vs. Stream processing

streaming

The table below compares these two types of processing.

Batch processingStream processing
DefinitionData is processed in batchesData is processed as soon as the sample arrives. This enables online prediction
Data typeusually processes static features such as age, gender, job, city, income; when the account was created; ratingusually processes dynamic features such as locations in the last 10 minutes; what user's watching; what user just liked
Data amountbounded, we know when a job finishesunbounded, we don't know when the stream ends
ToolsCan be processed in batch, e.g., SQL, MapReduce (to process large amounts of data)Processed as events arrive, e.g., Apache Flink, Samza

The stream processing is more complicated and has some requirements as below.

  • Real-time pipeline: process features, put into the model, return prediction in milliseconds.
  • Manage streaming data: Keep user's events in memory as much as possible, and stay there as long as they're helpful (e.g., seven days), then go into permanent storage (e.g., S3) or are discarded. For example, recent history, recent trips, recent transactions, choose a location, book a trip, cancel a trip, add a credit card, remove a credit card, etc.
  • Process data: Join features from various data sources

Apache Kafka stream processing is limited in its ability to deal with various data sources. The most popular tool for stream processing is Apache Flink, with native support for batch processing.

There're some barriers to Stream processing adoption, such as:

  • Companies don't see benefits of streaming: System, not at scale, batch predictions work fine, online predictions would work better, but they don't know
  • High initial investment in infrastructure
  • Mental shift
  • Python incompatibility

Ending

Data engineering is an important phase of the ML lifecycle. We have discussed some essential things to consider when we reach this phase in this post. In reality, there're many more things you might need to think about beyond the list provided above. Be ready for the changes in a data-centric world.