Thursday, February 16, 2023

Architect and build the full machine learning lifecycle with AWS: end-to-end with SageMaker - Part 2 Tool Mapping

 Data wrangling – We use SageMaker Data Wrangler for cleaning, normalizing, transforming and encoding data, as well as joining datasets. The output of SageMaker Data Wrangler is data transformation code that works with SageMaker Processing, SageMaker Pipelines, SageMaker Feature Store, or with Pandas in a plain Python script. Feature engineering can now be done faster and easier, with SageMaker Data Wrangler where we have a GUI-based environment and can generate code that can be used for the subsequent phases of the ML lifecycle.


Detecting bias – With SageMaker Clarify, in the data prep or training phases, we can detect pre-training (data bias) and post-training bias (model bias). At the inference phase, SageMaker Clarify gives us the ability to provide interpretability and explainability of the predictions by providing insight into which factors were most influential in coming up with the prediction.


Feature Store (offline) – After we complete our feature engineering, encoding, and transformations, we can standardize features offline in SageMaker Feature Store, to be used as input features for training models.

SageMaker Feature Store allows you to create offline feature groups that keep all the historical data and can be used as inputs to training.

Note that Features can be ingested from a feature processing pipeline into the online feature store and will then get replicated to the offline store. The offline store could be used to run batch inference as well. Thus, the online feature store can also be used as input for training.


Artifact lineage: We can use SageMaker ML Lineage Tracking to associate all the artifacts (such as data, models, and parameters) with a trained model to produce metadata that is stored in a model registry. In addition, tracking human in the loop actions such as model approvals and deployments further facilitates the process of ML governance.


Model Registry: The SageMaker Model Registry stores the metadata around all the artifacts that you include in the process of creating your models, along with the trained models themselves in a model registry. Later, we can use human approval to note that the model is ready for production. This feeds into the next phase of deploy and monitor.


Inference and Feature Store (online): SageMaker Feature Store provides for low latency (up to single digit milliseconds) and high throughput reads for serving our model with new incoming data.


Pipelines: After we experiment and decide on the various options in the lifecycle (such as which transforms to apply to our features, determine imbalance or bias in the data, which algorithms to choose to train with, or which hyperparameters are giving us the best performance metrics), we can automate the various tasks across the lifecycle using SageMaker Pipelines.


This lets us streamline the otherwise cumbersome manual processes into an automated ML pipeline. To build this pipeline, we will prepare some data (customers and claims) by ingesting the data into SageMaker Data Wrangler and apply various transformations in SageMaker Data Wrangler within SageMaker Studio.  SageMaker Data Wrangler creates .flow files. We will use these transformation definitions as a starting point for our automated pipeline and go through the ML Lifecycle all the way to deploying the model to a SageMaker Hosted Endpoint. Note that some use cases may require one, larger, end-to-end pipeline, that does everything. Other use cases may require multiple pipelines, such as the following:


A pipeline for all data prep steps.

A pipeline for training, tuning, lineage, and depositing into the model registry (which we show in the code associated with this post).

Possibly another pipeline for specific inference scenarios (such as real time vs. batch).

A pipeline for triggering retraining by using SageMaker Model Monitor to detect model drift or data drift and trigger retraining using, for example, an AWS Lambda

references:

https://aws.amazon.com/blogs/machine-learning/architect-and-build-the-full-machine-learning-lifecycle-with-amazon-sagemaker/


No comments:

Post a Comment