Thursday, February 16, 2023

Architect and build the full machine learning lifecycle with AWS: end-to-end with SageMaker - Part 4 Solution Architecture

This is a deep dive into the solution architecture for each of the four workflows for data prep, train and tune, deploy, and finally a pipeline that ties everything together in an automated fashion up to storing the models in a registry.

Manual workflow

Before we automate parts of the lifecycle, we often conduct investigative data science work. This is often carried out in the exploratory data analysis and visualization phases, where we use SageMaker Data Wrangler to figure out what we want to do with our data (visualize, understand, clean, transform, or featurize) to prepare it for training. The following diagram illustrates the flow for the two datasets on SageMaker Data Wrangler.

One of the outputs you can choose in SageMaker Data Wrangler is a Python notebook that distills these activities into a set of functions. The .flow file output contains a set of transformations that provide SageMaker Processing with guidance on what transformations to apply to features. The following screenshot shows the export options from SageMaker Data Wrangler.

We can send this code to SageMaker Processing to create a preprocessing job that prepares our datasets for training in a scalable and reproducible way.

Data prep

The data can be visualised and see if there is any class imbalance. If there is any, one option is to provide transformations

We loaded the raw data from the S3 bucket and created 10 transforms for claims and 6 for customers.

Some of the transformations are

 23  collision_type_rear              5000 non-null   float64

 24  collision_type_side              5000 non-null   float64

 25  collision_type_na                5000 non-null   float64

 26  authorities_contacted_police     5000 non-null   float64

 27  authorities_contacted_none       5000 non-null   float64

 28  authorities_contacted_fire       5000 non-null   float64

 29  authorities_contacted_ambulance  5000 non-null   float64

 30  event_time                       5000 non-null   float64

 12  policy_state_ca            5000 non-null   float64

 13  policy_state_wa            5000 non-null   float64

 14  policy_state_az            5000 non-null   float64

 15  policy_state_or            5000 non-null   float64

 16  policy_state_nv            5000 non-null   float64

 17  policy_state_id            5000 non-null   float64

 18  event_time                 5000 non-null   float64

The following diagram shows the data prep architecture.

It is assumed that SageMaker Data Wrangler and the output is available in the /data folder  

You can provide an S3 bucket that contains the results of the SageMaker Data Wrangler job that has output two files: claims.csv and customer.csv. If you want to move on and assume the data prep has been conducted, you can access the preprocessed data in the /data folder containing the files claims_preprocessed.csv (31 features) and customers_preprocessed.csv (19 features). The policy_id and event_time columns in customers_preprocessed.csv are necessary when creating a feature store, which requires a unique identifier for each record and a timestamp.

Ingesting the preprocessed data into SageMaker Feature Store

After SageMaker Processing finishes the preprocessing and we have our two CSV data files for claims and customers ready. We have contributed to the standardization of these features by making them discoverable and reusable by ingesting them into SageMaker Feature Store.

SageMaker Feature Store is a centralized store for features and their associated metadata, allowing features to be easily discovered and reused across your organization or team. You have the option of creating an offline feature store (stored in Amazon S3) or an online component stored in a low-latency store, or both. Data is stored in your S3 bucket using a prefixing scheme based on event time. The offline feature store is append-only, which enables you to maintain a historical record of all feature values. Data is stored in the offline store in Parquet format for optimized storage and query access. SageMaker Feature Store supports combining data to produce, train, validate, and test datasets, and allows you to extract data at different points in time.

To store features, we first need to define their feature group. A feature group is the main feature store resource that contains the metadata for all the data stored in Amazon SageMaker Feature Store. A feature group is a logical grouping of features, defined in the feature store, to describe records. A feature group’s definition is composed of a list of feature definitions, a record identifier name, and configurations for its online and offline store.

The online database is optional, but very useful if you need supplemental features to be available at inference. In this section, we create two feature groups for our claims and customers datasets. After inserting the claims and customers data into their respective feature groups, you need to query the offline store with Amazon Athena to build the training dataset.

To ingest data, we first designate a feature group for each type of feature, in this case, one per CSV file. You can ingest data into feature groups in SageMaker Feature Store in one of two ways: streaming or batch. For this post, we use the batch method.

Training and tuning

The following diagram illustrates the workflow for the bias check, training, tuning, lineage, and model registry stages.

We write the train and test split datasets to our designated S3 bucket, and create an XGBoost estimator to train our fraud detection model with a fraud or no fraud logistic target. Prior to starting the SageMaker training job using the built-in XGBoost algorithm, we set the XGBoost hyperparameters.

We take the opportunity to track all the artifacts or entities involved with the training job so we can track the lineage of the model. This is done by importing several sagemaker.lineage components. See the following code:

from sagemaker.lineage import context, artifact, association, action.
Lineage Tracking provides us with visibility into the code, training data,  and model artifacts that we then associate with association_type='Produced' and association_type='ContributesTo', which links what contributed to and what produced a given artifact in the process.

We also assess degrees of pre-training and post-training bias using SageMaker Clarify. Pre-training metrics show a variety of possible preexisting bias in our dataset. Post-training metrics show bias in the predictions resulting from the model. We use analysis_config.json to specify which groups we want to check bias across and which metrics we want to show.

We assess two metrics: the difference in positive proportions in predicted labels (DPPL) and if a class imbalance exists in the data. For our use case, we measure this on the gender feature, which indicates if we have more male customers than female customers. Results indicate a slight bias in our model measured by the DPPL metric.

Deploying and serving the model

Creating an automated workflow using SageMaker Pipelines

After we complete a few iterations of our manual exploratory data science and are happy with the outcomes of our cleansing, transformations, and featurizations, we may want to create an automated workflow using SageMaker Pipelines, so we can scale and don’t have to go through this manual process every time.

The following diagram shows our end-to-end automated MLOps pipeline, which includes eight steps:

Preprocess the claims data with SageMaker Data Wrangler.
Preprocess the customers data with SageMaker Data Wrangler.
Create a dataset and train/test split.
Train the XGBoost algorithm.
Create the model.
Run bias metrics with SageMaker Clarify.
Register the model.
Deploy the model.


No comments:

Post a Comment