Thursday, February 16, 2023

Architect and build the full machine learning lifecycle with AWS: end-to-end with SageMaker - Part 3 Use case analysis

Fraud Detection is the use case

Aim is to use Amazon SageMaker to predict the probability that an incoming auto claim may be fraudulent.

Wrangling and preprocessing the dataset

We use SageMaker Data Wrangler to ingest, analyze, prepare, and transform each dataset. You can do this in the GUI-based feature available in SageMaker Studio.

Second, we use SageMaker Data Wrangler to export the transformed data as two CSV files that can be picked up in an Amazon Simple Storage Service (Amazon S3) bucket by SageMaker Processing, in order to conduct scalable data preparation and preprocessing.

Storing the features

After SageMaker Processing applies the transformations defined in SageMaker Data Wrangler, we store the normalized features in an offline feature store so the features can be shared and reused consistently across an organization among collaborating data scientists. This standardization is often key to creating a normalized, reusable set of features that can be created, shared, and managed as input into training ML models. You can use this feature consistency across the ML maturity spectrum, whether you are a startup or an advanced organization with a ML Center of Excellence.

Assessing and Mitigating bias, training and tuning

The issues relating to bias detection and fairness in AI have taken a prominent role in ML. Data bias is often inadvertently injected during the data labeling and collection process, and may often be overlooked in the significance of its impact on training a model. SageMaker Clarify is a fully-managed toolkit to identify potential bias within a training dataset or model, explain individual inference results, aggregate these explanations for an entire dataset, integrate with built-in monitoring capabilities to assess production performance, and provide these capabilities across modeling frameworks.


You can use SageMaker Clarify to assess various types of bias. For example, assessing pre-training bias (data) can focus on determining if class imbalance or a variety of other factors are beyond a threshold and therefore may bias the model we seek to train. SageMaker Clarify helps improve your ML models by detecting potential biases prior to training (data bias) and after training, assess post-training bias (model bias) and can also help explain the predictions that models make during inference.


After we implement our bias mitigation strategy, the next step is often to choose a training algorithm and experiment with various ways of tuning it so as to obtain acceptable ML performance metrics such as F1, AUC, or accuracy. For this post, we use the XGBoost algorithm for training our model using the data in the feature store, and evaluate F1 metrics.


We can also check the resulting model’s post-training bias and, when satisfied with both the performance and transparency (bias) metrics, tune the model to get the most out of its performance through hyperparameter optimization.


We can track the lineage of these experiments using Lineage Tracking to track various aspects of the evolution of our experiments including answering questions related to the following:


Data – Which dataset did we use?

Prep – How did we clean, transform and featurize the data?

Training – Which model and training job configuration did we use?

Tuning – Which hyperparameters did we use?



During our experimentation, we may have trained many models, from different datasets, prepared with different transformations, each with their own performance metrics and bias metrics. If we like a result, we can look at the artifact lineage associated with it so we can reproduce those results or improve them.


Capturing artifact lineage in experiments

Not only do we want to store our trained models themselves, but also the specific datasets, feature  transformations, preprocessing mechanisms, algorithms, and hyperparameter configurations that were used to produce and optimize the models for governance and reproducibility purposes. We can store that metadata, which tracks the experiment and lineage of the model, with a reference to the data and the model in the SageMaker Model Registry.

Deploying the model to a SageMaker hosted endpoint

After we decide which models should be approved for deployment, we can deploy them to a SageMaker hosted endpoint, where they are ready for serving predictions.

Running predictions on the model using the online feature store

We create models so we can run predictions on them. We can invoke an endpoint directly, since Amazon SageMaker endpoints have load balancers behind them to balance incoming load.


Another common invocation pattern for running inference is the ML Gateway Pattern, where we expose the inference as a service endpoint and invoke it using an Amazon API Gateway. This pattern also allows the benefits of a service oriented architecture exposing a set of ML services as RESTful endpoints. Incoming service requests benefit from being load balanced, cached, and monitored using Amazon API Gateway. Amazon API Gateway then calls an AWS Lambda function which can call the SageMaker endpoint.


Explaining the model’s predictions

We can then inspect why this decision was made and present an explainable narrative to inquisitive parties. For this, we use the explainability features of SageMaker Clarify.

references:

https://aws.amazon.com/blogs/machine-learning/architect-and-build-the-full-machine-learning-lifecycle-with-amazon-sagemaker/

No comments:

Post a Comment