Saturday, February 18, 2023

Dissecting a reference MLOps architecture

The components of the reference architecture diagram are:



A secured development environment was implemented using an Amazon SageMaker Notebook Instance deployed to a custom virtual private cloud (VPC), and secured by implementing security groups and routing the notebook’s internet traffic via the custom VPC.


Also, the development environment has two Git repositories (AWS CodeCommit) attached: one for the Exploratory Data Analysis (EDA) code and the other for developing the custom Amazon SageMaker Docker container images.

.

An ML CI/CD pipeline made up of three sub-components:

Data validation step implemented using AWS Lambda and triggered using AWS CodeBuild.

Model training/retraining pipeline implemented using Amazon SageMaker Pipelines (pipeline-as-code) and executed using CodeBuild.

Model deployment pipeline that natively supports model rollbacks was implemented using AWS CloudFormation.

Finally, AWS CodePipeline is used to orchestrate the pipeline.


A CI/CD pipeline for developing and deploying the custom Amazon SageMaker Docker container image. This pipeline automatically triggers the ML pipeline when you successfully push a new version of the SageMaker container image, providing the following benefits:

Developers and data scientists can thoroughly test and get immediate feedback on the ML pipeline’s performance after publishing a new version of the Docker image. This helps ensure ML pipelines are adequately tested before promoting them to production.

Developers and data scientists don’t have to manually update the ML pipeline to use the latest version of the customized Amazon SageMaker image when working on the develop git branch. They can branch off the develop branch if they want to use an older version or start developing a new version, which they will merge back to develop branch once approved.

.

A model monitoring solution implemented using Amazon SageMaker Model Monitor to monitor the production models’ quality continuously.

.

This provides monitoring for the following: data drift, model drift, the bias in the models’ predictions, and drifts in feature attributes. You can start with the default model monitor, which requires no coding.

.

A model retraining implementation that is based on the metric-based model retraining strategy. There are three main retaining strategies available for your model retraining implementation:

Scheduled: This kicks off the model retraining process at a scheduled time and can be implemented using an Amazon EventBridge scheduled event.

Event-driven: This kicks off the model retraining process when a new model retraining dataset is made available and can be implemented using an EventBridge event.

Metric-based: This is implemented by creating a Data Drift CloudWatch Alarm (as seen in Figure 1 above) that triggers your model retraining process once it goes off, fully automating your correction action for a model drift.

A data platform implemented using Amazon S3 buckets with versioning enabled.

.

A model governance framework, which is not obvious from the architectural diagram and is made of the following components:

A model registry for versioning and tracking the trained model artefacts, implemented using Amazon SageMaker Model Registry.

Dataset versioning implemented using Amazon S3 bucket versioning.

ML workflow steps auditability, visibility, and reproducibility implemented using Amazon SageMaker Lineage Tracking.

Secured trained model artefacts implemented using AWS Identity and Access Management (IAM) roles to ensure only authorized individuals have access.


references:

https://aws.amazon.com/blogs/apn/taming-machine-learning-on-aws-with-mlops-a-reference-architecture/


No comments:

Post a Comment