Saturday, February 18, 2023

Taming Machine Learning on AWS with MLOps: A Reference Architecture

There is no one size fits all when it comes to implementing an MLOps solution on Amazon Web Services (AWS). Like any other technical solution, MLOps should be implemented to meet the project requirements.


Data science and analytics teams are often squeezed between increasing business expectations and sandbox environments evolving into complex solutions. This makes it challenging to transform data into solid answers for stakeholders consistently.


Key Components of an MLOps Solution


1. A version control system to store, track, and version changes to your ML code.

2. A version control system to track and version changes to your training datasets.

3. A network layer that implements the necessary network resources to ensure the MLOps solution is secured.

4. An ML-based workload to execute machine learning tasks. AWS offers a three-layered ML stack to choose from based on your organization’s skill level.

We describe the three layers briefly here 

AI services: They are a fully managed set of services that enable you to quickly add ML capabilities to your workloads using API calls. Examples of these AWS services are Amazon Rekognition and Amazon Comprehend.

ML services: AWS provides managed services and resources (Amazon SageMaker suite, for example) to enable you to label your data and build, train, deploy, and operate your ML models.

ML frameworks and infrastructure: This is a level intended for expert ML practitioners using open-source frameworks like TensorFlow, PyTorch, and Apache MXNet; Deep Learning AMI for Amazon EC2 P3 and P3dn instances; and Deep Learning Containers to implement your own tools and workflows to build, train, and deploy the ML models.


It’s important to note the ML-based workloads can also be implemented by combining services and infrastructure from the different levels of the AWS ML stack.

5. Use infrastructure as code (IaC) to automate the provisioning and configuration of your cloud-based ML workloads and other IT infrastructure resources.

.

6. An ML (training/retraining) pipeline to automate the steps required to train/retrain and deploy your ML models.

7. An orchestration tool to orchestrate and execute your automated ML workflow steps.

8. A model monitoring solution to monitor production models’ performance to protect against both model and data drift. You can also use the performance metrics as feedback to help improve the models’ future development and training.

9. A model governance framework to make it easier to track, compare, and reproduce your ML experiments and secure your ML models.

10. A data platform like Amazon Simple Storage Service (Amazon S3) to store your datasets.


Implementing an MLOps Solution


 Let’s look at the three main options Amazon SageMaker provides when it comes to choosing your training algorithm:

1. Use a built-in Amazon SageMaker algorithm or framework. With this option, a training dataset is the only input developers and data scientists have to provide when training their models. On the other hand, the trained model artefacts are the only input they need to deploy the models. This is the least flexible of the options available and a good fit for scenarios where off-the-shelf solutions meet your need.


2. Use pre-built Amazon SageMaker container images. For this option, you need to provide two inputs to train your models, and they are your training scripts and datasets. Likewise, the inputs for deploying the trained models are your serving scripts and the trained model artefacts.

.

3. Extend a pre-built Amazon SageMaker container image, or adapt an existing container image. This is for more advanced use cases. You are responsible for developing and maintaining those container images. Therefore, you may want to consider implementing a CI/CD pipeline to automate the building, testing, and publishing of the customized Amazon SageMaker container images and then integrate the pipeline with your MLOps solution.

references:

https://aws.amazon.com/blogs/apn/taming-machine-learning-on-aws-with-mlops-a-reference-architecture/


No comments:

Post a Comment