MLOps using AirFlow, MLFlow and Kafka
Apache Kafka is a distributed messaging platform that allows you to sequentially log streaming data into topic-specific feeds, which other applications in turn can tap into.
Apache Airflow is a task scheduling platform that allows you to create, orchestrate and monitor data workflows
MLFlow is an open-source tool that enables you to keep track of your ML experiments, amongst others by logging parameters, results, models and data of each trial .
In this hypothetical example, below are required
a container which has Airflow and your typical data science
toolkit installed (in our case Pandas, NumPy and Keras) in order to create and update the model, whilst also schedule such tasks
a PostgreSQL container which serves as Airflow’s underlying metadata database
a Kafka container, which handles streaming data
a Zookeeper container, which amongst others is responsible for keeping track of Kafka topics, partitions and alike (later more on this!)
a MLFlow container, which keeps track of the results of the update runs and the characteristics of the resulting models
A typical. folder structure can be as below .
project_folder
├── dags
│ └── src
│ ├── data
│ ├── models
│ └── preprocessing
├── data
│ ├── to_use_for_training
│ ├── used_for_training
├── models
│ ├── current_model
│ └── archive
├── airflow_docker
├── mlflow_docker
└── docker_compose.yml
This example utilises the MNIST data set. One of the Airflow task DAG is to fetch the data and split into test, train and streaming set. Streaming set is to simulate the dynamic data that is coming in after the initial model is put into action. and puts them in the right format for training the CNN.
Construct & fit the model - Task 2 amongst others fetches the train and test set from the previous step above.
It then constructs and fits the CNN and stores it in the current_model folder
References
https://www.vantage-ai.com/en/blog/keeping-your-ml-model-in-shape-with-kafka-airflow-and-mlflow
No comments:
Post a Comment