This is a deep dive into the solution architecture for each of the four workflows for data prep, train and tune, deploy, and finally a pipeline that ties everything together in an automated fashion up to storing the models in a registry.
Manual workflow
Before we automate parts of the lifecycle, we often conduct investigative data science work. This is often carried out in the exploratory data analysis and visualization phases, where we use SageMaker Data Wrangler to figure out what we want to do with our data (visualize, understand, clean, transform, or featurize) to prepare it for training. The following diagram illustrates the flow for the two datasets on SageMaker Data Wrangler.
One of the outputs you can choose in SageMaker Data Wrangler is a Python notebook that distills these activities into a set of functions. The .flow file output contains a set of transformations that provide SageMaker Processing with guidance on what transformations to apply to features. The following screenshot shows the export options from SageMaker Data Wrangler.
We can send this code to SageMaker Processing to create a preprocessing job that prepares our datasets for training in a scalable and reproducible way.
Data prep
The data can be visualised and see if there is any class imbalance. If there is any, one option is to provide transformations
We loaded the raw data from the S3 bucket and created 10 transforms for claims and 6 for customers.
Some of the transformations are
23 collision_type_rear 5000 non-null float64
24 collision_type_side 5000 non-null float64
25 collision_type_na 5000 non-null float64
26 authorities_contacted_police 5000 non-null float64
27 authorities_contacted_none 5000 non-null float64
28 authorities_contacted_fire 5000 non-null float64
29 authorities_contacted_ambulance 5000 non-null float64
30 event_time 5000 non-null float64
12 policy_state_ca 5000 non-null float64
13 policy_state_wa 5000 non-null float64
14 policy_state_az 5000 non-null float64
15 policy_state_or 5000 non-null float64
16 policy_state_nv 5000 non-null float64
17 policy_state_id 5000 non-null float64
18 event_time 5000 non-null float64
The following diagram shows the data prep architecture.
It is assumed that SageMaker Data Wrangler and the output is available in the /data folder
You can provide an S3 bucket that contains the results of the SageMaker Data Wrangler job that has output two files: claims.csv and customer.csv. If you want to move on and assume the data prep has been conducted, you can access the preprocessed data in the /data folder containing the files claims_preprocessed.csv (31 features) and customers_preprocessed.csv (19 features). The policy_id and event_time columns in customers_preprocessed.csv are necessary when creating a feature store, which requires a unique identifier for each record and a timestamp.
Ingesting the preprocessed data into SageMaker Feature Store
After SageMaker Processing finishes the preprocessing and we have our two CSV data files for claims and customers ready. We have contributed to the standardization of these features by making them discoverable and reusable by ingesting them into SageMaker Feature Store.
SageMaker Feature Store is a centralized store for features and their associated metadata, allowing features to be easily discovered and reused across your organization or team. You have the option of creating an offline feature store (stored in Amazon S3) or an online component stored in a low-latency store, or both. Data is stored in your S3 bucket using a prefixing scheme based on event time. The offline feature store is append-only, which enables you to maintain a historical record of all feature values. Data is stored in the offline store in Parquet format for optimized storage and query access. SageMaker Feature Store supports combining data to produce, train, validate, and test datasets, and allows you to extract data at different points in time.
To store features, we first need to define their feature group. A feature group is the main feature store resource that contains the metadata for all the data stored in Amazon SageMaker Feature Store. A feature group is a logical grouping of features, defined in the feature store, to describe records. A feature group’s definition is composed of a list of feature definitions, a record identifier name, and configurations for its online and offline store.
The online database is optional, but very useful if you need supplemental features to be available at inference. In this section, we create two feature groups for our claims and customers datasets. After inserting the claims and customers data into their respective feature groups, you need to query the offline store with Amazon Athena to build the training dataset.
To ingest data, we first designate a feature group for each type of feature, in this case, one per CSV file. You can ingest data into feature groups in SageMaker Feature Store in one of two ways: streaming or batch. For this post, we use the batch method.
Training and tuning
No comments:
Post a Comment