Saturday, February 25, 2023

Why should do One hot encoding

Many machine learning algorithms, instead, require all the input and output variables to be numeric. Although some like decision tree can work on categorical data. 


one-hot encoding comes in help because it transforms categorical data into numerical; in other words: it transforms strings into numbers so that we can apply our Machine Learning algorithms without any problems.


animals = ['dog', 'cat', 'mouse'] 

one-hot encoding will create new columns as much as the number of unique kinds of animals in the “animals” column, and the new columns will be filled with 0s and 1s. So, if you have 100 kinds of animals in your “animals” column, one-hot encoding will create 100 new columns all filled with 1s and 0s.



this process can lead to some troubles. In this case, the trouble is the so-called “Dummy Variable Trap”.



The Dummy Variable Trap is a scenario where the variables present become highly correlated to each other, and this means an important thing: one-hot encoding can lead to multicollinearity; it means that we always have to analyze the variables (the new features, aka: the new columns) and decide if it is the case to drop some of them



There is a much more simpler way to perform one-hot encoding and it can be done directly in pandas. Consider the data frame, df, as we created it earlier. To encode it we can simply write the following line of code:



#one-hot encoding

df3 = pd.get_dummies(df, dtype=int)

#showing new head

df3.head()



More convoluted way of doing this is using SK Learn 



SKLearn has one hot encoder 


import pandas as pd

from sklearn.preprocessing import OneHotEncoder


# initializing values

data = {'Name':['Tom', 'Jack', 'Nick', 'John',

                'Tom', 'Jack', 'Nick', 'John',

                'Tom', 'Jack', 'Nick', 'John',],

        'Time':[20, 21, 19, 18,

                20, 100, 19, 18,

                21, 22, 21, 20]

}

#creating dataframe

df = pd.DataFrame(data)

#showing head

df.head()


encoder = OneHotEncoder(handle_unknown='ignore')


encoder_df = pd.DataFrame(encoder.fit_transform(df[['Name']]).toarray())


#merge one-hot encoded columns back with original DataFrame

df2 = df.join(encoder_df)

#drop columns with strings

df2.drop('Name', axis=1, inplace=True)

#showing new head

df2.head()



references:

https://towardsdatascience.com/how-and-why-performing-one-hot-encoding-in-your-data-science-project-a1500ec72d85#:~:text=In%20these%20cases%2C%20one%2Dhot,Learning%20algorithms%20without%20any%20problems.

Wednesday, February 22, 2023

Docker-compose down gives active end point exists.

ERROR: network docker_default has active endpoints


docker network inspect <network>

docker network disconnect -f <network> <endpoint>


references:

https://stackoverflow.com/questions/42842277/docker-compose-down-default-network-error

Sunday, February 19, 2023

MLOps - simulating a streaming new data set

Approach is below . This assumes the use of MNIST dataset 

img_rows, img_cols = 28, 28

from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Set train samples apart that will serve as streaming data later on

x_stream = x_train[:20000]

y_stream = y_train[:20000]

x_train = x_train[20000:]

y_train = y_train[20000:]


stream_sample = [x_stream, y_stream]

The stream_sample is the sample for streaming 


pickle.dump(stream_sample, open(os.getcwd() + kwargs['path_stream_sample'], "wb"))


The stream sample is now written to the pickle file 


Now lets construct the model 



model = Sequential()

model.add(Conv2D(32, kernel_size=(3, 3),

activation='relu',

input_shape=input_shape))

model.add(Conv2D(64, (3, 3), activation='relu'))

model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Dropout(0.25))

model.add(Flatten())

model.add(Dense(128, activation='relu'))

model.add(Dropout(0.5))

model.add(Dense(num_classes, activation='softmax'))


model.compile(loss=keras.losses.categorical_crossentropy,

  optimizer=keras.optimizers.Adadelta(),

  metrics=['accuracy'])


Now fitting the model is as below 



model.fit(x_train, y_train,

          batch_size=kwargs['batch_size'],

          epochs=kwargs['epochs'],

          verbose=1,

          validation_data=(x_test, y_test))



# now evaluate the model 


score = model.evaluate(x_test, y_test, verbose=0)


logging.info('Test - loss:', score[0])

logging.info('Test - accuracy:', score[1])


model.save(os.getcwd() + kwargs['initial_model_path'])



Now having both stream set and the trained model, lets feed additional data 


For it, using Kafka

 

Kafka is one of the go-to platforms when you have to deal with streaming data. Its framework basically consists of three players, being 1) brokers; 2) producers; and 3) consumers.


A broker is an instance of a Kafka server (also known as a Kafka node) that hosts named streams of records, which are called topics. A broker takes in messages from producers and stores them to a topic. It in turn enables consumers to fetch messages from a topic.



In its simplest form, you have one single producer pushing messages to one end of a topic, whilst one single consumer fetches messages from the other end of the topic (like for example an app). In the situation of our case where we have Kafka running locally, a single setup likes this (shown below) does the trick.


With the help of the Kafka-Python API we can now simulate a data stream by constructing a Producer that publishes messages to the topic. 



def generate_stream(**kwargs):


producer = KafkaProducer(bootstrap_servers=['kafka:9092'],                              # set up Producer

                         value_serializer=lambda x: dumps(x).encode('utf-8'))


stream_sample = pickle.load(open(os.getcwd() + kwargs['path_stream_sample'], "rb"))       # load stream sample file


rand = random.sample(range(0, 20000), 200)                                                # the stream sample consists of 20000 observations - and along this setup 200 samples are selected randomly


x_new = stream_sample[0]

y_new = stream_sample[1]


logging.info('Partitions: ', producer.partitions_for('TopicA'))


for i in rand:

json_comb = encode_to_json(x_new[i], y_new[i])                                         # pick observation and encode to JSON

producer.send('TopicA', value=json_comb)                                               # send encoded observation to Kafka topic

logging.info("Sent number: {}".format(y_new[i]))

sleep(1)


producer.close()




Now the task will be to receive the stream data 


To fetch the data from the Kafka topic, we turn again to the Kafka-Python API to construct a Consumer. This Consumer is wrapped in a function that sequentially retrieves observations from the topic, which it in turn converts back from JSON to its original format and groups together in a NumPy array which is stored (in pickle format) in the to_use_for_training folder. 



def get_data_from_kafka(**kwargs):


    consumer = KafkaConsumer(

        kwargs['topic'],                                # specify topic to consume from

        bootstrap_servers=[kwargs['client']],

        consumer_timeout_ms=3000,                       # break connection if the consumer has fetched anything for 3 secs (e.g. in case of an empty topic)

        auto_offset_reset='earliest',                   # automatically reset the offset to the earliest offset (should the current offset be deleted or anything)

        enable_auto_commit=True,                        # offsets are committed automatically by the consumer

        #group_id='my-group',

        value_deserializer=lambda x: loads(x.decode('utf-8')))



    logging.info('Consumer constructed')


    try:


        xs = []

        ys = []


        for message in consumer:                            # loop over messages


            logging.info( "Offset: ", message.offset)

            message = message.value

            x, y = decode_json(message)            # decode JSON


            xs.append(x)

            ys.append(y)


            logging.info('Image retrieved from topic')


        xs = np.array(xs).reshape(-1, 28, 28, 1)            # put Xs in the right shape for our CNN

        ys = np.array(ys).reshape(-1)                       # put ys in the right shape for our CNN


        new_samples = [xs, ys]


        pickle.dump(new_samples, open(os.getcwd()+kwargs['path_new_data']+str(time.strftime("%Y%m%d_%H%M"))+"_new_samples.p", "wb"))     # write data


        logging.info(str(xs.shape[0])+' new samples retrieved')


        consumer.close()


    except Exception as e:

        print(e)

        logging.info('Error: '+e)



The update_model function in update_functions.py does most of the heavy lifting:


it takes in the data we fetched from the Kafka topic


it loads the current model and gauges how it scores on the test set*


it does a number of epochs of gradient descent with the new data and accordingly adjusts the weights of the model**


it then tests whether the adjusted model scores better on the test set than the current version — and if it does, it replaces the current version and moves the latter to a model archive. If it doesn’t it sticks to the current version of the model


in addition, it moves the data it used for updating the model to the used_for_training folder and logs a set of metrics corresponding to each update run to MLFlow


references:

https://www.vantage-ai.com/en/blog/keeping-your-ml-model-in-shape-with-kafka-airflow-and-mlflow

 

MLOps using AirFlow, MLFlow and Kafka 

Apache Kafka is a distributed messaging platform that allows you to sequentially log streaming data into topic-specific feeds, which other applications in turn can tap into.
Apache Airflow is a task scheduling platform that allows you to create, orchestrate and monitor data workflows
MLFlow is an open-source tool that enables you to keep track of your ML experiments, amongst others by logging parameters, results, models and data of each trial .

In this hypothetical example, below are required 

a container which has Airflow and your typical data science
toolkit installed (in our case Pandas, NumPy and Keras) in order to create and update the model, whilst also schedule such tasks
a PostgreSQL container which serves as Airflow’s underlying metadata database
a Kafka container, which handles streaming data
a Zookeeper container, which amongst others is responsible for keeping track of Kafka topics, partitions and alike (later more on this!)
a MLFlow container, which keeps track of the results of the update runs and the characteristics of the resulting models




A typical. folder structure can be as below .
project_folder
├── dags
│ └── src
│ ├── data
│ ├── models
│ └── preprocessing
├── data
│ ├── to_use_for_training
│ ├── used_for_training
├── models
│ ├── current_model
│ └── archive
├── airflow_docker
├── mlflow_docker
└── docker_compose.yml


This example utilises the MNIST data set. One of the Airflow task DAG is to fetch the data and split into test, train and streaming set. Streaming set is to simulate the dynamic data that is coming in after the initial model is put into action. and puts them in the right format for training the CNN.

 Construct & fit the model - Task 2 amongst others fetches the train and test set from the previous step above. 

It then constructs and fits the CNN and stores it in the current_model folder






References 
https://www.vantage-ai.com/en/blog/keeping-your-ml-model-in-shape-with-kafka-airflow-and-mlflow


What is MNIST dataset?

 The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger NIST Special Database 3 (digits written by employees of the United States Census Bureau) and Special Database 1 (digits written by high school students) which contain monochrome images of handwritten digits. The digits have been size-normalized and centered in a fixed-size image. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.


references

https://paperswithcode.com/dataset/mnist

Saturday, February 18, 2023

What is GXP

 





GxP is an acronym that refers to the regulations and guidelines applicable to life sciences organizations that make food and medical products such as drugs, medical devices, and medical software applications. The overall intent of GxP requirements is to ensure that food and medical products are safe for consumers and to ensure the integrity of data used to make product-related safety decisions.


The term GxP encompasses a broad range of compliance-related activities such as Good Laboratory Practices (GLP), Good Clinical Practices (GCP), Good Manufacturing Practices (GMP), and others, each of which has product-specific requirements that life sciences organizations must implement based on the 1) type of products they make and 2) country in which their products are sold. When life sciences organizations use computerized systems to perform certain GxP activities, they must ensure that the computerized GxP system is developed, validated, and operated appropriately for the intended use of the system.


References:

https://aws.amazon.com/compliance/gxp-part-11-annex-11/

What are benefits of using Multi Accounts in AWS

Below are the main topics involved in this 


Group workloads based on business purpose and ownership

Apply distinct security controls by environment

Constrain access to sensitive data

Promote innovation and agility

Limit scope of impact from adverse events

Support multiple IT operating models

Manage costs

Distribute AWS Service Quotas and API request rate limits



Group workloads based on business purpose and ownership

You can group workloads with a common business purpose in distinct accounts. As a result, you can align the ownership and decision making with those accounts and avoid dependencies and conflicts with how workloads in other accounts are secured and managed.



Different business units or product teams might have different processes. Depending on your overall business model, you might choose to isolate distinct business units or subsidiaries in different accounts. Isolation of business units can help them operate with greater decentralized control, but still provides the ability for you to provide overarching guardrails. This approach might also ease divestment of those units over time.


Guardrails are governance rules for security, operations, and compliance that you can define and apply to align with your overall requirements.


Apply distinct security controls by environment

Workloads often have distinct security profiles that require separate control policies and mechanisms to support them. For example, it’s common to apply different security and operational policies for the non-production and production environments of a given workload. By using separate accounts for the non-production and production environments, by default, the resources and data that make up a workload environment are separated from other environments and workloads.


Constrain access to sensitive data

When you limit sensitive data stores to an account that is built to manage it, you can more easily constrain the number of people and processes that can access and manage the data store. This approach simplifies the process of achieving least privilege access. Limiting access at the coarse-grained level of an account helps contain exposure to highly sensitive data.


For example, designating a set of accounts to house publicly accessible Amazon S3 buckets enables you to implement policies for all your other accounts to expressly forbid making S3 buckets publicly available.



Promote innovation and agility

At AWS, we refer to your technologists as builders because they are all responsible for building value using AWS products and services. Your builders likely represent diverse roles, such as application developers, data engineers, data scientists, data analysts, security engineers, and infrastructure engineers.


In the early stages of a workload’s lifecycle, you can help promote innovation by providing your builders with separate accounts in support of experimentation, development, and early testing. These environments often provide greater freedom than more tightly controlled production-like test and production environments by enabling broader access to AWS services while using guardrails to help prohibit access to and use of sensitive and internal data.


Sandbox accounts are typically disconnected from your enterprise services and do not provide access to your internal data, but offer the greatest freedom for experimentation.


Development accounts typically provide limited access to your enterprise services and development data, but can more readily support day-to-day experimentation with your enterprise approved AWS services, formal development, and early testing work.


In both cases, we recommend security guardrails and cost budgets so that you limit risks and proactively manage costs.




Limit scope of impact from adverse events

An AWS account provides security, access, and billing boundaries for your AWS resources that can help you achieve resource independence and isolation. By design, all resources provisioned within an account are logically isolated from resources provisioned in other accounts, even within your own AWS environment.


This isolation boundary provides you with a way to limit the risks of an application-related issue, misconfiguration, or malicious actions. If an issue occurs within one account, impacts to workloads contained in other accounts can be either reduced or eliminated.


Manage Costs 

An account is the default means by which AWS costs are allocated. Because of this fact, using different accounts for different business units and groups of workloads can help you more easily report, control, forecast, and budget your cloud expenditures.



In addition to cost reporting at the account level, AWS has built-in support to consolidate and report costs across your entire set of accounts. When you require fine-grained cost allocation, you can apply cost allocation tags to individual resources in each of your accounts.


Distribute AWS Service Quotas and API request rate limits

AWS Service Quotas, also known as limits, are the maximum number of service resources or operations that apply to an account. For example, the number of Amazon Simple Storage Service (Amazon S3) buckets that you can create for each account.


You can use Service Quotas to help protect you from unexpected excessive provisioning of AWS resources and malicious actions that could dramatically impact your AWS costs.


AWS services can also throttle or limit the rate of requests made to their API operations.


Because Service Quotas and request rate limits are allocated for each account, use of separate accounts for workloads can help distribute the potential impact of the quotas and limits.



references:

https://docs.aws.amazon.com/whitepapers/latest/organizing-your-aws-environment/benefits-of-using-multiple-aws-accounts.html


What is Amazon Keyspaces

 


Amazon Keyspaces (for Apache Cassandra) is a scalable, highly available, and managed Apache Cassandra–compatible database service. With Amazon Keyspaces, you can run your Cassandra workloads on AWS using the same Cassandra application code and developer tools that you use today. You don’t have to provision, patch, or manage servers, and you don’t have to install, maintain, or operate software. Amazon Keyspaces is serverless, so you pay for only the resources you use and the service can automatically scale tables up and down in response to application traffic. You can build applications that serve thousands of requests per second with virtually unlimited throughput and storage. Data is encrypted by default and Amazon Keyspaces enables you to back up your table data continuously using point-in-time recovery. Amazon Keyspaces gives you the performance, elasticity, and enterprise features you need to operate business-critical Cassandra workloads at scale.



What is Apache Cassandra

 Apache Cassandra is an open source NoSQL distributed database trusted by thousands of companies for scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.

Hybrid: Masterless architecture and low latency means Cassandra will withstand an entire data center outage with no data loss—across public or private clouds and on-premises.

Fault Tolerant: Cassandra’s support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages. Failed nodes can be replaced with no downtime.


Focus on Quality : To ensure reliability and stability, Cassandra is tested on clusters as large as 1,000 nodes and with hundreds of real world use cases and schemas tested with replay, fuzz, property-based, fault-injection, and performance tests.


Performant : Cassandra consistently outperforms popular NoSQL alternatives in benchmarks and real applications, primarily because of fundamental architectural choices.


You’re In Control: Choose between synchronous or asynchronous replication for each update. Highly available asynchronous operations are optimized with features like Hinted Handoff and Read Repair.


Security and Observability: The audit logging feature for operators tracks the DML, DDL, and DCL activity with minimal impact to normal workload performance, while the fqltool allows the capture and replay of production workloads for analysis.


Distributed: Cassandra is suitable for applications that can’t afford to lose data, even when an entire data center goes down. There are no single points of failure. There are no network bottlenecks. Every node in the cluster is identical.


Scalable: Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications.


Elastic: Cassandra streams data between nodes during scaling operations such as adding a new node or datacenter during peak traffic times. Zero Copy Streaming makes this up to 5x faster without vnodes for a more elastic architecture particularly in cloud and Kubernetes environments.


Dissecting a reference MLOps architecture

The components of the reference architecture diagram are:



A secured development environment was implemented using an Amazon SageMaker Notebook Instance deployed to a custom virtual private cloud (VPC), and secured by implementing security groups and routing the notebook’s internet traffic via the custom VPC.


Also, the development environment has two Git repositories (AWS CodeCommit) attached: one for the Exploratory Data Analysis (EDA) code and the other for developing the custom Amazon SageMaker Docker container images.

.

An ML CI/CD pipeline made up of three sub-components:

Data validation step implemented using AWS Lambda and triggered using AWS CodeBuild.

Model training/retraining pipeline implemented using Amazon SageMaker Pipelines (pipeline-as-code) and executed using CodeBuild.

Model deployment pipeline that natively supports model rollbacks was implemented using AWS CloudFormation.

Finally, AWS CodePipeline is used to orchestrate the pipeline.


A CI/CD pipeline for developing and deploying the custom Amazon SageMaker Docker container image. This pipeline automatically triggers the ML pipeline when you successfully push a new version of the SageMaker container image, providing the following benefits:

Developers and data scientists can thoroughly test and get immediate feedback on the ML pipeline’s performance after publishing a new version of the Docker image. This helps ensure ML pipelines are adequately tested before promoting them to production.

Developers and data scientists don’t have to manually update the ML pipeline to use the latest version of the customized Amazon SageMaker image when working on the develop git branch. They can branch off the develop branch if they want to use an older version or start developing a new version, which they will merge back to develop branch once approved.

.

A model monitoring solution implemented using Amazon SageMaker Model Monitor to monitor the production models’ quality continuously.

.

This provides monitoring for the following: data drift, model drift, the bias in the models’ predictions, and drifts in feature attributes. You can start with the default model monitor, which requires no coding.

.

A model retraining implementation that is based on the metric-based model retraining strategy. There are three main retaining strategies available for your model retraining implementation:

Scheduled: This kicks off the model retraining process at a scheduled time and can be implemented using an Amazon EventBridge scheduled event.

Event-driven: This kicks off the model retraining process when a new model retraining dataset is made available and can be implemented using an EventBridge event.

Metric-based: This is implemented by creating a Data Drift CloudWatch Alarm (as seen in Figure 1 above) that triggers your model retraining process once it goes off, fully automating your correction action for a model drift.

A data platform implemented using Amazon S3 buckets with versioning enabled.

.

A model governance framework, which is not obvious from the architectural diagram and is made of the following components:

A model registry for versioning and tracking the trained model artefacts, implemented using Amazon SageMaker Model Registry.

Dataset versioning implemented using Amazon S3 bucket versioning.

ML workflow steps auditability, visibility, and reproducibility implemented using Amazon SageMaker Lineage Tracking.

Secured trained model artefacts implemented using AWS Identity and Access Management (IAM) roles to ensure only authorized individuals have access.


references:

https://aws.amazon.com/blogs/apn/taming-machine-learning-on-aws-with-mlops-a-reference-architecture/


Taming Machine Learning on AWS with MLOps: A Reference Architecture

There is no one size fits all when it comes to implementing an MLOps solution on Amazon Web Services (AWS). Like any other technical solution, MLOps should be implemented to meet the project requirements.


Data science and analytics teams are often squeezed between increasing business expectations and sandbox environments evolving into complex solutions. This makes it challenging to transform data into solid answers for stakeholders consistently.


Key Components of an MLOps Solution


1. A version control system to store, track, and version changes to your ML code.

2. A version control system to track and version changes to your training datasets.

3. A network layer that implements the necessary network resources to ensure the MLOps solution is secured.

4. An ML-based workload to execute machine learning tasks. AWS offers a three-layered ML stack to choose from based on your organization’s skill level.

We describe the three layers briefly here 

AI services: They are a fully managed set of services that enable you to quickly add ML capabilities to your workloads using API calls. Examples of these AWS services are Amazon Rekognition and Amazon Comprehend.

ML services: AWS provides managed services and resources (Amazon SageMaker suite, for example) to enable you to label your data and build, train, deploy, and operate your ML models.

ML frameworks and infrastructure: This is a level intended for expert ML practitioners using open-source frameworks like TensorFlow, PyTorch, and Apache MXNet; Deep Learning AMI for Amazon EC2 P3 and P3dn instances; and Deep Learning Containers to implement your own tools and workflows to build, train, and deploy the ML models.


It’s important to note the ML-based workloads can also be implemented by combining services and infrastructure from the different levels of the AWS ML stack.

5. Use infrastructure as code (IaC) to automate the provisioning and configuration of your cloud-based ML workloads and other IT infrastructure resources.

.

6. An ML (training/retraining) pipeline to automate the steps required to train/retrain and deploy your ML models.

7. An orchestration tool to orchestrate and execute your automated ML workflow steps.

8. A model monitoring solution to monitor production models’ performance to protect against both model and data drift. You can also use the performance metrics as feedback to help improve the models’ future development and training.

9. A model governance framework to make it easier to track, compare, and reproduce your ML experiments and secure your ML models.

10. A data platform like Amazon Simple Storage Service (Amazon S3) to store your datasets.


Implementing an MLOps Solution


 Let’s look at the three main options Amazon SageMaker provides when it comes to choosing your training algorithm:

1. Use a built-in Amazon SageMaker algorithm or framework. With this option, a training dataset is the only input developers and data scientists have to provide when training their models. On the other hand, the trained model artefacts are the only input they need to deploy the models. This is the least flexible of the options available and a good fit for scenarios where off-the-shelf solutions meet your need.


2. Use pre-built Amazon SageMaker container images. For this option, you need to provide two inputs to train your models, and they are your training scripts and datasets. Likewise, the inputs for deploying the trained models are your serving scripts and the trained model artefacts.

.

3. Extend a pre-built Amazon SageMaker container image, or adapt an existing container image. This is for more advanced use cases. You are responsible for developing and maintaining those container images. Therefore, you may want to consider implementing a CI/CD pipeline to automate the building, testing, and publishing of the customized Amazon SageMaker container images and then integrate the pipeline with your MLOps solution.

references:

https://aws.amazon.com/blogs/apn/taming-machine-learning-on-aws-with-mlops-a-reference-architecture/


What is DPPL metric

 The difference in positive proportions in predicted labels (DPPL) metric determines whether the model predicts outcomes differently for each facet. It is defined as the difference between the proportion of positive predictions (y’ = 1) for facet a and the proportion of positive predictions (y’ = 1) for facet d. For example, if the model predictions grant loans to 60% of a middle-aged group (facet a) and 50% other age groups (facet d), it might be biased against facet d. In this example, you need to determine whether the 10% difference is material to a case for bias. A comparison of DPL with DPPL assesses whether bias initially present in the dataset increases or decreases in the model predictions after training.


The formula for the difference in proportions of predicted labels:


        DPPL = q'a - q'd


Where:


q'a = n'a(1)/na is the predicted proportion of facet a who get a positive outcome of value 1. In our example, the proportion of a middle-aged facet predicted to get granted a loan. Here n'a(1) represents the number of members of facet a who get a positive predicted outcome of value 1 and na the is number of members of facet a.


q'd = n'd(1)/nd is the predicted proportion of facet d who get a positive outcome of value 1. In our example, a facet of older and younger people predicted to get granted a loan. Here n'd(1) represents the number of members of facet d who get a positive predicted outcome and nd the is number of members of facet d.


If DPPL is close enough to 0, it means that post-training demographic parity has been achieved.


references:

https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-post-training-bias-metric-dppl.html

Thursday, February 16, 2023

Architect and build the full machine learning lifecycle with AWS: end-to-end with SageMaker - Part 4 Solution Architecture

This is a deep dive into the solution architecture for each of the four workflows for data prep, train and tune, deploy, and finally a pipeline that ties everything together in an automated fashion up to storing the models in a registry.


Manual workflow

Before we automate parts of the lifecycle, we often conduct investigative data science work. This is often carried out in the exploratory data analysis and visualization phases, where we use SageMaker Data Wrangler to figure out what we want to do with our data (visualize, understand, clean, transform, or featurize) to prepare it for training. The following diagram illustrates the flow for the two datasets on SageMaker Data Wrangler.


One of the outputs you can choose in SageMaker Data Wrangler is a Python notebook that distills these activities into a set of functions. The .flow file output contains a set of transformations that provide SageMaker Processing with guidance on what transformations to apply to features. The following screenshot shows the export options from SageMaker Data Wrangler.


We can send this code to SageMaker Processing to create a preprocessing job that prepares our datasets for training in a scalable and reproducible way.


Data prep


The data can be visualised and see if there is any class imbalance. If there is any, one option is to provide transformations


We loaded the raw data from the S3 bucket and created 10 transforms for claims and 6 for customers.

Some of the transformations are


 23  collision_type_rear              5000 non-null   float64

 24  collision_type_side              5000 non-null   float64

 25  collision_type_na                5000 non-null   float64

 26  authorities_contacted_police     5000 non-null   float64

 27  authorities_contacted_none       5000 non-null   float64

 28  authorities_contacted_fire       5000 non-null   float64

 29  authorities_contacted_ambulance  5000 non-null   float64

 30  event_time                       5000 non-null   float64



 12  policy_state_ca            5000 non-null   float64

 13  policy_state_wa            5000 non-null   float64

 14  policy_state_az            5000 non-null   float64

 15  policy_state_or            5000 non-null   float64

 16  policy_state_nv            5000 non-null   float64

 17  policy_state_id            5000 non-null   float64

 18  event_time                 5000 non-null   float64


The following diagram shows the data prep architecture.


It is assumed that SageMaker Data Wrangler and the output is available in the /data folder  


You can provide an S3 bucket that contains the results of the SageMaker Data Wrangler job that has output two files: claims.csv and customer.csv. If you want to move on and assume the data prep has been conducted, you can access the preprocessed data in the /data folder containing the files claims_preprocessed.csv (31 features) and customers_preprocessed.csv (19 features). The policy_id and event_time columns in customers_preprocessed.csv are necessary when creating a feature store, which requires a unique identifier for each record and a timestamp.


Ingesting the preprocessed data into SageMaker Feature Store

After SageMaker Processing finishes the preprocessing and we have our two CSV data files for claims and customers ready. We have contributed to the standardization of these features by making them discoverable and reusable by ingesting them into SageMaker Feature Store.


SageMaker Feature Store is a centralized store for features and their associated metadata, allowing features to be easily discovered and reused across your organization or team. You have the option of creating an offline feature store (stored in Amazon S3) or an online component stored in a low-latency store, or both. Data is stored in your S3 bucket using a prefixing scheme based on event time. The offline feature store is append-only, which enables you to maintain a historical record of all feature values. Data is stored in the offline store in Parquet format for optimized storage and query access. SageMaker Feature Store supports combining data to produce, train, validate, and test datasets, and allows you to extract data at different points in time.


To store features, we first need to define their feature group. A feature group is the main feature store resource that contains the metadata for all the data stored in Amazon SageMaker Feature Store. A feature group is a logical grouping of features, defined in the feature store, to describe records. A feature group’s definition is composed of a list of feature definitions, a record identifier name, and configurations for its online and offline store.


The online database is optional, but very useful if you need supplemental features to be available at inference. In this section, we create two feature groups for our claims and customers datasets. After inserting the claims and customers data into their respective feature groups, you need to query the offline store with Amazon Athena to build the training dataset.


To ingest data, we first designate a feature group for each type of feature, in this case, one per CSV file. You can ingest data into feature groups in SageMaker Feature Store in one of two ways: streaming or batch. For this post, we use the batch method.


Training and tuning

The following diagram illustrates the workflow for the bias check, training, tuning, lineage, and model registry stages.





We write the train and test split datasets to our designated S3 bucket, and create an XGBoost estimator to train our fraud detection model with a fraud or no fraud logistic target. Prior to starting the SageMaker training job using the built-in XGBoost algorithm, we set the XGBoost hyperparameters.

We take the opportunity to track all the artifacts or entities involved with the training job so we can track the lineage of the model. This is done by importing several sagemaker.lineage components. See the following code:

from sagemaker.lineage import context, artifact, association, action.
Lineage Tracking provides us with visibility into the code, training data,  and model artifacts that we then associate with association_type='Produced' and association_type='ContributesTo', which links what contributed to and what produced a given artifact in the process.

We also assess degrees of pre-training and post-training bias using SageMaker Clarify. Pre-training metrics show a variety of possible preexisting bias in our dataset. Post-training metrics show bias in the predictions resulting from the model. We use analysis_config.json to specify which groups we want to check bias across and which metrics we want to show.

We assess two metrics: the difference in positive proportions in predicted labels (DPPL) and if a class imbalance exists in the data. For our use case, we measure this on the gender feature, which indicates if we have more male customers than female customers. Results indicate a slight bias in our model measured by the DPPL metric.

Deploying and serving the model


Creating an automated workflow using SageMaker Pipelines

After we complete a few iterations of our manual exploratory data science and are happy with the outcomes of our cleansing, transformations, and featurizations, we may want to create an automated workflow using SageMaker Pipelines, so we can scale and don’t have to go through this manual process every time.

The following diagram shows our end-to-end automated MLOps pipeline, which includes eight steps:

Preprocess the claims data with SageMaker Data Wrangler.
Preprocess the customers data with SageMaker Data Wrangler.
Create a dataset and train/test split.
Train the XGBoost algorithm.
Create the model.
Run bias metrics with SageMaker Clarify.
Register the model.
Deploy the model.








references:
https://aws.amazon.com/blogs/machine-learning/architect-and-build-the-full-machine-learning-lifecycle-with-amazon-sagemaker/


Architect and build the full machine learning lifecycle with AWS: end-to-end with SageMaker - Part 1

Amazon SageMaker provides a rich set of capabilities that enable data scientists, machine learning engineers, and developers to prepare, build, train, and deploy ML models rapidly and with ease.

The example is Fraud Detection 

To get started, data scientists use an experimental process to explore various data preparation tasks, in some cases engineering features, and eventually settle on a standard way of doing so. Then they embark on a more repeatable and scalable process of automating stages of this process, until the model provides the necessary levels of performance (such as accuracy, F1 score, and precision). Then they package this process in a repeatable, automated, and scalable ML pipeline.

Below is an overall diagram for this 




The general phases of the ML lifecycle are data preparation, train and tune, and deploy and monitor, with inference being when we actually serve the model up with new data for inference.

Below is a very detailed view of ML Ops Life cycle 


the red boxes represent comparatively newer concepts and tasks that are now deemed important to include in, and run in a scalable, operational, and production-oriented (vs. research-oriented) environment.


references:

https://aws.amazon.com/blogs/machine-learning/architect-and-build-the-full-machine-learning-lifecycle-with-amazon-sagemaker/



What is Parquet?

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Apache Parquet is designed to be a common interchange format for both batch and interactive workloads. It is similar to other columnar-storage file formats available in Hadoop, namely RCFile and ORC.


Characteristics of Parquet

Free and open source file format.

Language agnostic.

Column-based format - files are organized by column, rather than by row, which saves storage space and speeds up analytics queries.

Used for analytics (OLAP) use cases, typically in conjunction with traditional OLTP databases.

Highly efficient data compression and decompression.

Supports complex data types and advanced nested data structures.


Benefits of Parquet

Good for storing big data of any kind (structured data tables, images, videos, documents).

Saves on cloud storage space by using highly efficient column-wise compression, and flexible encoding schemes for columns with different data types.

Increased data throughput and performance using techniques like data skipping, whereby queries that fetch specific column values need not read the entire row of data.

references:

https://www.databricks.com/glossary/what-is-parquet#:~:text=What%20is%20Parquet%3F,handle%20complex%20data%20in%20bulk.

Architect and build the full machine learning lifecycle with AWS: end-to-end with SageMaker - Part 3 Use case analysis

Fraud Detection is the use case

Aim is to use Amazon SageMaker to predict the probability that an incoming auto claim may be fraudulent.

Wrangling and preprocessing the dataset

We use SageMaker Data Wrangler to ingest, analyze, prepare, and transform each dataset. You can do this in the GUI-based feature available in SageMaker Studio.

Second, we use SageMaker Data Wrangler to export the transformed data as two CSV files that can be picked up in an Amazon Simple Storage Service (Amazon S3) bucket by SageMaker Processing, in order to conduct scalable data preparation and preprocessing.

Storing the features

After SageMaker Processing applies the transformations defined in SageMaker Data Wrangler, we store the normalized features in an offline feature store so the features can be shared and reused consistently across an organization among collaborating data scientists. This standardization is often key to creating a normalized, reusable set of features that can be created, shared, and managed as input into training ML models. You can use this feature consistency across the ML maturity spectrum, whether you are a startup or an advanced organization with a ML Center of Excellence.

Assessing and Mitigating bias, training and tuning

The issues relating to bias detection and fairness in AI have taken a prominent role in ML. Data bias is often inadvertently injected during the data labeling and collection process, and may often be overlooked in the significance of its impact on training a model. SageMaker Clarify is a fully-managed toolkit to identify potential bias within a training dataset or model, explain individual inference results, aggregate these explanations for an entire dataset, integrate with built-in monitoring capabilities to assess production performance, and provide these capabilities across modeling frameworks.


You can use SageMaker Clarify to assess various types of bias. For example, assessing pre-training bias (data) can focus on determining if class imbalance or a variety of other factors are beyond a threshold and therefore may bias the model we seek to train. SageMaker Clarify helps improve your ML models by detecting potential biases prior to training (data bias) and after training, assess post-training bias (model bias) and can also help explain the predictions that models make during inference.


After we implement our bias mitigation strategy, the next step is often to choose a training algorithm and experiment with various ways of tuning it so as to obtain acceptable ML performance metrics such as F1, AUC, or accuracy. For this post, we use the XGBoost algorithm for training our model using the data in the feature store, and evaluate F1 metrics.


We can also check the resulting model’s post-training bias and, when satisfied with both the performance and transparency (bias) metrics, tune the model to get the most out of its performance through hyperparameter optimization.


We can track the lineage of these experiments using Lineage Tracking to track various aspects of the evolution of our experiments including answering questions related to the following:


Data – Which dataset did we use?

Prep – How did we clean, transform and featurize the data?

Training – Which model and training job configuration did we use?

Tuning – Which hyperparameters did we use?



During our experimentation, we may have trained many models, from different datasets, prepared with different transformations, each with their own performance metrics and bias metrics. If we like a result, we can look at the artifact lineage associated with it so we can reproduce those results or improve them.


Capturing artifact lineage in experiments

Not only do we want to store our trained models themselves, but also the specific datasets, feature  transformations, preprocessing mechanisms, algorithms, and hyperparameter configurations that were used to produce and optimize the models for governance and reproducibility purposes. We can store that metadata, which tracks the experiment and lineage of the model, with a reference to the data and the model in the SageMaker Model Registry.

Deploying the model to a SageMaker hosted endpoint

After we decide which models should be approved for deployment, we can deploy them to a SageMaker hosted endpoint, where they are ready for serving predictions.

Running predictions on the model using the online feature store

We create models so we can run predictions on them. We can invoke an endpoint directly, since Amazon SageMaker endpoints have load balancers behind them to balance incoming load.


Another common invocation pattern for running inference is the ML Gateway Pattern, where we expose the inference as a service endpoint and invoke it using an Amazon API Gateway. This pattern also allows the benefits of a service oriented architecture exposing a set of ML services as RESTful endpoints. Incoming service requests benefit from being load balanced, cached, and monitored using Amazon API Gateway. Amazon API Gateway then calls an AWS Lambda function which can call the SageMaker endpoint.


Explaining the model’s predictions

We can then inspect why this decision was made and present an explainable narrative to inquisitive parties. For this, we use the explainability features of SageMaker Clarify.

references:

https://aws.amazon.com/blogs/machine-learning/architect-and-build-the-full-machine-learning-lifecycle-with-amazon-sagemaker/

Architect and build the full machine learning lifecycle with AWS: end-to-end with SageMaker - Part 2 Tool Mapping

 Data wrangling – We use SageMaker Data Wrangler for cleaning, normalizing, transforming and encoding data, as well as joining datasets. The output of SageMaker Data Wrangler is data transformation code that works with SageMaker Processing, SageMaker Pipelines, SageMaker Feature Store, or with Pandas in a plain Python script. Feature engineering can now be done faster and easier, with SageMaker Data Wrangler where we have a GUI-based environment and can generate code that can be used for the subsequent phases of the ML lifecycle.


Detecting bias – With SageMaker Clarify, in the data prep or training phases, we can detect pre-training (data bias) and post-training bias (model bias). At the inference phase, SageMaker Clarify gives us the ability to provide interpretability and explainability of the predictions by providing insight into which factors were most influential in coming up with the prediction.


Feature Store (offline) – After we complete our feature engineering, encoding, and transformations, we can standardize features offline in SageMaker Feature Store, to be used as input features for training models.

SageMaker Feature Store allows you to create offline feature groups that keep all the historical data and can be used as inputs to training.

Note that Features can be ingested from a feature processing pipeline into the online feature store and will then get replicated to the offline store. The offline store could be used to run batch inference as well. Thus, the online feature store can also be used as input for training.


Artifact lineage: We can use SageMaker ML Lineage Tracking to associate all the artifacts (such as data, models, and parameters) with a trained model to produce metadata that is stored in a model registry. In addition, tracking human in the loop actions such as model approvals and deployments further facilitates the process of ML governance.


Model Registry: The SageMaker Model Registry stores the metadata around all the artifacts that you include in the process of creating your models, along with the trained models themselves in a model registry. Later, we can use human approval to note that the model is ready for production. This feeds into the next phase of deploy and monitor.


Inference and Feature Store (online): SageMaker Feature Store provides for low latency (up to single digit milliseconds) and high throughput reads for serving our model with new incoming data.


Pipelines: After we experiment and decide on the various options in the lifecycle (such as which transforms to apply to our features, determine imbalance or bias in the data, which algorithms to choose to train with, or which hyperparameters are giving us the best performance metrics), we can automate the various tasks across the lifecycle using SageMaker Pipelines.


This lets us streamline the otherwise cumbersome manual processes into an automated ML pipeline. To build this pipeline, we will prepare some data (customers and claims) by ingesting the data into SageMaker Data Wrangler and apply various transformations in SageMaker Data Wrangler within SageMaker Studio.  SageMaker Data Wrangler creates .flow files. We will use these transformation definitions as a starting point for our automated pipeline and go through the ML Lifecycle all the way to deploying the model to a SageMaker Hosted Endpoint. Note that some use cases may require one, larger, end-to-end pipeline, that does everything. Other use cases may require multiple pipelines, such as the following:


A pipeline for all data prep steps.

A pipeline for training, tuning, lineage, and depositing into the model registry (which we show in the code associated with this post).

Possibly another pipeline for specific inference scenarios (such as real time vs. batch).

A pipeline for triggering retraining by using SageMaker Model Monitor to detect model drift or data drift and trigger retraining using, for example, an AWS Lambda

references:

https://aws.amazon.com/blogs/machine-learning/architect-and-build-the-full-machine-learning-lifecycle-with-amazon-sagemaker/


Wednesday, February 15, 2023

Text analytics on AWS: implementing a data lake architecture with OpenSearch ( Part 1)

Text data is a common type of unstructured data found in analytics. It is often stored without a predefined format and can be hard to obtain and process.

web pages contain text data that data analysts collect through web scraping and pre-process using lowercasing, stemming, and lemmatisation. After pre-processing, the cleaned text is analyzed by data scientists and analysts to extract relevant insights.

We can handle text data using a data lake architecture on Amazon Web Services 

Below is a reference architecture from AWS blog for creating an end-to-end text analytics solution, starting from the data collection and ingestion up to the data consumption in OpenSearch 


The detailed is below 

1. Collect data from various sources, such as SaaS applications, edge devices, logs, streaming media, and social networks.
2. Use tools like AWS Database Migration Service (AWS DMS), AWS DataSync, Amazon Kinesis, Amazon Managed Streaming for Apache Kafka (Amazon MSK), AWS IoT Core, and Amazon AppFlow to ingest the data into the AWS data lake, depending on the data source type.
3. Store the ingested data in the raw zone of the Amazon Simple Storage Service (Amazon S3) data lake—a temporary area where data is kept in its original form.
4. Validate, clean, normalize, transform, and enrich the data through a series of pre-processing steps using AWS Glue or Amazon EMR.
5. Place the data that is ready to be indexed in the indexing zone.
6. Use AWS Lambda to index the documents into OpenSearch and store them back in the data lake with a unique identifier.
7. Use the clean zone as the source of truth for teams to consume the data and calculate additional metrics.
8. Develop, train, and generate new metrics using machine learning (ML) models with Amazon SageMaker or artificial intelligence (AI) services like Amazon Comprehend.
9. Store the new metrics in the enriching zone along with the identifier of the OpenSearch document.
10. Use the identifier column from the initial indexing phase to identify the correct documents and update them in OpenSearch with the newly calculated metrics using AWS Lambda.
11. Use OpenSearch to search through the documents and visualize them with metrics using OpenSearch Dashboards.

references:
https://aws.amazon.com/blogs/architecture/text-analytics-on-aws-implementing-a-data-lake-architecture-with-opensearch/




Text analytics on AWS: implementing a data lake architecture with OpenSearch ( Part 2)

This architecture allows data teams to work independently on text documents at different stages of their lifecycles. The data engineering team manages the raw and indexing zones, who also handle data ingestion and preprocessing for indexing in OpenSearch.

The cleaned data is stored in the clean zone, where data analysts and data scientists generate insights and calculate new metrics. These metrics are stored in the enrich zone and indexed as new fields in the OpenSearch documents by the data engineering team 

Consider a company that periodically retrieves blog site comments and performs sentiment analysis using Amazon Comprehend. In this case:

The comments are ingested into the raw zone of the data lake.

The data engineering team processes the comments and stores them in the indexing zone.

A Lambda function indexes the comments into OpenSearch, enriches the comments with the OpenSearch document ID, and saves it in the clean zone.

The data science team consumes the comments and performs sentiment analysis using Amazon Comprehend.

The sentiment analysis metrics are stored in the metrics zone of the data lake. A second Lambda function updates the comments in OpenSearch with the new metrics.

Schema evolution

As your data progresses through data lake stages, the schema changes and gets enriched accordingly. Continuing with our previous example, Figure 3 explains how the schema evolves.


 Schema evolution through the data lake stages


In the raw zone, there is a raw text field received directly from the ingestion phase. It’s best practice to keep a raw version of the data as a backup, or in case the processing steps need to be repeated later.

In the indexing zone, the clean text field replaces the raw text field after being processed.

In the clean zone, we add a new ID field that is generated during indexing and identifies the OpenSearch document of the text field.

In the enrich zone, the ID field is required. Other fields with metric names are optional and represent new metrics calculated by other teams that will be added to OpenSearch.


Consumption layer with OpenSearch

In OpenSearch, data is organized into indices, which can be thought of as tables in a relational database. Each index consists of documents—similar to table rows—and multiple fields, similar to table columns. You can add documents to an index by indexing and updating them using various client APIs for popular programming languages.


Data flow automation

As architectures evolve towards automation, the data flow between data lake stages becomes event-driven. Following our previous example, we can automate the processing steps of the data when moving from the raw to the indexing zone 


With Amazon EventBridge and AWS Step Functions, we can automatically trigger our pre-processing AWS Glue jobs so our data gets pre-processed without manual intervention.

The same approach can be applied to the other data lake stages to achieve a fully automated architecture. Explore this implementation for an automated language use case.

references:

https://aws.amazon.com/blogs/architecture/text-analytics-on-aws-implementing-a-data-lake-architecture-with-opensearch/

What is AWS EventBridge

EventBridge receives an event, an indicator of a change in environment, and applies a rule to route the event to a target. Rules match events to targets based on either the structure of the event, called an event pattern, or on a schedule. For example, when an Amazon EC2 instance changes from pending to running, you can have a rule that sends the event to a Lambda function.


All events that come to EventBridge are associated with an event bus. Rules are tied to a single event bus, so they can only be applied to events on that event bus. Your account has a default event bus which receives events from AWS services, and you can create custom event buses to send or receive events from a different account or Region.


When an AWS Partner wants to send events to an AWS customer account, they set up a partner event source. Then the customer must associate an event bus with the partner event source.


EventBridge API destinations are HTTP endpoints that you can set as the target of a rule, in the same way that you would send event data to an AWS service or resource. By using API destinations, you can use REST API calls to route events between AWS services, integrated SaaS applications, and your applications outside of AWS. When you create an API destination, you specify a connection to use for it. Each connection includes the details about the authorization type and parameters to use to authorize with the API destination endpoint.


What is AWS Step functions

 AWS Step Functions is a visual workflow service that helps developers use AWS services to build distributed applications, automate processes, orchestrate microservices, and create data and machine learning (ML) pipelines.




What is AWS Fargate?


AWS Fargate is a serverless, pay-as-you-go compute engine that lets you focus on building applications without managing servers. AWS Fargate is compatible with both Amazon Elastic Container Service (ECS) and Amazon Elastic Kubernetes Service (EKS).


Web apps, APIs, and microservices :

Build and deploy your applications, APIs, and microservices architectures with the speed and immutability of containers. Fargate removes the need to own, run, and manage the lifecycle of a compute infrastructure so that you can focus on what matters most: your applications.


Run and scale container workloads

Use Fargate with Amazon ECS or Amazon EKS to easily run and scale your containerized data processing workloads. Fargate also enables you to migrate and run your Amazon ECS Windows containers without refactoring or rearchitecting your legacy applications.


Support AI and ML training applications

Create an AI and ML development environment that is flexible and portable. With Fargate, achieve the scalability you need to boost server capacity without over-provisioning—to train, test, and deploy your machine learning (ML) models.


With AWS Fargate there are no upfront expenses, pay for only the resources used. Further optimize with Compute Savings Plans and Fargate Spot, then use Graviton2 powered Fargate for up to 40% price performance improvements.


What is AWS Glue?

AWS Glue is a serverless data integration service that makes it easy for analytics users to discover, prepare, move, and integrate data from multiple sources. You can use it for analytics, machine learning, and application development. It also includes additional productivity and data ops tooling for authoring, running jobs, and implementing business workflows.


With AWS Glue, you can discover and connect to more than 70 diverse data sources and manage your data in a centralized data catalog. You can visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes. Also, you can immediately search and query cataloged data using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.


AWS Glue consolidates major data integration capabilities into a single service. These include data discovery, modern ETL, cleansing, transforming, and centralized cataloging. It's also serverless, which means there's no infrastructure to manage. With flexible support for all workloads like ETL, ELT, and streaming in one service, AWS Glue supports users across various workloads and types of users.


With the ability to scale on demand, AWS Glue helps you focus on high-value activities that maximize the value of your data. It scales for any data size, and supports all data types and schema variances. To increase agility and optimize costs, AWS Glue provides built-in high availability and pay-as-you-go billing.


AWS Glue Studio

AWS Glue Studio is a graphical interface that makes it easy to create, run, and monitor data integration jobs in AWS Glue. You can visually compose data transformation workflows and seamlessly run them on the Apache Spark–based serverless ETL engine in AWS Glue

AWS Glue features fall into three major categories:

Discover and organize data

Transform, prepare, and clean data for analysis

Build and monitor data pipelines


References

https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html

Tuesday, February 14, 2023

What is CRF Model?

Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without considering "neighbouring" samples, a CRF can take context into account.

CRFs find their applications in named entity recognition, part of speech tagging, gene prediction, noise reduction and object detection problems, to name a few.

Conditional Random Field is a special case of Markov Random field wherein the graph satisfies the property : “When we condition the graph on X globally i.e. when the values of random variables in X is fixed or given, all the random variables in set Y follow the Markov property p(Yᵤ/X,Yᵥ, u≠v) = p(Yᵤ/X,Yₓ, Yᵤ~Yₓ), where Yᵤ~Yₓ signifies that Yᵤ and Yₓ are neighbors in the graph.” A variable’s neighboring nodes or variables are also called the Markov Blanket of that variable.


https://towardsdatascience.com/conditional-random-fields-explained-e5b8256da776

Monday, February 13, 2023

What is MLFlow

MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. MLflow currently offers four components:


MLflow Tracking: Record and query experiments: code, data, config, and results

MLflow Projects: Package data science code in a format to reproduce runs on any platform

MLflow Models: Deploy machine learning models in diverse serving environments

Model Registry: Store, annotate, discover, and manage models in a central repository


# Install MLflow

pip install mlflow


# Install MLflow with extra ML libraries and 3rd-party tools

pip install mlflow[extras]


# Install a lightweight version of MLflow

pip install mlflow-skinny

references:

https://mlflow.org/docs/latest/tutorials-and-examples/index.html 

Sunday, February 5, 2023

Mongo - Update field of records matching criteria

Main point is that we need to set multi: true 

db.getCollection("notificationcontent").find({})

db.getCollection("notificationcontent").update({"_id" : ObjectId("63be8e174826db2b41dc352b")},{"$set" : {"tags" : ["Inspirational" ] }})

db.getCollection("notificationcontent").update({"status" : "ready"},{"$set" : {"tags" : ["Inspirational" ] }})

db.getCollection("notificationcontent").update({"status" : "ready"},{"$set" : {"tags" : ["Inspirational" ] }},{upsert:false, multi:true})

references: