Friday, June 30, 2023

Difference between OpenAI Embedding and Transformer Embedding

OpenAI Embeddings and Transformer Embeddings refer to different approaches for generating word or text representations.


OpenAI Embeddings:

OpenAI Embeddings, specifically referring to OpenAI's GPT-based models like GPT-3 or GPT-4, utilize deep neural networks based on the Transformer architecture. These models are pre-trained on a large corpus of text data and are designed to generate contextualized word embeddings. OpenAI Embeddings capture semantic and syntactic information by considering the surrounding words in the context. They can be used for various natural language processing (NLP) tasks, such as text generation, language translation, sentiment analysis, and more.


Transformer Embeddings:

Transformer Embeddings, on the other hand, refer to the embeddings generated by the Transformer model architecture itself. The Transformer model is a neural network architecture that has revolutionized various NLP tasks, including machine translation, text classification, and sequence generation. Transformer models consist of self-attention mechanisms that allow the model to capture dependencies between words or tokens in a sequence. The embeddings produced by the Transformer model are typically used as input features for downstream tasks or as representations for further analysis.


In summary, OpenAI Embeddings specifically refer to the contextualized word embeddings generated by OpenAI's GPT models, while Transformer Embeddings refer to the embeddings generated by the Transformer model architecture, which can be used in various NLP tasks. The key difference lies in the specific implementation and pre-training process of the models, with OpenAI Embeddings being a specific instance of Transformer-based embeddings.

What Are Embeddings in OpenAI

OpenAI’s text embeddings measure the relatedness of text strings. Embeddings are commonly used for:


Search (where results are ranked by relevance to a query string)

Clustering (where text strings are grouped by similarity)

Recommendations (where items with related text strings are recommended)

Anomaly detection (where outliers with little relatedness are identified)

Diversity measurement (where similarity distributions are analyzed)

Classification (where text strings are classified by their most similar label)



An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.


To get an embedding, send your text string to the embeddings API endpoint along with a choice of embedding model ID (e.g., text-embedding-ada-002). The response will contain an embedding, which you can extract, save, and use.


curl https://api.openai.com/v1/embeddings \

  -H "Content-Type: application/json" \

  -H "Authorization: Bearer $OPENAI_API_KEY" \

  -d '{

    "input": "Your text string goes here",

    "model": "text-embedding-ada-002"

  }'


{

  "data": [

    {

      "embedding": [

        -0.006929283495992422,

        -0.005336422007530928,

        ...

        -4.547132266452536e-05,

        -0.024047505110502243

Docker How to custom build a package when building

FROM centos:8

RUN yum update -y && \

    yum install -y zlib-devel bzip2-devel


WORKDIR /tmp

RUN curl -OL ftp://ftp.gnupg.org/gcrypt/libksba/libksba-1.6.3.tar.bz2 && \

    tar xjf libksba-1.6.3.tar.bz2 && \

    rm libksba-1.6.3.tar.bz2


WORKDIR /tmp/libksba-1.6.3

RUN ./configure && \

    make && \

    make install


WORKDIR /

RUN rm -rf /tmp/libksba-1.6.3


Docker how to check if docker image contains which version of library

 Inspect the Docker Image Layers: You can use the docker history command to view the layers of a Docker image. Each layer represents a step in the image's construction, and you can inspect the commands executed in each layer. By examining the commands, you might find information about the library versions installed.


bash

Copy code

docker history <image_name>

However, this method may not provide detailed information about the specific library versions unless they were explicitly mentioned in the Dockerfile.


Run a Container from the Image and Check the Version: You can run a container from the Docker image and execute commands to check the version of a specific library.


bash

Copy code

docker run -it <image_name> <command_to_check_version>

Replace <image_name> with the name or ID of the Docker image and <command_to_check_version> with the command specific to the library you want to check. For example, for Python libraries, you can use pip show <library_name> to get the version.


bash

Copy code

docker run -it <image_name> pip show <library_name>

This method allows you to execute commands inside the container and inspect the installed libraries directly.


Check the Dockerfile: If you have access to the Dockerfile used to build the image, you can inspect it to find the specific versions of the installed libraries. Look for RUN commands that install or update the libraries, and check if the versions are explicitly specified.


dockerfile

Copy code

RUN pip install <library_name>==<version>

The Dockerfile is the most reliable source to determine the specific library versions used during image creation.


By using these methods, you can obtain information about specific library versions installed in a Docker image. The approach you choose depends on the availability of resources and your specific requirements.

What does rsyslog do?

 Most modern Linux distributions actually use a new-and-improved daemon called rsyslog. rsyslog is capable of forwarding logs to remote servers. The configuration is relatively simple and makes it possible for Linux admins to centralize log files for archiving and troubleshooting.D

Tuesday, June 27, 2023

How to do an item-item Similarity recommendation?

import numpy as np

from sklearn.metrics.pairwise import cosine_similarity


# Sample ratings data (user-item matrix)

ratings = np.array([

    [5, 3, 4, 4, 0],  # User 1

    [1, 0, 5, 0, 4],  # User 2

    [0, 3, 0, 4, 0],  # User 3

    [5, 0, 4, 3, 5]   # User 4

])


# Calculate item-item similarity matrix using cosine similarity

item_similarity = cosine_similarity(ratings.T)


# Function to generate item recommendations for a given item

def get_item_recommendations(item_id, top_n=3):

    item_scores = item_similarity[item_id]

    top_items = np.argsort(item_scores)[-top_n-1:-1][::-1]

    return top_items


# Example usage:

item_id = 2  # Item ID for which recommendations are needed

recommendations = get_item_recommendations(item_id, top_n=3)

print(f"Top recommendations for Item {item_id}: {recommendations}")


Monday, June 26, 2023

Sample code for Analysing the user behavior from the logs

import pandas as pd

from sklearn.preprocessing import LabelEncoder

from sklearn.cluster import KMeans

from sklearn.ensemble import IsolationForest


# Load log data from a CSV file

log_data = pd.read_csv('log_file.csv')


# Preprocessing: Encode categorical variables

label_encoder = LabelEncoder()

log_data['user_id'] = label_encoder.fit_transform(log_data['user_id'])

log_data['action'] = label_encoder.fit_transform(log_data['action'])


# User Session Identification: Group log entries by user sessions

session_duration = pd.Timedelta(minutes=30)

log_data['timestamp'] = pd.to_datetime(log_data['timestamp'])

log_data['session_id'] = (log_data['timestamp'].diff() > session_duration).cumsum()


# Behavioral Metrics Calculation: Calculate session duration and action frequency

session_metrics = log_data.groupby('session_id').agg({

    'user_id': 'first',

    'timestamp': ['min', 'max'],

    'action': 'count'

})

session_metrics.columns = ['user_id', 'start_time', 'end_time', 'action_count']


# Anomaly Detection: Identify anomalous user sessions

model = IsolationForest(contamination=0.05)  # Adjust contamination based on expected anomaly rate

session_metrics['is_anomaly'] = model.fit_predict(session_metrics[['action_count']])


# Visualization: Plot session duration and action count

plt.scatter(session_metrics['action_count'], session_metrics['end_time'] - session_metrics['start_time'])

plt.xlabel('Action Count')

plt.ylabel('Session Duration')

plt.title('User Session Duration vs. Action Count')

plt.show()


Sample csv file is as below 


timestamp,user_id,action

2023-06-01 10:00:00,user1,login

2023-06-01 10:01:00,user1,browse

2023-06-01 10:02:00,user1,purchase

2023-06-01 10:03:00,user2,login

2023-06-01 10:04:00,user2,browse

2023-06-01 10:05:00,user2,add_to_cart

2023-06-01 10:06:00,user3,login

2023-06-01 10:07:00,user3,browse

2023-06-01 10:08:00,user3,browse

2023-06-01 10:09:00,user3,checkout

Firebase hosting how to restore the file from cloud if they are lost from the local machine

Did a lot of search on this but finally got this and it works perfectly!! 

You should of course add your own version control (e.g. git) to manage your revisions and backups so this doesn't occur.

There is a script you can run to download all the assets using the CLI. You can find the script and instructions here. It can also be run using npx:

npx https://gist.github.com/mbleigh/9c8680cf319ace2f506f57380da66e7d <site_name>

Note that this only returns the compiled/rendered content from the specified public folder and not any precompiled source you may have had on the development machine.

Since your files are static assets, you could also scrape them using wget. This is inferior for advanced apps as you'll get the rendered content and not the source:

wget -r -np https://<YOURAPPNAME>.firebaseapp.com

Read more on scraping web sites here: https://apple.stackexchange.com/questions/100570/getting-files-all-at-once-from-a-web-page-using-curl

references:

https://stackoverflow.com/questions/26286339/pull-lost-code-from-firebase-hosting-deployment


How can the Log file Visualisation can be done using ML methodologies

Log file visualization using ML models typically involves transforming log data into a format suitable for visualization and applying ML techniques to extract patterns or insights from the logs. Here's a high-level overview of the process:

Log Data Preprocessing:

Clean and preprocess the log data, including handling missing values, removing irrelevant information, and standardizing the log format if necessary.

Perform feature engineering to extract meaningful features from the log data. This may include extracting timestamps, log levels, error codes, or any other relevant information.

Dimensionality Reduction:


Apply dimensionality reduction techniques like Principal Component Analysis (PCA) or t-SNE to reduce the dimensionality of the log data.

This step is useful when dealing with large amounts of log data with numerous features, making it easier to visualize and interpret the patterns.

Clustering or Anomaly Detection:


Apply clustering algorithms like K-means, DBSCAN, or hierarchical clustering to group similar log patterns together.

Alternatively, use anomaly detection algorithms like Isolation Forest or Autoencoders to identify unusual or anomalous log patterns.

Visualization Techniques:


Utilize various visualization techniques to represent the log data and the output of clustering or anomaly detection.

Common visualization methods include scatter plots, line charts, bar graphs, heatmaps, or network graphs.

Consider interactive visualization tools or libraries that allow users to explore the log data and drill down into specific patterns or anomalies.

ML Model Interpretation:


Interpret and analyze the results obtained from the ML models.

Identify and visualize important features or clusters within the log data.

Extract insights or patterns that can help in understanding system behavior, identifying errors, or detecting anomalies.

It's important to note that log file visualization using ML models is a complex task, and the specific techniques and tools used may vary based on the nature of the log data and the desired visualization goals. Customization and experimentation are often required to tailor the visualization approach to your specific use case and data characteristics.


references:


What are steps involved in Log Summarisation

Log summarization using ML models typically involves using natural language processing (NLP) techniques to analyze and summarize log data. Here's a general approach to performing log summarization using ML models:


Data Preparation:


Collect the log data that you want to summarize. Log data can be in various formats such as text files, CSV files, or database entries.

Preprocess the log data by cleaning and formatting it. This may involve removing irrelevant information, normalizing text, removing punctuation, and handling special characters or symbols.

Data Annotation:


Annotate the log data by manually creating summaries for a subset of log entries. This step involves reading each log entry and writing a concise summary that captures the key information.

Dataset Creation:


Split the annotated data into a training set and a validation/test set. The training set will be used to train the ML model, while the validation/test set will be used to evaluate the model's performance.

Feature Extraction:


Convert the log data into numerical or vector representations that ML models can understand. Common techniques include tokenization, vectorization (e.g., using TF-IDF or word embeddings), and feature engineering.

Model Training:


Select an appropriate ML model for log summarization, such as sequence-to-sequence models, transformer models, or recurrent neural networks (RNNs).

Train the ML model using the annotated log data. This typically involves feeding the log data and corresponding summaries as input-output pairs to the model and optimizing its parameters.

Model Evaluation:


Evaluate the trained model's performance on the validation/test set. Common evaluation metrics for text summarization include ROUGE scores, which measure the quality of the generated summaries compared to the reference summaries.

Model Deployment:


Once the ML model performs well on the validation/test set, you can deploy it for log summarization tasks. This may involve integrating the model into your existing log processing pipeline or creating a dedicated API or service for log summarization.

Continuous Improvement:


Monitor and evaluate the model's performance in production. Collect feedback from users and use it to iteratively improve the model's accuracy and usefulness.

It's important to note that log summarization is a complex task, and the specific implementation details and choice of ML model may vary depending on your specific requirements and the characteristics of your log data. It's recommended to explore existing research and libraries related to text summarization and adapt them to your specific use case.



Sunday, June 25, 2023

Creating BPMN multi instance with external task

How to create bpmn file with external task with multi instance and verify it 

<?xml version="1.0" encoding="UTF-8"?>

<definitions xmlns="http://www.omg.org/spec/BPMN/20100524/MODEL"

             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

             xsi:schemaLocation="http://www.omg.org/spec/BPMN/20100524/MODEL BPMN20.xsd"

             id="definitions"

             targetNamespace="http://bpmn.io/schema/bpmn">


  <process id="external_task_multi_instance" isExecutable="true">

  

    <startEvent id="startEvent" />

    

    <serviceTask id="externalTask" name="External Task" camunda:type="external" camunda:topic="sampleTopic">

      <multiInstanceLoopCharacteristics isSequential="true">

        <loopCardinality>5</loopCardinality>

        <inputDataItem>

          <camunda:inputData ref="loopDataInput" />

        </inputDataItem>

      </multiInstanceLoopCharacteristics>

    </serviceTask>

    

    <endEvent id="endEvent" />

    

    <sequenceFlow id="flow1" sourceRef="startEvent" targetRef="externalTask" />

    <sequenceFlow id="flow2" sourceRef="externalTask" targetRef="endEvent" />

    

    <dataObjectReference id="loopDataInput" />

    

  </process>

  

</definitions>


How to configure Jenkins to Skip builds based on previous Commit ID

 To configure Jenkins Job Cacher for comparing commit IDs and skipping builds based on changes, you can follow these steps:

Install Jenkins Job Cacher Plugin: In the Jenkins dashboard, navigate to "Manage Jenkins" > "Manage Plugins." In the "Available" tab, search for "Jenkins Job Cacher" and install the plugin.

Configure Global Settings: Once the plugin is installed, go to "Manage Jenkins" > "Configure System." Scroll down to the "Jenkins Job Cacher" section and configure the global settings as follows:


Enable the "Enable job cacher" checkbox.

Set the "Max build history" to an appropriate value to limit the number of builds to compare against.

Configure the "Commit ID extraction" pattern to extract the commit ID from the relevant SCM (Source Control Management) field. This pattern should match the commit ID format used in your SCM.

Configure Job-Level Settings: Open the configuration of the specific Jenkins job for which you want to compare commit IDs and skip builds.


Enable Job Cacher: Scroll down to the "Build Environment" section in the job configuration. Check the "Enable Job Cacher" checkbox.


Configure Build Skip Criteria: In the "Job Cacher Settings" section, configure the build skip criteria:


Select the appropriate SCM field that contains the commit ID from the dropdown.

Specify the build skip behavior based on commit ID changes. You can choose to skip the build when the commit ID is the same as the previous build or when it is different.

Save the Job Configuration: After configuring the build skip criteria, save the job configuration.


Now, when a build is triggered for the configured job, Jenkins will compare the commit ID with the previous build's commit ID. Depending on the configured criteria, the build will be skipped if the commit ID matches or not, thus avoiding unnecessary builds.


Note: Ensure that your Jenkins job is configured to trigger builds automatically, such as on SCM changes or a schedule, for the commit ID comparison and build skipping mechanism to come into effect.

Sample code for doing predictive maintenance using ML Models

 import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score


# Load the dataset

dataset = pd.read_csv('maintenance_data.csv')


# Split the dataset into features (X) and target variable (y)

X = dataset.drop('failure', axis=1)

y = dataset['failure']


# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Create a Random Forest Classifier

model = RandomForestClassifier()


# Train the model

model.fit(X_train, y_train)


# Make predictions on the test set

y_pred = model.predict(X_test)


# Evaluate the model's accuracy

accuracy = accuracy_score(y_test, y_pred)

print('Accuracy:', accuracy)

In this code snippet:

The dataset is loaded using pd.read_csv() from a CSV file named maintenance_data.csv. This dataset should contain historical data related to equipment, including features and the target variable indicating whether a failure occurred or not.

The dataset is split into features (X) and the target variable (y).


The data is further split into training and testing sets using train_test_split() from sklearn.model_selection. Adjust the test_size parameter as needed.

A Random Forest Classifier is created using RandomForestClassifier() from sklearn.ensemble.

The model is trained using the training data with model.fit(X_train, y_train).

Predictions are made on the test set using model.predict(X_test).


The accuracy of the model is evaluated by comparing the predicted values (y_pred) with the actual values (y_test) using accuracy_score() from sklearn.metrics.


Note that this is a basic example using a Random Forest Classifier. Depending on your specific requirements, you might need to explore other ML algorithms, perform feature engineering, hyperparameter tuning, or handle imbalanced data to improve the accuracy of the predictive maintenance model.


Ensure that you have a suitable dataset containing relevant features and the target variable indicating failures or maintenance needs for your equipment. Adjust the code accordingly to fit your dataset and consider additional preprocessing steps or model enhancements based on your specific use case.


the sample data for this is as below 


temperature,humidity,vibration,failure

30,60,0.2,0

28,55,0.3,0

32,58,0.5,1

26,50,0.4,0

34,65,0.6,1


How to do predictive maintenance using AI/ML models

 



Predictive maintenance using AI/ML models involves leveraging machine learning techniques to forecast equipment failures or maintenance needs. Here's a high-level overview of the steps involved:


Data Collection: Gather historical data related to the equipment you want to perform predictive maintenance on. This data can include sensor readings, maintenance logs, operational parameters, environmental conditions, and any other relevant information.


Data Preprocessing: Clean and preprocess the collected data. This may involve handling missing values, outliers, and noise, as well as normalizing or scaling the data for modeling purposes.


Feature Engineering: Extract meaningful features from the raw data that can help in predicting equipment failures. This can involve aggregating sensor readings over time, creating statistical features, deriving time-based features, or incorporating domain knowledge to engineer informative features.


Labeling: Identify and label the instances in the historical data that represent equipment failures or maintenance events. This will serve as the target variable for training the predictive model.


Model Selection: Choose an appropriate machine learning model for predictive maintenance. This can include techniques such as regression, classification, time series forecasting, or anomaly detection, depending on the nature of the problem and the available data.


Training: Split the preprocessed data into training and testing sets. Train the chosen model using the labeled data, allowing it to learn the patterns and relationships between the features and the target variable.


Model Evaluation: Evaluate the trained model's performance using appropriate evaluation metrics such as accuracy, precision, recall, or mean squared error. Assess how well the model predicts equipment failures or maintenance needs.


Deployment and Monitoring: Once satisfied with the model's performance, deploy it in a production environment to monitor equipment and make real-time predictions. Continuously collect new data to update the model periodically and improve its accuracy over time.


Maintenance Planning: Utilize the predictions from the model to plan proactive maintenance activities, such as scheduling inspections or component replacements, before the equipment failure occurs. This can help minimize downtime and optimize maintenance efforts.


It's important to note that the specific implementation details will vary based on the type of equipment, the available data, and the chosen machine learning techniques. It may require experimentation, tuning hyperparameters, and incorporating domain expertise to achieve accurate and reliable predictions for predictive maintenance.


Remember to continuously monitor and update the model as new data becomes available to ensure its effectiveness and adaptability to changing conditions.


Sample code for doing Log Event correlation using ML model


import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.cluster import DBSCAN


# Load log data into a pandas DataFrame

log_data = pd.read_csv('log_file.csv')


# Extract relevant features for correlation (e.g., log message content)

features = log_data['log_message']


# Convert log messages to feature vectors using TF-IDF vectorization

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(features)


# Perform clustering to identify correlated log events

clustering_model = DBSCAN(eps=0.5, min_samples=2)

clusters = clustering_model.fit_predict(X)


# Add cluster labels to the log data

log_data['cluster_label'] = clusters


# Print the correlated log events

unique_clusters = log_data['cluster_label'].unique()

for cluster in unique_clusters:

    cluster_data = log_data[log_data['cluster_label'] == cluster]

    print("Cluster Label:", cluster)

    print(cluster_data['log_message'])

    print("------------------------------------")



Sample data for this is 


log_message

Application started

Invalid input detected

Processing completed

Low disk space

Connection timed out

Database connection failed


Saturday, June 24, 2023

Code to extract timestamp and error type from log using ML

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.linear_model import LogisticRegression


# Load log data into a pandas DataFrame

log_data = pd.read_csv('log_file.csv')


# Define the features and labels for log parsing

features = log_data['log_message']

timestamps = log_data['timestamp']

error_codes = log_data['error_code']


# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(features, timestamps, test_size=0.2, random_state=42)


# Convert log messages to feature vectors using CountVectorizer

vectorizer = CountVectorizer()

X_train_vectors = vectorizer.fit_transform(X_train)

X_test_vectors = vectorizer.transform(X_test)


# Train a logistic regression model for timestamp parsing

timestamp_model = LogisticRegression()

timestamp_model.fit(X_train_vectors, y_train)


# Predict the timestamps for test data

timestamp_predictions = timestamp_model.predict(X_test_vectors)


# Evaluate the timestamp model

timestamp_accuracy = np.mean(timestamp_predictions == y_test)

print("Timestamp Accuracy:", timestamp_accuracy)


# Split the data into training and testing sets for error code parsing

X_train, X_test, y_train, y_test = train_test_split(features, error_codes, test_size=0.2, random_state=42)


# Convert log messages to feature vectors using CountVectorizer

X_train_vectors = vectorizer.fit_transform(X_train)

X_test_vectors = vectorizer.transform(X_test)


# Train a logistic regression model for error code parsing

error_code_model = LogisticRegression()

error_code_model.fit(X_train_vectors, y_train)


# Predict the error codes for test data

error_code_predictions = error_code_model.predict(X_test_vectors)


# Evaluate the error code model

error_code_accuracy = np.mean(error_code_predictions == y_test)

print("Error Code Accuracy:", error_code_accuracy)


A sample file is 


log_message,timestamp,error_code

[INFO] Application started,2023-06-01 10:23:45,

[ERROR] Invalid input detected,2023-06-02 14:57:21,ERR001

[INFO] Processing completed,2023-06-03 09:12:34,

[WARNING] Low disk space,2023-06-04 16:35:02,

[DEBUG] Connection timed out,2023-06-05 11:45:10,

[ERROR] Database connection failed,2023-06-06 08:21:56,ERR002





Code snippet for Logistic regression for parsing files

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.linear_model import LogisticRegression


# Load log data into a pandas DataFrame

log_data = pd.read_csv('log_file.csv')


# Define the features and labels for log parsing

features = log_data['log_message']

labels = log_data['parsed_info']


# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)


# Convert log messages to feature vectors using CountVectorizer

vectorizer = CountVectorizer()

X_train_vectors = vectorizer.fit_transform(X_train)

X_test_vectors = vectorizer.transform(X_test)


# Train a logistic regression model

model = LogisticRegression()

model.fit(X_train_vectors, y_train)


# Predict the parsed information for test data

predictions = model.predict(X_test_vectors)


# Evaluate the model

accuracy = np.mean(predictions == y_test)

print("Accuracy:", accuracy)


A sample csv file could be like this 


log_message,parsed_info

[INFO] Application started,START_EVENT

[ERROR] Invalid input detected,INPUT_ERROR

[INFO] Processing completed,PROCESS_COMPLETE

[WARNING] Low disk space,DISK_WARNING

[DEBUG] Connection timed out,CONNECTION_TIMEOUT

[ERROR] Database connection failed,DB_CONNECTION_ERROR

Isolation Forest in anomaly detection

Isolation Forest is a technique for identifying outliers in data that was first introduced by Fei Tony Liu and Zhi-Hua Zhou in 2008. The approach employs binary trees to detect anomalies, resulting in a linear time complexity and low memory usage that is well-suited for processing large datasets.


Since its introduction, Isolation Forest has gained popularity as a fast and reliable algorithm for anomaly detection in various fields such as cybersecurity, finance, and medical research.


IsolationForests were built based on the fact that anomalies are the data points that are “few and different”.

How do Isolation Forests work?

When given a dataset, a random sub-sample of the data is selected and assigned to a binary tree.

Branching of the tree starts by selecting a random feature (from the set of all N features) first. And then branching is done on a random threshold ( any value in the range of minimum and maximum values of the selected feature).

If the value of a data point is less than the selected threshold, it goes to the left branch else to the right. And thus a node is split into left and right branches.

This process from step 2 is continued recursively till each data point is completely isolated or till max depth(if defined) is reached.

The above steps are repeated to construct random binary trees.


import numpy as np

import pandas as pd

import seaborn as sns

from sklearn.ensemble import IsolationForest

data = pd.read_csv('marks.csv')

random_state = np.random.RandomState(42)

model=IsolationForest(n_estimators=100,max_samples='auto',contamination=float(0.2),random_state=random_state)

model.fit(data[['marks']])

print(model.get_params())


data['scores'] = model.decision_function(data[['marks']])

data['anomaly_score'] = model.predict(data[['marks']])

data[data['anomaly_score']==-1].head()


Model Evaluation:

accuracy = 100*list(data['anomaly_score']).count(-1)/(anomaly_count)

print("Accuracy of the model:", accuracy)

Output:


Accuracy of the model: 100.0


Limitations of Isolation Forest:

Isolation Forests are computationally efficient and

have been proven to be very effective in Anomaly detection. Despite its advantages, there are a few limitations as mentioned below.


The final anomaly score depends on the contamination parameter, provided while training the model. This implies that we should have an idea of what percentage of the data is anomalous beforehand to get a better prediction.

Also, the model suffers from a bias due to the way the branching takes place.


References:

https://www.analyticsvidhya.com/blog/2021/07/anomaly-detection-using-isolation-forest-a-complete-guide/

Tuesday, June 20, 2023

Camunda conditional flow and Connector task

In Camunda BPM, a conditional flow with a connector behaves in a specific way. When a conditional flow with a connector is encountered during process execution, Camunda will wait for the connector to finish executing before evaluating the condition and determining the next path to take.


Here's a step-by-step explanation of how it works:


Process Execution Reaches the Conditional Flow with Connector: During the execution of a process instance, when it reaches a conditional flow with a connector, the execution pauses at that point.


Connector Execution: The connector associated with the conditional flow is then executed. A connector is typically used to integrate with external systems or perform some complex logic. It could be a REST call, a database operation, a message queue interaction, or any other custom logic implemented using connectors in Camunda.


Connector Execution Completion: Once the connector execution is complete, the process execution resumes.


Condition Evaluation: After the connector execution, Camunda evaluates the condition associated with the conditional flow. The condition could be expressed using Camunda's Expression Language or a script. The condition determines which outgoing path the process should take based on the connector's result or other process variables.


Next Path Determination: Based on the evaluation of the condition, Camunda determines the next path to be taken in the process. If the condition evaluates to true, the process follows the outgoing path connected to the conditional flow with a "true" condition. If the condition evaluates to false, the process follows the outgoing path connected to the conditional flow with a "false" condition.


By waiting for the connector execution to finish before evaluating the condition, Camunda ensures that any data or results obtained from the connector are available for the condition evaluation. This allows for dynamic and data-driven routing within a BPMN process.

Camunda Conditional flow behavior in Camunda External Task

 In Camunda BPM, a conditional flow with an external task does not wait for the external task to finish before evaluating the condition. This behavior is by design and is important to understand in order to model your processes correctly.

Here's an explanation of how the conditional flow with an external task works:


Process Execution Reaches the Conditional Flow with External Task: During the execution of a process instance, when it reaches a conditional flow with an external task, the execution pauses at that point.


External Task Creation: Camunda creates an external task and sends it to an external task worker or a specific external system for processing. The external task represents a unit of work that needs to be performed externally.


External Task Execution: The external system or external task worker performs the required work associated with the external task. This work can involve interacting with external systems, performing calculations, or any other business logic specific to your use case.


External Task Completion: Once the external task is completed by the external system or task worker, the result is reported back to Camunda.


Condition Evaluation: After the external task is completed, Camunda evaluates the condition associated with the conditional flow. The condition could be expressed using Camunda's Expression Language or a script. The condition determines which outgoing path the process should take based on the result of the external task or other process variables.


Next Path Determination: Based on the evaluation of the condition, Camunda determines the next path to be taken in the process. If the condition evaluates to true, the process follows the outgoing path connected to the conditional flow with a "true" condition. If the condition evaluates to false, the process follows the outgoing path connected to the conditional flow with a "false" condition.


As you can see, the evaluation of the condition happens after the external task is completed. This means that the outcome of the external task can be considered in the condition evaluation, but the process does not wait for the external task to finish before evaluating the condition.

Sunday, June 18, 2023

Jenkins Build cache - Various ways to speed up the build process

Reducing build time for a project with multiple microservices can be achieved through various strategies. Here are some approaches you can consider:


Parallelize the build process: Instead of building each microservice sequentially, you can build them in parallel. This can be accomplished by using build tools or scripts that allow concurrent execution of build tasks. By utilizing multiple CPU cores or even distributing the build across different machines, you can significantly reduce the overall build time.


Enable incremental builds: Configure your build system to support incremental builds, which means only rebuilding the modules that have changed since the last build. This approach avoids rebuilding the entire project every time, resulting in faster build times. Tools like Make, Gradle, or Maven support incremental builds.


Optimize dependencies: Analyze the dependencies between your microservices and ensure that unnecessary dependencies are minimized. Reducing dependencies can help avoid rebuilding unrelated modules when changes occur in a specific microservice, thus reducing build time.


Utilize caching: Set up a build cache to store compiled artifacts, dependencies, or intermediate build results. By reusing cached components, subsequent builds can skip redundant compilation tasks, resulting in faster build times. Tools like Gradle and Maven offer built-in support for caching.


Implement build profiles: Create build profiles or configuration options that allow you to build only the necessary microservices for a particular scenario. This approach enables selective building based on specific requirements, reducing build times by excluding unnecessary services.


Use modular builds: If your microservices are designed to be independent modules, consider adopting a modular build approach. This allows you to build and test each microservice separately, minimizing the scope of a full project build. Modular builds also enable faster deployment of individual services without impacting the entire system.


Optimize build scripts: Review your build scripts or configuration files to identify any performance bottlenecks or inefficient processes. Consider optimizing build steps, reducing unnecessary file copies, or eliminating redundant operations. This can lead to noticeable improvements in build time.


Leverage distributed build systems: If your project is large and requires extensive computing resources, you can consider using distributed build systems. These systems distribute the build process across multiple machines or a network of computers, allowing for faster builds by harnessing more computational power.


Employ build caching services: Explore tools or services that specialize in build caching, such as Gradle Enterprise or BuildBee. These services can cache build artifacts and dependencies, improving build times by avoiding redundant compilation or download tasks.


Upgrade hardware: If your build machine's hardware is outdated or inadequate, upgrading components like CPU, memory, or storage can provide a significant boost to build performance.


Remember, the effectiveness of these strategies depends on your specific project, build system, and infrastructure. It's advisable to analyze and experiment with different approaches to identify the most effective optimizations for your microservice architecture.

Friday, June 16, 2023

What is an Autotokenizer in TensorFlow

AutoTokenizer. A tokenizer is responsible for preprocessing text into an array of numbers as inputs to a model. There are multiple rules that govern the tokenization process, including how to split a word and at what level words should be split

AutoClass can help you automatically retrieve the relevant model given the provided pretrained weights/vocabulary. AutoTokenizer is a generic tokenizer class that will be instantiated as one of the base tokenizer classes when created with the AutoTokenizer. from_pretrained() classmethod.

Some useful pip commands

 Some useful pip commands 

To upgrade the pip itself, below command to be used. 

pip install --upgrade pip
To view the version of a package, below to be used 
pip show tensorflow 
to uninstall installed package and install a specific version, below to be used
pip unsinstall urillib3
pip install urllib3==1.26.16 

Friday, June 9, 2023

How To Fix the “Warning: Remote Host Identification Has Changed” Error

 From here, run the nano ~/.ssh/known_hosts command in your window. This will open a new Nano instance and display the keys within your known_hosts file:


Saturday, June 3, 2023

Camunda what is a process

 In Camunda, a process refers to a specific instance of a business process that is modeled using the Business Process Model and Notation (BPMN) standard. It represents a sequence of activities, tasks, decisions, and events that are designed to achieve a particular business objective or outcome.


A process in Camunda consists of a set of interconnected elements that define the flow of work and the logic of the business process. These elements include:


Activities: Activities represent the individual tasks or steps within the process. They can be manual tasks, automated tasks, or user tasks that require human interaction.


Gateways: Gateways define the branching and merging behavior of the process flow. They determine which path the process should follow based on conditions or rules.


Events: Events represent points in the process that trigger specific actions or responses. They can be start events that initiate the process, intermediate events that occur during the process, or end events that indicate the completion of the process.


Sequence Flows: Sequence flows connect the various elements of the process and define the order in which activities, gateways, and events are executed.


Process Variables: Process variables are data elements that store information relevant to the process. They can be used to pass data between different elements of the process and can be accessed and modified during the execution of the process.


When a process is initiated, an instance of that process is created. This instance represents the execution of the process in a specific context or scenario. The process instance progresses through the various elements and activities of the process until it reaches a predefined end condition or outcome.


Camunda provides a process engine that executes and manages the process instances based on the BPMN models. It allows for the automation, monitoring, and control of business processes, enabling organizations to streamline their operations, improve efficiency, and achieve their business goals.

Camunda what is external Task and Connector

Whether to use a Connector or an External Task in Camunda depends on the specific requirements and characteristics of your integration scenario. Here's a comparison to help you understand the differences and make an informed decision:

Connectors:

Connectors are built-in components in Camunda that provide a standardized way to integrate with external systems or services.

Connectors offer a simplified and declarative approach to interact with external systems without the need for custom code.

They allow you to define input/output parameters, configure connection details, and handle error scenarios.

Connectors support synchronous communication, where the process waits for the response from the external system before continuing.

They are suitable for simpler integration scenarios where direct interaction with external systems is required within the process flow.

Connectors are typically used when the integration logic is relatively straightforward and can be accomplished using the available connectors (e.g., REST, SOAP, JMS).

External Tasks:

External Tasks provide a more flexible and scalable approach for integrating with external systems or services.

External Tasks allow you to decouple the process engine from the external systems, enabling asynchronous communication and parallel processing.

With External Tasks, you define tasks in your BPMN model and implement worker applications that handle the execution of these tasks.

Worker applications can be developed in any programming language and provide custom logic to interact with the external systems.

External Tasks are suitable for complex integration scenarios, long-running processes, or when dealing with multiple external systems concurrently.

They support asynchronous processing, where the process engine delegates the task to the worker application and continues with the process flow.

External Tasks provide advanced features like task retries, long polling, and topic-based subscription to handle failures and distribute workloads.

In summary, you should consider using a Connector when the integration logic is relatively simple and can be accomplished using the available connectors provided by Camunda. On the other hand, if you have more complex integration requirements, asynchronous communication, or need scalability and fault tolerance, External Tasks provide a more flexible and powerful approach.


Evaluate your specific integration needs, the complexity of the integration logic, and the desired communication patterns to determine whether a Connector or an External Task is the best fit for your use case.




Chamunda What is Synchronous and Asynchronous processing

 In Camunda, the behavior of moving to the next step in the workflow before an External Task finishes depends on how you model and configure your process. By default, Camunda supports both synchronous and asynchronous processing, offering flexibility based on your requirements. Let's explore the two possibilities:


Synchronous Processing:


In synchronous processing, the workflow execution will wait for the completion of the External Task before proceeding to the next step in the workflow.

When an External Task is reached during the execution of the process, the process engine will pause and delegate the task to an external worker (implemented by your application).

The external worker performs the necessary work for the task and returns the result to the process engine.

Only after the External Task is completed and the response is received, the workflow execution will continue to the next step.

Synchronous processing ensures that each step of the workflow is executed sequentially, waiting for the completion of each task before moving forward.

Asynchronous Processing:


In asynchronous processing, the workflow execution can proceed to the next step immediately after the External Task is started, without waiting for its completion.

When an External Task is reached during the execution of the process, the process engine triggers the task and continues to the next step without waiting for the task to finish.

The External Task is then processed by an external worker asynchronously, which means the work is performed outside the scope of the process engine.

Once the external worker completes the task, it reports the result back to the process engine, which can then continue the workflow execution from that point.

Asynchronous processing allows for parallelism, scalability, and the ability to perform long-running tasks without blocking the workflow execution.

The choice between synchronous and asynchronous processing depends on your use case and the desired behavior of your workflow. If you require sequential execution and want to ensure each task completes before moving to the next step, synchronous processing is suitable. On the other hand, if you need parallelism, scalability, or support for long-running tasks, asynchronous processing with External Tasks is recommended.


You can configure the behavior of the External Tasks, such as the concurrency settings and polling intervals, to further control how the workflow interacts with the external workers