Saturday, June 24, 2023

Isolation Forest in anomaly detection

Isolation Forest is a technique for identifying outliers in data that was first introduced by Fei Tony Liu and Zhi-Hua Zhou in 2008. The approach employs binary trees to detect anomalies, resulting in a linear time complexity and low memory usage that is well-suited for processing large datasets.


Since its introduction, Isolation Forest has gained popularity as a fast and reliable algorithm for anomaly detection in various fields such as cybersecurity, finance, and medical research.


IsolationForests were built based on the fact that anomalies are the data points that are “few and different”.

How do Isolation Forests work?

When given a dataset, a random sub-sample of the data is selected and assigned to a binary tree.

Branching of the tree starts by selecting a random feature (from the set of all N features) first. And then branching is done on a random threshold ( any value in the range of minimum and maximum values of the selected feature).

If the value of a data point is less than the selected threshold, it goes to the left branch else to the right. And thus a node is split into left and right branches.

This process from step 2 is continued recursively till each data point is completely isolated or till max depth(if defined) is reached.

The above steps are repeated to construct random binary trees.


import numpy as np

import pandas as pd

import seaborn as sns

from sklearn.ensemble import IsolationForest

data = pd.read_csv('marks.csv')

random_state = np.random.RandomState(42)

model=IsolationForest(n_estimators=100,max_samples='auto',contamination=float(0.2),random_state=random_state)

model.fit(data[['marks']])

print(model.get_params())


data['scores'] = model.decision_function(data[['marks']])

data['anomaly_score'] = model.predict(data[['marks']])

data[data['anomaly_score']==-1].head()


Model Evaluation:

accuracy = 100*list(data['anomaly_score']).count(-1)/(anomaly_count)

print("Accuracy of the model:", accuracy)

Output:


Accuracy of the model: 100.0


Limitations of Isolation Forest:

Isolation Forests are computationally efficient and

have been proven to be very effective in Anomaly detection. Despite its advantages, there are a few limitations as mentioned below.


The final anomaly score depends on the contamination parameter, provided while training the model. This implies that we should have an idea of what percentage of the data is anomalous beforehand to get a better prediction.

Also, the model suffers from a bias due to the way the branching takes place.


References:

https://www.analyticsvidhya.com/blog/2021/07/anomaly-detection-using-isolation-forest-a-complete-guide/

No comments:

Post a Comment