3 Key Lessons in Anomaly Detection
by Odin Berre, AI Developer
1. Importance of Feature Selection
One of the key lessons we learned is that feature selection plays a critical role in anomaly detection. Not all features contribute equally to identifying anomalies, and including irrelevant features can dilute the model’s effectiveness. Selecting the right features that capture abnormal behavior in the data is vital.
Here’s an example using sklearn's IsolationForest for anomaly detection. We first select a subset of meaningful features.
import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest
# Example data
data = pd.DataFrame({
'feature_1': np.random.normal(size=100),
'feature_2': np.random.normal(size=100),
'feature_3': np.random.normal(size=100),
})
# Selecting meaningful features
X = data[['feature_1', 'feature_2']]
# Applying Isolation Forest for anomaly detection
model = IsolationForest(contamination=0.1)
model.fit(X)
anomalies = model.predict(X)
# -1 represents anomaly, 1 represents normal
anomalies
2. Dealing with Imbalanced Data
Anomalies are rare, making most anomaly detection datasets highly imbalanced. This imbalance can affect the performance of your models because they tend to focus on the majority class (normal data) rather than detecting the rare anomalies. To tackle this, we experimented with different methods and found that algorithms such as One-Class SVM or IsolationForest are well-suited for imbalanced data, as they are designed to handle outliers.
Here’s an example of using One-Class SVM for anomaly detection with imbalanced data:
from sklearn.svm import OneClassSVM
# Using One-Class SVM for anomaly detection
ocsvm = OneClassSVM(gamma='auto', nu=0.1) # nu controls the proportion of outliers
ocsvm.fit(X)
anomaly_labels = ocsvm.predict(X)
# -1 represents anomaly, 1 represents normal
anomaly_labels
3. Monitoring Anomalies in Real-Time
Anomaly detection becomes especially valuable when used in real-time. By continuously monitoring data streams, businesses can quickly identify and respond to potential issues before they escalate. Whether it's detecting fraud or monitoring a system for unusual behavior, real-time anomaly detection helps maintain operational stability.
Here’s how you can apply PartialFit in sklearn's IsolationForest to simulate real-time anomaly detection on incoming data:
from sklearn.ensemble import IsolationForest
# Simulate real-time data streaming
real_time_data = pd.DataFrame({
'feature_1': np.random.normal(size=20),
'feature_2': np.random.normal(size=20),
})
# Use the same Isolation Forest model trained previously
for index, row in real_time_data.iterrows():
model.partial_fit([row]) # Partially fit the model with the new data point
# Predict if the new data point is an anomaly
prediction = model.predict([row])
print(f"Data point {index}: {'Anomaly' if prediction == -1 else 'Normal'}")