🚩 Anomaly detection

Currently implemented conformal anomaly detectors are listed in this page.

Each of these wrappers calibrate the decision threshold for anomaly detectors that are passed as argument in the object constructor. Such models need to implement the fit() and predict() methods. Prediction module from the API ensures the compliance of models from various ML/DL libraries (such as Keras and scikit-learn) to puncc.

class deel.puncc.anomaly_detection.SplitCAD(predictor, *, train=True, random_state=None)

Split conformal anomaly detection method based on Laxhammar’s algorithm. The anomaly detection is based on the calibrated threshold (through conformal prediction) of underlying anomaly detection (model’s) scores. For more details, we refer the user to the theory overview page.

Parameters:

predictor (BasePredictor) – a predictor implementing fit and predict.
train (bool) – if False, prediction model(s) will not be (re)trained. Defaults to True.
random_state (float) – random seed used when the user does not provide a custom fit/calibration split in fit method.

Example:

import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt

from deel.puncc.anomaly_detection import SplitCAD
from deel.puncc.api.prediction import BasePredictor

# We generate the two moons dataset
dataset = 4 * make_moons(n_samples=1000, noise=0.05, random_state=0)[
    0
] - np.array([0.5, 0.25])

# We generate uniformly new (test) data points
rng = np.random.RandomState(42)
z_test = rng.uniform(low=-6, high=6, size=(150, 2))


# The nonconformity scores are defined as the IF scores (anomaly score).
# By default, score_samples return the opposite of IF scores.
# We need to redefine the predict to output the nonconformity scores.
class ADPredictor(BasePredictor):
    def predict(self, X):
        return -self.model.score_samples(X)

# Instantiate the Isolation Forest (IF) anomaly detection model
# and wrap it in a predictor
if_predictor = ADPredictor(IsolationForest(random_state=42))

# Instantiate CAD on top of IF predictor
if_cad = SplitCAD(if_predictor, train=True, random_state=0)

# Fit the IF on the proper fitting dataset and
# calibrate it using calibration dataset.
# The two datasets are sampled randomly with a ration of 7:3,
# respectively.
if_cad.fit(z=dataset, fit_ratio=0.7)

# We set the maximum false detection rate to 1%
alpha = 0.01

# The method `predict` is called on the new data points
# to test which are anomalous and which are not
results = if_cad.predict(z_test, alpha=alpha)

anomalies = z_test[results]
not_anomalies = z_test[np.invert(results)]

# Plot results
plt.scatter(dataset[:, 0], dataset[:, 1], s=10, label="Inliers")
plt.scatter(
    anomalies[:, 0],
    anomalies[:, 1],
    marker="x",
    color="red",
    s=40,
    label="Anomalies",
)
plt.scatter(
    not_anomalies[:, 0],
    not_anomalies[:, 1],
    marker="x",
    color="blue",
    s=40,
    label="Normal",
)
plt.xticks(())
plt.yticks(())
plt.legend()

fit(*, z=None, fit_ratio=0.8, z_fit=None, z_calib=None, **kwargs)

This method fits the models on the fit data and computes nonconformity scores on calibration data. If z are provided, randomly split data into fit and calib subsets w.r.t to the fit_ratio. In case z_fit and z_calib are provided, the conformalization is performed on the given user defined fit and calibration sets.

Note

If z is provided, fit ignores any user-defined fit/calib split.

Parameters:

z (Iterable) – data points from the training dataset.
fit_ratio (float) – the proportion of samples assigned to the fit subset.
z_fit (Iterable) – data points from the fit dataset.
z_calib (Iterable) – data points from the calibration dataset.
kwargs (dict) – predict configuration to be passed to the model’s fit method.

Raises:

RuntimeError – no dataset provided.

predict(z_test, alpha)

Predict whether each example is an anomaly or not. The decision is taken based on the calibrated threshold (through conformal prediction) of underlying anomaly detection scores.

Parameters:

z_test (Iterable) – new data points.
alpha (float) – target maximum FDR.

Returns:

outlier tag. True if outlier, False otherwise.

Return type:

Iterables[bool]