A Python implementation of the Sliced-Wasserstein Filter developed by Julien Pallage and Antoine Lesage-Landry.
In this work, we present a new unsupervised anomaly (outlier) detection (AD) method using the sliced-Wasserstein metric. This filtering technique is conceptually interesting for integration in MLOps pipelines deploying trustworthy machine learning models in critical sectors like energy. We also propose an approximation of our methodology using a Fast Euclidian variation. The code is written to respect scikit-learn's API and be called similarly to other scikit-learn AD methods, e.g., Isolation Forest, Local Outlier Factor.
We use the Python implementation of the sliced-Wasserstein distance from the library POT and use a voting system to label candidate samples as outliers or inliers and we use joblib to parallelize the procedure.
For large datasets, we recommend using SmartSplitSlicedWassersteinFilter or FastEuclidianFilter to speed up computations.
from swfilter import SlicedWassersteinFilter
eps = 0.01 # the threshold of the SW distance
n = 30 # the number of voters
n_projections = 50 # the number of projections used in the SW computations
p = 0.6 # the threshold percentage of voters required to label as outlier
n_jobs = -1 # the number of workers to call in the parallelization (-1 = max)
model = SlicedWassersteinFilter(eps=eps, n=n, n_projections=n_projections, p=p, n_jobs=n_jobs, swtype='original')
preds, vote = model.fit_predict(dataset)
mask = preds == 1
filtered_dataset = dataset[mask]pip install swfilterSee our tutorial page!
@article{pallage2024sliced,
title={Sliced-Wasserstein-based Anomaly Detection and Open Dataset for Localized Critical Peak Rebates},
author={Pallage, Julien and Scherrer, Bertrand and Naccache, Salma and B{\'e}langer, Christophe and Lesage-Landry, Antoine},
journal={arXiv preprint arXiv:2410.21712},
year={2024}
}