A lightweight, dependency-free library combining isolation forests with survival analysis for anomaly detection in time-to-event data.
pip install desolatefrom desolate import DesolateForest
import numpy as np
# Create synthetic data
n_samples = 1000
features = np.random.normal(size=(n_samples, 5))
times = np.random.exponential(50, size=n_samples)
censoring = np.random.exponential(30, size=n_samples)
observed = np.minimum(times, censoring)
events = (times <= censoring).astype(int)
# Fit model
model = DesolateForest(contamination=0.1)
model.fit(observed, events, features)
# Get predictions
predictions = model.predict(observed, events, features)
scores = model.score_samples(observed, events, features)Desolate combines isolation forests with Kaplan-Meier survival analysis by augmenting the feature space with survival information:
where:
-
$\mathbf{X}$ is the original feature matrix -
$\hat{S}(t)$ is the Kaplan-Meier survival probability -
$t$ is the observed time -
$\delta$ is the event indicator
The survival function estimate is:
where:
-
$t_i$ are the distinct event times -
$d_i$ is the number of events at time$t_i$ -
$n_i$ is the number of subjects at risk at time$t_i$
The anomaly score for a point
where:
-
$h(x)$ is the path length for point$x$ -
$E[h(x)]$ is the average path length across trees -
$c(n) = 2H(n-1) - \frac{2(n-1)}{n}$ , where$H(i)$ is the harmonic number -
$n$ is the number of samples
-
Censoring-Aware Detection:
$P(\text{anomaly}|x, t, \delta) = P(\text{anomaly}|x, t, \delta, \hat{S}(t))$ -
Temporal Consistency:
If
$t_1 < t_2$ and$\hat{S}(t_1) > \hat{S}(t_2)$ :$s([x|\hat{S}(t_1)|t_1|\delta]) \leq s([x|\hat{S}(t_2)|t_2|\delta])$ -
Feature-Survival Interaction:
$s([x|\hat{S}(t)|t|\delta]) \neq s(x) + s([\hat{S}(t)|t|\delta])$
- Minimal Dependencies: Only requires numpy
- Efficient: O(log n) average case complexity
- Flexible: Works with any feature type
- Interpretable: Decomposable anomaly scores
from desolate.datasets import DatasetLoader
# Load built-in benchmark dataset
loader = DatasetLoader()
features, durations, events = loader.load_dataset("turbofan")
# Available datasets:
# - turbofan: NASA Turbofan Engine Degradation
# - gbsg2: German Breast Cancer Study
# - bearing: IMS Bearing Dataset
# - valve: Industrial Control Valve
# - support: Study to Understand Prognoses
# - pbc: Primary Biliary Cirrhosis
# - semiconductor: SECOM Manufacturing
# - software: Software Project Survivalfrom desolate.preprocessing import Preprocessor
# Apply dataset-specific preprocessing
preprocessor = Preprocessor()
features_proc, durations_proc, events_proc = preprocessor.preprocess_turbofan(
features, durations, events
)from desolate.anomalies import LocalOutlierInjector
# Inject synthetic anomalies
injector = LocalOutlierInjector()
features_anom, durations_anom, events_anom = injector.inject(
features, durations, events,
contamination=0.1
)The expected path length in augmented space:
This decomposition shows that the model captures:
- Standard feature anomalies
- Survival pattern anomalies
- Joint anomalies in both spaces
Under regularity conditions:
-
Consistency of Anomaly Detection:
As
$n \to \infty$ :$P(|s(x_{aug}) - s^*(x_{aug})| > \epsilon) \to 0$ -
Consistency of Survival Estimation:
As
$n \to \infty$ :$\sup|\hat{S}(t) - S(t)| \to 0$ in probability
Contributions welcome! Please read our Contributing Guide.
MIT License - see LICENSE for details.
If you use Desolate in your research, please cite:
@software{desolate2024,
title = {Desolate: Anomaly Detection with Isolation Forests and Survival Analysis},
author = {Your Name},
year = {2024},
url = {https://github.com/yourusername/desolate}
}- Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining
- Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete observations. Journal of the American Statistical Association