This python class (strangenss.py) is an anomaly/change detector based on the concept of martingales. It is designed to work on unlabelled data (unsupervised anomaly detection). An example of unlabelled data set can be the dataset of number of steps taken by a user everyday. This anomaly detector can point out if the number of steps taken on a particular day are out of the ordinary.
I implemented one of the many possible implementation of the concept explained in this paper Detecting Changes in Unlabeled Data Streams using Martingale by Shen-Shyang Ho and Harry Wechsler. The basic pretext is that given a list of values, the properties (joint probability) of set should not change if the elements in the list are permutated. I use cluster mean for cluster representation.
The input to the code is a file with a header row and data in the following format
<row Label/ID>,<value1>,<value2>,<value3>...,<valueN>
The row label can be anything that represents the set of values. A timestamp can be a label. The 'value' fields represent the state at that particular label. For a heat rate monitor, the label represents the time of measurement and the values are the light intensity detected by the heart rate sensor. Check the sample data file in 'test-data' folder. I combined data from 4 different normal distributions to generate this dataset.
Also, order of the rows is important, please maintain the original order of records when generating an input dataset.
Once the data file is ready, the code can be run as follows:
python usageExample.py <dataset> <threshold> <minQueueLen> <epsilon>
With sample parameters:
python usageExamples.py ./test-data/martingalesDataWithLabels.csv 10 50 0.92
The 3 values after the input dataset are as follows:
-
threshold- This value decides the sensitivity of the algorithm. Lesser value causes more detections -
minQueueLen- This value decides the minimum number of input values before starting the change detection. Lesser value causes more detections -
epsilon- This value decides the sensitivity of the algorithm - randomised power martingales
The output consists of <row Label/ID> <M Value>. If this value is greater than threshold, the algorithm has detected an anomaly.
-
Download the class file
strangeness.pyin your code folder -
Add the import statement
from strangeness import Strangness -
Create the class object -
strangeness = Strangeness(threshold, minQueueLen, epsilon) -
Next pass the tuple of values (no label) to get the M value -
strangeness.getMValue(valuesTuple) -
Step 4 is to be repeated for each data point.
-
If the value returned by
getMValue()is greater than the threshold then the algorithm has detected an anomaly.
For the above parameters, the code takes around 6 secs to process 4000 points.
For each new point, the center mean and distance are recalculated for minQueueLen points, therefore keeping the queue length short will reduce runtime.
For each new problem we need to find optimum paramters(threshold, minQueueLen, and epsilon) that results in the lower number of false positive (false detection) and false negatives (missed detection). I have put up an example of how parameter tuning (finding optimum values for threshold, minQueueLen, and epsilon) can be perfomed using an example dataset comprising of S&P500 opening day values for each day. The date serves as the label for this dataset. More details are here
The code is my personal work and does not in any way represent my employer