Skip to content

udayankumar/anomaly-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

This python class (strangenss.py) is an anomaly/change detector based on the concept of martingales. It is designed to work on unlabelled data (unsupervised anomaly detection). An example of unlabelled data set can be the dataset of number of steps taken by a user everyday. This anomaly detector can point out if the number of steps taken on a particular day are out of the ordinary.

I implemented one of the many possible implementation of the concept explained in this paper Detecting Changes in Unlabeled Data Streams using Martingale by Shen-Shyang Ho and Harry Wechsler. The basic pretext is that given a list of values, the properties (joint probability) of set should not change if the elements in the list are permutated. I use cluster mean for cluster representation.

Usage

The input to the code is a file with a header row and data in the following format

<row Label/ID>,<value1>,<value2>,<value3>...,<valueN>

The row label can be anything that represents the set of values. A timestamp can be a label. The 'value' fields represent the state at that particular label. For a heat rate monitor, the label represents the time of measurement and the values are the light intensity detected by the heart rate sensor. Check the sample data file in 'test-data' folder. I combined data from 4 different normal distributions to generate this dataset.

Also, order of the rows is important, please maintain the original order of records when generating an input dataset.

Concrete usage example

Once the data file is ready, the code can be run as follows: python usageExample.py <dataset> <threshold> <minQueueLen> <epsilon>

With sample parameters:

python usageExamples.py ./test-data/martingalesDataWithLabels.csv 10 50 0.92

The 3 values after the input dataset are as follows:

  • threshold - This value decides the sensitivity of the algorithm. Lesser value causes more detections

  • minQueueLen - This value decides the minimum number of input values before starting the change detection. Lesser value causes more detections

  • epsilon - This value decides the sensitivity of the algorithm - randomised power martingales

The output consists of <row Label/ID> <M Value>. If this value is greater than threshold, the algorithm has detected an anomaly.

Adding in your python code

  1. Download the class file strangeness.py in your code folder

  2. Add the import statement from strangeness import Strangness

  3. Create the class object - strangeness = Strangeness(threshold, minQueueLen, epsilon)

  4. Next pass the tuple of values (no label) to get the M value - strangeness.getMValue(valuesTuple)

  5. Step 4 is to be repeated for each data point.

  6. If the value returned by getMValue() is greater than the threshold then the algorithm has detected an anomaly.

Performance

For the above parameters, the code takes around 6 secs to process 4000 points. For each new point, the center mean and distance are recalculated for minQueueLen points, therefore keeping the queue length short will reduce runtime.

Parameter Tuning

For each new problem we need to find optimum paramters(threshold, minQueueLen, and epsilon) that results in the lower number of false positive (false detection) and false negatives (missed detection). I have put up an example of how parameter tuning (finding optimum values for threshold, minQueueLen, and epsilon) can be perfomed using an example dataset comprising of S&P500 opening day values for each day. The date serves as the label for this dataset. More details are here

Disclaimer

The code is my personal work and does not in any way represent my employer

About

Anomaly detection in less than 60 loc!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published