Skip to content

fringewidth/raven

Repository files navigation

RAVEN: Redundancy Analysis via Elimination Networks Black Bird

image

An ultra-fast tool to reduce the attributes (features) of that insanely large dataset in a way that doesn't affect dataset quality. It does this by identifying clusters of linearly related (and therefore redundant) features, and only preserving the feature most 'near' to all other features.

Tested on huge datasets, and mathematically sound. Read the unfinished draft here.

Dependencies

Make sure you have Pandas, NumPy and NetworkX installed. You can install these packages using pip

pip install pandas numpy networkx

Usage

To use Raven, you can simply download the raw of raven.py and import it as

from raven import raven

Once you have it imported, you can identify redundant features. Here's an example usage:

really_huge_dataset = pd.read_csv('./really_huge_dataset.csv')

redundant_features = raven(really_huge_dataset)

smaller_dataset = really_huge_dataset.drop(columns=redundant_features)

About

A graph based alternative to PCA for feature selection

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •