An ultra-fast tool to reduce the attributes (features) of that insanely large dataset in a way that doesn't affect dataset quality. It does this by identifying clusters of linearly related (and therefore redundant) features, and only preserving the feature most 'near' to all other features.
Tested on huge datasets, and mathematically sound. Read the unfinished draft here.
Make sure you have Pandas, NumPy and NetworkX installed. You can install these packages using pip
pip install pandas numpy networkx
To use Raven, you can simply download the raw of raven.py and import it as
from raven import ravenOnce you have it imported, you can identify redundant features. Here's an example usage:
really_huge_dataset = pd.read_csv('./really_huge_dataset.csv')
redundant_features = raven(really_huge_dataset)
smaller_dataset = really_huge_dataset.drop(columns=redundant_features)