Abstract: In this project, we analyze and compare two different techniques for sparsity: the Least Absolute Shrinkage and Selection Operator, also known as "LASSO," and the Relevance Vector Machine, known as "RVM;" for the purpose of selecting strong predictors in environments with plenty of noise. In financial systems; where signal-to-noise ratios are required to be, and are, very low; and the collinearity is usually known to be high, finding a stable subset of features is an imperative step for model generalization. In this project, we start out with an implementation of LASSO based on synthetic data, where we plant collinearity (
$f1 \approx 0.75 f0$ ) to show its recovery of the sparse ground truth. For Stability Selection, we also want to see how strong the selected features are. While RVM gives us a complex probabilistic framework with Automatic Relevance Determination, we are able to determine that LASSO's convex optimization path provides us with the deterministic stability that's needed for proper inference.
The true signal f0 (green) is selected early, while the correlated noise f1 (red) is suppressed until the penalty is negligible.
Model successfully recovers the sparse ground truth {f0, f2, f4}.
-
Navigate to the code directory:
cd sparse-feature-selection -
Install dependencies:
python3 -m pip install numpy pandas scikit-learn matplotlib
-
Generate synthetic data:
python3 src/synthData.py
-
Run the demo notebook:
jupyter notebook notebook/demo_lasso.ipynb