This repository contains the code for
- two papers,
- (parts of) a dissertation,
- and the Python package
alfese.
This document provides:
- An overview of the related publications.
- An outline of the repo structure.
- Steps for setting up a virtual environment and reproducing the experiments.
Bach, Jakob, and Klemens Böhm (2024): "Alternative feature selection with user control"
is published in the International Journal of Data Science and Analytics.
You can find the paper here.
You can find the corresponding complete experimental data (inputs as well as results) on RADAR4KIT.
Use the tags run-2023-06-23 and evaluation-2024-03-19 for reproducing the experiments.
Bach, Jakob (2023): "Finding Optimal Diverse Feature Sets with Alternative Feature Selection"
is published on arXiv.
You can find the paper here.
You can find the corresponding complete experimental data (inputs as well as results) on RADAR4KIT.
Use the tags run-2023-06-23 and evaluation-2023-07-04 for reproducing the experimental data for v1 of the paper.
Use the tags run-2024-01-23 and evaluation-2024-02-01 for reproducing the experimental data for v2 of the paper.
Use the tags run-2024-09-28-arXiv-v3 and evaluation-2024-12-08-arXiv-v3 for reproducing the experimental data for v3 of the paper.
Bach, Jakob (2025): "Leveraging Constraints for User-Centric Feature Selection"
is a dissertation at the Department of Informatics of the Karlsruhe Institute of Technology.
You can find the dissertation here.
You can find the corresponding complete experimental data (inputs as well as results) on RADAR4KIT.
Use the tags run-2024-09-28-dissertation and evaluation-2024-11-02-dissertation for reproducing the experiments.
Currently, the repository contains six Python files and four non-code files. The non-code files are:
.gitignore: For Python development.LICENSE: The software is MIT-licensed, so feel free to use the code.README.md: You are here 🙃requirements.txt: To set up an environment with all necessary dependencies; see below for details.
The code files comprise our experimental pipeline (see below for details):
prepare_datasets.py: First stage of the experiments (download prediction datasets).run_experiments.py: Second stage of the experiments (run feature selection, search for alternatives, and make predictions).run_evaluation_(arxiv|dissertation|journal).py: Third stage of the experiments (compute statistics and create plots).data_handling.py: Functions for working with prediction datasets and experimental data.
Additionally, we have organized the (alternative) feature-selection methods for our experiments
as the standalone Python package alfese, located in the directory alfese_package/.
See the corresponding README for more information.
Before running the scripts to reproduce the experiments, you should
- Set up an environment (optional but recommended).
- Install all necessary dependencies.
Our code is implemented in Python (version 3.8; other versions, including lower ones, might work as well).
If you use conda, you can directly install the correct Python version into a new conda environment
and activate the environment as follows:
conda create --name <conda-env-name> python=3.8
conda activate <conda-env-name>Choose <conda-env-name> as you like.
To leave the environment, run
conda deactivateWe used virtualenv (version 20.4.7; other versions might work as well)
to create an environment for our experiments.
First, you need to install the correct Python version yourself.
Let's assume the Python executable is located at <path/to/python>.
Next, you install virtualenv with
python -m pip install virtualenv==20.4.7To set up an environment with virtualenv, run
python -m virtualenv -p <path/to/python> <path/to/env/destination>Choose <path/to/env/destination> as you like.
Activate the environment in Linux with
source <path/to/env/destination>/bin/activateActivate the environment in Windows (note the back-slashes) with
<path\to\env\destination>\Scripts\activateTo leave the environment, run
deactivateAfter activating the environment, you can use python and pip as usual.
To install all necessary dependencies for this repo, run
python -m pip install -r requirements.txtIf you make changes to the environment and you want to persist them, run
python -m pip freeze > requirements.txtAfter setting up and activating an environment, you are ready to run the code. Run
python -m prepare_datasetsto download and pre-process the input data for the experiments (prediction datasets from PMLB).
datasets/ from the experimental data (for the paper version of your choice) linked above.
Next, start the experimental pipeline with
python -m run_experimentsDepending on your hardware, this might take several days.
For the last pipeline run, we had a runtime of 141 hours on a server with an AMD EPYC 7551
CPU (32 physical cores, base clock of 2.0 GHz).
In case the pipeline is nearly finished but doesn't make progress anymore,
the solver might have silently crashed (which happened in the past with Cbc as the solver, though
we didn't encounter the phenomenon with the current solver SCIP).
In this case, or if you had to abort the experimental run for other reasons, you could re-start the
experimental pipeline by calling the same script again; it automatically detects existing results
and only runs the remaining tasks.
To print statistics and create the plots, run
python -m run_evaluation_<<version>>with <<version>> being one of arxiv, dissertation, or journal.
(The evaluation length differs between versions, as does the plot formatting. The arXiv version has the longest and most detailed evaluation.)
All scripts have a few command-line options, which you can see by running the scripts like
python -m prepare_datasets --help