This repository contains the code and instructions to reproduce the experiments and results presented in the paper Fair-OBNC: Correcting Label Noise for Fairer Datasets.
This section details how to replicate our experiments to obtain the results we present in the paper Fair-OBNC: Correcting Label Noise for Fairer Datasets.
The first step is to install the Aequitas Flow package:
pip install git+https://github.com/dssg/aequitas.gitThen, one can download the necessary data by running:
# To store the necessary data
>>> from generate_data import generate_data
>>> generate_data({"BankAccountFraud": ["TypeII"]})Finally, we include in this repository the configuration files we used in our experiments, so the only step left is to run the fairobnc_experiment.py script to run the experiments:
# To run the experiments with the multiple injected noise scenarios
>>> python -m fairobnc_experiment baf typeii noise_injection_experiment --noise_injection
# To run the experiments without noise injection
>>> python -m fairobnc_experiment baf typeii noise_injection_experimentIf you wish to test our method in addtional scenarios, our framework can be used to test more cases.
The generate_data function loads the desired datasets from Aequitas, generates the IID versions of it and injects noise into the labels, storing the necessary files for using the IIDDataset and NoisyDataset classes.
# To store the necessary data
>>> from generate_data import generate_data
>>> generate_data({"BankAccountFraud": ["TypeII"]})
# To load an IID dataset
>>> from datasets import IIDDataset
>>> iid_dataset = IIDDataset("BankAccountFraud", "TypeII")
>>> iid_dataset.load_data()
>>> iid_dataset.create_splits()
# To load a noisy dataset, where noise is being applied only on the instances from the negative class, flipping 5% of the instances belonging to the negative sensitive group and 20% of the ones from the positive group
>>> from datasets import NoisyDataset
>>> noisy_dataset = NoisyDataset("BankAccountFraud", "TypeII", {0:0.05, 1:0.20}, [0])
>>> noisy_dataset.load_data()
>>> noisy_dataset.create_splits()The configsfolder is organized into 2 subfolders, following the Aequitas experiment logic:
methodscontains the config files for each of the preprocessing methods being analyzeddatasetswhich contains the config files for each noisy version of the used datasets. These configs can be automatically generated by calling thegenerate_dataset_configsfunction:>>> from generate_configs import generate_dataset_configs >>> generate_dataset_configs({"BankAccountFraud":["TypeII"]})
Each specific type of injected noise must be run as a seperate experiment so that the same hyperparameters are sampled in each trial.
The experiment config files can be generated using the generate_experiment_file function:
>>> from generate_configs import generate_experiment_files
>>> generate_experiment_files(
... methods = ["lightgbm", "OBNC", "Fair-OBNC", "PrevalenceSampling"],
... variants = {"BankAccountFraud":["TypeII"]},
... noise_injection = True,
... n_trials = 50,
)After setting up all the data and config files, one can run the fairobnc_experiment.py script to run the experiments:
>>> python -m fairobnc_experiment baf typeii noise_injection_experiment --noise_injectionThe result_analysis.py file contains the definition of the functions used to analyze the obtained results and generate the plot presented in the paper.