Skip to content

MultiX-Amsterdam/Veli

 
 

Repository files navigation

Veli: Unsupervised Method and Unified Benchmark for Low-Cost Air Quality Sensor Correction

Access the dataset repository through the Air Quality Sensor Data Repository (AQ-SDR) Here

If you are going to use either this work on the AQ-SDR dataset, please make sure to cite

@misc{Yahia2025Veli,
      title={Veli: Unsupervised Method and Unified Benchmark for Low-Cost Air Quality Sensor Correction}, 
      author={Yahia Dalbah and Marcel Worring and Yen-Chia Hsu},
      year={2025},
      eprint={2508.02724},
      archivePrefix={arXiv},
      primaryClass={eess.SP},
      url={https://arxiv.org/abs/2508.02724}, 
}

Basic information:

For the training and testing scripts, make sure to specify the required directory (train_veli.py requires the EU data directory, fine_tune_taiwan.py requires the taiwanese out of distribution data. test_veli.py works on either of them).

The requirements.txt file has all the libraries required to run this model. We also provide the packages through a toml file (poetry) for your convenience.

The dataset is large in size, but the model is lightweight, so you can make it run as long as you have at least 500MB of VRAM.

Make sure to download the dataset and run the instructions. The shell script in the AQ-SDR repository creates the dataset in the format required for this repository. Follow the directions in the AQ-SDR repository. Here we highlight the important part for Veli:

The dataset paths should have the following two directory trees (disregarding all other preprocessing):

EU Data:

final_dir_eudata
|
.
.
.

└── final_dataset
   ├── prepared_lcs_bulk
   └── pre_prepared_datasets_unfiltered

final_dataset contains the data that we used in modeling Veli. prepared_lcs_bulk contains the LCS data without reference station data for unsupervised training. pre_prepared_datasets_unfiltered contains files with reference stations used for verification purposes.

OOD Data:

final_dir_ood
|
.
.
.

└── final_dataset
    ├── ood_data_bulk
    └── prepared_ood_datasets

final_dataset contains the data that we used in modeling Veli. ood_data_bulk contains the LCS data without reference station data for unsupervised training. prepared_ood_datasets contains files with reference stations used for verification purposes.

Scripts train_veli.py and test_veli.py are what you are going to need to run and experiment with the models.

Below you will find a very detailed explanation on how to run the environment for additional help.

The following premade config files can be used as follow:

  • ./configs/Veli has the pre-trained model in our paper with same hyperparameters
  • ./configs/Veli_train_from_scratch.json can be used to re-train the model, with same hyperparameters
  • ./configs/Veli_taiwan.json has the fine-tuned model for out-of-distribution (OOD) data from Taiwan.

Environment Setup

All packages and dependencies used are available via two options:

  • requirements.txt
  • pyproject.toml (Poetry)

This guide assumes you are using a bash-based tool (linux/mac) You can install all necessary packages EXACTLY following our work using the following commands:

1. Using Python Virtual Environment

We recommend using approach (a) because it is the easiest and fastest. For the highest gaurantee of reproducability, approach (b) is recommended but unnecessary. (c) is untested but easy as well.

a) With requirements.txt

# Create and activate a virtual environment
python3 -m venv veli
source veli/bin/activate   # On Windows: veli\Scripts\activate
pip install -r requirements.txt

b) With poetry.toml

Install poetry if not installed (feel free to create a virtual environment beforehand)

pip install poetry

Create and activate the virtual environment + install deps. Also activate the shell.

poetry install

source $(poetry env info --path)/bin/activate

The source file will automatically start the virtual environment Alternatively, you can do poetry env list --full-path and activate the environment in there.

c) Using Conda

conda create -n veli python=3.10

conda activate veli

pip install -r requirements.txt

Details

Inside the /model/ folder you will find:

  • The PyTorch model and model building files are all available in Veli.py. u in the model building refers to $\psi$ in our paper.
  • The loss functions among other model building supports are in the model_utils.py file.
  • The functions to create dataset tensors and dataloaders are in the tensor_creation file.
  • Support functions for model builders and learning utilities are in the model_builders.py and learning_utils.py, repsectively.

Training

Training from scratch

To run the training, you can use the train_veli.py script. You must define a configration file from the ./configs/. If you wish to train from scratch, you can use the ./configs/Veli_train_from_scratch.json config file. We show below an explanation of the config files.

You can use the following template on bash to run it:

python train_veli.py --config_path "./configs/Veli_train_from_scratch.json"

You can also define additional arguments stored_models_path and logs_path to define where you want to store the trained models and log files. The default will be in the directory of the repository under folders /trained_models/ and /logs/, respectively. This repository contains a pre-trained weight in ./trained_models/Veli.pth and a fine-tuned weight in ./trained_models/Veli_taiwan.pth.

You have to define the directory of the data inside the config file at ./configs/Veli.json under the key dataset_path. The dataset path should be the folder that contains the following two folders: prepared_lcs_bulk and pre_prepared_datasets_unfiltered.

Fine-tuning for out of distribution data

To fine tune the data on out-of-distribution data, use:

python fine_tune_taiwan.py --config_path "./configs/Veli_taiwan.json" --ood "/path_to_dataset/final_dir_ood/final_dataset"

Inside the config file, you have to define key pretrained_config_path to find the pre-trained weights that you want to fine-tune on.

Similarily, you can define paths stored_models_path and logs_path similar to training. You need to also have

Testing

To test on a pre-trained model, feel free to use:

python test_veli.py --config_path "./configs/Veli.json"

You can also use argument stored_models_path to indicate where the weights are stored, otherwise it will seek in ./trained_models/.

IMPORTANT: The script relies on finding a model with the same name as the config file. Make sure the config file name matches the name of the model you are seeking.

If you want to test on out-of-distribution (ood) data, make sure to use the following arguments: test_ood and ood_data. test_ood is a flag that lets the script know you want to test on ood data, and ood_data is the path where the ood data is. Both arguments are required.

The ood_data path should be the folder that contains the folders ood_data_bulk and prepared_ood_datasets.

Example bash:

python test_veli.py --config_path "./configs/Veli_taiwan.json" --test_ood --ood_data "/path/to/ood_data"

The following is a sample output from the testing script (on in-distribution data):

================================================================= FINAL RESULTS SUMMARY ======================================================================
           MAE Relative to original LCS  No-model MAE (mean)  No-model RMSE (mean)  Veli MAE (mean)  Veli MAE (single)  Veli RMSE (mean)  Veli RMSE (single)
City                                                                                                                                                        
rotterdam                       22.8099              21.0497               42.0942           3.4937             3.4937            4.8272              4.8272
utrecht                         32.6555              24.0222               39.4749           4.2683             4.2683            6.0409              6.0409
amsterdam                       12.3012              11.6981               20.2150           3.6057             3.6057            5.3456              5.3456
groningen                        2.9895               3.1105                4.5315           3.1477             3.1477            4.4146              4.4146
hague                            3.0981               3.3686                4.5101           3.2790             3.2790            4.3581              4.3581
ijmuiden                         3.3657               4.1239                7.0861           3.4724             3.4724            4.3717              4.3717
nijmegen                         3.0571               5.4688                7.4308           5.3286             5.3286            6.9867              6.9867
==============================================================================================================================================================

Configuration files arguments

The following is a description of the meaning behind config files and the usage.

{
    "dataset_path": "/home/yahia/final_dataset_replication/final_dataset", #the path to the dataset (in distribution)

    "pretrained_config_path": "/home/yahia/Veli/configs/Veli.json", #for out of distribution data fine-tuning: define the config file for the pre-trained model to fine-tune on. required.

    "load_pretrained": false, #start from a pretrained file

    
    "train_split": "all", #train on all dataset with no validation?
    "train_suffix": "_lcs", suffix for the training data (lcs for low cost sensor as per AQ-SDR design)
    "test_suffix": "_ref", suffix for testing data (ref for reference station as per AQ-SDR design)
    "shuffle_data_cols": true, #shuffle the columns to make sure the model doesn't overfit on column position, true will shuffle
    "shuffle_cities_order": true, #shuffle cities loading order, true will shuffle
    "full_shuffle": true, #shuffle after loading everything, true will shuffle
    "mask_value": 0, #what value to replace the NA values with


    "fill_hour_rows": false, #bring back hours that have ALL NA values, false will not fill
    
    
    "test": true, #do a testing at the end of the training, true will test

    "model_name": "Veli", #name of the architecture to load. Only Veli is available.
    "loss": "huber", #implementation for the reconstruction loss. available: huber, SE (squared errors), studentT
    "distribution": "gaussian", #assumption of distribution as per VAE, only gaussian is supported


    "no_relu": false, #do not include relu. false WILL include relu
    "hidden_dim": 32, #size of hidden dimensions for all feed forward layers (aside from input/output/encoding/decoding)
    "latent_dim": 1, #size of the latent dimension for all blocks
    "beta_z": 10, #weight of beta_z as per the loss function
    "beta_y": 0.1, #weight of beta_y as per the loss function
    "reconstruction_factor": 1, #weight of the reoconstruction factor as per the loss function


    "shuffle_time_order": true, #shuffle time order (lose temporal information). true WILL shuffle
    "shuffle_sensor_order": true, #randomly shuffle columns (again). true WILL shuffle.
    "reorder_probability": 0.5, #probability at which to shuffle sensor order (above)

    "batch_size": 64, 
    "learning_rate": 1e-6, #starting learning rate
    "optimizer_step_size": 10, #for the learning rate scheduler
    "optimizer_gamma": 0.5, #for the learning rate scheduler
    "num_epochs": 100,
    "clip_gradient": true, #ensure stability - true WILL clip the gradient

    "epoch_print_divider": 1, #print out information every n epochs
    "print_loss": true, #print out loss information during training
    "device": "cuda", #device to use for training/testing
    "model_save_period": 33 #store an additional weight every n epoch


}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%