Access the dataset repository through the Air Quality Sensor Data Repository (AQ-SDR) Here
If you are going to use either this work on the AQ-SDR dataset, please make sure to cite
@misc{Yahia2025Veli,
title={Veli: Unsupervised Method and Unified Benchmark for Low-Cost Air Quality Sensor Correction},
author={Yahia Dalbah and Marcel Worring and Yen-Chia Hsu},
year={2025},
eprint={2508.02724},
archivePrefix={arXiv},
primaryClass={eess.SP},
url={https://arxiv.org/abs/2508.02724},
}
For the training and testing scripts, make sure to specify the required directory (train_veli.py requires the EU data directory, fine_tune_taiwan.py requires the taiwanese out of distribution data. test_veli.py works on either of them).
The requirements.txt file has all the libraries required to run this model. We also provide the packages through a toml file (poetry) for your convenience.
The dataset is large in size, but the model is lightweight, so you can make it run as long as you have at least 500MB of VRAM.
Make sure to download the dataset and run the instructions. The shell script in the AQ-SDR repository creates the dataset in the format required for this repository. Follow the directions in the AQ-SDR repository. Here we highlight the important part for Veli:
The dataset paths should have the following two directory trees (disregarding all other preprocessing):
EU Data:
final_dir_eudata
|
.
.
.
└── final_dataset
├── prepared_lcs_bulk
└── pre_prepared_datasets_unfiltered
final_dataset contains the data that we used in modeling Veli.
prepared_lcs_bulk contains the LCS data without reference station data for unsupervised training.
pre_prepared_datasets_unfiltered contains files with reference stations used for verification purposes.
OOD Data:
final_dir_ood
|
.
.
.
└── final_dataset
├── ood_data_bulk
└── prepared_ood_datasets
final_dataset contains the data that we used in modeling Veli.
ood_data_bulk contains the LCS data without reference station data for unsupervised training.
prepared_ood_datasets contains files with reference stations used for verification purposes.
Scripts train_veli.py and test_veli.py are what you are going to need to run and experiment with the models.
Below you will find a very detailed explanation on how to run the environment for additional help.
The following premade config files can be used as follow:
./configs/Velihas the pre-trained model in our paper with same hyperparameters./configs/Veli_train_from_scratch.jsoncan be used to re-train the model, with same hyperparameters./configs/Veli_taiwan.jsonhas the fine-tuned model for out-of-distribution (OOD) data from Taiwan.
All packages and dependencies used are available via two options:
requirements.txtpyproject.toml(Poetry)
This guide assumes you are using a bash-based tool (linux/mac) You can install all necessary packages EXACTLY following our work using the following commands:
We recommend using approach (a) because it is the easiest and fastest. For the highest gaurantee of reproducability, approach (b) is recommended but unnecessary. (c) is untested but easy as well.
# Create and activate a virtual environment
python3 -m venv veli
source veli/bin/activate # On Windows: veli\Scripts\activate
pip install -r requirements.txtInstall poetry if not installed (feel free to create a virtual environment beforehand)
pip install poetryCreate and activate the virtual environment + install deps. Also activate the shell.
poetry install
source $(poetry env info --path)/bin/activateThe source file will automatically start the virtual environment Alternatively, you can do poetry env list --full-path and activate the environment in there.
conda create -n veli python=3.10
conda activate veli
pip install -r requirements.txtInside the /model/ folder you will find:
- The PyTorch model and model building files are all available in
Veli.py.uin the model building refers to$\psi$ in our paper. - The loss functions among other model building supports are in the
model_utils.pyfile. - The functions to create dataset tensors and dataloaders are in the
tensor_creationfile. - Support functions for model builders and learning utilities are in the
model_builders.pyandlearning_utils.py, repsectively.
To run the training, you can use the train_veli.py script. You must define a configration file from the ./configs/. If you wish to train from scratch, you can use the ./configs/Veli_train_from_scratch.json config file. We show below an explanation of the config files.
You can use the following template on bash to run it:
python train_veli.py --config_path "./configs/Veli_train_from_scratch.json"
You can also define additional arguments stored_models_path and logs_path to define where you want to store the trained models and log files. The default will be in the directory of the repository under folders /trained_models/ and /logs/, respectively. This repository contains a pre-trained weight in ./trained_models/Veli.pth and a fine-tuned weight in ./trained_models/Veli_taiwan.pth.
You have to define the directory of the data inside the config file at ./configs/Veli.json under the key dataset_path.
The dataset path should be the folder that contains the following two folders: prepared_lcs_bulk and pre_prepared_datasets_unfiltered.
To fine tune the data on out-of-distribution data, use:
python fine_tune_taiwan.py --config_path "./configs/Veli_taiwan.json" --ood "/path_to_dataset/final_dir_ood/final_dataset"
Inside the config file, you have to define key pretrained_config_path to find the pre-trained weights that you want to fine-tune on.
Similarily, you can define paths stored_models_path and logs_path similar to training. You need to also have
To test on a pre-trained model, feel free to use:
python test_veli.py --config_path "./configs/Veli.json"
You can also use argument stored_models_path to indicate where the weights are stored, otherwise it will seek in ./trained_models/.
IMPORTANT: The script relies on finding a model with the same name as the config file. Make sure the config file name matches the name of the model you are seeking.
If you want to test on out-of-distribution (ood) data, make sure to use the following arguments: test_ood and ood_data. test_ood is a flag that lets the script know you want to test on ood data, and ood_data is the path where the ood data is. Both arguments are required.
The ood_data path should be the folder that contains the folders ood_data_bulk and prepared_ood_datasets.
Example bash:
python test_veli.py --config_path "./configs/Veli_taiwan.json" --test_ood --ood_data "/path/to/ood_data"
The following is a sample output from the testing script (on in-distribution data):
================================================================= FINAL RESULTS SUMMARY ======================================================================
MAE Relative to original LCS No-model MAE (mean) No-model RMSE (mean) Veli MAE (mean) Veli MAE (single) Veli RMSE (mean) Veli RMSE (single)
City
rotterdam 22.8099 21.0497 42.0942 3.4937 3.4937 4.8272 4.8272
utrecht 32.6555 24.0222 39.4749 4.2683 4.2683 6.0409 6.0409
amsterdam 12.3012 11.6981 20.2150 3.6057 3.6057 5.3456 5.3456
groningen 2.9895 3.1105 4.5315 3.1477 3.1477 4.4146 4.4146
hague 3.0981 3.3686 4.5101 3.2790 3.2790 4.3581 4.3581
ijmuiden 3.3657 4.1239 7.0861 3.4724 3.4724 4.3717 4.3717
nijmegen 3.0571 5.4688 7.4308 5.3286 5.3286 6.9867 6.9867
==============================================================================================================================================================
The following is a description of the meaning behind config files and the usage.
{
"dataset_path": "/home/yahia/final_dataset_replication/final_dataset", #the path to the dataset (in distribution)
"pretrained_config_path": "/home/yahia/Veli/configs/Veli.json", #for out of distribution data fine-tuning: define the config file for the pre-trained model to fine-tune on. required.
"load_pretrained": false, #start from a pretrained file
"train_split": "all", #train on all dataset with no validation?
"train_suffix": "_lcs", suffix for the training data (lcs for low cost sensor as per AQ-SDR design)
"test_suffix": "_ref", suffix for testing data (ref for reference station as per AQ-SDR design)
"shuffle_data_cols": true, #shuffle the columns to make sure the model doesn't overfit on column position, true will shuffle
"shuffle_cities_order": true, #shuffle cities loading order, true will shuffle
"full_shuffle": true, #shuffle after loading everything, true will shuffle
"mask_value": 0, #what value to replace the NA values with
"fill_hour_rows": false, #bring back hours that have ALL NA values, false will not fill
"test": true, #do a testing at the end of the training, true will test
"model_name": "Veli", #name of the architecture to load. Only Veli is available.
"loss": "huber", #implementation for the reconstruction loss. available: huber, SE (squared errors), studentT
"distribution": "gaussian", #assumption of distribution as per VAE, only gaussian is supported
"no_relu": false, #do not include relu. false WILL include relu
"hidden_dim": 32, #size of hidden dimensions for all feed forward layers (aside from input/output/encoding/decoding)
"latent_dim": 1, #size of the latent dimension for all blocks
"beta_z": 10, #weight of beta_z as per the loss function
"beta_y": 0.1, #weight of beta_y as per the loss function
"reconstruction_factor": 1, #weight of the reoconstruction factor as per the loss function
"shuffle_time_order": true, #shuffle time order (lose temporal information). true WILL shuffle
"shuffle_sensor_order": true, #randomly shuffle columns (again). true WILL shuffle.
"reorder_probability": 0.5, #probability at which to shuffle sensor order (above)
"batch_size": 64,
"learning_rate": 1e-6, #starting learning rate
"optimizer_step_size": 10, #for the learning rate scheduler
"optimizer_gamma": 0.5, #for the learning rate scheduler
"num_epochs": 100,
"clip_gradient": true, #ensure stability - true WILL clip the gradient
"epoch_print_divider": 1, #print out information every n epochs
"print_loss": true, #print out loss information during training
"device": "cuda", #device to use for training/testing
"model_save_period": 33 #store an additional weight every n epoch
}