The aim of this benchmark is to compare several frameworks who manage DataFrames on common operations of data preparation.
- Clone this github repository on your machine;
- Run
pip install -r requirements.txt; - Run
python install.pyto build all the algorithms inside Docker containers*.
*Note: you will need Docker installed on your machine. If you want to run the algorithms locally, avoid this step.
The command python run_algorithm.py --algorithm <algorithm_name> --dataset <dataset_name> will run an algorithm on the specified dataset.
By default an algorithm running inside its Docker container, if you want to run it locally add the parameter --locally.
The results of a run are stored in results/<dataset_name>/<algorithm_name>.csv.
run_algorithm.py takes as input the following parameters:
- --algorithm <algorithm_name>, mandatory, the name of the algorithm to run.
- --dataset <dataset_name>, mandatory, the dataset on which run the algorithm.
- --locally, optional, if set the algorithm will run locally, otherwise it will run inside its Docker container.
- --cpu_limit <cpu_number>, optional, maximum number of CPUs that the Docker container can use.
- --mem_limit <memory_limit>, optional, maximum memory that the Docker container can use.
- Create a new folder named as the dataset name inside the
datasetfolder; - Place the new dataset file inside your folder;
- Copy the file
dataset/tests_template.jsoninside your folder renaming it as<your_dataset_name>_template.jsonand edit it; - Edit the file
dataset/datasets.jsonby adding the new dataset.
- Create a docker file for your algorithm named
Dockerfile.your_algoinside theinstallfolder. It must contain all the instructions needed to install the required libraries (see as exampleDockerfile.pandas); - Create a python class named
your_algo.pyinside the folderdf_benchmark/algorithms. The class must extend and implement all the methods of the base class contained indf_benchmark/algorithms/base.py; - Add your algorithm definition in
df_benchmark/algorithms/algorithms.jsonby using the following pattern
{
"name": "algorithm_name",
"module": "df_benchmark.algorithms.algorithm_name",
"constructor": "className",
"constructor_args": []
}
- name: the name of your algorithm.
- module: the name of the module which contains your class
- constructor: name name of your class
- constructor_args: arguments that have to be passed to the constructor when the class is instantiated