Daedalus is a novel dynamic spatial microsimulation pipeline that allows users to produce (custom) population projections for policy intervention analysis. Currently, it provides simulation utilities for the whole of the United Kingdom at the local authority (LA) level.
Daedalus builds on code written by Andrew Smith, available here https://github.com/nismod/microsimulation/
Daedalus is being developed in collaboration between Leeds Institute for Data Analytics and the Alan Turing Institute as part of the SPENSER (Synthetic Population Estimation and Scenario Projection Model) project.
We strongly recommend installation via Anaconda:
-
Create a new environment for Daedalus:
conda create -n daedalus python=3.7- Activate the environment:
conda activate daedalus- Clone Daedalus source code:
git clone https://github.com/alan-turing-institute/daedalus.git- Install Daedalus and its dependencies:
cd /path/to/my/daedalus
pip install -v -e .
Daedalus can be run via command line. The following command displays all available options:
python scripts/run.py --helpOutput:
usage: run.py [-h] -c config-file [--location LOCATION]
[--input_data_dir INPUT_DATA_DIR]
[--persistent_data_dir PERSISTENT_DATA_DIR]
[--output_dir OUTPUT_DIR]
Dynamic Microsimulation
optional arguments:
-h, --help show this help message and exit
-c config-file, --config config-file
the model config file (YAML)
--location LOCATION LAD code
--input_data_dir INPUT_DATA_DIR
directory where the input data is
--persistent_data_dir PERSISTENT_DATA_DIR
directory where the persistent data is
--output_dir OUTPUT_DIR
directory where the output data is savedFor example, to run a simulation for LAD E08000032:
python scripts/run.py -c config/default_config.yaml --location E08000032 --input_data_dir data --persistent_data_dir persistent_data --output_dir outputIn the above command:
- -c: the model config file in YAML format. For more information on the configuration file, refer to section: Configuration file.
- --location: target LAD code, here
E08000032. Note that, Daedalus can be run in parallel for several LADs, refer to section: Speeding up simulations over several LADs by parallelization - --input_data_dir: the parent directory where population file is stored, e.g.,
datawheressm_E08000032_MSOA11_ppp_2011.csvis located. - --persistent_data_dir: the parent directory that contains all persistent data, e.g., rates, OD matrices and etc, refer to section: Preparing datasets for details.
- --output_dir: directory where the output files will be stored.
As an example, when running the above command, Daedalus store the results in the following directory structure:
output
└── E08000032
├── config_file_E08000032.yml
├── ssm_E08000032_MSOA11_ppp_2011_processed.csv
├── ssm_E08000032_MSOA11_ppp_2011_simulation.csv
├── year_1
│ └── ssm_E08000032_MSOA11_ppp_2011_simulation_year_1.csv
└── year_2
└── ssm_E08000032_MSOA11_ppp_2011_simulation_year_2.csvwith the following messages on the terminal:
❯ python scripts/run.py -c config/default_config.yaml --location E08000032 --input_data_dir data --persistent_data_dir persistent_data --output_dir output ─╯
Start Population Size: 524213
Write config file successful
Write the dataset at: output_single/E08000032/ssm_E08000032_MSOA11_ppp_2011_processed.csv
Computing immigration OD matrices...
Computing internal migration rate table...
Caching rate table...
Cached to persistent_data/internal_migration_rate_table_1.csv
Computing mortality rate table...
Caching rate table...
Cached to persistent_data/mortality_rate_table_1.csv
Computing fertility rate table...
Caching rate table...
Cached to persistent_data/fertility_rate_table_1.csv
Computing emigration rate table...
Caching rate table...
Cached to persistent_data/emigration_rate_table_1.csv
Computing immigration rate table...
Caching rate table...
Cached to persistent_data/immigration_rate_table_E08000032_1.csv
Computing total immigration number for location E08000032
Start simulation setup
2020-10-30 10:18:26
2020-10-30 10:18:26.363 | DEBUG | vivarium.framework.values:register_value_modifier:373 - Registering metrics.1.population_manager.metrics as modifier to metrics
2020-10-30 11:05:54.951 | DEBUG | vivarium.framework.values:_register_value_producer:323 - Registering value pipeline int_outmigration_rate
.
.
.
2020-10-30 13:58:07.094 | DEBUG | vivarium.framework.engine:step:140 - 2012-12-01 01:30:00
Finished running simulation for year: 2
2020-10-30 14:01:03
In year: 2013
alive 539883
dead 8797
emigrated 2984
internal migration 551664
New children 18547
Immigrants 8904
Finished running the full simulation
alive 539883
dead 8797
emigrated 2984
internal migration 551664
New children 18547
Immigrants 8904
In the previous section, we ran the simulation over one LAD (specified by --location E08000032).
The simulation took around 2 to 3 hours to finish.
To speed up the simulations over severals LADs, Daedalus can be run in parallel.
For example, the following command runs various LAD codes (specified by --path_pop_files "data/ssm_*ppp*csv", wildcard accepted)
on five processes in parallel (specified by --process_np 5):
python scripts/parallel_run.py -c config/default_config.yaml --path_pop_files "data/ssm_*ppp*csv" --input_data_dir data --persistent_data_dir persistent_data --output_dir output --process_np 5In this command:
- -c: the model config file in YAML format. For more information on the configuration file, refer to section: Configuration file.
- --path_pop_files: path to population files, wildcard accepted.
LAD codes are extracted from the filenames specified in this argument, e.g.,
in the example,
--path_pop_files "data/ssm_*ppp*csv", LAD codes of all filesssm_*ppp*csvwill be used. - --input_data_dir: the parent directory where population file is stored, e.g.,
datawheressm_*ppp*csvare located. - --persistent_data_dir: the parent directory that contains all persistent data, e.g., rates, OD matrices and etc, refer to section: Preparing datasets for details.
- --output_dir: directory where the output files will be stored.
- --process_np: number of processors to be used. All detected LAD codes will be distributed over the requested number of processes.
The following command displays all available options:
python scripts/parallel_run.py --helpAfter running the simulation in section: Run Daedalus via command line,
the results are stored in a directory specified by --output_dir, e.g., output in the command above.
In our example, it contains the following dirs/files because we ran the simulation for 2 years:
output
└── E08000032
├── config_file_E08000032.yml
├── ssm_E08000032_MSOA11_ppp_2011_processed.csv
├── ssm_E08000032_MSOA11_ppp_2011_simulation.csv
├── year_1
│ └── ssm_E08000032_MSOA11_ppp_2011_simulation_year_1.csv
└── year_2
└── ssm_E08000032_MSOA11_ppp_2011_simulation_year_2.csvTo evaluate the results, we need to:
- reassign the migrants to the correct LADs.
For example, people who migrated from
LAD_code_1 ---> LAD_code_2should be added to the population file ofLAD_code_2. This step is required since Daedalus works and stores the results at LAD level.
output directory.
- run validation code on the resulting population files.
The above two steps can be run via one command line:
python scripts/validation.py --simulation_dir output --persistent_data_dir persistent_data- --simulation_dir: directory where the simulated population files are stored, i.e., output directory of a Daedalus simulation.
- --persistent_data_dir: the parent directory that contains the following ONS files:
- MYEB2_detailed_components_of_change_series_EW_(2019_geog20).csv
- MYEB3_summary_components_of_change_series_UK_(2019_geog20).csv
output
└── E08000032
├── config_file_E08000032.yml
├── ssm_E08000032_MSOA11_ppp_2011_processed.csv
└── ssm_E08000032_MSOA11_ppp_2011_simulation.csv
└── year_1
└── ssm_E08000032_MSOA11_ppp_2011_simulation_year_1.csv
└── year_2
└── ssm_E08000032_MSOA11_ppp_2011_simulation_year_2.csvThe following command displays all available options:
python scripts/validation.py --helpNext, we plot the results in this notebook.
In another notebook,
the results are plotted on maps.
We use the cartopy library to plot maps in this notebook.
cartopy is not installed by default. Please follow the instructions here:
https://scitools.org.uk/cartopy/docs/latest/installing.html
The following directories/files are needed to run a Daedalus simulation:
- Configuration file. Refer to section: Configuration file.
- Directory that contains persistent datasets, e.g.,
rate files (mortality, fertility, ...), OD matrices. See
persistent_datadirectory on the repo. - For validation, the following files are needed:
- MYEB2_detailed_components_of_change_series_EW_(2019_geog20).csv
- MYEB3_summary_components_of_change_series_UK_(2019_geog20).csv
They are also stored in the
persistent_datadirectory on the repo.
- Directory that contains population files. For example, see
datadirectory on the repo.
If you are planning to run the microsimulation pipeline on the LADs
E09000001, E09000033, E06000052 and E06000053 beware that
the rates of these LADs are merged in the following way:
E09000001+E09000033E06000052+E06000053
(For all other LADs, the rates are at individual level).
It is still possible to run the simulations for E09000001, E09000033, E06000052 and E06000053 individually,
but the pipeline will use the combined rates and immigrated values as specified above.
The most appropriate way to deal with this is to run the microsimulation from a combined LAD starting file,
instead of individually. For example, to run the simulation for the LADs E09000001+E09000033,
- Create a file named:
ssm_E09000001+E09000033_MSOA11_ppp_2011.csvthat contains the starting population from bothE09000001andE09000033. - Run the pipeline in the following way:
python scripts/run.py -c config/default_config.yaml --location E09000001+E09000033 --input_data_dir data --persistent_data_dir persistent_data --output_dir outputSimilarly, this should be done for E06000052+E06000053
Daedalus reads a config file specified by -c flag
(see section: Run Daedalus via command line).
An example config file is provided on the repo, see: config/default_config.yaml
This config file contains the following options:
- Start/end time, number of years and step size (in days) for a simulation as well as the min/max ages of the simulated population.
randomness:
key_columns: ['entrance_time', 'age']
input_data:
location: 'UK'
time:
start: {year: 2011, month: 1, day: 1}
end: {year: 2012, month: 1, day: 1}
step_size: 30.4375 # Days
num_years: 2
population:
age_start: 0
age_end: 100- File/dir-names of the rates (different components), conversion from MSOA and LAD, ethnicity lookup table and OD matrices.
mortality_file: 'Mortality2011_LEEDS1_2.csv'
fertility_file: 'Fertility2011_LEEDS1_2.csv'
emigration_file: 'Emig_2011_2012_LEEDS2.csv'
immigration_file: 'Immig_2011_2012_LEEDS2.csv'
total_population_file: 'MY2011AGEN.csv'
msoa_to_lad: 'Middle_Layer_Super_Output_Area__2011__to_Ward__2016__Lookup_in_England_and_Wales.csv'
OD_matrix_dir: 'od_matrices'
OD_matrix_index_file: 'MSOA_to_OD_index.csv'
internal_outmigration_file: 'InternalOutmig2011_LEEDS2.csv'
immigration_MSOA : 'Immigration_MSOA_M_F.csv'
ethnic_lookup: 'ethnic_lookup.csv'- Components to be used in simulation. For a realistic simulation, all components should be included.
components : [TestPopulation(),InternalMigration(), Mortality(), Emigration(), FertilityAgeSpecificRates(),Immigration()]- In this part of the config file, rate tables of each component can be scaled by a constant factor. This can be used for sensitivity analysis and hypothesis-driven tests, e.g., how the population would change if the rates (of one or more components) would be increased or decreased by a constant factor.
scale_rates:
# methods:
# constant: all rates regardless of age/sex/... will be multiplied by the specified factor
# if 1, the original rates will be usd
method: "constant"
constant:
mortality: 1
fertility: 1
emigration: 1
immigration: 1
internal_migration: 1

