Partition data by country #359
Replies: 5 comments 12 replies
-
|
I believe this is something that others have asked about in the past. I think one of this issues though is that Overture don't want to be in the position of defining country boundaries. Country partitioned Overture data is something that others have created themselves in the past though. You could try this using DuckDB or more likely Apache Sedona. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @mtravis, Thanks for your interest and getting back to me! Glad to see @cholmes post about this topic. A lot of great thoughts there! I would love to use the published datasets, but they're naturally not updated with the latest Overture releases. I understand the foundation's hesitation around country boundary definitions, but since the data already contains country information, this approach simply exposes that existing data in a more accessible format (in my opinion). The country-partitioned approach makes it much easier to query data for country subsets using Hive partitioning for filter pushdown as described here: https://duckdb.org/docs/stable/guides/performance/file_formats.html#hive-partitioning-for-filter-pushdown If many people are implementing similar solutions, there would indeed be merit to having it done once by the foundation 🤔 Here are the details on my approach (if anyone has any tips making this better, or has another implementation please feel free share) I am sure it could be improved 😅 My implementation is based in a dbt-duckdb project which takes advantage of the great spatial extension. Everything is configured to materialize externally into s3 as geoparquet. the nice thing about this is it allows some clean separation of config vs transformation logic (here there is not much going on, but it's possible) and it is all powered by the fast duckdb processing engine 🦆 My approach uses Overture's divisions data to extract country polygons, then applies those (plus a static bounding box for performance) to filter other themes and types. The spatial extension lets me filter for features like buildings within specific country boundaries. The key concept is simple: I first extract country boundaries from the divisions data, then use spatial queries with st_intersects() to find objects within those boundaries. The bounding box filter provides an initial performance boost before applying the more expensive spatial operations. The project is structured s.t. I have one model/file per country and configure the output with this dbt_project.yml -- dbt_project.yml
name: 'dbt_project'
version: '1.0.0'
config-version: 2
profile: 'dbt_project'
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
clean-targets:
- "target"
- "dbt_packages"
vars:
# Overture Maps version there data according to a specific release
# https://docs.overturemaps.org/release/latest/
overture_maps_release: '2025-02-19.0'
# List of countries (country codes) to process
country_list: ['AD', 'AT']
# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models
models:
dbt_project:
+materialized: external
overture_maps:
+tags: ['overture']
buildings:
+options:
# The partition_by value is passed down to the duckdb COPY TO command, which uses this value as a option. It must be a comma separated, non-white space string of columns present in the model/table https://duckdb.org/docs/data/partitioning/partitioned_writes.html#partitioned-writes
partition_by: 'country'
overwrite_or_ignore: 1
# Setting the same location and partition_by key ensures all the models in the `overture_maps/buildings` subdir write to `s3://target.external_root/buildings/country=XX/*.parquet`
+location: '{{ target.external_root }}/overture/buildings'# models/sources.yml
version: 2
sources:
##############################################################################
# OVERTURE MAPS
##############################################################################
- name: overture
description: |
Overture Maps is a rich dataset of geospatial data. It includes divisions, buildings, roads, land use, addresses and more 🌍
The data combines multiple sources [OpenStreetMap](https://www.openstreetmap.org), [OpenAddresses](https://openaddresses.io/), [Meta places](https://dataforgood.facebook.com/dfg/tools/places), [and more](https://docs.overturemaps.org/attribution/)) and attempts to provide a consistent and high-quality dataset of the whole planet, normalized to a [unified schema](https://docs.overturemaps.org/schema/) 🍱
The data is stored in multiple [geoparquet](https://guide.cloudnativegeo.org/geoparquet/) files inside a S3 bucket, following a hive partitioning scheme which allows for efficient filtering and retrieval based on theme and type 💾🪣
eg) `s3://overturemaps-us-west-2/release/2024-12-18.0/theme=base/type=water/*` will only contain data for water bodies eg) `Oceans`, `Rivers`, `Swimming pools` ect... 🌊
The data is stored in a [geoparquet](https://guide.cloudnativegeo.org/geoparquet/) format, optimized for the cloud, in a way that allows for efficient filtering because of efficient structuring. If you can filter the data based on the partition keys, only the files that contain the data you are interested in will be downloaded and processed. The "world size" dataset is massive! Storing in geoparquet means the whole planet of data is accessible, but can be processed per region of interest without downloading the whole dataset 🌍🔍
For more information on this source, visit the Overture Maps Documentation [https://docs.overturemaps.org](https://docs.overturemaps.org) 📚
Note: This implementation is made possible thanks to the [dbt-duckdb](https://github.com/duckdb/dbt-duckdb) adapter and the [external_location](https://github.com/duckdb/dbt-duckdb?tab=readme-ov-file#reading-and-writing-external-files) parameter.
meta:
owner: "Overture Foundation"
external_location: 's3://overturemaps-us-west-2/release/{{ var("overture_maps_release") }}/theme=*/type=*/*'
# https://duckdb.org/docs/data/partitioning/hive_partitioning.html#filter-pushdown
# Filters on the partition keys are automatically pushed down into the files. This way we skip reading files when they contain no data that will be used.
# Overture Maps data follows a hive scheme where the structure is partitioned by theme and type, so we can use the `theme` and `type` keys to filter the data.
# eg: "s3://overturemaps-us-west-2/release/2024-12-18.0/theme=buildings/type=building/*" will only read files from the buildings theme and building type.
tags:
- "overture"
- "geoparquet"
tables:
- name: divisions
description: |
Divisions are administrative areas, such as countries, states, provinces, counties, cities, and neighborhoods, 🗺️
A `division_area` combines properties of each division with a **polygon** GEOMETRY that represents the land or maritime area associated with a division.
[https://docs.overturemaps.org/schema/reference/divisions/division_area/](https://docs.overturemaps.org/schema/reference/divisions/division_area/) 📚
meta:
# This is only for documentation purposes, the path that will be used in compilation is the one in defined in `config`
external_location: 's3://overturemaps-us-west-2/release/{{ var("overture_maps_release") }}/theme=divisions/type=division_area/*'
config:
external_location: 'read_parquet(''s3://overturemaps-us-west-2/release/{{ var("overture_maps_release") }}/theme=divisions/type=division_area/*'', filename=true, hive_partitioning=1)'
external:
location: 's3://overturemaps-us-west-2/release/{{ var("overture_maps_release") }}/theme=divisions/type=division_area/*'
- name: buildings
description: |
In OpenStreetMap, a building is a man-made structure with a roof, standing more or less permanently in one place. 🏠
A `building` combines known properties of a building with a **polygon** GEOMETRY that represents the footprint of a building.
[https://docs.overturemaps.org/schema/reference/buildings/building/](https://docs.overturemaps.org/schema/reference/buildings/building/) 📚
meta:
# This is only for documentation purposes, the path that will be used in compilation is the one in defined in `config`
external_location: 's3://overturemaps-us-west-2/release/{{ var("overture_maps_release") }}/theme=buildings/type=building/*'
config:
external_location: 'read_parquet(''s3://overturemaps-us-west-2/release/{{ var("overture_maps_release") }}/theme=buildings/type=building/*'', filename=true, hive_partitioning=1)'-- models/overture_maps/buildings/ad_buildings.sql
with
divisions as (
select country, "geometry"
from {{ ref("divisions") }} as divisions
where divisions.country = 'AD' and divisions.subtype = 'region'
),
ad_buildings as (
select
id,
names."primary" as primary_name,
sources[1].dataset as "source",
sources[1].record_id as source_ref,
class,
subtype,
"level",
num_floors,
height,
(
select country
from divisions
where st_intersects(geometry, divisions.geometry)
limit 1
) as country,
bbox,
geometry as "geometry"
from {{ source("overture", "buildings") }}
where
bbox.xmin > 1.4135781
and bbox.xmax < 1.7863837
and bbox.ymin > 42.4288238
and bbox.ymax < 42.6559357
)
select distinct on (id) *
from ad_buildings
where country is not null-- models/overture_maps/divisions.sql
select
id,
subtype,
country,
names.primary as primary_name,
sources[1].dataset as primary_source,
bbox,
"geometry"
from {{ source("overture", "divisions") }}
where "type" = 'division_area' and country in {{ var("country_list") }}Hope this helps! sorry its so long 🙊 I'd be very interested to hear about your own plans for country partitioning and any alternative approaches you've considered! |
Beta Was this translation helpful? Give feedback.
-
|
Also somewhat related to the discussion here... Geofabrik is a great company offering free daily continent, country, and subregion extracts of OpenStreetMap data through https://download.geofabrik.de/. I can speak for many when I say this service is a huge benefit and we owe a big thanks to the people running it! it eliminates repetitive work, allowing the entire community to benefit from a single, shared effort. Geofabrik, if you're reading this, thankyou 🙇 |
Beta Was this translation helpful? Give feedback.
-
|
@mberg Regarding:
Yes, division IDs are meant to be stable. cc: @DavidKarlas |
Beta Was this translation helpful? Give feedback.
-
|
Thanks Victor that's very helpful!
Sorry follow-up question. Is there a version where the divisions are
clipped to land? I get the value of the current approach if you want to
set the bounds for points within a country (and water) but it's not ideal
if you want to use this data for visualizing.
…On Fri, May 9, 2025 at 3:05 PM Victor Schappert ***@***.***> wrote:
@mberg <https://github.com/mberg> Regarding:
Does anyone know the division_ids are stable? I'm also interested in
mapping the division areas to pcodes where possible.
Yes, division IDs are meant to be stable.
cc: @DavidKarlas <https://github.com/DavidKarlas>
—
Reply to this email directly, view it on GitHub
<#359 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAOEQKKEVHJDRYBAKDJ6W325T37DAVCNFSM6AAAAAB23QXMQSVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTGMBZGQ4TSNI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
👋 Feature Request: Country-Level Partitioning
Dear Overture team,
I'm a regular user and admirer of your project. The work you're doing with Overture Maps provides tremendous value to the geospatial community.
🔍 Request: Country-Level Data Partitioning
I'd like to respectfully propose adding country-level partitioning as an additional dimension alongside the existing
themeandtypepartitions.Currently, data is structured like:
s3://overturemaps-us-west-2/release/2025-03-19.0/theme={foo}/type={bar}/*With country-level partitioning, it might look like:
s3://overturemaps-us-west-2/release/2025-03-19.0/theme={foo}/type={bar}/country=DE/*s3://overturemaps-us-west-2/release/2025-03-19.0/theme={foo}/type={bar}/country=PL/*💡 Use Case
I'm currently implementing country-level partitioning in my own workflows for several reasons:
I believe other community members might benefit from this feature as well, with upstream implementation preventing duplicated efforts.
Technical Considerations 🤔
I understand this may present several challenges:
Benefits
If implemented at the source level, this could:
I'm open to discussing this further if helpful. Has this been considered previously by the team? If so, I'd appreciate any pointers to relevant discussions.
Thank you for considering this suggestion and for all your contributions to the mapping community 😁
Beta Was this translation helpful? Give feedback.
All reactions