From e7c4a20b935946deeb6d6b9cf48464063d5dc4f0 Mon Sep 17 00:00:00 2001 From: Raymond Yee Date: Fri, 5 Dec 2025 07:38:47 -0800 Subject: [PATCH] Add wide format Cesium visualization tutorial MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Complete 1:1 translation of parquet_cesium.qmd to wide format: - All queries use p__* columns instead of edge row JOINs - Eric's Query section translated to wide format - Understanding Paths documentation with wide format references - Helper functions for sample details, agents, keywords - Geographic Location Classification section - Updated sidebar navigation Wide format advantages: - 60% smaller file (275MB vs 691MB) - 79% fewer rows (~2.5M vs ~11.6M) - Simpler queries (3 JOINs vs 7+) - 2-4x faster query performance πŸ€– Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- _quarto.yml | 4 +- tutorials/parquet_cesium_wide.qmd | 1463 +++++++++++++++++++++++++++++ 2 files changed, 1466 insertions(+), 1 deletion(-) create mode 100644 tutorials/parquet_cesium_wide.qmd diff --git a/_quarto.yml b/_quarto.yml index a86e50a..ae8144d 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -51,8 +51,10 @@ website: href: tutorials/parquet.qmd - text: "Zenodo iSamples OpenContext Tutorial" href: tutorials/zenodo_isamples_analysis.qmd - - text: "Cesium View" + - text: "Cesium View (Narrow)" href: tutorials/parquet_cesium.qmd + - text: "Cesium View (Wide)" + href: tutorials/parquet_cesium_wide.qmd - text: "Cesium View split sources" href: tutorials/parquet_cesium_split.qmd - text: "Narrow vs Wide Performance" diff --git a/tutorials/parquet_cesium_wide.qmd b/tutorials/parquet_cesium_wide.qmd new file mode 100644 index 0000000..b507e5d --- /dev/null +++ b/tutorials/parquet_cesium_wide.qmd @@ -0,0 +1,1463 @@ +--- +title: Using Cesium for display of remote parquet (Wide Format). +categories: [parquet, spatial, recipe, wide] +--- + +This page renders points from an iSamples **wide-format** parquet file on Cesium using point primitives. + +::: {.callout-note} +## Wide Format Advantages + +This page uses the **wide parquet schema** which: + +- Is **60% smaller** (275 MB vs 691 MB) +- Has **79% fewer rows** (~2.5M vs ~11.6M) +- Uses **simpler queries** (direct column access via `p__*` columns instead of edge row JOINs) +- Provides **2-4x faster query performance** over HTTP + +See [Narrow vs Wide Performance](/tutorials/narrow_vs_wide_performance.html) for benchmarks. +::: + + + + + +```{ojs} +//| output: false +Cesium.Ion.defaultAccessToken = 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJqdGkiOiIwNzk3NjkyMy1iNGI1LTRkN2UtODRiMy04OTYwYWE0N2M3ZTkiLCJpZCI6Njk1MTcsImlhdCI6MTYzMzU0MTQ3N30.e70dpNzOCDRLDGxRguQCC-tRzGzA-23Xgno5lNgCeB4'; +``` + +```{ojs} +//| echo: false +viewof parquet_path = Inputs.text({ + label:"Source (Wide Format)", + value:"https://storage.googleapis.com/opencontext-parquet/oc_isamples_pqg_wide.parquet", + placeholder: "URL or file:///path/to/file.parquet", + width:"100%", + submit:true +}); +``` + +```{ojs} +//| echo: false +viewof searchGeoPid = Inputs.text({ + label:"Jump to Geocode", + placeholder: "Paste geocode PID (e.g., geoloc_04d6e816218b1a8798fa90b3d1d43bf4c043a57f)", + width:"100%", + submit:true +}); +``` + +```{ojs} +//| echo: false +// Simple trigger variable that increments on button click +viewof classifyTrigger = { + let count = 0; + const button = html``; + button.onclick = () => { + count++; + button.value = count; + button.dispatchEvent(new CustomEvent("input")); + }; + button.value = count; + return button; +} + +// Alias for handler compatibility +classifyDots = classifyTrigger > 0 ? classifyTrigger : null +``` + +::: {.callout-tip collapse="true"} +#### Using a local cached file for faster performance + +DuckDB-WASM running in the browser **cannot access local files via `file://` URLs** due to browser security restrictions. However, you can use a local cached file when running `quarto preview`: + +**Local Development (recommended)** + +The repository includes a cached parquet file. To use it: + +1. Ensure the file exists in `docs/assets/oc_isamples_pqg_wide.parquet` (275MB) + - The file must be in Quarto's output directory `docs/assets/`, not just the source `assets/` directory + - If needed, copy: `cp assets/oc_isamples_pqg_wide.parquet docs/assets/` + +2. When running `quarto preview`, use the full localhost URL: + ``` + http://localhost:4979/assets/oc_isamples_pqg_wide.parquet + ``` + (Replace `4979` with your actual preview port) + +**Alternative: Python HTTP server** +```bash +# In the directory containing your parquet file: +cd /Users/raymondyee/Data/iSample +python3 -m http.server 8000 +``` + +Then use: `http://localhost:8000/oc_isamples_pqg_wide.parquet` + +**Benefits of wide format file:** +- 60% smaller than narrow format (275 MB vs 691 MB) +- Much faster initial load (less network transfer) +- Simpler queries with direct column access +- Works offline once cached + +**Limitation:** Only works during local development, not on published GitHub Pages. +::: + +::: callout-warning +#### Heads up: first interaction may be slow +The first click or query can take a few seconds while the in‑browser database engine initializes and the remote Parquet file is fetched and indexed. Subsequent interactions are much faster because both the browser and DuckDB cache metadata and column chunks, so later queries reuse what was already loaded. +::: + +```{ojs} +//| code-fold: true + +// Create a DuckDB instance +db = { + const instance = await DuckDBClient.of(); + await instance.query(`create view nodes as select * from read_parquet('${parquet_path}')`) + return instance; +} + + +async function loadData(query, params = [], waiting_id = null, key = "default") { + // latest-only guard per key + loadData._latest = loadData._latest || new Map(); + const requestToken = Symbol(); + loadData._latest.set(key, requestToken); + + // Get loading indicator + const waiter = waiting_id ? document.getElementById(waiting_id) : null; + if (waiter) waiter.hidden = false; + + try { + // Run the (slow) query + const _results = await db.query(query, params); + // Ignore stale responses + if (loadData._latest.get(key) !== requestToken) return null; + return _results; + } catch (error) { + if (waiter && loadData._latest.get(key) === requestToken) { + waiter.innerHTML = `
${error}
`; + } + return null; + } finally { + // Hide the waiter (if there is one) only if latest + if (waiter && loadData._latest.get(key) === requestToken) { + waiter.hidden = true; + } + } +} + +locations = { + // Performance telemetry + performance.mark('locations-start'); + + // Get loading indicator element for progress updates + const loadingDiv = document.getElementById('loading_1'); + if (loadingDiv) { + loadingDiv.hidden = false; + loadingDiv.innerHTML = 'Loading geocodes...'; + } + + // Fast query: just get all distinct geocodes (no classification!) + const query = ` + SELECT DISTINCT + pid, + latitude, + longitude + FROM nodes + WHERE otype = 'GeospatialCoordLocation' + `; + + performance.mark('query-start'); + const data = await loadData(query, [], "loading_1", "locations"); + performance.mark('query-end'); + performance.measure('locations-query', 'query-start', 'query-end'); + const queryTime = performance.getEntriesByName('locations-query')[0].duration; + console.log(`Query executed in ${queryTime.toFixed(0)}ms - retrieved ${data.length} locations`); + + // Clear the existing PointPrimitiveCollection + content.points.removeAll(); + + // Single color for all points (blue) + const defaultColor = Cesium.Color.fromCssColorString('#2E86AB'); + const defaultSize = 4; + + // Render points in chunks to keep UI responsive + const CHUNK_SIZE = 500; + const scalar = new Cesium.NearFarScalar(1.5e2, 2, 8.0e6, 0.2); + + performance.mark('render-start'); + for (let i = 0; i < data.length; i += CHUNK_SIZE) { + const chunk = data.slice(i, i + CHUNK_SIZE); + const endIdx = Math.min(i + CHUNK_SIZE, data.length); + + // Update progress indicator + if (loadingDiv) { + const pct = Math.round((endIdx / data.length) * 100); + loadingDiv.innerHTML = `Rendering geocodes... ${endIdx.toLocaleString()}/${data.length.toLocaleString()} (${pct}%)`; + } + + // Add points for this chunk + for (const row of chunk) { + content.points.add({ + id: row.pid, + position: Cesium.Cartesian3.fromDegrees( + row.longitude, //longitude + row.latitude, //latitude + 0 //elevation, m + ), + pixelSize: defaultSize, + color: defaultColor, + scaleByDistance: scalar, + }); + } + + // Yield to browser between chunks to keep UI responsive + if (i + CHUNK_SIZE < data.length) { + await new Promise(resolve => setTimeout(resolve, 0)); + } + } + performance.mark('render-end'); + performance.measure('locations-render', 'render-start', 'render-end'); + const renderTime = performance.getEntriesByName('locations-render')[0].duration; + + // Hide loading indicator + if (loadingDiv) { + loadingDiv.hidden = true; + } + + performance.mark('locations-end'); + performance.measure('locations-total', 'locations-start', 'locations-end'); + const totalTime = performance.getEntriesByName('locations-total')[0].duration; + + console.log(`Rendering completed in ${renderTime.toFixed(0)}ms`); + console.log(`Total time (query + render): ${totalTime.toFixed(0)}ms`); + + content.enableTracking(); + return data; +} + + +function createShowPrimitive(viewer) { + return function(movement) { + // Get the point at the mouse end position + const selectPoint = viewer.viewer.scene.pick(movement.endPosition); + + // Clear the current selection, if there is one and it is different to the selectPoint + if (viewer.currentSelection !== null) { + //console.log(`selected.p ${viewer.currentSelection}`) + if (Cesium.defined(selectPoint) && selectPoint !== viewer.currentSelection) { + console.log(`selected.p 2 ${viewer.currentSelection}`) + viewer.currentSelection.primitive.pixelSize = 4; + viewer.currentSelection.primitive.outlineColor = Cesium.Color.TRANSPARENT; + viewer.currentSelection.outlineWidth = 0; + viewer.currentSelection = null; + } + } + + // If selectPoint is valid and no currently selected point + if (Cesium.defined(selectPoint) && selectPoint.hasOwnProperty("primitive")) { + //console.log(`showPrimitiveId ${selectPoint.id}`); + //const carto = Cesium.Cartographic.fromCartesian(selectPoint.primitive.position) + viewer.pointLabel.position = selectPoint.primitive.position; + viewer.pointLabel.label.show = true; + //viewer.pointLabel.label.text = `id:${selectPoint.id}, ${carto}`; + viewer.pointLabel.label.text = `${selectPoint.id}`; + selectPoint.primitive.pixelSize = 20; + selectPoint.primitive.outlineColor = Cesium.Color.YELLOW; + selectPoint.primitive.outlineWidth = 3; + viewer.currentSelection = selectPoint; + } else { + viewer.pointLabel.label.show = false; + } + } +} + +class CView { + constructor(target) { + this.viewer = new Cesium.Viewer( + target, { + timeline: false, + animation: false, + baseLayerPicker: false, + fullscreenElement: target, + terrain: Cesium.Terrain.fromWorldTerrain() + }); + this.currentSelection = null; + this.point_size = 1; + this.n_points = 0; + // https://cesium.com/learn/cesiumjs/ref-doc/PointPrimitiveCollection.html + this.points = new Cesium.PointPrimitiveCollection(); + this.viewer.scene.primitives.add(this.points); + + this.pointLabel = this.viewer.entities.add({ + label: { + show: false, + showBackground: true, + font: "14px monospace", + horizontalOrigin: Cesium.HorizontalOrigin.LEFT, + verticalOrigin: Cesium.VerticalOrigin.BOTTOM, + pixelOffset: new Cesium.Cartesian2(15, 0), + // this attribute will prevent this entity clipped by the terrain + disableDepthTestDistance: Number.POSITIVE_INFINITY, + text:"", + }, + }); + + this.pickHandler = new Cesium.ScreenSpaceEventHandler(this.viewer.scene.canvas); + // Can also do this rather than wait for the points to be generated + //this.pickHandler.setInputAction(createShowPrimitive(this), Cesium.ScreenSpaceEventType.MOUSE_MOVE); + + this.selectHandler = new Cesium.ScreenSpaceEventHandler(this.viewer.scene.canvas); + this.selectHandler.setInputAction((e) => { + const selectPoint = this.viewer.scene.pick(e.position); + if (Cesium.defined(selectPoint) && selectPoint.hasOwnProperty("primitive")) { + mutable clickedPointId = selectPoint.id; + } + },Cesium.ScreenSpaceEventType.LEFT_CLICK); + + } + + enableTracking() { + this.pickHandler.setInputAction(createShowPrimitive(this), Cesium.ScreenSpaceEventType.MOUSE_MOVE); + } +} + +content = new CView("cesiumContainer"); + +async function getGeoRecord(pid) { + if (pid === null || pid ==="" || pid == "unset") { + return "unset"; + } + const q = `SELECT row_id, pid, otype, latitude, longitude FROM nodes WHERE otype='GeospatialCoordLocation' AND pid=?`; + const rows = await loadData(q, [pid], "loading_geo", "geo"); + return rows && rows.length ? rows[0] : null; +} + +// WIDE FORMAT: Path 1 - Direct event location +// Uses p__sample_location column instead of edge row JOINs +async function get_samples_1(pid) { + if (pid === null || pid ==="" || pid == "unset") { + return []; + } + // Path 1: Direct event location - WIDE FORMAT version + // Uses p__* columns instead of edge rows + const q = ` + SELECT + geo.latitude, + geo.longitude, + site.label AS sample_site_label, + site.pid AS sample_site_pid, + samp.pid AS sample_pid, + samp.alternate_identifiers AS sample_alternate_identifiers, + samp.label AS sample_label, + samp.description AS sample_description, + samp.thumbnail_url AS sample_thumbnail_url, + samp.thumbnail_url IS NOT NULL as has_thumbnail, + 'direct_event_location' as location_path + FROM nodes AS geo + -- Wide format: SamplingEvent has p__sample_location column with geo row_ids + JOIN nodes AS se ON ( + se.otype = 'SamplingEvent' + AND list_contains(se.p__sample_location, geo.row_id) + ) + -- Wide format: SamplingEvent has p__sampling_site column with site row_ids + JOIN nodes AS site ON ( + site.otype = 'SamplingSite' + AND list_contains(se.p__sampling_site, site.row_id) + ) + -- Wide format: MaterialSampleRecord has p__produced_by column with event row_ids + JOIN nodes AS samp ON ( + samp.otype = 'MaterialSampleRecord' + AND list_contains(samp.p__produced_by, se.row_id) + ) + WHERE geo.pid = ? + AND geo.otype = 'GeospatialCoordLocation' + ORDER BY has_thumbnail DESC + `; + performance.mark('samples1-start'); + const result = await loadData(q, [pid], "loading_s1", "samples_1"); + performance.mark('samples1-end'); + performance.measure('samples1-query', 'samples1-start', 'samples1-end'); + const queryTime = performance.getEntriesByName('samples1-query')[0].duration; + console.log(`Path 1 query (wide) executed in ${queryTime.toFixed(0)}ms - retrieved ${result?.length || 0} samples`); + return result ?? []; +} + +// WIDE FORMAT: Path 2 - Via site location +// Uses p__site_location and p__sampling_site columns +async function get_samples_2(pid) { + if (pid === null || pid ==="" || pid == "unset") { + return []; + } + // Path 2: Via site location - WIDE FORMAT version + const q = ` + SELECT + geo.latitude, + geo.longitude, + site.label AS sample_site_label, + site.pid AS sample_site_pid, + samp.pid AS sample_pid, + samp.alternate_identifiers AS sample_alternate_identifiers, + samp.label AS sample_label, + samp.description AS sample_description, + samp.thumbnail_url AS sample_thumbnail_url, + samp.thumbnail_url IS NOT NULL as has_thumbnail, + 'via_site_location' as location_path + FROM nodes AS geo + -- Wide format: SamplingSite has p__site_location column with geo row_ids + JOIN nodes AS site ON ( + site.otype = 'SamplingSite' + AND list_contains(site.p__site_location, geo.row_id) + ) + -- Wide format: SamplingEvent has p__sampling_site column with site row_ids + JOIN nodes AS se ON ( + se.otype = 'SamplingEvent' + AND list_contains(se.p__sampling_site, site.row_id) + ) + -- Wide format: MaterialSampleRecord has p__produced_by column with event row_ids + JOIN nodes AS samp ON ( + samp.otype = 'MaterialSampleRecord' + AND list_contains(samp.p__produced_by, se.row_id) + ) + WHERE geo.pid = ? + AND geo.otype = 'GeospatialCoordLocation' + ORDER BY has_thumbnail DESC + `; + performance.mark('samples2-start'); + const result = await loadData(q, [pid], "loading_s2", "samples_2"); + performance.mark('samples2-end'); + performance.measure('samples2-query', 'samples2-start', 'samples2-end'); + const queryTime = performance.getEntriesByName('samples2-query')[0].duration; + console.log(`Path 2 query (wide) executed in ${queryTime.toFixed(0)}ms - retrieved ${result?.length || 0} samples`); + return result ?? []; +} + +// WIDE FORMAT: Eric Kansa's authoritative query (Path 1 only) +// This is the wide format equivalent of get_samples_at_geo_cord_location_via_sample_event +async function get_samples_at_geo_cord_location_via_sample_event(pid) { + if (pid === null || pid ==="" || pid == "unset") { + return []; + } + // Eric Kansa's authoritative query - WIDE FORMAT version + // Source pattern: https://github.com/ekansa/open-context-py + const q = ` + SELECT + geo.latitude, + geo.longitude, + site.label AS sample_site_label, + site.pid AS sample_site_pid, + samp.pid AS sample_pid, + samp.alternate_identifiers AS sample_alternate_identifiers, + samp.label AS sample_label, + samp.description AS sample_description, + samp.thumbnail_url AS sample_thumbnail_url, + samp.thumbnail_url IS NOT NULL as has_thumbnail + FROM nodes AS geo + -- Wide format: SamplingEvent.p__sample_location contains geo row_ids + JOIN nodes AS se ON ( + se.otype = 'SamplingEvent' + AND list_contains(se.p__sample_location, geo.row_id) + ) + -- Wide format: SamplingEvent.p__sampling_site contains site row_ids + JOIN nodes AS site ON ( + site.otype = 'SamplingSite' + AND list_contains(se.p__sampling_site, site.row_id) + ) + -- Wide format: MaterialSampleRecord.p__produced_by contains event row_ids + JOIN nodes AS samp ON ( + samp.otype = 'MaterialSampleRecord' + AND list_contains(samp.p__produced_by, se.row_id) + ) + WHERE geo.pid = ? + AND geo.otype = 'GeospatialCoordLocation' + ORDER BY has_thumbnail DESC + `; + performance.mark('eric-query-start'); + const result = await loadData(q, [pid], "loading_combined", "samples_combined"); + performance.mark('eric-query-end'); + performance.measure('eric-query', 'eric-query-start', 'eric-query-end'); + const queryTime = performance.getEntriesByName('eric-query')[0].duration; + console.log(`Eric's query (wide) executed in ${queryTime.toFixed(0)}ms - retrieved ${result?.length || 0} samples`); + return result ?? []; +} + +// WIDE FORMAT: Get full sample data via sample PID +async function get_sample_data_via_sample_pid(sample_pid) { + if (sample_pid === null || sample_pid === "" || sample_pid === "unset") { + return null; + } + // Wide format: Uses p__produced_by, p__sample_location, p__sampling_site columns + const q = ` + SELECT + samp.row_id, + samp.pid AS sample_pid, + samp.alternate_identifiers AS sample_alternate_identifiers, + samp.label AS sample_label, + samp.description AS sample_description, + samp.thumbnail_url AS sample_thumbnail_url, + samp.thumbnail_url IS NOT NULL as has_thumbnail, + geo.latitude, + geo.longitude, + site.label AS sample_site_label, + site.pid AS sample_site_pid + FROM nodes AS samp + -- Wide format: use p__produced_by column + JOIN nodes AS se ON ( + se.otype = 'SamplingEvent' + AND list_contains(samp.p__produced_by, se.row_id) + ) + -- Wide format: use p__sample_location column + JOIN nodes AS geo ON ( + geo.otype = 'GeospatialCoordLocation' + AND list_contains(se.p__sample_location, geo.row_id) + ) + -- Wide format: use p__sampling_site column + JOIN nodes AS site ON ( + site.otype = 'SamplingSite' + AND list_contains(se.p__sampling_site, site.row_id) + ) + WHERE samp.pid = ? + AND samp.otype = 'MaterialSampleRecord' + `; + const result = await loadData(q, [sample_pid], "loading_sample_data", "sample_data"); + return result && result.length ? result[0] : null; +} + +// WIDE FORMAT: Get agent info (who collected/registered) +async function get_sample_data_agents_sample_pid(sample_pid) { + if (sample_pid === null || sample_pid === "" || sample_pid === "unset") { + return []; + } + // Wide format: Uses p__produced_by and p__responsibility/p__registrant columns + const q = ` + WITH event_agents AS ( + SELECT + samp.pid AS sample_pid, + samp.label AS sample_label, + samp.description AS sample_description, + samp.thumbnail_url AS sample_thumbnail_url, + samp.thumbnail_url IS NOT NULL as has_thumbnail, + 'responsibility' AS predicate, + unnest(se.p__responsibility) AS agent_row_id + FROM nodes AS samp + JOIN nodes AS se ON ( + se.otype = 'SamplingEvent' + AND list_contains(samp.p__produced_by, se.row_id) + ) + WHERE samp.pid = ? AND samp.otype = 'MaterialSampleRecord' + + UNION ALL + + SELECT + samp.pid AS sample_pid, + samp.label AS sample_label, + samp.description AS sample_description, + samp.thumbnail_url AS sample_thumbnail_url, + samp.thumbnail_url IS NOT NULL as has_thumbnail, + 'registrant' AS predicate, + unnest(samp.p__registrant) AS agent_row_id + FROM nodes AS samp + WHERE samp.pid = ? AND samp.otype = 'MaterialSampleRecord' + ) + SELECT + ea.sample_pid, + ea.sample_label, + ea.sample_description, + ea.sample_thumbnail_url, + ea.has_thumbnail, + ea.predicate, + agent.pid AS agent_pid, + agent.name AS agent_name, + agent.alternate_identifiers AS agent_alternate_identifiers + FROM event_agents ea + JOIN nodes AS agent ON ( + agent.row_id = ea.agent_row_id + AND agent.otype = 'Agent' + ) + `; + const result = await loadData(q, [sample_pid, sample_pid], "loading_agents", "agents"); + return result ?? []; +} + +// WIDE FORMAT: Get classification keywords and types +async function get_sample_types_and_keywords_via_sample_pid(sample_pid) { + if (sample_pid === null || sample_pid === "" || sample_pid === "unset") { + return []; + } + // Wide format: Sample has p__keywords, p__has_sample_object_type, p__has_material_category columns + const q = ` + WITH sample_concepts AS ( + SELECT + samp.pid AS sample_pid, + samp.label AS sample_label, + 'keywords' AS predicate, + unnest(samp.p__keywords) AS concept_row_id + FROM nodes AS samp + WHERE samp.pid = ? AND samp.otype = 'MaterialSampleRecord' + + UNION ALL + + SELECT + samp.pid AS sample_pid, + samp.label AS sample_label, + 'has_sample_object_type' AS predicate, + unnest(samp.p__has_sample_object_type) AS concept_row_id + FROM nodes AS samp + WHERE samp.pid = ? AND samp.otype = 'MaterialSampleRecord' + + UNION ALL + + SELECT + samp.pid AS sample_pid, + samp.label AS sample_label, + 'has_material_category' AS predicate, + unnest(samp.p__has_material_category) AS concept_row_id + FROM nodes AS samp + WHERE samp.pid = ? AND samp.otype = 'MaterialSampleRecord' + ) + SELECT + sc.sample_pid, + sc.sample_label, + sc.predicate, + kw.pid AS keyword_pid, + kw.label AS keyword + FROM sample_concepts sc + JOIN nodes AS kw ON ( + kw.row_id = sc.concept_row_id + AND kw.otype = 'IdentifiedConcept' + ) + `; + const result = await loadData(q, [sample_pid, sample_pid, sample_pid], "loading_keywords", "keywords"); + return result ?? []; +} + +async function locationUsedBy(rowid){ + if (rowid === undefined || rowid === null) { + return []; + } + // Wide format: Check which entities reference this location via p__* columns + const q = ` + SELECT pid, otype FROM nodes + WHERE list_contains(p__sample_location, ?) + OR list_contains(p__site_location, ?) + `; + return db.query(q, [rowid, rowid]); +} + +mutable clickedPointId = "unset"; +// Loading flags to control UI clearing while fetching +mutable geoLoading = false; +mutable s1Loading = false; +mutable s2Loading = false; +mutable combinedLoading = false; + +// Precompute selection-driven data with loading flags +selectedGeoRecord = { + mutable geoLoading = true; + try { + return await getGeoRecord(clickedPointId); + } finally { + mutable geoLoading = false; + } +} + +selectedSamples1 = { + mutable s1Loading = true; + try { + return await get_samples_1(clickedPointId); + } finally { + mutable s1Loading = false; + } +} + +selectedSamples2 = { + mutable s2Loading = true; + try { + return await get_samples_2(clickedPointId); + } finally { + mutable s2Loading = false; + } +} + +selectedSamplesCombined = { + mutable combinedLoading = true; + try { + return await get_samples_at_geo_cord_location_via_sample_event(clickedPointId); + } finally { + mutable combinedLoading = false; + } +} + +md`Retrieved ${pointdata.length} locations from ${parquet_path}.`; +``` + +```{ojs} +//| echo: false +//| output: false +// Center initial Cesium view on PKAP Survey Area and also set Home to PKAP! +{ + const viewer = content.viewer; + // PKAP Survey Area near Cyprus + // Source: https://opencontext.org/subjects/48fd434c-f6d3... + const pkapLat = 34.987406; + const pkapLon = 33.708047; + const delta = 0.3; // degrees padding around point + const pkapRect = Cesium.Rectangle.fromDegrees( + pkapLon - delta, // west (lon) + pkapLat - delta, // south (lat) + pkapLon + delta, // east (lon) + pkapLat + delta // north (lat) + ); + + // Make the Home button go to PKAP as well + Cesium.Camera.DEFAULT_VIEW_RECTANGLE = pkapRect; + Cesium.Camera.DEFAULT_VIEW_FACTOR = 0.5; + + // Apply camera after the first render to avoid resize/tab visibility issues + const once = () => { + viewer.camera.setView({ destination: pkapRect }); + viewer.scene.postRender.removeEventListener(once); + }; + viewer.scene.postRender.addEventListener(once); +} +``` + +```{ojs} +//| echo: false +//| output: false +// Handle geocode search: fly to location and trigger queries +{ + if (searchGeoPid && searchGeoPid.trim() !== "") { + const pid = searchGeoPid.trim(); + + // Look up the geocode in the database + const q = `SELECT pid, latitude, longitude FROM nodes WHERE otype='GeospatialCoordLocation' AND pid=?`; + const result = await db.query(q, [pid]); + + if (result && result.length > 0) { + const geo = result[0]; + const viewer = content.viewer; + + // Fly camera to the location + const position = Cesium.Cartesian3.fromDegrees( + geo.longitude, + geo.latitude, + 15000 // 15km altitude for good view + ); + + viewer.camera.flyTo({ + destination: position, + duration: 2.0, // 2 second flight + complete: () => { + // After camera arrives, trigger the click to load data + mutable clickedPointId = pid; + } + }); + } else { + // Geocode not found - could display error to user + console.warn(`Geocode not found: ${pid}`); + } + } +} +``` + +```{ojs} +//| echo: false +//| output: false +// Handle optional classification button: recolor dots by type +// WIDE FORMAT: Uses p__sample_location and p__site_location columns directly +{ + if (classifyDots !== null) { + console.log("Classifying dots by type (wide format)..."); + performance.mark('classify-start'); + + try { + // Wide format classification query - uses p__* columns directly + const query = ` + WITH geo_classification AS ( + SELECT + geo.pid, + MAX(CASE WHEN se.row_id IS NOT NULL THEN 1 ELSE 0 END) as is_sample_location, + MAX(CASE WHEN site.row_id IS NOT NULL THEN 1 ELSE 0 END) as is_site_location + FROM nodes geo + LEFT JOIN nodes se ON ( + se.otype = 'SamplingEvent' + AND list_contains(se.p__sample_location, geo.row_id) + ) + LEFT JOIN nodes site ON ( + site.otype = 'SamplingSite' + AND list_contains(site.p__site_location, geo.row_id) + ) + WHERE geo.otype = 'GeospatialCoordLocation' + GROUP BY geo.pid + ) + SELECT + pid, + CASE + WHEN is_sample_location = 1 AND is_site_location = 1 THEN 'both' + WHEN is_sample_location = 1 THEN 'sample_location_only' + WHEN is_site_location = 1 THEN 'site_location_only' + END as location_type + FROM geo_classification + `; + + const classifications = await db.query(query); + + // Build lookup map: pid -> location_type + const typeMap = new Map(); + for (const row of classifications) { + typeMap.set(row.pid, row.location_type); + } + + // Color and size styling by location type + const styles = { + sample_location_only: { + color: Cesium.Color.fromCssColorString('#2E86AB'), + size: 3 + }, // Blue - field collection points + site_location_only: { + color: Cesium.Color.fromCssColorString('#A23B72'), + size: 6 + }, // Purple - administrative markers + both: { + color: Cesium.Color.fromCssColorString('#F18F01'), + size: 5 + } // Orange - dual-purpose + }; + + // Update colors of existing points + const points = content.points; + for (let i = 0; i < points.length; i++) { + const point = points.get(i); + const pid = point.id; + const locationType = typeMap.get(pid); + + if (locationType && styles[locationType]) { + point.color = styles[locationType].color; + point.pixelSize = styles[locationType].size; + } + } + + performance.mark('classify-end'); + performance.measure('classification', 'classify-start', 'classify-end'); + const classifyTime = performance.getEntriesByName('classification')[0].duration; + console.log(`Classification completed in ${classifyTime.toFixed(0)}ms - updated ${points.length} points`); + console.log(` - Blue (sample_location_only): field collection points`); + console.log(` - Purple (site_location_only): administrative markers`); + console.log(` - Orange (both): dual-purpose locations`); + } catch (error) { + console.error("Classification failed:", error); + console.error("Error details:", error.message); + + // Show user-friendly message in browser console + console.warn("⚠️ Color-coding failed due to a data loading issue."); + console.warn("πŸ’‘ Tip: This is an intermittent DuckDB-WASM issue with remote files."); + console.warn(" Try clicking the button again, or use a local cached file for better reliability."); + console.warn(" See the 'Using a local cached file' section above for instructions."); + + // Note: We don't show an alert() to avoid disrupting the user experience + // The page remains functional, just without the color-coding + } + } +} +``` + +::: {.panel-tabset} + +## Map + +
+ +## Data + +
Loading...
+ +```{ojs} +//| code-fold: true + +viewof pointdata = { + const data_table = Inputs.table(locations, { + header: { + pid: "PID", + latitude: "Latitude", + longitude: "Longitude", + location_type: "Location Type" + }, + }); + return data_table; +} +``` + +::: + +The click point ID is "${clickedPointId}". + + + +```{ojs} +//| echo: false +geoLoading ? md`(loading…)` : md`\`\`\` +${JSON.stringify(selectedGeoRecord, null, 2)} +\`\`\` +` +``` + +## getGeoRecord (selected) + +```{ojs} +//| code-fold: true +pid = clickedPointId +testrecord = selectedGeoRecord; +``` + +```{ojs} +//| echo: false +md`\`\`\` +${JSON.stringify(testrecord, null, 2)} +\`\`\` +` +``` + +## Samples at Location via Sampling Event (Eric Kansa's Query - Wide Format) + + + +This query implements Eric Kansa's authoritative `get_samples_at_geo_cord_location_via_sample_event` function from [open-context-py](https://github.com/ekansa/open-context-py/blob/staging/opencontext_py/apps/all_items/isamples/isamples_explore.py), **translated to wide format**. + +::: {.callout-note} +## Wide Format Query Advantage + +**Narrow format** requires 7+ JOINs through edge rows: +```sql +JOIN nodes AS rel_se ON (rel_se.p = 'sample_location' AND list_contains(rel_se.o, geo.row_id)) +JOIN nodes AS se ON (rel_se.s = se.row_id ...) +``` + +**Wide format** uses direct column access (3 JOINs): +```sql +JOIN nodes AS se ON (se.otype = 'SamplingEvent' AND list_contains(se.p__sample_location, geo.row_id)) +``` + +This is typically **2-4x faster** over HTTP. +::: + +**Query Strategy (Path 1 Only)**: +- Starts at a GeospatialCoordLocation (clicked point) +- Walks **backward** via `p__sample_location` column to find SamplingEvents that reference this location +- From those events, finds MaterialSampleRecords via `p__produced_by` column +- Requires site context (INNER JOIN on `p__sampling_site` β†’ SamplingSite) + +**Returns**: +- Geographic coordinates: `latitude`, `longitude` +- Sample metadata: `sample_pid`, `sample_label`, `sample_description`, `sample_alternate_identifiers` +- Site context: `sample_site_label`, `sample_site_pid` +- Media: `sample_thumbnail_url`, `has_thumbnail` + +**Ordering**: Prioritizes samples with images (`ORDER BY has_thumbnail DESC`) + +**Important**: This query only returns samples whose **sampling events directly reference this geolocation** via `p__sample_location` (Path 1). Samples that reach this location only through their site's `p__site_location` (Path 2) are **not included**. This means site marker locations may return 0 results if no events were recorded at that exact coordinate. + +```{ojs} +//| echo: false +samples_combined = selectedSamplesCombined +``` + +```{ojs} +//| echo: false +html`${ + combinedLoading ? + html`
Loading samples…
` + : + samples_combined && samples_combined.length > 0 ? + html`
+ + + + + + + + + + + + ${samples_combined.map((sample, i) => html` + + + + + + + + `)} + +
ThumbnailSampleDescriptionSiteLocation
+ ${sample.has_thumbnail ? + html` + ${sample.sample_label} + ` + : + html`
No image
` + } +
+
+ ${sample.sample_label} +
+ +
+
+ ${sample.sample_description || 'No description'} +
+
+
+ ${sample.sample_site_label} +
+ +
+ ${sample.latitude.toFixed(5)}Β°N
+ ${sample.longitude.toFixed(5)}Β°E +
+
+
+ Found ${samples_combined.length} sample${samples_combined.length !== 1 ? 's' : ''} +
` + : + html`
+ No samples found at this location via Path 1 (direct sampling events). +
` +}` +``` + +## Understanding Paths in the iSamples Property Graph + +### Why "Path 1" and "Path 2"? + +These terms describe the **two main ways to get from a MaterialSampleRecord to geographic coordinates**. They're not the only relationship paths in the graph, but they're the most commonly used for spatial queries. + +**Path 1 (Direct Event Location) - Wide Format** +``` +MaterialSampleRecord + β†’ p__produced_by β†’ +SamplingEvent + β†’ p__sample_location β†’ +GeospatialCoordLocation +``` + +**Path 2 (Via Sampling Site) - Wide Format** +``` +MaterialSampleRecord + β†’ p__produced_by β†’ +SamplingEvent + β†’ p__sampling_site β†’ +SamplingSite + β†’ p__site_location β†’ +GeospatialCoordLocation +``` + +**Key Differences:** +- **Path 1 is direct**: Event β†’ Location (3 hops total) +- **Path 2 goes through Site**: Event β†’ Site β†’ Location (4 hops total) +- **Path 1** = "Where was this specific sample collected?" +- **Path 2** = "What named site is this sample from, and where is that site?" + +**Wide Format Advantage**: Instead of JOINing through separate edge rows (otype='_edge_'), we directly access the `p__*` columns on entity rows. + +**Important:** The queries below use INNER JOIN for both paths, meaning samples must have connections through both paths to appear in results. Samples with only one path will be excluded. + +### Full Relationship Map (Beyond Path 1 and Path 2) + +The iSamples property graph contains many more relationships than just the geographic paths: + +``` + Agent + ↑ + | {p__responsibility, p__registrant} + | +MaterialSampleRecord ──p__produced_by──→ SamplingEvent ──p__sample_location──→ GeospatialCoordLocation + | | ↑ + | | | + | {p__keywords, └──p__sampling_site──→ SamplingSite ──p__site_locationβ”€β”˜ + | p__has_sample_object_type, + | p__has_material_category} + | + └──→ IdentifiedConcept +``` + +**Path Categories (Wide Format):** +- **PATH 1**: MaterialSampleRecord β†’ SamplingEvent β†’ GeospatialCoordLocation (via `p__produced_by`, `p__sample_location`) +- **PATH 2**: MaterialSampleRecord β†’ SamplingEvent β†’ SamplingSite β†’ GeospatialCoordLocation (via `p__sampling_site`, `p__site_location`) +- **AGENT PATH**: MaterialSampleRecord β†’ SamplingEvent β†’ Agent (via `p__responsibility`, `p__registrant`) +- **CONCEPT PATH**: MaterialSampleRecord β†’ IdentifiedConcept (via `p__keywords`, `p__has_sample_object_type`, `p__has_material_category` - direct, no event!) + +**Key Insight:** SamplingEvent is the central hub for most relationships, except concepts which attach directly to MaterialSampleRecord. + +### Query Pattern Analysis (Wide Format Translations) + +The following analysis shows Eric's query functions translated to wide format: + +#### 1. `get_sample_data_via_sample_pid` - Uses BOTH Path 1 AND Path 2 +``` +MaterialSampleRecord (WHERE pid = ?) + β†’ p__produced_by β†’ SamplingEvent + β”œβ”€β†’ p__sample_location β†’ GeospatialCoordLocation [Path 1] + └─→ p__sampling_site β†’ SamplingSite [Path 2] + +Returns: sample metadata + lat/lon + site label/pid +Required: BOTH paths must exist (INNER JOIN) +``` + +#### 2. `get_sample_data_agents_sample_pid` - Uses AGENT PATH +``` +MaterialSampleRecord (WHERE pid = ?) + β†’ p__produced_by β†’ SamplingEvent + β†’ p__responsibility β†’ Agent + +Returns: sample metadata + agent info (who collected/registered) +Independent of: Path 1 and Path 2 (no geographic data) +``` + +#### 3. `get_sample_types_and_keywords_via_sample_pid` - Uses CONCEPT PATH +``` +MaterialSampleRecord (WHERE pid = ?) + β†’ {p__keywords, p__has_sample_object_type, p__has_material_category} β†’ IdentifiedConcept + +Returns: sample metadata + classification keywords/types +Independent of: Path 1, Path 2, and SamplingEvent! +``` + +#### 4. `get_samples_at_geo_cord_location_via_sample_event` - REVERSE Path 1 + Path 2 +``` +GeospatialCoordLocation (WHERE pid = ?) ← START HERE (reverse!) + ← p__sample_location ← SamplingEvent [Path 1 REVERSED] + β”œβ”€β†’ p__sampling_site β†’ SamplingSite [Path 2 enrichment] + └─← p__produced_by ← MaterialSampleRecord [complete chain] + +Returns: all samples at a given location + site info +Direction: geo β†’ samples (opposite of other queries) +``` + +**Summary Table:** + +| Function | Path 1 | Path 2 | Direction | Notes | +|----------|--------|--------|-----------|-------| +| `get_sample_data_via_sample_pid` | βœ… Required | βœ… Required | Forward | INNER JOIN - no row if either missing | +| `get_sample_data_agents_sample_pid` | ❌ N/A | ❌ N/A | N/A | Uses agent path instead | +| `get_sample_types_and_keywords_via_sample_pid` | ❌ N/A | ❌ N/A | N/A | Direct edges to concepts | +| `get_samples_at_geo_cord_location_via_sample_event` | βœ… Required | βœ… Required | Reverse | Walks from geo to samples | + +## Related Sample Path 1 (selected) + + + +Path 1 (direct_event_location): find MaterialSampleRecord items whose producing SamplingEvent has a direct `p__sample_location` pointing to the clicked GeospatialCoordLocation (pid). + +- Chain: MaterialSampleRecord β†’ p__produced_by β†’ SamplingEvent β†’ p__sample_location β†’ GeospatialCoordLocation (clicked pid) +- This matches the "direct_samples" concept in the Python notebook and is labeled as `location_path = 'direct_event_location'` in the query. + +```{ojs} +//| echo: false +samples_1 = selectedSamples1 +``` + +```{ojs} +//| echo: false +html`${ + s1Loading ? + html`
Loading Path 1 samples…
` + : + samples_1 && samples_1.length > 0 ? + html`
+ + + + + + + + + + + + ${samples_1.map((sample, i) => html` + + + + + + + + `)} + +
ThumbnailSampleDescriptionSiteLocation
+ ${sample.has_thumbnail ? + html` + ${sample.sample_label} + ` + : + html`
No image
` + } +
+
+ ${sample.sample_label} +
+ +
+
+ ${sample.sample_description || 'No description'} +
+
+
+ ${sample.sample_site_label} +
+ +
+ ${sample.latitude.toFixed(5)}Β°N
+ ${sample.longitude.toFixed(5)}Β°E +
+
+
+ Found ${samples_1.length} sample${samples_1.length !== 1 ? 's' : ''} via Path 1 (direct event location) +
` + : + html`
+ No samples found via Path 1 (direct event location). +
` +}` +``` + + +## Related Sample Path 2 (selected) + + + +Path 2 (via_site_location): find MaterialSampleRecord items whose producing SamplingEvent references a SamplingSite via `p__sampling_site`, and that site's `p__site_location` points to the clicked GeospatialCoordLocation (pid). + +- Chain: MaterialSampleRecord β†’ p__produced_by β†’ SamplingEvent β†’ p__sampling_site β†’ SamplingSite β†’ p__site_location β†’ GeospatialCoordLocation (clicked pid) +- This matches the "samples_via_sites" concept in the Python notebook and is labeled as `location_path = 'via_site_location'` in the query. + +```{ojs} +//| echo: false +samples_2 = selectedSamples2 +``` + +```{ojs} +//| echo: false +html`${ + s2Loading ? + html`
Loading Path 2 samples…
` + : + samples_2 && samples_2.length > 0 ? + html`
+ + + + + + + + + + + + ${samples_2.map((sample, i) => html` + + + + + + + + `)} + +
ThumbnailSampleDescriptionSiteLocation
+ ${sample.has_thumbnail ? + html` + ${sample.sample_label} + ` + : + html`
No image
` + } +
+
+ ${sample.sample_label} +
+ +
+
+ ${sample.sample_description || 'No description'} +
+
+
+ ${sample.sample_site_label} +
+ +
+ ${sample.latitude.toFixed(5)}Β°N
+ ${sample.longitude.toFixed(5)}Β°E +
+
+
+ Found ${samples_2.length} sample${samples_2.length !== 1 ? 's' : ''} via Path 2 (via site location) +
` + : + html`
+ No samples found via Path 2 (via site location). +
` +}` +``` + +## Geographic Location Classification + +::: {.callout-tip icon=false} +## βœ… IMPLEMENTED - Differentiated Geographic Visualization + +**Current implementation**: GeospatialCoordLocations are now color-coded by their semantic role in the property graph: + +- πŸ”΅ **Blue (small)** - `sample_location_only`: Precise field collection points (Path 1) +- 🟣 **Purple (large)** - `site_location_only`: Administrative site markers (Path 2) +- 🟠 **Orange (medium)** - `both`: Dual-purpose locations (used for both Path 1 and Path 2) + +**Discovery**: Analysis of the OpenContext parquet data reveals that geos fall into three distinct categories based on their usage: + +1. **`sample_location_only`**: Precise field collection points (Path 1) + - Most common category + - Represents exact GPS coordinates where sampling events occurred + - Varies per event, even within the same site + +2. **`site_location_only`**: Administrative site markers (Path 2) + - Represents general/reference locations for named archaeological sites + - One coordinate per site + - May not correspond to any actual collection point + +3. **`both`**: 10,346 geos (5.2%) - Dual-purpose locations + - Used as BOTH `p__sample_location` AND `p__site_location` + - Primarily single-location sites (85% of all sites) + - Occasionally one of many locations at multi-location sites (e.g., PKAP) + +**Site spatial patterns**: +- **85.4%** of sites are compact (single location) - all events at one coordinate + - Example: Suberde - 384 events at one location +- **14.6%** of sites are distributed (multiple locations) - events spread across space + - Example: PKAP Survey Area - 15,446 events across 544 different coordinates + - Poggio Civitate - 29,985 events across 11,112 coordinates + +### Benefits of Current Implementation + +1. **Educational**: Makes Path 1 vs Path 2 distinction visually concrete + - Users can SEE the semantic difference between precise and administrative locations + - Blue points show where samples were actually collected (Path 1) + - Purple points show administrative site markers (Path 2) + - Demonstrates the complementary nature of the two geographic paths + +2. **Exploratory**: Enables visual understanding of spatial patterns + - Archaeological sites appear as purple markers (large points) + - Field collection points appear as blue markers (small points) + - Dual-purpose locations appear as orange markers (medium points) + - No UI filters required - the colors provide immediate visual differentiation + +3. **Analytical**: Reveals site spatial structure at a glance + - Compact sites: tight cluster of blue points around purple marker + - Survey areas: purple marker with cloud of blue points spread across region + - Identifies sampling strategies and field methodologies by visual inspection + +### Wide Format Advantage + +The classification query is **simpler** in wide format because it directly checks `p__sample_location` and `p__site_location` columns instead of querying through edge rows. + +**Narrow format** (edge rows): +```sql +JOIN nodes e ON (geo.row_id = e.o[1]) +WHERE e.p IN ('sample_location', 'site_location') +``` + +**Wide format** (direct columns): +```sql +LEFT JOIN nodes AS se ON (list_contains(se.p__sample_location, geo.row_id)) +LEFT JOIN nodes AS site ON (list_contains(site.p__site_location, geo.row_id)) +``` + +### Implementation Status + +**Status**: βœ… **IMPLEMENTED** (Basic color-coding by location type) + +**What's implemented**: +- βœ… Classification query using `p__sample_location` and `p__site_location` columns +- βœ… Conditional styling by location_type +- βœ… Color-coded points: Blue (sample_location), Purple (site_location), Orange (both) +- βœ… Size differentiation: 3px (field points), 6px (sites), 5px (dual-purpose) + +**Future enhancements** (not yet implemented): +- ⬜ UI filter controls (checkbox toggles for each location type) +- ⬜ Site Explorer Mode (click site β†’ highlight all sample_locations) +- ⬜ Convex hull/region drawing for distributed sites +- ⬜ Dynamic statistics display on site selection + +This implementation transforms the visualization from uniform points into a pedagogical tool that visually demonstrates the Path 1 vs Path 2 distinction in the iSamples metadata model architecture. + +::: + +## See Also + +- [Cesium View (Narrow Format)](/tutorials/parquet_cesium.html) - Same visualization using the narrow schema +- [Narrow vs Wide Performance](/tutorials/narrow_vs_wide_performance.html) - Benchmark comparison +- [iSamples Parquet Tutorial](/tutorials/parquet.qmd) - Introduction to parquet format