Skip to content

Comments

feat(geo): Geospatial indexing with geo.* SQL functions (LSM-Tree native storage)#3510

Open
robfrank wants to merge 44 commits intomainfrom
lsmtree-geospatial
Open

feat(geo): Geospatial indexing with geo.* SQL functions (LSM-Tree native storage)#3510
robfrank wants to merge 44 commits intomainfrom
lsmtree-geospatial

Conversation

@robfrank
Copy link
Collaborator

@robfrank robfrank commented Feb 22, 2026

Geospatial Indexing Support (LSM-Tree native storage, geo.* SQL functions)

Overview

Adds full geospatial indexing to ArcadeDB using the native LSM-Tree engine as storage backend and a geo.* SQL function namespace consistent with ArcadeDB's existing dot-namespace convention (e.g. vector.neighbors, vector.cosineSimilarity). The design mirrors the existing LSMTreeFullTextIndex pattern: a thin wrapper that tokenizes geometry into GeoHash cells via Apache Lucene's lucene-spatial-extras library, stored in the LSM-Tree — inheriting ACID, WAL, HA, and compaction for free.


New index type: GEOSPATIAL

Create a geospatial index on any STRING property that stores WKT geometry:

CREATE DOCUMENT TYPE Location;
CREATE PROPERTY Location.coords STRING;
CREATE INDEX ON Location (coords) GEOSPATIAL;

Configurable GeoHash precision (default 11, ~2.4 m resolution; range 1–12). Precision is persisted in the schema JSON and survives database reopen.


Automatic query optimizer integration

No search_index() call required. Any WHERE clause using a geo.* spatial predicate on an indexed field is automatically routed through the geospatial index:

-- Uses GEOSPATIAL index automatically
SELECT FROM Location
WHERE geo.within(coords, geo.geomFromText('POLYGON ((10 38, 16 38, 16 44, 10 44, 10 38))')) = true

-- Falls back to full scan transparently when no index exists
SELECT FROM Location
WHERE geo.within(coords, geo.geomFromText('POLYGON ((10 38, 16 38, 16 44, 10 44, 10 38))')) = true

The index returns a GeoHash-cell superset of candidates; the exact Spatial4j/JTS predicate is applied as a post-filter (shouldExecuteAfterSearch = true).


12 geo.* constructor / accessor functions

Function Description
geo.geomFromText(wkt) Parse any WKT string → Shape
geo.point(x, y) Returns POINT (x y) WKT
geo.lineString(pts) Returns LINESTRING (...) WKT
geo.polygon(pts) Returns POLYGON ((…)) WKT, auto-closes ring
geo.buffer(geom, dist) OGC buffer via JTS Geometry.buffer()
geo.envelope(geom) Bounding rectangle as WKT
geo.distance(g1, g2 [,unit]) Haversine; units: m (default), km, mi, nmi
geo.area(geom) Area in square degrees via Spatial4j
geo.asText(geom) Shape → WKT string
geo.asGeoJson(geom) Shape → GeoJSON string
geo.x(point) Extract longitude
geo.y(point) Extract latitude

9 geo.* spatial predicate functions

All implement IndexableSQLFunction for automatic optimizer integration. Predicates that are semantically incompatible with a GeoHash intersection superset correctly opt out of indexed execution.

Function Indexed Notes
geo.within(g, shape) g fully within shape
geo.intersects(g, shape) g and shape share any point
geo.contains(g, shape) containment direction flips index semantics
geo.dWithin(g, shape, dist) requires bounding-circle expansion (future work)
geo.disjoint(g, shape) disjoint records are absent from index result
geo.equals(g, shape) requires exact coordinate match
geo.crosses(g, shape) DE-9IM; full scan with JTS post-filter
geo.overlaps(g, shape) DE-9IM; full scan with JTS post-filter
geo.touches(g, shape) DE-9IM; full scan with JTS post-filter

All predicates return null when either argument is null (three-valued SQL logic).


SQL parser: geo.* namespace support

The ANTLR4 SQL grammar does not natively support dotted function names. A FUNCTION_NAMESPACES set in SQLASTBuilder.visitIdentifierChain rewrites geo.function(args) identifier-chain patterns into proper FunctionCall AST nodes at visitor level, keeping the grammar rule ordering intact and avoiding regressions with field.method() patterns.

Note: The JavaCC SQL parser does not support unquoted dotted names. Use backtick syntax for JavaCC contexts: `geo.point`(x, y).


Breaking changes

The old non-standard geo functions are removed and replaced by geo.* equivalents:

Removed Replacement
point(x, y) geo.point(x, y)
distance(p1, p2) geo.distance(p1, p2)
circle(c, r) geo.buffer(geom, dist)
polygon(pts) geo.polygon(pts)
lineString(pts) geo.lineString(pts)
rectangle(pts) geo.envelope(geom)

Cypher point(lat, lon) and distance(p1, p2) are preserved via CypherFunctionFactory and continue to work unchanged in Cypher queries.


New dependency

org.apache.lucene:lucene-spatial-extras (version ${lucene.version}, Apache 2.0). Lucene core is already a transitive dependency; this is a sibling module. Attribution added to ATTRIBUTIONS.md.


Tests

Test class Coverage
GeoIndexMetadataTest precision default, round-trip serialization
LSMTreeGeoIndexTest put/get/remove, null WKT, transaction replay path
LSMTreeGeoIndexSchemaTest DDL creation, persistence across database reopen, non-default precision round-trip
SQLGeoFunctionsTest all 12 accessor + 9 predicate functions, null inputs
SQLGeoIndexedQueryTest end-to-end indexed SQL queries for all 9 predicates, no-index fallback

79 tests total. Full engine suite (4978 tests) passes with no regressions.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @robfrank, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust and standardized geospatial indexing system to ArcadeDB. It leverages a native LSM-Tree implementation and GeoHash tokenization to efficiently store and query spatial data. By adopting OGC/PostGIS ST_* SQL function conventions, the database's geospatial capabilities are aligned with industry standards, offering a more familiar and powerful API. The new system integrates seamlessly with the query optimizer, automatically utilizing indexes for spatial predicates, and replaces older, non-standard functions for a cleaner and more consistent experience.

Highlights

  • Native Geospatial Indexing: Introduced a new GEOSPATIAL index type utilizing the native LSM-Tree engine for efficient storage and retrieval of spatial data.
  • Standardized SQL Functions: Replaced existing non-standard geospatial SQL functions with a comprehensive set of OGC/PostGIS ST_* functions for constructors, accessors, and spatial predicates.
  • Automatic Query Optimization: Integrated geospatial indexing with the query optimizer, allowing ST_* spatial predicate functions on indexed fields to automatically leverage the new geospatial index without explicit calls.
  • GeoHash-based Tokenization: Implemented geometry tokenization into GeoHash cells using Apache Lucene's lucene-spatial-extras library, enabling hierarchical spatial indexing and querying.
  • Configurable Precision: Added support for configurable precision in geospatial indexes, allowing users to define the resolution of GeoHash cells (defaulting to ~2.4m resolution).
  • Breaking Changes: Removed old non-standard geo functions (point(), distance(), circle(), polygon(), lineString(), rectangle()) in favor of their ST_* equivalents, while preserving Cypher point() and distance() via a bridge.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • .claude/settings.local.json
    • Added permissions for various bash commands.
  • ATTRIBUTIONS.md
    • Added lucene-spatial-extras to the list of third-party attributions.
  • docs/plans/2026-02-22-geospatial-design.md
    • Added a design document detailing the architecture and goals of the geospatial indexing feature.
  • docs/plans/2026-02-22-geospatial-implementation.md
    • Added an implementation plan outlining the step-by-step process for developing the geospatial indexing.
  • engine/pom.xml
    • Added lucene-spatial-extras as a new dependency.
  • engine/src/main/java/com/arcadedb/function/geo/CypherPointFunction.java
    • Added a new Cypher-specific point function.
  • engine/src/main/java/com/arcadedb/function/sql/DefaultSQLFunctionFactory.java
    • Removed old geospatial functions and registered new ST_* standard and predicate functions.
  • engine/src/main/java/com/arcadedb/function/sql/geo/GeoUtils.java
    • Extended with utility methods for parsing and converting geometries to and from WKT and JTS formats.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionDistance.java
    • Removed the old distance SQL function.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionLineString.java
    • Removed the old linestring SQL function.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionPolygon.java
    • Removed the old polygon SQL function.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionRectangle.java
    • Removed the old rectangle SQL function.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionST_Area.java
    • Renamed SQLFunctionCircle.java to SQLFunctionST_Area.java and refactored to implement ST_Area.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionST_AsGeoJson.java
    • Added the ST_AsGeoJson function to convert geometries to GeoJSON strings.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionST_AsText.java
    • Added the ST_AsText function to convert geometries to WKT strings.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionST_Buffer.java
    • Added the ST_Buffer function to create a buffer around a geometry.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionST_Contains.java
    • Added the ST_Contains predicate function, which does not use indexed execution.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionST_Crosses.java
    • Added the ST_Crosses predicate function, which does not use indexed execution.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionST_DWithin.java
    • Added the ST_DWithin predicate function, which does not use indexed execution.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionST_Disjoint.java
    • Added the ST_Disjoint predicate function, which does not use indexed execution.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionST_Distance.java
    • Added the ST_Distance function to calculate the Haversine distance between points.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionST_Envelope.java
    • Added the ST_Envelope function to return the bounding box of a geometry.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionST_Equals.java
    • Added the ST_Equals predicate function, which does not use indexed execution.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionST_GeomFromText.java
    • Renamed SQLFunctionPoint.java to SQLFunctionST_GeomFromText.java and refactored to implement ST_GeomFromText.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionST_Intersects.java
    • Added the ST_Intersects predicate function with indexed execution support.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionST_LineString.java
    • Added the ST_LineString function to construct WKT linestrings.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionST_Overlaps.java
    • Added the ST_Overlaps predicate function, which does not use indexed execution.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionST_Point.java
    • Added the ST_Point function to construct WKT points.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionST_Polygon.java
    • Added the ST_Polygon function to construct WKT polygons with auto-closing rings.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionST_Predicate.java
    • Added an abstract base class for ST_* predicate functions, providing common logic for indexable SQL functions.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionST_Touches.java
    • Added the ST_Touches predicate function, which does not use indexed execution.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionST_Within.java
    • Added the ST_Within predicate function with indexed execution support.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionST_X.java
    • Added the ST_X function to extract the X coordinate from a point.
  • engine/src/main/java/com/arcadedb/function/sql/geo/SQLFunctionST_Y.java
    • Added the ST_Y function to extract the Y coordinate from a point.
  • engine/src/main/java/com/arcadedb/index/geospatial/LSMTreeGeoIndex.java
    • Added the core LSMTreeGeoIndex implementation for geospatial indexing.
  • engine/src/main/java/com/arcadedb/query/opencypher/executor/CypherFunctionFactory.java
    • Updated to register CypherPointFunction and bridge distance to SQLFunctionST_Distance.
  • engine/src/main/java/com/arcadedb/query/sql/parser/CreateIndexStatement.java
    • Updated to support the new GEOSPATIAL index type in SQL CREATE INDEX statements.
  • engine/src/main/java/com/arcadedb/schema/GeoIndexMetadata.java
    • Added metadata class for geospatial indexes, including precision configuration.
  • engine/src/main/java/com/arcadedb/schema/LocalSchema.java
    • Registered the GEOSPATIAL index factory and updated index loading logic.
  • engine/src/main/java/com/arcadedb/schema/Schema.java
    • Added GEOSPATIAL to the INDEX_TYPE enum.
  • engine/src/test/java/com/arcadedb/function/sql/geo/SQLGeoFunctionsTest.java
    • Updated existing tests and added new tests for ST_* functions, removing tests for old geo functions.
  • engine/src/test/java/com/arcadedb/index/geospatial/GeoIndexMetadataTest.java
    • Added tests for GeoIndexMetadata default and custom precision.
  • engine/src/test/java/com/arcadedb/index/geospatial/LSMTreeGeoIndexSchemaTest.java
    • Added tests for geospatial index creation via SQL and persistence across database reopens.
  • engine/src/test/java/com/arcadedb/index/geospatial/LSMTreeGeoIndexTest.java
    • Added tests for LSMTreeGeoIndex indexing and querying, including transaction replay scenarios.
  • engine/src/test/java/com/arcadedb/index/geospatial/SQLGeoIndexedQueryTest.java
    • Added integration tests for ST_* predicate functions with and without geospatial indexing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify
Copy link
Contributor

mergify bot commented Feb 22, 2026

🧪 CI Insights

Here's what we observed from your CI run for 87a37d3.

🟢 All jobs passed!

But CI Insights is watching 👀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces comprehensive geospatial indexing support to ArcadeDB, leveraging the native LSM-Tree engine and OGC-compliant ST_* functions. This is a significant architectural improvement that replaces previous non-standard geo functions with a more robust, Lucene-backed tokenization strategy. The integration with the query optimizer via IndexableSQLFunction is well-designed, allowing for automatic index usage in spatial predicates. However, there are a few critical issues that need to be addressed: the index build process currently bypasses tokenization, which will result in unusable indexes after a rebuild; there are performance bottlenecks in spatial predicates due to eager record loading; and a breaking change in the default unit for ST_Distance (from kilometers to meters in SQL context) might affect existing users. Additionally, some input validation and optimization opportunities in the geohash lookup logic were identified.

Comment on lines 433 to 435
public long build(final int buildIndexBatchSize, final BuildIndexCallback callback) {
return underlyingIndex.build(buildIndexBatchSize, callback);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The build() method currently delegates directly to the underlying LSM-Tree index. This is a critical bug because underlyingIndex.build() uses its own put() method, which indexes the WKT string as a literal key instead of tokenizing it into geohash cells. Consequently, after a rebuild, the index will be corrupted and spatial queries will return no results. This method must be implemented locally to scan the bucket and call this.put() to ensure proper tokenization.

return List.of();

// Query each per-bucket geo index and collect the results
final List<Record> results = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The searchFromTarget implementation eagerly loads all documents from disk into a List<Record>. This is a significant performance bottleneck and memory risk for queries returning many candidates. It should instead return a lazy Iterable that loads records only as they are iterated. Additionally, id.getRecord() can return null if a record was deleted but the index is not yet updated; the implementation should filter out these nulls to avoid downstream issues.

Comment on lines +234 to +246
final CellIterator cellIter = grid.getTreeCellIterator(searchShape, detailLevel);
final LinkedHashSet<RID> seen = new LinkedHashSet<>();
while (cellIter.hasNext()) {
final Cell cell = cellIter.next();
if (cell.getShapeRel() == null)
continue;
final String token = cell.getTokenBytesNoLeaf(null).utf8ToString();
if (token.isEmpty())
continue;
final IndexCursor cursor = underlyingIndex.get(new Object[]{token});
while (cursor.hasNext())
seen.add(cursor.next().getIdentity());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The geohash lookup logic currently collects all matching RIDs from all covering cells into a LinkedHashSet before applying the limit. For large search shapes or high-density areas, this can lead to excessive memory usage and slow performance. It is recommended to check the limit inside the cell iteration loop and stop early once enough unique RIDs have been collected.

Comment on lines +202 to +213
private static String extractFieldName(final Expression expr) {
if (expr == null)
return null;
// toString() on a plain identifier expression yields the field name
final String text = expr.toString();
if (text == null || text.isBlank())
return null;
// Reject if this looks like a function call or contains operators
if (text.contains("(") || text.contains(" ") || text.contains("."))
return null;
return text;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The extractFieldName method is too restrictive as it rejects any identifier containing a space. In ArcadeDB, property names can contain spaces if they are escaped with backticks (e.g., `my property`). This restriction prevents the query optimizer from using geospatial indexes on such properties. A more robust check against the AST expression type (e.g., checking if it is a simple identifier) would be preferable.

final double[] p1 = extractPointCoords(iParams[0]);
final double[] p2 = extractPointCoords(iParams[1]);

final String unit = (iParams.length >= 3 && iParams[2] != null) ? iParams[2].toString() : "m";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The new ST_Distance implementation always defaults to meters ('m'). This is a breaking change for SQL users, as the previous distance() function defaulted to kilometers ('km') in a SQL context (while using meters for Cypher). To maintain backward compatibility and follow the PR description's intent, the default unit should be determined based on the argument count or the query context.

return null;

@SuppressWarnings("unchecked")
final List<Object> points = (List<Object>) iParams[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The code performs an unchecked cast of the first parameter to List<Object>. If the user provides a different type (e.g., a single point or a string), this will throw a ClassCastException. It is safer to check the type using instanceof List before casting.

    if (!(iParams[0] instanceof List))
      throw new IllegalArgumentException("ST_LineString() requires a list of points as the first parameter");

    final List<Object> points = (List<Object>) iParams[0];

if (iParams == null || iParams.length < 1 || iParams[0] == null)
return null;

@SuppressWarnings("unchecked")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to ST_LineString, this unchecked cast to List<Object> can lead to a ClassCastException if the input is not a list. A type check should be added.

Suggested change
@SuppressWarnings("unchecked")
if (!(iParams[0] instanceof List))
throw new IllegalArgumentException("ST_Polygon() requires a list of points as the first parameter");
final List<Object> points = (List<Object>) iParams[0];

@claude
Copy link

claude bot commented Feb 22, 2026

Code Review: Geospatial Indexing (#3510)

This is a substantial and well-structured feature addition. The overall design - mirroring LSMTreeFullTextIndex with GeoHash tokenization via Lucene's RecursivePrefixTreeStrategy - is sound and integrates neatly into the existing architecture. Below are issues ranging from blockers to minor suggestions.


Blockers

1. .claude/settings.local.json must not be committed

This file (.claude/settings.local.json) is a local Claude Code AI assistant configuration specific to the developer's machine. It has no place in the repository. It should be removed from this PR and added to .gitignore. Committing it exposes internal AI toolchain configuration and will cause noise for all other contributors.

2. docs/plans/ AI planning artifacts should be removed

docs/plans/2026-02-22-geospatial-design.md (237 lines) and docs/plans/2026-02-22-geospatial-implementation.md (1,456 lines) appear to be AI-generated implementation plan documents. These are development process artifacts, not user-facing documentation. They add ~1,700 lines of noise to the repository. If design rationale is worth preserving, it belongs in a condensed form in docs/ or as an ADR, not as verbatim AI session outputs.


Correctness Issues

3. ST_DWithin uses degrees, not real-world units - inconsistent with ST_Distance

SpatialContext.calcDistance() returns degrees in geographic (non-Euclidean) mode. ST_Distance, by contrast, uses Haversine and returns metres/km/mi/nmi based on a unit parameter. This inconsistency will surprise users who pass a value in metres to ST_DWithin expecting the same semantics as ST_Distance. The docstring says 'Distance in degrees (Spatial4j coordinate system)' but the function signature gives no hint of this to SQL users.

At minimum, the unit must be clearly communicated at the SQL level (e.g., a unit parameter or explicit naming like ST_DWithinDegrees). Ideally, accept the same unit parameter as ST_Distance and convert.

4. ST_DWithin measures center-to-center distance, not geometry distance

Using geom1.getCenter() for polygon or linestring inputs violates OGC semantics. ST_DWithin(g1, g2, d) should return true if any point of g1 is within distance d of any point of g2. For a polygon g1, the center can be far from an edge that is actually within the threshold. This produces silently wrong results for non-point geometries.

5. looksLikeGeoHashToken fallback in put() is architecturally fragile

This heuristic exists to handle WAL/commit replay, where already-tokenized geohash strings are passed back through put(). Relying on character-pattern matching to distinguish raw WKT from internal index tokens is fragile and error-prone. Consider:

  • A WKT string like 'bc' (a valid 2-char abbreviation in some systems) would be treated as a geohash token.
  • The check s.length() > precision means tokens longer than the configured precision are silently dropped during replay.

A cleaner design would wrap the internal token differently (e.g., a GeoHashToken value object) or expose a dedicated internal putToken() method that bypasses WKT parsing, rather than inferring intent from string content.

6. Memory: get() loads all matching RIDs into a LinkedHashSet before streaming

For queries covering large areas (low-precision geohash cells covering many documents), this materialises the entire candidate set in memory. The existing LSMTreeFullTextIndex has the same pattern, but geospatial queries can realistically return millions of candidates for coarse shapes. Consider deferring materialisation or at least implementing an early-exit once limit is reached (currently the limit is only applied after full collection).


Design / API Issues

7. Precision is hardcoded to DEFAULT_PRECISION in the file-ID constructor

This constructor is used during database open (before LocalSchema.readConfiguration() wraps the index with the correct precision). The wrap in readConfiguration() fixes it, but between construction and wrapping, any code that touches the index object from the component registry uses the wrong precision. Worth adding a comment explaining this lifecycle dependency, or restructuring so the wrap happens at construction time.

8. ST_Area returns square degrees with no documentation of units

ST_Area returns area via Spatial4j in square degrees (geographic CRS). This is non-intuitive - most users would expect square metres or at least an explicit unit parameter. The getSyntax() return value should document the unit, and ideally an optional unit parameter should be added for consistency with ST_Distance.

9. ST_DWithin docstring contradicts OGC standard

The Javadoc says 'returns true if geometry g is within the given distance of shape' (correct OGC definition) but the implementation measures center-to-center distance (see issue #4 above). Align implementation with spec or explicitly document the deviation.


Testing Issues

10. LSMTreeGeoIndexTest uses reflection to bypass the public API

This couples the test to private implementation details (LocalSchema.indexMap field name). If the field is renamed or the map structure changes, the test breaks silently at runtime rather than compile time. Use the public database.command("sql", "CREATE INDEX ON ...") API instead, which is what LSMTreeGeoIndexSchemaTest and SQLGeoIndexedQueryTest already do correctly.

11. No regression test for precision persistence across database reopen

GeoIndexMetadata and LocalSchema.readConfiguration() include logic to persist and restore the precision value. There is no test that verifies: create index with precision=7, close database, reopen, query - and confirm the precision was restored. LSMTreeGeoIndexSchemaTest tests schema serialisation, but not the full reopen lifecycle.

12. No test for the remove() / update path

There are tests for put() and get(), but none for remove(Object[], Identifiable). Removing a document from a geospatial index involves re-tokenizing the WKT and removing each token - a distinct code path from insertion that deserves its own test.


Minor Issues

13. GeoUtils.parseEnvelopeWkt coordinate ordering comment needed

Spatial4j's ENVELOPE format is ENVELOPE(minX, maxX, maxY, minY) - i.e., (lon_min, lon_max, lat_max, lat_min). This is counter-intuitive. A comment explaining the ordering (or a reference to the Spatial4j spec) would prevent future bugs.

14. SQLFunctionST_AsGeoJson uses manual GeoJSON serialisation

The implementation hand-rolls GeoJSON for Point, LineString, Polygon, Multi-geometries, and GeometryCollection. JTS has a GeoJsonWriter class in org.locationtech.jts.io.geojson that handles all geometry types correctly. Given JTS is already a transitive dependency via lucene-spatial-extras, using GeoJsonWriter would eliminate a large block of error-prone manual code and ensure edge cases (holes in polygons, mixed geometry collections) are handled correctly.

15. ST_Buffer returns WKT string, not a Shape

The function returns a String. If the result is passed to another spatial function (e.g., ST_Within(coords, ST_Buffer(geom, 1.0))), the outer function must re-parse it. All other constructor functions (ST_Point, ST_GeomFromText) also return WKT strings, so this is consistent - but it means every chained call re-parses. Internally returning a Shape object and stringifying only at serialisation would be more efficient.


Summary

The core indexing approach is solid. Most of the changes in LSMTreeGeoIndex, GeoIndexMetadata, LocalSchema, and SQLFunctionST_Predicate are well-structured. The main blockers are the committed developer files (.claude/settings.local.json, docs/plans/), the correctness issues with ST_DWithin (degrees vs. real units, center vs. geometry distance), and the fragile looksLikeGeoHashToken heuristic.

@lvca
Copy link
Contributor

lvca commented Feb 22, 2026

Should we rename all these new function in the package "geo"?

ST_Equals() -> geo.equals()

I don't think anybody is converting whole queries from postgres to ArcadeDB. WDYT? So we keep things in order, right now we have hundreds of functions ;-)

@codacy-production
Copy link

codacy-production bot commented Feb 22, 2026

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation Diff coverage
Report missing for 8f261ca1 73.15%
Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (8f261ca) Report Missing Report Missing Report Missing
Head commit (601f145) 120110 75378 62.76%

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#3510) 756 553 73.15%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings    Change summary preferences

Footnotes

  1. Codacy didn't receive coverage data for the commit, or there was an error processing the received data. Check your integration for errors and validate that your coverage setup is correct.

@codecov
Copy link

codecov bot commented Feb 22, 2026

Codecov Report

❌ Patch coverage is 63.60202% with 289 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.53%. Comparing base (b1d5780) to head (87a37d3).

Files with missing lines Patch % Lines
...com/arcadedb/index/geospatial/LSMTreeGeoIndex.java 57.14% 70 Missing and 17 partials ⚠️
...dedb/function/sql/geo/SQLFunctionGeoPredicate.java 43.95% 24 Missing and 27 partials ⚠️
...n/java/com/arcadedb/function/sql/geo/GeoUtils.java 50.00% 24 Missing and 7 partials ⚠️
...dedb/function/sql/geo/SQLFunctionGeoAsGeoJson.java 48.27% 26 Missing and 4 partials ⚠️
...va/com/arcadedb/query/sql/antlr/SQLASTBuilder.java 47.91% 19 Missing and 6 partials ⚠️
...cadedb/function/sql/geo/SQLFunctionGeoPolygon.java 72.97% 4 Missing and 6 partials ⚠️
...adedb/function/sql/geo/SQLFunctionGeoDistance.java 76.66% 4 Missing and 3 partials ⚠️
...com/arcadedb/function/geo/CypherPointFunction.java 44.44% 3 Missing and 2 partials ⚠️
...rcadedb/function/sql/geo/SQLFunctionGeoAsText.java 66.66% 2 Missing and 2 partials ⚠️
.../arcadedb/function/sql/geo/SQLFunctionGeoArea.java 70.00% 1 Missing and 2 partials ⚠️
... and 21 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3510      +/-   ##
==========================================
- Coverage   65.76%   65.53%   -0.23%     
==========================================
  Files        1471     1492      +21     
  Lines      100150   100811     +661     
  Branches    20874    20968      +94     
==========================================
+ Hits        65860    66069     +209     
- Misses      25305    25660     +355     
- Partials     8985     9082      +97     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

skofra0 pushed a commit to skofra0/arcadedb that referenced this pull request Feb 23, 2026
Bumps [pg](https://github.com/brianc/node-postgres/tree/HEAD/packages/pg) from 8.16.3 to 8.17.1.
Changelog

*Sourced from [pg's changelog](https://github.com/brianc/node-postgres/blob/master/CHANGELOG.md).*

> All major and minor releases are briefly explained below.
>
> For richer information consult the commit log on github with referenced pull requests.
>
> We do not include break-fix version release in this file.
>
> pg@8.17.0
> ---------
>
> * Throw correct error if database URL parsing [fails](https://redirect.github.com/brianc/node-postgres/issues/3513).
>
> pg@8.16.0
> ---------
>
> * Add support for [min connection pool size](https://redirect.github.com/brianc/node-postgres/pull/3438).
>
> pg@8.15.0
> ---------
>
> * Add support for [esm](https://redirect.github.com/brianc/node-postgres/pull/3423) importing. CommonJS importing is still also supported.
>
> pg@8.14.0
> ---------
>
> * Add support from SCRAM-SAH-256-PLUS i.e. [channel binding](https://redirect.github.com/brianc/node-postgres/pull/3356).
>
> pg@8.13.0
> ---------
>
> * Add ability to specify query timeout on [per-query basis](https://redirect.github.com/brianc/node-postgres/pull/3074).
>
> pg@8.12.0
> ---------
>
> * Add `queryMode` config option to [force use of the extended query protocol](https://redirect.github.com/brianc/node-postgres/pull/3214) on queries without any parameters.
>
> pg-pool@8.10.0
> --------------
>
> * Emit `release` event when client is returned to [the pool](https://redirect.github.com/brianc/node-postgres/pull/2845).
>
> pg@8.9.0
> --------
>
> * Add support for [stream factory](https://redirect.github.com/brianc/node-postgres/pull/2898).
> * [Better errors](https://redirect.github.com/brianc/node-postgres/pull/2901) for SASL authentication.
> * [Use native crypto module](https://redirect.github.com/brianc/node-postgres/pull/2815) for SASL authentication.
>
> pg@8.8.0
> --------
>
> * Bump minimum required version of [native bindings](https://redirect.github.com/brianc/node-postgres/pull/2787).
> * Catch previously uncatchable errors thrown in [`pool.query`](https://redirect.github.com/brianc/node-postgres/pull/2569).
> * Prevent the pool from blocking the event loop if all clients are [idle](https://redirect.github.com/brianc/node-postgres/pull/2721) (and `allowExitOnIdle` is enabled).
> * Support `lock_timeout` in [client config](https://redirect.github.com/brianc/node-postgres/pull/2779).
> * Fix errors thrown in callbacks from [interfering with cleanup](https://redirect.github.com/brianc/node-postgres/pull/2753).
>
> ### pg-pool@3.5.0

... (truncated)


Commits

* [`4eb7529`](brianc/node-postgres@4eb7529) Publish
* [`b94c8e1`](brianc/node-postgres@b94c8e1) Don't use prefix import as it breaks in old nodes. ([#3578](https://github.com/brianc/node-postgres/tree/HEAD/packages/pg/issues/3578))
* [`6bf475c`](brianc/node-postgres@6bf475c) Improve Deno compatibility: config-first and safe env access ([#3547](https://github.com/brianc/node-postgres/tree/HEAD/packages/pg/issues/3547))
* [`d10e09c`](brianc/node-postgres@d10e09c) Publish
* [`9174783`](brianc/node-postgres@9174783) test: Replace dead row length check with similar shape check ([#3532](https://github.com/brianc/node-postgres/tree/HEAD/packages/pg/issues/3532))
* [`27a2754`](brianc/node-postgres@27a2754) Deprecations ([ArcadeData#3510](https://github.com/brianc/node-postgres/tree/HEAD/packages/pg/issues/3510))
* See full diff in [compare view](https://github.com/brianc/node-postgres/commits/pg@8.17.1/packages/pg)
  
[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility\_score?dependency-name=pg&package-manager=npm\_and\_yarn&previous-version=8.16.3&new-version=8.17.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
Dependabot commands and options
  
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show  ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
@claude
Copy link

claude bot commented Feb 23, 2026

Geospatial Indexing Review

This is a substantial and well-structured feature. The architecture cleanly mirrors LSMTreeFullTextIndex, the IndexableSQLFunction integration is correctly wired, and the test coverage is thorough. Below are issues ranging from blockers to minor suggestions.


Blockers

1. .claude/settings.local.json must not be committed

.claude/settings.local.json is a local Claude Code settings file (similar to .idea/ or .vscode/ local files). It should be added to .gitignore and removed from this PR. It contains tool permissions that are specific to the developer's AI tool session and have no meaning in the repository.

2. Design/implementation plan docs contain tool-internal instructions

docs/plans/2026-02-22-geospatial-implementation.md contains a line that reads: "For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task."

These are AI tool directives that have no place in a committed design doc. If design documentation should be kept, the internal instructions need to be stripped before committing.


Correctness Issues

3. PR description says ST_*, code uses geo.* — pick one

The PR title and every function table in the description reference ST_* naming (ST_Within, ST_GeomFromText, etc.). The actual implementation uses geo.* (geo.within, geo.geomFromText). This inconsistency will confuse users reading the changelog. The implementation choice is fine either way, but the PR description must match the code.

4. geo.dWithin uses center-to-center distance, not geometry distance

In SQLFunctionGeoDWithin.evaluate():

final double actualDistance = GeoUtils.getSpatialContext()
    .calcDistance(geom1.getCenter(), geom2.getCenter());
return actualDistance <= distance;

For two points this works. For a POLYGON and a POINT, this computes the distance from the polygon's centroid to the point, not the minimum distance from the polygon boundary to the point. A point may be very close to the boundary but far from the centroid, producing wrong results. The correct approach for mixed geometry types is JTS geometry1.distance(geometry2). Additionally, the distance parameter is in degrees (Spatial4j units) — this needs to be clearly documented, as users will expect meters or km by analogy with geo.distance.

5. geo.equals may not correctly determine geometric equality

In SQLFunctionGeoEquals.evaluate():

return jts1.norm().equals(jts2.norm());

JTS .equals() on normalized geometries is Java object equality (Object.equals), not topological equality. The correct JTS method is jts1.norm().equalsExact(jts2.norm()) or jts1.equalsTopo(jts2). The current code will return false for two geometrically identical shapes unless they are structurally identical byte-for-byte after normalization.

6. looksLikeGeoHashToken() heuristic is fragile and needs verification

The transaction-replay scenario this detects should be verified against the actual TransactionIndexContext code path. If LSM-Tree replay calls put() with original WKT values (not pre-tokenized geohash strings), the heuristic is unnecessary. If it does call with pre-tokenized strings, a WKT string that happens to be short and composed only of geohash-alphabet characters (e.g., "bcde") would be misclassified. The intended code path should be documented with a reference to the specific transaction machinery that triggers this.


Performance Concerns

7. searchFromTarget() materializes all candidates into memory

In SQLFunctionGeoPredicate.searchFromTarget():

final List<Record> results = new ArrayList<>();
...
while (cursor.hasNext())
    results.add(id.getRecord());
return results;

For a large geospatial index, the GeoHash intersection can return a very large candidate set before the post-filter runs. Loading all matching records into a List before post-filtering could cause significant heap pressure. Consider returning a lazy Iterable<Record> backed by the cursors so the post-filter can process records as they stream in.

8. WKTReader/WKTWriter allocated on every function call

GeoUtils.parseJtsGeometry(), buildPolygonFromRect(), and jtsToWKT() all allocate new WKTReader()/new WKTWriter() on every invocation. Since these are not thread-safe, a ThreadLocal<WKTReader> would eliminate repeated allocation overhead in bulk-query/insert scenarios.


Minor Issues

9. No precision validation in GeoIndexMetadata.setPrecision()

The javadoc says range 1–12, but the setter accepts any integer silently. GeohashPrefixTree will fail at runtime. Add:

if (precision < 1 || precision > 12)
    throw new IllegalArgumentException("Precision must be 1–12, got: " + precision);

10. getAssociatedBucketId() fallback needs a better comment

return bucketId >= 0 ? bucketId : underlyingIndex.getFileId();

Returning a file ID where a bucket ID is expected is a category mismatch. The comment should explain which code path produces bucketId == -1, why the file ID is safe here, and whether it could collide with a real bucket ID.

11. geo.buffer distance unit is silently in degrees

geometry.buffer(distance) operates in degrees (WGS-84 coordinate space). A user calling geo.buffer('POINT (12 41)', 10) expecting a 10-km buffer gets a ~1,100-km buffer. The syntax string gives no hint. Either add unit conversion (like geo.distance does), add an explicit unit parameter, or add a clear warning in the javadoc and an exception message.

12. geo.dWithin distance unit is inconsistent with geo.distance

geo.distance accepts a unit string (m, km, mi, nmi). geo.dWithin takes raw degrees with no conversion. This inconsistency is surprising for users of the function family.

13. Reflection in LSMTreeGeoIndexTest to access private indexMap is fragile

final Field indexMapField = LocalSchema.class.getDeclaredField("indexMap");
indexMapField.setAccessible(true);

This breaks if the field is renamed. The createAndRegisterIndex() test helper should use DDL instead (CREATE INDEX ON ...), as LSMTreeGeoIndexSchemaTest already demonstrates it works end-to-end.

14. JavaCC parser incompatibility not surfaced as an actionable error

The comment in SQLASTBuilder notes that geo.* dotted function syntax only works through the ANTLR parser; the JavaCC parser requires backtick syntax. If a user writes geo.within(...) in a JavaCC-parsed context, they get a confusing parse error. This should produce a clear diagnostic message.

15. estimate() returning -1 — verify optimizer handling

Confirm that SelectExecutionPlanner treats -1 from IndexableSQLFunction.estimate() as "unknown/not-estimated" rather than as Long.MAX_VALUE - 1 (which -1 as a signed long would appear as when compared unsigned). If -1 causes the optimizer to always prefer a full scan, the geo index will never be used for cost-based planning.


Positive Notes

  • Wrapping LSMTreeIndex with GeoHash tokenisation is clean and correctly inherits ACID/WAL/HA/compaction.
  • Correctly disabling indexed execution for geo.disjoint, geo.contains, geo.equals, geo.crosses, geo.overlaps, and geo.touches — each with a clear, accurate explanation — is thorough and correct.
  • GeoIndexMetadata correctly mirrors FullTextIndexMetadata for schema persistence.
  • ATTRIBUTIONS.md is correctly updated for lucene-spatial-extras.
  • The CONTAINS keyword addition to the SQL grammar identifier list is a correct fix for the parser conflict.
  • Test coverage across index unit tests, schema persistence tests, and end-to-end SQL query tests is solid.

@robfrank robfrank changed the title Lsmtree geospatial feat(geo): Geospatial indexing with geo.* SQL functions (LSM-Tree native storage) Feb 23, 2026
@claude
Copy link

claude bot commented Feb 23, 2026

Code Review — feat(geo): Geospatial indexing with geo.* SQL functions

Overall this is a well-structured, ambitious feature that follows existing project patterns closely (mirroring LSMTreeFullTextIndex). The design doc and test coverage are thorough. Below are my findings, grouped by severity.


Critical

1. Fragile transaction-replay detection in LSMTreeGeoIndex.put()

looksLikeGeoHashToken() uses a heuristic (short, lowercase-alphanumeric string matching the GeoHash alphabet) to distinguish pre-tokenized GeoHash strings (written by TransactionIndexContext on commit replay) from original WKT values. This is fragile:

  • A user-stored value like 'bc', 'ef7', or 'u0mh' would silently pass straight to the underlying index instead of being rejected as invalid WKT.
  • The heuristic hard-codes character exclusions (a, i, l, o) from the GeoHash alphabet — an implementation detail of Lucene's GeohashPrefixTree. Any change in the tokenizer could break detection silently.

A cleaner approach: expose a package-private putToken() method that receives already-tokenized strings, so the main put() path never needs to guess. The TransactionIndexContext replay would call putToken() directly, eliminating the heuristic entirely.

2. GeoIndexMetadata.toJSON() / fromJSON() asymmetry

toJSON() only serialises 'precision' — it does NOT call super.toJSON(), so typeName and propertyNames are never written. fromJSON() works around this with a conditional 'if (metadata.has("typeName")) super.fromJSON(metadata)'. On a fresh write the parent fields are absent; on reload the condition is false and they cannot be reconstructed. Fix by always calling super.toJSON() / super.fromJSON(), matching the FullTextIndexMetadata pattern.


Important

3. Hard breaking changes with no deprecation path

The PR removes point(), distance(), circle(), polygon(), lineString(), and rectangle() entirely. Any existing database with queries or scripts using these functions will break at upgrade with no migration path. Consider keeping the old names as thin aliases that delegate to their geo.* equivalents and emit a deprecation log message, or document the upgrade migration explicitly in release notes.

4. allowsIndexedExecution() ignores the comparison operator

The method receives a BinaryCompareOperator but never inspects it. If someone writes WHERE geo.within(coords, shape) != true, the optimizer may still route through the index and produce wrong results. Return false when the operator is not an equality operator.

5. JavaCC parser limitation creates a two-tier user experience

The ANTLR4 parser supports geo.point(x, y); the JavaCC parser requires backtick-quoted geo.point(x, y). The project uses both parsers, so users hitting the JavaCC path will see confusing parse errors with no actionable guidance. User-visible error messages should hint at the backtick alternative.

6. CellIterator null shapeRel filtering may miss cells

In get():
if (cell.getShapeRel() == null)
continue;

shapeRel is null for unresolved interior nodes, not for definitively disjoint cells. Skipping them may cause false negatives at fine precision levels. Consider skipping only DISJOINT cells rather than null-rel cells.

7. docs/plans/ implementation planning files committed to the repo

The four docs/plans/ markdown files are internal step-by-step implementation checklists (they include git commit instructions). These do not belong in the production repository. A single architecture document summarising the final design would be more appropriate.


Minor

8. Test database name collision

geoManualIndexPoints and geoManualIndexBoundingBoxes both pass "GeoDatabase" to TestHelper.executeInNewDatabase(). Parallel test runs will collide. Use unique names per test method.

9. geo.distance default unit change is a silent breaking change

The old distance() function used different default units. geo.distance now defaults to meters. Any query relying on the numeric output of the old function will produce values 1000x different without warning. Document this explicitly in the breaking-changes section.

10. .claude/settings.local.json in .gitignore

Fine to add, but it should be grouped with other IDE entries (.idea/) with a brief comment, or placed in a global gitignore instead of the project-level file.

11. extractFieldName() silently rejects dotted field paths

The check 'text.contains(".")' prevents geo.within(record.nested.coords, shape) from using the index. Acceptable for v1, but a comment documenting this limitation would help future maintainers.

12. estimate() always returns -1

The optimizer always treats the geo index as having unknown cardinality. Even a rough estimate based on precision level and search shape area would improve query plan selection when combining a geo filter with other indexed conditions.


Positive observations

  • Clean wrapper pattern following LSMTreeFullTextIndex — ACID, WAL, HA, and compaction are inherited for free.
  • The shouldExecuteAfterSearch = true post-filter architecture correctly handles the GeoHash superset property.
  • Correctly opting out of indexed execution for geo.disjoint, geo.dWithin, and geo.contains with clear rationale.
  • The ANTLR4 visitIdentifierChain rewriting for geo.* namespaces avoids grammar changes.
  • Test coverage across metadata, index, schema DDL, SQL functions, and end-to-end indexed queries is comprehensive (79 tests).
  • Attribution for lucene-spatial-extras properly added to ATTRIBUTIONS.md.

@robfrank robfrank added the invalid This doesn't seem right label Feb 23, 2026
@claude
Copy link

claude bot commented Feb 23, 2026

Code Review: PR #3510 — Geospatial Indexing

This is a substantial, well-structured feature PR. The design (GeoHash-tokenised LSM-Tree wrapping the existing LSMTreeFullTextIndex pattern) is sound. Below are the issues I found, grouped by severity.


Critical / Potential Bugs

1. Breaking change with no migration path

point(), distance(), circle(), polygon(), lineString(), and rectangle() are deleted outright. Any existing database query using these functions will fail immediately after upgrading. Consider keeping the old registrations in DefaultSQLFunctionFactory as deprecated stubs that delegate to the new geo.* equivalents or throw a clear CommandExecutionException pointing users to the replacement. A changelog note alone is insufficient for a runtime breakage.

2. searchFromTarget() eagerly loads all candidates into List<Record> — double-load + OOM risk

In SQLFunctionGeoPredicate.searchFromTarget():

results.add(id.getRecord());  // loads full document from disk

Because shouldExecuteAfterSearch() returns true, the query executor re-evaluates the predicate on every returned record — loading each candidate from disk a second time. For a broad spatial query (e.g. entire continent), this could load millions of records twice and exhaust heap. The full-text implementation returns an IndexCursor for lazy iteration; the same approach should be applied here. Return RIDs and let the executor load records on demand.

3. Loading-time constructor hardcodes DEFAULT_PRECISION

public LSMTreeGeoIndex(final DatabaseInternal database, final String name, final String filePath,
    final int fileId, final ComponentFile.MODE mode, final int pageSize, final int version) {
  this.precision = DEFAULT_PRECISION;  // precision persisted in JSON is ignored here

If this constructor is ever invoked during index loading (e.g. by a future refactoring or external caller), the persisted precision will be silently lost and incorrect GeoHash cells will be used for querying. Either add a precision parameter to this constructor, or make it private/package-private with a clear warning.

4. No range validation in GeoIndexMetadata.setPrecision()

Valid GeoHash precision is 1–12. Values outside this range will silently produce a broken index. Add validation:

if (precision < 1 || precision > 12)
  throw new IllegalArgumentException("GeoHash precision must be 1-12, got: " + precision);

Design / Correctness Issues

5. looksLikeGeoHashToken() heuristic is fragile

The transaction-commit-replay detection in put() relies on a character-set heuristic. A WKT value like "bc" or "u0n" would be a valid WKT parse failure and match the GeoHash alphabet, causing it to be inserted as a raw token rather than tokenised WKT. The root cause is that the LSM-Tree replay path passes raw tokens back to the wrapper. Consider storing a sentinel prefix on GeoHash tokens (e.g. a \0 byte) to distinguish them unambiguously from WKT strings, eliminating the need for the heuristic entirely.

6. extractFieldName() rejects dotted property paths silently

if (text.contains("."))
  return null;

This means geo.within(nested.coords, ...) can never use the index. The function silently falls back to a full scan with no warning, which could mislead users who expect the index to be used. Add a log warning when indexed execution is declined due to a non-simple field reference.

7. Distance and area units are undocumented at the usage level

  • geo.buffer(geom, dist): dist is in degrees (JTS internal unit) but users will naturally assume meters or km.
  • geo.dWithin(g, shape, distanceDegrees): the function name says nothing about degrees.
  • geo.area(geom): returns square degrees.

These need explicit unit documentation in getSyntax() and ideally a clear error for obviously invalid inputs (e.g. negative distance).

8. Plan/design documents committed to the repository tree

docs/plans/2026-02-22-geospatial-design.md, docs/plans/2026-02-22-geospatial-implementation.md, and the two rename plan files are implementation artefacts. They make the source tree harder to navigate and will not age well. They should not be committed to the repository.


Performance

9. LinkedHashSet<RID> in get() materialises all candidates before applying limit

For a query covering a large area at high precision, thousands of GeoHash cells may be returned, each with many RIDs. The full LinkedHashSet is materialised before any post-filtering occurs. Consider applying the limit parameter earlier (stop cell iteration once limit unique RIDs are collected) and use a plain HashSet if insertion order is not required.


Tests

10. Test uses reflection to access indexMap

final Field indexMapField = LocalSchema.class.getDeclaredField("indexMap");
indexMapField.setAccessible(true);

This tightly couples the test to an internal field name. If LocalSchema is refactored, the test will fail with a NoSuchFieldException rather than a meaningful assertion. The DDL SQL path (as LSMTreeGeoIndexSchemaTest already does correctly) should be used instead.

11. No regression test for the CONTAINS grammar change

Adding CONTAINS as a valid identifier token in SQLParser.g4 could silently affect existing SQL queries. At minimum, add a test that a property named contains can be created and queried without parse errors.


Cleanup

12. .claude/settings.local.json added to project .gitignore

This is a tool-specific local file and should be in the global ~/.gitignore_global, not in the project's .gitignore.


Minor

  • LSMTreeGeoIndex imports com.arcadedb.database.Document and java.util.concurrent.atomic.AtomicLong which do not appear to be used.
  • TypeIndexBuilder.metadata is set as a public field directly in LSMTreeGeoIndexSchemaTest (builder.metadata = geoMeta). A builder.withMetadata() setter would be cleaner and more stable against future refactoring.
  • geo.geomFromText(wkt) returns a Spatial4j Shape object. If a user stores the result directly into a document property, the stored value will be a Shape.toString() representation, not a WKT string. This behaviour should be documented.

Overall the architecture is well thought-out and the test coverage is good. Issue #2 (searchFromTarget eager loading) is the most important to fix before merge — it can silently cause out-of-memory failures on production datasets with broad spatial queries.

@robfrank robfrank removed the invalid This doesn't seem right label Feb 24, 2026
@claude
Copy link

claude bot commented Feb 24, 2026

Code Review — Geospatial Indexing (PR #3510)

This is a substantial and well-structured feature addition. The design closely mirrors LSMTreeFullTextIndex, the documentation is thorough, and the test coverage is broad. Below are the issues I found, organized by severity.


Critical / Correctness Bugs

1. LSMTreeGeoIndex.toJSON() calls getBucketById() on the raw underlying bucket ID

The toJSON() method calls underlyingIndex.getAssociatedBucketId() directly, bypassing the wrapper's own getAssociatedBucketId() override which provides a fallback to fileId when the bucket ID is -1. If the underlying bucket ID is -1 (no associated bucket), getBucketById(-1) will throw. Change toJSON() to call this.getAssociatedBucketId() instead.

2. looksLikeGeoHashToken() is a fragile heuristic for WAL replay detection

The method tries to distinguish already-tokenized GeoHash strings from raw WKT strings during WAL replay by checking that all characters are in the GeoHash alphabet (0-9, b-z excluding i, l, o) and length <= precision. A short WKT value like "bc" or "sp3e" satisfies these conditions and would be stored verbatim rather than tokenized, silently corrupting the index. The heuristic's correctness depends on WKT strings never matching the GeoHash alphabet + length constraint, which is not guaranteed.

A more robust approach: store tokens with a distinguishing prefix (e.g., "ghx:" + token) so raw WKT and tokenized forms are unambiguously distinguishable. Or review how LSMTreeFullTextIndex handles the analogous WAL replay path to see if there is an established pattern to reuse.

3. remove() re-derives tokens from WKT at deletion time

remove() calls extractTokens(keys[0]) to re-derive GeoHash tokens from the WKT at deletion time. If the WKT passed at removal differs from the WKT used at insertion (e.g., whitespace normalization, coordinate ordering, or Spatial4j round-trip differences), the derived tokens will differ and old tokens remain as orphaned entries, causing false positives in future queries. Verify that the engine always passes the original stored WKT on removal, or document this as a precondition.


High Priority

4. No validation of the precision parameter range

GeohashPrefixTree only supports precision levels 1–12. Neither GeoIndexMetadata.setPrecision() nor GeoIndexFactoryHandler.create() validates this. Providing precision 0 or 99 will cause an unhelpful exception deep in Lucene. Add an explicit guard with a clear error message in both places.

5. geo.buffer() distance is in degrees, not meters

GeoUtils.parseJtsGeometry() returns a JTS Geometry in the WGS84 coordinate system (degrees), and Geometry.buffer(distance) interprets that distance in the same coordinate units — degrees, not meters. A user calling geo.buffer(point, 1000) expecting 1000 m will get a buffer of 1000 degrees, wrapping the globe many times. Document this constraint clearly in getSyntax(), or add a unit conversion if the intent is meters. This is also inconsistent with geo.distance() which defaults to meters.

6. docs/plans/ AI planning artifacts committed to the repository

The four new markdown files under docs/plans/ are AI-generated implementation plans (design doc, implementation plan, rename design, rename plan). These are internal development artifacts, not user-facing documentation. They inflate the repository history and expose process details that belong in a developer wiki or the PR description itself. Please remove them.


Medium Priority

7. GeoUtils.toWKT(Shape) may produce Spatial4j's non-standard ENVELOPE(...) format

Spatial4j uses ENVELOPE(minX, maxX, maxY, minY) for rectangles — this is not valid ISO/OGC WKT. Any downstream code (JTS, external clients) that receives this from geo.asText() or geo.envelope() and parses it as standard WKT will fail. The private parseEnvelopeWkt() already handles the reverse, so the gap is on output only. Consider always converting Rectangle to a POLYGON WKT before returning.

8. All candidates are collected in memory in get() before the limit is applied

For coarse precision levels or spatially large queries, a single GeoHash cell can match a large fraction of the dataset. All matching RIDs are buffered into a LinkedHashSet before the limit is applied. Check whether the limit can be threaded through underlyingIndex.get() for early termination, or at minimum accumulate into the set only up to limit entries.

9. RecursivePrefixTreeStrategy shared instance — verify thread safety

strategy is a single instance shared across all concurrent put() and get() calls. RecursivePrefixTreeStrategy.createIndexableFields() creates TokenStreams internally; verify from the Lucene spatial docs that this is safe for concurrent invocation. If not, the strategy should be created per-call or per-thread.

10. Test method names misleading for non-indexed predicates

stCrossesWithIndex, stOverlapsWithIndex, and stTouchesWithIndex in SQLGeoIndexedQueryTest test predicates whose allowsIndexedExecution() returns false — they always fall back to a full scan. The names imply index usage. Rename to stCrossesFallbackWithExistingIndex etc., consistent with stDWithinFallbackWithExistingIndex.

11. GeoIndexMetadata.fromJSON() conditionally skips super.fromJSON()

super.fromJSON(metadata) is only called when "typeName" is present in the JSON. Verify that the schema JSON always includes "typeName" at this deserialization call site; if not, base-class fields (bucket ID, property names) will not be restored correctly on database reopen.


Minor / Style

12. extractFieldName() silently rejects dotted field paths — add a comment

The check text.contains(".") returns null (no index) for dotted field paths like location.coords. This is intentional but looks like a bug without a comment explaining why.

13. The strategy field could be final

strategy is assigned exactly once in every constructor and never reassigned. Mark it final for clarity.

14. Hard removal of old geo functions — consider a deprecation cycle

Removing point(), distance(), circle(), polygon(), lineString(), and rectangle() in a single PR is a hard break for existing user queries. Consider keeping old names as deprecated aliases for one release.

15. Cell.getShapeRel() == null skip condition needs a comment

The continue when cell.getShapeRel() == null silently skips cells with an unresolved spatial relationship. A brief comment explaining when this can occur and why it is safe to skip would help future maintainers.

@claude
Copy link

claude bot commented Feb 24, 2026

Code Review

This is a substantial and well-structured addition. The overall architecture (thin wrapper over LSMTreeIndex, mirroring LSMTreeFullTextIndex) is sound and the geo.* namespace is consistent with the existing vector.* convention. Below are findings grouped by severity.


Critical / Correctness Issues

1. geo.dWithin uses center-to-center distance — semantically wrong for non-point geometries

SQLFunctionGeoDWithin.evaluate() computes SpatialContext.calcDistance(geom1.getCenter(), geom2.getCenter()). For polygons or lines this computes the distance between centroids, not the minimum separation between the geometries, which is what ST_DWithin means everywhere else in the GIS world. A polygon containing a point returns non-zero distance even though the geometries overlap. For the initial version either restrict the function to point inputs and document that, or implement the correct minimum-distance check via JTS Geometry.distance().

2. Fragile GeoHash token detection in LSMTreeGeoIndex.put()

The looksLikeGeoHashToken() heuristic is used to distinguish "commit replay" (pre-tokenised GeoHash strings coming back through put()) from raw WKT. This is a leaky abstraction:

  • The check s.length() > precision is wrong: tokens shorter than precision characters are valid (parent cells in the prefix tree are always shorter than the leaf level).
  • GeoHash tokens at different precision levels genuinely look like lowercase alphanumeric strings, so any short ASCII string that happens to fail WKT parsing will be passed through silently.

A cleaner design: introduce a typed key wrapper (e.g. a GeoHashToken record) in the transaction log rather than trying to reverse-engineer the type from the string value. Alternatively, adopt the same approach as LSMTreeFullTextIndex and verify exactly how it handles transaction replay — if it stores the tokens directly in the WAL there should be no need for this heuristic at all.

3. geo.buffer distance is in degrees, not meters

JTS Geometry.buffer() works in the coordinate system's native units. For WGS-84 (degrees) that is degrees, not meters. The syntax geo.buffer(geom, 1000) will expand by 1000 degrees — effectively the whole planet. Either convert from meters at the function boundary or clearly document (and validate) that the unit is degrees.


Design / Architecture Issues

4. searchFromTarget() loads all candidate records eagerly into a List<Record>

Both SQLFunctionGeoPredicate.searchFromTarget() implementations (the base-class version and the per-predicate helper) collect results into an ArrayList<Record> before returning. For large geospatial datasets (millions of points in a bounding box) this causes high GC pressure, contradicting the project's performance guidelines. LSMTreeFullTextIndex returns a lazy IndexCursor — the same pattern should be followed here. Return an Iterable<Record> backed by a cursor that reads from the index on demand.

5. extractFieldName() is a text-based heuristic

if (!text.contains("(") && !text.contains(" "))
    return text;

This will fail silently for:

  • Backtick-quoted identifiers: WHERE geo.within(`my field`, ...)
  • Dotted field access: WHERE geo.within(location.coords, ...)
  • Any identifier that AST serialises with surrounding whitespace

Use the AST node type directly (BaseExpression → BaseIdentifier → LevelZeroIdentifier → Identifier) rather than round-tripping through toString().

6. Precision not validated in GeoIndexMetadata.setPrecision()

Valid GeoHash precision is 1–12. Any integer is accepted now. A precision of 0 or 13+ will be silently passed to GeohashPrefixTree, which may throw at runtime or behave unexpectedly. Add a simple range check.

7. getAssociatedBucketId() fallback is a hack

return bucketId >= 0 ? bucketId : underlyingIndex.getFileId();

Returning the index file ID as a substitute for a bucket ID when there is no associated bucket conflates two unrelated identifiers. If the locking machinery misbehaves without a bucket ID, the fix should be in the locking machinery or in a null-safe check there, not by returning a misleading value here.


Minor Issues

8. Design/plan documents committed to the repo

docs/plans/2026-02-22-geospatial-design.md, docs/plans/2026-02-22-geospatial-implementation.md, docs/plans/2026-02-23-geo-rename-*.md are AI-generated implementation notes, not user-facing documentation. They clutter the repository and expose internal development tooling. These should be removed before merging (or moved to a private wiki / PR description).

9. Test uses reflection to access LocalSchema.indexMap

LSMTreeGeoIndexTest.createAndRegisterIndex() uses Field.setAccessible(true) to insert directly into the schema's index map. This makes the test brittle against internal refactoring and signals that the API doesn't expose a clean enough registration path for testing. Consider exposing a package-visible helper or test utility, or restructure the test to create the index via SQL DDL as the other schema tests do.

10. Breaking changes need a migration note in the changelog / docs

The removal of point(), distance(), circle(), polygon(), lineString(), rectangle() is a hard breaking change for anyone on the existing release. The PR description mentions this briefly but there is no migration guide in the docs and no deprecation period. At minimum, add a note to CHANGELOG.md and to the migration section of the docs.

11. Missing test for multi-bucket (partitioned) types

All integration tests use single-bucket document types. The searchFromTarget() loop iterates getIndexesOnBuckets() but there is no test verifying that results from multiple buckets are correctly unioned. Adding one test with a type that has a non-default bucket count would close this gap.

12. Duplicate shouldExecuteAfterSearch / allowsIndexedExecution logic

The base class SQLFunctionGeoPredicate implements allowsIndexedExecution, canExecuteInline, shouldExecuteAfterSearch, and searchFromTarget. Subclasses that disable indexing (e.g. GeoContains, GeoDWithin) override only allowsIndexedExecution. But the base-class searchFromTarget is still callable and would return results if invoked, even when allowsIndexedExecution returns false. This should be safe in practice because the planner won't call searchFromTarget when allowsIndexedExecution is false, but it is a subtle invariant worth documenting or enforcing (e.g. searchFromTarget in the base class could assert allowsIndexedExecution).


Positive Aspects

  • The delegation pattern in LSMTreeGeoIndex is clean and consistent with LSMTreeFullTextIndex.
  • Correct decision to opt out of indexed execution for disjoint, contains, crosses, overlaps, touches, and dWithin — the comments explaining why are good.
  • Three-valued null logic (return null when either argument is null) is correct.
  • ATTRIBUTIONS.md updated, license check passed.
  • Test coverage is comprehensive for the happy paths (79 tests).
  • The ANTLR visitor rewrite for geo.* namespace is clearly explained and well-isolated.

@codacy-production
Copy link

codacy-production bot commented Feb 24, 2026

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation Diff coverage
-10.24% 75.44%
Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (b1d5780) 100150 74874 74.76%
Head commit (87a37d3) 131322 (+31172) 84732 (+9858) 64.52% (-10.24%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#3510) 794 599 75.44%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings    Change summary preferences

@claude
Copy link

claude bot commented Feb 24, 2026

Code Review: feat(geo) — Geospatial indexing with geo.* SQL functions

Overall this is a well-structured feature that follows existing project patterns (notably LSMTreeFullTextIndex). The ACID/WAL/HA reuse via LSMTree is the right call, and the query-optimizer integration via IndexableSQLFunction.searchFromTarget() is clean. Below are issues worth addressing before merge.


Breaking changes without deprecation path

The PR removes point(), distance(), circle(), polygon(), lineString(), and rectangle() functions outright. These are public SQL APIs. Any user code calling them will fail with no migration path. Either:

  • Keep the old names registered as deprecated aliases delegating to the geo.* variants, or
  • Provide a clear migration guide in the release notes with a deprecation warning in the old functions before the next major version removes them.

The Cypher functions are preserved — the same respect should apply to SQL.


looksLikeGeoHashToken() heuristic is fragile

// LSMTreeGeoIndex.java
private boolean looksLikeGeoHashToken(String s) { ... }

This method distinguishes "real user data" from "GeoHash tokens written during put()" by pattern-matching the Base-32 alphabet. The problem: a user could legitimately store a 5-character alphanumeric string (e.g. a product code like "b3c9f") that passes this heuristic, causing the WAL replay path to skip re-tokenization and store the raw string instead of the correct GeoHash cells.

A more robust approach: use a dedicated prefix or sentinel in stored tokens (e.g. \x00 + token, or a type marker in the LSMTree value byte), rather than inferring intent from the key characters.


get() collects the full result set in memory

For a query covering a large region at high precision (up to 12 levels = millions of GeoHash cells), or a type with a large number of records per cell, this materialises the entire candidate set into a heap-allocated LinkedHashSet. This is a GC pressure hotspot and could OOM on production datasets.

Consider a lazy/streaming cursor that queries cells on demand, similar to how the full-text index chains individual cursors.


build() bypasses LSMTree batch path

// LSMTreeGeoIndex.java
@Override
public void build(...) {
    // scans bucket and calls this.put() per record
}

Calling this.put() per record forces a WKT parse + GeoHash tokenization + N individual LSMTree put() calls per document. For an initial load of millions of documents this will be measurably slower. Consider batching the GeoHash tokens before handing them to underlyingIndex.


GeoIndexMetadata.fromJSON() — suspicious conditional

@Override
public void fromJSON(JSONObject metadata) {
    if (metadata.has("typeName"))
        super.fromJSON(metadata);        // why conditional?
    this.precision = metadata.getInt("precision", DEFAULT_PRECISION);
}

IndexMetadata.fromJSON() reads typeName, propertyNames, bucketId, etc. Skipping it silently when typeName is absent means those fields stay at their constructor-default values. This will cause a broken metadata object. The condition should either be removed (always call super) or, if there's a legitimate case for absent typeName, throw or log an error rather than silently proceeding.


getAssociatedBucketId() workaround needs documentation

// LSMTreeGeoIndex.java
@Override
public int getAssociatedBucketId() {
    return bucketId >= 0 ? bucketId : underlyingIndex.getFileId();
}

When is bucketId legitimately -1 for a GEOSPATIAL index? If this is a type-level index (no specific bucket), returning the file ID is semantically different from returning a bucket ID. Document the invariant, or fix the upstream caller that expects -1 to mean "no bucket".


Test uses reflection to access private schema internals

// LSMTreeGeoIndexTest.java
final Field indexMapField = LocalSchema.class.getDeclaredField("indexMap");
indexMapField.setAccessible(true);
indexMap.put(idx.getName(), idx);

Using reflection to bypass encapsulation in a test signals that there is no public (or package-visible) API for programmatic index registration outside of DDL. The comment should explain why this is needed. Alternatively, expose a package-private registration method if this pattern is used in multiple test classes.


docs/plans/ directory should not be in the main source tree

The PR adds six dated design/planning documents under docs/plans/. These are development-time planning artifacts. They add noise to the repository, won't be maintained once the feature lands, and are not user-facing documentation. Remove them before merge, or convert the relevant content into proper user documentation under docs/src/.


JavaCC parser inconsistency is not documented for users

The PR description notes that the JavaCC SQL parser requires backtick syntax for geo.* functions. This means two syntaxes exist for the same function depending on parser context, with no indication to users of when each applies. Add a code comment and user documentation describing when backtick escaping is required.


CONTAINS added as valid identifier in ANTLR grammar

Adding CONTAINS as a valid identifier means it can no longer be reliably used as a reserved SQL keyword. If ArcadeDB ever wants to add a SQL CONTAINS operator (e.g. WHERE tags CONTAINS 'geo'), this grammar entry will create a conflict. Document why CONTAINS specifically needed to be added, or scope it more narrowly.


Minor observations

  1. .gitignore: Adding .claude/settings.local.json is fine but consider adding the whole .claude/ directory rather than a specific file.

  2. Test type naming: SQLGeoIndexedQueryTest creates types named Location, Location2, etc. per test method. If tests share a database instance, names may collide. Use @BeforeEach/@AfterEach with DROP TYPE cleanup or generate unique names.

  3. extractTokens() resource cleanup: Verify that the Lucene TokenStream from RecursivePrefixTreeStrategy.createIndexableFields() is properly closed after token extraction to avoid resource leaks.

  4. dWithin and contains future work: Consider throwing an explicit message rather than silently falling back to full scan, so users know index support is not yet available.

  5. SingletonSet.java: If this is a new class, Java 21 already provides Set.of(element). Verify it adds something that Set.of() does not.


Summary

Category Issues
Must fix before merge Breaking API removal without deprecation, fragile WAL-replay heuristic
Should address Memory pressure in get(), fromJSON conditional, docs/plans removal, getAssociatedBucketId documentation
Nice to have JavaCC docs, test isolation, TokenStream cleanup, CONTAINS grammar note

The core implementation is solid and the test coverage is comprehensive. Addressing the breaking-change strategy and the looksLikeGeoHashToken reliability issue are the most important items before this lands.

robfrank and others added 8 commits February 24, 2026 22:06
Design for porting OrientDB geospatial indexing to ArcadeDB using
LSM-Tree as storage backend (following the LSMTreeFullTextIndex pattern)
and lucene-spatial-extras for GeoHash decomposition. Covers ST_*
SQL functions, IndexableSQLFunction integration for automatic query
optimizer usage, and WKT as geometry storage format.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Apply the limit ceiling to results returned from the deduplication map,
matching the behavior of LSMTreeFullTextIndex.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- toJSON(): add missing "bucket" field to prevent schema reload failure
- extractTokens(): consolidate duplicated token-stream loop from indexShape()
  so that IOException warning is emitted consistently in both put and remove paths
- get(): replace LinkedHashMap<RID,Integer> with LinkedHashSet<RID> for clarity
- getType(): remove internal task reference from exception message

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add GEOSPATIAL to Schema.INDEX_TYPE enum
- Register GeoIndexFactoryHandler in LocalSchema
- Fix LSMTreeGeoIndex.getType() to return Schema.INDEX_TYPE.GEOSPATIAL
- Add GEOSPATIAL case to LocalSchema readConfiguration() for schema reload
- Add GEOSPATIAL to CreateIndexStatement validate() and executeDDL() mapping
- Add LSMTreeGeoIndexSchemaTest to verify DDL creation via SQL

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
robfrank and others added 28 commits February 24, 2026 22:06
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update all predicate subclass extends clauses to reference the new base class name.
Rename 12 geo function classes (GeomFromText, Point, LineString, Polygon,
Buffer, Envelope, Distance, Area, AsText, AsGeoJson, X, Y) from ST_* prefix
to geo.* namespace, updating class names, NAME constants, getSyntax() strings,
Javadoc references, factory imports and registrations accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rename the 9 spatial predicate function classes from SQLFunctionST_*
to SQLFunctionGeo* and update their NAME constants from ST_Xxx to
geo.xxx, matching the geo.* naming convention established in Task 2.
Update DefaultSQLFunctionFactory imports and register calls accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ter() calls

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace all ST_* function calls in SQL string literals in
  SQLGeoFunctionsTest and SQLGeoIndexedQueryTest with the new
  geo.* namespace-qualified names (e.g. ST_GeomFromText → geo.geomFromText,
  ST_Within → geo.within, ST_Contains → geo.contains, etc.).

- Extend the SQL grammar (SQLParser.g4) to support namespace-qualified
  function calls: identifier DOT identifier LPAREN ... (e.g. geo.point(x,y)).
  functionCall now accepts an optional 'namespace DOT' prefix before the
  function name.

- Add CONTAINS to the identifier rule so that geo.contains(...) parses
  correctly (CONTAINS was a reserved keyword not usable as an identifier).

- Update SQLASTBuilder.visitFunctionCall to combine the two identifiers
  into a single dot-separated function name when the qualified form is used.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…) syntax

Revert the grammar ordering change from commit d0aff24 that placed
functionCallExpr before identifierChain in baseExpression. That ordering
caused field.method() patterns like decimal.format('%.1f') and
name.toLowerCase() to be incorrectly parsed as namespace-qualified function
calls.

The fix uses Option A:
- Restore identifierChain before functionCallExpr in baseExpression so that
  field.method() patterns parse correctly as field access + method call.
- Revert functionCall to accept a single identifier (no DOT namespace prefix),
  matching its original form.
- Add FUNCTION_NAMESPACES detection in visitIdentifierChain: when the
  identifierChain matches exactly one base identifier that is a known function
  namespace (currently "geo") followed by exactly one methodCall, the visitor
  rewrites the node as a namespace-qualified FunctionCall (e.g. "geo.point",
  "geo.within") instead of a field access with a method modifier.

This preserves full support for:
  geo.within(coords, geo.point(1, 2))  — namespace-qualified function call
  decimal.format('%.1f')               — field with method call
  name.toLowerCase()                   — field with method call

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… JavaCC limitation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tokenization

- Remove .claude/settings.local.json from tracking and add to .gitignore
- Strip internal AI tool directive from geospatial implementation plan doc
- Fix LSMTreeGeoIndex.build() to scan bucket and call this.put() via the
  indexer, instead of delegating to underlyingIndex.build() which bypassed
  GeoHash tokenization and stored raw WKT keys in the index

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…error paths

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lid WKT test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rror paths

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…entFunctionsTest

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ror paths

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…or paths

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tests

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…classes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ying comments

- Restore point(), distance(), linestring(), polygon() as aliases for the
  geo.* equivalents via getAlias() on each function class; the factory
  template auto-registers both names from a single instance
- Re-add SQLFunctionCircle and SQLFunctionRectangle as @deprecated wrappers
  preserving the original Spatial4j-based behaviour (no direct geo.* equivalent
  exists); GeoUtils.parseGeometry() already handles Shape objects so they
  remain compatible with all new geo.* predicates
- Add TODO in LSMTreeGeoIndex.get() documenting the LinkedHashSet<RID> v1
  memory trade-off and what a streaming cursor alternative would look like
- Add comment in LSMTreeGeoIndexTest explaining why reflection is required
  to inject the index into the schema's indexMap
- Document the CONTAINS-as-identifier grammar trade-off in SQLParser.g4

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 24, 2026

Code Review: Geospatial Indexing (PR #3510)

This is a substantial feature addition. The overall architecture is sound — mirroring LSMTreeFullTextIndex keeps it consistent with existing patterns, and inheriting ACID/WAL/compaction from the LSM backend is the right call. Below is a detailed review.


Breaking Changes

The PR removes SQLFunctionPoint, SQLFunctionDistance, SQLFunctionPolygon, and SQLFunctionLineString entirely, while keeping SQLFunctionCircle and SQLFunctionRectangle as @Deprecated. This is inconsistent. All six removed functions should either be kept as @Deprecated wrappers for at least one release cycle, or removed together with a clear migration path. Silently deleting point(), distance(), polygon(), and lineString() while keeping the other two will confuse upgrading users.


Bugs / Correctness Issues

1. Scientific notation in WKT generation (GeoUtils.buildPolygonFromRect)

Using %s for double values in String.format calls Double.toString(), which produces scientific notation for extreme values (e.g., 1.0E10). The JTS WKTReader does not accept scientific notation — this would cause a parse failure. The already-present GeoUtils.formatCoord() should be used instead of %s. The same issue exists in parseEnvelopeWkt.

2. Inconsistent distance units between geo.distance and geo.dWithin

geo.distance returns metres/km/mi via Haversine. geo.dWithin works with Spatial4j native units (degrees of great-circle arc by default). A user calling geo.dWithin(coords, point, 1000) expecting metres will get wrong results. This needs either explicit documentation or a unit parameter added to geo.dWithin.

3. Fragile looksLikeGeoHashToken heuristic

The transaction-replay detection relies on character-set checks to distinguish WKT strings from pre-tokenised GeoHash strings. A short string that passes the alphabet check (e.g., "e" or "b32") could be misidentified as a GeoHash token and passed through to the underlying index unchanged. A safer design would have TransactionIndexContext tag replayed entries explicitly, rather than relying on a character-set heuristic.


Performance Concerns

4. searchFromTarget materialises all candidates in memory

SQLFunctionGeoPredicate.searchFromTarget eagerly loads all candidate records into a List<Record> before returning. For large datasets and wide search areas this could be a significant memory hit. The existing LSMTreeFullTextIndex path returns an IndexCursor iterator — the same streaming approach should be used here. The TODO comment in LSMTreeGeoIndex.get() acknowledges this; it should be resolved before production use.

5. GeoUtils.formatCoord allocates two String objects per coordinate

The method calls String.format then substring. On the hot path for every polygon vertex during index writes and query construction, this adds GC pressure. Double.toString() already strips trailing zeros for values in the normal coordinate range.

6. WKTReader created per call

GeoUtils.parseJtsGeometry, buildPolygonFromRect, and parseEnvelopeWkt each instantiate new WKTReader(). A ThreadLocal<WKTReader> would eliminate repeated allocation on the hot path.


Design Concerns

7. Plan documents in docs/plans/ should not be committed

The PR adds four docs/plans/*.md files that are implementation-tracking artifacts, not user-facing documentation. Remove them before merging.

8. CONTAINS grammar change risks future SQL compatibility

Adding CONTAINS as a valid identifier in the ANTLR grammar (to support geo.contains(...)) blocks its future use as an infix operator. ArcadeDB SQL is OrientDB-compatible, and OrientDB SQL uses CONTAINS as a collection operator. The inline comment acknowledges this — worth filing as a follow-up issue.

9. Missing range validation in GeoIndexMetadata.setPrecision

The valid range is documented as 1–12 but no validation is performed. Values outside this range would be silently stored and cause GeohashPrefixTree to throw at query time.

10. geo.buffer does not validate the distance argument type

GeoUtils.getDoubleValue casts to Number directly. If a non-numeric argument is passed, this throws ClassCastException instead of a meaningful SQL error.


Minor / Code Style

  • SQLGeoIndexedQueryTest creates Location, Location2, Location3, etc. TestHelper provides per-test database isolation, so counter suffixes are unnecessary.
  • LSMTreeGeoIndex.DEFAULT_PRECISION just delegates to GeoIndexMetadata.DEFAULT_PRECISION — superfluous.
  • FunctionReferenceGeneratorTest: widening throws IOException to throws Exception is a style regression.

What Works Well

  • Architecture closely mirrors LSMTreeFullTextIndex — consistent with codebase patterns
  • ACID/WAL/HA inheritance is correct by design
  • shouldExecuteAfterSearch = true post-filter for the GeoHash superset is correct
  • Thoughtful opt-out of indexed execution for DE-9IM predicates (touches, overlaps, crosses, equals, disjoint)
  • Cypher point() and distance() preserved via CypherFunctionFactory — good backward compatibility for Cypher users
  • Test coverage is broad, test names are clear, and attribution/licensing are handled correctly
  • Precision metadata survives database reopen round-trips

Summary

The core implementation is solid but there are correctness bugs (scientific notation in WKT generation, distance unit mismatch between geo.distance and geo.dWithin), a real memory concern (searchFromTarget full materialisation), and plan docs that should be removed. The hard removal of legacy geo functions should follow the same deprecation approach used for circle(). Addressing items 1–4 is recommended before merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Geospatial indexing with geo.* SQL functions (LSM-Tree native storage)

2 participants