feat(geo): Geospatial indexing with geo.* SQL functions (LSM-Tree native storage)#3510
feat(geo): Geospatial indexing with geo.* SQL functions (LSM-Tree native storage)#3510
Conversation
Summary of ChangesHello @robfrank, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust and standardized geospatial indexing system to ArcadeDB. It leverages a native LSM-Tree implementation and GeoHash tokenization to efficiently store and query spatial data. By adopting OGC/PostGIS Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
🧪 CI InsightsHere's what we observed from your CI run for 87a37d3. 🟢 All jobs passed!But CI Insights is watching 👀 |
There was a problem hiding this comment.
Code Review
The pull request introduces comprehensive geospatial indexing support to ArcadeDB, leveraging the native LSM-Tree engine and OGC-compliant ST_* functions. This is a significant architectural improvement that replaces previous non-standard geo functions with a more robust, Lucene-backed tokenization strategy. The integration with the query optimizer via IndexableSQLFunction is well-designed, allowing for automatic index usage in spatial predicates. However, there are a few critical issues that need to be addressed: the index build process currently bypasses tokenization, which will result in unusable indexes after a rebuild; there are performance bottlenecks in spatial predicates due to eager record loading; and a breaking change in the default unit for ST_Distance (from kilometers to meters in SQL context) might affect existing users. Additionally, some input validation and optimization opportunities in the geohash lookup logic were identified.
| public long build(final int buildIndexBatchSize, final BuildIndexCallback callback) { | ||
| return underlyingIndex.build(buildIndexBatchSize, callback); | ||
| } |
There was a problem hiding this comment.
The build() method currently delegates directly to the underlying LSM-Tree index. This is a critical bug because underlyingIndex.build() uses its own put() method, which indexes the WKT string as a literal key instead of tokenizing it into geohash cells. Consequently, after a rebuild, the index will be corrupted and spatial queries will return no results. This method must be implemented locally to scan the bucket and call this.put() to ensure proper tokenization.
| return List.of(); | ||
|
|
||
| // Query each per-bucket geo index and collect the results | ||
| final List<Record> results = new ArrayList<>(); |
There was a problem hiding this comment.
The searchFromTarget implementation eagerly loads all documents from disk into a List<Record>. This is a significant performance bottleneck and memory risk for queries returning many candidates. It should instead return a lazy Iterable that loads records only as they are iterated. Additionally, id.getRecord() can return null if a record was deleted but the index is not yet updated; the implementation should filter out these nulls to avoid downstream issues.
| final CellIterator cellIter = grid.getTreeCellIterator(searchShape, detailLevel); | ||
| final LinkedHashSet<RID> seen = new LinkedHashSet<>(); | ||
| while (cellIter.hasNext()) { | ||
| final Cell cell = cellIter.next(); | ||
| if (cell.getShapeRel() == null) | ||
| continue; | ||
| final String token = cell.getTokenBytesNoLeaf(null).utf8ToString(); | ||
| if (token.isEmpty()) | ||
| continue; | ||
| final IndexCursor cursor = underlyingIndex.get(new Object[]{token}); | ||
| while (cursor.hasNext()) | ||
| seen.add(cursor.next().getIdentity()); | ||
| } |
There was a problem hiding this comment.
The geohash lookup logic currently collects all matching RIDs from all covering cells into a LinkedHashSet before applying the limit. For large search shapes or high-density areas, this can lead to excessive memory usage and slow performance. It is recommended to check the limit inside the cell iteration loop and stop early once enough unique RIDs have been collected.
| private static String extractFieldName(final Expression expr) { | ||
| if (expr == null) | ||
| return null; | ||
| // toString() on a plain identifier expression yields the field name | ||
| final String text = expr.toString(); | ||
| if (text == null || text.isBlank()) | ||
| return null; | ||
| // Reject if this looks like a function call or contains operators | ||
| if (text.contains("(") || text.contains(" ") || text.contains(".")) | ||
| return null; | ||
| return text; | ||
| } |
There was a problem hiding this comment.
The extractFieldName method is too restrictive as it rejects any identifier containing a space. In ArcadeDB, property names can contain spaces if they are escaped with backticks (e.g., `my property`). This restriction prevents the query optimizer from using geospatial indexes on such properties. A more robust check against the AST expression type (e.g., checking if it is a simple identifier) would be preferable.
| final double[] p1 = extractPointCoords(iParams[0]); | ||
| final double[] p2 = extractPointCoords(iParams[1]); | ||
|
|
||
| final String unit = (iParams.length >= 3 && iParams[2] != null) ? iParams[2].toString() : "m"; |
There was a problem hiding this comment.
The new ST_Distance implementation always defaults to meters ('m'). This is a breaking change for SQL users, as the previous distance() function defaulted to kilometers ('km') in a SQL context (while using meters for Cypher). To maintain backward compatibility and follow the PR description's intent, the default unit should be determined based on the argument count or the query context.
| return null; | ||
|
|
||
| @SuppressWarnings("unchecked") | ||
| final List<Object> points = (List<Object>) iParams[0]; |
There was a problem hiding this comment.
The code performs an unchecked cast of the first parameter to List<Object>. If the user provides a different type (e.g., a single point or a string), this will throw a ClassCastException. It is safer to check the type using instanceof List before casting.
if (!(iParams[0] instanceof List))
throw new IllegalArgumentException("ST_LineString() requires a list of points as the first parameter");
final List<Object> points = (List<Object>) iParams[0];| if (iParams == null || iParams.length < 1 || iParams[0] == null) | ||
| return null; | ||
|
|
||
| @SuppressWarnings("unchecked") |
There was a problem hiding this comment.
Similar to ST_LineString, this unchecked cast to List<Object> can lead to a ClassCastException if the input is not a list. A type check should be added.
| @SuppressWarnings("unchecked") | |
| if (!(iParams[0] instanceof List)) | |
| throw new IllegalArgumentException("ST_Polygon() requires a list of points as the first parameter"); | |
| final List<Object> points = (List<Object>) iParams[0]; |
Code Review: Geospatial Indexing (#3510)This is a substantial and well-structured feature addition. The overall design - mirroring LSMTreeFullTextIndex with GeoHash tokenization via Lucene's RecursivePrefixTreeStrategy - is sound and integrates neatly into the existing architecture. Below are issues ranging from blockers to minor suggestions. Blockers1. .claude/settings.local.json must not be committed This file (.claude/settings.local.json) is a local Claude Code AI assistant configuration specific to the developer's machine. It has no place in the repository. It should be removed from this PR and added to .gitignore. Committing it exposes internal AI toolchain configuration and will cause noise for all other contributors. 2. docs/plans/ AI planning artifacts should be removed docs/plans/2026-02-22-geospatial-design.md (237 lines) and docs/plans/2026-02-22-geospatial-implementation.md (1,456 lines) appear to be AI-generated implementation plan documents. These are development process artifacts, not user-facing documentation. They add ~1,700 lines of noise to the repository. If design rationale is worth preserving, it belongs in a condensed form in docs/ or as an ADR, not as verbatim AI session outputs. Correctness Issues3. ST_DWithin uses degrees, not real-world units - inconsistent with ST_Distance SpatialContext.calcDistance() returns degrees in geographic (non-Euclidean) mode. ST_Distance, by contrast, uses Haversine and returns metres/km/mi/nmi based on a unit parameter. This inconsistency will surprise users who pass a value in metres to ST_DWithin expecting the same semantics as ST_Distance. The docstring says 'Distance in degrees (Spatial4j coordinate system)' but the function signature gives no hint of this to SQL users. At minimum, the unit must be clearly communicated at the SQL level (e.g., a unit parameter or explicit naming like ST_DWithinDegrees). Ideally, accept the same unit parameter as ST_Distance and convert. 4. ST_DWithin measures center-to-center distance, not geometry distance Using geom1.getCenter() for polygon or linestring inputs violates OGC semantics. ST_DWithin(g1, g2, d) should return true if any point of g1 is within distance d of any point of g2. For a polygon g1, the center can be far from an edge that is actually within the threshold. This produces silently wrong results for non-point geometries. 5. looksLikeGeoHashToken fallback in put() is architecturally fragile This heuristic exists to handle WAL/commit replay, where already-tokenized geohash strings are passed back through put(). Relying on character-pattern matching to distinguish raw WKT from internal index tokens is fragile and error-prone. Consider:
A cleaner design would wrap the internal token differently (e.g., a GeoHashToken value object) or expose a dedicated internal putToken() method that bypasses WKT parsing, rather than inferring intent from string content. 6. Memory: get() loads all matching RIDs into a LinkedHashSet before streaming For queries covering large areas (low-precision geohash cells covering many documents), this materialises the entire candidate set in memory. The existing LSMTreeFullTextIndex has the same pattern, but geospatial queries can realistically return millions of candidates for coarse shapes. Consider deferring materialisation or at least implementing an early-exit once limit is reached (currently the limit is only applied after full collection). Design / API Issues7. Precision is hardcoded to DEFAULT_PRECISION in the file-ID constructor This constructor is used during database open (before LocalSchema.readConfiguration() wraps the index with the correct precision). The wrap in readConfiguration() fixes it, but between construction and wrapping, any code that touches the index object from the component registry uses the wrong precision. Worth adding a comment explaining this lifecycle dependency, or restructuring so the wrap happens at construction time. 8. ST_Area returns square degrees with no documentation of units ST_Area returns area via Spatial4j in square degrees (geographic CRS). This is non-intuitive - most users would expect square metres or at least an explicit unit parameter. The getSyntax() return value should document the unit, and ideally an optional unit parameter should be added for consistency with ST_Distance. 9. ST_DWithin docstring contradicts OGC standard The Javadoc says 'returns true if geometry g is within the given distance of shape' (correct OGC definition) but the implementation measures center-to-center distance (see issue #4 above). Align implementation with spec or explicitly document the deviation. Testing Issues10. LSMTreeGeoIndexTest uses reflection to bypass the public API This couples the test to private implementation details (LocalSchema.indexMap field name). If the field is renamed or the map structure changes, the test breaks silently at runtime rather than compile time. Use the public database.command("sql", "CREATE INDEX ON ...") API instead, which is what LSMTreeGeoIndexSchemaTest and SQLGeoIndexedQueryTest already do correctly. 11. No regression test for precision persistence across database reopen GeoIndexMetadata and LocalSchema.readConfiguration() include logic to persist and restore the precision value. There is no test that verifies: create index with precision=7, close database, reopen, query - and confirm the precision was restored. LSMTreeGeoIndexSchemaTest tests schema serialisation, but not the full reopen lifecycle. 12. No test for the remove() / update path There are tests for put() and get(), but none for remove(Object[], Identifiable). Removing a document from a geospatial index involves re-tokenizing the WKT and removing each token - a distinct code path from insertion that deserves its own test. Minor Issues13. GeoUtils.parseEnvelopeWkt coordinate ordering comment needed Spatial4j's ENVELOPE format is ENVELOPE(minX, maxX, maxY, minY) - i.e., (lon_min, lon_max, lat_max, lat_min). This is counter-intuitive. A comment explaining the ordering (or a reference to the Spatial4j spec) would prevent future bugs. 14. SQLFunctionST_AsGeoJson uses manual GeoJSON serialisation The implementation hand-rolls GeoJSON for Point, LineString, Polygon, Multi-geometries, and GeometryCollection. JTS has a GeoJsonWriter class in org.locationtech.jts.io.geojson that handles all geometry types correctly. Given JTS is already a transitive dependency via lucene-spatial-extras, using GeoJsonWriter would eliminate a large block of error-prone manual code and ensure edge cases (holes in polygons, mixed geometry collections) are handled correctly. 15. ST_Buffer returns WKT string, not a Shape The function returns a String. If the result is passed to another spatial function (e.g., ST_Within(coords, ST_Buffer(geom, 1.0))), the outer function must re-parse it. All other constructor functions (ST_Point, ST_GeomFromText) also return WKT strings, so this is consistent - but it means every chained call re-parses. Internally returning a Shape object and stringifying only at serialisation would be more efficient. SummaryThe core indexing approach is solid. Most of the changes in LSMTreeGeoIndex, GeoIndexMetadata, LocalSchema, and SQLFunctionST_Predicate are well-structured. The main blockers are the committed developer files (.claude/settings.local.json, docs/plans/), the correctness issues with ST_DWithin (degrees vs. real units, center vs. geometry distance), and the fragile looksLikeGeoHashToken heuristic. |
|
Should we rename all these new function in the package "geo"?
I don't think anybody is converting whole queries from postgres to ArcadeDB. WDYT? So we keep things in order, right now we have hundreds of functions ;-) |
Coverage summary from CodacySee diff coverage on Codacy
Coverage variation details
Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: Diff coverage details
Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: See your quality gate settings Change summary preferencesFootnotes
|
Bumps [pg](https://github.com/brianc/node-postgres/tree/HEAD/packages/pg) from 8.16.3 to 8.17.1. Changelog *Sourced from [pg's changelog](https://github.com/brianc/node-postgres/blob/master/CHANGELOG.md).* > All major and minor releases are briefly explained below. > > For richer information consult the commit log on github with referenced pull requests. > > We do not include break-fix version release in this file. > > pg@8.17.0 > --------- > > * Throw correct error if database URL parsing [fails](https://redirect.github.com/brianc/node-postgres/issues/3513). > > pg@8.16.0 > --------- > > * Add support for [min connection pool size](https://redirect.github.com/brianc/node-postgres/pull/3438). > > pg@8.15.0 > --------- > > * Add support for [esm](https://redirect.github.com/brianc/node-postgres/pull/3423) importing. CommonJS importing is still also supported. > > pg@8.14.0 > --------- > > * Add support from SCRAM-SAH-256-PLUS i.e. [channel binding](https://redirect.github.com/brianc/node-postgres/pull/3356). > > pg@8.13.0 > --------- > > * Add ability to specify query timeout on [per-query basis](https://redirect.github.com/brianc/node-postgres/pull/3074). > > pg@8.12.0 > --------- > > * Add `queryMode` config option to [force use of the extended query protocol](https://redirect.github.com/brianc/node-postgres/pull/3214) on queries without any parameters. > > pg-pool@8.10.0 > -------------- > > * Emit `release` event when client is returned to [the pool](https://redirect.github.com/brianc/node-postgres/pull/2845). > > pg@8.9.0 > -------- > > * Add support for [stream factory](https://redirect.github.com/brianc/node-postgres/pull/2898). > * [Better errors](https://redirect.github.com/brianc/node-postgres/pull/2901) for SASL authentication. > * [Use native crypto module](https://redirect.github.com/brianc/node-postgres/pull/2815) for SASL authentication. > > pg@8.8.0 > -------- > > * Bump minimum required version of [native bindings](https://redirect.github.com/brianc/node-postgres/pull/2787). > * Catch previously uncatchable errors thrown in [`pool.query`](https://redirect.github.com/brianc/node-postgres/pull/2569). > * Prevent the pool from blocking the event loop if all clients are [idle](https://redirect.github.com/brianc/node-postgres/pull/2721) (and `allowExitOnIdle` is enabled). > * Support `lock_timeout` in [client config](https://redirect.github.com/brianc/node-postgres/pull/2779). > * Fix errors thrown in callbacks from [interfering with cleanup](https://redirect.github.com/brianc/node-postgres/pull/2753). > > ### pg-pool@3.5.0 ... (truncated) Commits * [`4eb7529`](brianc/node-postgres@4eb7529) Publish * [`b94c8e1`](brianc/node-postgres@b94c8e1) Don't use prefix import as it breaks in old nodes. ([#3578](https://github.com/brianc/node-postgres/tree/HEAD/packages/pg/issues/3578)) * [`6bf475c`](brianc/node-postgres@6bf475c) Improve Deno compatibility: config-first and safe env access ([#3547](https://github.com/brianc/node-postgres/tree/HEAD/packages/pg/issues/3547)) * [`d10e09c`](brianc/node-postgres@d10e09c) Publish * [`9174783`](brianc/node-postgres@9174783) test: Replace dead row length check with similar shape check ([#3532](https://github.com/brianc/node-postgres/tree/HEAD/packages/pg/issues/3532)) * [`27a2754`](brianc/node-postgres@27a2754) Deprecations ([ArcadeData#3510](https://github.com/brianc/node-postgres/tree/HEAD/packages/pg/issues/3510)) * See full diff in [compare view](https://github.com/brianc/node-postgres/commits/pg@8.17.1/packages/pg) [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Geospatial Indexing ReviewThis is a substantial and well-structured feature. The architecture cleanly mirrors Blockers1.
2. Design/implementation plan docs contain tool-internal instructions
These are AI tool directives that have no place in a committed design doc. If design documentation should be kept, the internal instructions need to be stripped before committing. Correctness Issues3. PR description says The PR title and every function table in the description reference 4. In final double actualDistance = GeoUtils.getSpatialContext()
.calcDistance(geom1.getCenter(), geom2.getCenter());
return actualDistance <= distance;For two points this works. For a 5. In return jts1.norm().equals(jts2.norm());JTS 6. The transaction-replay scenario this detects should be verified against the actual Performance Concerns7. In final List<Record> results = new ArrayList<>();
...
while (cursor.hasNext())
results.add(id.getRecord());
return results;For a large geospatial index, the GeoHash intersection can return a very large candidate set before the post-filter runs. Loading all matching records into a 8.
Minor Issues9. No precision validation in The javadoc says range 1–12, but the setter accepts any integer silently. if (precision < 1 || precision > 12)
throw new IllegalArgumentException("Precision must be 1–12, got: " + precision);10. return bucketId >= 0 ? bucketId : underlyingIndex.getFileId();Returning a file ID where a bucket ID is expected is a category mismatch. The comment should explain which code path produces 11.
12.
13. Reflection in final Field indexMapField = LocalSchema.class.getDeclaredField("indexMap");
indexMapField.setAccessible(true);This breaks if the field is renamed. The 14. JavaCC parser incompatibility not surfaced as an actionable error The comment in 15. Confirm that Positive Notes
|
Code Review — feat(geo): Geospatial indexing with geo.* SQL functionsOverall this is a well-structured, ambitious feature that follows existing project patterns closely (mirroring LSMTreeFullTextIndex). The design doc and test coverage are thorough. Below are my findings, grouped by severity. Critical1. Fragile transaction-replay detection in LSMTreeGeoIndex.put()looksLikeGeoHashToken() uses a heuristic (short, lowercase-alphanumeric string matching the GeoHash alphabet) to distinguish pre-tokenized GeoHash strings (written by TransactionIndexContext on commit replay) from original WKT values. This is fragile:
A cleaner approach: expose a package-private putToken() method that receives already-tokenized strings, so the main put() path never needs to guess. The TransactionIndexContext replay would call putToken() directly, eliminating the heuristic entirely. 2. GeoIndexMetadata.toJSON() / fromJSON() asymmetrytoJSON() only serialises 'precision' — it does NOT call super.toJSON(), so typeName and propertyNames are never written. fromJSON() works around this with a conditional 'if (metadata.has("typeName")) super.fromJSON(metadata)'. On a fresh write the parent fields are absent; on reload the condition is false and they cannot be reconstructed. Fix by always calling super.toJSON() / super.fromJSON(), matching the FullTextIndexMetadata pattern. Important3. Hard breaking changes with no deprecation pathThe PR removes point(), distance(), circle(), polygon(), lineString(), and rectangle() entirely. Any existing database with queries or scripts using these functions will break at upgrade with no migration path. Consider keeping the old names as thin aliases that delegate to their geo.* equivalents and emit a deprecation log message, or document the upgrade migration explicitly in release notes. 4. allowsIndexedExecution() ignores the comparison operatorThe method receives a BinaryCompareOperator but never inspects it. If someone writes WHERE geo.within(coords, shape) != true, the optimizer may still route through the index and produce wrong results. Return false when the operator is not an equality operator. 5. JavaCC parser limitation creates a two-tier user experienceThe ANTLR4 parser supports geo.point(x, y); the JavaCC parser requires backtick-quoted geo.point(x, y). The project uses both parsers, so users hitting the JavaCC path will see confusing parse errors with no actionable guidance. User-visible error messages should hint at the backtick alternative. 6. CellIterator null shapeRel filtering may miss cellsIn get(): shapeRel is null for unresolved interior nodes, not for definitively disjoint cells. Skipping them may cause false negatives at fine precision levels. Consider skipping only DISJOINT cells rather than null-rel cells. 7. docs/plans/ implementation planning files committed to the repoThe four docs/plans/ markdown files are internal step-by-step implementation checklists (they include git commit instructions). These do not belong in the production repository. A single architecture document summarising the final design would be more appropriate. Minor8. Test database name collisiongeoManualIndexPoints and geoManualIndexBoundingBoxes both pass "GeoDatabase" to TestHelper.executeInNewDatabase(). Parallel test runs will collide. Use unique names per test method. 9. geo.distance default unit change is a silent breaking changeThe old distance() function used different default units. geo.distance now defaults to meters. Any query relying on the numeric output of the old function will produce values 1000x different without warning. Document this explicitly in the breaking-changes section. 10. .claude/settings.local.json in .gitignoreFine to add, but it should be grouped with other IDE entries (.idea/) with a brief comment, or placed in a global gitignore instead of the project-level file. 11. extractFieldName() silently rejects dotted field pathsThe check 'text.contains(".")' prevents geo.within(record.nested.coords, shape) from using the index. Acceptable for v1, but a comment documenting this limitation would help future maintainers. 12. estimate() always returns -1The optimizer always treats the geo index as having unknown cardinality. Even a rough estimate based on precision level and search shape area would improve query plan selection when combining a geo filter with other indexed conditions. Positive observations
|
Code Review: PR #3510 — Geospatial IndexingThis is a substantial, well-structured feature PR. The design (GeoHash-tokenised LSM-Tree wrapping the existing Critical / Potential Bugs1. Breaking change with no migration path
2. In results.add(id.getRecord()); // loads full document from diskBecause 3. Loading-time constructor hardcodes DEFAULT_PRECISION public LSMTreeGeoIndex(final DatabaseInternal database, final String name, final String filePath,
final int fileId, final ComponentFile.MODE mode, final int pageSize, final int version) {
this.precision = DEFAULT_PRECISION; // precision persisted in JSON is ignored hereIf this constructor is ever invoked during index loading (e.g. by a future refactoring or external caller), the persisted precision will be silently lost and incorrect GeoHash cells will be used for querying. Either add a 4. No range validation in Valid GeoHash precision is 1–12. Values outside this range will silently produce a broken index. Add validation: if (precision < 1 || precision > 12)
throw new IllegalArgumentException("GeoHash precision must be 1-12, got: " + precision);Design / Correctness Issues5. The transaction-commit-replay detection in 6. if (text.contains("."))
return null;This means 7. Distance and area units are undocumented at the usage level
These need explicit unit documentation in 8. Plan/design documents committed to the repository tree
Performance9. For a query covering a large area at high precision, thousands of GeoHash cells may be returned, each with many RIDs. The full Tests10. Test uses reflection to access final Field indexMapField = LocalSchema.class.getDeclaredField("indexMap");
indexMapField.setAccessible(true);This tightly couples the test to an internal field name. If 11. No regression test for the Adding Cleanup12. This is a tool-specific local file and should be in the global Minor
Overall the architecture is well thought-out and the test coverage is good. Issue #2 ( |
601f145 to
bf88d51
Compare
Code Review — Geospatial Indexing (PR #3510)This is a substantial and well-structured feature addition. The design closely mirrors Critical / Correctness Bugs1.
|
Code ReviewThis is a substantial and well-structured addition. The overall architecture (thin wrapper over Critical / Correctness Issues1.
2. Fragile GeoHash token detection in The
A cleaner design: introduce a typed key wrapper (e.g. a 3. JTS Design / Architecture Issues4. Both 5. if (!text.contains("(") && !text.contains(" "))
return text;This will fail silently for:
Use the AST node type directly ( 6. Precision not validated in Valid GeoHash precision is 1–12. Any integer is accepted now. A precision of 0 or 13+ will be silently passed to 7. return bucketId >= 0 ? bucketId : underlyingIndex.getFileId();Returning the index file ID as a substitute for a bucket ID when there is no associated bucket conflates two unrelated identifiers. If the locking machinery misbehaves without a bucket ID, the fix should be in the locking machinery or in a null-safe check there, not by returning a misleading value here. Minor Issues8. Design/plan documents committed to the repo
9. Test uses reflection to access
10. Breaking changes need a migration note in the changelog / docs The removal of 11. Missing test for multi-bucket (partitioned) types All integration tests use single-bucket document types. The 12. Duplicate The base class Positive Aspects
|
Coverage summary from CodacySee diff coverage on Codacy
Coverage variation details
Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: Diff coverage details
Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: See your quality gate settings Change summary preferences |
7778525 to
d6d8444
Compare
Code Review: feat(geo) — Geospatial indexing with geo.* SQL functionsOverall this is a well-structured feature that follows existing project patterns (notably Breaking changes without deprecation pathThe PR removes
The Cypher functions are preserved — the same respect should apply to SQL.
|
| Category | Issues |
|---|---|
| Must fix before merge | Breaking API removal without deprecation, fragile WAL-replay heuristic |
| Should address | Memory pressure in get(), fromJSON conditional, docs/plans removal, getAssociatedBucketId documentation |
| Nice to have | JavaCC docs, test isolation, TokenStream cleanup, CONTAINS grammar note |
The core implementation is solid and the test coverage is comprehensive. Addressing the breaking-change strategy and the looksLikeGeoHashToken reliability issue are the most important items before this lands.
Design for porting OrientDB geospatial indexing to ArcadeDB using LSM-Tree as storage backend (following the LSMTreeFullTextIndex pattern) and lucene-spatial-extras for GeoHash decomposition. Covers ST_* SQL functions, IndexableSQLFunction integration for automatic query optimizer usage, and WKT as geometry storage format. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Apply the limit ceiling to results returned from the deduplication map, matching the behavior of LSMTreeFullTextIndex. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- toJSON(): add missing "bucket" field to prevent schema reload failure - extractTokens(): consolidate duplicated token-stream loop from indexShape() so that IOException warning is emitted consistently in both put and remove paths - get(): replace LinkedHashMap<RID,Integer> with LinkedHashSet<RID> for clarity - getType(): remove internal task reference from exception message Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add GEOSPATIAL to Schema.INDEX_TYPE enum - Register GeoIndexFactoryHandler in LocalSchema - Fix LSMTreeGeoIndex.getType() to return Schema.INDEX_TYPE.GEOSPATIAL - Add GEOSPATIAL case to LocalSchema readConfiguration() for schema reload - Add GEOSPATIAL to CreateIndexStatement validate() and executeDDL() mapping - Add LSMTreeGeoIndexSchemaTest to verify DDL creation via SQL Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update all predicate subclass extends clauses to reference the new base class name.
Rename 12 geo function classes (GeomFromText, Point, LineString, Polygon, Buffer, Envelope, Distance, Area, AsText, AsGeoJson, X, Y) from ST_* prefix to geo.* namespace, updating class names, NAME constants, getSyntax() strings, Javadoc references, factory imports and registrations accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rename the 9 spatial predicate function classes from SQLFunctionST_* to SQLFunctionGeo* and update their NAME constants from ST_Xxx to geo.xxx, matching the geo.* naming convention established in Task 2. Update DefaultSQLFunctionFactory imports and register calls accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ter() calls Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace all ST_* function calls in SQL string literals in SQLGeoFunctionsTest and SQLGeoIndexedQueryTest with the new geo.* namespace-qualified names (e.g. ST_GeomFromText → geo.geomFromText, ST_Within → geo.within, ST_Contains → geo.contains, etc.). - Extend the SQL grammar (SQLParser.g4) to support namespace-qualified function calls: identifier DOT identifier LPAREN ... (e.g. geo.point(x,y)). functionCall now accepts an optional 'namespace DOT' prefix before the function name. - Add CONTAINS to the identifier rule so that geo.contains(...) parses correctly (CONTAINS was a reserved keyword not usable as an identifier). - Update SQLASTBuilder.visitFunctionCall to combine the two identifiers into a single dot-separated function name when the qualified form is used. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…) syntax Revert the grammar ordering change from commit d0aff24 that placed functionCallExpr before identifierChain in baseExpression. That ordering caused field.method() patterns like decimal.format('%.1f') and name.toLowerCase() to be incorrectly parsed as namespace-qualified function calls. The fix uses Option A: - Restore identifierChain before functionCallExpr in baseExpression so that field.method() patterns parse correctly as field access + method call. - Revert functionCall to accept a single identifier (no DOT namespace prefix), matching its original form. - Add FUNCTION_NAMESPACES detection in visitIdentifierChain: when the identifierChain matches exactly one base identifier that is a known function namespace (currently "geo") followed by exactly one methodCall, the visitor rewrites the node as a namespace-qualified FunctionCall (e.g. "geo.point", "geo.within") instead of a field access with a method modifier. This preserves full support for: geo.within(coords, geo.point(1, 2)) — namespace-qualified function call decimal.format('%.1f') — field with method call name.toLowerCase() — field with method call Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… JavaCC limitation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tokenization - Remove .claude/settings.local.json from tracking and add to .gitignore - Strip internal AI tool directive from geospatial implementation plan doc - Fix LSMTreeGeoIndex.build() to scan bucket and call this.put() via the indexer, instead of delegating to underlyingIndex.build() which bypassed GeoHash tokenization and stored raw WKT keys in the index Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…error paths Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lid WKT test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rror paths Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…entFunctionsTest Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ror paths Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…or paths Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…classes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ying comments - Restore point(), distance(), linestring(), polygon() as aliases for the geo.* equivalents via getAlias() on each function class; the factory template auto-registers both names from a single instance - Re-add SQLFunctionCircle and SQLFunctionRectangle as @deprecated wrappers preserving the original Spatial4j-based behaviour (no direct geo.* equivalent exists); GeoUtils.parseGeometry() already handles Shape objects so they remain compatible with all new geo.* predicates - Add TODO in LSMTreeGeoIndex.get() documenting the LinkedHashSet<RID> v1 memory trade-off and what a streaming cursor alternative would look like - Add comment in LSMTreeGeoIndexTest explaining why reflection is required to inject the index into the schema's indexMap - Document the CONTAINS-as-identifier grammar trade-off in SQLParser.g4 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
d6d8444 to
87a37d3
Compare
Code Review: Geospatial Indexing (PR #3510)This is a substantial feature addition. The overall architecture is sound — mirroring Breaking ChangesThe PR removes Bugs / Correctness Issues1. Scientific notation in WKT generation ( Using 2. Inconsistent distance units between
3. Fragile The transaction-replay detection relies on character-set checks to distinguish WKT strings from pre-tokenised GeoHash strings. A short string that passes the alphabet check (e.g., Performance Concerns4.
5. The method calls 6.
Design Concerns7. Plan documents in The PR adds four 8. Adding 9. Missing range validation in The valid range is documented as 1–12 but no validation is performed. Values outside this range would be silently stored and cause 10.
Minor / Code Style
What Works Well
SummaryThe core implementation is solid but there are correctness bugs (scientific notation in WKT generation, distance unit mismatch between |
Geospatial Indexing Support (LSM-Tree native storage,
geo.*SQL functions)Overview
Adds full geospatial indexing to ArcadeDB using the native LSM-Tree engine as storage backend and a
geo.*SQL function namespace consistent with ArcadeDB's existing dot-namespace convention (e.g.vector.neighbors,vector.cosineSimilarity). The design mirrors the existingLSMTreeFullTextIndexpattern: a thin wrapper that tokenizes geometry into GeoHash cells via Apache Lucene'slucene-spatial-extraslibrary, stored in the LSM-Tree — inheriting ACID, WAL, HA, and compaction for free.New index type:
GEOSPATIALCreate a geospatial index on any
STRINGproperty that stores WKT geometry:Configurable GeoHash precision (default 11, ~2.4 m resolution; range 1–12). Precision is persisted in the schema JSON and survives database reopen.
Automatic query optimizer integration
No
search_index()call required. AnyWHEREclause using ageo.*spatial predicate on an indexed field is automatically routed through the geospatial index:The index returns a GeoHash-cell superset of candidates; the exact Spatial4j/JTS predicate is applied as a post-filter (
shouldExecuteAfterSearch = true).12
geo.*constructor / accessor functionsgeo.geomFromText(wkt)geo.point(x, y)POINT (x y)WKTgeo.lineString(pts)LINESTRING (...)WKTgeo.polygon(pts)POLYGON ((…))WKT, auto-closes ringgeo.buffer(geom, dist)Geometry.buffer()geo.envelope(geom)geo.distance(g1, g2 [,unit])m(default),km,mi,nmigeo.area(geom)geo.asText(geom)geo.asGeoJson(geom)geo.x(point)geo.y(point)9
geo.*spatial predicate functionsAll implement
IndexableSQLFunctionfor automatic optimizer integration. Predicates that are semantically incompatible with a GeoHash intersection superset correctly opt out of indexed execution.geo.within(g, shape)gfully withinshapegeo.intersects(g, shape)gandshapeshare any pointgeo.contains(g, shape)geo.dWithin(g, shape, dist)geo.disjoint(g, shape)geo.equals(g, shape)geo.crosses(g, shape)geo.overlaps(g, shape)geo.touches(g, shape)All predicates return
nullwhen either argument isnull(three-valued SQL logic).SQL parser:
geo.*namespace supportThe ANTLR4 SQL grammar does not natively support dotted function names. A
FUNCTION_NAMESPACESset inSQLASTBuilder.visitIdentifierChainrewritesgeo.function(args)identifier-chain patterns into properFunctionCallAST nodes at visitor level, keeping the grammar rule ordering intact and avoiding regressions withfield.method()patterns.Breaking changes
The old non-standard geo functions are removed and replaced by
geo.*equivalents:point(x, y)geo.point(x, y)distance(p1, p2)geo.distance(p1, p2)circle(c, r)geo.buffer(geom, dist)polygon(pts)geo.polygon(pts)lineString(pts)geo.lineString(pts)rectangle(pts)geo.envelope(geom)Cypher
point(lat, lon)anddistance(p1, p2)are preserved viaCypherFunctionFactoryand continue to work unchanged in Cypher queries.New dependency
org.apache.lucene:lucene-spatial-extras(version${lucene.version}, Apache 2.0). Lucene core is already a transitive dependency; this is a sibling module. Attribution added toATTRIBUTIONS.md.Tests
GeoIndexMetadataTestLSMTreeGeoIndexTestLSMTreeGeoIndexSchemaTestSQLGeoFunctionsTestSQLGeoIndexedQueryTest79 tests total. Full engine suite (4978 tests) passes with no regressions.