Skip to content

(feat: persistence) Add schema-metrics-v1.sql for metrics tables (#3337)#3523

Open
obelix74 wants to merge 7 commits intoapache:mainfrom
obelix74:feat-3337-schema-v4
Open

(feat: persistence) Add schema-metrics-v1.sql for metrics tables (#3337)#3523
obelix74 wants to merge 7 commits intoapache:mainfrom
obelix74:feat-3337-schema-v4

Conversation

@obelix74
Copy link
Contributor

@obelix74 obelix74 commented Jan 24, 2026

Add new schema version 4 with tables for storing scan and commit metrics reports as first-class entities.

New tables:

  • scan_metrics_report: Stores scan metrics with trace correlation
  • scan_metrics_report_roles: Junction table for principal roles
  • commit_metrics_report: Stores commit metrics with trace correlation
  • commit_metrics_report_roles: Junction table for principal roles

Key design decisions:

  • PRIMARY KEY (realm_id, report_id) for multi-tenancy
  • Junction tables with CASCADE DELETE for roles
  • Timestamp index for retention cleanup
  • JSONB metadata column for extensibility (Postgres), TEXT for H2

Checklist

  • 🛡️ Don't disclose security issues! (contact security@apache.org)
  • 🔗 Clearly explained why the changes are needed, or linked related issues: Fixes #
  • 🧪 Added/updated tests with good coverage, or manually tested (and explained how)
  • 💡 Added comments for complex logic
  • 🧾 Updated CHANGELOG.md (if needed)
  • 📚 Updated documentation in site/content/in-dev/unreleased (if needed)

@singhpk234 singhpk234 requested a review from dimas-b January 24, 2026 00:23
@dimas-b
Copy link
Contributor

dimas-b commented Jan 26, 2026

@obelix74 @singhpk234 : WDYT about starting an RFC doc + dev thread on this? I believe a structured overview of this feature would be good to set the stage for PRs :) (apologies if I missed it)

@singhpk234
Copy link
Contributor

@dimas-b there is a dev thread already please ref : https://lists.apache.org/thread/c83jnkvlwc2k3swm65cmvl4t0mt7p799
thanks @obelix74 for the the writing this up !

@obelix74
Copy link
Contributor Author

@obelix74 @singhpk234 : WDYT about starting an RFC doc + dev thread on this? I believe a structured overview of this feature would be good to set the stage for PRs :) (apologies if I missed it)

I am trying to solve two sets of asks from my product folks with this.

  1. Metrics - what tables were accessed by a client principal
  2. Auditing - which user accessed what data and why

From the metrics perspective, today, with 1.3.0, I want to be able to report on table metrics based on:

Track table scan operations:

  • by table
  • by snapshot
  • by time range
  • by realm
  • by user principal
  • by engine

For commit report queries:

  • by operation type
  • data growth
  • file churn
  • storage analysis

Also many operational dashboards, and filtering by user, realm, engine name, version etc.

I have not thought about roles in this flow at all, perhaps it will be useful. @singhpk234 recommended adding roles and I added them. I normalized the roles tables from a RDBMS perspective, but I didn't realize there are other similar fields stored as JSON already.

@dimas-b
Copy link
Contributor

dimas-b commented Feb 4, 2026

@obelix74 : please rebase to fix CI

@dimas-b
Copy link
Contributor

dimas-b commented Feb 4, 2026

Let's hold final review until #3616 is resolved... Intermediate comments are welcome, of course :)

@obelix74 obelix74 force-pushed the feat-3337-schema-v4 branch 2 times, most recently from 5238123 to 472f6f3 Compare February 5, 2026 02:09
…tion

- Create schema-metrics-v1.sql for H2 and PostgreSQL with independent version tracking
- Add --include-metrics CLI option to BootstrapCommand
- Add openMetricsSchemaResource() to DatabaseType for loading metrics schema
- Update JdbcMetaStoreManagerFactory to optionally load metrics schema during bootstrap

This allows metrics schema to evolve independently from the entity schema.
@obelix74 obelix74 force-pushed the feat-3337-schema-v4 branch from 7771396 to 65e8cb4 Compare February 5, 2026 23:44
@obelix74 obelix74 requested a review from singhpk234 February 5, 2026 23:50
dimas-b
dimas-b previously approved these changes Feb 6, 2026
Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's give this PR a couple more days in review for other people to comment if they want.

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Feb 6, 2026
Replace junction tables (scan_metrics_report_roles, commit_metrics_report_roles)
with a denormalized principal_role_ids JSON array column in both scan_metrics_report
and commit_metrics_report tables.

This simplifies the schema and reduces the number of tables from 5 to 3.
Anand Kumar Sankaran added 2 commits February 6, 2026 06:33
- Added unit tests for JdbcBootstrapUtils.shouldIncludeMetrics() method:
  - Test when schemaOptions is null (returns false)
  - Test when includeMetrics is true (returns true)
  - Test when includeMetrics is false (returns false)
- Added integration tests in RelationalJdbcBootstrapCommandTest:
  - testBootstrapWithIncludeMetrics: verifies --include-metrics flag works
  - testBootstrapWithoutIncludeMetrics: verifies default behavior

Per PR review comment from singhpk234
The schema-v4 version number fixes have been extracted to a separate
branch (fix-schema-v4-version-number) for an independent PR.

This reverts the version changes back to match main branch.
Per PR review feedback from dimas-b: Polaris supports external IdP
and PDP (e.g. Keycloak and OPA), and the roles stored in metrics
tables may not be aligned with AuthZ decisions.

Removed principal_role_ids column from both scan_metrics_report and
commit_metrics_report tables in H2 and PostgreSQL schemas.
@dimas-b dimas-b changed the title (feat: persistence) Add schema-v4.sql for metrics tables (#3337) (feat: persistence) Add schema-metrics-v1.sql for metrics tables (#3337) Feb 6, 2026
Address PR review comment - remove namespace since tableId uniquely
identifies the table.
report_id TEXT NOT NULL,
realm_id TEXT NOT NULL,
catalog_id BIGINT NOT NULL,
namespace TEXT NOT NULL,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

• Removed namespace TEXT NOT NULL column from both H2 and Postgres schema files

dimas-b
dimas-b previously approved these changes Feb 6, 2026
Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@singhpk234 : WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants