SKU/molecule phase 2 elasticsearch #1462

zaychenko-sergei · 2025-11-24T19:07:03Z

Description

Closes: #issue

Checklist before requesting a review

ElasticSearch first steps: - followed footsepts of natural language search components - makefile: ElasticSearch start, stop, and clean actions - achieved connectivity to started ElasticSearch cluster - achieved connectivity to self-launched child container service - test GQL endpoint, routing cluster health info for now - indexer: scaffold, no real indexing yet - search service API: scaffold, no real implementation yet Full text search: - support framework for entity schemas registration (no versioning/migrations yet) - ES: first client calls for index check, registration, documents counting - feeding schemas for accounts and datasets from corresponding domains (very simplified fields now) - entire process happens via plugin style, partiular domains register provider components that are accessed by template shared process ElasticSearch code structure extended: - separated low level operations in `ElasticSearchClient`: deals with engine connection building, sending queries in the right format, interpreting responses, and dealing with API errors - `ElasticSearchIndexMappings` handles creation of index mappings for the given schema + hashing it's content, this will be a future place to apply complex column properties depending on configs - `ElasticSearchVersionedEntityIndex` manages indexes for entities and aliases, auto-registers indcies, validates schema metadata, automatically detects drifts without version modification, automatically applies breaking or reindsable upgrades - main repository code stays at very high level Basic shape of full ElasticSearch index re-indexing + sketched simplest indexing procedure for Datasets Indexing owner-id in datasets index (for filters) ElasticSearch indexing added for Accounts Indexing creation time for Accounts/Datasets Indexing dataset documents similarly to natural language seach: added schema fields, description, keywords, and attachments More realstic field roles: hierarchical identifiers (account name, dataset name, alias, schema field), prose (description, attachments), keywords (owner_id, keyword, dataset_kind) - with corresponding analyzers and properties for ElasticSearch Added "Title" field role, which is in between Prose and Identifier, using for account's display names for now. Identifier fields get inner-ngrams (3..6) and wider edge-ngrams (2..10). Account life cycle events update ElasticSearch index: - massaged events format a bit to satisfy new needs - new outbox event handler for account search index updates - reorganized account schema code to encapsulate 1 document operations, while indexer and update handler use it's helpers - issuing bulk insert, update, delete operations in ES for account events Dummy implementation in e2e tests (until a better solution is found, as containerized ES starts for over 20s per each command, and that's affected by acocunt/dataset lifecycle events) Implemented updates to ElasticSearch for dataset-related events: lifecycle, reference update, parent account rename/delete. Fixed account deletion handler in datasets domain, no ReBAC/dangling checks should be executed during system event handling. Datasets schema: better incremental re-indexings for partial updates Hotfix: improved detection of invalid intervals in case of breaking changes in the dataset, when expected tail is ahead of head First sketching of a search function: - simplest querying: query_string vs match_all, depending if non-empty query was received - support specifying list of indexes vs defaulting to all schemas - ES: sending search request, decoding response - trivial GQL endpoint support Naive pagination support (size/from). Requesting source fields in multiple modes: None, All, Particular, Complex (include+exclude patterns) Next ElasticSearch steps: - Search schemas constants moved and published by domains, so that GQL can reference those fields. - Support flexible sorting of search results: N criterias, by field or relevance score, configurable direction. - Each schema now provides a field that can be used for universal alphabetial sorting ("title" alias) Support basic search filters (keyword = value, keyword in {values}) and compound from those (and, or, not). Added convenience macros to specify compound filters and for sort specifications. Search highlights for textual fields: displays best fragments explaining why certain document's field matched the query On-demand "explain" option: outputs low-level ElasticSearch scoring computation formula

…ticsearch

…search

…ojects, data room entries, announcements, activitty events)

…rched)

…ticsearch

…search

…ocument

…dices (filter is auto-attached to "read alias"). Fix: search should always be directed to "read alias", never to "writable index".

…ticsearch

…roject disabled" and "project reenabled" messages and setting "is_banned" attribute

…ticsearch

…search

…unt, to use for background lazy processes like on demand search indexer. Generalized Molecule reindexing template algorithm: start from `KamuBackgroundCatalog`, attach Molecule org account as subject, initiate separate transaction, then run indexing on Molecule's account behalf. Drafted data room entries reindexing. Relaxed compatibility requirements, so that v1 data room datasets don't crash when loaded.

…ed comands, so that heterogenous ops are possible in one bulk. Simplified search context: we don't need account for now.

…ticsearch

…search

… setInfo, attachments). Not indexing default vocabulary fields in schema, as those do not contribute to search relevance. Indexing tests run with real outbox to maximize realism: forcing sync when necessary between test steps

…lters, stemmers, ...

…search

…ticsearch

…, writes data, but is not yet queried

Problem with E2E test persists for now

…regards to catalogs

* Upgrade to dill 0.15.0 * Fix correct mock expectations in tests (#1540) --------- Co-authored-by: Roman Boiko <roman.bv20@gmail.com>

…using HTTPS/TLS

* Renamed existing natural language service stuff ElasticSearch first steps: - followed footsepts of natural language search components - makefile: ElasticSearch start, stop, and clean actions - achieved connectivity to started ElasticSearch cluster - achieved connectivity to self-launched child container service - test GQL endpoint, routing cluster health info for now - indexer: scaffold, no real indexing yet - search service API: scaffold, no real implementation yet Full text search: - support framework for entity schemas registration (no versioning/migrations yet) - ES: first client calls for index check, registration, documents counting - feeding schemas for accounts and datasets from corresponding domains (very simplified fields now) - entire process happens via plugin style, partiular domains register provider components that are accessed by template shared process ElasticSearch code structure extended: - separated low level operations in `ElasticSearchClient`: deals with engine connection building, sending queries in the right format, interpreting responses, and dealing with API errors - `ElasticSearchIndexMappings` handles creation of index mappings for the given schema + hashing it's content, this will be a future place to apply complex column properties depending on configs - `ElasticSearchVersionedEntityIndex` manages indexes for entities and aliases, auto-registers indcies, validates schema metadata, automatically detects drifts without version modification, automatically applies breaking or reindsable upgrades - main repository code stays at very high level Basic shape of full ElasticSearch index re-indexing + sketched simplest indexing procedure for Datasets Indexing owner-id in datasets index (for filters) ElasticSearch indexing added for Accounts Indexing creation time for Accounts/Datasets Indexing dataset documents similarly to natural language seach: added schema fields, description, keywords, and attachments More realstic field roles: hierarchical identifiers (account name, dataset name, alias, schema field), prose (description, attachments), keywords (owner_id, keyword, dataset_kind) - with corresponding analyzers and properties for ElasticSearch Added "Title" field role, which is in between Prose and Identifier, using for account's display names for now. Identifier fields get inner-ngrams (3..6) and wider edge-ngrams (2..10). Account life cycle events update ElasticSearch index: - massaged events format a bit to satisfy new needs - new outbox event handler for account search index updates - reorganized account schema code to encapsulate 1 document operations, while indexer and update handler use it's helpers - issuing bulk insert, update, delete operations in ES for account events Dummy implementation in e2e tests (until a better solution is found, as containerized ES starts for over 20s per each command, and that's affected by acocunt/dataset lifecycle events) Implemented updates to ElasticSearch for dataset-related events: lifecycle, reference update, parent account rename/delete. Fixed account deletion handler in datasets domain, no ReBAC/dangling checks should be executed during system event handling. Datasets schema: better incremental re-indexings for partial updates Hotfix: improved detection of invalid intervals in case of breaking changes in the dataset, when expected tail is ahead of head First sketching of a search function: - simplest querying: query_string vs match_all, depending if non-empty query was received - support specifying list of indexes vs defaulting to all schemas - ES: sending search request, decoding response - trivial GQL endpoint support Naive pagination support (size/from). Requesting source fields in multiple modes: None, All, Particular, Complex (include+exclude patterns) Next ElasticSearch steps: - Search schemas constants moved and published by domains, so that GQL can reference those fields. - Support flexible sorting of search results: N criterias, by field or relevance score, configurable direction. - Each schema now provides a field that can be used for universal alphabetial sorting ("title" alias) Support basic search filters (keyword = value, keyword in {values}) and compound from those (and, or, not). Added convenience macros to specify compound filters and for sort specifications. Search highlights for textual fields: displays best fragments explaining why certain document's field matched the query On-demand "explain" option: outputs low-level ElasticSearch scoring computation formula * ES: support unprocessed objects field (stored, but not indexed or searched) * Supporting boolean fields + added a generic banning feature for ES indices (filter is auto-attached to "read alias"). Fix: search should always be directed to "read alias", never to "writable index". * Merge corrections * Corrections in ES client: use single bulk update operation with encoded comands, so that heterogenous ops are possible in one bulk. Simplified search context: we don't need account for now. * Lock correction * Backported ElasticSearch-focused changes from Molecule branch * Deps correction * Simplifying renames * Unified account/dataset schemas to the style in Molecule branch * Prototyped framework for integration tests with ElasticSearch involved: - EsTestContext: main facility, lazily initializes reusable ES client - a test proc-macro hiding the plumbing of the context - each test receives a dill::Catalog prefilled with ES client, ES repository impl, with unique randomly generated index prefix - a succesful test automatically cleans it's own indices, while a failing test keeps the indices available for inspection - on first ES client initialization, the potentially abandoned test indices from previous sessions are discarded automatically - written first couple tests for Accounts indexing * Correct spelling of "Elasticsearch" brand name * ElasticSearch test group on CI/CD * More account indexing tests * Initial test suite for datasets indexing. Hardening es_client against async waiting issues: created index must be reachable, assigned alias must be reachable. Makefile: automated cleaning of abandoned test artifacts from previous experiments. Not doing this in fixtures, as `cargo nextest run` creates races around it executing every test in separate process. * Common searching harness + aligning dataset use case harness to be more pluggable * Shared fixture for account use case tests + indexing tests. Accounts indexing: testing predefined indexer * Tests: predefined datasets indexing * MT version of predefiend datasets indexing test * MT version of incremental indexing * Tests: renaming or deleting account affects index of it's datasets * Udeps fixed * Improvements and tests for detailed dataset content indexing (schema, setInfo, attachments). Not indexing default vocabulary fields in schema, as those do not contribute to search relevance. Indexing tests run with real outbox to maximize realism: forcing sync when necessary between test steps * Added basic test suide for datasets searching: checking analyzers, filters, stemmers, ... * Merged useful testability changes from Molecule prototype * Elasticsearch: abiility to setup, connect to, and test with a server using HTTPS/TLS * udeps fix * Admin: - endpoint to force reset search indices - temporarily hidden querying full text GQL endpoint under admin's guard * minor rename * Makefiule actions to start Qdrant container * QDrant parity with ES approaches: - dummy implementation - lazy init via background catalog - passing explicit context for queries * changelog * DEVELOPER.md notes on how to use Elasticsearch * v0.256.0

…search

…ion, our CI is not ready for elasticsearch+postgres combo

…search

- affects both QDrant and Elasticsearch - support "clear_on_start" and dataset indexing filters in Elasticsearch - support disabling incremental search index updates

… Elasticsearch

zaychenko-sergei and others added 30 commits November 24, 2025 19:59

Merge branch 'prototype/elasticsearch' into sku/molecule_phase_2_elas…

dfd9d7e

…ticsearch

Merge branch 'sku/molecule_phase_2' into sku/molecule_phase_2_elastic…

4855a4c

…search

Sketched Molecule service crate + 4 Molecule search index schemas (pr…

26219f0

…ojects, data room entries, announcements, activitty events)

ES: support unprocessed objects field (stored, but not indexed or sea…

e8e3456

…rched)

Merge branch 'prototype/elasticsearch' into sku/molecule_phase_2_elas…

4ec7c06

…ticsearch

Schema: using UnprocessedObject for activity body JSON

38d5be4

Merge branch 'sku/molecule_phase_2' into sku/molecule_phase_2_elastic…

354d072

…search

Merged previous adapter into new sku/molecule/domain|services

492b77c

Automatic reindexing of Molecule projects in ElasticSearch

5ace8d3

Flakky hell on flow e2e fixed

7bcd434

Reacting to MoleculeProjectMessageCreated, adding new ElasticSearch d…

3ed11e0

…ocument

Merge branch 'master' into prototype/elasticsearch

9e1e6b6

Supporting boolean fields + added a generic banning feature for ES in…

6d07597

…dices (filter is auto-attached to "read alias"). Fix: search should always be directed to "read alias", never to "writable index".

Merge branch 'prototype/elasticsearch' into sku/molecule_phase_2_elas…

133fe80

…ticsearch

Added Molecule project banning reaction in search: handling outbox "p…

940b126

…roject disabled" and "project reenabled" messages and setting "is_banned" attribute

Merge branch 'master' into prototype/elasticsearch

40bc2f3

Merge corrections

8ced425

Merge branch 'prototype/elasticsearch' into sku/molecule_phase_2_elas…

a31aa93

…ticsearch

Merge branch 'sku/molecule_phase_2' into sku/molecule_phase_2_elastic…

5c5ada1

…search

Merge branch 'sku/molecule_phase_2' into sku/molecule_phase_2_elastic…

1a1d3b7

…search

Forgotten fields in data rooms schema

58f32c2

Bulk-based indexing for projects and data room entries

5f5ba14

Indexing speedup: loading entries from N projects in parallel

a5cf7f1

Corrections in ES client: use single bulk update operation with encod…

9a2408f

…ed comands, so that heterogenous ops are possible in one bulk. Simplified search context: we don't need account for now.

Merge branch 'prototype/elasticsearch' into sku/molecule_phase_2_elas…

fc0db4d

…ticsearch

Merge branch 'sku/molecule_phase_2' into sku/molecule_phase_2_elastic…

b416539

…search

Incremental indexing of data rooms.

daed4f7

Merge branch 'master' into prototype/elasticsearch

1f07418

zaychenko-sergei and others added 30 commits December 26, 2025 17:04

Tests: predefined datasets indexing

a430040

MT version of predefiend datasets indexing test

42de0dd

MT version of incremental indexing

5e18415

Tests: renaming or deleting account affects index of it's datasets

a23a01d

Fix a few failing datasets

3d8b999

Udeps fixed

1c49458

Added basic test suide for datasets searching: checking analyzers, fi…

16713f1

…lters, stemmers, ...

Merge branch 'sku/molecule_phase_2' into sku/molecule_phase_2_elastic…

9263d1a

…search

Merge branch 'prototype/elasticsearch' into sku/molecule_phase_2_elas…

5cdae53

…ticsearch

Elasticsearch basically plugged into v2 Molecule tests, runs indexing…

6ea5743

…, writes data, but is not yet queried

Stabilized Molecule Elasticsearch filter/search tests.

60f257d

Problem with E2E test persists for now

Temp fix: e2e + elasticsearch

6bb90fc

Stabilized e2e vs integration test vs full manual indexing test with …

d2f90ae

…regards to catalogs

Removed Postgres + Elasticsearch e2e combo

ba7872a

Upgrade to dill 0.15.0 (#1538)

c85d271

* Upgrade to dill 0.15.0 * Fix correct mock expectations in tests (#1540) --------- Co-authored-by: Roman Boiko <roman.bv20@gmail.com>

v0.255.1 + minor deps

c89dd13

Elasticsearch: abiility to setup, connect to, and test with a server …

307a7f3

…using HTTPS/TLS

GQL: Allow to specify schema format in metadata event

ce9775a

Fix dependendencies lint

3e1cc11

Merge branch 'sku/molecule_phase_2' into sku/molecule_phase_2_elastic…

25bd15f

…search

Elasticsearch support for access level rules

ae410b0

test_molecule_v2_activity_change_by_for_remove: left only SQlite vers…

9c57fe7

…ion, our CI is not ready for elasticsearch+postgres combo

Merge branch 'master' into sku/molecule

0361b52

Merge corrections

7af467a

Merge branch 'sku/molecule' into sku/molecule_phase_2

0a60989

Merge branch 'sku/molecule_phase_2' into sku/molecule_phase_2_elastic…

15305dc

…search

Search indexer config:

78c1b18

- affects both QDrant and Elasticsearch - support "clear_on_start" and dataset indexing filters in Elasticsearch - support disabling incremental search index updates

Feature flag: enable/disable Molecule APIs to read from projection in…

6b04d27

… Elasticsearch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SKU/molecule phase 2 elasticsearch #1462

SKU/molecule phase 2 elasticsearch #1462

Uh oh!

zaychenko-sergei commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SKU/molecule phase 2 elasticsearch #1462

Are you sure you want to change the base?

SKU/molecule phase 2 elasticsearch #1462

Uh oh!

Conversation

zaychenko-sergei commented Nov 24, 2025

Description

Checklist before requesting a review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants