Skip to content

Conversation

@zaychenko-sergei
Copy link
Contributor

Description

Closes: #issue

Checklist before requesting a review

zaychenko-sergei and others added 30 commits November 24, 2025 19:59
ElasticSearch first steps:
 - followed footsepts of natural language search components
 - makefile: ElasticSearch start, stop, and clean actions
 - achieved connectivity to started ElasticSearch cluster
 - achieved connectivity to self-launched child container service
 - test GQL endpoint, routing cluster health info for now
 - indexer: scaffold, no real indexing yet
 - search service API: scaffold, no real implementation yet

Full text search:
 - support framework for entity schemas registration (no versioning/migrations yet)
 - ES: first client calls for index check, registration, documents counting
 - feeding schemas for accounts and datasets from corresponding domains (very simplified fields now)
 - entire process happens via plugin style, partiular domains register provider components that are accessed by template shared process

ElasticSearch code structure extended:
- separated low level operations in `ElasticSearchClient`: deals with engine connection building, sending queries in the right format, interpreting responses, and dealing with API errors
- `ElasticSearchIndexMappings` handles creation of index mappings for the given schema + hashing it's content, this will be a future place to apply complex column properties depending on configs
- `ElasticSearchVersionedEntityIndex` manages indexes for entities and aliases, auto-registers indcies, validates schema metadata, automatically detects drifts without version modification, automatically applies breaking or reindsable upgrades
- main repository code stays at very high level

Basic shape of full ElasticSearch index re-indexing + sketched simplest indexing procedure for Datasets

Indexing owner-id in datasets index (for filters)

ElasticSearch indexing added for Accounts

Indexing creation time for Accounts/Datasets

Indexing dataset documents similarly to natural language seach: added schema fields, description, keywords, and attachments

More realstic field roles: hierarchical identifiers (account name,  dataset name, alias, schema field), prose (description, attachments), keywords (owner_id, keyword, dataset_kind) - with  corresponding analyzers and properties for ElasticSearch

Added "Title" field role, which is in between Prose and Identifier, using for account's display names for now.
Identifier fields get inner-ngrams (3..6) and wider edge-ngrams (2..10).

Account life cycle events update ElasticSearch index:
 - massaged events format a bit to satisfy new needs
 - new outbox event handler for account search index updates
 -  reorganized account schema code to encapsulate 1 document operations, while indexer and update handler use it's helpers
 -  issuing bulk insert, update, delete operations in ES for account events

Dummy implementation in e2e tests (until a better solution is found, as containerized ES starts for over 20s per each command, and that's affected by acocunt/dataset lifecycle events)

Implemented updates to ElasticSearch for dataset-related events: lifecycle, reference update, parent account rename/delete.
Fixed account deletion handler in datasets domain, no ReBAC/dangling checks should be executed during system event handling.

Datasets schema: better incremental re-indexings for partial updates

Hotfix: improved detection of invalid intervals in case of breaking changes in the dataset, when expected tail is ahead of head

First sketching of a search function:
 - simplest querying: query_string vs match_all, depending if non-empty query was received
 - support specifying list of indexes vs defaulting to all schemas
 - ES: sending search request, decoding response
 - trivial GQL endpoint support

Naive pagination support (size/from).
Requesting source fields in multiple modes: None, All, Particular, Complex (include+exclude patterns)

Next ElasticSearch steps:
- Search schemas constants moved and published by domains, so that GQL can reference those fields.
- Support flexible sorting of search results: N criterias, by field or relevance score, configurable direction.
- Each schema now provides a field that can be used for universal alphabetial sorting ("title" alias)

Support basic search filters (keyword = value, keyword in {values}) and compound from those (and, or, not).

Added convenience macros to specify compound filters and for sort specifications.

Search highlights for textual fields: displays best fragments explaining why certain document's field matched the query

On-demand "explain" option: outputs low-level ElasticSearch scoring computation formula
…ojects, data room entries, announcements, activitty events)
…dices (filter is auto-attached to "read alias").

Fix: search should always be directed to "read alias", never to "writable index".
…roject disabled" and "project reenabled" messages and setting "is_banned" attribute
…unt, to use for background lazy processes like on demand search indexer.

Generalized Molecule reindexing template algorithm: start from `KamuBackgroundCatalog`, attach Molecule org account as subject, initiate separate transaction, then run indexing on Molecule's account behalf.

Drafted data room entries reindexing. Relaxed compatibility requirements, so that v1 data room datasets don't crash when loaded.
…ed comands, so that heterogenous ops are possible in one bulk.

Simplified search context: we don't need account for now.
zaychenko-sergei and others added 30 commits December 26, 2025 17:04
… setInfo, attachments).

Not indexing default vocabulary fields in schema, as those do not contribute to search relevance.

Indexing tests run with real outbox to maximize realism: forcing sync when necessary between test steps
Problem with E2E test persists for now
* Upgrade to dill 0.15.0

* Fix correct mock expectations in tests (#1540)

---------

Co-authored-by: Roman Boiko <roman.bv20@gmail.com>
* Renamed existing natural language service stuff

ElasticSearch first steps:
 - followed footsepts of natural language search components
 - makefile: ElasticSearch start, stop, and clean actions
 - achieved connectivity to started ElasticSearch cluster
 - achieved connectivity to self-launched child container service
 - test GQL endpoint, routing cluster health info for now
 - indexer: scaffold, no real indexing yet
 - search service API: scaffold, no real implementation yet

Full text search:
 - support framework for entity schemas registration (no versioning/migrations yet)
 - ES: first client calls for index check, registration, documents counting
 - feeding schemas for accounts and datasets from corresponding domains (very simplified fields now)
 - entire process happens via plugin style, partiular domains register provider components that are accessed by template shared process

ElasticSearch code structure extended:
- separated low level operations in `ElasticSearchClient`: deals with engine connection building, sending queries in the right format, interpreting responses, and dealing with API errors
- `ElasticSearchIndexMappings` handles creation of index mappings for the given schema + hashing it's content, this will be a future place to apply complex column properties depending on configs
- `ElasticSearchVersionedEntityIndex` manages indexes for entities and aliases, auto-registers indcies, validates schema metadata, automatically detects drifts without version modification, automatically applies breaking or reindsable upgrades
- main repository code stays at very high level

Basic shape of full ElasticSearch index re-indexing + sketched simplest indexing procedure for Datasets

Indexing owner-id in datasets index (for filters)

ElasticSearch indexing added for Accounts

Indexing creation time for Accounts/Datasets

Indexing dataset documents similarly to natural language seach: added schema fields, description, keywords, and attachments

More realstic field roles: hierarchical identifiers (account name,  dataset name, alias, schema field), prose (description, attachments), keywords (owner_id, keyword, dataset_kind) - with  corresponding analyzers and properties for ElasticSearch

Added "Title" field role, which is in between Prose and Identifier, using for account's display names for now.
Identifier fields get inner-ngrams (3..6) and wider edge-ngrams (2..10).

Account life cycle events update ElasticSearch index:
 - massaged events format a bit to satisfy new needs
 - new outbox event handler for account search index updates
 -  reorganized account schema code to encapsulate 1 document operations, while indexer and update handler use it's helpers
 -  issuing bulk insert, update, delete operations in ES for account events

Dummy implementation in e2e tests (until a better solution is found, as containerized ES starts for over 20s per each command, and that's affected by acocunt/dataset lifecycle events)

Implemented updates to ElasticSearch for dataset-related events: lifecycle, reference update, parent account rename/delete.
Fixed account deletion handler in datasets domain, no ReBAC/dangling checks should be executed during system event handling.

Datasets schema: better incremental re-indexings for partial updates

Hotfix: improved detection of invalid intervals in case of breaking changes in the dataset, when expected tail is ahead of head

First sketching of a search function:
 - simplest querying: query_string vs match_all, depending if non-empty query was received
 - support specifying list of indexes vs defaulting to all schemas
 - ES: sending search request, decoding response
 - trivial GQL endpoint support

Naive pagination support (size/from).
Requesting source fields in multiple modes: None, All, Particular, Complex (include+exclude patterns)

Next ElasticSearch steps:
- Search schemas constants moved and published by domains, so that GQL can reference those fields.
- Support flexible sorting of search results: N criterias, by field or relevance score, configurable direction.
- Each schema now provides a field that can be used for universal alphabetial sorting ("title" alias)

Support basic search filters (keyword = value, keyword in {values}) and compound from those (and, or, not).

Added convenience macros to specify compound filters and for sort specifications.

Search highlights for textual fields: displays best fragments explaining why certain document's field matched the query

On-demand "explain" option: outputs low-level ElasticSearch scoring computation formula

* ES: support unprocessed objects field (stored, but not indexed or searched)

* Supporting boolean fields + added a generic banning feature for ES indices (filter is auto-attached to "read alias").
Fix: search should always be directed to "read alias", never to "writable index".

* Merge corrections

* Corrections in ES client: use single bulk update operation with encoded comands, so that heterogenous ops  are possible in one bulk.

Simplified search context: we don't need account for now.

* Lock correction

* Backported ElasticSearch-focused changes from Molecule branch

* Deps correction

* Simplifying renames

* Unified account/dataset schemas to the style in Molecule branch

* Prototyped framework for integration tests with ElasticSearch involved:
 - EsTestContext: main facility, lazily initializes reusable ES client
 - a test proc-macro hiding the plumbing of the context
 - each test receives a dill::Catalog prefilled with ES client, ES repository impl, with unique randomly generated index prefix
 - a succesful test automatically cleans it's own indices, while a failing test keeps the indices available for inspection
 - on first ES client initialization, the potentially abandoned test indices from previous sessions are discarded automatically
 - written first couple tests for Accounts indexing

* Correct spelling of "Elasticsearch" brand name

* ElasticSearch test group on CI/CD

* More account indexing tests

* Initial test suite for datasets indexing.

Hardening es_client against async waiting issues: created index must be reachable, assigned alias must be reachable.

Makefile: automated cleaning of abandoned test artifacts from previous experiments. Not doing this in fixtures, as `cargo nextest run` creates races around it executing every test in separate process.

* Common searching harness + aligning dataset use case harness to be more pluggable

* Shared fixture for account use case tests + indexing tests.
Accounts indexing: testing predefined indexer

* Tests: predefined datasets indexing

* MT version of predefiend datasets indexing test

* MT version of incremental indexing

* Tests: renaming or deleting account affects index of it's datasets

* Udeps fixed

* Improvements and tests for detailed dataset content indexing (schema, setInfo, attachments).

Not indexing default vocabulary fields in schema, as those do not contribute to search relevance.

Indexing tests run with real outbox to maximize realism: forcing sync when necessary between test steps

* Added basic test suide for datasets searching: checking analyzers, filters, stemmers, ...

* Merged useful testability changes from Molecule prototype

* Elasticsearch: abiility to setup, connect to, and test with a server using HTTPS/TLS

* udeps fix

* Admin:
 - endpoint to force reset search indices
 - temporarily hidden querying full text GQL endpoint under admin's guard

* minor rename

* Makefiule actions to start Qdrant container

* QDrant parity with ES approaches:
 - dummy implementation
 - lazy init via background catalog
 - passing explicit context for queries

* changelog

* DEVELOPER.md notes on how to use Elasticsearch

* v0.256.0
…ion, our CI is not ready for elasticsearch+postgres combo
- affects both QDrant and Elasticsearch
- support "clear_on_start" and dataset indexing filters in Elasticsearch
- support disabling incremental search index updates
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants