Skip to content

Parallel implementation of the (batch-based) bind join algorithm#499

Merged
hartig merged 8 commits intomainfrom
ParallelBindJoin
Feb 12, 2026
Merged

Parallel implementation of the (batch-based) bind join algorithm#499
hartig merged 8 commits intomainfrom
ParallelBindJoin

Conversation

@hartig
Copy link
Member

@hartig hartig commented Feb 10, 2026

Our current implementation performs all bind-join requests sequentially. This PR introduces an implementation that can perform them all in parallel.

More specifically, the new implementation issues the bind-join requests without blocking, handling the processing of their responses in parallel (in the threads that the federation access manager uses to perform the requests).

The algorithm works as follows: For every sequence of solution mappings from the input, the algorithm splits this sequence into batches where each such batch will then be used for a separate bind-join request. Each such batch is associated with a sub-multiset of the input solution mappings that are covered by the batch, whereas the batch itself consists of versions of these input solution mappings that are already restricted to the join variables (and that contain no blank nodes, see below). Hence, while the number of such already-restricted solution mappings per batch is fixed (see the batchSize argument of the constructor), the size of the sub-multiset of input solution mappings associated with each batch may be greater than the batch size.

After splitting the current sequence of input solution mappings into batches, the last batch may not be full, in which case it is kept and will be populated further once the next sequence of input solution mappings is passed to the operator. The full batches are used to create bind-join requests, one per batch. The response to such a request is the subset of the solutions for the query/pattern of this operator that are join partners for at least one of the solutions that were used for creating the request.

Each of the requests is issued using the asynchronous functionality of the federation access manager, which results in a CompletableFuture. The algorithm connects this future to an internal response processor to process the response once it arrives (joining the solution mappings from the response with the solution mappings covered by the corresponding batch). All these futures are collected such that the algorithm can wait for their completion after the child operator has stopped producing input for this operator.

This implementation is also capable of separating out each input solution mapping that assigns a blank node to any of the join variables. Then, such solution mappings are not even considered when creating the requests because they cannot have any join partners in the results obtained from the federation member. Of course, in case the algorithm is used with outer-join semantics, these solution mappings are still returned to the output (without joining them with anything).

Another feature of this implementation is that it switches into a full-retrieval mode as soon as there is an input solution mapping that does not have a binding for any of the join variables (which may happen only in cases in which none of the join variables is a certain variable). Such an input solution mapping is compatible with (and, thus, can be joined with) every solution mapping that the federation member has for the query/pattern of this bind-join operator. Therefore, when switching into full-retrieval mode, this implementation performs a request to retrieve the complete set of all these solution mappings and, then, uses this set to find join partners for the current and the future batches of input solution mappings (because, with the complete set available locally, there is no need anymore to issue further bind-join requests).

@hartig
Copy link
Member Author

hartig commented Feb 10, 2026

@AdrianaConcha there are still a few things to be done here, but can you please give it an initial try first in your experiment setup? Just for a handful queries first, only to see whether it works (it should ;) and whether it has an effect. I tried with one query on my machine and saw reduction of execution time to about 1/2, for a query in which the bind-join had to process five batches.

To enable the parallel version, edit the config file. There is a new entry now for PhysicalOpParallelBindJoinWithVALUES. That one needs to be uncommented, and the other bind-join related entries before it need to be commented.

@hartig hartig marked this pull request as draft February 10, 2026 11:57
@hartig hartig marked this pull request as ready for review February 12, 2026 21:30
@hartig hartig changed the title WIP: Parallel implementation of the (batch-based) bind join algorithm Parallel implementation of the (batch-based) bind join algorithm Feb 12, 2026
@hartig hartig merged commit 391350e into main Feb 12, 2026
1 check passed
@hartig hartig deleted the ParallelBindJoin branch February 12, 2026 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant