Parallel implementation of the (batch-based) bind join algorithm#499
Merged
Parallel implementation of the (batch-based) bind join algorithm#499
Conversation
…VALUES-based variation so far
Member
Author
|
@AdrianaConcha there are still a few things to be done here, but can you please give it an initial try first in your experiment setup? Just for a handful queries first, only to see whether it works (it should ;) and whether it has an effect. I tried with one query on my machine and saw reduction of execution time to about 1/2, for a query in which the bind-join had to process five batches. To enable the parallel version, edit the config file. There is a new entry now for |
… to PhysicalOpRegistry, and removes almost all the functions from PhysicalPlanFactory that create plans with a specific type of root operator
…ts code directly into PhysicalOpBindJoinSPARQL)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Our current implementation performs all bind-join requests sequentially. This PR introduces an implementation that can perform them all in parallel.
More specifically, the new implementation issues the bind-join requests without blocking, handling the processing of their responses in parallel (in the threads that the federation access manager uses to perform the requests).
The algorithm works as follows: For every sequence of solution mappings from the input, the algorithm splits this sequence into batches where each such batch will then be used for a separate bind-join request. Each such batch is associated with a sub-multiset of the input solution mappings that are covered by the batch, whereas the batch itself consists of versions of these input solution mappings that are already restricted to the join variables (and that contain no blank nodes, see below). Hence, while the number of such already-restricted solution mappings per batch is fixed (see the
batchSizeargument of the constructor), the size of the sub-multiset of input solution mappings associated with each batch may be greater than the batch size.After splitting the current sequence of input solution mappings into batches, the last batch may not be full, in which case it is kept and will be populated further once the next sequence of input solution mappings is passed to the operator. The full batches are used to create bind-join requests, one per batch. The response to such a request is the subset of the solutions for the query/pattern of this operator that are join partners for at least one of the solutions that were used for creating the request.
Each of the requests is issued using the asynchronous functionality of the federation access manager, which results in a
CompletableFuture. The algorithm connects this future to an internal response processor to process the response once it arrives (joining the solution mappings from the response with the solution mappings covered by the corresponding batch). All these futures are collected such that the algorithm can wait for their completion after the child operator has stopped producing input for this operator.This implementation is also capable of separating out each input solution mapping that assigns a blank node to any of the join variables. Then, such solution mappings are not even considered when creating the requests because they cannot have any join partners in the results obtained from the federation member. Of course, in case the algorithm is used with outer-join semantics, these solution mappings are still returned to the output (without joining them with anything).
Another feature of this implementation is that it switches into a full-retrieval mode as soon as there is an input solution mapping that does not have a binding for any of the join variables (which may happen only in cases in which none of the join variables is a certain variable). Such an input solution mapping is compatible with (and, thus, can be joined with) every solution mapping that the federation member has for the query/pattern of this bind-join operator. Therefore, when switching into full-retrieval mode, this implementation performs a request to retrieve the complete set of all these solution mappings and, then, uses this set to find join partners for the current and the future batches of input solution mappings (because, with the complete set available locally, there is no need anymore to issue further bind-join requests).