-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Hash sets/maps are common collections, and their current implementation in the standard library naturally fit Paralight's model, as they are represented by a slice of buckets. Conceptually, implementing a parallel source over hash tables could be as simple as buckets.par_iter().filter(/* is bucket occupied? */).
The main challenge is that hash table internals are not exposed in public APIs (unlike Vec and its into_raw_parts()/from_raw_parts() APIs), which allows future evolvability of the stdlib implementation but prevents optimizations.
Another possibility would be to work with the hashbrown crate (the underlying implementation of the standard library) to peek into the internals and provide ParallelSource adaptors for hashbrown::HashMap. This wouldn't fully solve it for std::collections::HashMap though.
Pushing to the frontier of what's possible, a terrible hack would be to use unsafe code to convert between std::hashmap::HashMap and hashbrown::HashMap, using something like rustversion or the upcoming cfg(version(...)) to allow this only on Rust versions where the standard library implementation is known to be a specific version of hashbrown. This is a whole can of worms, might be implemented via separate support crates, etc.
At the end of the day, there is no free lunch: the standard library isn't committing too much to its internals to allow future performance improvements, but this prevents crates from making assumptions on standard library types to implement some optimizations.
The other side of the puzzle is parallel collection into a hash table (akin to #4). One approach would be to have per-thread temporary tables to collect into, and a merge phase at the end. Another approach would be to integrate further into the internals, having a single hash table for all threads and locking selectively part of it when reading/writing into a bucket.