-
Notifications
You must be signed in to change notification settings - Fork 7
feat: implement T-Digest #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
|
Basic functions are ready. Anyone can drop a review now. Other features may or may not included in this PR and they will be new commits if any. |
|
cc @notfilippo @freakyzoidberg cc @leerho based on #2 (comment), I made this impl as a combination of C++ & Java version. Welcome to give it a review. I'll add serde support today or tomorrow. But I'm still not quite sure what CDF and PMF are. It's possible to convey C++'s impl as is but I wonder a real world use case to understand its definition and usage. |
Signed-off-by: tison <wander4096@gmail.com>
Without the x-serde this is quite hard to confirm if the synopsis are compatible Note I am no mathematician - but my humble understanding PMF -> Probability Mass FunctionIt returns the approximate fraction of data points (mass) that fall into specific "bins" or intervals. it's a histogram generator of sort CDF -> Cumulative Distribution FunctionIt returns the approximate fraction of data points that are less than (or equal to) each split point. |
notfilippo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation looks good!
Shall we support tdigest over f32? Java implements only TDigestDouble while C++ has both for float and double. I think f64 is enough to cover most cases.
Is there any way we can have a TDigest<T: num::Float>? I think in some cases having less precision for half of the memory is desirable.
get_rank and get_quantile now takes &mut self because they call compress internally. I think we can add a "freezed" tdigest struct that will never updated then and already compressed, so that the freezed struct can be shared anywhere to test ranks and quantities without mut.
I like this approach.
Possible. The tricky part is around overflow/underflow during computation and serde. And the more tricky part is that it's not only about the value, but also the weight (u64 for f64, u32 for f32, GenericTypeConfig to get them associated, damn). But I'm considering why not just use f64/f32 for weight - we're now often casting weight to f64 and while representing u64 as f64 is lossless. |
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Added |
|
The remaining task in my mind:
I may not include this feature in this PR, because:
That said, I'll try to see if the self.weight = total_weight;
let total_weight = self.weight + other.weight;cc @leerho @AlexanderSaydakov - any reason we must use an integer for |
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
| check_non_nan(min, "min")?; | ||
| check_non_nan(max, "max")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min & max should have other certia like min <= max and should be consistent with centroid + buffer. But I'd leave it later since the non_nan check is complex enough here. I don't make this PR too unreviewable (
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
|
Here is a little tutorial on CDFs and PMFs. It is on the website:
The beauty of using these quantile sketches for histogramming big data is:
|
|
WRT +/-inf. This is truely bizarre. The KLL, Classic, REQ sketches reject NaNs (except in one very special case) As for +-Inf, I don't recall that they ever subtract points so the (inf - inf) should never happen. T-Digest is another animal entirely, and I can see where the subtraction of centroids would create a problem. i would reject all +/-Inf on the input, they really mess up data analysis IMHO. |
Good point to follow. Let me try to push a commit for doing so later. |
the total weight is the number of input items. weight of a centroid is the number of items this centroid represents. |
Yes. I mean, is there some downside if I'm using f64 (double) for weight? Because to support type generic over f32/f64 like what datasketches-cpp's tdigest, reducing type combo into one generic type would help. It's quite wordy to implement something like below in Rust: using W = typename std::conditional<std::is_same<T, double>::value, uint64_t, uint32_t>::type; |
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
All filter out in the final commit. @leerho @freakyzoidberg @notfilippo This PR should be considered finished now. Welcome to drop a review. |
| /// Checks the sequential validity of the given array of double values. | ||
| /// They must be unique, monotonically increasing and not NaN. | ||
| #[track_caller] | ||
| fn check_split_points(split_points: &[f64]) { | ||
| let len = split_points.len(); | ||
| if len == 1 && split_points[0].is_nan() { | ||
| panic!("split_points must not contain NaN values: {split_points:?}"); | ||
| } | ||
| for i in 0..len - 1 { | ||
| if split_points[i] < split_points[i + 1] { | ||
| // we must use this positive condition because NaN comparisons are always false | ||
| continue; | ||
| } | ||
| panic!("split_points must be unique and monotonically increasing: {split_points:?}"); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While C++ defines check_split_points as a private method, Java provides it as a public util: https://apache.github.io/datasketches-java/9.0.0/org/apache/datasketches/quantilescommon/QuantilesUtil.html#checkDoublesSplitPointsOrder(double%5B%5D)
Given that we require split_points to be unique, monotonically increasing and not NaN in cdf/pmf contract, I'd prefer to expose a fallible version of this method for users to check by themselves easily.
The remaining issue is where we should put these utilities. And this can be a follow-up anyway.
Signed-off-by: tison <wander4096@gmail.com>
leerho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is overly big because of all the serialized test *.sk files. We don't check those files in as they are created dynamically. That being said, I keep a copy locally (cause I'm lazy and impatient) but even then I do try to update them frequently.
I would request that these *.sk files be removed from this PR.
Xuanwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you for working on this!
|
See https://github.com/apache/datasketches-rust/tree/4f5f266/tests/serialization_test_data and @notfilippo may prefer check in the snapshot files as in (#21 (comment)) and what datasketches-go does now (https://github.com/apache/datasketches-go/tree/f7bc4b1d/serialization_test_data). But I'd prefer Java/C++'s approach so let me prepare a patch for review (#10 (comment)). |
Prerequisite implemented in #29. After #29 merged I can rebase this one onto that. |
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
|
@leerho Rebased and all sk files are removed. Now it takes about 5 mins to generate the snapshots in CI. I'm considering have a shared repo for pregenerating the snapshots and only download when we need it. The snapshot can be kept updated in a daily manner and each language developers maintainer they own language's generator. But that would be another initiative. |
|
Planning to take a look at this tomorrow. Thanks for your work! |
This closes #14
Some open questions:
get_rank and get_quantile now takesSee 7175d98&mut selfbecause they call compress internally. I think we can add a "freezed" tdigest struct that will never updated then and already compressed, so that the freezed struct can be shared anywhere to test ranks and quantities without mut.