Skip to content

Conversation

@tisonkun
Copy link
Member

@tisonkun tisonkun commented Dec 16, 2025

This closes #14

Some open questions:

  • Shall we support tdigest over f32? Java implements only TDigestDouble while C++ has both for float and double. I think f64 is enough to cover most cases.
  • get_rank and get_quantile now takes &mut self because they call compress internally. I think we can add a "freezed" tdigest struct that will never updated then and already compressed, so that the freezed struct can be shared anywhere to test ranks and quantities without mut. See 7175d98

Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
@tisonkun tisonkun marked this pull request as ready for review December 16, 2025 07:00
@tisonkun
Copy link
Member Author

Basic functions are ready. Anyone can drop a review now.

Other features may or may not included in this PR and they will be new commits if any.

@tisonkun
Copy link
Member Author

cc @notfilippo @freakyzoidberg

cc @leerho based on #2 (comment), I made this impl as a combination of C++ & Java version. Welcome to give it a review. I'll add serde support today or tomorrow. But I'm still not quite sure what CDF and PMF are. It's possible to convey C++'s impl as is but I wonder a real world use case to understand its definition and usage.

Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
@freakyzoidberg
Copy link
Member

cc @notfilippo @freakyzoidberg

cc @leerho based on #2 (comment), I made this impl as a combination of C++ & Java version. Welcome to give it a review. I'll add serde support today or tomorrow. But I'm still not quite sure what CDF and PMF are. It's possible to convey C++'s impl as is but I wonder a real world use case to understand its definition and usage.

Without the x-serde this is quite hard to confirm if the synopsis are compatible
But that can be done later a the implem progress

Note I am no mathematician - but my humble understanding

PMF -> Probability Mass Function

It returns the approximate fraction of data points (mass) that fall into specific "bins" or intervals.
You give it an array of split points and it returns an array of mass fraction that sum up to 1.

it's a histogram generator of sort

CDF -> Cumulative Distribution Function

It returns the approximate fraction of data points that are less than (or equal to) each split point.
CDF ~= running running sum of the PMF

Copy link
Contributor

@notfilippo notfilippo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks good!

Shall we support tdigest over f32? Java implements only TDigestDouble while C++ has both for float and double. I think f64 is enough to cover most cases.

Is there any way we can have a TDigest<T: num::Float>? I think in some cases having less precision for half of the memory is desirable.

get_rank and get_quantile now takes &mut self because they call compress internally. I think we can add a "freezed" tdigest struct that will never updated then and already compressed, so that the freezed struct can be shared anywhere to test ranks and quantities without mut.

I like this approach.

@tisonkun
Copy link
Member Author

tisonkun commented Dec 16, 2025

Is there any way we can have a TDigest<T: num::Float>? I think in some cases having less precision for half of the memory is desirable.

Possible. The tricky part is around overflow/underflow during computation and serde.

And the more tricky part is that it's not only about the value, but also the weight (u64 for f64, u32 for f32, GenericTypeConfig to get them associated, damn). But I'm considering why not just use f64/f32 for weight - we're now often casting weight to f64 and while representing u64 as f64 is lossless.

Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
@tisonkun
Copy link
Member Author

get_rank and get_quantile now takes &mut self because they call compress internally. I think we can add a "freezed" tdigest struct that will never updated then and already compressed, so that the freezed struct can be shared anywhere to test ranks and quantities without mut.

I like this approach.

Added TDigest and TDigestMut for this purpose. See 7175d98 for more details. Also an internal TDigestView<'a> for sharing core logic.

@tisonkun
Copy link
Member Author

tisonkun commented Dec 16, 2025

The remaining task in my mind:

  • Implement cdf/pmf functions
  • Support reading format of the reference implementation and snapshot tests See 88837bf
  • Figure out whether to cover NaN edge cases

Shall we support tdigest over f32? Java implements only TDigestDouble while C++ has both for float and double. I think f64 is enough to cover most cases.

Is there any way we can have a TDigest<T: num::Float>? I think in some cases having less precision for half of the memory is desirable.

I may not include this feature in this PR, because:

  1. The Java impl has only TDigestDouble.
  2. Generally, f64 is what you need.
  3. Switching between f32 and f64 can increase a lot of type tuning issues and perhaps abstractions only for making this work.

That said, I'll try to see if the weight field can be a f64 so we don't need a type config to associate (f64, u64)/(f32, u32) anyway. But the Java/C++ impl does use an integer type for weight, while using f64 for u64 should be lossless since the only assignment here is:

self.weight = total_weight;
let total_weight = self.weight + other.weight;

cc @leerho @AlexanderSaydakov - any reason we must use an integer for weight?

Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Comment on lines +480 to +481
check_non_nan(min, "min")?;
check_non_nan(max, "max")?;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min & max should have other certia like min <= max and should be consistent with centroid + buffer. But I'd leave it later since the non_nan check is complex enough here. I don't make this PR too unreviewable (

Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
@leerho
Copy link
Contributor

leerho commented Dec 17, 2025

Here is a little tutorial on CDFs and PMFs.

It is on the website:

  • Click on Documentation
  • On the left ToC open Sketch Families
  • Open Quantiles and Histograms
  • Click on Classic Quantiles Sketches
  • At the top of the page click on Quantiles StreamA Study

The beauty of using these quantile sketches for histogramming big data is:

  • You only scan the raw data once.
  • After that, you can query the sketch with a set of points that define the boundaries of your bins. The result is returned to you as two arrays of points. You choose your own plotting tool. If you don't like what you see, you can query the sketch again with a different set of points. Most histogram plotting software requires you to define your bins before you scan the raw data. These sketches allow you to play with your data in nearly real time.

@leerho
Copy link
Contributor

leerho commented Dec 17, 2025

WRT +/-inf. This is truely bizarre. The KLL, Classic, REQ sketches reject NaNs (except in one very special case) As for +-Inf, I don't recall that they ever subtract points so the (inf - inf) should never happen. T-Digest is another animal entirely, and I can see where the subtraction of centroids would create a problem. i would reject all +/-Inf on the input, they really mess up data analysis IMHO.

@tisonkun
Copy link
Member Author

i would reject all +/-Inf on the input, they really mess up data analysis IMHO.

Good point to follow. Let me try to push a commit for doing so later.

@AlexanderSaydakov
Copy link

AlexanderSaydakov commented Dec 17, 2025

any reason we must use an integer for weight?

the total weight is the number of input items. weight of a centroid is the number of items this centroid represents.

@tisonkun
Copy link
Member Author

tisonkun commented Dec 17, 2025

any reason we must use an integer for weight?

the total weight is the number of input items. weight of a centroid is the number of items this centroid represents.

Yes. I mean, is there some downside if I'm using f64 (double) for weight? Because to support type generic over f32/f64 like what datasketches-cpp's tdigest, reducing type combo into one generic type would help.

It's quite wordy to implement something like below in Rust:

using W = typename std::conditional<std::is_same<T, double>::value, uint64_t, uint32_t>::type;

Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
@tisonkun
Copy link
Member Author

tisonkun commented Dec 17, 2025

i would reject all +/-Inf on the input, they really mess up data analysis IMHO.

All filter out in the final commit.

@leerho @freakyzoidberg @notfilippo This PR should be considered finished now. Welcome to drop a review.

Comment on lines +1059 to +1074
/// Checks the sequential validity of the given array of double values.
/// They must be unique, monotonically increasing and not NaN.
#[track_caller]
fn check_split_points(split_points: &[f64]) {
let len = split_points.len();
if len == 1 && split_points[0].is_nan() {
panic!("split_points must not contain NaN values: {split_points:?}");
}
for i in 0..len - 1 {
if split_points[i] < split_points[i + 1] {
// we must use this positive condition because NaN comparisons are always false
continue;
}
panic!("split_points must be unique and monotonically increasing: {split_points:?}");
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While C++ defines check_split_points as a private method, Java provides it as a public util: https://apache.github.io/datasketches-java/9.0.0/org/apache/datasketches/quantilescommon/QuantilesUtil.html#checkDoublesSplitPointsOrder(double%5B%5D)

Given that we require split_points to be unique, monotonically increasing and not NaN in cdf/pmf contract, I'd prefer to expose a fallible version of this method for users to check by themselves easily.

The remaining issue is where we should put these utilities. And this can be a follow-up anyway.

Signed-off-by: tison <wander4096@gmail.com>
Copy link
Contributor

@leerho leerho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is overly big because of all the serialized test *.sk files. We don't check those files in as they are created dynamically. That being said, I keep a copy locally (cause I'm lazy and impatient) but even then I do try to update them frequently.

I would request that these *.sk files be removed from this PR.

@tisonkun
Copy link
Member Author

I would request that these *.sk files be removed from this PR.

@leerho Sure. We already do that so far, so originally I'm considering to add a followup to build the snapshot in CI with a script as I'm planning in #10.

But if this can be a blocker, let me try to implement it beforehand :D

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for working on this!

@tisonkun
Copy link
Member Author

tisonkun commented Dec 19, 2025

See https://github.com/apache/datasketches-rust/tree/4f5f266/tests/serialization_test_data and @notfilippo may prefer check in the snapshot files as in (#21 (comment)) and what datasketches-go does now (https://github.com/apache/datasketches-go/tree/f7bc4b1d/serialization_test_data).

But I'd prefer Java/C++'s approach so let me prepare a patch for review (#10 (comment)).

@tisonkun
Copy link
Member Author

This is overly big because of all the serialized test *.sk files. We don't check those files in as they are created dynamically. That being said, I keep a copy locally (cause I'm lazy and impatient) but even then I do try to update them frequently.

I would request that these *.sk files be removed from this PR.

Prerequisite implemented in #29. After #29 merged I can rebase this one onto that.

Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
@tisonkun
Copy link
Member Author

tisonkun commented Dec 19, 2025

@leerho Rebased and all sk files are removed.

Now it takes about 5 mins to generate the snapshots in CI. I'm considering have a shared repo for pregenerating the snapshots and only download when we need it. The snapshot can be kept updated in a daily manner and each language developers maintainer they own language's generator.

But that would be another initiative.

@notfilippo
Copy link
Contributor

Planning to take a look at this tomorrow. Thanks for your work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement tdigest

6 participants