feat: implement T-Digest #23

tisonkun · 2025-12-16T05:15:25Z

This closes #14

Some open questions:

Shall we support tdigest over f32? Java implements only TDigestDouble while C++ has both for float and double. I think f64 is enough to cover most cases.
get_rank and get_quantile now takes &mut self because they call compress internally. I think we can add a "freezed" tdigest struct that will never updated then and already compressed, so that the freezed struct can be shared anywhere to test ranks and quantities without mut. See 7175d98

Signed-off-by: tison <wander4096@gmail.com>

src/tdigest/sketch.rs

tisonkun · 2025-12-16T07:00:58Z

Basic functions are ready. Anyone can drop a review now.

Other features may or may not included in this PR and they will be new commits if any.

tisonkun · 2025-12-16T07:04:18Z

cc @notfilippo @freakyzoidberg

cc @leerho based on #2 (comment), I made this impl as a combination of C++ & Java version. Welcome to give it a review. I'll add serde support today or tomorrow. But I'm still not quite sure what CDF and PMF are. It's possible to convey C++'s impl as is but I wonder a real world use case to understand its definition and usage.

Signed-off-by: tison <wander4096@gmail.com>

src/tdigest/serialization.rs

Signed-off-by: tison <wander4096@gmail.com>

src/tdigest/serialization.rs

freakyzoidberg · 2025-12-16T09:29:59Z

cc @notfilippo @freakyzoidberg

cc @leerho based on #2 (comment), I made this impl as a combination of C++ & Java version. Welcome to give it a review. I'll add serde support today or tomorrow. But I'm still not quite sure what CDF and PMF are. It's possible to convey C++'s impl as is but I wonder a real world use case to understand its definition and usage.

Without the x-serde this is quite hard to confirm if the synopsis are compatible
But that can be done later a the implem progress

Note I am no mathematician - but my humble understanding

PMF -> Probability Mass Function

It returns the approximate fraction of data points (mass) that fall into specific "bins" or intervals.
You give it an array of split points and it returns an array of mass fraction that sum up to 1.

it's a histogram generator of sort

CDF -> Cumulative Distribution Function

It returns the approximate fraction of data points that are less than (or equal to) each split point.
CDF ~= running running sum of the PMF

notfilippo

Implementation looks good!

Shall we support tdigest over f32? Java implements only TDigestDouble while C++ has both for float and double. I think f64 is enough to cover most cases.

Is there any way we can have a TDigest<T: num::Float>? I think in some cases having less precision for half of the memory is desirable.

get_rank and get_quantile now takes &mut self because they call compress internally. I think we can add a "freezed" tdigest struct that will never updated then and already compressed, so that the freezed struct can be shared anywhere to test ranks and quantities without mut.

I like this approach.

src/tdigest/mod.rs

src/tdigest/serialization.rs

src/tdigest/sketch.rs

tisonkun · 2025-12-16T10:17:08Z

Is there any way we can have a TDigest<T: num::Float>? I think in some cases having less precision for half of the memory is desirable.

Possible. The tricky part is around overflow/underflow during computation and serde.

And the more tricky part is that it's not only about the value, but also the weight (u64 for f64, u32 for f32, GenericTypeConfig to get them associated, damn). But I'm considering why not just use f64/f32 for weight - we're now often casting weight to f64 and while representing u64 as f64 is lossless.

Signed-off-by: tison <wander4096@gmail.com>

src/tdigest/sketch.rs

Signed-off-by: tison <wander4096@gmail.com>

tisonkun · 2025-12-16T15:45:11Z

get_rank and get_quantile now takes &mut self because they call compress internally. I think we can add a "freezed" tdigest struct that will never updated then and already compressed, so that the freezed struct can be shared anywhere to test ranks and quantities without mut.

I like this approach.

Added TDigest and TDigestMut for this purpose. See 7175d98 for more details. Also an internal TDigestView<'a> for sharing core logic.

tisonkun · 2025-12-16T15:52:09Z

The remaining task in my mind:

Implement cdf/pmf functions
~~Support reading format of the reference implementation and snapshot tests~~ See 88837bf
Figure out whether to cover NaN edge cases

Shall we support tdigest over f32? Java implements only TDigestDouble while C++ has both for float and double. I think f64 is enough to cover most cases.

Is there any way we can have a TDigest<T: num::Float>? I think in some cases having less precision for half of the memory is desirable.

I may not include this feature in this PR, because:

The Java impl has only TDigestDouble.
Generally, f64 is what you need.
Switching between f32 and f64 can increase a lot of type tuning issues and perhaps abstractions only for making this work.

That said, I'll try to see if the weight field can be a f64 so we don't need a type config to associate (f64, u64)/(f32, u32) anyway. But the Java/C++ impl does use an integer type for weight, while using f64 for u64 should be lossless since the only assignment here is:

self.weight = total_weight;
let total_weight = self.weight + other.weight;

cc @leerho @AlexanderSaydakov - any reason we must use an integer for weight?

Signed-off-by: tison <wander4096@gmail.com>

tisonkun · 2025-12-17T02:04:31Z

src/tdigest/sketch.rs

+        check_non_nan(min, "min")?;
+        check_non_nan(max, "max")?;


min & max should have other certia like min <= max and should be consistent with centroid + buffer. But I'd leave it later since the non_nan check is complex enough here. I don't make this PR too unreviewable (

Signed-off-by: tison <wander4096@gmail.com>

leerho · 2025-12-17T06:46:32Z

Here is a little tutorial on CDFs and PMFs.

It is on the website:

Click on Documentation
On the left ToC open Sketch Families
Open Quantiles and Histograms
Click on Classic Quantiles Sketches
At the top of the page click on Quantiles StreamA Study

The beauty of using these quantile sketches for histogramming big data is:

You only scan the raw data once.
After that, you can query the sketch with a set of points that define the boundaries of your bins. The result is returned to you as two arrays of points. You choose your own plotting tool. If you don't like what you see, you can query the sketch again with a different set of points. Most histogram plotting software requires you to define your bins before you scan the raw data. These sketches allow you to play with your data in nearly real time.

leerho · 2025-12-17T07:14:07Z

WRT +/-inf. This is truely bizarre. The KLL, Classic, REQ sketches reject NaNs (except in one very special case) As for +-Inf, I don't recall that they ever subtract points so the (inf - inf) should never happen. T-Digest is another animal entirely, and I can see where the subtraction of centroids would create a problem. i would reject all +/-Inf on the input, they really mess up data analysis IMHO.

tisonkun · 2025-12-17T08:47:02Z

i would reject all +/-Inf on the input, they really mess up data analysis IMHO.

Good point to follow. Let me try to push a commit for doing so later.

AlexanderSaydakov · 2025-12-17T19:23:32Z

any reason we must use an integer for weight?

the total weight is the number of input items. weight of a centroid is the number of items this centroid represents.

tisonkun · 2025-12-17T23:10:29Z

any reason we must use an integer for weight?

the total weight is the number of input items. weight of a centroid is the number of items this centroid represents.

Yes. I mean, is there some downside if I'm using f64 (double) for weight? Because to support type generic over f32/f64 like what datasketches-cpp's tdigest, reducing type combo into one generic type would help.

It's quite wordy to implement something like below in Rust:

using W = typename std::conditional<std::is_same<T, double>::value, uint64_t, uint32_t>::type;

Signed-off-by: tison <wander4096@gmail.com>

tisonkun · 2025-12-17T23:53:57Z

i would reject all +/-Inf on the input, they really mess up data analysis IMHO.

All filter out in the final commit.

@leerho @freakyzoidberg @notfilippo This PR should be considered finished now. Welcome to drop a review.

tisonkun · 2025-12-17T23:58:50Z

src/tdigest/sketch.rs

+/// Checks the sequential validity of the given array of double values.
+/// They must be unique, monotonically increasing and not NaN.
+#[track_caller]
+fn check_split_points(split_points: &[f64]) {
+    let len = split_points.len();
+    if len == 1 && split_points[0].is_nan() {
+        panic!("split_points must not contain NaN values: {split_points:?}");
+    }
+    for i in 0..len - 1 {
+        if split_points[i] < split_points[i + 1] {
+            // we must use this positive condition because NaN comparisons are always false
+            continue;
+        }
+        panic!("split_points must be unique and monotonically increasing: {split_points:?}");
+    }
+}


While C++ defines check_split_points as a private method, Java provides it as a public util: https://apache.github.io/datasketches-java/9.0.0/org/apache/datasketches/quantilescommon/QuantilesUtil.html#checkDoublesSplitPointsOrder(double%5B%5D)

Given that we require split_points to be unique, monotonically increasing and not NaN in cdf/pmf contract, I'd prefer to expose a fallible version of this method for users to check by themselves easily.

The remaining issue is where we should put these utilities. And this can be a follow-up anyway.

Signed-off-by: tison <wander4096@gmail.com>

leerho

This is overly big because of all the serialized test *.sk files. We don't check those files in as they are created dynamically. That being said, I keep a copy locally (cause I'm lazy and impatient) but even then I do try to update them frequently.

I would request that these *.sk files be removed from this PR.

tisonkun · 2025-12-19T02:25:11Z

I would request that these *.sk files be removed from this PR.

@leerho Sure. We already do that so far, so originally I'm considering to add a followup to build the snapshot in CI with a script as I'm planning in #10.

But if this can be a blocker, let me try to implement it beforehand :D

Xuanwo

LGTM, thank you for working on this!

tisonkun · 2025-12-19T02:27:09Z

See https://github.com/apache/datasketches-rust/tree/4f5f266/tests/serialization_test_data and @notfilippo may prefer check in the snapshot files as in (#21 (comment)) and what datasketches-go does now (https://github.com/apache/datasketches-go/tree/f7bc4b1d/serialization_test_data).

But I'd prefer Java/C++'s approach so let me prepare a patch for review (#10 (comment)).

tisonkun · 2025-12-19T03:49:07Z

This is overly big because of all the serialized test *.sk files. We don't check those files in as they are created dynamically. That being said, I keep a copy locally (cause I'm lazy and impatient) but even then I do try to update them frequently.

I would request that these *.sk files be removed from this PR.

Prerequisite implemented in #29. After #29 merged I can rebase this one onto that.

Signed-off-by: tison <wander4096@gmail.com>

tisonkun · 2025-12-19T05:15:52Z

@leerho Rebased and all sk files are removed.

Now it takes about 5 mins to generate the snapshots in CI. I'm considering have a shared repo for pregenerating the snapshots and only download when we need it. The snapshot can be kept updated in a daily manner and each language developers maintainer they own language's generator.

But that would be another initiative.

notfilippo · 2025-12-19T09:22:07Z

Planning to take a look at this tomorrow. Thanks for your work!

tisonkun added 3 commits December 16, 2025 11:45

feat: implement T-Digest

ed48865

Signed-off-by: tison <wander4096@gmail.com>

impl merge and compress

26ee955

Signed-off-by: tison <wander4096@gmail.com>

impl get_rank

88ac87e

Signed-off-by: tison <wander4096@gmail.com>

tisonkun commented Dec 16, 2025

View reviewed changes

src/tdigest/sketch.rs Outdated Show resolved Hide resolved

tisonkun marked this pull request as ready for review December 16, 2025 07:00

tisonkun added 3 commits December 16, 2025 15:23

impl merge and add tests

09afcc9

Signed-off-by: tison <wander4096@gmail.com>

demo iter

a3271d7

Signed-off-by: tison <wander4096@gmail.com>

impl ser

81ba5af

Signed-off-by: tison <wander4096@gmail.com>

tisonkun commented Dec 16, 2025

View reviewed changes

src/tdigest/serialization.rs Outdated Show resolved Hide resolved

tisonkun commented Dec 16, 2025

View reviewed changes

src/tdigest/serialization.rs Outdated Show resolved Hide resolved

tisonkun commented Dec 16, 2025

View reviewed changes

src/tdigest/serialization.rs Outdated Show resolved Hide resolved

tisonkun added 2 commits December 16, 2025 16:50

impl de

a213242

Signed-off-by: tison <wander4096@gmail.com>

fine tune deserialize tags

5cc6e21

Signed-off-by: tison <wander4096@gmail.com>

tisonkun commented Dec 16, 2025

View reviewed changes

src/tdigest/serialization.rs Outdated Show resolved Hide resolved

tisonkun commented Dec 16, 2025

View reviewed changes

src/tdigest/serialization.rs Outdated Show resolved Hide resolved

tisonkun mentioned this pull request Dec 16, 2025

chore: check InsufficientData before index access #24

Merged

notfilippo reviewed Dec 16, 2025

View reviewed changes

src/tdigest/mod.rs Outdated Show resolved Hide resolved

src/tdigest/serialization.rs Outdated Show resolved Hide resolved

src/tdigest/sketch.rs Outdated Show resolved Hide resolved

src/tdigest/sketch.rs Show resolved Hide resolved

src/tdigest/sketch.rs Show resolved Hide resolved

tisonkun added 2 commits December 16, 2025 18:23

define code in one place

d90491d

Signed-off-by: tison <wander4096@gmail.com>

centralize compare logics

9497d24

Signed-off-by: tison <wander4096@gmail.com>

tisonkun commented Dec 16, 2025

View reviewed changes

src/tdigest/sketch.rs Show resolved Hide resolved

Xuanwo reviewed Dec 16, 2025

View reviewed changes

src/tdigest/sketch.rs Show resolved Hide resolved

tisonkun added 2 commits December 16, 2025 23:14

finish serde

53b74ee

Signed-off-by: tison <wander4096@gmail.com>

enable freeze TDigestMut

7175d98

Signed-off-by: tison <wander4096@gmail.com>

add serde compat test files

b37f08b

Signed-off-by: tison <wander4096@gmail.com>

further tidy

11fee5f

Signed-off-by: tison <wander4096@gmail.com>

tisonkun mentioned this pull request Dec 17, 2025

Compress TDigest can generate centroid with NaN mean apache/datasketches-java#702

Open

best effort avoid NaN

bebd87c

Signed-off-by: tison <wander4096@gmail.com>

tisonkun commented Dec 17, 2025

View reviewed changes

tisonkun added 2 commits December 17, 2025 10:13

fixup! best effort avoid NaN

243dc28

Signed-off-by: tison <wander4096@gmail.com>

concrete tag

2a4ad3d

Signed-off-by: tison <wander4096@gmail.com>

Merge branch 'main' into tdigests

ab73d58

tisonkun mentioned this pull request Dec 17, 2025

Serde incompatible with C++ TDigest apache/datasketches-java#701

Closed

tisonkun added 2 commits December 18, 2025 07:19

filter invalid inputs

ddbe0e2

Signed-off-by: tison <wander4096@gmail.com>

weight nonzero and should not overflow

2f61d4f

Signed-off-by: tison <wander4096@gmail.com>

tisonkun commented Dec 17, 2025

View reviewed changes

other_mean - self_mean may produce inf

b35cdb2

Signed-off-by: tison <wander4096@gmail.com>

leerho requested changes Dec 19, 2025

View reviewed changes

Xuanwo approved these changes Dec 19, 2025

View reviewed changes

tisonkun added 3 commits December 19, 2025 13:07

Merge branch 'main' into tdigests

caffa5a

no need for checking in sk files now

743ede9

Signed-off-by: tison <wander4096@gmail.com>

reuse test data loading logics

1f0ce3e

Signed-off-by: tison <wander4096@gmail.com>

tisonkun force-pushed the tdigests branch from fbeb28e to 1f0ce3e Compare December 19, 2025 05:15

tisonkun requested a review from leerho December 19, 2025 05:15

feat: implement T-Digest #23

Are you sure you want to change the base?

feat: implement T-Digest #23

Uh oh!

Conversation

tisonkun commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tisonkun commented Dec 16, 2025

Uh oh!

tisonkun commented Dec 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

freakyzoidberg commented Dec 16, 2025

PMF -> Probability Mass Function

CDF -> Cumulative Distribution Function

Uh oh!

notfilippo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tisonkun commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tisonkun commented Dec 16, 2025

Uh oh!

tisonkun commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tisonkun Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

leerho commented Dec 17, 2025

Uh oh!

leerho commented Dec 17, 2025

Uh oh!

tisonkun commented Dec 17, 2025

Uh oh!

AlexanderSaydakov commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tisonkun commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tisonkun commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tisonkun Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

leerho left a comment

Choose a reason for hiding this comment

Uh oh!

tisonkun commented Dec 19, 2025

Uh oh!

Xuanwo left a comment

Choose a reason for hiding this comment

Uh oh!

tisonkun commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tisonkun commented Dec 19, 2025

Uh oh!

tisonkun commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

notfilippo commented Dec 19, 2025

Uh oh!

Reviewers

tisonkun commented Dec 16, 2025 •

edited

Loading

tisonkun commented Dec 16, 2025 •

edited

Loading

tisonkun commented Dec 16, 2025 •

edited

Loading

AlexanderSaydakov commented Dec 17, 2025 •

edited

Loading

tisonkun commented Dec 17, 2025 •

edited

Loading

tisonkun commented Dec 17, 2025 •

edited

Loading

tisonkun commented Dec 19, 2025 •

edited

Loading

tisonkun commented Dec 19, 2025 •

edited

Loading