Add likelihood-informed data processors by ArneBouillon · Pull Request #376 · CliMA/CalibrateEmulateSample.jl

ArneBouillon · 2025-08-20T15:53:48Z

Implement the likelihood-informed data processor from our Overleaf

This PR makes the following changes.

We add a LikelihoodInformed data processor, which combines input data, output data and the actual inverse problem to find good reduced spaces.
We add an (undocumented and inconvenient to access) option to construct a Decorrelator that uses a fixed reduced-space dimension, instead of calculating it based on a variance threshold. This helps primarily when testing or doing comparisons.
We add a machine learning tool that simply calls a user-defined function on the decoded input and encodes the result again (closing Add a machine learning tool for user-defined functions #380)
- This is currently in examples/DimensionReduction/emulate_sample_linlinexp.jl. I feel like we need to make it a bit more versatile and less hacky before extracting it from there.
- @odunbar what do you think?
We add an example comparing Decorrelator + LikelihoodInformed to just Decorrelator.
- TODO: I'm working on more experiments.

Big TODO: The error estimate for output space reduction when α ≠ 0 is way off, which would result in high errors when using retain_KL as a criterion to find a reduced space. I'm still looking into why this is.

There are some test failures that I still have to look at, but I think this PR is ready to review.

codecov · 2025-08-20T16:26:13Z

Codecov Report

❌ Patch coverage is 7.76699% with 95 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.99%. Comparing base (b102f1f) to head (63829f4).

Files with missing lines	Patch %	Lines
src/Utilities/likelihood_informed.jl	0.00%	93 Missing ⚠️
src/Utilities.jl	50.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #376      +/-   ##
==========================================
- Coverage   93.98%   88.99%   -5.00%     
==========================================
  Files          10       11       +1     
  Lines        1630     1717      +87     
==========================================
- Hits         1532     1528       -4     
- Misses         98      189      +91

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

odunbar

So far looks great .

I can see where you butted up against missing features/API. Happy to have resolved here:

The dim criterion for decorrelator -> Happy to resolve here
The No-Emulator MachineLearningTool -> Happy to resolve here
encode_data/decode_data for Vector types -> Happy to resolve here

Maybe can be resolved later:

Parallel MCMC, , I have created an issue to expand this so you don't have to reach into internals to do parallel MCMC -> We can resolve this in another PR if that's useful
The EKP scheduler with a checkpoint -> We can create a PR for this in EKP (may avoid the issue below)

Regarding the examples.

I had to rebase this branch with main to get the MCMC working again
It is a little hard to interpret what the final result of emulate_sample_linlinexp.jl does. it also seemed like to run emulate_sample you always had to run calibrate in the same session (I wonder if this has something to do with the CheckpointScheduler definition)
I haven't yet run lorenz

odunbar · 2025-12-03T21:38:28Z

examples/DimensionReduction/emulate_sample_linlinexp.jl

+        end
+
+        all_errs[dim_i, :] =
+            [norm(post_means[:, 2] - v) / norm(post_means[:, 2]) for v in eachcol(post_means[:, 3:end])]'


Why is the reference emulator not used here? Otherwise this is just showing the size of difference to a PCA-truncation mean? rather than comparing both a truncated PCA and likelihood approach to a true reference

PS. Running the example gave:

0.771104 0.724079 0.686967 0.675671 0.720269 0.687709 0.397877 0.567556 0.629621 0.656653 0.666858 0.66542 0.40432 0.393665 0.350307 0.287113 0.297794 0.238406 0.436975 0.370324 0.34126 0.278414 0.262382 0.233595 0.376547 0.295315 0.151808 0.147098 0.094317 0.118197

odunbar · 2025-12-03T21:47:09Z

examples/DimensionReduction/emulate_sample_linlinexp.jl

+                chain_type = Chains,
+                stepsize = new_step,
+                discard_initial = 5_000,
+            )


I see that our framework is lacking the full support for parallel chains - perhaps we can address this in a future PR

Seems like we just need to enable passing args... into optimize_stepsize that are forwarded into sample

I have created an issue to this effect #389

odunbar · 2025-12-03T21:56:08Z

src/Utilities/canonical_correlation.jl

+Apply the `CanonicalCorrelation` encoder, on a columns-are-data matrix or a data vector
 """
-function encode_data(cc::CanonicalCorrelation, data::MM) where {MM <: AbstractMatrix}
+function encode_data(cc::CanonicalCorrelation, data::MorV) where {MorV <: Union{AbstractMatrix, AbstractVector}}


In general the EKP / CES ecosystem, data is internally treated as columns, thus will always be provided as a matrix. However, I guess as this may be used as part of the external API, we can allow for vectors!

odunbar · 2025-12-03T22:04:29Z

src/Utilities/likelihood_informed.jl

+# Fields
+$(TYPEDFIELDS)
+"""
+mutable struct LikelihoodInformed{FT <: Real} <: PairedDataContainerProcessor


In general in the package we have never used mutable structs, only immutable ones. Which is perhaps annoying, but for consistency could we stick with immutable for now and do a larger change later (sorry)?

I also saw a couple of issues like this where the union {Nothing,X} do weird things before mutation. Not a big deal but happens quite commonly

I'm not gonna die on this hill however - If it is a pain to make this immutable then I wouldn't worry too much about it. My guess is it just makes the stored objects vectors that you push to.

odunbar · 2025-12-03T22:09:24Z

src/Utilities/likelihood_informed.jl

+            grad = (samples_out .- mean(samples_out; dims = 2)) / (samples_in .- mean(samples_in; dims = 2))
+            fill(grad, size(samples_in, 2))
+        else
+            @assert li.grad_type == :localsl


Could we have useful throw(ArgumentError(...))s for the user over asserts? In our experience it is much better to tell the user to use r to provide the correct args than to have asserts.

odunbar · 2025-12-03T22:09:41Z

src/Utilities/likelihood_informed.jl

+                    sortby = (-),
+                )
+            else
+                @assert apply_to == "out" && α ≈ 0 && obs_whitened


likewise assert statement above... and others below

odunbar · 2025-12-03T22:19:18Z

examples/DimensionReduction/emulate_sample_linlinexp.jl

+
+num_trials = 1
+for trial in 1:num_trials
+    loaded = load("datafiles/ekp_linlinexp_$(trial).jld2")


I found that loading this in a new session fails, it seems like the only time it worked was if I ran both calibrate and emulate_sample in the same session.

I wonder if this is because the CheckpointScheduler must be defined in the session where you load the EKP object

ArneBouillon changed the title ~~Ab/likelihood informed~~ Add likelihood-informed data processors Aug 20, 2025

Base automatically changed from ab/struct-mat-dict to main August 25, 2025 20:10

ArneBouillon force-pushed the ab/likelihood-informed branch from 11f527a to 63829f4 Compare October 17, 2025 15:21

ArneBouillon added 10 commits November 20, 2025 10:19

Add first draft of likelihood-informed data processor

0770a29

Fix bugs

fd275cd

Separate spatial-dep Lorenz into its own file

c2adc44

Start adding example

73b42db

Fix typo

6ba5028

Add option to pass reduced dimension to Decorrelator

500b740

Improve docs and variable names

7f5e643

Pass encoder schedule to MLTs to enable NoEmulation

143ce90

Fix likelihood-informed bugs

00968d0

Update tests

208a7e0

ArneBouillon force-pushed the ab/likelihood-informed branch from 01687ce to 208a7e0 Compare November 20, 2025 09:19

ArneBouillon added 6 commits November 20, 2025 10:32

Use correct matrix

4b2abdf

Add/update documentation

f44b3d4

Receive encoder_schedule in all MLTs

9d46c8b

Update and add tests

2108780

Fix bugs in likelihood-informed processor for the output space

cc6ca15

Format

0ff1038

ArneBouillon requested a review from odunbar November 20, 2025 16:48

ArneBouillon marked this pull request as ready for review November 20, 2025 16:48

ArneBouillon added 3 commits November 21, 2025 16:35

Fix minor bugs

7424ac1

Don't do implicit whitening

fb46967

Fix gradient

5b6e6f9

odunbar reviewed Dec 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add likelihood-informed data processors#376

Add likelihood-informed data processors#376
ArneBouillon wants to merge 19 commits intomainfrom
ab/likelihood-informed

ArneBouillon commented Aug 20, 2025 •

edited

Loading

Uh oh!

codecov bot commented Aug 20, 2025 •

edited

Loading

Uh oh!

odunbar left a comment •

edited

Loading

Uh oh!

odunbar Dec 3, 2025 •

edited

Loading

Uh oh!

odunbar Dec 3, 2025

Uh oh!

odunbar Dec 3, 2025

Uh oh!

odunbar Dec 3, 2025

Uh oh!

odunbar Dec 3, 2025

Uh oh!

odunbar Dec 3, 2025

Uh oh!

odunbar Dec 3, 2025

Uh oh!

odunbar Dec 3, 2025

Uh oh!

odunbar Dec 3, 2025

Uh oh!

odunbar Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ArneBouillon commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

odunbar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

odunbar Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ArneBouillon commented Aug 20, 2025 •

edited

Loading

codecov bot commented Aug 20, 2025 •

edited

Loading

odunbar left a comment •

edited

Loading

odunbar Dec 3, 2025 •

edited

Loading