Skip to content

Add likelihood-informed data processors#376

Open
ArneBouillon wants to merge 19 commits intomainfrom
ab/likelihood-informed
Open

Add likelihood-informed data processors#376
ArneBouillon wants to merge 19 commits intomainfrom
ab/likelihood-informed

Conversation

@ArneBouillon
Copy link
Collaborator

@ArneBouillon ArneBouillon commented Aug 20, 2025

Implement the likelihood-informed data processor from our Overleaf

This PR makes the following changes.

  • We add a LikelihoodInformed data processor, which combines input data, output data and the actual inverse problem to find good reduced spaces.
  • We add an (undocumented and inconvenient to access) option to construct a Decorrelator that uses a fixed reduced-space dimension, instead of calculating it based on a variance threshold. This helps primarily when testing or doing comparisons.
  • We add a machine learning tool that simply calls a user-defined function on the decoded input and encodes the result again (closing Add a machine learning tool for user-defined functions #380)
    • This is currently in examples/DimensionReduction/emulate_sample_linlinexp.jl. I feel like we need to make it a bit more versatile and less hacky before extracting it from there.
    • @odunbar what do you think?
  • We add an example comparing Decorrelator + LikelihoodInformed to just Decorrelator.
    • TODO: I'm working on more experiments.

Big TODO: The error estimate for output space reduction when α ≠ 0 is way off, which would result in high errors when using retain_KL as a criterion to find a reduced space. I'm still looking into why this is.

There are some test failures that I still have to look at, but I think this PR is ready to review.

@ArneBouillon ArneBouillon changed the title Ab/likelihood informed Add likelihood-informed data processors Aug 20, 2025
@codecov
Copy link

codecov bot commented Aug 20, 2025

Codecov Report

❌ Patch coverage is 7.76699% with 95 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.99%. Comparing base (b102f1f) to head (63829f4).

Files with missing lines Patch % Lines
src/Utilities/likelihood_informed.jl 0.00% 93 Missing ⚠️
src/Utilities.jl 50.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #376      +/-   ##
==========================================
- Coverage   93.98%   88.99%   -5.00%     
==========================================
  Files          10       11       +1     
  Lines        1630     1717      +87     
==========================================
- Hits         1532     1528       -4     
- Misses         98      189      +91     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Base automatically changed from ab/struct-mat-dict to main August 25, 2025 20:10
@ArneBouillon ArneBouillon force-pushed the ab/likelihood-informed branch from 11f527a to 63829f4 Compare October 17, 2025 15:21
@ArneBouillon ArneBouillon force-pushed the ab/likelihood-informed branch from 01687ce to 208a7e0 Compare November 20, 2025 09:19
@ArneBouillon ArneBouillon requested a review from odunbar November 20, 2025 16:48
@ArneBouillon ArneBouillon marked this pull request as ready for review November 20, 2025 16:48
Copy link
Member

@odunbar odunbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far looks great .

I can see where you butted up against missing features/API. Happy to have resolved here:

  • The dim criterion for decorrelator -> Happy to resolve here
  • The No-Emulator MachineLearningTool -> Happy to resolve here
  • encode_data/decode_data for Vector types -> Happy to resolve here

Maybe can be resolved later:

  • Parallel MCMC, , I have created an issue to expand this so you don't have to reach into internals to do parallel MCMC -> We can resolve this in another PR if that's useful
  • The EKP scheduler with a checkpoint -> We can create a PR for this in EKP (may avoid the issue below)

Regarding the examples.

  • I had to rebase this branch with main to get the MCMC working again
  • It is a little hard to interpret what the final result of emulate_sample_linlinexp.jl does. it also seemed like to run emulate_sample you always had to run calibrate in the same session (I wonder if this has something to do with the CheckpointScheduler definition)
  • I haven't yet run lorenz

end

all_errs[dim_i, :] =
[norm(post_means[:, 2] - v) / norm(post_means[:, 2]) for v in eachcol(post_means[:, 3:end])]'
Copy link
Member

@odunbar odunbar Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the reference emulator not used here? Otherwise this is just showing the size of difference to a PCA-truncation mean? rather than comparing both a truncated PCA and likelihood approach to a true reference

PS. Running the example gave:

 0.771104  0.724079  0.686967  0.675671  0.720269  0.687709
 0.397877  0.567556  0.629621  0.656653  0.666858  0.66542
 0.40432   0.393665  0.350307  0.287113  0.297794  0.238406
 0.436975  0.370324  0.34126   0.278414  0.262382  0.233595
 0.376547  0.295315  0.151808  0.147098  0.094317  0.118197

chain_type = Chains,
stepsize = new_step,
discard_initial = 5_000,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that our framework is lacking the full support for parallel chains - perhaps we can address this in a future PR

Seems like we just need to enable passing args... into optimize_stepsize that are forwarded into sample

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created an issue to this effect #389

Apply the `CanonicalCorrelation` encoder, on a columns-are-data matrix or a data vector
"""
function encode_data(cc::CanonicalCorrelation, data::MM) where {MM <: AbstractMatrix}
function encode_data(cc::CanonicalCorrelation, data::MorV) where {MorV <: Union{AbstractMatrix, AbstractVector}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general the EKP / CES ecosystem, data is internally treated as columns, thus will always be provided as a matrix. However, I guess as this may be used as part of the external API, we can allow for vectors!

# Fields
$(TYPEDFIELDS)
"""
mutable struct LikelihoodInformed{FT <: Real} <: PairedDataContainerProcessor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general in the package we have never used mutable structs, only immutable ones. Which is perhaps annoying, but for consistency could we stick with immutable for now and do a larger change later (sorry)?

I also saw a couple of issues like this where the union {Nothing,X} do weird things before mutation. Not a big deal but happens quite commonly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not gonna die on this hill however - If it is a pain to make this immutable then I wouldn't worry too much about it. My guess is it just makes the stored objects vectors that you push to.

grad = (samples_out .- mean(samples_out; dims = 2)) / (samples_in .- mean(samples_in; dims = 2))
fill(grad, size(samples_in, 2))
else
@assert li.grad_type == :localsl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have useful throw(ArgumentError(...))s for the user over asserts? In our experience it is much better to tell the user to use r to provide the correct args than to have asserts.

sortby = (-),
)
else
@assert apply_to == "out" && α ≈ 0 && obs_whitened
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise assert statement above... and others below


num_trials = 1
for trial in 1:num_trials
loaded = load("datafiles/ekp_linlinexp_$(trial).jld2")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that loading this in a new session fails, it seems like the only time it worked was if I ran both calibrate and emulate_sample in the same session.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is because the CheckpointScheduler must be defined in the session where you load the EKP object

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a machine learning tool for user-defined functions Add new utilities for the likelihood-informed DataProcessor

2 participants