Optimize DOI event performance with bulk-preloading by jrhoads · Pull Request #1449 · datacite/lupo

jrhoads · 2026-01-13T21:45:02Z

Purpose

This PR addresses N+1 query performance issues when fetching or importing DOIs that have associated event data (citations, views, downloads, references, etc.). Previously, each relationship type for every DOI in a collection triggered a separate database query.

Approach

The optimization introduces a "bulk-preload" strategy for events:

EventsPreloader Service: A new service that fetches all relevant events for a batch of DOIs in just 1 or 2 queries.
PreloadedEventRelation Concern: A wrapper class that mimics the ActiveRecord::Relation API (supporting pluck, map, select, etc.), allowing existing code to operate on in-memory arrays as if they were database relations.
Model Integration: The Doi model’s association methods (e.g., view_events, citation_events) are overridden to check if preloaded_events is present. If so, they filter the in-memory array; otherwise, they fall back to the standard database scope.

Key Modifications

app/services/events_preloader.rb: New service to batch-fetch events based on source_doi or target_doi.
app/models/concerns/preloaded_event_relation.rb: New wrapper for in-memory event collections to ensure backward compatibility with relation-based methods.
app/models/doi.rb: Added preloaded_events accessor and logic to switch between preloaded data and real-time database queries.
app/controllers/datacite_dois_controller.rb: Integrated preloading for the :detail view to speed up serialization.
app/models/datacite_doi.rb & app/models/other_doi.rb: Integrated preloading into import_in_bulk to optimize Elasticsearch indexing processes.

Important Technical Details

Memory vs. DB: The preloader maps events to DOI objects in memory. This significantly reduces high-latency database rounds but increases memory usage slightly for very large batches (mitigated by CHUNK_SIZE = 1000).
Case Sensitivity: The preloader handles DOI matching using .upcase to ensure consistency between database records and memory mapping.
Transparency: If preloaded_events is never initialized, the application behaves exactly as before, ensuring safety for code paths not yet optimized.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Reviewer, please remember our guidelines:

Be humble in the language and feedback you give, ask don't tell.
Consider using positive language as opposed to neutral when offering feedback. This is to avoid the negative bias that can occur with neutral language appearing negative.
Offer suggestions on how to improve code e.g. simplification or expanding clarity.
Ensure you give reasons for the changes you are proposing.

Summary by CodeRabbit

Release Notes

Performance
- Optimized event loading to improve performance when displaying related citations, references, and other event data.
New Features
- Added provider ID filtering capability for advanced search and query refinement.

…specs

coderabbitai · 2026-02-13T09:46:26Z

📝 Walkthrough

Walkthrough

This pull request introduces an event preloading optimization system comprising a new EventsPreloader service, a PreloadedEventRelation wrapper class for in-memory event collections, and model-level integration to cache preloaded events. Updates to DataciteDoi, OtherDoi, and Doi models shift event loading from eager-inclusion to batch preloading within import pipelines. The DataciteDoisController#show conditionally preloads detail data. A provider_id filter is added to ReferenceRepository. Comprehensive test coverage validates preloading behavior, backward compatibility, and performance characteristics.

Changes

Cohort / File(s)	Summary
Event Preloading Service `app/services/events_preloader.rb`	New service class that batch-preloads Event records for a set of DOIs in chunks, distributing preloaded events to corresponding DOI objects via preloaded_events array.
Event Relation Wrapper `app/models/concerns/preloaded_event_relation.rb`	New concern introducing PreloadedEventRelation to adapt in-memory event arrays to behave like ActiveRecord::Relation with delegation of Enumerable methods (pluck, map, select, sort_by, etc.).
DOI Model Updates `app/models/doi.rb`, `app/models/datacite_doi.rb`, `app/models/other_doi.rb`	Added preloaded_events accessor and event-type helper methods (view_events, download_events, citation_events, etc.) that prefer preloaded data when available; updated import_in_bulk to use batch preloading instead of per-association eager loading.
Controller Updates `app/controllers/datacite_dois_controller.rb`	Added EventsPreloader invocation in `#show` when detail parameter is enabled to preload related events.
Repository Enhancement `app/models/reference_repository.rb`	Added case-insensitive provider_id filter to Elasticsearch query builder in `#filter` method.
Test Coverage `spec/models/preloaded_event_relation_spec.rb`, `spec/services/events_preloader_spec.rb`, `spec/models/doi_related_spec.rb`, `spec/requests/repositories_spec.rb`	Comprehensive RSpec suites validating PreloadedEventRelation delegation methods, EventsPreloader batch logic, event preloading in import pipelines, backward compatibility with preloaded_events, and ReferenceRepository index resets.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 28.13% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection	⚠️ Warning	⚠️ Unable to check for merge conflicts: Failed to fetch base branch: From https://github.com/datacite/lupo ! [rejected] master -> master (non-fast-forward) + `1ec7c31...e07b4d9` master -> origin/master (forced update)

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Optimize DOI event performance with bulk-preloading' directly and concisely summarizes the main objective of the PR: introducing a bulk-preload strategy to optimize event performance for DOIs.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch events-optimization

⚔️ Resolve merge conflicts (beta)

Auto-commit resolved conflicts to branch events-optimization
Create stacked PR with resolved conflicts
Post resolved changes as copyable diffs in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@app/controllers/datacite_dois_controller.rb`:
- Around line 452-454: The guard `params[:detail] != false` is always truthy
because params are strings, so remove the misleading conditional and always
preload: call EventsPreloader.new([doi]).preload! unconditionally (keep the
existing comment about preloading) so the intent matches behavior;
alternatively, if you want to preserve conditional behavior, change the check to
string comparison (e.g., params[:detail] != "false" or
ActiveModel::Type::Boolean.new.cast(params[:detail]) ) but the simpler fix is to
drop the `if` and always invoke EventsPreloader.new([doi]).preload!.

In `@app/services/events_preloader.rb`:
- Around line 37-49: When distributing events in the all_events loop, avoid
appending the same event twice when event.source_doi == event.target_doi by
checking for equality before pushing to the same DOI's preloaded_events: look up
source_doi_obj and target_doi_obj from `@doi_map` (used in the current loop), and
if they refer to the same object, append the event only once (otherwise append
to each if present); adjust the logic around source_doi_obj.preloaded_events <<
event and target_doi_obj.preloaded_events << event to perform this dedupe during
distribution (or add a single guard to skip the second append when source and
target DOIs match).

🧹 Nitpick comments (7)

app/models/concerns/preloaded_event_relation.rb (2)
69-71: inject override doesn't support the symbol form.

Ruby's Enumerable#inject supports both inject(:+) (symbol) and inject(0) { |sum, x| sum + x } (block) forms. This override only passes initial and &block to Array#inject, so calling relation.inject(:+) would pass :+ as initial with a nil block, raising a TypeError.

Currently, inject is only called on the result of pluck (which returns a plain Array), so this isn't actively broken. But it's a latent trap if the API surface grows.
♻️ Suggested fix
-  def inject(initial = nil, &block)
-    `@events.inject`(initial, &block)
+  def inject(*args, &block)
+    `@events.inject`(*args, &block)
   end
94-96: Remove to_ary — it contradicts the design and is never used.

The to_ary method is never called explicitly in the codebase and no tests cover it. Since wrapper-preserving methods like select, compact, and uniq return new PreloadedEventRelation instances, having to_ary (which enables implicit unwrapping via array concatenation, splat, or Array() calls) contradicts the design intent. Keep only to_a for explicit conversion.
app/services/events_preloader.rb (1)

51-54: Redundant initialization — already handled in initialize.

Line 15 in initialize already sets doi.preloaded_events ||= [] for every DOI. This block on lines 52-54 is a no-op unless a DOI's preloaded_events was somehow set to nil after initialization, which shouldn't happen in normal flow.
spec/models/preloaded_event_relation_spec.rb (2)
65-71: sort_by test doesn't verify actual reordering.

The events are already in ascending total order ([10, 20, 30]), so sort_by(&:total) produces the same order as the input. This test passes trivially without proving that sorting actually works. Consider sorting in reverse or using a differently-ordered input.
🔧 Suggested improvement
   describe "#sort_by" do
-    it "sorts events" do
-      result = relation.sort_by(&:total)
+    it "sorts events by attribute" do
+      reversed = PreloadedEventRelation.new([event3, event1, event2])
+      result = reversed.sort_by(&:total)
       expect(result).to be_a(PreloadedEventRelation)
       expect(result.to_a.map(&:total)).to eq([10, 20, 30])
     end
   end
133-145: Consider testing first/last/count on an empty relation too.

The Enumerable methods block only tests the happy path. Since PreloadedEventRelation wraps an array and delegates these, it would be good to verify edge behavior (e.g., first on empty returns nil, count returns 0).
spec/services/events_preloader_spec.rb (1)
89-107: Large batch test creates 1100+ DB records — significant test suite slowdown.

create_list(:doi, EventsPreloader::CHUNK_SIZE + 100, ...) with CHUNK_SIZE = 1000 inserts 1100 DOI records plus associated data. This will be very slow (potentially minutes) and could dominate the test suite runtime.

Consider using build_stubbed_list or mocking the chunking behavior instead of hitting the database, since the test only verifies that no error is raised and preloaded_events is not nil.
♻️ Suggested alternative approach
     context "with large batch" do
       it "chunks large DOI lists" do
-        # Create more than CHUNK_SIZE DOIs
-        large_batch = create_list(:doi, EventsPreloader::CHUNK_SIZE + 100, client: client, aasm_state: "findable")
-
-        # Create events for some of them
-        create(:event_for_crossref, {
-          subj_id: "https://doi.org/#{large_batch.first.doi}",
-          obj_id: "https://doi.org/#{large_batch.last.doi}",
-          relation_type_id: "references",
-        })
-
-        expect {
-          EventsPreloader.new(large_batch).preload!
-        }.not_to raise_error
-
-        expect(large_batch.first.preloaded_events).not_to be_nil
+        # Verify chunking by checking that Event.where is called multiple times
+        large_batch = build_stubbed_list(:doi, EventsPreloader::CHUNK_SIZE + 100, client: client)
+        large_batch.each { |d| allow(d).to receive(:preloaded_events=) }
+        large_batch.each { |d| allow(d).to receive(:preloaded_events).and_return([]) }
+
+        relation = double("relation", to_a: [])
+        expect(Event).to receive(:where).twice.and_return(relation)
+
+        EventsPreloader.new(large_batch).preload!
       end
     end
spec/models/doi_related_spec.rb (1)
42-65: Test doesn't actually verify preloading — it verifies events via a fresh DB query.

fresh_doi1 = DataciteDoi.find(doi1.id) creates a new instance with preloaded_events = nil, so reference_events and citation_events fall back to database queries. This test confirms events exist in the DB but does not validate that EventsPreloader was used during import_in_bulk.

To verify preloading actually occurred, consider spying on EventsPreloader or asserting on the in-batch DOI objects:
♻️ Suggested approach
-        allow(DataciteDoi).to receive(:upload_to_elasticsearch)
-
-        DataciteDoi.import_in_bulk(ids)
-
-        # Verify that events were preloaded (check via a fresh query)
-        fresh_doi1 = DataciteDoi.find(doi1.id)
-        expect(fresh_doi1.reference_events.count).to eq(1)
-        expect(fresh_doi1.citation_events.count).to eq(1)
+        preloader_spy = instance_double(EventsPreloader, preload!: nil)
+        expect(EventsPreloader).to receive(:new).and_return(preloader_spy)
+        expect(preloader_spy).to receive(:preload!).once
+
+        allow(DataciteDoi).to receive(:upload_to_elasticsearch)
+
+        DataciteDoi.import_in_bulk(ids)

coderabbitai · 2026-02-13T09:50:52Z

app/controllers/datacite_dois_controller.rb

+        # Preload events if we are going to show details
+        # This optimizes the serializer which accesses part_events, citation_events, etc.
+        EventsPreloader.new([doi]).preload! if params[:detail] != false


⚠️ Potential issue | 🟡 Minor

Condition params[:detail] != false is always true — params are strings, not booleans.

params[:detail] will be nil, "true", or "false" (a string), never the boolean false. So params[:detail] != false is always truthy, making the guard a no-op. The preloader runs unconditionally.

Since detail is hardcoded to true on line 461, this is functionally correct (preloading is always wanted here), but the conditional is misleading.

🐛 Simplify to unconditional preload or fix the string comparison

- EventsPreloader.new([doi]).preload! if params[:detail] != false + # Preload events to optimize serializer access to event-derived fields + EventsPreloader.new([doi]).preload!

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Preload events if we are going to show details

# This optimizes the serializer which accesses part_events, citation_events, etc.

EventsPreloader.new([doi]).preload! if params[:detail] != false

# Preload events if we are going to show details

# This optimizes the serializer which accesses part_events, citation_events, etc.

# Preload events to optimize serializer access to event-derived fields

EventsPreloader.new([doi]).preload!

🤖 Prompt for AI Agents

In `@app/controllers/datacite_dois_controller.rb` around lines 452 - 454, The guard `params[:detail] != false` is always truthy because params are strings, so remove the misleading conditional and always preload: call EventsPreloader.new([doi]).preload! unconditionally (keep the existing comment about preloading) so the intent matches behavior; alternatively, if you want to preserve conditional behavior, change the check to string comparison (e.g., params[:detail] != "false" or ActiveModel::Type::Boolean.new.cast(params[:detail]) ) but the simpler fix is to drop the `if` and always invoke EventsPreloader.new([doi]).preload!.

coderabbitai · 2026-02-13T09:50:52Z

app/services/events_preloader.rb

+    all_events.each do |event|
+      # Add event to source DOI's preloaded_events if it matches
+      if event.source_doi.present?
+        source_doi_obj = @doi_map[event.source_doi.upcase]
+        source_doi_obj.preloaded_events << event if source_doi_obj
+      end
+
+      # Add event to target DOI's preloaded_events if it matches
+      if event.target_doi.present?
+        target_doi_obj = @doi_map[event.target_doi.upcase]
+        target_doi_obj.preloaded_events << event if target_doi_obj
+      end
+    end


⚠️ Potential issue | 🟡 Minor

Duplicate events when source_doi == target_doi.

If an event has source_doi == target_doi (same DOI), the event is appended twice to that DOI's preloaded_events array. Downstream filter methods in doi.rb (e.g., citation_events) would then return the event twice, inflating counts.

🐛 Proposed fix — deduplicate during distribution

# Group events by DOI and assign to each Doi object all_events.each do |event| # Add event to source DOI's preloaded_events if it matches if event.source_doi.present? source_doi_obj = `@doi_map`[event.source_doi.upcase] - source_doi_obj.preloaded_events << event if source_doi_obj + source_doi_obj.preloaded_events << event if source_doi_obj end # Add event to target DOI's preloaded_events if it matches if event.target_doi.present? target_doi_obj = `@doi_map`[event.target_doi.upcase] - target_doi_obj.preloaded_events << event if target_doi_obj + if target_doi_obj && target_doi_obj != source_doi_obj + target_doi_obj.preloaded_events << event + end end end

Alternatively, if an event legitimately needs to appear in both source-based and target-based filters for the same DOI, the dedup should happen at the filter level in doi.rb instead.

🤖 Prompt for AI Agents

In `@app/services/events_preloader.rb` around lines 37 - 49, When distributing events in the all_events loop, avoid appending the same event twice when event.source_doi == event.target_doi by checking for equality before pushing to the same DOI's preloaded_events: look up source_doi_obj and target_doi_obj from `@doi_map` (used in the current loop), and if they refer to the same object, append the event only once (otherwise append to each if present); adjust the logic around source_doi_obj.preloaded_events << event and target_doi_obj.preloaded_events << event to perform this dedupe during distribution (or add a single guard to skip the second append when source and target DOIs match).

jrhoads force-pushed the events-optimization branch from 17e9f41 to 46a7e33 Compare January 14, 2026 13:21

jrhoads requested a review from codycooperross January 16, 2026 08:47

jrhoads force-pushed the events-optimization branch from 83f6c29 to 99492b7 Compare January 20, 2026 08:35

jrhoads force-pushed the events-optimization branch from 99492b7 to 13f8d15 Compare January 29, 2026 11:41

jrhoads added 9 commits February 13, 2026 10:45

perf: optimize event preloading for DOIs using EventsPreloader service

e68d524

perf: preload DOI events in show action to optimize serialization

c11566f

test: add unit tests for PreloadedEventRelation and EventsPreloader

9bc2b39

test: enhance N+1 testing and event preloading coverage in Doi model …

171d460

…specs

Appease Rubocop

61c09dc

fix: filter preloaded events by DOI and update event tests

dab3ad4

Appease Rubocop

201089c

refactor: delegate first, last, and count methods to events array

123f60b

feat: add provider_id filtering to ReferenceRepository and update tests

e998678

jrhoads force-pushed the events-optimization branch from 13f8d15 to e998678 Compare February 13, 2026 09:46

coderabbitai bot reviewed Feb 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Optimize DOI event performance with bulk-preloading#1449

Optimize DOI event performance with bulk-preloading#1449
jrhoads wants to merge 9 commits intomasterfrom
events-optimization

jrhoads commented Jan 13, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 13, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 13, 2026

Uh oh!

coderabbitai bot Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

jrhoads commented Jan 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Approach

Key Modifications

Important Technical Details

Types of changes

Reviewer, please remember our guidelines:

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jrhoads commented Jan 13, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 13, 2026 •

edited

Loading