Skip to content

Comments

Optimize DOI event performance with bulk-preloading#1449

Open
jrhoads wants to merge 9 commits intomasterfrom
events-optimization
Open

Optimize DOI event performance with bulk-preloading#1449
jrhoads wants to merge 9 commits intomasterfrom
events-optimization

Conversation

@jrhoads
Copy link
Contributor

@jrhoads jrhoads commented Jan 13, 2026

Purpose

This PR addresses N+1 query performance issues when fetching or importing DOIs that have associated event data (citations, views, downloads, references, etc.). Previously, each relationship type for every DOI in a collection triggered a separate database query.

Approach

The optimization introduces a "bulk-preload" strategy for events:

  1. EventsPreloader Service: A new service that fetches all relevant events for a batch of DOIs in just 1 or 2 queries.
  2. PreloadedEventRelation Concern: A wrapper class that mimics the ActiveRecord::Relation API (supporting pluck, map, select, etc.), allowing existing code to operate on in-memory arrays as if they were database relations.
  3. Model Integration: The Doi model’s association methods (e.g., view_events, citation_events) are overridden to check if preloaded_events is present. If so, they filter the in-memory array; otherwise, they fall back to the standard database scope.

Key Modifications

  • app/services/events_preloader.rb: New service to batch-fetch events based on source_doi or target_doi.
  • app/models/concerns/preloaded_event_relation.rb: New wrapper for in-memory event collections to ensure backward compatibility with relation-based methods.
  • app/models/doi.rb: Added preloaded_events accessor and logic to switch between preloaded data and real-time database queries.
  • app/controllers/datacite_dois_controller.rb: Integrated preloading for the :detail view to speed up serialization.
  • app/models/datacite_doi.rb & app/models/other_doi.rb: Integrated preloading into import_in_bulk to optimize Elasticsearch indexing processes.

Important Technical Details

  • Memory vs. DB: The preloader maps events to DOI objects in memory. This significantly reduces high-latency database rounds but increases memory usage slightly for very large batches (mitigated by CHUNK_SIZE = 1000).
  • Case Sensitivity: The preloader handles DOI matching using .upcase to ensure consistency between database records and memory mapping.
  • Transparency: If preloaded_events is never initialized, the application behaves exactly as before, ensuring safety for code paths not yet optimized.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)

  • New feature (non-breaking change which adds functionality)

  • Breaking change (fix or feature that would cause existing functionality to change)

Reviewer, please remember our guidelines:

  • Be humble in the language and feedback you give, ask don't tell.
  • Consider using positive language as opposed to neutral when offering feedback. This is to avoid the negative bias that can occur with neutral language appearing negative.
  • Offer suggestions on how to improve code e.g. simplification or expanding clarity.
  • Ensure you give reasons for the changes you are proposing.

Summary by CodeRabbit

Release Notes

  • Performance

    • Optimized event loading to improve performance when displaying related citations, references, and other event data.
  • New Features

    • Added provider ID filtering capability for advanced search and query refinement.

@jrhoads jrhoads force-pushed the events-optimization branch from 17e9f41 to 46a7e33 Compare January 14, 2026 13:21
@jrhoads jrhoads force-pushed the events-optimization branch from 83f6c29 to 99492b7 Compare January 20, 2026 08:35
@jrhoads jrhoads force-pushed the events-optimization branch from 99492b7 to 13f8d15 Compare January 29, 2026 11:41
@jrhoads jrhoads force-pushed the events-optimization branch from 13f8d15 to e998678 Compare February 13, 2026 09:46
@coderabbitai
Copy link

coderabbitai bot commented Feb 13, 2026

📝 Walkthrough

Walkthrough

This pull request introduces an event preloading optimization system comprising a new EventsPreloader service, a PreloadedEventRelation wrapper class for in-memory event collections, and model-level integration to cache preloaded events. Updates to DataciteDoi, OtherDoi, and Doi models shift event loading from eager-inclusion to batch preloading within import pipelines. The DataciteDoisController#show conditionally preloads detail data. A provider_id filter is added to ReferenceRepository. Comprehensive test coverage validates preloading behavior, backward compatibility, and performance characteristics.

Changes

Cohort / File(s) Summary
Event Preloading Service
app/services/events_preloader.rb
New service class that batch-preloads Event records for a set of DOIs in chunks, distributing preloaded events to corresponding DOI objects via preloaded_events array.
Event Relation Wrapper
app/models/concerns/preloaded_event_relation.rb
New concern introducing PreloadedEventRelation to adapt in-memory event arrays to behave like ActiveRecord::Relation with delegation of Enumerable methods (pluck, map, select, sort_by, etc.).
DOI Model Updates
app/models/doi.rb, app/models/datacite_doi.rb, app/models/other_doi.rb
Added preloaded_events accessor and event-type helper methods (view_events, download_events, citation_events, etc.) that prefer preloaded data when available; updated import_in_bulk to use batch preloading instead of per-association eager loading.
Controller Updates
app/controllers/datacite_dois_controller.rb
Added EventsPreloader invocation in #show when detail parameter is enabled to preload related events.
Repository Enhancement
app/models/reference_repository.rb
Added case-insensitive provider_id filter to Elasticsearch query builder in #filter method.
Test Coverage
spec/models/preloaded_event_relation_spec.rb, spec/services/events_preloader_spec.rb, spec/models/doi_related_spec.rb, spec/requests/repositories_spec.rb
Comprehensive RSpec suites validating PreloadedEventRelation delegation methods, EventsPreloader batch logic, event preloading in import pipelines, backward compatibility with preloaded_events, and ReferenceRepository index resets.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.13% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection ⚠️ Warning ⚠️ Unable to check for merge conflicts: Failed to fetch base branch: From https://github.com/datacite/lupo
! [rejected] master -> master (non-fast-forward)
+ 1ec7c31...e07b4d9 master -> origin/master (forced update)
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Optimize DOI event performance with bulk-preloading' directly and concisely summarizes the main objective of the PR: introducing a bulk-preload strategy to optimize event performance for DOIs.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch events-optimization
⚔️ Resolve merge conflicts (beta)
  • Auto-commit resolved conflicts to branch events-optimization
  • Create stacked PR with resolved conflicts
  • Post resolved changes as copyable diffs in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@app/controllers/datacite_dois_controller.rb`:
- Around line 452-454: The guard `params[:detail] != false` is always truthy
because params are strings, so remove the misleading conditional and always
preload: call EventsPreloader.new([doi]).preload! unconditionally (keep the
existing comment about preloading) so the intent matches behavior;
alternatively, if you want to preserve conditional behavior, change the check to
string comparison (e.g., params[:detail] != "false" or
ActiveModel::Type::Boolean.new.cast(params[:detail]) ) but the simpler fix is to
drop the `if` and always invoke EventsPreloader.new([doi]).preload!.

In `@app/services/events_preloader.rb`:
- Around line 37-49: When distributing events in the all_events loop, avoid
appending the same event twice when event.source_doi == event.target_doi by
checking for equality before pushing to the same DOI's preloaded_events: look up
source_doi_obj and target_doi_obj from `@doi_map` (used in the current loop), and
if they refer to the same object, append the event only once (otherwise append
to each if present); adjust the logic around source_doi_obj.preloaded_events <<
event and target_doi_obj.preloaded_events << event to perform this dedupe during
distribution (or add a single guard to skip the second append when source and
target DOIs match).
🧹 Nitpick comments (7)
app/models/concerns/preloaded_event_relation.rb (2)

69-71: inject override doesn't support the symbol form.

Ruby's Enumerable#inject supports both inject(:+) (symbol) and inject(0) { |sum, x| sum + x } (block) forms. This override only passes initial and &block to Array#inject, so calling relation.inject(:+) would pass :+ as initial with a nil block, raising a TypeError.

Currently, inject is only called on the result of pluck (which returns a plain Array), so this isn't actively broken. But it's a latent trap if the API surface grows.

♻️ Suggested fix
-  def inject(initial = nil, &block)
-    `@events.inject`(initial, &block)
+  def inject(*args, &block)
+    `@events.inject`(*args, &block)
   end

94-96: Remove to_ary — it contradicts the design and is never used.

The to_ary method is never called explicitly in the codebase and no tests cover it. Since wrapper-preserving methods like select, compact, and uniq return new PreloadedEventRelation instances, having to_ary (which enables implicit unwrapping via array concatenation, splat, or Array() calls) contradicts the design intent. Keep only to_a for explicit conversion.

app/services/events_preloader.rb (1)

51-54: Redundant initialization — already handled in initialize.

Line 15 in initialize already sets doi.preloaded_events ||= [] for every DOI. This block on lines 52-54 is a no-op unless a DOI's preloaded_events was somehow set to nil after initialization, which shouldn't happen in normal flow.

spec/models/preloaded_event_relation_spec.rb (2)

65-71: sort_by test doesn't verify actual reordering.

The events are already in ascending total order ([10, 20, 30]), so sort_by(&:total) produces the same order as the input. This test passes trivially without proving that sorting actually works. Consider sorting in reverse or using a differently-ordered input.

🔧 Suggested improvement
   describe "#sort_by" do
-    it "sorts events" do
-      result = relation.sort_by(&:total)
+    it "sorts events by attribute" do
+      reversed = PreloadedEventRelation.new([event3, event1, event2])
+      result = reversed.sort_by(&:total)
       expect(result).to be_a(PreloadedEventRelation)
       expect(result.to_a.map(&:total)).to eq([10, 20, 30])
     end
   end

133-145: Consider testing first/last/count on an empty relation too.

The Enumerable methods block only tests the happy path. Since PreloadedEventRelation wraps an array and delegates these, it would be good to verify edge behavior (e.g., first on empty returns nil, count returns 0).

spec/services/events_preloader_spec.rb (1)

89-107: Large batch test creates 1100+ DB records — significant test suite slowdown.

create_list(:doi, EventsPreloader::CHUNK_SIZE + 100, ...) with CHUNK_SIZE = 1000 inserts 1100 DOI records plus associated data. This will be very slow (potentially minutes) and could dominate the test suite runtime.

Consider using build_stubbed_list or mocking the chunking behavior instead of hitting the database, since the test only verifies that no error is raised and preloaded_events is not nil.

♻️ Suggested alternative approach
     context "with large batch" do
       it "chunks large DOI lists" do
-        # Create more than CHUNK_SIZE DOIs
-        large_batch = create_list(:doi, EventsPreloader::CHUNK_SIZE + 100, client: client, aasm_state: "findable")
-
-        # Create events for some of them
-        create(:event_for_crossref, {
-          subj_id: "https://doi.org/#{large_batch.first.doi}",
-          obj_id: "https://doi.org/#{large_batch.last.doi}",
-          relation_type_id: "references",
-        })
-
-        expect {
-          EventsPreloader.new(large_batch).preload!
-        }.not_to raise_error
-
-        expect(large_batch.first.preloaded_events).not_to be_nil
+        # Verify chunking by checking that Event.where is called multiple times
+        large_batch = build_stubbed_list(:doi, EventsPreloader::CHUNK_SIZE + 100, client: client)
+        large_batch.each { |d| allow(d).to receive(:preloaded_events=) }
+        large_batch.each { |d| allow(d).to receive(:preloaded_events).and_return([]) }
+
+        relation = double("relation", to_a: [])
+        expect(Event).to receive(:where).twice.and_return(relation)
+
+        EventsPreloader.new(large_batch).preload!
       end
     end
spec/models/doi_related_spec.rb (1)

42-65: Test doesn't actually verify preloading — it verifies events via a fresh DB query.

fresh_doi1 = DataciteDoi.find(doi1.id) creates a new instance with preloaded_events = nil, so reference_events and citation_events fall back to database queries. This test confirms events exist in the DB but does not validate that EventsPreloader was used during import_in_bulk.

To verify preloading actually occurred, consider spying on EventsPreloader or asserting on the in-batch DOI objects:

♻️ Suggested approach
-        allow(DataciteDoi).to receive(:upload_to_elasticsearch)
-
-        DataciteDoi.import_in_bulk(ids)
-
-        # Verify that events were preloaded (check via a fresh query)
-        fresh_doi1 = DataciteDoi.find(doi1.id)
-        expect(fresh_doi1.reference_events.count).to eq(1)
-        expect(fresh_doi1.citation_events.count).to eq(1)
+        preloader_spy = instance_double(EventsPreloader, preload!: nil)
+        expect(EventsPreloader).to receive(:new).and_return(preloader_spy)
+        expect(preloader_spy).to receive(:preload!).once
+
+        allow(DataciteDoi).to receive(:upload_to_elasticsearch)
+
+        DataciteDoi.import_in_bulk(ids)

Comment on lines +452 to +454
# Preload events if we are going to show details
# This optimizes the serializer which accesses part_events, citation_events, etc.
EventsPreloader.new([doi]).preload! if params[:detail] != false
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Condition params[:detail] != false is always true — params are strings, not booleans.

params[:detail] will be nil, "true", or "false" (a string), never the boolean false. So params[:detail] != false is always truthy, making the guard a no-op. The preloader runs unconditionally.

Since detail is hardcoded to true on line 461, this is functionally correct (preloading is always wanted here), but the conditional is misleading.

🐛 Simplify to unconditional preload or fix the string comparison
-        EventsPreloader.new([doi]).preload! if params[:detail] != false
+        # Preload events to optimize serializer access to event-derived fields
+        EventsPreloader.new([doi]).preload!
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Preload events if we are going to show details
# This optimizes the serializer which accesses part_events, citation_events, etc.
EventsPreloader.new([doi]).preload! if params[:detail] != false
# Preload events if we are going to show details
# This optimizes the serializer which accesses part_events, citation_events, etc.
# Preload events to optimize serializer access to event-derived fields
EventsPreloader.new([doi]).preload!
🤖 Prompt for AI Agents
In `@app/controllers/datacite_dois_controller.rb` around lines 452 - 454, The
guard `params[:detail] != false` is always truthy because params are strings, so
remove the misleading conditional and always preload: call
EventsPreloader.new([doi]).preload! unconditionally (keep the existing comment
about preloading) so the intent matches behavior; alternatively, if you want to
preserve conditional behavior, change the check to string comparison (e.g.,
params[:detail] != "false" or
ActiveModel::Type::Boolean.new.cast(params[:detail]) ) but the simpler fix is to
drop the `if` and always invoke EventsPreloader.new([doi]).preload!.

Comment on lines +37 to +49
all_events.each do |event|
# Add event to source DOI's preloaded_events if it matches
if event.source_doi.present?
source_doi_obj = @doi_map[event.source_doi.upcase]
source_doi_obj.preloaded_events << event if source_doi_obj
end

# Add event to target DOI's preloaded_events if it matches
if event.target_doi.present?
target_doi_obj = @doi_map[event.target_doi.upcase]
target_doi_obj.preloaded_events << event if target_doi_obj
end
end
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Duplicate events when source_doi == target_doi.

If an event has source_doi == target_doi (same DOI), the event is appended twice to that DOI's preloaded_events array. Downstream filter methods in doi.rb (e.g., citation_events) would then return the event twice, inflating counts.

🐛 Proposed fix — deduplicate during distribution
     # Group events by DOI and assign to each Doi object
     all_events.each do |event|
       # Add event to source DOI's preloaded_events if it matches
       if event.source_doi.present?
         source_doi_obj = `@doi_map`[event.source_doi.upcase]
-        source_doi_obj.preloaded_events << event if source_doi_obj
+        source_doi_obj.preloaded_events << event if source_doi_obj
       end
 
       # Add event to target DOI's preloaded_events if it matches
       if event.target_doi.present?
         target_doi_obj = `@doi_map`[event.target_doi.upcase]
-        target_doi_obj.preloaded_events << event if target_doi_obj
+        if target_doi_obj && target_doi_obj != source_doi_obj
+          target_doi_obj.preloaded_events << event
+        end
       end
     end

Alternatively, if an event legitimately needs to appear in both source-based and target-based filters for the same DOI, the dedup should happen at the filter level in doi.rb instead.

🤖 Prompt for AI Agents
In `@app/services/events_preloader.rb` around lines 37 - 49, When distributing
events in the all_events loop, avoid appending the same event twice when
event.source_doi == event.target_doi by checking for equality before pushing to
the same DOI's preloaded_events: look up source_doi_obj and target_doi_obj from
`@doi_map` (used in the current loop), and if they refer to the same object,
append the event only once (otherwise append to each if present); adjust the
logic around source_doi_obj.preloaded_events << event and
target_doi_obj.preloaded_events << event to perform this dedupe during
distribution (or add a single guard to skip the second append when source and
target DOIs match).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant