Optimize DOI event performance with bulk-preloading#1449
Optimize DOI event performance with bulk-preloading#1449
Conversation
17e9f41 to
46a7e33
Compare
83f6c29 to
99492b7
Compare
99492b7 to
13f8d15
Compare
13f8d15 to
e998678
Compare
📝 WalkthroughWalkthroughThis pull request introduces an event preloading optimization system comprising a new EventsPreloader service, a PreloadedEventRelation wrapper class for in-memory event collections, and model-level integration to cache preloaded events. Updates to DataciteDoi, OtherDoi, and Doi models shift event loading from eager-inclusion to batch preloading within import pipelines. The DataciteDoisController#show conditionally preloads detail data. A provider_id filter is added to ReferenceRepository. Comprehensive test coverage validates preloading behavior, backward compatibility, and performance characteristics. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@app/controllers/datacite_dois_controller.rb`:
- Around line 452-454: The guard `params[:detail] != false` is always truthy
because params are strings, so remove the misleading conditional and always
preload: call EventsPreloader.new([doi]).preload! unconditionally (keep the
existing comment about preloading) so the intent matches behavior;
alternatively, if you want to preserve conditional behavior, change the check to
string comparison (e.g., params[:detail] != "false" or
ActiveModel::Type::Boolean.new.cast(params[:detail]) ) but the simpler fix is to
drop the `if` and always invoke EventsPreloader.new([doi]).preload!.
In `@app/services/events_preloader.rb`:
- Around line 37-49: When distributing events in the all_events loop, avoid
appending the same event twice when event.source_doi == event.target_doi by
checking for equality before pushing to the same DOI's preloaded_events: look up
source_doi_obj and target_doi_obj from `@doi_map` (used in the current loop), and
if they refer to the same object, append the event only once (otherwise append
to each if present); adjust the logic around source_doi_obj.preloaded_events <<
event and target_doi_obj.preloaded_events << event to perform this dedupe during
distribution (or add a single guard to skip the second append when source and
target DOIs match).
🧹 Nitpick comments (7)
app/models/concerns/preloaded_event_relation.rb (2)
69-71:injectoverride doesn't support the symbol form.Ruby's
Enumerable#injectsupports bothinject(:+)(symbol) andinject(0) { |sum, x| sum + x }(block) forms. This override only passesinitialand&blocktoArray#inject, so callingrelation.inject(:+)would pass:+asinitialwith anilblock, raising aTypeError.Currently,
injectis only called on the result ofpluck(which returns a plainArray), so this isn't actively broken. But it's a latent trap if the API surface grows.♻️ Suggested fix
- def inject(initial = nil, &block) - `@events.inject`(initial, &block) + def inject(*args, &block) + `@events.inject`(*args, &block) end
94-96: Removeto_ary— it contradicts the design and is never used.The
to_arymethod is never called explicitly in the codebase and no tests cover it. Since wrapper-preserving methods likeselect,compact, anduniqreturn newPreloadedEventRelationinstances, havingto_ary(which enables implicit unwrapping via array concatenation, splat, orArray()calls) contradicts the design intent. Keep onlyto_afor explicit conversion.app/services/events_preloader.rb (1)
51-54: Redundant initialization — already handled ininitialize.Line 15 in
initializealready setsdoi.preloaded_events ||= []for every DOI. This block on lines 52-54 is a no-op unless a DOI'spreloaded_eventswas somehow set tonilafter initialization, which shouldn't happen in normal flow.spec/models/preloaded_event_relation_spec.rb (2)
65-71:sort_bytest doesn't verify actual reordering.The events are already in ascending
totalorder ([10, 20, 30]), sosort_by(&:total)produces the same order as the input. This test passes trivially without proving that sorting actually works. Consider sorting in reverse or using a differently-ordered input.🔧 Suggested improvement
describe "#sort_by" do - it "sorts events" do - result = relation.sort_by(&:total) + it "sorts events by attribute" do + reversed = PreloadedEventRelation.new([event3, event1, event2]) + result = reversed.sort_by(&:total) expect(result).to be_a(PreloadedEventRelation) expect(result.to_a.map(&:total)).to eq([10, 20, 30]) end end
133-145: Consider testingfirst/last/counton an empty relation too.The Enumerable methods block only tests the happy path. Since
PreloadedEventRelationwraps an array and delegates these, it would be good to verify edge behavior (e.g.,firston empty returnsnil,countreturns0).spec/services/events_preloader_spec.rb (1)
89-107: Large batch test creates 1100+ DB records — significant test suite slowdown.
create_list(:doi, EventsPreloader::CHUNK_SIZE + 100, ...)withCHUNK_SIZE = 1000inserts 1100 DOI records plus associated data. This will be very slow (potentially minutes) and could dominate the test suite runtime.Consider using
build_stubbed_listor mocking the chunking behavior instead of hitting the database, since the test only verifies that no error is raised andpreloaded_eventsis not nil.♻️ Suggested alternative approach
context "with large batch" do it "chunks large DOI lists" do - # Create more than CHUNK_SIZE DOIs - large_batch = create_list(:doi, EventsPreloader::CHUNK_SIZE + 100, client: client, aasm_state: "findable") - - # Create events for some of them - create(:event_for_crossref, { - subj_id: "https://doi.org/#{large_batch.first.doi}", - obj_id: "https://doi.org/#{large_batch.last.doi}", - relation_type_id: "references", - }) - - expect { - EventsPreloader.new(large_batch).preload! - }.not_to raise_error - - expect(large_batch.first.preloaded_events).not_to be_nil + # Verify chunking by checking that Event.where is called multiple times + large_batch = build_stubbed_list(:doi, EventsPreloader::CHUNK_SIZE + 100, client: client) + large_batch.each { |d| allow(d).to receive(:preloaded_events=) } + large_batch.each { |d| allow(d).to receive(:preloaded_events).and_return([]) } + + relation = double("relation", to_a: []) + expect(Event).to receive(:where).twice.and_return(relation) + + EventsPreloader.new(large_batch).preload! end endspec/models/doi_related_spec.rb (1)
42-65: Test doesn't actually verify preloading — it verifies events via a fresh DB query.
fresh_doi1 = DataciteDoi.find(doi1.id)creates a new instance withpreloaded_events = nil, soreference_eventsandcitation_eventsfall back to database queries. This test confirms events exist in the DB but does not validate thatEventsPreloaderwas used duringimport_in_bulk.To verify preloading actually occurred, consider spying on
EventsPreloaderor asserting on the in-batch DOI objects:♻️ Suggested approach
- allow(DataciteDoi).to receive(:upload_to_elasticsearch) - - DataciteDoi.import_in_bulk(ids) - - # Verify that events were preloaded (check via a fresh query) - fresh_doi1 = DataciteDoi.find(doi1.id) - expect(fresh_doi1.reference_events.count).to eq(1) - expect(fresh_doi1.citation_events.count).to eq(1) + preloader_spy = instance_double(EventsPreloader, preload!: nil) + expect(EventsPreloader).to receive(:new).and_return(preloader_spy) + expect(preloader_spy).to receive(:preload!).once + + allow(DataciteDoi).to receive(:upload_to_elasticsearch) + + DataciteDoi.import_in_bulk(ids)
| # Preload events if we are going to show details | ||
| # This optimizes the serializer which accesses part_events, citation_events, etc. | ||
| EventsPreloader.new([doi]).preload! if params[:detail] != false |
There was a problem hiding this comment.
Condition params[:detail] != false is always true — params are strings, not booleans.
params[:detail] will be nil, "true", or "false" (a string), never the boolean false. So params[:detail] != false is always truthy, making the guard a no-op. The preloader runs unconditionally.
Since detail is hardcoded to true on line 461, this is functionally correct (preloading is always wanted here), but the conditional is misleading.
🐛 Simplify to unconditional preload or fix the string comparison
- EventsPreloader.new([doi]).preload! if params[:detail] != false
+ # Preload events to optimize serializer access to event-derived fields
+ EventsPreloader.new([doi]).preload!📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Preload events if we are going to show details | |
| # This optimizes the serializer which accesses part_events, citation_events, etc. | |
| EventsPreloader.new([doi]).preload! if params[:detail] != false | |
| # Preload events if we are going to show details | |
| # This optimizes the serializer which accesses part_events, citation_events, etc. | |
| # Preload events to optimize serializer access to event-derived fields | |
| EventsPreloader.new([doi]).preload! |
🤖 Prompt for AI Agents
In `@app/controllers/datacite_dois_controller.rb` around lines 452 - 454, The
guard `params[:detail] != false` is always truthy because params are strings, so
remove the misleading conditional and always preload: call
EventsPreloader.new([doi]).preload! unconditionally (keep the existing comment
about preloading) so the intent matches behavior; alternatively, if you want to
preserve conditional behavior, change the check to string comparison (e.g.,
params[:detail] != "false" or
ActiveModel::Type::Boolean.new.cast(params[:detail]) ) but the simpler fix is to
drop the `if` and always invoke EventsPreloader.new([doi]).preload!.
| all_events.each do |event| | ||
| # Add event to source DOI's preloaded_events if it matches | ||
| if event.source_doi.present? | ||
| source_doi_obj = @doi_map[event.source_doi.upcase] | ||
| source_doi_obj.preloaded_events << event if source_doi_obj | ||
| end | ||
|
|
||
| # Add event to target DOI's preloaded_events if it matches | ||
| if event.target_doi.present? | ||
| target_doi_obj = @doi_map[event.target_doi.upcase] | ||
| target_doi_obj.preloaded_events << event if target_doi_obj | ||
| end | ||
| end |
There was a problem hiding this comment.
Duplicate events when source_doi == target_doi.
If an event has source_doi == target_doi (same DOI), the event is appended twice to that DOI's preloaded_events array. Downstream filter methods in doi.rb (e.g., citation_events) would then return the event twice, inflating counts.
🐛 Proposed fix — deduplicate during distribution
# Group events by DOI and assign to each Doi object
all_events.each do |event|
# Add event to source DOI's preloaded_events if it matches
if event.source_doi.present?
source_doi_obj = `@doi_map`[event.source_doi.upcase]
- source_doi_obj.preloaded_events << event if source_doi_obj
+ source_doi_obj.preloaded_events << event if source_doi_obj
end
# Add event to target DOI's preloaded_events if it matches
if event.target_doi.present?
target_doi_obj = `@doi_map`[event.target_doi.upcase]
- target_doi_obj.preloaded_events << event if target_doi_obj
+ if target_doi_obj && target_doi_obj != source_doi_obj
+ target_doi_obj.preloaded_events << event
+ end
end
endAlternatively, if an event legitimately needs to appear in both source-based and target-based filters for the same DOI, the dedup should happen at the filter level in doi.rb instead.
🤖 Prompt for AI Agents
In `@app/services/events_preloader.rb` around lines 37 - 49, When distributing
events in the all_events loop, avoid appending the same event twice when
event.source_doi == event.target_doi by checking for equality before pushing to
the same DOI's preloaded_events: look up source_doi_obj and target_doi_obj from
`@doi_map` (used in the current loop), and if they refer to the same object,
append the event only once (otherwise append to each if present); adjust the
logic around source_doi_obj.preloaded_events << event and
target_doi_obj.preloaded_events << event to perform this dedupe during
distribution (or add a single guard to skip the second append when source and
target DOIs match).
Purpose
This PR addresses N+1 query performance issues when fetching or importing DOIs that have associated event data (citations, views, downloads, references, etc.). Previously, each relationship type for every DOI in a collection triggered a separate database query.
Approach
The optimization introduces a "bulk-preload" strategy for events:
EventsPreloaderService: A new service that fetches all relevant events for a batch of DOIs in just 1 or 2 queries.PreloadedEventRelationConcern: A wrapper class that mimics theActiveRecord::RelationAPI (supportingpluck,map,select, etc.), allowing existing code to operate on in-memory arrays as if they were database relations.Doimodel’s association methods (e.g.,view_events,citation_events) are overridden to check ifpreloaded_eventsis present. If so, they filter the in-memory array; otherwise, they fall back to the standard database scope.Key Modifications
app/services/events_preloader.rb: New service to batch-fetch events based onsource_doiortarget_doi.app/models/concerns/preloaded_event_relation.rb: New wrapper for in-memory event collections to ensure backward compatibility with relation-based methods.app/models/doi.rb: Addedpreloaded_eventsaccessor and logic to switch between preloaded data and real-time database queries.app/controllers/datacite_dois_controller.rb: Integrated preloading for the:detailview to speed up serialization.app/models/datacite_doi.rb&app/models/other_doi.rb: Integrated preloading intoimport_in_bulkto optimize Elasticsearch indexing processes.Important Technical Details
CHUNK_SIZE = 1000)..upcaseto ensure consistency between database records and memory mapping.preloaded_eventsis never initialized, the application behaves exactly as before, ensuring safety for code paths not yet optimized.Types of changes
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Reviewer, please remember our guidelines:
Summary by CodeRabbit
Release Notes
Performance
New Features