Skip to content

Comments

Create enrichment table and model#1470

Open
wendelfabianchinsamy wants to merge 33 commits intomasterfrom
create-enrichments-table
Open

Create enrichment table and model#1470
wendelfabianchinsamy wants to merge 33 commits intomasterfrom
create-enrichments-table

Conversation

@wendelfabianchinsamy
Copy link
Contributor

@wendelfabianchinsamy wendelfabianchinsamy commented Jan 29, 2026

Purpose

closes: https://github.com/datacite/product-backlog/issues/612

Approach

Open Questions and Pre-Merge TODOs

Learning

Types of changes

  • Bug fix (non-breaking change which fixes an issue)

  • New feature (non-breaking change which adds functionality)

  • Breaking change (fix or feature that would cause existing functionality to change)

Reviewer, please remember our guidelines:

  • Be humble in the language and feedback you give, ask don't tell.
  • Consider using positive language as opposed to neutral when offering feedback. This is to avoid the negative bias that can occur with neutral language appearing negative.
  • Offer suggestions on how to improve code e.g. simplification or expanding clarity.
  • Ensure you give reasons for the changes you are proposing.

Summary by CodeRabbit

  • New Features

    • Public API to list enrichments by DOI or client with cursor-based pagination and navigation links
    • Batch ingestion and background processing of enrichment files with concurrent workers
    • Server-side JSON Schema validation for enrichment records
    • Ability to apply enrichment operations (insert/update/update child/delete child) to records
    • New serializer normalizing enrichment output fields
  • Chores

    • Database migration adding an enrichments table and index
    • Rake task and local environment settings to support ingestion from cloud storage
    • Background queue configuration and local environment updates
  • Tests

    • Added factory to support enrichment testing

@coderabbitai
Copy link

coderabbitai bot commented Feb 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds enrichment ingestion: new Enrichment model with JSON Schema and migration; Enrichable concern mixed into Doi; batch ingestion job and Rake task for JSONL from S3; EnrichmentsController#index with cursor pagination, serializer, factory, routes, Shoryuken group, and local env config.

Changes

Cohort / File(s) Summary
Model & Schema
app/models/enrichment.rb, app/models/schemas/enrichment/enrichment.json
New Enrichment ActiveRecord model with JSONSchemer validation, associations to Doi/Client, scopes (by_doi, by_client, by_cursor, order_by_cursor) and helper to build enrichment hashes. Adds a comprehensive, strict JSON Schema for enrichment records.
Database
db/migrate/20260120083752_create_enrichments.rb, db/schema.rb
Migration creating enrichments table (doi, contributors JSON, resources JSON, field, action, original_value, enriched_value) with index on (doi, updated_at, id) desc; schema version bumped and multiple timestamp precision entries normalized.
API Controller & Routes
app/controllers/enrichments_controller.rb, config/routes.rb
New EnrichmentsController#index supports doi (uppercased) or client_id with optional encoded cursor, cursor-based filtering/ordering, PAGE_SIZE = 25, builds self and conditional next links, and renders serialized JSON. Route added for enrichments#index.
Batch Processing & Rake Task
app/jobs/enrichment_batch_process_job.rb, lib/tasks/enrichment.rake
Adds EnrichmentBatchProcessJob to parse JSONL lines, validate DOIs, apply enrichments via Doi#apply_enrichment, save records with per-line DB connection handling and error logging. Rake task streams S3 object, batches by byte size, and enqueues jobs.
Enrichable Concern & Doi
app/models/concerns/enrichable.rb, app/models/doi.rb
New Enrichable concern implements apply_enrichment (actions: insert, update, update_child, delete_child) and enrichment_field mapping; included into Doi.
Serialization & Test Factory
app/serializers/enrichment_serializer.rb, spec/factories/enrichment.rb
New EnrichmentSerializer (JSONAPI::Serializer) with camelLower keys, doi lowercased, and created/updated mappings. Adds FactoryBot :enrichment factory creating an associated doi.
Queue & Local Dev Config
config/shoryuken.yml, docker-compose.localstack.yml, config/application.rb
Adds enrichment_batch_process_job group to Shoryuken, sets ENRICHMENTS_INGESTION_FILES_BUCKET_NAME default in application config, and adds localstack environment variables including ENRICHMENTS_INGESTION_FILES_BUCKET_NAME.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Create enrichment table and model' accurately describes the main changes—a new Enrichment model, database migration, and supporting infrastructure for enrichments.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch create-enrichments-table

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🤖 Fix all issues with AI agents
In `@app/controllers/enrichments_controller.rb`:
- Around line 7-8: The params key is inconsistent: the controller reads
params["client_id"] but later builds a next-link query using ?client-id=; pick
the canonical form and use it everywhere — e.g., if the API uses the hyphenated
convention change the read to params["client-id"] (replace params["client_id"]),
and update any next-link construction that currently emits ?client-id= or
?client_id= so both the incoming read and outgoing URL use the exact same key
(also update any other occurrences in this file like the next-link builder
around the variable doi/client_id to match).
- Line 25: The debug puts call printing SQL should be removed or replaced with a
Rails logger call; replace the current
puts(enrichments.order_by_cursor.limit(PAGE_SIZE).to_sql) with
Rails.logger.debug(enrichments.order_by_cursor.limit(PAGE_SIZE).to_sql) (or
simply remove the line) so raw SQL is not written to stdout in production —
locate the call referencing enrichments, order_by_cursor, and PAGE_SIZE and
update accordingly.
- Around line 49-59: Replace the bare rescues in encode_cursor and decode_cursor
with rescues for specific expected errors (e.g., ArgumentError,
JSON::ParserError and TypeError) so you don't catch system-level exceptions; in
both methods rescue the specific exception classes (capture the exception into a
variable if you want to log or re-raise with context) and then raise
ActionController::BadRequest, ensuring encode_cursor and decode_cursor only
handle decoding/encoding-related failures.

In `@app/jobs/enrichment_batch_process_job.rb`:
- Line 9: Fix the typo in the log prefix string used in the
EnrichmentBatchProcessJob file: change the value of the log_prefix variable from
"EnrichmentBathProcessJob" to "EnrichmentBatchProcessJob" so logging reflects
the correct job name (locate the log_prefix assignment in
app/jobs/enrichment_batch_process_job.rb).
- Around line 60-66: In the "update_child" case, protect against doi[field]
being nil before calling doi[field].each_with_index by either wrapping the
iterable with Array(doi[field]) or adding an early guard (e.g., next unless
doi[field].present? or doi[field].is_a?(Array)); update the when "update_child"
block so the loop uses Array(doi[field]).each_with_index (or skips when nil) to
avoid NoMethodError while preserving the intended update logic.
- Around line 67-75: In the when "delete_child" branch the code mutates
doi[field] while iterating which is unsafe; replace the manual each_with_index
loop with a safe array-removal call to remove the first matching element (e.g.,
call Array#delete on doi[field] with enrichment["original_value"]) while keeping
the doi[field] ||= [] initialization; update the branch handling around the
identifier enrichment["original_value"] and the doi[field] reference
accordingly.

In `@config/routes.rb`:
- Line 198: Add authentication and authorization to the
EnrichmentsController#index: require authentication (e.g., add a before_action
like authenticate_user!) and add a before_action (e.g., ensure_client_access)
that checks params[:client_id] (and any client lookup used by the index action)
to ensure current_user is the client owner or current_user.staff? is true; if
not, return unauthorized. Also decide and enforce whether DOI-based queries
(params[:doi]) should bypass auth — if not, treat DOI queries as protected and
run the same ensure_client_access/authorization; update any references to client
lookup used in index (e.g., client = Client.find_by(id: params[:client_id])) to
use the authorization helper.

In `@lib/tasks/enrichment.rake`:
- Around line 13-14: The current max_batch_bytes value
(ENV.fetch("MAX_BATCH_BYTES", "200000")) only counts raw line bytes and ignores
ActiveJob/JSON envelope overhead, which can push SQS messages over 256KB; change
the code that sets max_batch_bytes to either lower the default (e.g., ~150_000)
or compute a safe budget by measuring the serialized ActiveJob envelope size and
subtracting that overhead from the 256KB limit, and apply the same change to the
other occurrence referenced (lines 46-47); specifically update the logic that
assigns max_batch_bytes so it either uses a reduced default or dynamically
serializes a representative job payload (including job_class, queue name and
args) to determine the actual serialized byte length and ensure the final SQS
message stays under 256KB.

In `@spec/factories/enrichment.rb`:
- Around line 3-7: The enrichment factory is missing required attributes and
incorrectly assigns the DOI object; update the factory :enrichment to set
non-null columns contributors (e.g., an array/hash matching the model shape),
resources, field, and action with sensible defaults, and fix the doi assignment
so it stores the DOI string rather than a Doi object by either declaring an
association like doi_record (to create a Doi and use its doi attribute) or
assigning doi { create(:doi, doi: generate(:unique_doi)).doi }; add a sequence
(e.g., sequence :unique_doi) to generate unique DOI strings instead of the
hardcoded "10.0000/fake.test.doi" to avoid uniqueness violations.
🧹 Nitpick comments (7)
app/models/schemas/enrichment/enrichment.json (1)

110-143: Consider allowing null and boolean types for originalValue / enrichedValue.

Both properties use oneOf restricted to object, string, number, and array. If any DataCite metadata field could hold a boolean or legitimately be null (beyond the absent-when-not-required case), this schema will reject it. If these types are intentionally excluded, no change needed.

db/migrate/20260120083752_create_enrichments.rb (1)

4-18: Migration is clean and index design is solid for cursor-based pagination.

The composite index [doi, updated_at, id] with descending order efficiently supports the pagination pattern. A few optional considerations:

  1. No foreign key constraint on doi: The doi column is a soft reference to the dataset table. This means enrichments can exist for DOIs that have been deleted or never existed. If referential integrity is desired, consider adding a FK. If the loose coupling is intentional (e.g., enrichments may arrive before the DOI is registered), this is fine as-is.

  2. Column name action: While not a Rails reserved word, it shadows ActionController::Base#action method. Since this is a model column and not used in controller context, it works fine, but worth being aware of if you ever delegate or mix concerns.

app/controllers/enrichments_controller.rb (1)

30-37: next_link is always computed even when it won't be used.

When enrichments is empty, next_cursor is nil, yet next_link is still built (resulting in ...&cursor=). This is harmless because Line 43 gates on enrichments.length == PAGE_SIZE, but it's confusing. Consider moving the next_link construction inside the same conditional or making it explicitly nil when there's no cursor.

lib/tasks/enrichment.rake (1)

26-33: Mutating batch_lines in-place after enqueuing is fragile.

perform_later serializes synchronously in standard ActiveJob, so .clear on Line 31 is currently safe. However, this pattern is brittle—any change to adapter behavior or middleware could cause the enqueued job to receive an empty array. Reassigning a new array is safer and clearer.

Proposed fix
     flush = lambda do
       return if batch_lines.empty?

       EnrichmentBatchProcessJob.perform_later(batch_lines)

-      batch_lines.clear
-      batch_bytes = 0
+      batch_lines = []
+      batch_bytes = 0
     end
app/jobs/enrichment_batch_process_job.rb (2)

27-28: Unnecessary string interpolation.

"#{parsed_line["doi"]}" is equivalent to parsed_line["doi"] since it's already a string.

Proposed fix
-        doi: "#{parsed_line["doi"]}",
+        doi: parsed_line["doi"],

52-77: enrich_doi should be private.

This method is an internal implementation detail of the job and is declared after the perform method but without a private access modifier.

Proposed fix
   end

+  private
+
   def enrich_doi(enrichment, doi)
app/models/enrichment.rb (1)

1-2: private does not apply to self. class methods.

def self.enrichment_schemer on Line 47 is declared inside the private block, but Ruby's private keyword only affects instance methods. The class method remains public. If the intent is to restrict access, use private_class_method :enrichment_schemer after the definition.

Proposed fix
     def self.enrichment_schemer
       `@enrichment_schemer` ||= begin
         schema_path = Rails.root.join("app/models/schemas/enrichment/enrichment.json")
         schema = JSON.parse(File.read(schema_path))
         JSONSchemer.schema(schema)
       end
     end
+
+    private_class_method :enrichment_schemer
 end

Also applies to: 25-53

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@app/controllers/enrichments_controller.rb`:
- Around line 17-23: The code currently calls Time.iso8601 on untrusted input
from decode_cursor which can raise ArgumentError for malformed timestamps;
update the cursor handling in the controller (the block using decode_cursor and
Time.iso8601 and subsequently calling enrichments.by_cursor) to validate and
safely parse the timestamp by rescuing ArgumentError (or using a safe parser),
and return or raise a 400/BadRequest response when the decoded "updated_at" is
not a valid ISO8601 string instead of letting it bubble as a 500; ensure the fix
wraps Time.iso8601 (or replaces it with a safe parsing helper) and treats
decode_cursor failures consistently before calling enrichments.by_cursor.

In `@app/jobs/enrichment_batch_process_job.rb`:
- Around line 12-13: Wrap the JSON.parse call inside the Parallel.each block
with a rescue for JSON::ParserError so a single malformed line doesn't abort the
whole batch; in the block that iterates (the Parallel.each do |line| ... end)
catch JSON::ParserError around JSON.parse(line), log the parse error and the
offending line (or its index) via your logger, and skip/next that iteration so
processing continues for other lines instead of re-raising the exception.
- Around line 52-54: The call to enrichment["field"].underscore in enrich_doi
can raise NoMethodError when enrichment["field"] is nil; update enrich_doi to
guard/validate before calling .underscore by checking enrichment["field"] (or
using safe navigation) and handling the nil case (e.g., skip processing, log a
warning, or set a default field) so .underscore is only invoked on a String;
reference enrich_doi and enrichment["field"] when making the change.
🧹 Nitpick comments (2)
app/controllers/enrichments_controller.rb (1)

29-36: next_link is constructed even when next_cursor is nil (empty result set), producing a malformed URL with cursor= at the end.

While Line 42 guards rendering, the next_link variable is always built. This is minor since it's never returned when there's no next page, but it's cleaner to wrap the link construction inside the conditional.

Proposed refactor
-    next_cursor = if enrichments.any?
-      last = enrichments.last
-      encode_cursor(updated_at: last.updated_at.iso8601(6), id: last.id)
-    end
-
-    next_link = doi.present? ?
-      "#{request.original_url.split("?").first}?doi=#{doi}&cursor=#{next_cursor}" :
-      "#{request.original_url.split("?").first}?client-id=#{client_id}&cursor=#{next_cursor}"
+    next_link = if enrichments.length == PAGE_SIZE
+      last = enrichments.last
+      next_cursor = encode_cursor(updated_at: last.updated_at.iso8601(6), id: last.id)
+      base = request.original_url.split("?").first
+      doi.present? ?
+        "#{base}?doi=#{doi}&cursor=#{next_cursor}" :
+        "#{base}?client-id=#{client_id}&cursor=#{next_cursor}"
+    end
app/jobs/enrichment_batch_process_job.rb (1)

52-77: enrich_doi is a public method but is only used internally — should be private.

This avoids accidental external invocation and keeps the job's public surface minimal.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@app/models/doi.rb`:
- Around line 2773-2798: The apply_enrichment method can crash when
enrichment["field"] is nil and when update_child is called on a nil field;
update the method to first safely derive field (e.g. return early if
enrichment["field"] is nil or use safe navigation when calling underscore) and
then for cases that iterate or modify arrays ensure the target is initialized
(use self[field] ||= [] in the "update_child" branch just like
"insert"/"delete_child"), and change the "delete_child" logic to find the index
with Array#index (or similar) and call delete_at(index) instead of deleting
while iterating to avoid mutation-during-iteration bugs; keep the other case
behaviors (insert/update) unchanged.
🧹 Nitpick comments (1)
app/jobs/enrichment_batch_process_job.rb (1)

52-77: Remove commented-out dead code.

The enrich_doi method is fully commented out. This logic has been moved to Doi#apply_enrichment. Remove the dead code to avoid confusion.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@app/jobs/enrichment_batch_process_job.rb`:
- Around line 12-54: The Parallel.each block spawns 10 threads but
ActiveRecord's pool may be smaller, causing ConnectionTimeoutError; wrap the
per-iteration work in ActiveRecord::Base.connection_pool.with_connection (or
ActiveRecord::Base.connection_pool.with_connection do ... end) so each iteration
checks out and returns a DB connection around calls like Doi.find_by,
doi.apply_enrichment/doi.only_validate, and enrichment.save (i.e., the body
inside Parallel.each), and also ensure the pool size in database config matches
or exceeds the thread count (or reduce in_threads from 10) to avoid exhausting
the pool.
🧹 Nitpick comments (3)
app/jobs/enrichment_batch_process_job.rb (2)

33-33: Unnecessary string interpolation.

"#{parsed_line["doi"]}" is equivalent to parsed_line["doi"]. Drop the interpolation.

-        doi: "#{parsed_line["doi"]}",
+        doi: parsed_line["doi"],

42-48: Enrichment mutations to doi are discarded — is this intentional?

doi.apply_enrichment(enrichment) mutates the in-memory doi, then doi.valid? is checked but the DOI is never saved. Only the enrichment record is persisted. If the intent is "validate that applying this enrichment would produce valid metadata, then store the enrichment for later replay," this is fine — but worth a brief comment to clarify for future maintainers.

lib/tasks/enrichment.rake (1)

28-35: batch_lines.clear reuses the same array object — safe only because perform_later serializes synchronously.

This works because ActiveJob serializes arguments before perform_later returns. However, using batch_lines = [] (reassignment) instead of clear would be more defensive and avoid coupling to serialization timing.

Suggested change
     flush = lambda do
       return if batch_lines.empty?

       EnrichmentBatchProcessJob.perform_later(batch_lines, key)

-      batch_lines.clear
-      batch_bytes = 0
+      batch_lines = []
+      batch_bytes = 0
     end

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@app/controllers/enrichments_controller.rb`:
- Around line 40-42: next_link generation produces an empty client-id param when
both doi and client_id are blank and also fails to URL-encode values; change the
code to build the query explicitly: compute base =
request.original_url.split("?").first, then build a params hash like params = {
"cursor" => next_cursor }; add params["doi"] = doi if doi.present? elsif
params["client-id"] = client_id if client_id.present?; finally set next_link =
"#{base}?#{params.to_query}" (or use CGI.escape/URI.encode_www_form_component
for values if you prefer manual encoding). This ensures no empty client-id is
emitted and both doi and client-id are URL-encoded; reference symbols:
next_link, doi, client_id, request.original_url, next_cursor.
🧹 Nitpick comments (1)
app/controllers/enrichments_controller.rb (1)

44-50: Response exposes raw ActiveRecord objects via data: enrichments.

enrichments is an array of Enrichment model instances. Passing them directly to render json: will call .as_json with default settings, potentially exposing all columns (including internal IDs, timestamps, etc.). Consider using a serializer or an explicit .map to control the shape of the response.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@app/jobs/enrichment_batch_process_job.rb`:
- Around line 23-28: Normalize the DOI from parsed_line once (e.g., uppercase
it) before any lookups or persistence so Doi.find_by(doi: parsed_line["doi"],
agency: "datacite") and later persistence use the same normalized value; replace
uses of parsed_line["doi"] in this method with the normalized variable and
ensure the same normalization is applied where a second lookup occurs (the block
around lines 40-42) and rely on the model's Doi#doi= behavior for storage
consistency.
- Around line 50-56: Wrap the call to doi.apply_enrichment(enrichment) in a
begin/rescue so any exception raised by apply_enrichment (e.g., nil array access
in update_child) is caught per DOI and does not abort the whole batch; in the
rescue log the exception class, message and backtrace along with the existing
context (use log_prefix and enrichment.doi) and then next to continue to the
next record; keep existing validation check (doi.valid?) after the protected
call and avoid rescuing Exception (use rescue => e) so only runtime errors are
captured.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
app/jobs/enrichment_batch_process_job.rb (2)

23-30: Guard against nil DOI value for clearer logging.

If parsed_line["doi"] is nil, the log message at line 28 will read "Doi does not exist" (empty DOI). Consider adding an explicit check for a more informative error message.

Proposed improvement
         # We only create enrichments for DOIs that exist and which have an agency of 'datacite'.
         uid = parsed_line["doi"]&.upcase
+
+        if uid.blank?
+          Rails.logger.error("#{log_prefix}: Missing DOI in input line")
+          next
+        end
+
         doi = Doi.find_by(doi: uid, agency: "datacite")

         if doi.blank?
-          Rails.logger.error("#{log_prefix}: Doi #{uid} does not exist")
+          Rails.logger.error("#{log_prefix}: DOI #{uid} does not exist")
           next
         end

41-42: Remove unnecessary string interpolation.

uid is already a string (result of &.upcase), so the interpolation "#{uid}" is redundant.

Proposed fix
         enrichment = Enrichment.new(
-          doi: "#{uid}",
+          doi: uid,
           contributors: parsed_line["contributors"],

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@app/controllers/enrichments_controller.rb`:
- Around line 34-42: The current base_page_enrichments method returns
Enrichment.all when neither doi nor client_id is provided, which exposes the
entire table via the public index action; either restrict access or change the
fallback: if the endpoint should be private, add an authentication/authorization
check (e.g., a before_action like authenticate_user! or verify_api_key) to
EnrichmentsController so index cannot be called anonymously, or if the intent is
to require a filter, change base_page_enrichments to return Enrichment.none (or
raise ActionController::Forbidden) instead of Enrichment.all; update any tests
to reflect the chosen behavior and adjust callers of base_page_enrichments
accordingly.
- Around line 44-54: The rescue in filter_enrichments_with_cursor currently only
catches ArgumentError and KeyError so JSON::ParserError from decode_cursor
bubbles up as a 500; update the rescue to also catch JSON::ParserError (the
exception raised by JSON.parse inside decode_cursor) and re-raise
ActionController::BadRequest("Invalid cursor") so malformed/corrupted cursors
return 400; locate the filter_enrichments_with_cursor method and add
JSON::ParserError to the rescue clause that handles decode_cursor errors.
- Line 8: Remove the .upcase call when reading the client_id param so the value
matches stored lowercase symbols: change the assignment that sets client_id =
params["client_id"]&.upcase to just use the raw param (e.g.,
params["client_id"]) so EnrichmentsController uses the same lowercase format
returned by client.symbol.downcase and will match the Enrichment.by_client
scope's case-sensitive comparison against datacentre.symbol.
🧹 Nitpick comments (1)
app/controllers/enrichments_controller.rb (1)

24-30: Add a serializer for Enrichment to control API response shape.

The enrichment objects are serialized without an explicit as_json override or serializer, exposing all columns including created_at and updated_at timestamps. Unlike other models in the codebase (Client, Contact, Provider, Doi) which use serialization methods, Enrichment relies on the default Rails behavior, coupling the API response directly to the database schema.

…ode error should result in an invalid cursor response
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@app/controllers/enrichments_controller.rb`:
- Around line 24-25: The controller is rendering enrichments directly which
leaks timestamps because Enrichment#to_enrichment_hash is private; either make
that method public (change visibility of to_enrichment_hash) and update the
controller to map enrichments with .map(&:to_enrichment_hash), or implement a
dedicated EnrichmentSerializer (named EnrichmentSerializer) that exposes only
the allowed fields and use it in the controller render, or override
Enrichment#as_json to return the desired hash; pick one approach and update the
corresponding symbol (to_enrichment_hash, EnrichmentSerializer, or as_json) and
the render call in the controller to use the chosen serialization method.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
app/serializers/enrichment_serializer.rb (1)

10-22: :doi is declared twice — remove it from the plain attributes list.

The custom block at lines 20–22 overrides the plain :doi declared at line 11, making the latter dead code. Keeping both is confusing for future readers.

Similarly, :id is listed in attributes (line 10) while also set via set_id :id (line 8). This causes id to appear both as the JSON:API resource identifier and inside the attributes hash, which is unusual per the JSON:API spec (the id member is already top-level). Confirm this duplication is intentional for your API contract.

♻️ Suggested cleanup
   attributes :id,
-             :doi,
              :contributors,
              :resources,
              :field,
              :action,
              :original_value,
              :enriched_value

If :id in attributes is also unintentional:

-  attributes :id,
+  attributes :contributors,
-             :contributors,
-             :resources,
              :resources,

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
app/serializers/enrichment_serializer.rb (1)

10-21: Remove :doi from the plain attributes list — it's overridden by the custom block below.

:doi is declared on line 10 as a plain attribute and then re-declared on lines 19–21 with a custom block that downcases the value. The block wins at serialization time, so the plain declaration is dead code and misleading to readers who may assume it takes effect.

♻️ Proposed fix
- attributes :doi,
-            :contributors,
+ attributes :contributors,
             :resources,
             :field,
             :action,
             :original_value,
             :enriched_value
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/serializers/enrichment_serializer.rb` around lines 10 - 21, Remove the
redundant plain :doi declaration from the attributes list so the custom
attribute block is the sole source of the DOI value; specifically, in the
EnrichmentSerializer remove ":doi" from the attributes declaration (the list
that currently includes :doi, :contributors, :resources, :field, :action,
:original_value, :enriched_value) and keep the existing attribute :doi do
|object| ... end block that downcases the DOI.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@app/serializers/enrichment_serializer.rb`:
- Around line 10-21: Remove the redundant plain :doi declaration from the
attributes list so the custom attribute block is the sole source of the DOI
value; specifically, in the EnrichmentSerializer remove ":doi" from the
attributes declaration (the list that currently includes :doi, :contributors,
:resources, :field, :action, :original_value, :enriched_value) and keep the
existing attribute :doi do |object| ... end block that downcases the DOI.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
db/schema.rb (1)

242-254: Enrichments table looks well-structured for cursor pagination.

The composite index on [doi, updated_at, id] with descending order on updated_at and id aligns with the cursor-based pagination in the controller. The column constraints are reasonable.

One minor note: the table uses utf8mb3 charset while newer tables (e.g., events, reference_repositories) use utf8mb4. If enrichment text fields (e.g., enriched_value JSON) could contain 4-byte Unicode characters (emoji, CJK supplementary), consider using utf8mb4 for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@db/schema.rb` around lines 242 - 254, The schema uses charset "utf8mb3" for
the enrichments table which can drop 4-byte Unicode characters; update the table
to use "utf8mb4" to match other tables and safely store emoji/CJK supplementary
characters by changing the create_table definition for "enrichments" (the
create_table "enrichments" block in schema.rb) to charset: "utf8mb4" and, if
applying via migration, add an ALTER TABLE migration to convert the
table/columns (notably the JSON columns "enriched_value" and "original_value")
to utf8mb4 collation so they support 4-byte characters consistently.
app/jobs/enrichment_batch_process_job.rb (1)

41-50: contributors and resources columns are NOT NULL — missing keys in the JSONL input will fail silently at save.

If a JSONL line doesn't include "contributors" or "resources", the Enrichment.new call sets them to nil. The save on line 60 will then fail due to the NOT NULL constraint (see db/schema.rb lines 244–245), and the error is only logged. This is technically handled, but if most enrichments don't carry these fields, you'll get a high volume of silent failures.

Consider whether these columns should be nullable, or whether they should have defaults (e.g., default: {} or default: [] in the migration), or whether the job should set a default before saving.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/jobs/enrichment_batch_process_job.rb` around lines 41 - 50, The
Enrichment.new call in enrichment_batch_process_job.rb assigns nil to
contributors and resources when those keys are missing, which violates the NOT
NULL DB constraints and causes silent save failures; fix by normalizing missing
keys before creating/saving the record: ensure the job sets defaults (e.g.,
contributors: parsed_line["contributors"] || [] and resources:
parsed_line["resources"] || {} or whichever type matches the schema) or update
the migration/schema to allow NULL or provide column defaults; adjust the code
paths that build the Enrichment (the Enrichment.new invocation and the save
logic) to always supply non-nil values for contributors and resources.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@app/controllers/enrichments_controller.rb`:
- Around line 57-61: The bare rescue in encode_cursor currently catches all
exceptions (including SystemExit/SignalException); update the rescue in the
encode_cursor method that wraps Base64.urlsafe_encode64(hash.to_json, padding:
false) to rescue StandardError (or a more specific exception) instead of a bare
rescue, so only application-level errors are handled and system-level exceptions
are not swallowed.
- Around line 6-8: The controller's index action reads params["client_id"] but
build_next_link emits client-id=... causing the second-page request to lose the
filter; update code to use a single param key consistently (pick either
"client_id" or "client-id") across index and build_next_link (e.g., change index
to read params["client-id"] or change build_next_link to emit client_id=...),
and ensure values embedded in the generated next-link (the DOI and the client
identifier used by build_next_link) are URL-encoded (use
URI.encode_www_form_component or equivalent) so DOIs with slashes and client IDs
are safe in the query string; reference the index method (where params are read)
and build_next_link (where query params are emitted) when making the edits.

---

Nitpick comments:
In `@app/jobs/enrichment_batch_process_job.rb`:
- Around line 41-50: The Enrichment.new call in enrichment_batch_process_job.rb
assigns nil to contributors and resources when those keys are missing, which
violates the NOT NULL DB constraints and causes silent save failures; fix by
normalizing missing keys before creating/saving the record: ensure the job sets
defaults (e.g., contributors: parsed_line["contributors"] || [] and resources:
parsed_line["resources"] || {} or whichever type matches the schema) or update
the migration/schema to allow NULL or provide column defaults; adjust the code
paths that build the Enrichment (the Enrichment.new invocation and the save
logic) to always supply non-nil values for contributors and resources.

In `@db/schema.rb`:
- Around line 242-254: The schema uses charset "utf8mb3" for the enrichments
table which can drop 4-byte Unicode characters; update the table to use
"utf8mb4" to match other tables and safely store emoji/CJK supplementary
characters by changing the create_table definition for "enrichments" (the
create_table "enrichments" block in schema.rb) to charset: "utf8mb4" and, if
applying via migration, add an ALTER TABLE migration to convert the
table/columns (notably the JSON columns "enriched_value" and "original_value")
to utf8mb4 collation so they support 4-byte characters consistently.

Comment on lines +19 to +33
when "update_child"
self[field].each_with_index do |item, index|
if item == enrichment["original_value"]
self[field][index] = enrichment["enriched_value"]
end
end
when "delete_child"
self[field] ||= []
self[field].each_with_index do |item, index|
if item == enrichment["original_value"]
self[field].delete_at(index)
break
end
end
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open question that doesn't necessarily need to be answered now: should the enrichment be considered valid if the no item is found for update_child and delete_child? At that point, the record wouldn't actually "do" anything, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants