Skip to content

Conversation

@dsolistorres
Copy link
Contributor

@dsolistorres dsolistorres commented Dec 8, 2025

Closes #33661

This PR addresses performance issues and pagination errors in the site copy job by implementing ElasticSearch Scroll API for a large result set.

Proposed Changes

  • When copying sites with large numbers of contentlets, the copy host job was encountering deep pagination errors when the offset exceeded ElasticSearch's max_result_window (100,000), and also performance degradation with offset-based pagination for large result sets.
  • A refactoring was done on the indexSearchScroll method from the ESContentFactoryImp class to expose the ES scroll API in a new wrapper interface ESContentletScroll. The PaginatedContentlets class uses this new interface to iterate on results using the ES scroll API.
  • SQL queries in HostFactoryImpl were optimized to use structure_inode field from contentlet table to filter hosts, and also to use the ILIKE clause in SQL conditions to match case insensitive values.

Checklist

  • Tests

} catch (final InterruptedException e) {
Logger.warn(this, "Batch pause was interrupted", e);
Thread.currentThread().interrupt();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should always sleep. to avoid starvation. if the minimum is set to 0, we should default to something

private static final String SELECT_SYSTEM_HOST = "SELECT id FROM identifier WHERE id = '"+ Host.SYSTEM_HOST+"' ";

private static final String FROM_JOINED_TABLES = "INNER JOIN identifier i " +
"ON c.identifier = i.id AND i.asset_subtype = '" + Host.HOST_VELOCITY_VAR_NAME + "' " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this getting removed?
Is it good enough to filter by the structure inode?

Gain in performance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice set of improvements !

sourceContentlets.size(), batchSize, relationshipBatchSize, batchPauseMs));

// Strategy: Process simple content immediately, collect HTML pages for later processing
final List<String> htmlPageInodes = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps a LinkedList is best suited here. or an initialization using the size of the dbResults

@dsolistorres dsolistorres force-pushed the issue-33661-optimize-copy-host-job branch from 640ce0b to 9ef0091 Compare December 12, 2025 21:17
@dsolistorres dsolistorres force-pushed the issue-33661-optimize-copy-host-job branch from 9ef0091 to 163e018 Compare December 12, 2025 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DEFECT] Copy Host operation suffers severe performance degradation with large content volume

6 participants