Skip to content

Conversation

@rschmitt
Copy link

@rschmitt rschmitt commented Jan 18, 2026

Tika by default uses the first 12,000 bytes of a document for charset detection. This is an extremely computationally intensive process that checks every byte of the sample against every supported charset, and which also performs ngram-based natural language detection for ISO-8859-1. As a result, the majority of apache-rat runtime is actually spent performing charset detection.

Reducing the sample size to 256 bytes reduces the cost of charset detection by over 95%. On my machine, this single change cuts the total runtime of apache-rat:check in half.

@ottlinger ottlinger changed the title Use smaller text samples for charset detection RAT-533: Use smaller text samples for charset detection Jan 18, 2026
*/
private static Charset detectCharset(final InputStream stream, final DocumentName documentName) throws IOException, UnsupportedCharsetException {
CharsetDetector encodingDetector = new CharsetDetector();
final int bytesForCharsetDetection = 256;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we extract this as a constant and document it properly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or use 256 implicitly in the constructor instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or use 256 implicitly in the constructor instead?

This would have been my preference, but it triggered a Checkstyle failure.

*/
private static Charset detectCharset(final InputStream stream, final DocumentName documentName) throws IOException, UnsupportedCharsetException {
CharsetDetector encodingDetector = new CharsetDetector();
final int bytesForCharsetDetection = 256;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or use 256 implicitly in the constructor instead?

@ottlinger ottlinger requested a review from Claudenw January 18, 2026 21:36
Tika by default uses the first 12,000 bytes of a document for charset
detection. This is an extremely computationally intensive process that
checks every byte of the sample against every supported charset, and
which also performs ngram-based natural language detection for
ISO-8859-1. As a result, the majority of apache-rat runtime is actually
spent performing charset detection.

Reducing the sample size to 256 bytes reduces the cost of charset
detection by over 95%. On my machine, this single change cuts the total
runtime of `apache-rat:check` in half.
@rschmitt
Copy link
Author

I assume this SonarQube failure is unrelated?

Caused by: org.sonar.api.utils.MessageException: Project not found. Please check the 'sonar.projectKey' and 'sonar.organization' properties, the 'SONAR_TOKEN' environment variable, or contact the project administrator to check the permissions of the user the token belongs to

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants