RAT-533: Use smaller text samples for charset detection #612

rschmitt · 2026-01-18T20:59:38Z

Tika by default uses the first 12,000 bytes of a document for charset detection. This is an extremely computationally intensive process that checks every byte of the sample against every supported charset, and which also performs ngram-based natural language detection for ISO-8859-1. As a result, the majority of apache-rat runtime is actually spent performing charset detection.

Reducing the sample size to 256 bytes reduces the cost of charset detection by over 95%. On my machine, this single change cuts the total runtime of apache-rat:check in half.

ottlinger · 2026-01-18T21:15:06Z

apache-rat-core/src/main/java/org/apache/rat/analysis/TikaProcessor.java

     */
    private static Charset detectCharset(final InputStream stream, final DocumentName documentName) throws IOException, UnsupportedCharsetException {
-        CharsetDetector encodingDetector = new CharsetDetector();
+        final int bytesForCharsetDetection = 256;


Should we extract this as a constant and document it properly?

Or use 256 implicitly in the constructor instead?

Or use 256 implicitly in the constructor instead?

This would have been my preference, but it triggered a Checkstyle failure.

ottlinger · 2026-01-18T21:15:41Z

apache-rat-core/src/main/java/org/apache/rat/analysis/TikaProcessor.java

     */
    private static Charset detectCharset(final InputStream stream, final DocumentName documentName) throws IOException, UnsupportedCharsetException {
-        CharsetDetector encodingDetector = new CharsetDetector();
+        final int bytesForCharsetDetection = 256;


Or use 256 implicitly in the constructor instead?

Tika by default uses the first 12,000 bytes of a document for charset detection. This is an extremely computationally intensive process that checks every byte of the sample against every supported charset, and which also performs ngram-based natural language detection for ISO-8859-1. As a result, the majority of apache-rat runtime is actually spent performing charset detection. Reducing the sample size to 256 bytes reduces the cost of charset detection by over 95%. On my machine, this single change cuts the total runtime of `apache-rat:check` in half.

rschmitt · 2026-01-20T21:12:32Z

I assume this SonarQube failure is unrelated?

Caused by: org.sonar.api.utils.MessageException: Project not found. Please check the 'sonar.projectKey' and 'sonar.organization' properties, the 'SONAR_TOKEN' environment variable, or contact the project administrator to check the permissions of the user the token belongs to

ottlinger changed the title ~~Use smaller text samples for charset detection~~ RAT-533: Use smaller text samples for charset detection Jan 18, 2026

ottlinger reviewed Jan 18, 2026

View reviewed changes

ottlinger requested changes Jan 18, 2026

View reviewed changes

ottlinger requested a review from Claudenw January 18, 2026 21:36

rschmitt force-pushed the charset branch from c1e5896 to 6ccac6c Compare January 20, 2026 20:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RAT-533: Use smaller text samples for charset detection #612

RAT-533: Use smaller text samples for charset detection #612

rschmitt commented Jan 18, 2026 •

edited

Loading

Uh oh!

ottlinger Jan 18, 2026

Uh oh!

ottlinger Jan 18, 2026

Uh oh!

rschmitt Jan 18, 2026

Uh oh!

ottlinger Jan 18, 2026

Uh oh!

rschmitt commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RAT-533: Use smaller text samples for charset detection #612

Are you sure you want to change the base?

RAT-533: Use smaller text samples for charset detection #612

Conversation

rschmitt commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ottlinger Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

ottlinger Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

rschmitt Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

ottlinger Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

rschmitt commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rschmitt commented Jan 18, 2026 •

edited

Loading