-
Notifications
You must be signed in to change notification settings - Fork 53
RAT-533: Use smaller text samples for charset detection #612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| */ | ||
| private static Charset detectCharset(final InputStream stream, final DocumentName documentName) throws IOException, UnsupportedCharsetException { | ||
| CharsetDetector encodingDetector = new CharsetDetector(); | ||
| final int bytesForCharsetDetection = 256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we extract this as a constant and document it properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or use 256 implicitly in the constructor instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or use 256 implicitly in the constructor instead?
This would have been my preference, but it triggered a Checkstyle failure.
| */ | ||
| private static Charset detectCharset(final InputStream stream, final DocumentName documentName) throws IOException, UnsupportedCharsetException { | ||
| CharsetDetector encodingDetector = new CharsetDetector(); | ||
| final int bytesForCharsetDetection = 256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or use 256 implicitly in the constructor instead?
Tika by default uses the first 12,000 bytes of a document for charset detection. This is an extremely computationally intensive process that checks every byte of the sample against every supported charset, and which also performs ngram-based natural language detection for ISO-8859-1. As a result, the majority of apache-rat runtime is actually spent performing charset detection. Reducing the sample size to 256 bytes reduces the cost of charset detection by over 95%. On my machine, this single change cuts the total runtime of `apache-rat:check` in half.
|
I assume this SonarQube failure is unrelated? |
Tika by default uses the first 12,000 bytes of a document for charset detection. This is an extremely computationally intensive process that checks every byte of the sample against every supported charset, and which also performs ngram-based natural language detection for ISO-8859-1. As a result, the majority of apache-rat runtime is actually spent performing charset detection.
Reducing the sample size to 256 bytes reduces the cost of charset detection by over 95%. On my machine, this single change cuts the total runtime of
apache-rat:checkin half.