Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,8 @@ public static String process(final Document document) throws RatDocumentAnalysis
* @throws UnsupportedCharsetException on unsupported charset.
*/
private static Charset detectCharset(final InputStream stream, final DocumentName documentName) throws IOException, UnsupportedCharsetException {
CharsetDetector encodingDetector = new CharsetDetector();
final int bytesForCharsetDetection = 256;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we extract this as a constant and document it properly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or use 256 implicitly in the constructor instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or use 256 implicitly in the constructor instead?

This would have been my preference, but it triggered a Checkstyle failure.

CharsetDetector encodingDetector = new CharsetDetector(bytesForCharsetDetection);
encodingDetector.setText(stream);
CharsetMatch charsetMatch = encodingDetector.detect();
if (charsetMatch != null) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -109,8 +109,10 @@ public boolean equals(final Object obj) {
* @throws IOException if this document cannot be read.
*/
public Reader reader() throws IOException {
final int bytesForCharsetDetection = 256;
CharsetDetector charsetDetector = new CharsetDetector(bytesForCharsetDetection);
// RAT-494: Tika's CharsetDetector.getReader() may return null if the read can not be constructed due to I/O or encoding errors
Reader result = new CharsetDetector().getReader(TikaProcessor.markSupportedInputStream(inputStream()), getMetaData().getCharset().name());
Reader result = charsetDetector.getReader(TikaProcessor.markSupportedInputStream(inputStream()), getMetaData().getCharset().name());
if (result == null) {
throw new IOException(String.format("Can not read document `%s`", getName()));
}
Expand Down
3 changes: 3 additions & 0 deletions src/changes/changes.xml
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,9 @@ in order to be properly linked in site reports.
</release>
-->
<release version="1.0.0" date="xxxx-yy-zz" description="Current SNAPSHOT - release to be done">
<action issue="RAT-533" type="fix" dev="pottlinger" due-to="Ryan Schmitt">
Reduce sample size of charset detection from 12000 to 256 byte (Tika) to increase I/O performance of RAT scane.s
</action>
<action issue="RAT-531" type="fix" dev="pottlinger" due-to="huangxiaoping">
Fix NPE that license families is null if licenses are defined manually, reported by huangxiaoping from Hudi.
</action>
Expand Down
Loading