I came across a batch of docs where Tika's SummaryExtractor is hitting a RecordFormatException because the heuristic max record size is set to 100,000 in Property->VariantSupport#read(). I need to finish processing these docs, but the highest so far is 2,500,000.
A stacktrace would be good to see where exactly this is hit. Also we should not just increase such limits as soon as there are a few documents which exceed them as it will dilute such safeguards over time and would make them useless at some point. If some are using higher values, we can add a way to configure/disable the limit, but the default should be to stop at some point before memory is exhausted with a certain amount of memory (a few GB?).
Y, completely agree, Dominik. I'll attach a stacktrace tomorrow.
Here's the stacktrace: WARN [pool-3-thread-6] 08:45:01,740 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor Ignoring unexpected exception while parsing summary entry DOCUMENTSUMMARYINFORMATION org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 743,564, but the maximum length for this record type is 100,000. If the file is not corrupt and not large, please open an issue on bugzilla to request increasing the maximum allowable size for this record type. You can set a higher override value with IOUtils.setByteArrayMaxOverride() at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:596) ~[tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:281) ~[tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.poi.util.IOUtils.safelyAllocateCheck(IOUtils.java:560) ~[tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.poi.util.IOUtils.safelyAllocate(IOUtils.java:546) ~[tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:255) ~[tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.poi.hpsf.Property.<init>(Property.java:182) ~[tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.poi.hpsf.Section.<init>(Section.java:240) ~[tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:493) ~[tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:194) ~[tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:116) [tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:97) [tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:211) [tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:183) [tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
FYI, this uses CodePageString#getMaxRecordLength(), so CodePageString#setMaxRecordLength() can be used to allow more in case it is really needed. Also the generic IOUtils.setByteArrayMaxOverride() can already be used to provide a global higher value.