Bug 68666 - Bump max record size in PropertySet->VariantSupport
Summary: Bump max record size in PropertySet->VariantSupport
Status: NEEDINFO
Alias: None
Product: POI
Classification: Unclassified
Component: HPSF (show other bugs)
Version: 5.2.3-FINAL
Hardware: PC Linux
: P2 trivial (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-02-23 18:00 UTC by Tim Allison
Modified: 2024-02-25 19:17 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Allison 2024-02-23 18:00:55 UTC
I came across a batch of docs where Tika's SummaryExtractor is hitting a RecordFormatException because the heuristic max record size is set to 100,000 in Property->VariantSupport#read().

I need to finish processing these docs, but the highest so far is 2,500,000.
Comment 1 Dominik Stadler 2024-02-25 09:36:52 UTC
A stacktrace would be good to see where exactly this is hit. 

Also we should not just increase such limits as soon as there are a few documents which exceed them as it will dilute such safeguards over time and would make them useless at some point.

If some are using higher values, we can add a way to configure/disable the limit, but the default should be to stop at some point before memory is exhausted with a certain amount of memory (a few GB?).
Comment 2 Tim Allison 2024-02-25 13:09:06 UTC
Y, completely agree, Dominik. I'll attach a stacktrace tomorrow.
Comment 3 Tim Allison 2024-02-25 13:47:59 UTC
Here's the stacktrace:

WARN  [pool-3-thread-6] 08:45:01,740 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor Ignoring unexpected exception while parsing summary entry DOCUMENTSUMMARYINFORMATION
org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 743,564, but the maximum length for this record type is 100,000.
If the file is not corrupt and not large, please open an issue on bugzilla to request 
increasing the maximum allowable size for this record type.
You can set a higher override value with IOUtils.setByteArrayMaxOverride()
	at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:596) ~[tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:281) ~[tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.apache.poi.util.IOUtils.safelyAllocateCheck(IOUtils.java:560) ~[tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.apache.poi.util.IOUtils.safelyAllocate(IOUtils.java:546) ~[tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:255) ~[tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.apache.poi.hpsf.Property.<init>(Property.java:182) ~[tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.apache.poi.hpsf.Section.<init>(Section.java:240) ~[tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:493) ~[tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:194) ~[tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:116) [tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:97) [tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:211) [tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:183) [tika-app-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
Comment 4 Dominik Stadler 2024-02-25 19:17:41 UTC
FYI, this uses CodePageString#getMaxRecordLength(), so CodePageString#setMaxRecordLength() can be used to allow more in case it is really needed. 

Also the generic IOUtils.setByteArrayMaxOverride() can already be used to provide a global higher value.