Bug 65649 - org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 20785132, but 10000000 is the maximum for this record type
Summary: org.apache.poi.util.RecordFormatException: Tried to allocate an array of leng...
Alias: None
Product: POI
Classification: Unclassified
Component: XWPF (show other bugs)
Version: 4.1.2-FINAL
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2021-10-22 19:49 UTC by redmanmale
Modified: 2021-10-22 20:10 UTC (History)
0 users

sample (463.71 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2021-10-22 19:49 UTC, redmanmale

Note You need to log in before you can comment on or make changes to this bug.
Description redmanmale 2021-10-22 19:49:09 UTC
Created attachment 38076 [details]

I've processed 30k documents and found ~60 docx with the huge record length (~20,000,000 - 50,000,000).

Similar issue for ppt: https://bz.apache.org/bugzilla/show_bug.cgi?id=65639


org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@29b3f521
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:297)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)

<business logic>

Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 20785132, but 10000000 is the maximum for this record type.
If the file is not corrupt, please open an issue on bugzilla to request 
increasing the maximum allowable size for this record type.
As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride()
	at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:630)
	at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:205)
	at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:173)
	at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:149)
	at org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:47)
	at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:53)
	at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:106)
	at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:307)
	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:114)
	at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:113)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
	... 11 more
Comment 1 PJ Fanning 2021-10-22 20:01:33 UTC
The sample file loads ok for me (latest POI trunk code).

    void bug65649() throws IOException {
        try (XWPFDocument document = new XWPFDocument(samples.openResourceAsStream("bug65649.docx"))) {
            assertEquals(731, document.getParagraphs().size());

Can you provide more detail as to what code you are using that fails? Can you provide the full stack trace including all the caused bys? - you seem to have removed the caused by with the actual line where the error happens.
Comment 2 PJ Fanning 2021-10-22 20:04:44 UTC
Also note that the latest trunk code has support for ZipInputStreamZipEntrySource#setThresholdBytesForTempFiles(int) which means that big data is put in a temp file instead of an error being thrown (if set) - this is used by org.apache.poi.openxml4j.util.ZipArchiveFakeEntry
Comment 3 PJ Fanning 2021-10-22 20:10:25 UTC
Actually, I'm going to close this because the latest POI code has a fix for this - already.