Created attachment 38076 [details] sample I've processed 30k documents and found ~60 docx with the huge record length (~20,000,000 - 50,000,000). Similar issue for ppt: https://bz.apache.org/bugzilla/show_bug.cgi?id=65639 Stacktrace: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@29b3f521 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:297) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) <business logic> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 20785132, but 10000000 is the maximum for this record type. If the file is not corrupt, please open an issue on bugzilla to request increasing the maximum allowable size for this record type. As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride() at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:630) at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:205) at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:173) at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:149) at org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:47) at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:53) at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:106) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:307) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:114) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:113) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) ... 11 more
The sample file loads ok for me (latest POI trunk code). ``` @Test void bug65649() throws IOException { try (XWPFDocument document = new XWPFDocument(samples.openResourceAsStream("bug65649.docx"))) { assertEquals(731, document.getParagraphs().size()); } } ``` Can you provide more detail as to what code you are using that fails? Can you provide the full stack trace including all the caused bys? - you seem to have removed the caused by with the actual line where the error happens.
Also note that the latest trunk code has support for ZipInputStreamZipEntrySource#setThresholdBytesForTempFiles(int) which means that big data is put in a temp file instead of an error being thrown (if set) - this is used by org.apache.poi.openxml4j.util.ZipArchiveFakeEntry
Actually, I'm going to close this because the latest POI code has a fix for this - already.