Bug 65706

Summary: Warning 'Entry [trash]/0000.dat is not valid' for valid trash item while opening .xlsx file
Product: POI Reporter: Bernhard Schuhmann <bernhard.schuhmann>
Component: OPCAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Mac OS X 10.1   
Attachments: Excel file with trash item

Description Bernhard Schuhmann 2021-11-25 10:41:05 UTC
Created attachment 38103 [details]
Excel file with trash item

POI 5.1.0 logs a warning while reading an .xlsx file which contains a trash item, this is the log entry:

2021-11-25 10:00:21,801  WARN [pool-4-thread-1] (ZipPackage.java:340) - Entry [trash]/0000.dat is not valid, so this part won't be add to the package.
org.apache.poi.openxml4j.exceptions.InvalidFormatException: Illegal character in path at index 1: /[trash]/0000.dat
        at org.apache.poi.openxml4j.opc.PackagingURIHelper.createPartName(PackagingURIHelper.java:500) ~[poi-ooxml-5.1.0.jar:5.1.0]
        at org.apache.poi.openxml4j.opc.ZipPackage$EntryTriple.<init>(ZipPackage.java:337) ~[poi-ooxml-5.1.0.jar:5.1.0]
        at org.apache.poi.openxml4j.opc.ZipPackage.lambda$getPartsImpl$0(ZipPackage.java:311) ~[poi-ooxml-5.1.0.jar:5.1.0]
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) ~[?:?]
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655) ~[?:?]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) ~[?:?]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) ~[?:?]
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) ~[?:?]
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) ~[?:?]
        at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:314) ~[poi-ooxml-5.1.0.jar:5.1.0]
        at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:740) ~[poi-ooxml-5.1.0.jar:5.1.0]
        at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:315) ~[poi-ooxml-5.1.0.jar:5.1.0]

According to the spec trash items are valid content of an ooxml file (see https://www.ecma-international.org/publications-and-standards/standards/ecma-376/ part 1):

9.1.5 Trash Items

Trash items represent parts that have been discarded or are no longer in use. Trash items shall not conform to OPC part naming guidelines as defined in ECMA-376-2 and shall not be associated with a content type. All trash items shall follow the naming scheme: [trash]/HHHH.dat where H represents a hexadecimal digit.

[Example: A package has two parts that must be updated in-place but both parts have grown beyond their growth hints. The newer updated parts are added as new ZIP items while the original parts are renamed to:

  [trash]/0000.dat
  [trash]/0001.dat

end example]

In our case I guess SharePoint has added this trash item.

Since it's valid content of the ooxml file it should probably not get logged as a warning. I know warnings are not errors, but still these warnings create noise. We can filter those warnings though - especially since we haven't paid attention to POI logging before you were moving to log4j2.

BTW, there's a typo in the log message - it should read '...so this part won't be added...'.
Comment 1 PJ Fanning 2021-11-25 10:49:15 UTC
We are not going to fix every log statement. Can you try configuring log4j to remove the logs you don't like? This is the whole point of logging frameworks.
Comment 2 PJ Fanning 2021-11-25 10:51:47 UTC
added r1895322
Comment 3 Bernhard Schuhmann 2021-11-25 11:05:14 UTC
Sure. But pls consider rethinking the log level as unnecessary warnings create noise in the log. And a misconfigured filter in the logging framework could swallow warnings that you might actually want to get.

Yes, meaningful logging is less important than getting it running right. But it's still an indication for quality. I personally consider logging an integral part of my software - it's the part ops is using when it's in production.

As said earlier - it is not a big issue for us as we have ignored POI logging before, so we don't loose much if we filter those erroneous warnings out.
Comment 4 Nick Burch 2021-11-25 11:24:34 UTC
Based on the bit of the ECMA spec, it does seem that trash files are allowed (required?) to ignore all of the normal rules on what is a valid part name and what isn't allowed as a part name

Not sure if we need to properly handle them, or just keep quiet about their non-standard naming?
Comment 5 Bernhard Schuhmann 2021-11-25 11:33:12 UTC
The discussion in [1] pointed me to the standard. I read it that these trash items are considered valid content of an ooxml ZIP file and hence I'd assume POI should not warn about it. I'd however be thankful for a log entry on level DEBUG so if something is failing I could enable DEBUG logging and see the warning about the trash item.

Not sure how the excel files we use for testing picked up all these trash items. The attached file contains one, I'm now chasing several other regression tests where excel files have more than one trash item. Haven't figured out who is actually using them - but since for instance SharePoint is creating them (see [1], we store our test files in SharePoint), it hopefully should also be _using_ them - but then it's Microsoft...

[1] https://social.technet.microsoft.com/Forums/en-US/287650b5-293c-48bc-90ec-9e13a61a46a6/office365-word-document-docx-banned-from-mailer-if-you-edit-properties-online-bug-?forum=sharepointgeneral
Comment 6 PJ Fanning 2021-11-27 10:17:58 UTC
I've added r1895370 - code will skip trash parts - this is probably temporary because someone will eventually want to use POI to access trash parts.