Bug 52211

Summary: OpenXML4JRuntimeException when opening xlsx files on mainframe
Product: POI Reporter: jxz164
Component: XSSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: 3.8-dev   
Target Milestone: ---   
Hardware: PC   
OS: other   
Attachments: Test xlsx file

Description jxz164 2011-11-18 20:57:49 UTC
I am using the POI 3.8 beta 5 (from my own build on 10/06) on mainframe to read Excel files. Reading/Writing xls file is OK. I am getting the following stack trace when reading xlsx files.

Exception in thread "main" org.apache.poi.openxml4j.exceptions.OpenXML4JRuntimeException: Package.init() : this exception should never happen, if you read this message please send a mail to the developers team. : The specified content type 'application/vnd.openxmlformats-package.core-properties+xml' is not compliant with RFC 2616: malformed content type.
	at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:166)
	at org.apache.poi.openxml4j.opc.OPCPackage.<init>(OPCPackage.java:141)
	at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:82)
	at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:228)
	at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:67)
	at TestWorkbookFactoryCreate.main(TestWorkbookFactoryCreate.java:16)

Here is the output of "java -version".

java version "1.5.0"
Java(TM) 2 Runtime Environment, Standard Edition (build pmz31dev-20090707 (SR10 ))
IBM J9 VM (build 2.3, J2RE 1.5.0 IBM J9 2.3 z/OS s390-31 j9vmmz3123-20090707 (JIT enabled)
J9VM - 20090706_38445_bHdSMr
JIT  - 20090623_1334_r8
GC   - 200906_09)
JCL  - 20090705

Output of "uname -a"
OS/390 ABIZOS08 21.00 03 2818

Test code

import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.usermodel.*;

import java.io.FileInputStream;
import java.io.IOException;


public class TestWorkbookFactoryCreate {

  public static void main(String[] args) throws IOException, Exception {
    FileInputStream fileIn = null;

    try
      {
	fileIn = new FileInputStream("utf8.xlsx");
	XSSFWorkbook wb = (XSSFWorkbook) WorkbookFactory.create(fileIn);
	System.out.println("Workbook created");                
      } finally {
	if (fileIn != null)
	  fileIn.close();
      }
  }
    
}
Comment 1 Nick Burch 2011-11-18 20:59:30 UTC
Could you please attach the problematic file too?

Also, do you know how the file was generated?
Comment 2 jxz164 2011-11-18 21:13:03 UTC
Created attachment 27970 [details]
Test xlsx file
Comment 3 jxz164 2011-11-18 21:13:58 UTC
Any xlsx file created by Excel 2007 has this problem. I have attached a sample file.
Comment 4 jxz164 2011-11-18 21:20:55 UTC
I did more testing on this on mainframe and figured out that I have to pass the -Dfile.encoding=utf-8 option.

$ java -Dfile.encoding=UTF-8 TestWorkbookFactoryCreate
Workbook created

$ java  TestWorkbookFactoryCreate
                               
Exception in thread "main" org.apache.poi.openxml4j.exceptions.OpenXML4JRuntimeException: Package.init() : this exception should never happen, if you read this message please send a mail to the developers team. : The specified content type 'application/vnd.openxmlformats-package.core-properties+xml' is not compliant with RFC 2616: malformed content type.
	at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:166)
	at org.apache.poi.openxml4j.opc.OPCPackage.<init>(OPCPackage.java:141)
	at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:82)
	at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:228)
	at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:67)
	at TestWorkbookFactoryCreate.main(TestWorkbookFactoryCreate.java:16)

Therefore, the -Dfile.encoding=utf-8 solves my problem. The default encoding in mainframe is ebcdic, and I have to use utf-8. I sent this as a poi bug earlier because the error message said so.
Comment 5 Nick Burch 2011-11-18 21:28:11 UTC
Hmm, we must have an encoding assumption in the OPC code somewhere then

The odd thing is that that error message is coming from the ContentType class, which does hard code the encoding to US-ASCII, so I'm not sure where the issue is
Comment 6 jxz164 2011-11-21 19:33:06 UTC
I hope to get this working without passing passing the -Dfile.encoding=UTF-8 option when calling java.
Comment 7 Nick Burch 2011-11-21 21:16:56 UTC
If you're able to, fire up your JVM with remote debugging enabled, and attach a remote debugger (eg eclipse) to it. Then, step through the problem code, and see if you can work out what is incorrectly encoded that's breaking.

(Nothing springs to mind as wrong from looking at the source code, so it's likely something subtle)
Comment 8 Constantin 2012-09-28 08:44:33 UTC
Hello,

We are using the POI API (stable 3.8) on a system running ibm500 encoding as default encoding.
So we got the same error, when trying to create a Workbook using WorkbookFactory.create(ByteArrayInputStream bais).

We found that the problem lies in the method
org.apache.poi.openxml4j.opc.internal.ContentType.ContentType(String contentType)

In line 139, the follwoing code is called:
contentTypeASCII = new String(contentType.getBytes(), "US-ASCII");

The String.getBytes() causes the system to return the bytes in default system encoding (for instance ibm500). Afterwards this should be converted using encoding US-ASCII. This cannot work.

So, we wonder, why this conversion will be done?

We deleted the line and just put following code:
contentTypeASCII = contentType;

Afterwards it worked fine.

Regards
Constantin
Comment 9 Yegor Kozlov 2012-10-01 13:20:52 UTC
It is very likely that your hypothesis is correct and this oine of code can cause problems.

The problematic piece of code exists since POI-3.5, when OpenXml4j was contributed to Apache POI. 
I guess the intention was to ensure that the string being parsed and validated is in the ASCII encoding. 
This "worked" for years but the conversion does not make sense because if the input argument contains characters above ASCII then they are converted to 0XFFFD ("not a character" unicode) and the subsequent validation against the patternMediaType regex fails.

Consider the following examples:

(a) new ContentType("text/\u007E") 
(b) new ContentType("text/\u0080") 

The first case (a) works because all characters in the input string are in ASCII and the conversion does not change the input string. 
The second case (b) fails no matter if the input argument is re-converted to US-ASCII or not. If you apply your fix (contentTypeASCII=contentType) then the regex check at line 146 fails. Current code first converts the input string to "text/\uFFFD" and then the regex fails.

So I agree that this conversion is extra and can be removed. The fix is coming soon.

Regards,
Yegor

(In reply to comment #8)
> Hello,
> 
> We are using the POI API (stable 3.8) on a system running ibm500 encoding as
> default encoding.
> So we got the same error, when trying to create a Workbook using
> WorkbookFactory.create(ByteArrayInputStream bais).
> 
> We found that the problem lies in the method
> org.apache.poi.openxml4j.opc.internal.ContentType.ContentType(String
> contentType)
> 
> In line 139, the follwoing code is called:
> contentTypeASCII = new String(contentType.getBytes(), "US-ASCII");
> 
> The String.getBytes() causes the system to return the bytes in default
> system encoding (for instance ibm500). Afterwards this should be converted
> using encoding US-ASCII. This cannot work.
> 
> So, we wonder, why this conversion will be done?
> 
> We deleted the line and just put following code:
> contentTypeASCII = contentType;
> 
> Afterwards it worked fine.
> 
> Regards
> Constantin
Comment 10 Yegor Kozlov 2012-10-04 11:53:19 UTC
Should be fixed in r1394001. 

Yegor