Bug 52385

Summary: [REGRESSION] HPSF corrupts output when starting file has unsupported variant props
Product: POI Reporter: Yegor Kozlov <yegor>
Component: HPSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: critical    
Priority: P2    
Version: 3.13-dev   
Target Milestone: ---   
Hardware: All   
OS: All   
Bug Depends on:    
Bug Blocks: 52337, 52538    
Attachments: Diagram of the HeadingPair/DocParts TypedProperty structures

Description Yegor Kozlov 2011-12-25 20:11:36 UTC
[REGRESSION] HPSF 

It looks like we have a regression caused by recent changes in HPSF: an OLE2 file becomes unreadable after write if it contains a variant property of unsupported type. In my research the problematic variant types were 4126 and 4108. The log warninga are below:

HPSF does not yet support the variant type 4126 (unknown variant type, 000000000000101E).  
HPSF does not yet support the variant type 4108 (unknown variant type, 000000000000100C). 

I was working on some improvements in HSSF and noticed Excel coudn't open the output file. At first I thought it was my changes, but it turned out that even simple read-write results in unreadble output: 


  HSSFWorkbook wb = new HSSFWorkbook(new FileInputStream(inputFile));

  FileOutputStream os = new FileOutputStream(outputFile);
  wb.write(os);
  os.close();

Try the code above against the following files from our collection of test files and the output will be coruppted. 
  

12843-1.xls        34775.xls              45365.xls    ContinueRecordProblem.xls         OddStyleRecord.xls
13224.xls          37684-2.xls            45365-2.xls  ex42570-20305.xls                 RangePtg.xls
14460.xls          41139.xls              46137.xls    ex44921-21902.xls                 testNames.xls
24207.xls          42464-ExpPtg-bad.xls   47034.xls    ex45978-extraLinkTableSheets.xls  XRefCalc.xls
27852.xls          42464-ExpPtg-ok.xls    47847.xls    ex46548-23133.xls                 XRefCalcData.xls
29982.xls          42844.xls              48026.xls    IndexFunctionTestCaseData.xls
30978-deleted.xls  44010-SingleChart.xls  49185.xls    IrrNpvTestCaseData.xls
32822.xls          44010-TwoCharts.xls    50939.xls    MRExtraLines.xls

Excel 2010 shows a warning when opening such files.  

The problem seems to be reelated to OLE properties and HPSF. If I comment the line 1218 in HSSFWorkbook then all is fine and Excel is happy to open the output files:

        // Write out our HPFS properties, if we have them
        writeProperties(fs, excepts);

This is a must for 3.8-final. 

Yegor
Comment 1 Niklas Rehfeld 2012-01-03 21:47:30 UTC
I think this is related to (or rather, causes) bug #52337, as the returned structure should be of type VT_VECTOR | VT_VARIANT (0x100C). 

So it seems to me that the problem is in the code that reads the property sets, rather than the writing. 

Nik
Comment 2 Niklas Rehfeld 2012-01-05 02:55:53 UTC
I had a look around the code, the bug seems to be in 

TypedPropertyValue.read(byte[], int)

in the fact that it automatically pads the result, i.e. returns a 'padded' offset. This is bad when reading the Heading Pairs vector (and possibly others) in the DocumentSummaryInformation stream, as they use *unpadded* strings of the type UnalignedLpstr (http://msdn.microsoft.com/en-us/library/dd950621%28v=office.12%29.aspx).
I hope that this is the same bug, and not completely unrelated. 

Nik
Comment 3 Niklas Rehfeld 2012-01-11 01:39:30 UTC
Created attachment 28134 [details]
Diagram of the HeadingPair/DocParts TypedProperty structures

Just thought this might be useful for this bug, it shows some of the structure of the docparts and headingpair properties, which as far as I have been able to find, are the only ones that use unaligned strings in property sets. 

All the info comes straight from MS-OSHARED (and maybe a little bit from MS-OLEPS)

Ignore the green stuff on the left, that was from a project that I'm working on. 

Nik
Comment 4 Yegor Kozlov 2012-02-15 07:53:08 UTC
Your hypothesis seems to be correct. I changed TypedPropertyValue.read(byte[], int) to return the unpadded offset and it fixed the problem. 

The fix has been committed in 1244388

Regards,
Yegor

(In reply to comment #2)
> I had a look around the code, the bug seems to be in 
> 
> TypedPropertyValue.read(byte[], int)
> 
> in the fact that it automatically pads the result, i.e. returns a 'padded'
> offset. This is bad when reading the Heading Pairs vector (and possibly others)
> in the DocumentSummaryInformation stream, as they use *unpadded* strings of the
> type UnalignedLpstr
> (http://msdn.microsoft.com/en-us/library/dd950621%28v=office.12%29.aspx).
> I hope that this is the same bug, and not completely unrelated. 
> 
> Nik