Bug 49020

Summary: "org.xml.sax.SAXParseException: </b> does not close tag <br>." when opening some Excel 2007 files
Product: POI Reporter: Paul Spencer <paulsp>
Component: XSSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: regression    
Priority: P1    
Version: 3.6-FINAL   
Target Milestone: ---   
Hardware: All   
OS: All   
Attachments: Spreadsheet contains one button with a multi-line title

Description Paul Spencer 2010-03-29 21:10:28 UTC
I am get the exception below thrown when reading in some .xlsm files
into POI v3.6 WorkbookFactory.create(fileInputStream). The files open without
error in Excel 2007 and OpenOffice 2.3.  I have other .xlsm files that
work as expected in POI v3.6.

The source of the error is in xi/drawings/vmlDrawing1.vml. A user created button has a 2 line title.  The lines are separated by a <BR>, thus the reason for the exception.  Below is an excerpt from vmlDrawing1.vml

<v:shape id="_x0000_s2060" type="#_x0000_t201" style='position:absolute;
margin-left:554.25pt;margin-top:78.75pt;width:205.5pt;height:50.25pt;
z-index:1;mso-wrap-style:tight' o:button="t" fillcolor="buttonFace [67]"
strokecolor="windowText [64]" o:insetmode="auto">
<v:fill color2="buttonFace [67]" o:detectmouseclick="t"/>
<o:lock v:ext="edit" rotation="t"/>
<v:textbox o:singleclick="f">
 <div style='text-align:center'><font face="Arial" size="280" color="10"><b>Print
 Entire <br>
    Data Set</b></font></div>
</v:textbox>
<x:ClientData ObjectType="Button">
 <x:Anchor>
  10, 8, 5, 1, 13, 66, 7, 36</x:Anchor>
 <x:PrintObject>False</x:PrintObject>
 <x:AutoFill>False</x:AutoFill>
 <x:FmlaMacro>[0]!Module1.print_entire_data_set</x:FmlaMacro>
 <x:TextHAlign>Center</x:TextHAlign>
 <x:TextVAlign>Center</x:TextVAlign>
</x:ClientData>
</v:shape>


FYI: I downgrades POI to 3.5-FINAL and the workbook loaded without errors.
Comment 1 Paul Spencer 2010-03-30 18:00:42 UTC
Created attachment 25214 [details]
Spreadsheet contains one button with a multi-line title
Comment 2 Paul Spencer 2010-03-30 18:05:46 UTC
I updated the priority to P1 since the bug is preventing the use of version 3.6 and since the bug is related to the normal use of non-XML compliant HTML tags in the workbook.
Comment 3 Nick Burch 2010-03-31 11:14:53 UTC
The bug is really with Excel here - it has generated a file with invalid XML. The xlsx file is defined as being made up of XML subparts, and the XML spec is very very strict on matching tags.

For the long term, you should report a bug to Microsoft about this. They either need to sanitise the user input and sort out the tags (eg <br> becomes <br />), or they need to give up and escape the whole tag contents for the bits where iffy data could get added (eg put this textbox within a CDATA section)

Short term, you could just comment out the code that reads in the vmlDrawing section of the file, and ensure that you don't touch the drawing records

Medium term, we should get a list of the problem bits that Excel does wrong, such as <br> (but perhaps others). Then, we need to write a XML Input Wrapper that cleans these up before they get passed to the XML Processor for loading. Something like this is quite nasty, though it's possible some other project out there has already done it, and we can just re-use what they do.
Comment 4 Nick Burch 2010-05-05 13:51:23 UTC
EvilUnclosedBRFixingInputStream added in r941399.

It's a terrifying sick workaround.... But does allow your file to be loaded

Proper fix is to get Microsoft to make Excel output valid xml though!