Bug 55292

Summary: XSSFTextBox enhancements
Product: POI Reporter: Darren Roberts <robertsdjguard-poidev>
Component: XSSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal CC: sanewell92
Priority: P2    
Version: 3.10-dev   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: Patch which implements extended functionality for XSSFTextBox and unit tests for major features
Xlsx test file for the unit tests
Replacement patch which tidies up some code and removes redundant style setting in default shape constructor.
Replacement patch with small mods
Replacement patch which includes bullet properties and associated unit test
testAddBulletParagraphs result
Replacement patch which fixes invalid content in unit test output
Replacement patch which implements getText() auto-numbering according to Excel rules
final (?) patch

Description Darren Roberts 2013-07-22 14:03:53 UTC
XSSFTextBox is currently very limited functionality wise compared to TextBoxes in HSSF and XSLF. For example it currently only allows you to set a single paragraph of text (when multiples should be allowed) and provides no functionality for manipulating that paragraph without manually amending the internal structure.

I am implementing enhancements to address this and will upload patches when available.
Comment 1 Darren Roberts 2013-07-23 13:24:10 UTC
Created attachment 30617 [details]
Patch which implements extended functionality for XSSFTextBox and unit tests for major features

First pass at the enhanced XSSFTextBox functionaility.

Note that the new functionlaity does not currently support bullets or hyperlinks. I will look at adding these later. Unit tests have been implemented in TestXSSFDrawing for the major features (text setting/extraction, font and color setting and retrieval). I will look at implementing additional unit tests in time.
Comment 2 Darren Roberts 2013-07-23 13:25:52 UTC
Created attachment 30618 [details]
Xlsx test file for the unit tests

Lives in test-data/spreadsheet folder.
Comment 3 Tim Allison 2013-07-23 14:39:38 UTC
Darren,
  Thank you very much for submitting this.  If an experienced committer is willing to back me up, I'd like to take this as my first review+commit.  I should have feedback by tomorrow.
  This should solve (https://issues.apache.org/jira/browse/TIKA-1100).
Comment 4 Darren Roberts 2013-07-23 15:06:29 UTC
Great, thanks Tim.

Through further testing in my actual project I have identified a number of other small things that need to be addressed - for example the defaults for a new XSSFSimpleShape set the background to blue which is at odds with Excel (white), and also if you don't explicitly specify the text color then I'm getting it coming through as white by default although I seem to recall seeing this was an issue elsewhere? I'll take a look at them anyway.
Comment 5 Darren Roberts 2013-07-23 20:08:31 UTC
Created attachment 30620 [details]
Replacement patch which tidies up some code and removes redundant style setting in default shape constructor.
Comment 6 Tim Allison 2013-07-26 00:30:56 UTC
I made a few small cosmetic changes.

In XSSFTextParagraph what should this javadoc read:

/**
     * Specifies the particular level text properties that this paragraph will follow.
     * The value for this attribute formats the text according to the corresponding level
     * paragraph properties defined in the SlideMaster.

SlideMaster?
Comment 7 Darren Roberts 2013-07-26 05:32:53 UTC
Ah, a remnant from the presentation side. I think this function could be removed at the moment as I believe from the docs that level only has any relevance when it comes to bullets (to determine the style of bullet to use at a given indentation level), and since this enhancement doesn't support bullet properties yet it's a moot point. 

If you could upload your modified patch so I can apply your changes to my copy, I can probably find a few hours over the weekend to look at incorporating the missing bullet functionality.
Comment 8 Tim Allison 2013-07-26 14:23:28 UTC
Created attachment 30632 [details]
Replacement patch with small mods

Only a few cosmetic changes.  The initial patch was in great shape!

Let me know which you'd prefer:
Plan A: You add back in the bullets.  I review and then allow a few days for the community to review it, and then I commit.
Plan B: We leave until Monday for the community to review the current patch.  I commit that and then you submit a separate issue.

Thank you, again, for contributing this.
Comment 9 Darren Roberts 2013-07-26 16:30:17 UTC
No problem, I needed some of it's capabilities for my own project which was my primary driver!

I'm easy, obviously I don't want to hold up any release. Probably the best thing to do is see how I get on over the weekend - naturally I'll merge your patch with my source so I retain your modifications and then if I get round to finishing the bullets I'll upload a new patch for review.

If not, commit your patch and when the new release is done I'll just have to update and merge my source, generate a new patch and attach it to a new issue for review.
Comment 10 Darren Roberts 2013-07-28 16:14:19 UTC
Created attachment 30637 [details]
Replacement patch which includes bullet properties and associated unit test

This patch has the bullet properties included and a unit test for that functionality. It includes all mods done in attachments 30632.

I have also added a method to XSSFSimpleShape to add a new paragraph from a string (alongside the rich text string it already supported), and updated the XSSFSimpleShape getText method to replicate bullets and autonumbering, along with tabbed indentation for each level. Note that these do not replicate the format of the bullets, the getText method is just a quick way to get all the text in the shape in a semi-representative plain text format. If the developer wants an exact representation they should access the list of paragraphs and manually construct it from them.
Comment 11 Tim Allison 2013-07-29 13:26:02 UTC
Created attachment 30646 [details]
testAddBulletParagraphs result

Excel is throwing an "unreadable content; want to recover" message when I try to open the wb created in testAddBulletParagraphs.  Excel is able to recover, but can you tell what's causing this?  To generate this attachment, I added the following at the end of testAddBulletParagraphs:

 try{
  OutputStream os = new FileOutputStream(new File("C:/data/testbullets.xlsx"));
  wb.write(os);
  os.flush();
  os.close();
 } catch (IOException e){
  e.printStackTrace();
 }
Comment 12 Darren Roberts 2013-07-29 14:38:04 UTC
(In reply to Tim Allison from comment #11)
> Created attachment 30646 [details]
> testAddBulletParagraphs result
> 
> Excel is throwing an "unreadable content; want to recover" message when I
> try to open the wb created in testAddBulletParagraphs.  Excel is able to
> recover, but can you tell what's causing this?  To generate this attachment,
> I added the following at the end of testAddBulletParagraphs:
> 
>  try{
>   OutputStream os = new FileOutputStream(new
> File("C:/data/testbullets.xlsx"));
>   wb.write(os);
>   os.flush();
>   os.close();
>  } catch (IOException e){
>   e.printStackTrace();
>  }

Ah, I have been able to figure out what it is that Excel doesn't like - by default in the code when you set a paragraph as a bullet using setBullet(true) it sets a character to use as the bullet, but Excel doesn't like this being defined if you then set the bullet to use auto-numbering. Remove the buChar from the XML for the auto-number bullets and it loads without issue (almost, the arbitrary size text box actually needs to be bigger).

Interestingly the spec makes no mention of the buAutoNum and buChar elements being exclusive, so presumably it is an Excel-ism. LibreOffice Calc loads without complaint, but doesn't appear to support bullets in the textbox (just displays them as normal paragraphs) so that doesn't help as far as testing goes.

So two options, I can either remove the automatic setting of the buChar (in which case Excel does not display anything, it does not substitutes a value), or set it and remove it if you call setBulletAutoNumbering?
Comment 13 Darren Roberts 2013-07-29 15:11:34 UTC
Created attachment 30647 [details]
Replacement patch which fixes invalid content in unit test output

Added checks to remove the buChar element if a bullet is set as auto-number which should solve the problem of invalid content when opened with Excel. 

Also fixed unit-test TextBox size so it should now fully contain the content.
Comment 14 Tim Allison 2013-07-29 15:26:56 UTC
Would overriding setBullet like this work:

setBullet(boolean flag)
setBullet(ListAutoNumber scheme)
setBullet(ListAutoNumber scheme, int startAt)


(In reply to Darren Roberts from comment #13)
> Created attachment 30647 [details]
> Replacement patch which fixes invalid content in unit test output
> 
> Added checks to remove the buChar element if a bullet is set as auto-number
> which should solve the problem of invalid content when opened with Excel. 
> 
> Also fixed unit-test TextBox size so it should now fully contain the content.
Comment 15 Tim Allison 2013-07-29 17:21:41 UTC
So close.  Thank you for bearing with me.  The numbering in the created xlsx (as interpreted by Excel) differs from the test case:

3. Second bullet (level 1)
1. Third bullet (level 1)

Is there an easy fix?
Comment 16 Darren Roberts 2013-07-29 18:34:00 UTC
Not really as there doesn't appear to be any documentation regarding how Excel treats auto-numbering. The OOXML specification for the startAt attribute only say this:

Specifies the number that starts a given sequence of automatically numbered bullets. When the numbering is alphabetical, the number should map to the appropriate letter. For instance 1 maps to 'a', 2 to 'b' and so on. If the numbers are larger than 26, then multiple letters should be used. For instance 27 should be represented as 'aa' and similarly 53 should be 'aaa'.

Even the MS docs here say nothing more than the specification:
http://msdn.microsoft.com/en-us/library/documentformat.openxml.drawing.autonumberedbullet%28v=office.14%29.aspx

At the weekend I did a file in Excel that had two bullets at level 1 and on the first bullet I set the startAt as 5. This got me 5 and 6 in Excel which is what you'd expect. When I took a look at the drawing.xml file what it actually had was startAt set on both buAutoNum elements as 5. To my mind, and by the above that means both should be 5. That's the way I implemented the simple shape getText method - if it has startAt set then it sets the starting number for the level as that and increments from there for each following bullet. If the second bullet also has startAt as 5 it will use 5.

I wonder if Excel actually uses the startAt as a linking value, so if all the buAutoNum elements at a given level have startAt as say 5 then they are considered as a releated group and so the first element uses 5, and the rest are incremented. 

If anyone can point me to documentation on how Excel calculates the auto number then that would be great, otherwise I think we're in best guess territory.

With regards to your previous reply I did respond but it doesn't appear to have gone through. I'm leaning towards overriding setBullet as it seems cleaner, although it does little to actually simplify the problem.
Comment 17 Tim Allison 2013-07-29 19:24:28 UTC
Ah, ok. Would it make sense to leave out the setBulletAutoNumber functionality for now?  Someone can add that functionality on a later patch? 

I did get your message about overloading setBullet().  I tried the no-thinking process of just replacing setBulletAutoNumber with setBullet(...), and I wasn't getting any numbers out in your test case.
Comment 18 Darren Roberts 2013-07-29 19:44:35 UTC
But what happens when someone loads a file containing numbered bullets in a TextBox? The other way to look at it is as I said in an earlier response the getText method is a simple way to get the basic text out, for anything more complicated the developer would need to parse the paragraphs manually which means they have the headache of working this out for themselves anyway.

I see no reason to exclude setBulletAutoNumber as that caters to the dev's like me who are creating xlsx files with POI more than reading them. 

But since the current missing piece of information is how Excel calculates the numbers, perhaps in the getText method instead of calulating (guessing) what the number should be then we should just output something identifiable like "[autonum] bullet text" or "? bullet text". That way we've informed the dev that there should be a number or something there, but it's up to them how to handle it?

It would be so much easier if the schema just stored the value Excel assigned the bullet, even if it was just a hint... <sigh>
Comment 19 Darren Roberts 2013-07-29 19:59:10 UTC
(In reply to Darren Roberts from comment #18)
> But what happens when someone loads a file containing numbered bullets in a
> TextBox? The other way to look at it is as I said in an earlier response the
> getText method is a simple way to get the basic text out, for anything more
> complicated the developer would need to parse the paragraphs manually which
> means they have the headache of working this out for themselves anyway.
> 
> I see no reason to exclude setBulletAutoNumber as that caters to the dev's
> like me who are creating xlsx files with POI more than reading them. 
> 
> But since the current missing piece of information is how Excel calculates
> the numbers, perhaps in the getText method instead of calulating (guessing)
> what the number should be then we should just output something identifiable
> like "[autonum] bullet text" or "? bullet text". That way we've informed the
> dev that there should be a number or something there, but it's up to them
> how to handle it?
> 
> It would be so much easier if the schema just stored the value Excel
> assigned the bullet, even if it was just a hint... <sigh>

Actually, I've just found something online which gives a little bit more information regarding rules for calculating the numbers which actually makes sense of the two startAt="5" elements I saw. I'll try modifying the getText method to use these rules and see if that gets us any closer to Excel.
Comment 20 Nick Burch 2013-07-29 21:04:59 UTC
(In reply to Darren Roberts from comment #16)
> Even the MS docs here say nothing more than the specification:
> http://msdn.microsoft.com/en-us/library/documentformat.openxml.drawing.
> autonumberedbullet%28v=office.14%29.aspx

If the Microsoft documentation is missing something, the first place is to ask on the relevant Open Specifications Community forum for the document, see <http://www.microsoft.com/openspecifications/en/us/applied-interoperability/support/default.aspx> for the links to the forums

If that doesn't get an answer in a sensible amount of time, let me know and I'll give you the email address of the docs team and you can ask them directly. (They do check the forums though, so you might well get an answer from there) The Interoperability Specification Team at Microsoft are very keen for their documentation to be complete and useful, so they do want to know if there is a problem or noticable gap in the docs they publish!
Comment 21 Darren Roberts 2013-07-30 19:29:52 UTC
Created attachment 30651 [details]
Replacement patch which implements getText() auto-numbering according to Excel rules

I've implemented the auto-numbering rules I found in a document on the openxmldeveloper.org website and the getText() auto-numbering does now tally now with what I see in Excel 2010.

Also expanded XSSFSimpleShape getText() so that auto-numbered bullets will now appear according to their type/scheme - so you will get alpha (a,b,c), arabic (1,2,3) or roman (i, ii, iv) along with the various options for brackets and periods. In the case of the wingdings scheme I use the default bullet character since we getText returns a string rather than a rich text string.

Normal bullets will also use the bullet character if it has been explicitly set in the getText() output. 

I think this is just about as complete as I can make this function (even if the code isn't pretty). I have contemplated a method on the XSSFTextParagraph which would calculate that paragraphs auto-number (for dev's manually parsing the paragraphs) but I think that can wait and be implemented as a seperate patch.
Comment 22 Darren Roberts 2013-07-30 19:30:54 UTC
> I think this is just about as complete as I can make this function (even if
> the code isn't pretty). I have contemplated a method on the
> XSSFTextParagraph which would calculate that paragraphs auto-number (for
> dev's manually parsing the paragraphs) but I think that can wait and be
> implemented as a seperate patch.

Almost forgot, I also implemented the setBullet override.
Comment 23 Tim Allison 2013-07-31 00:36:11 UTC
Created attachment 30652 [details]
final (?) patch

Wow and wow.  This is fantastic.  I made one small change to your code and added one test case.
Comment 24 Tim Allison 2013-07-31 00:38:24 UTC
Committed as of r1508692.  Thank you, Darren!
Comment 25 Dominik Stadler 2015-09-07 20:09:26 UTC
*** Bug 54969 has been marked as a duplicate of this bug. ***