Nick, it seems there is a bug with stylings. I'm implementing TextBox shape and get corrupted files. Description: If change text via TextRun.setText and save the resulting ppt the output has correct structure but when opening it in PowerPoint it looks corrupted. Test case: public void testChangeText() throws Exception { String dirname = System.getProperty("HSLF.testdata.path"); String filename = dirname + "/with_textbox.ppt"; HSLFSlideShow hss = new HSLFSlideShow(filename); SlideShow ppt = new SlideShow(hss); Slide sl = ppt.getSlides()[0]; TextRun[] txt = sl.getTextRuns(); for (int i = 0; i < txt.length; i++) { String text = txt[i].getText(); //everything should remain the same. Existing data structures must be preserved. txt[i].setText(text); } FileOutputStream out = new FileOutputStream("text_poi.ppt"); ppt.write(out); out.close(); //open it and see that it is corrupted } If you compare the original file and the re-saved one you will see thast the differences are only in StyleTextPropAtom. For example, for "Plain Text" string: before after paragraph.length 12 11 characters.length 12 11 for "This is Times New Roman": before after paragraph.length 24 23 characters.length 24 23 That is the length of text in stylings is always text.length while it should be text.length+1. If text consists of multiple RichTextRun object then the total size of all text runs must be rawtext.length+1. Quick fix: Lines 277 and 288 in TextRun.setText should be changed as follows: pCol.updateTextSize(s.length()+1); cCol.updateTextSize(s.length()+1); With this fix everything works right. The re-saved ppt have the same structure as the original file. Regards, Yegor
Created attachment 18012 [details] Code which fixes the bug
Hi Nick, I got why the sum of lengths must be rawtext.length + 1. Extra character controls the format of new text which is appended to the end. For example, consider a string "Hello, World!" with 2 format runs: <b>Hello,</b><i> World!</i> The first part is bold, the second is italic. Total length of the string is 13 so the sum of lengths must be 13 + 1 = 14. Extra character must be included either in <b>Hello,</b> or to <i> World!</i> fragment. In first case if you edit the text in PowerPoint and append something to the end it will be bold, in the second case - italic. Sounds tricky. Isn't it? I think we should keep it simple and automatically add the extra character to the last format run. Any objections? Regards, Yegor
Ah, that's looking better. I was a bit worried about randomly just adding one more onto the last length! I'll probably run some tests on several files, to make sure that adding 1 to the last text run is the right thing to do (it sounds right to me, but you never know with PPT!) and then commit it. Cheers for figuring this out :)
We also need a tweak to the last RichTextRun, to make that 1 longer too I've committed a fix that covers both, along with some tests to ensure we don't break things on re-writes.
(In reply to comment #4) > We also need a tweak to the last RichTextRun, to make that 1 longer too > > I've committed a fix that covers both, along with some tests to ensure we don't > break things on re-writes. Hi Nick, I was writing this post when you commited the fix. It looks like the problem is fixed. Below is an extract from what I was going to post, just to repeat it once more: The sum of length of all text runs must equal to total length of the text plus 1. The last text run always contains an extra character which defines styling of a new text which is appended to the end. This rule works both for character and paragraph styles. I.e. the sum of length of all character styles must be equal to rawtext.length +1 and the sum of length of all paragraph styles must be equal to rawtext.length +1. It makes sense to reflect it in the documentation. Text styling is one of the most weird things in PPT. Even if you look at the working code it's extremely hard to figure out why it has the extra character at the end. Best Regards, Yegor
Ah, I'd been wondering about why there was that extra 1 character on the end. I'm going to need to do another review of the code, to make sure when we build up the RichTextRuns, we're properly taking account of this extra character. When it comes to writing it back out, I want to get my head straight on how it interplays with both the reading in, and paragraph stylings (not all the final paragraphs seem to have this extra char, so some are 1 shorter than the char one). I have a feeling that the current code isn't always putting an extra 1 onto the paragraph styling length, so I want to know if we do need it or not, and then get it consistent. When I've got that all sorted, I can try to write some comments that make sense to others!
OK, the latest re-attack of the RichTextCode should hopefully have this completely correctly sorted. As a bonus, it's also got it documented, and hopefully much easier to see why we're doing it :)