There is a bug in StyleTextPropAtom which prevents me from fixing the issue with slides ordering. How to reproduce: 1. Download http://www.gersteinlab.org/lectures/ppt/A2K-at-Yale-20060422.ppt 2. Try to construct the slide show: SlideShow ppt = new SlideShow(new HSLFSlideShow("A2K-at-Yale-20060422.ppt")); and get the error: java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.LinkedList.entry(LinkedList.java:356) at java.util.LinkedList.get(LinkedList.java:299) at org.apache.poi.hslf.model.TextRun.<init>(TextRun.java:110) at org.apache.poi.hslf.model.TextRun.<init>(TextRun.java:67) at org.apache.poi.hslf.model.Sheet.findTextRuns(Sheet.java:126) at org.apache.poi.hslf.model.Sheet.findTextRuns(Sheet.java:88) at org.apache.poi.hslf.model.Slide.<init>(Slide.java:66) at org.apache.poi.hslf.usermodel.SlideShow.buildSlidesAndNotes(SlideShow.java:394) Before stack trace I see the following warnings: No core record found with ID 3 based on PersistPtr lookup No core record found with ID 5 based on PersistPtr lookup No core record found with ID 6 based on PersistPtr lookup No core record found with ID 7 based on PersistPtr lookup No core record found with ID 8 based on PersistPtr lookup No core record found with ID 9 based on PersistPtr lookup No core record found with ID 10 based on PersistPtr lookup No core record found with ID 11 based on PersistPtr lookup No core record found with ID 12 based on PersistPtr lookup No core record found with ID 13 based on PersistPtr lookup No core record found with ID 22 based on PersistPtr lookup No core record found with ID 44 based on PersistPtr lookup No core record found with ID 45 based on PersistPtr lookup Paragraph style ran out before character style did! Short by 327 characters. Calling RichTextRun functions is likely to break things - see Bug #38544 Regards, Yegor Kozlov
Created attachment 18362 [details] Patch with the fix
Created attachment 18364 [details] return slides in the correct order
Created attachment 18365 [details] null-pointer fixed
Nick, (1) I had to comment a line in StyleTextPropAtom.setParentTextSize: // Handle extra 1 char styles at the end if(pos < rawContents.length && textHandled == size) { //size++; //do we need this code? } It results in errors as described above. Do you have examples when this code works properly and we need it? All test cases pass, I can add text to a slide and see it in PPT. (2) Finally I found how to get slides in the correct order: the order of slides is defined by the order in which slides are listed in SlideListWithText. I'm 99% sure it is so. I did lots of tests and this rule seems OK. BTW: There are potential bugs in SlideShow.buildSlidesAndNotes. To get slides and atoms we need only the second SlideListWithText, not all. There can be several SlideListWithText containers: - SlideListWithText[0]: always present. Contains MainMaster and TitleMaster objects. Since PowerPoint 2003 there can be multiple masters and this is the place where their references are stored. - SlideListWithText[1]: Always exists if a ppt contains at least one slide. Stores slides and slide atoms - SlideListWithText[2] (if any): Stores other objects like notes, handouts, etc. Regards, Yegor
> BTW: There are potential bugs in SlideShow.buildSlidesAndNotes. > To get slides and atoms we need only the second SlideListWithText, not all. Is that always the case on files that've been saved multiple times? I have a nasty feeling that the SLWTs get pretty messed up in those cases, and you can get quite a few of them (I'm all for making it simpler, I'm just worried that it might not be quite that simple!) I'm thinking we should probably do a simpler version of SLWTTextListing, to report what it does find in the SlideListWithTexts. We can then run that over a large range of documents, and check our assumptions are correct? Nick
> Is that always the case on files that've been saved multiple times? I have a > nasty feeling that the SLWTs get pretty messed up in those cases, and you can > get quite a few of them No, as far as I know slides are always stored in the second SLWT. If we add a new slide we must add SlidePersistAtom to the second SLWT, not to the last one. Ops, I wrote it and understood that the current code adds new slides to the last SLWT and this is wrong :). I attached a sample ppt which demonstrates the problem. Title Masters are always stored as Slide records, not as MainMaster. Current algorithm recognizes them as slides and this is wrong. Check what SlideShow.getSlides() returns and see extra records. I also prepared a picture which describes what I know about SLWTs in PowerPoint. > (I'm all for making it simpler, I'm just worried that it might not be quite that > simple!) I agree, we can simplify and improve it. We know much more about PPT format than a year ago, it is a chance to use this knowledge and refactor the code. Regards, Yegor
Created attachment 18395 [details] A sample ppt with two masters and two title masters
Created attachment 18396 [details] SlideListWithText hierarchy
Can you cast hslf.dev.SLWTListing over a bunch of your test files? I've been finding that the 2nd SLWT does seem to hold the slide's text, as your research indicates. I've yet to find a document where that doesn't hold true. The first one seems to hold one or more SlidePersistAtoms, but never any other records. The third one (if present) seems to also hold one or more SlidePersistAtoms. If your testing with dev.SLWTListing concurs, then I'll go ahead and make the appropriate changes. (I just want to be sure we've correctly understood before committing!)
The recent batch of changes should've fixed the TextRun bug. I'll keep this bug open until we have the new slide ordering stuff committed though.
Created attachment 18477 [details] sample ppt which does not hold text atoms in the second SLWT
Hi Nick >>Can you cast hslf.dev.SLWTListing over a bunch of your test files? I have a similar utility but it dumps the content in XML. For me it is easier to analyze the structure of records if I work with a tree, not with plain text. >>I've been finding that the 2nd SLWT does seem to hold the slide's text, as your >>research indicates. I've yet to find a document where that doesn't hold true. The 2nd SLWT MAY not contain text, for example, if the slides don't have text placeholders. See the attached sample. Is it what you were looking for? I ran hslf.dev.SLWTListing over a set of ppts and what I saw confirms my suppositions about the order of slides. So I suggest to commit this fix. To tell the truth I don't like how current SlideSHow.buildSlidesAndNotes is organized. Many things are related to each other: getting slides in the correct order, adding a new slide, saving ppt and current code looks a bit clumsy to do all that. It works fine but it is hard to support. In future we are going to deal with master slides, reorder slides, do lots of other tricks and this code is a good candidate to be refactored. Below I will try to summarize what I know about SLWT: (1) The 1st SLWT contains one or more SlidePersistAtoms. I don't know if it can hold other records, so far I didn't see such ppt. Each SlidePersistAtom refers to either a org.apache.poi.hslf.record.MainMaster (not exists yet) or to a org.apache.poi.hslf.record.Slide. org.apache.poi.hslf.record.MainMaster describes slide master, org.apache.poi.hslf.record.Slide describes title master. SlidePersistAtom.slideId is always negative in the 1st SLWT. Example in XML notation: <SlideListWithText> <SlidePersistAtom> <psrReference>1193</psrReference> // either MainMaster or Slide <slideId>-2147483632</slideId> // always negative </SlidePersistAtom> <SlidePersistAtom> <psrReference>1194</psrReference> <slideId>-2147483631</slideId> <Reserved>0</Reserved> </SlidePersistAtom> </SlideListWithText> (2) The 2nd SLWT contains one or more SlidePersistAtoms followed by atoms of text placeholders (if a slide has any). If there is not the 2nd SLWT it means that the presentation does not contain slides, i.e. it is empty. The order of slides is defined by the order of SPA in the second SLWT. Example in XML notation: <SlideListWithText> //the first slide <SlidePersistAtom> //there must be one SPA per each slide <psrReference>993</psrReference> // always org.apache.poi.hslf.record.Slide <slideId>477</slideId> // slideId is positive for slides <numberTexts>1</numberTexts> //SPA knows about the number of subsequent texts </SlidePersistAtom> //below follow the atoms of text placeholders <TextHeaderAtom type="0"/> <TextBytesAtom>Text in a Title Placeholder</TextBytesAtom> <StyleTextPropAtom/> //optional //the second slide <SlidePersistAtom> <psrReference>1191</psrReference> <slideId>533</slideId> </SlidePersistAtom> </SlideListWithText> Note, the actual number of slides is calculated as DocumentAtom.firstSlideNum + idx where idx is the 0-based index of the SPA. You can play with this setting in the PageSetup dialog. (3) The 3rd SLWT contains one or more SlidePersistAtoms wich refer to org.apache.poi.hslf.record.Notes. If the 3rd SLWT is missing it means that the ppt does not contain notes. Example in XML notation: <SlideListWithText> <SlidePersistAtom> <psrReference>994</psrReference> // always org.apache.poi.hslf.record.Notes <slideId>316</slideId> </SlidePersistAtom> <SlidePersistAtom> <psrReference>995</psrReference> <slideId>317</slideId> </SlidePersistAtom> </SlideListWithText> Regards, Yegor
OK, based on your discoveries, I've gone ahead and changed how we add new slides, and how we build the slides+notes Hopefully, I've correctly implemented what you discovered :)
very good. I'm going to add Slide.getTitle() with will return title for a slide. With this method it will be much easier to write test cases to control the order of slides. If PMC doesn't create my account till weekend I will preepare the patch. Yegor
(In reply to comment #14) > I'm going to add Slide.getTitle() with will return title for a slide. With this > method it will be much easier to write test cases to control the order of slides. Sounds good. We should check the type from the TextHeaderAtom, so we can be sure we get the right thing (even if the the first text run isn't the header, for some reason) I already have a slide ordering test, but your idea sounds like a good addition :)