Bug 39547

Summary: HSLF: IndexOutOfBoundsException in TextRun constructor
Product: POI Reporter: Yegor Kozlov <yegor>
Component: HSLFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Hardware: Other   
OS: other   
Attachments: Patch with the fix
return slides in the correct order
null-pointer fixed
A sample ppt with two masters and two title masters
SlideListWithText hierarchy
sample ppt which does not hold text atoms in the second SLWT

Description Yegor Kozlov 2006-05-10 16:25:06 UTC
There is a bug in StyleTextPropAtom which prevents me from fixing the issue with
slides ordering.

How to reproduce:

1. Download http://www.gersteinlab.org/lectures/ppt/A2K-at-Yale-20060422.ppt

2. Try to construct the slide show:

    SlideShow ppt = new SlideShow(new HSLFSlideShow("A2K-at-Yale-20060422.ppt"));

and get the error:

java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
	at java.util.LinkedList.entry(LinkedList.java:356)
	at java.util.LinkedList.get(LinkedList.java:299)
	at org.apache.poi.hslf.model.TextRun.<init>(TextRun.java:110)
	at org.apache.poi.hslf.model.TextRun.<init>(TextRun.java:67)
	at org.apache.poi.hslf.model.Sheet.findTextRuns(Sheet.java:126)
	at org.apache.poi.hslf.model.Sheet.findTextRuns(Sheet.java:88)
	at org.apache.poi.hslf.model.Slide.<init>(Slide.java:66)
	at org.apache.poi.hslf.usermodel.SlideShow.buildSlidesAndNotes(SlideShow.java:394)


Before stack trace I see the following warnings: 

No core record found with ID 3 based on PersistPtr lookup
No core record found with ID 5 based on PersistPtr lookup
No core record found with ID 6 based on PersistPtr lookup
No core record found with ID 7 based on PersistPtr lookup
No core record found with ID 8 based on PersistPtr lookup
No core record found with ID 9 based on PersistPtr lookup
No core record found with ID 10 based on PersistPtr lookup
No core record found with ID 11 based on PersistPtr lookup
No core record found with ID 12 based on PersistPtr lookup
No core record found with ID 13 based on PersistPtr lookup
No core record found with ID 22 based on PersistPtr lookup
No core record found with ID 44 based on PersistPtr lookup
No core record found with ID 45 based on PersistPtr lookup
Paragraph style ran out before character style did! Short by 327 characters.
Calling RichTextRun functions is likely to break things - see Bug #38544

Regards, Yegor Kozlov
Comment 1 Yegor Kozlov 2006-05-29 15:24:17 UTC
Created attachment 18362 [details]
Patch with the fix
Comment 2 Yegor Kozlov 2006-05-29 15:27:45 UTC
Created attachment 18364 [details]
return slides in the correct order
Comment 3 Yegor Kozlov 2006-05-29 15:29:06 UTC
Created attachment 18365 [details]
null-pointer fixed
Comment 4 Yegor Kozlov 2006-05-29 15:42:45 UTC
Nick,

(1) I had to comment a line in StyleTextPropAtom.setParentTextSize:

                        // Handle extra 1 char styles at the end
                        if(pos < rawContents.length && textHandled == size) {
                                //size++; //do we need this code? 
                        }

It results in errors as described above. Do you have examples when this code
works properly and we need it?
All test cases pass, I can add text to a slide and see it in PPT. 

(2)  Finally I found how to get slides in the correct order: the order of slides
is defined by
the order in which slides are listed in SlideListWithText.  I'm 99% sure it is so.
I did lots of tests and this rule seems OK. 
BTW: There are potential bugs in SlideShow.buildSlidesAndNotes.
To get slides and atoms we need only the second SlideListWithText, not all.

There can be several SlideListWithText containers:
 - SlideListWithText[0]:  always present. Contains MainMaster and TitleMaster
objects.
 Since PowerPoint 2003 there can be multiple masters and this is the place where
their references are stored.
- SlideListWithText[1]:  Always exists if a ppt contains at least one slide. 
Stores slides and slide atoms
- SlideListWithText[2] (if any): Stores other objects like notes, handouts, etc.

Regards, Yegor
Comment 5 Nick Burch 2006-06-01 17:36:35 UTC
> BTW: There are potential bugs in SlideShow.buildSlidesAndNotes.
> To get slides and atoms we need only the second SlideListWithText, not all.

Is that always the case on files that've been saved multiple times? I have a
nasty feeling that the SLWTs get pretty messed up in those cases, and you can
get quite a few of them

(I'm all for making it simpler, I'm just worried that it might not be quite that
simple!)

I'm thinking we should probably do a simpler version of SLWTTextListing, to
report what it does find in the SlideListWithTexts. We can then run that over a
large range of documents, and check our assumptions are correct?

Nick
Comment 6 Yegor Kozlov 2006-06-02 08:14:15 UTC
> Is that always the case on files that've been saved multiple times? I have a
> nasty feeling that the SLWTs get pretty messed up in those cases, and you can
> get quite a few of them

No, as far as I know slides are always stored in the second SLWT. If
we add a new slide we must add SlidePersistAtom to the second SLWT,
not to the last one. Ops, I wrote it and understood that the current
code adds new slides to the last SLWT and this is wrong :).

I attached a sample ppt which demonstrates the problem. Title Masters
are always stored as Slide records, not as MainMaster. Current
algorithm recognizes them as slides and this is wrong.
Check what SlideShow.getSlides() returns and see extra records.

I also prepared a picture which describes what I know about SLWTs in
PowerPoint.

> (I'm all for making it simpler, I'm just worried that it might not be quite that
> simple!)
I agree, we can simplify and improve it. We know much more about PPT
format than a year ago, it is a chance to use this knowledge and refactor the code.

Regards,
Yegor
Comment 7 Yegor Kozlov 2006-06-02 08:15:47 UTC
Created attachment 18395 [details]
A sample ppt with two masters and two title masters
Comment 8 Yegor Kozlov 2006-06-02 08:18:26 UTC
Created attachment 18396 [details]
SlideListWithText hierarchy
Comment 9 Nick Burch 2006-06-12 15:26:16 UTC
Can you cast hslf.dev.SLWTListing over a bunch of your test files?

I've been finding that the 2nd SLWT does seem to hold the slide's text, as your
research indicates. I've yet to find a document where that doesn't hold true.

The first one seems to hold one or more SlidePersistAtoms, but never any other
records.

The third one (if present) seems to also hold one or more SlidePersistAtoms.


If your testing with dev.SLWTListing concurs, then I'll go ahead and make the
appropriate changes. (I just want to be sure we've correctly understood before
committing!)
Comment 10 Nick Burch 2006-06-13 15:01:23 UTC
The recent batch of changes should've fixed the TextRun bug.

I'll keep this bug open until we have the new slide ordering stuff committed though.
Comment 11 Yegor Kozlov 2006-06-16 11:11:57 UTC
Created attachment 18477 [details]
sample ppt which does not hold text atoms in the second SLWT
Comment 12 Yegor Kozlov 2006-06-16 11:12:36 UTC
Hi Nick

>>Can you cast hslf.dev.SLWTListing over a bunch of your test files?

I have a similar utility but it dumps the content in XML.
 For me it is easier to analyze the structure of records if I work with a tree,
not with plain text.

>>I've been finding that the 2nd SLWT does seem to hold the slide's text, as your
>>research indicates. I've yet to find a document where that doesn't hold true.

The 2nd SLWT MAY not contain text, for example, if the slides don't have text
placeholders.
See the attached sample. Is it what you were looking for?

I ran hslf.dev.SLWTListing over a set of ppts and what I saw confirms
my suppositions about the order of slides.
So I suggest to commit this fix. 

To tell the truth I don't like how current SlideSHow.buildSlidesAndNotes is
organized. 
Many things are related to each other: getting slides in the correct order,
adding a new slide, saving ppt
and current code looks a bit clumsy to do all that. It works fine but
it is hard to support. In future we are going to deal with master
slides, reorder slides, do lots of other tricks and this code is a good
candidate to be refactored.

Below I will try to summarize what I know about SLWT:

(1) The 1st SLWT contains one or more SlidePersistAtoms.
I don't know if it can hold other records, so far I didn't see such ppt.
Each SlidePersistAtom refers to either a org.apache.poi.hslf.record.MainMaster
(not exists yet)
 or to a org.apache.poi.hslf.record.Slide. 
org.apache.poi.hslf.record.MainMaster describes slide master,
org.apache.poi.hslf.record.Slide describes title master.

 SlidePersistAtom.slideId is always negative in the 1st SLWT.

 Example in XML notation:

 <SlideListWithText>
   <SlidePersistAtom>
     <psrReference>1193</psrReference>  // either MainMaster or Slide
     <slideId>-2147483632</slideId>     // always negative
   </SlidePersistAtom>
   <SlidePersistAtom>
     <psrReference>1194</psrReference> 
     <slideId>-2147483631</slideId> 
     <Reserved>0</Reserved> 
   </SlidePersistAtom>
 </SlideListWithText>


(2) The 2nd SLWT contains one or more SlidePersistAtoms followed by atoms of
text placeholders (if a slide has any). 
If there is not the 2nd SLWT it means that the presentation does not contain
slides, i.e. it is empty. 

The order of slides is defined by the order of SPA in the second SLWT.

 Example in XML notation:

 <SlideListWithText>

   //the first slide
   <SlidePersistAtom> //there must be one SPA per each slide
     <psrReference>993</psrReference> // always  org.apache.poi.hslf.record.Slide
     <slideId>477</slideId> // slideId is positive for slides
     <numberTexts>1</numberTexts> //SPA knows about the number of subsequent texts
    </SlidePersistAtom>
        
   //below follow the atoms of text placeholders
   <TextHeaderAtom type="0"/> 
   <TextBytesAtom>Text in a Title Placeholder</TextBytesAtom> 
   <StyleTextPropAtom/> //optional

   //the second slide
   <SlidePersistAtom>
     <psrReference>1191</psrReference> 
     <slideId>533</slideId> 
   </SlidePersistAtom>


 </SlideListWithText>

Note, the actual number of slides is calculated as DocumentAtom.firstSlideNum + idx
where idx is the 0-based index of the SPA. You can play with this setting in the
PageSetup dialog.

(3) The 3rd SLWT contains one or more SlidePersistAtoms wich refer to
org.apache.poi.hslf.record.Notes.
If the 3rd SLWT is missing it means that the ppt does not contain notes.

 Example in XML notation:

 <SlideListWithText>
   <SlidePersistAtom>
     <psrReference>994</psrReference>  // always  org.apache.poi.hslf.record.Notes
     <slideId>316</slideId>
   </SlidePersistAtom>
   <SlidePersistAtom>
     <psrReference>995</psrReference>
     <slideId>317</slideId>
   </SlidePersistAtom>
 </SlideListWithText>

Regards, Yegor
Comment 13 Nick Burch 2006-06-27 18:16:53 UTC
OK, based on your discoveries, I've gone ahead and changed how we add new
slides, and how we build the slides+notes

Hopefully, I've correctly implemented what you discovered :)
Comment 14 Yegor Kozlov 2006-06-28 12:30:46 UTC
very good. 

I'm going to add Slide.getTitle() with will return title for a slide. With this
method it will be much easier to write test cases to control the order of slides. 

If PMC doesn't create my account till weekend I will preepare the patch.

Yegor
Comment 15 Nick Burch 2006-06-28 13:07:22 UTC
(In reply to comment #14)
> I'm going to add Slide.getTitle() with will return title for a slide. With this
> method it will be much easier to write test cases to control the order of slides. 

Sounds good. We should check the type from the TextHeaderAtom, so we can be sure
we get the right thing (even if the the first text run isn't the header, for
some reason)

I already have a slide ordering test, but your idea sounds like a good addition :)