Created attachment 27490 [details] Simple PPT with a master slide w/ footer text "Master footer is here" I hit this in Tika https://issues.apache.org/jira/browse/TIKA-712 but it sounds like we need to fix it in POI first. Ideally each slide's master slide would have the elements from its corresponding master slide inlined, but if this is problematic, a fallback would just be to see each master slide once. The attached PPT has a slide whose master slide has "Master footer is here" footer; you can see it rendered when you view this in Office. But when I extract text using Tika, the footer isn't included. This also happens for PPTX; I'll open a separate issue.
Turns out it's actually already supported by HSLF, you just need to set the flag to include master text (off by default)
I think there is still a problem here: with the example PPT I attached, I see boiler-plate text when I run PowerPointExtract (which does set to flag to include master slide text, in its static main method). I see code in HSLF for detecting that a given Shape is a placeholder (MasterSheet.isPlaceholder), so it seems possible we can avoid extracting such text? But I'm not familiar enough with the APIs, eg when Sheet.findTextRuns is invoked for a MasterSlide, how can it get the Shape for each run and then skip its text if it's a placeholder?
MasterSlide.getTextRuns is used in other places and should return all runs including boilerplate ones from placeholders. I think the correct fix would be as follows: The line 224 in PowerPointExtractor.java invokes textRunsToText, but this method can't tell runs from placeholders from normal text. textRunsToText(ret, master.getTextRuns()); It is better to re-write it and iterate over shapes in the master sheet: for(Shape sh : master.getShapes()){ if(sh instanceof TextShape){ if(MasterSheet.isPlaceholder(sh)) { // don't bother about boiler plate text on master sheets continue; } TextShape tsh = (TextShape)sh; String text = tsh.getText(); ret.append(text); if (!text.endsWith("\n")) { ret.append("\n"); } } } Any volunteers to help me with testing? I never worked with text extractors and don't want to occasionally break things. I guess the best would be to apply this fix, build and test from inside Tika. Yegor (In reply to comment #2) > I think there is still a problem here: with the example PPT I > attached, I see boiler-plate text when I run PowerPointExtract (which > does set to flag to include master slide text, in its static main > method). > > I see code in HSLF for detecting that a given Shape is a placeholder > (MasterSheet.isPlaceholder), so it seems possible we can avoid > extracting such text? But I'm not familiar enough with the APIs, eg > when Sheet.findTextRuns is invoked for a MasterSlide, how can it get > the Shape for each run and then skip its text if it's a placeholder? (In reply to comment #2) > I think there is still a problem here: with the example PPT I > attached, I see boiler-plate text when I run PowerPointExtract (which > does set to flag to include master slide text, in its static main > method). > > I see code in HSLF for detecting that a given Shape is a placeholder > (MasterSheet.isPlaceholder), so it seems possible we can avoid > extracting such text? But I'm not familiar enough with the APIs, eg > when Sheet.findTextRuns is invoked for a MasterSlide, how can it get > the Shape for each run and then skip its text if it's a placeholder?
Hi Yegor, That sounds like a great fix! (Iterating over shapes ourselves). I'm happy to test a patch to confirm we're skipping the boiler plate text. Thanks!
In fact you already described exactly what to test, so I'll take your proposed change, test it, and report back! Thanks.
Created attachment 27907 [details] Patch w/ fix, plus test OK the change works well! I see no more placeholder text from the master slide. I worked out a patch (attached), containing the above fix, and fixes to the tests, and some minor improvements to Sheet.java (cutover from Vector -> List<TextRun>). One test failed with this change (TestExtracter.testMasterText), because the master text in the test PPT for that test was in fact placeholder text. So, I replaced that test file (test-data/slideshow/master_text.ppt) with a new PPT, that has non-placeholder text from the master slide, and also added an assert to verify placeholder text doesn't come through. The assert fails on trunk today and passes with the patch, so I think we are good. Thanks Yegor!
Created attachment 27908 [details] PPT for test case (replaces current test-data/slideshow/master_text.ppt).
Hi, I think this patch is ready to go in? It has a patch with a test case... is there a committer who can double check and commit? Thanks!
Sorry for the delay. Patch applied in r1203295 Thanks, Yegor
Awesome, thanks Yegor!