Bug 54771

Summary: Read text from Cover Page, Table of Contents and Bibliography
Product: POI Reporter: vikas.garg
Component: XWPFAssignee: POI Developers List <dev>
Severity: enhancement    
Priority: P2    
Version: 3.9-dev   
Target Milestone: ---   
Hardware: All   
OS: All   
Attachments: rough draft of patch

Description vikas.garg 2013-03-29 19:06:07 UTC
Currently, XWPFWordExtractor.getText() is not reading text from Cover Page, Table Of Contents or Bibliography parts of the docx file. Are there any plans to add the support for extracting the text from these parts? If so then, will it be in next release? OR Is there any other API available to do so?
Comment 1 Nick Burch 2013-03-30 04:05:44 UTC
Apache Tika might be a better bet - it uses Apache POI internally but pulls out a richer set of text and styling

Otherwise, please submit a patch to enhance XWPFWordExtractor if it isn't doing everything required!
Comment 2 Tim Allison 2014-06-04 12:17:46 UTC
Vladimir Glina just submitted test docs over on TIKA-1317.  This issue is related to POI-54849, which got most SDTs but apparently didn't capture this case.  I'll try to fix this soon.
Comment 3 Tim Allison 2014-06-11 00:49:00 UTC
Created attachment 31704 [details]
rough draft of patch

Rough draft of patch attached.  I need to clean up a few things before I commit (end of the week?).  All feedback welcome.  Thank you!
Comment 4 Nick Burch 2014-06-11 11:38:04 UTC
At first glance the patch looks promising

Any chance you could also look at updating appendTableText in XWPFWordExtractor with similar logic to in your updated unit test?
Comment 5 Tim Allison 2014-06-16 19:08:26 UTC
Thank you, Nick!

There's a slight difference between the test's extractSDTs and the way that XWPFDocumentExtractor works.  The general goal is to return all text recursively from an XWPFSDTCell's content object; this is what the extractor calls.  The test recursively goes through all objects to gather the SDTs, so that we can test numbers of SDTs and text within them.
Comment 6 Tim Allison 2014-06-16 19:31:47 UTC
Fixed r1602960.

Thank you, Vikas, for submitting this issue.

Thank you, Vladimir, for submitting test docs on TIKA-1317.

Thank you, Nick, for your review.