Bug 54771 - Read text from Cover Page, Table of Contents and Bibliography
Summary: Read text from Cover Page, Table of Contents and Bibliography
Alias: None
Product: POI
Classification: Unclassified
Component: XWPF (show other bugs)
Version: 3.9-dev
Hardware: All All
: P2 enhancement (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2013-03-29 19:06 UTC by vikas.garg
Modified: 2014-06-16 19:31 UTC (History)
0 users

rough draft of patch (27.50 KB, patch)
2014-06-11 00:49 UTC, Tim Allison
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description vikas.garg 2013-03-29 19:06:07 UTC
Currently, XWPFWordExtractor.getText() is not reading text from Cover Page, Table Of Contents or Bibliography parts of the docx file. Are there any plans to add the support for extracting the text from these parts? If so then, will it be in next release? OR Is there any other API available to do so?
Comment 1 Nick Burch 2013-03-30 04:05:44 UTC
Apache Tika might be a better bet - it uses Apache POI internally but pulls out a richer set of text and styling

Otherwise, please submit a patch to enhance XWPFWordExtractor if it isn't doing everything required!
Comment 2 Tim Allison 2014-06-04 12:17:46 UTC
Vladimir Glina just submitted test docs over on TIKA-1317.  This issue is related to POI-54849, which got most SDTs but apparently didn't capture this case.  I'll try to fix this soon.
Comment 3 Tim Allison 2014-06-11 00:49:00 UTC
Created attachment 31704 [details]
rough draft of patch

Rough draft of patch attached.  I need to clean up a few things before I commit (end of the week?).  All feedback welcome.  Thank you!
Comment 4 Nick Burch 2014-06-11 11:38:04 UTC
At first glance the patch looks promising

Any chance you could also look at updating appendTableText in XWPFWordExtractor with similar logic to in your updated unit test?
Comment 5 Tim Allison 2014-06-16 19:08:26 UTC
Thank you, Nick!

There's a slight difference between the test's extractSDTs and the way that XWPFDocumentExtractor works.  The general goal is to return all text recursively from an XWPFSDTCell's content object; this is what the extractor calls.  The test recursively goes through all objects to gather the SDTs, so that we can test numbers of SDTs and text within them.
Comment 6 Tim Allison 2014-06-16 19:31:47 UTC
Fixed r1602960.

Thank you, Vikas, for submitting this issue.

Thank you, Vladimir, for submitting test docs on TIKA-1317.

Thank you, Nick, for your review.