|Summary:||Read text from Cover Page, Table of Contents and Bibliography|
|Component:||XWPF||Assignee:||POI Developers List <dev>|
|Attachments:||rough draft of patch|
Description vikas.garg 2013-03-29 19:06:07 UTC
Currently, XWPFWordExtractor.getText() is not reading text from Cover Page, Table Of Contents or Bibliography parts of the docx file. Are there any plans to add the support for extracting the text from these parts? If so then, will it be in next release? OR Is there any other API available to do so?
Comment 1 Nick Burch 2013-03-30 04:05:44 UTC
Apache Tika might be a better bet - it uses Apache POI internally but pulls out a richer set of text and styling Otherwise, please submit a patch to enhance XWPFWordExtractor if it isn't doing everything required!
Comment 2 Tim Allison 2014-06-04 12:17:46 UTC
Vladimir Glina just submitted test docs over on TIKA-1317. This issue is related to POI-54849, which got most SDTs but apparently didn't capture this case. I'll try to fix this soon.
Comment 3 Tim Allison 2014-06-11 00:49:00 UTC
Created attachment 31704 [details] rough draft of patch Rough draft of patch attached. I need to clean up a few things before I commit (end of the week?). All feedback welcome. Thank you!
Comment 4 Nick Burch 2014-06-11 11:38:04 UTC
At first glance the patch looks promising Any chance you could also look at updating appendTableText in XWPFWordExtractor with similar logic to in your updated unit test?
Comment 5 Tim Allison 2014-06-16 19:08:26 UTC
Thank you, Nick! There's a slight difference between the test's extractSDTs and the way that XWPFDocumentExtractor works. The general goal is to return all text recursively from an XWPFSDTCell's content object; this is what the extractor calls. The test recursively goes through all objects to gather the SDTs, so that we can test numbers of SDTs and text within them.