Currently, XWPFWordExtractor.getText() is not reading text from Cover Page, Table Of Contents or Bibliography parts of the docx file. Are there any plans to add the support for extracting the text from these parts? If so then, will it be in next release? OR Is there any other API available to do so?
Apache Tika might be a better bet - it uses Apache POI internally but pulls out a richer set of text and styling Otherwise, please submit a patch to enhance XWPFWordExtractor if it isn't doing everything required!
Vladimir Glina just submitted test docs over on TIKA-1317. This issue is related to POI-54849, which got most SDTs but apparently didn't capture this case. I'll try to fix this soon.
Created attachment 31704 [details] rough draft of patch Rough draft of patch attached. I need to clean up a few things before I commit (end of the week?). All feedback welcome. Thank you!
At first glance the patch looks promising Any chance you could also look at updating appendTableText in XWPFWordExtractor with similar logic to in your updated unit test?
Thank you, Nick! There's a slight difference between the test's extractSDTs and the way that XWPFDocumentExtractor works. The general goal is to return all text recursively from an XWPFSDTCell's content object; this is what the extractor calls. The test recursively goes through all objects to gather the SDTs, so that we can test numbers of SDTs and text within them.
Fixed r1602960. Thank you, Vikas, for submitting this issue. Thank you, Vladimir, for submitting test docs on TIKA-1317. Thank you, Nick, for your review.