Created attachment 25358 [details] Test document which exposes the problem In the current trunk, two characters separated by a tab character are glued together the tab is removed. I tried to debug the issue and found a following piece of code in XWPFParagraph.getText() method: XmlObject o = c.getObject(); if (o instanceof CTText) { text.append(((CTText) o).getStringValue()); } if (o instanceof CTPTab) { text.append("\t"); } This seems to assume that wherever a <w:tab/> construct appears in the source text file, XMLBeans will return an instance of CTPTab. Unfortunately in my case it seems to return CTEmptyImpl, which is not a CTPTab. I tried to read the specs, and in section 17.3.1.37 it says that there is only one possible parent element for <w:tab> and it is <w:tabs>. In my file, generated with office 2010 beta I have: <w:p w14:paraId="4EB09767" w14:textId="77777777" w:rsidR="00B3064F" w:rsidRDefault="00B3064F"> <w:r> <w:t>a</w:t> </w:r> <w:r> <w:tab /> <w:t>b</w:t> </w:r> <w:bookmarkStart w:id="0" w:name="_GoBack" /> <w:bookmarkEnd w:id="0" /> </w:p> You see that <w:tab /> is note enclosed within <w:tabs></w:tabs> This might imply that either office produces a wrong file, or the OpenXML XSDs are wrong, or there is something wrong with XMLBeans class generator, or with its runtime parser. Could someone with more knowledge of the OpenXML format take a look at this? This error spoils fulltext indexing and seems pretty important for the users of the Aperture Framework. The easiest workaround for me would be to add a third 'if' for CTEmptyImpl and put a space in the output. Superfluous whitespace (almost) never hurts, while glueing words together is bad, but as I said, my knowledge on this topic is limited.
Created attachment 25359 [details] A test case, to be placed in ooxml/testcases/org/apache/poi/xwfp/extractor
Created attachment 25360 [details] Patch, with my workaround This fixes the issue for me, introduces some superfluous spaces in other tests, which I consider a small problem. I don't know what consequences it might have in other scenarios - more OpenXML knowledge is needed.
From looking at the xsds, it seems to me that a tab entry is allowed in two places, within the tabs element, or within a normal paragraph tag When within a paragraph, on it's own, it's of type CT_Empty. This fits with what xmlbeans is giving us cr (carriage return) looks to be quite similar (it's next to tabs in the paragraph definition), so we should probably handle the two in a similar way Will take a look at your patch later on
Fixed in r948199, along with a unit test. We now spot the CTEmpty instances, and check what they really are. If it's a tab or a cr, we then append the appropriate thing to the internal text representation