Bug 49189 - XWPFWordExtractor discards <w:tab/> entries.
Summary: XWPFWordExtractor discards <w:tab/> entries.
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: XWPF (show other bugs)
Version: 3.7-dev
Hardware: PC Windows XP
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-04-27 04:21 UTC by Antoni Mylka
Modified: 2010-05-25 16:32 UTC (History)
0 users



Attachments
Test document which exposes the problem (13.86 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2010-04-27 04:21 UTC, Antoni Mylka
Details
A test case, to be placed in ooxml/testcases/org/apache/poi/xwfp/extractor (1.93 KB, application/octet-stream)
2010-04-27 04:23 UTC, Antoni Mylka
Details
Patch, with my workaround (5.61 KB, application/octet-stream)
2010-04-27 06:08 UTC, Antoni Mylka
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Antoni Mylka 2010-04-27 04:21:50 UTC
Created attachment 25358 [details]
Test document which exposes the problem

In the current trunk, two characters separated by a tab character are glued together the tab is removed. 

I tried to debug the issue and found a following piece of code in XWPFParagraph.getText() method:

XmlObject o = c.getObject();
if (o instanceof CTText) {
    text.append(((CTText) o).getStringValue());
}
if (o instanceof CTPTab) {
    text.append("\t");
}

This seems to assume that wherever a <w:tab/> construct appears in the source text file, XMLBeans will return an instance of CTPTab. Unfortunately in my case it seems to return CTEmptyImpl, which is not a CTPTab. 

I tried to read the specs, and in section 17.3.1.37 it says that there is only one possible parent element for <w:tab> and it is <w:tabs>. In my file, generated with office 2010 beta I have:

<w:p w14:paraId="4EB09767" w14:textId="77777777" w:rsidR="00B3064F"
	w:rsidRDefault="00B3064F">
	<w:r>
		<w:t>a</w:t>
	</w:r>
	<w:r>
		<w:tab />
		<w:t>b</w:t>
	</w:r>
	<w:bookmarkStart w:id="0" w:name="_GoBack" />
	<w:bookmarkEnd w:id="0" />
</w:p>

You see that <w:tab /> is note enclosed within <w:tabs></w:tabs>

This might imply that either office produces a wrong file, or the OpenXML XSDs are wrong, or there is something wrong with XMLBeans class generator, or with its runtime parser.

Could someone with more knowledge of the OpenXML format take a look at this? This error spoils fulltext indexing and seems pretty important for the users of the Aperture Framework.

The easiest workaround for me would be to add a third 'if' for CTEmptyImpl and put a space in the output. Superfluous whitespace (almost) never hurts, while glueing words together is bad, but as I said, my knowledge on this topic is limited.
Comment 1 Antoni Mylka 2010-04-27 04:23:11 UTC
Created attachment 25359 [details]
A test case, to be placed in ooxml/testcases/org/apache/poi/xwfp/extractor
Comment 2 Antoni Mylka 2010-04-27 06:08:34 UTC
Created attachment 25360 [details]
Patch, with my workaround

This fixes the issue for me, introduces some superfluous spaces in other tests, which I consider a small problem.

I don't know what consequences it might have in other scenarios - more OpenXML knowledge is needed.
Comment 3 Nick Burch 2010-05-25 13:06:57 UTC
From looking at the xsds, it seems to me that a tab entry is allowed in two places, within the tabs element, or within a normal paragraph tag

When within a paragraph, on it's own, it's of type CT_Empty. This fits with what xmlbeans is giving us

cr (carriage return) looks to be quite similar (it's next to tabs in the paragraph definition), so we should probably handle the two in a similar way

Will take a look at your patch later on
Comment 4 Nick Burch 2010-05-25 16:32:52 UTC
Fixed in r948199, along with a unit test. 

We now spot the CTEmpty instances, and check what they really are. If it's a tab or a cr, we then append the appropriate thing to the internal text representation