Bug 58993 - failed to extract the correct paragraph direction for docx documents
Summary: failed to extract the correct paragraph direction for docx documents
Status: RESOLVED WONTFIX
Alias: None
Product: POI
Classification: Unclassified
Component: XWPF (show other bugs)
Version: 3.13-FINAL
Hardware: All All
: P2 major (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-02-10 21:54 UTC by nan.yu
Modified: 2016-02-11 18:08 UTC (History)
0 users



Attachments
docx document contains RTL content (52.68 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2016-02-10 21:54 UTC, nan.yu
Details

Note You need to log in before you can comment on or make changes to this bug.
Description nan.yu 2016-02-10 21:54:50 UTC
Created attachment 33546 [details]
docx document contains RTL content

I use "paragraph.getCTP != null && paragraph.getCTP.isSetPPr && paragraph.getCTP.getPPr.isSetBidi" to get the directional information for paragraphs. When HWPFDocument reads Arabic/Hebrew documents, I would expect that isSetBidi returns TRUE for the RTL paragraphs. However, it returns FALSE that represents "LTR" direction.
Comment 2 Tim Allison 2016-02-11 18:08:07 UTC
At least with docx, I don't think this is a fault of POI, and I don't think we can fix it.  Once you hit getCTP(), you are out of the hands of POI and into the hands of beans.

If you look at the ECMA OOXML part 1 standard p. 312, you'll see exactly the same underlying xml that is in your docx, where the only indication is the presence of <rtl/> in both the pPr's rPr and each run's rPr. In short, there is no bidi element in the pPr.

While I want your code to work, it looks like the way to get at whether a paragraph is basically rtl or whether a run is rtl _in your document_ (and the test document that I generated as well) is to check for the existence of rtl. 

For a paragraph, if p.getCTP().getPPr().getRPr.getRtl() is null, then the paragraph is probably lrt; if it is not null, then check its value.  If its value is null, then the paragraph is basically LTR, otherwise, I imagine, follow whatever value it has.

For a run, if r.getRPr().getRtl() is null, then that run is ltr, if it is not null, then you should probably check its value, which may or may not be null.



        InputStream is = new FileInputStream(f);
        XWPFDocument doc = new XWPFDocument(is);
        for (XWPFParagraph p : doc.getParagraphs()) {
            if (p.getCTP() != null && p.getCTP().getPPr() != null) {
                if (p.getCTP().getPPr().getRPr().getRtl() != null) {
                    if (p.getCTP().getPPr().getRPr().getRtl().getVal() == null) {
                        System.out.println("para: rtl");
                    } else {
                        System.out.println("para: " + p.getCTP().getPPr().getRPr().getRtl().getVal());
                    }
                } else {
                    System.out.println("para: ltr");
                }
            }
            for (XWPFRun r : p.getRuns()) {
                if (r.getCTR().getRPr() != null && r.getCTR().getRPr().getRtl() != null) {
                    //probably rtl
                    if (r.getCTR().getRPr().getRtl().getVal() == null) {
                        System.out.println("run: rtl");
                    } else {
                        System.out.println("run: " +r.getCTR().getRPr().getRtl().getVal());
                    }
                } else {
                    System.out.println("run: ltr");
                }
            }
        }