58993 – failed to extract the correct paragraph direction for docx documents

Bug 58993 - failed to extract the correct paragraph direction for docx documents

Summary: failed to extract the correct paragraph direction for docx documents

Status:	RESOLVED WONTFIX

Alias:	None

Product:	POI
Classification:	Unclassified
Component:	XWPF (show other bugs)
Version:	3.13-FINAL
Hardware:	All All

Importance:	P2 major (vote)
Target Milestone:	---
Assignee:	POI Developers List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2016-02-10 21:54 UTC by nan.yu
Modified:	2016-02-11 18:08 UTC (History)
CC List:	0 users

Attachments
docx document contains RTL content (52.68 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document) 2016-02-10 21:54 UTC, nan.yu	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description nan.yu 2016-02-10 21:54:50 UTC

Created attachment 33546 [details]
docx document contains RTL content

I use "paragraph.getCTP != null && paragraph.getCTP.isSetPPr && paragraph.getCTP.getPPr.isSetBidi" to get the directional information for paragraphs. When HWPFDocument reads Arabic/Hebrew documents, I would expect that isSetBidi returns TRUE for the RTL paragraphs. However, it returns FALSE that represents "LTR" direction.

Comment 1 Dominik Stadler 2016-02-11 15:42:48 UTC

Also reported at http://stackoverflow.com/questions/35326966/extract-wrong-paragraph-direction-in-word-using-apache-poi-library

Comment 2 Tim Allison 2016-02-11 18:08:07 UTC

At least with docx, I don't think this is a fault of POI, and I don't think we can fix it.  Once you hit getCTP(), you are out of the hands of POI and into the hands of beans.

If you look at the ECMA OOXML part 1 standard p. 312, you'll see exactly the same underlying xml that is in your docx, where the only indication is the presence of <rtl/> in both the pPr's rPr and each run's rPr. In short, there is no bidi element in the pPr.

While I want your code to work, it looks like the way to get at whether a paragraph is basically rtl or whether a run is rtl _in your document_ (and the test document that I generated as well) is to check for the existence of rtl. 

For a paragraph, if p.getCTP().getPPr().getRPr.getRtl() is null, then the paragraph is probably lrt; if it is not null, then check its value.  If its value is null, then the paragraph is basically LTR, otherwise, I imagine, follow whatever value it has.

For a run, if r.getRPr().getRtl() is null, then that run is ltr, if it is not null, then you should probably check its value, which may or may not be null.



        InputStream is = new FileInputStream(f);
        XWPFDocument doc = new XWPFDocument(is);
        for (XWPFParagraph p : doc.getParagraphs()) {
            if (p.getCTP() != null && p.getCTP().getPPr() != null) {
                if (p.getCTP().getPPr().getRPr().getRtl() != null) {
                    if (p.getCTP().getPPr().getRPr().getRtl().getVal() == null) {
                        System.out.println("para: rtl");
                    } else {
                        System.out.println("para: " + p.getCTP().getPPr().getRPr().getRtl().getVal());
                    }
                } else {
                    System.out.println("para: ltr");
                }
            }
            for (XWPFRun r : p.getRuns()) {
                if (r.getCTR().getRPr() != null && r.getCTR().getRPr().getRtl() != null) {
                    //probably rtl
                    if (r.getCTR().getRPr().getRtl().getVal() == null) {
                        System.out.println("run: rtl");
                    } else {
                        System.out.println("run: " +r.getCTR().getRPr().getRtl().getVal());
                    }
                } else {
                    System.out.println("run: ltr");
                }
            }
        }