Apache OpenOffice (AOO) Bugzilla – Issue 90800
PDF Import Extension. Hebrew (bidi) text reversed
Last modified: 2017-05-20 10:22:23 UTC
When importing a PDF file containing Hebrew text, the Hebrew text will be imported in reverse order. The letters will appear from left-to-right instead of RTL. Attaching screenshot and sample PDF.
Created attachment 54537 [details] PDF file containing Hebrew text
Created attachment 54539 [details] Screenshot after import
Reproducible.
change component.
ayaniger->wg: I would to help with this issue. I think that the problem is because of the way the Hebrew chars are stored in the PDF file. In PDF format, each glyph is stored with its co-ordinates, and the x value increases as you go rightward. This means that in a Hebrew RTL line, the “last” (leftmost) logical character is the “first” visual character in the line (lowest x-value). Apparently OOo thinks the text is ordered logically, instead of visually. It then passes the text to the bidi algorithm, which sees Hebrew, and the bidi algorithm incorrectly reverses the Hebrew text. Does this sound right to you? If so, where is the code which determines whether in an imported Draw document, RTL text should be treated as logically or visually ordered?
Created attachment 62254 [details] Patch for RTL support in PDF import
I've attached a patch which addresses this issue, and some others relating to RTL support in the PDF importer. Here's a summary of what the patch does: - During the optimization stage, paragraphs are checked for RTL text. If a paragraph has RTL text, a flag is set, so that the paragraph's XML will include a directive for RTL text direction. At the XML generation stage, all RTL text elements are reversed. The patch uses breakiterators at both of these stages to determine if a text element is RTL. - If several text elements had an identical graphics context, concatenation of text elements was only occurring from the second one on. The first character was treated as a text element unto itself. The code mistakenly thought that if the Transformation matrix of the second element was (100,0,0,-100), it should not be concatenated with the first element. The patch changes this behavior, so that the second element is concatenated with the first element. This was important for RTL strings, so that the entire string would be reversed, including the first character. - In RTL documents, the font of a space character is often in an LTR font. This breaks up an RTL phrase into several text objects. Each word gets reversed and written as a separate text element as the XML is generated, but Draw later unifies them into a single text object. The result was that each word looks fine, but the order of words was reversed. The patch fixes this by treating spaces in the optimization stage as if they were in an RTL font, thus concatenating all the words during the optimization stage. - Adds properties "style:font-family-complex", "style:font-weight-complex", and "style:font-style-complex", since with the current code, the font properties were being ignored in RTL documents. - Removes some unused code. - Treats a non-breaking space as a space, as described in issue 101327, only in an additional place in the code. The patch there should be unapplied before applying this patch. - Adds a few more hints from font names to determine if bold, italic, or regular. Optimized the code which checks the font name.
Created attachment 62255 [details] Sample file with Hebrew and English text
great work, this should go into the next pdfimport extension CWS
See issue 102002 , which is a follow-up to this issue.
I'm attaching a new patch to replace the previous one. The new patch has the features of the old patch, plus the following: - Adds include files needed for Windows build - For font recognition, checks for suffixes “PSMT”, and “MT”, and removes them - Fixes reversed parentheses, brackets, etc. for RTL languages - Checks for RTL paragraph not just before, but also after concatenation - Streamlines code segment in optimization loop - Fixes off-by-one bug when checking if text object is RTL
Created attachment 63287 [details] Has more fixes - replaces previous patch
*** Issue 105251 has been marked as a duplicate of this issue. ***
applied the second patch in CWS pdfextfix03 - aside from using vcl's GetMirroredChar; we cannot really link vcl in an extension as that would break binary compatibility.
verified in CWS vcl112
integrated in DEV300m83 closing
kaplan says, DEV300m89 (containing the attached patch) still shows the original problem, namely reversed hebrew strings. Need to investigate this.
changing type and target
Might it be the "#if 0" part in sdext/source/pdfimport/tree/drawtreevisiting.cxx (see patch in lines @@ -80,27 +117,50 @@) The change set is http://hg.services.openoffice.org/OOO330/rev/bd45002f7b96 Kaplan
Entirely possible. As stated in the comment we need a service to use the out #ifdef'd GetMirroredChar. Could you please attach a (preferably small) test PDF that will reproduce the problem ? Or would you consider it sufficient if I create my own using Insert->Special Characters with (random) hebrew characters ?
Created attachment 72095 [details] small test pdf in hebrew
@pl: Attached a small test for Hebrew in PDF. Let me know if anything else is needed. I hope this could get this fixed for 3.3...
@pl: actually there is another problematic piece in the code: a check is made for complex script type. If found, the string is reversed. But complex script does not necessarily mean RTL (e.g. Thai).
good point, need to change that check also.
work in progress @kaplan: the 3.3 release is luckily not the issue here; the extension gets released separately. The fix will work with 3.3 as well as its preecessors or successors.
Instead of getScriptType on the BreakIterator we can use the CharacterClassification which can actually tell us whether the text is RTL or not. I changed that, however this will probably break at some point where mixed RTL/LTR comes into play.
fixed in CWS pdfextfix04
please verify in CWS pdfextfix04
setting milestone to satisfy EIS (this will not affect the release of the extension)
Verified in CWS.