Issue 90800 - PDF Import Extension. Hebrew (bidi) text reversed
Summary: PDF Import Extension. Hebrew (bidi) text reversed
Status: CLOSED FIXED
Alias: None
Product: Draw
Classification: Application
Component: code (show other issues)
Version: DEV300m76
Hardware: PC All
: P3 Trivial with 8 votes (vote)
Target Milestone: 3.4.1
Assignee: wolframgarten
QA Contact: issues@graphics
URL:
Keywords:
: 105251 (view as issue list)
Depends on:
Blocks:
 
Reported: 2008-06-17 12:57 UTC by hennerdrewes
Modified: 2017-05-20 10:22 UTC (History)
8 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
PDF file containing Hebrew text (29.41 KB, application/pdf)
2008-06-17 12:59 UTC, hennerdrewes
no flags Details
Screenshot after import (86.56 KB, image/png)
2008-06-17 13:00 UTC, hennerdrewes
no flags Details
Patch for RTL support in PDF import (19.38 KB, text/plain)
2009-05-14 08:05 UTC, alan
no flags Details
Sample file with Hebrew and English text (117.10 KB, application/pdf)
2009-05-14 08:20 UTC, alan
no flags Details
Has more fixes - replaces previous patch (22.07 KB, patch)
2009-06-30 09:34 UTC, alan
no flags Details | Diff
small test pdf in hebrew (52.15 KB, application/pdf)
2010-10-18 18:05 UTC, kaplanlior
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description hennerdrewes 2008-06-17 12:57:45 UTC
When importing a PDF file containing Hebrew text, the Hebrew text will be
imported in reverse order. The letters will appear from left-to-right instead of
RTL.

Attaching screenshot and sample PDF.
Comment 1 hennerdrewes 2008-06-17 12:59:19 UTC
Created attachment 54537 [details]
PDF file containing Hebrew text
Comment 2 hennerdrewes 2008-06-17 13:00:00 UTC
Created attachment 54539 [details]
Screenshot after import
Comment 3 wolframgarten 2008-06-17 13:45:34 UTC
Reproducible.
Comment 4 Martin Hollmichel 2008-10-08 09:51:59 UTC
change component.
Comment 5 alan 2008-11-21 09:35:03 UTC
ayaniger->wg:
I would to help with this issue. 

I think that the problem is because of the way the Hebrew chars are stored in
the PDF file.

In PDF format, each glyph is stored with its co-ordinates, and the x value
increases as you go rightward. This means that in a Hebrew RTL line, the “last”
(leftmost) logical character is the “first” visual character in the line (lowest
x-value). Apparently OOo  thinks the text is ordered logically, instead of
visually. It then passes the text to the bidi algorithm, which sees Hebrew, and
the bidi algorithm incorrectly reverses the Hebrew text.   

Does this sound right to you? If so, where is the code which determines whether
in an imported Draw document,  RTL text should be treated as logically or
visually ordered?
Comment 6 alan 2009-05-14 08:05:52 UTC
Created attachment 62254 [details]
Patch for RTL support in PDF import
Comment 7 alan 2009-05-14 08:09:06 UTC
I've attached a patch which addresses this issue, and some others relating to
RTL support in the PDF importer.
Here's a summary of what the patch does:
- During the optimization stage, paragraphs are checked for RTL text. If a
paragraph has RTL text, a flag is set, so that the paragraph's XML will include
a directive for RTL text direction. At the XML generation stage, all RTL text
elements are reversed. The patch uses breakiterators at both of these stages to
determine if a text element is RTL.
- If several text elements had an identical graphics context, concatenation of
text elements was only occurring from the second one on. The first character was
treated as a text element unto itself. The code mistakenly thought that if the
Transformation matrix of the second element was (100,0,0,-100), it should not be
concatenated with the first element. The patch changes this behavior, so that
the second element is concatenated with the first element. This was important
for RTL strings, so that the entire string would be reversed, including the
first character.
- In RTL documents, the font of a space character is often in an LTR font. This
breaks up an RTL phrase into several text objects. Each word gets reversed and
written as a separate text element as the XML is generated, but Draw later
unifies them into a single text object. The result was that each word looks
fine, but the order of words was reversed. The patch fixes this by treating
spaces in the optimization stage as if they were in an RTL font, thus
concatenating all the words during the optimization stage. 
- Adds properties "style:font-family-complex", "style:font-weight-complex", and
"style:font-style-complex", since with the current code, the font properties
were being ignored in RTL documents.
- Removes some unused code.
- Treats a non-breaking space as a space, as described in issue 101327, only in
an additional place in the code. The patch there should be unapplied before
applying this patch.
- Adds a few more hints from font names to determine if bold, italic, or
regular. Optimized the code which checks the font name. 


Comment 8 alan 2009-05-14 08:20:22 UTC
Created attachment 62255 [details]
Sample file with Hebrew and English text
Comment 9 philipp.lohmann 2009-05-14 08:39:44 UTC
great work, this should go into the next pdfimport extension CWS
Comment 10 alan 2009-05-17 17:39:54 UTC
See issue 102002 , which is a follow-up to this issue.
Comment 11 alan 2009-06-30 09:32:41 UTC
I'm attaching a new patch to replace the previous one. The new patch has the
features of the old patch, plus the following:

- Adds include files needed for Windows build
- For font recognition, checks for suffixes “PSMT”, and “MT”, and removes them
- Fixes reversed parentheses, brackets, etc. for RTL languages
- Checks for RTL paragraph not just before, but also after concatenation
- Streamlines code segment in optimization loop
- Fixes off-by-one bug  when checking if text object is RTL

Comment 12 alan 2009-06-30 09:34:09 UTC
Created attachment 63287 [details]
Has more fixes - replaces previous patch
Comment 13 wolframgarten 2009-10-08 11:41:22 UTC
*** Issue 105251 has been marked as a duplicate of this issue. ***
Comment 14 philipp.lohmann 2010-04-28 14:33:50 UTC
applied the second patch in CWS pdfextfix03 - aside from using vcl's
GetMirroredChar; we cannot really link vcl in an extension as that would break
binary compatibility.
Comment 15 philipp.lohmann 2010-06-09 10:44:39 UTC
verified in CWS vcl112
Comment 16 philipp.lohmann 2010-07-19 15:16:22 UTC
integrated in DEV300m83

closing
Comment 17 philipp.lohmann 2010-10-18 15:02:41 UTC
kaplan says, DEV300m89 (containing the attached patch) still shows the original
problem, namely reversed hebrew strings. Need to investigate this.
Comment 18 philipp.lohmann 2010-10-18 15:08:50 UTC
changing type and target
Comment 19 kaplanlior 2010-10-18 17:22:29 UTC
Might it be the "#if 0" part in 
sdext/source/pdfimport/tree/drawtreevisiting.cxx (see patch in lines @@ -80,27
+117,50 @@)
The change set is http://hg.services.openoffice.org/OOO330/rev/bd45002f7b96

Kaplan

Comment 20 philipp.lohmann 2010-10-18 17:55:51 UTC
Entirely possible. As stated in the comment we need a service to use the out
#ifdef'd GetMirroredChar.

Could you please attach a (preferably small) test PDF that will reproduce the
problem ? Or would you consider it sufficient if I create my own using
Insert->Special Characters with (random) hebrew characters ?
Comment 21 kaplanlior 2010-10-18 18:05:57 UTC
Created attachment 72095 [details]
small test pdf in hebrew
Comment 22 kaplanlior 2010-10-18 18:09:56 UTC
@pl: Attached a small test for Hebrew in PDF. Let me know if anything else is
needed.

I hope this could get this fixed for 3.3...
Comment 23 hennerdrewes 2010-10-18 20:23:38 UTC
@pl: actually there is another problematic piece in the code: 

a check is made for complex script type. If found, the string is reversed. But
complex script does not necessarily mean RTL (e.g. Thai).
Comment 24 philipp.lohmann 2010-10-19 10:06:15 UTC
good point, need to change that check also.
Comment 25 philipp.lohmann 2010-10-20 12:00:57 UTC
work in progress

@kaplan: the 3.3 release is luckily not the issue here; the extension gets
released separately. The fix will work with 3.3 as well as its preecessors or
successors.
Comment 26 philipp.lohmann 2010-10-20 13:31:05 UTC
Instead of getScriptType on the BreakIterator we can use the
CharacterClassification which can actually tell us whether the text is RTL or
not. I changed that, however this will probably break at some point where mixed
RTL/LTR comes into play.
Comment 27 philipp.lohmann 2010-10-20 19:05:27 UTC
fixed in  CWS pdfextfix04
Comment 28 philipp.lohmann 2010-10-20 19:06:11 UTC
please verify in CWS pdfextfix04
Comment 29 philipp.lohmann 2010-10-20 19:16:17 UTC
setting milestone to satisfy EIS (this will not affect the release of the extension)
Comment 30 wolframgarten 2010-10-21 13:32:26 UTC
Verified in CWS.