90800 – PDF Import Extension. Hebrew (bidi) text reversed

Issue 90800 - PDF Import Extension. Hebrew (bidi) text reversed

Summary: PDF Import Extension. Hebrew (bidi) text reversed

Status:	CLOSED FIXED

Alias:	None

Product:	Draw
Classification:	Application
Component:	code (show other issues)
Version:	DEV300m76
Hardware:	PC All

Importance:	P3 Trivial with 8 votes (vote)
Target Milestone:	3.4.1
Assignee:	wolframgarten
QA Contact:	issues@graphics

URL:
Keywords:

Duplicates (1):	105251 (view as issue list)
Depends on:
Blocks:

Reported:	2008-06-17 12:57 UTC by hennerdrewes
Modified:	2017-05-20 10:22 UTC (History)
CC List:	8 users (show)

See Also:
Issue Type:	DEFECT
Latest Confirmation in:	---
Developer Difficulty:	---

Attachments
PDF file containing Hebrew text (29.41 KB, application/pdf) 2008-06-17 12:59 UTC, hennerdrewes	no flags	Details
Screenshot after import (86.56 KB, image/png) 2008-06-17 13:00 UTC, hennerdrewes	no flags	Details
Patch for RTL support in PDF import (19.38 KB, text/plain) 2009-05-14 08:05 UTC, alan	no flags	Details
Sample file with Hebrew and English text (117.10 KB, application/pdf) 2009-05-14 08:20 UTC, alan	no flags	Details
Has more fixes - replaces previous patch (22.07 KB, patch) 2009-06-30 09:34 UTC, alan	no flags	Details \| Diff
small test pdf in hebrew (52.15 KB, application/pdf) 2010-10-18 18:05 UTC, kaplanlior	no flags	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description hennerdrewes 2008-06-17 12:57:45 UTC

When importing a PDF file containing Hebrew text, the Hebrew text will be
imported in reverse order. The letters will appear from left-to-right instead of
RTL.

Attaching screenshot and sample PDF.

Comment 1 hennerdrewes 2008-06-17 12:59:19 UTC

Created attachment 54537 [details]
PDF file containing Hebrew text

Comment 2 hennerdrewes 2008-06-17 13:00:00 UTC

Created attachment 54539 [details]
Screenshot after import

Comment 3 wolframgarten 2008-06-17 13:45:34 UTC

Reproducible.

Comment 4 Martin Hollmichel 2008-10-08 09:51:59 UTC

change component.

Comment 5 alan 2008-11-21 09:35:03 UTC

ayaniger->wg:
I would to help with this issue. 

I think that the problem is because of the way the Hebrew chars are stored in
the PDF file.

In PDF format, each glyph is stored with its co-ordinates, and the x value
increases as you go rightward. This means that in a Hebrew RTL line, the “last”
(leftmost) logical character is the “first” visual character in the line (lowest
x-value). Apparently OOo  thinks the text is ordered logically, instead of
visually. It then passes the text to the bidi algorithm, which sees Hebrew, and
the bidi algorithm incorrectly reverses the Hebrew text.   

Does this sound right to you? If so, where is the code which determines whether
in an imported Draw document,  RTL text should be treated as logically or
visually ordered?

Comment 6 alan 2009-05-14 08:05:52 UTC

Created attachment 62254 [details]
Patch for RTL support in PDF import

Comment 7 alan 2009-05-14 08:09:06 UTC

I've attached a patch which addresses this issue, and some others relating to
RTL support in the PDF importer.
Here's a summary of what the patch does:
- During the optimization stage, paragraphs are checked for RTL text. If a
paragraph has RTL text, a flag is set, so that the paragraph's XML will include
a directive for RTL text direction. At the XML generation stage, all RTL text
elements are reversed. The patch uses breakiterators at both of these stages to
determine if a text element is RTL.
- If several text elements had an identical graphics context, concatenation of
text elements was only occurring from the second one on. The first character was
treated as a text element unto itself. The code mistakenly thought that if the
Transformation matrix of the second element was (100,0,0,-100), it should not be
concatenated with the first element. The patch changes this behavior, so that
the second element is concatenated with the first element. This was important
for RTL strings, so that the entire string would be reversed, including the
first character.
- In RTL documents, the font of a space character is often in an LTR font. This
breaks up an RTL phrase into several text objects. Each word gets reversed and
written as a separate text element as the XML is generated, but Draw later
unifies them into a single text object. The result was that each word looks
fine, but the order of words was reversed. The patch fixes this by treating
spaces in the optimization stage as if they were in an RTL font, thus
concatenating all the words during the optimization stage. 
- Adds properties "style:font-family-complex", "style:font-weight-complex", and
"style:font-style-complex", since with the current code, the font properties
were being ignored in RTL documents.
- Removes some unused code.
- Treats a non-breaking space as a space, as described in issue 101327, only in
an additional place in the code. The patch there should be unapplied before
applying this patch.
- Adds a few more hints from font names to determine if bold, italic, or
regular. Optimized the code which checks the font name.

Comment 8 alan 2009-05-14 08:20:22 UTC

Created attachment 62255 [details]
Sample file with Hebrew and English text

Comment 9 philipp.lohmann 2009-05-14 08:39:44 UTC

great work, this should go into the next pdfimport extension CWS

Comment 10 alan 2009-05-17 17:39:54 UTC

See issue 102002 , which is a follow-up to this issue.

Comment 11 alan 2009-06-30 09:32:41 UTC

I'm attaching a new patch to replace the previous one. The new patch has the
features of the old patch, plus the following:

- Adds include files needed for Windows build
- For font recognition, checks for suffixes “PSMT”, and “MT”, and removes them
- Fixes reversed parentheses, brackets, etc. for RTL languages
- Checks for RTL paragraph not just before, but also after concatenation
- Streamlines code segment in optimization loop
- Fixes off-by-one bug  when checking if text object is RTL

Comment 12 alan 2009-06-30 09:34:09 UTC

Created attachment 63287 [details]
Has more fixes - replaces previous patch

Comment 13 wolframgarten 2009-10-08 11:41:22 UTC

*** Issue 105251 has been marked as a duplicate of this issue. ***

Comment 14 philipp.lohmann 2010-04-28 14:33:50 UTC

applied the second patch in CWS pdfextfix03 - aside from using vcl's
GetMirroredChar; we cannot really link vcl in an extension as that would break
binary compatibility.

Comment 15 philipp.lohmann 2010-06-09 10:44:39 UTC

verified in CWS vcl112

Comment 16 philipp.lohmann 2010-07-19 15:16:22 UTC

integrated in DEV300m83

closing

Comment 17 philipp.lohmann 2010-10-18 15:02:41 UTC

kaplan says, DEV300m89 (containing the attached patch) still shows the original
problem, namely reversed hebrew strings. Need to investigate this.

Comment 18 philipp.lohmann 2010-10-18 15:08:50 UTC

changing type and target

Comment 19 kaplanlior 2010-10-18 17:22:29 UTC

Might it be the "#if 0" part in 
sdext/source/pdfimport/tree/drawtreevisiting.cxx (see patch in lines @@ -80,27
+117,50 @@)
The change set is http://hg.services.openoffice.org/OOO330/rev/bd45002f7b96

Kaplan

Comment 20 philipp.lohmann 2010-10-18 17:55:51 UTC

Entirely possible. As stated in the comment we need a service to use the out
#ifdef'd GetMirroredChar.

Could you please attach a (preferably small) test PDF that will reproduce the
problem ? Or would you consider it sufficient if I create my own using
Insert->Special Characters with (random) hebrew characters ?

Comment 21 kaplanlior 2010-10-18 18:05:57 UTC

Created attachment 72095 [details]
small test pdf in hebrew

Comment 22 kaplanlior 2010-10-18 18:09:56 UTC

@pl: Attached a small test for Hebrew in PDF. Let me know if anything else is
needed.

I hope this could get this fixed for 3.3...

Comment 23 hennerdrewes 2010-10-18 20:23:38 UTC

@pl: actually there is another problematic piece in the code: 

a check is made for complex script type. If found, the string is reversed. But
complex script does not necessarily mean RTL (e.g. Thai).

Comment 24 philipp.lohmann 2010-10-19 10:06:15 UTC

good point, need to change that check also.

Comment 25 philipp.lohmann 2010-10-20 12:00:57 UTC

work in progress

@kaplan: the 3.3 release is luckily not the issue here; the extension gets
released separately. The fix will work with 3.3 as well as its preecessors or
successors.

Comment 26 philipp.lohmann 2010-10-20 13:31:05 UTC

Instead of getScriptType on the BreakIterator we can use the
CharacterClassification which can actually tell us whether the text is RTL or
not. I changed that, however this will probably break at some point where mixed
RTL/LTR comes into play.

Comment 27 philipp.lohmann 2010-10-20 19:05:27 UTC

fixed in  CWS pdfextfix04

Comment 28 philipp.lohmann 2010-10-20 19:06:11 UTC

please verify in CWS pdfextfix04

Comment 29 philipp.lohmann 2010-10-20 19:16:17 UTC

setting milestone to satisfy EIS (this will not affect the release of the extension)

Comment 30 wolframgarten 2010-10-21 13:32:26 UTC

Verified in CWS.