Bug 32789

Summary: [PATCH] Arabic Shaping not Supported by FOP
Product: Fop - Now in Jira Reporter: Lucy lue <yuanlue>
Component: page-master/layoutAssignee: fop-dev
Status: CLOSED FIXED    
Severity: normal CC: alaaeldinalex, khyas_muaiyed, yuanlue
Priority: P2    
Version: all   
Target Milestone: ---   
Hardware: All   
OS: All   
Attachments: Support for Arabic PDF rendering using ICU4J
Example fo - one of many used in testing Arabic patch
generated PDF from sample .fo
Better example of FO file that uses Arabic
Here is generated PDF from better example of FO file
arabic patch teaser - sample font metrics, arabic script processing code

Description Lucy lue 2004-12-21 11:24:07 UTC
Lack of support for multi-language seems to be happening in FOP. 

Arabic is right to left, and there are rules for joining letters. 

We did the following steps listed in 
http://www.javaranch.com/journal/200409/CreatingMultipleLanguagePDFusingApacheFO
P.html

However, the Arabic words are able to display letter by letter, but the words 
are broken. No glyphs to join letter with letter...

Initially we thought it may be the font problem. Then we tried download quite a 
number of Arabic fonts, (like tradbdo.ttf, trado.ttf, nesf2.ttf, 
Nafees_Naskh.ttf etc.) The font that contains most glyphs are Nafees_Naskh.ttf, 
which contains 581 number of glyphs. However, after rendering pdf using FOP, 
the words are still broken.
Comment 1 J.Pietschmann 2004-12-21 23:34:49 UTC
There is no BIDI support in FOP 0.20.5, which not only means no support for
fo:bidi and writing-direction="lr" but also no detection of tb-rl scripts.
There had been a patch for the latter which required Java 1.4 (Java 1.3
doesn't have the necessary infrastructure). If you manage to find it, you
can use arabic and other tb-rl scripts, but you still have to remember to
swap begin/end-margins and similar properties for blocks explicitely.
Comment 2 Pascal Sancho 2010-01-06 00:18:02 UTC
*** Bug 48184 has been marked as a duplicate of this bug. ***
Comment 3 RN 2010-02-05 14:33:18 UTC
Created attachment 24934 [details]
Support for Arabic PDF rendering using ICU4J

This patch uses ICU4J to do form-shaping and BIDI transformation of rendered text.  It is a patch for the FOP trunk.   It does not change the layout manager or the area tree handler or allow a writing-mode other than “lr-tb”.   For this patch to be integrated with FOP, FOP would need to distribute the ICU4J library - icu4j-4_2_1.jar.   It affects both PDF and PCL rendering but has only been tested with PDF rendering.  So far results of testing with PDF rendering have been positive.  The PCL aspect of the patch looks correct given that the PDF aspect works.
Comment 4 J.Pietschmann 2010-02-07 05:12:27 UTC
I'd like to have a detection whether ICU4J and possibly also whether BIDI support
is available at run time, and either fail with a sensible error message or degrade
output. Distributing ICU4J with FOP is a bit of a headache, but taking advantage
of it if the user installed it separately is certainly ok.
Comment 5 Jonathan Levinson 2010-02-07 05:44:28 UTC
Why is distributing ICU4J with FOP a bit of a headache?

Would you use Java Relection to test whether the ICU4J classes this patch uses to support Arabic are available at run-time, and would you use reflection to call the ICU4J methods if they are available?
Comment 6 J.Pietschmann 2010-02-07 07:11:11 UTC
(In reply to comment #5)
> Why is distributing ICU4J with FOP a bit of a headache?

The jar is somewhat largish (6MB), and only a small part will be used (although
other parts are useful too, like Thai support).
Anyway, bundling a dependency may be good for users which prefer a self
contained installation but usually are a headache for people who want to
package FOP with their products, therefore I'd like to restrict bundled
jars to a "minimal guaranteed feature set."
 
> Would you use Java Relection to test whether the ICU4J classes this patch uses
> to support Arabic are available at run-time
Yes. Testing for a single, typically used class should be sufficient.

> and would you use reflection to
> call the ICU4J methods if they are available?
I don't think this is necessary, although I'm no longer on top of things
when it comes to know when a JVM loads classes.
Comment 7 Vincent Hennebert 2010-02-08 04:13:10 UTC
Hi,

Thanks for your patch. Do you have an example FO file that could be used for testing purpose (even better, with an English translation)?

IIUC, Arabic shaping is about replacing glyphs for standalone letters with suitable ligature glyphs for building words. Surely that affects character widths, so line breaking decisions? In the patch, shaping is performed at the rendering stage, so isn't there a danger of getting inconsistent results?

Also, IIC Arabic shaping affects glyphs selection. How do you make sure that the right glyphs are being embedded in the PDF file?

The same piece of code is duplicated in the PCL and PDF painters. The same would probably also need to be done for other painters. This is not desirable.

Finally, what is the impact on performance? It looks like shaping will be applied to just any text, even non-arabic one.

Thanks,
Vincent


(In reply to comment #3)
> Created an attachment (id=24934) [details]
> Support for Arabic PDF rendering using ICU4J
> 
> This patch uses ICU4J to do form-shaping and BIDI transformation of rendered
> text.  It is a patch for the FOP trunk.   It does not change the layout manager
> or the area tree handler or allow a writing-mode other than “lr-tb”.   For this
> patch to be integrated with FOP, FOP would need to distribute the ICU4J library
> - icu4j-4_2_1.jar.   It affects both PDF and PCL rendering but has only been
> tested with PDF rendering.  So far results of testing with PDF rendering have
> been positive.  The PCL aspect of the patch looks correct given that the PDF
> aspect works.
Comment 8 Jonathan Levinson 2010-02-08 06:58:15 UTC
Hi Vincent,

I will attach the .fo file I've been using for testing.  I will also attach the generated pdf.  This is from an example our Dubai team gave me for my own testing as I developed the code.

Our Dubai team has been testing with a large variety of Arabic script - but they are using a report creation tool that invokes fop.bat with xsl input so the .fo file isn't part of their output.

I could give them instructions for creating .fo files.

We have found in testing that what is most important is the BIDI algorithm is applied so that text (including embedded numerals) is in the right order and that form shaping is correct.  You need to know the Arabic alphabet and its rules to assess the output of testing.  We have a team that knows Arabic to do our testing.  They "eyeball" the reports to make sure they are in proper Arabic with text and sub-text in the right order.  Embedded numerals can be in a different order - left-to-right rather than right-to-left. It isn't clear to me how this process can be automated.

You are right that widths change and this could change line breaking decisions.  Do you know where in the FOP pipeline before we reach the rendering pipeline the Arabic shaping could go so as to be able to affect width selection?

I believe that what ensures the right glyphs are embedded in the PDF file is the nature of the ICU4J algorithm which transforms the UNICODE representation of the string.  The output for our Dubai team is PDFs with embedded fonts and these are working so ICU4J must have solved the problem in some way, and I believe the way they solve it is by using different UNICODE codes.

I don't have performance numbers to give you yet.  If ICU4J was clever about the way they wrote their transform algorithm it should not be much of a performance impact since they only need to transform text in the Arabic UNICODE code range and testing whether text is in this range should be quick.

Thanks,
Jonathan

(In reply to comment #7)
> Hi,
> Thanks for your patch. Do you have an example FO file that could be used for
> testing purpose (even better, with an English translation)?
> IIUC, Arabic shaping is about replacing glyphs for standalone letters with
> suitable ligature glyphs for building words. Surely that affects character
> widths, so line breaking decisions? In the patch, shaping is performed at the
> rendering stage, so isn't there a danger of getting inconsistent results?
> Also, IIC Arabic shaping affects glyphs selection. How do you make sure that
> the right glyphs are being embedded in the PDF file?
> The same piece of code is duplicated in the PCL and PDF painters. The same
> would probably also need to be done for other painters. This is not desirable.
> Finally, what is the impact on performance? It looks like shaping will be
> applied to just any text, even non-arabic one.
> Thanks,
> Vincent
> (In reply to comment #3)
> > Created an attachment (id=24934) [details] [details]
> > Support for Arabic PDF rendering using ICU4J
> > 
> > This patch uses ICU4J to do form-shaping and BIDI transformation of rendered
> > text.  It is a patch for the FOP trunk.   It does not change the layout manager
> > or the area tree handler or allow a writing-mode other than “lr-tb”.   For this
> > patch to be integrated with FOP, FOP would need to distribute the ICU4J library
> > - icu4j-4_2_1.jar.   It affects both PDF and PCL rendering but has only been
> > tested with PDF rendering.  So far results of testing with PDF rendering have
> > been positive.  The PCL aspect of the patch looks correct given that the PDF
> > aspect works.
Comment 9 Jonathan Levinson 2010-02-08 07:00:33 UTC
Created attachment 24947 [details]
Example fo - one of many used in testing Arabic patch
Comment 10 Jonathan Levinson 2010-02-08 07:01:31 UTC
Created attachment 24948 [details]
generated PDF from sample .fo
Comment 11 Vincent Hennebert 2010-02-11 12:25:40 UTC
Hi Jonathan,

(In reply to comment #8)
> Hi Vincent,
> 
> I will attach the .fo file I've been using for testing.  I will also attach the
> generated pdf.  This is from an example our Dubai team gave me for my own
> testing as I developed the code.

Well... It's a bit light for an example. Just a single word...


> Our Dubai team has been testing with a large variety of Arabic script - but
> they are using a report creation tool that invokes fop.bat with xsl input so
> the .fo file isn't part of their output.
> 
> I could give them instructions for creating .fo files.
> 
> We have found in testing that what is most important is the BIDI algorithm is
> applied so that text (including embedded numerals) is in the right order and
> that form shaping is correct.  You need to know the Arabic alphabet and its
> rules to assess the output of testing.  We have a team that knows Arabic to do
> our testing.  They "eyeball" the reports to make sure they are in proper Arabic
> with text and sub-text in the right order.  Embedded numerals can be in a
> different order - left-to-right rather than right-to-left. It isn't clear to me
> how this process can be automated.
> 
> You are right that widths change and this could change line breaking decisions.
>  Do you know where in the FOP pipeline before we reach the rendering pipeline
> the Arabic shaping could go so as to be able to affect width selection?

Something needs to be done in the layout engine, possibly also on the FO tree. At least section 5.8 (“Unicode BIDI Processing”) of XSL-FO 1.1 deserves a look as it explains how the Unicode algorithm should be blended in XSL-FO processing. Inline-level stuff is likely to be affected. It needs to be seen how and when character re-ordering should be done WRT line breaking.

Also, something might need to be done at the font level. I don't know what ICU4J does, but I suspect it replaces characters from the Arabic range (U+0600–U+06FF) with ones from Arabic Presentation Forms-A (U+FB50–U+FDFF). AFAIU from the Unicode specification this is legacy that may not be supported by every font. I suppose modern fonts (especially OpenType ones) use the standard ligature mechanism to provide contextual glyphs.


> I believe that what ensures the right glyphs are embedded in the PDF file is
> the nature of the ICU4J algorithm which transforms the UNICODE representation
> of the string.  The output for our Dubai team is PDFs with embedded fonts and
> these are working so ICU4J must have solved the problem in some way, and I
> believe the way they solve it is by using different UNICODE codes.

Actually this is taken care of by the font library called by PDFPainter. I suspect the same is done at the layout stage, with the standalone glyphs. Which would be suboptimal, as both standalone and contextual glyphs would be embedded in the final PDF.


> I don't have performance numbers to give you yet.  If ICU4J was clever about
> the way they wrote their transform algorithm it should not be much of a
> performance impact since they only need to transform text in the Arabic UNICODE
> code range and testing whether text is in this range should be quick.
> 
> Thanks,
> Jonathan
> 
> (In reply to comment #7)
> > Hi,
> > Thanks for your patch. Do you have an example FO file that could be used for
> > testing purpose (even better, with an English translation)?
> > IIUC, Arabic shaping is about replacing glyphs for standalone letters with
> > suitable ligature glyphs for building words. Surely that affects character
> > widths, so line breaking decisions? In the patch, shaping is performed at the
> > rendering stage, so isn't there a danger of getting inconsistent results?
> > Also, IIC Arabic shaping affects glyphs selection. How do you make sure that
> > the right glyphs are being embedded in the PDF file?
> > The same piece of code is duplicated in the PCL and PDF painters. The same
> > would probably also need to be done for other painters. This is not desirable.
> > Finally, what is the impact on performance? It looks like shaping will be
> > applied to just any text, even non-arabic one.
> > Thanks,
> > Vincent
> > (In reply to comment #3)
> > > Created an attachment (id=24934) [details] [details] [details]
> > > Support for Arabic PDF rendering using ICU4J
> > > 
> > > This patch uses ICU4J to do form-shaping and BIDI transformation of rendered
> > > text.  It is a patch for the FOP trunk.   It does not change the layout manager
> > > or the area tree handler or allow a writing-mode other than “lr-tb”.   For this
> > > patch to be integrated with FOP, FOP would need to distribute the ICU4J library
> > > - icu4j-4_2_1.jar.   It affects both PDF and PCL rendering but has only been
> > > tested with PDF rendering.  So far results of testing with PDF rendering have
> > > been positive.  The PCL aspect of the patch looks correct given that the PDF
> > > aspect works.


Vincent
Comment 12 Jonathan Levinson 2010-02-12 01:10:43 UTC
Hi Vincent,

Before committing the work I did on Arabic to the trunk, the Apache FOP organization seems to want five things:

1)	Modify ICU4J change to check if classes available and if not don't call them

2)	Provide Apache organization with performance data to assess performance cost of Arabic Shaping classes

3)	Provide Apache organization with better examples of use of Arabic

4)	Move Arabic form shaping and BIDI algorithm to layout manager

5)	Not use ICU4J to do UNICODE transformation but use the standard ligature mechanism to provide contextual glyphs.  This is a request for a complete rewrite of the patch to use a mechanism that isn't known to me currently, but maybe could become known if I had the right pointers.

(4) is highly non-trivial.  I haven’t a clue as to how to do (5).

For (4) could you point me at the source code files in the layout manager that would have to be changed?  Can you give me some pointers as to where this sort of information is processed by the layout manager?  I've read the layout manager code and tried to locate where it processes the width of characters and what would have to change to have right-to-left printing but I've been unable to penetrate the forest for the trees.  I have read Knuth's algorithm for line breaking and I think I have a good understanding of what a KnuthElement is - glue, penalties and the basics of Knuth's algorithm, but I'm having trouble converting this theoretical understanding into a practical understanding of what has to change in the code to move the printing from right to left.

I’m not sure what to do about (5).  Do you have any references, is there some pointer to what algorithm would do more than UNICODE transformation but would do contextual glyphs based on the glyphs in a font.  How do I tell the characteristics of an Arabic character in  a font, whether it is in initial, intermediate or final position?  I suppose this information would vary from font to font.  Where in FOP is font information like this processed and how do I “tell” a font I want the Arabic character at UNICODE position X but I want from the font that the character be in final position?  Does the layout manager actually process the font information about a character?  I suppose it must to know character widths, which are necessary for Knuth's algorithm, but please forgive me, I don't see where this code lives.  FOP has over 11,000 files!

I used ICU4J to avoid having to write a ton of code.  That is why my patch is so small.

I'm not complaining.  I'm hoping I can get some more pointers to what changes need to be made to support Arabic and where the changes have to go.  Even if I'm not the one who eventually does the work, whoever eventually implements right-to-left printing and Arabic support will certainly find our discussion valuable.  I'm sure you'll agree that FOP needs to become truly international at some point. That would really open a new community of users to the benefits of FOP, which are considerable.

In fact, I agree that it is hard to see how there can be a robust solution to the problem of printing Arabic text that simply involves the PDF renderer; theoretically and probably practically the layout manager has to be involved.

I’ve looked at the FOP SVG rendering code which tries to do Arabic form shaping and it seems to be just doing UNICODE transformations.  It doesn’t seem to be responding to the ability of a font that you are discussing, to display a single UNICODE code in many different forms.  It seems to be just doing a simple table look up that transforms a UNICODE code.  So you already have code in FOP, in  your SVG renderer, that seems to do the same thing I tried to do using ICU4J.  This doesn't mean the code I wrote using ICU4J is doing the right thing, but it does mean that simply transforming one UNICODE code to another is the simplest first step in solving this difficult problem.

Could we agree that we could live with an ICU4J approach if (1),(2),(3), and (4) were met as conditions, and that (5) - a complete rewrite using modern font techniques could be deferred.  Of course, I'm interested in learning how I could achieve (5); I'm not dismissing (5), I'm just looking for a bottom-line that would allow FOP to practically meet the needs of rendering Arabic text, even if the result isn't perfect yet.  

Best Regards,
Jonathan

(In reply to comment #11)
> Hi Jonathan,
> 
> (In reply to comment #8)
> > Hi Vincent,
> > 
> > I will attach the .fo file I've been using for testing.  I will also attach the
> > generated pdf.  This is from an example our Dubai team gave me for my own
> > testing as I developed the code.
> 
> Well... It's a bit light for an example. Just a single word...
> 
> 
> > Our Dubai team has been testing with a large variety of Arabic script - but
> > they are using a report creation tool that invokes fop.bat with xsl input so
> > the .fo file isn't part of their output.
> > 
> > I could give them instructions for creating .fo files.
> > 
> > We have found in testing that what is most important is the BIDI algorithm is
> > applied so that text (including embedded numerals) is in the right order and
> > that form shaping is correct.  You need to know the Arabic alphabet and its
> > rules to assess the output of testing.  We have a team that knows Arabic to do
> > our testing.  They "eyeball" the reports to make sure they are in proper Arabic
> > with text and sub-text in the right order.  Embedded numerals can be in a
> > different order - left-to-right rather than right-to-left. It isn't clear to me
> > how this process can be automated.
> > 
> > You are right that widths change and this could change line breaking decisions.
> >  Do you know where in the FOP pipeline before we reach the rendering pipeline
> > the Arabic shaping could go so as to be able to affect width selection?
> 
> Something needs to be done in the layout engine, possibly also on the FO tree.
> At least section 5.8 (“Unicode BIDI Processing”) of XSL-FO 1.1 deserves a look
> as it explains how the Unicode algorithm should be blended in XSL-FO
> processing. Inline-level stuff is likely to be affected. It needs to be seen
> how and when character re-ordering should be done WRT line breaking.
> 
> Also, something might need to be done at the font level. I don't know what
> ICU4J does, but I suspect it replaces characters from the Arabic range
> (U+0600–U+06FF) with ones from Arabic Presentation Forms-A (U+FB50–U+FDFF).
> AFAIU from the Unicode specification this is legacy that may not be supported
> by every font. I suppose modern fonts (especially OpenType ones) use the
> standard ligature mechanism to provide contextual glyphs.
> 
> 
> > I believe that what ensures the right glyphs are embedded in the PDF file is
> > the nature of the ICU4J algorithm which transforms the UNICODE representation
> > of the string.  The output for our Dubai team is PDFs with embedded fonts and
> > these are working so ICU4J must have solved the problem in some way, and I
> > believe the way they solve it is by using different UNICODE codes.
> 
> Actually this is taken care of by the font library called by PDFPainter. I
> suspect the same is done at the layout stage, with the standalone glyphs. Which
> would be suboptimal, as both standalone and contextual glyphs would be embedded
> in the final PDF.
> 
> 
> > I don't have performance numbers to give you yet.  If ICU4J was clever about
> > the way they wrote their transform algorithm it should not be much of a
> > performance impact since they only need to transform text in the Arabic UNICODE
> > code range and testing whether text is in this range should be quick.
> > 
> > Thanks,
> > Jonathan
> > 
> > (In reply to comment #7)
> > > Hi,
> > > Thanks for your patch. Do you have an example FO file that could be used for
> > > testing purpose (even better, with an English translation)?
> > > IIUC, Arabic shaping is about replacing glyphs for standalone letters with
> > > suitable ligature glyphs for building words. Surely that affects character
> > > widths, so line breaking decisions? In the patch, shaping is performed at the
> > > rendering stage, so isn't there a danger of getting inconsistent results?
> > > Also, IIC Arabic shaping affects glyphs selection. How do you make sure that
> > > the right glyphs are being embedded in the PDF file?
> > > The same piece of code is duplicated in the PCL and PDF painters. The same
> > > would probably also need to be done for other painters. This is not desirable.
> > > Finally, what is the impact on performance? It looks like shaping will be
> > > applied to just any text, even non-arabic one.
> > > Thanks,
> > > Vincent
> > > (In reply to comment #3)
> > > > Created an attachment (id=24934) [details] [details] [details] [details]
> > > > Support for Arabic PDF rendering using ICU4J
> > > > 
> > > > This patch uses ICU4J to do form-shaping and BIDI transformation of rendered
> > > > text.  It is a patch for the FOP trunk.   It does not change the layout manager
> > > > or the area tree handler or allow a writing-mode other than “lr-tb”.   For this
> > > > patch to be integrated with FOP, FOP would need to distribute the ICU4J library
> > > > - icu4j-4_2_1.jar.   It affects both PDF and PCL rendering but has only been
> > > > tested with PDF rendering.  So far results of testing with PDF rendering have
> > > > been positive.  The PCL aspect of the patch looks correct given that the PDF
> > > > aspect works.
> 
> 
> Vincent
Comment 13 Vincent Hennebert 2010-02-12 11:15:07 UTC
Hi Jonathan,

I'm lacking the knowledge to properly answer all of your questions, but I'll try anyway.

(In reply to comment #12)
> Hi Vincent,
> 
> Before committing the work I did on Arabic to the trunk, the Apache FOP
> organization seems to want five things:
> 
> 1)    Modify ICU4J change to check if classes available and if not don't call
> them

I would leave that aside for now. This can be done in the last refinements, once everything else is in place.


> 2)    Provide Apache organization with performance data to assess performance
> cost of Arabic Shaping classes
> 
> 3)    Provide Apache organization with better examples of use of Arabic
> 
> 4)    Move Arabic form shaping and BIDI algorithm to layout manager
> 
> 5)    Not use ICU4J to do UNICODE transformation but use the standard ligature
> mechanism to provide contextual glyphs.  This is a request for a complete
> rewrite of the patch to use a mechanism that isn't known to me currently, but
> maybe could become known if I had the right pointers.
> 
> (4) is highly non-trivial.  I haven’t a clue as to how to do (5).

I didn't say it was trivial :-)


> For (4) could you point me at the source code files in the layout manager that
> would have to be changed?  Can you give me some pointers as to where this sort
> of information is processed by the layout manager?  I've read the layout
> manager code and tried to locate where it processes the width of characters and
> what would have to change to have right-to-left printing but I've been unable
> to penetrate the forest for the trees.  I have read Knuth's algorithm for line
> breaking and I think I have a good understanding of what a KnuthElement is -
> glue, penalties and the basics of Knuth's algorithm, but I'm having trouble
> converting this theoretical understanding into a practical understanding of
> what has to change in the code to move the printing from right to left.

I don't really know myself where to look either. Without talking about the FOP code yet, it must be seen how to do character re-ordering, line breaking and glyph shaping. The three processes probably have an impact on each other. Does a glyph change whether it is at the end of a line or not? How does hyphenation work (apparently, only applies to the Uighur script)? Also, contrary to Western scripts, I think justification is not done by increasing inter-word spaces, but by using wider alternative glyphs.
Obviously, the appropriate sections of the XSL-FO Recommendation need to be studied, as well as the Unicode Standard (in particular, UAX #9 about the Bidirectional Algorithm). And also other resources on the web.

You can use the Wiki to gather your thoughts:
http://wiki.apache.org/xmlgraphics-fop/DeveloperPages


> I’m not sure what to do about (5).  Do you have any references, is there some
> pointer to what algorithm would do more than UNICODE transformation but would
> do contextual glyphs based on the glyphs in a font.  How do I tell the
> characteristics of an Arabic character in  a font, whether it is in initial,
> intermediate or final position?  I suppose this information would vary from
> font to font.  Where in FOP is font information like this processed and how do
> I “tell” a font I want the Arabic character at UNICODE position X but I want
> from the font that the character be in final position?  Does the layout manager
> actually process the font information about a character?  I suppose it must to
> know character widths, which are necessary for Knuth's algorithm, but please
> forgive me, I don't see where this code lives.  FOP has over 11,000 files!

I'm almost sure that the OpenType font format provides the necessary mechanisms to do contextual glyph shaping. But I've never really looked into it. I guess one mechanism or the other will have to be selected depending on which one the font supports.


> I used ICU4J to avoid having to write a ton of code.  That is why my patch is
> so small.

Which is a good idea. We will probably need ICU4J anyway. But there /is/ going to be a lot of code to write, simply because the whole issue is all but trivial. Some heavy refactoring of the layout code will probably be needed, too.


> I'm not complaining.  I'm hoping I can get some more pointers to what changes
> need to be made to support Arabic and where the changes have to go.  Even if
> I'm not the one who eventually does the work, whoever eventually implements
> right-to-left printing and Arabic support will certainly find our discussion
> valuable.  I'm sure you'll agree that FOP needs to become truly international
> at some point. That would really open a new community of users to the benefits
> of FOP, which are considerable.
> 
> In fact, I agree that it is hard to see how there can be a robust solution to
> the problem of printing Arabic text that simply involves the PDF renderer;
> theoretically and probably practically the layout manager has to be involved.
> 
> I’ve looked at the FOP SVG rendering code which tries to do Arabic form shaping
> and it seems to be just doing UNICODE transformations.  It doesn’t seem to be
> responding to the ability of a font that you are discussing, to display a
> single UNICODE code in many different forms.  It seems to be just doing a
> simple table look up that transforms a UNICODE code.  So you already have code
> in FOP, in  your SVG renderer, that seems to do the same thing I tried to do
> using ICU4J.  This doesn't mean the code I wrote using ICU4J is doing the right
> thing, but it does mean that simply transforming one UNICODE code to another is
> the simplest first step in solving this difficult problem.
> 
> Could we agree that we could live with an ICU4J approach if (1),(2),(3), and
> (4) were met as conditions, and that (5) - a complete rewrite using modern font
> techniques could be deferred.  Of course, I'm interested in learning how I
> could achieve (5); I'm not dismissing (5), I'm just looking for a bottom-line
> that would allow FOP to practically meet the needs of rendering Arabic text,
> even if the result isn't perfect yet.

(5) can surely be left aside for now, as long as the code structure allows it to be implemented and plugged in later on, with a transparent switch from one mechanism to the other depending on the font used.
  

> Best Regards,
> Jonathan

HTH,
Vincent
Comment 14 Simon Pepping 2010-02-12 19:48:16 UTC
You touch many complicated points. I will try to give some hints regarding item 4)    Move Arabic form shaping and BIDI algorithm to layout manager. Since I worked with this code a while ago, I go with what I remember.

The relevant layout managers are lineLayoutManager and textLayoutManager. LineLM initiates the linebreaking algorithm. Input is the string, which is converted into boxes using font width info. Here the rl text must be presented in boxes in the proper order. In fully Arabic paragraphs this could be rl order, but then the breakpoints must be interpreted as breaks at the lhs. In mixed paragraphs lr order may be best, unless lhs breaks are again used. The algorithm itself is basically writing order agnostic.

In LM.addAreas the areas for the words and lines are created. I suppose they contain the string to be rendered, so that the renderers can insert glyphs.

The linebreaking algorithm needs to know pre-break, post-break and no-break pieces. In Western texts this is done using penalty widths. I think we did something with post-break pieces, but I do not remember precisely. You should find out whether the outcome of the linebreaking algorithm, esp. the stretch of a line, contains sufficient information for Arabic rendering.

On further points: If we can use work done by ICU, then by all means let us do that. I am a bit confused about the options to use either ICU or OpenType font info. The latter is more focussed to that particular font, I would guess. But if ICU's string transformation does the trick for all fonts, we can use it.

Most of all, let us go with a good solution. Better is the enemy of good. I would rather have a good solution than the best but not yet available solution.
Comment 15 Jonathan Levinson 2010-02-17 14:27:28 UTC
Created attachment 25010 [details]
Better example of FO file that uses Arabic

I will also attach generated PDF.
Comment 16 Jonathan Levinson 2010-02-17 14:28:37 UTC
Created attachment 25011 [details]
Here is generated PDF from better example of FO file
Comment 17 Glenn Adams 2010-04-12 01:00:37 UTC
FYI, I am preparing a candidate patch that will add direct support for Arabic (and other complex scripts). This primarily involves making use of the advanced typographic tables present in TrueType and OpenType fonts (e.g., 'mort', 'morx', 'GSUB' and 'GPOS' tables). Initial support will focus on use of the GSUB table. This patch will not have any external dependencies, i.e., it does not make use of ICU4J.

Regards,
Glenn Adams
Comment 18 Sachin Sharma 2010-04-30 01:42:49 UTC
(In reply to comment #17)
> FYI, I am preparing a candidate patch that will add direct support for Arabic
> (and other complex scripts). This primarily involves making use of the advanced
> typographic tables present in TrueType and OpenType fonts (e.g., 'mort',
> 'morx', 'GSUB' and 'GPOS' tables). Initial support will focus on use of the
> GSUB table. This patch will not have any external dependencies, i.e., it does
> not make use of ICU4J.
> 
> Regards,
> Glenn Adams


Glenn,
Can you please provide more information on this as to when & how will this update be available?

Regards,
Sachin Sharma.
Comment 19 Glenn Adams 2010-04-30 03:21:13 UTC
basically, the patch will do the following (in summary):

* enhance org.apache.fop.fonts.truetype.TTFFile in order to read the OpenType GSUB and GPOS tables, creating new org.apache.fop.fonts.GlyphTable instances which are added to MultiByteFont instances;

* enhance org.apache.fop.fonts.apps.TTFReader in order to write out XML representation of this new data into the FOP metrics file;

* enhance org.apache.fop.fonts.FontReader to read the new GSUB/GPOS data stored in the FOP metrics file;

* enhance the knuth elements generation in org.apache.fop.layoutmgr.inline.TextLayoutManager, specifically, #processWord, in order to perform substitution processing, which, if the current font supports substitution, causes the font to invoke substitution processing using the new metrics; this substitution process is a multi-stage process starting with a mapping from a sequence of character codes to a sequence of glyph indices, followed by one or more mappings from sequence of glyph indices to sequences of glyph indices, and finally mapping back to a sequence of character codes denoting the final mapped glyphs to be used;

* similarly, if font supports these new metrics, then perform glyph positioning process to produce sequence of [dx,dy] adjustments to apply, the application of which follows a somewhat updated logic to handle both X and Y advancements on a per-resultant-glyph (= per output character) basis;
implement bidi algorithm specified in XSL-FO 1.1 Section 5.8 "Unicode Bidi Processing", which essentially involves resolving the final inline-progression-direction for each glyph or inline area child of an inline area and each inline child of a line area;

* enhance area generation process to make use of inline-progress-direction produced by bidi processing in order to reorder areas to satisfy unicode bidi semantics (both explicit and implied);
initially, i am testing against the set of Arabic fonts shipped with Windows 7; but I expect to work with a few other fonts that have GSUB/GPOS tables as well; i am actually doing this work on MacOSX 10.6, so at some point I would hope to add support for the TrueType GX tables known as 'mort' and 'morx' which perform similar processes;

note that these processes (substitution/positioning/etc) allow support for a number of complex scripts, not just arabic script; e.g., the indic scripts, southeast asian, mongolian, tibetan scripts, etc, and also advanced typographic effects on latin, greek, cyrillic, etc., and east asian scripts (e.g., JISX4051) are supported by these processes as well; nevertheless, in order to make use of specific sub-tables of GSUB/GPOS, it is necessary to make use of script specific processing; therefore, I have implemented a mechanism to make use of script information, either supplied by the XSL-FO script property or, by default, scanning the characters to determine their dominant script; I have started by implementing this general mechanism and also specific Arabic and Default script processors; it will then be straightforward to add other script specific support;

i don't have a fixed schedule, but I have most of the GSUB code working and tested; i am wrapping up the bidi algorithm work now, and when that is complete to submit a patch for potential incorporation into the trunk; i am hoping to have that patch done within the next 2 to 4 weeks time;

as a teaser, i will add an attachment containing several files, one showing an FOP metrics file with the new data (see the <script-extras/> element); the others show the GlyphSubstitutionTable and ArabicScriptProcessor classes, which are not functionally complete, but are sufficiently complete to perform basic Arabic glyphs substitution (but not yet ligature processing);

regarding how it will be used, you will need to:

* possess or have access to a font in the form of a TTF file that contains GSUB/GPOS metrics; if it is to be used with Arabic, then it must should contain the GSUB lookup tables for the following features, e.g., 'isol', 'init', 'medi', 'fina', and 'liga';

* create the FOP font metrics file for it by using the org.apache.fop.fonts.apps.TTFReader application;
update your FOP configuration file as needed to refer to the new font and metrics;
reference the font as usual using XSL-FO properties;

* where necessary, add explicit use of <fo:bidi-override/> in order to override the default Unicode bidi logic; e.g., to override implicit directionality or to create embedding levels; you can also make use of the explicit Unicode bidi control characters, LRO RLO LRE RLE and PDF, but it is better to use explicit markup with <fo:bidi-override/>;

* where necessary, to force or prevent joining behavior when the default would not join or would join, then you can use the ZWJ and ZWNJ Unicode controls, however, forced joining must be supported by the font to have an effect; while force non-joining doesn't depend on the font (though if font did not support joining of two characters in the first place then ZWNJ would have no visible effect);

regards,
glenn

p.s. I have an ICLA on file with the apache office;

(In reply to comment #18)
> (In reply to comment #17)
> > FYI, I am preparing a candidate patch that will add direct support for Arabic
> > (and other complex scripts). This primarily involves making use of the advanced
> > typographic tables present in TrueType and OpenType fonts (e.g., 'mort',
> > 'morx', 'GSUB' and 'GPOS' tables). Initial support will focus on use of the
> > GSUB table. This patch will not have any external dependencies, i.e., it does
> > not make use of ICU4J.
> > 
> > Regards,
> > Glenn Adams
> 
> 
> Glenn,
> Can you please provide more information on this as to when & how will this
> update be available?
> 
> Regards,
> Sachin Sharma.
Comment 20 Glenn Adams 2010-04-30 03:28:52 UTC
Created attachment 25380 [details]
arabic patch teaser - sample font metrics, arabic script processing code
Comment 21 Sachin Sharma 2010-05-12 08:05:50 UTC
(In reply to comment #20)
> Created an attachment (id=25380) [details]
> arabic patch teaser - sample font metrics, arabic script processing code

Glenn,
Thanks for the update.

Regards,
Sachin Sharma.
Comment 22 Dharmesh Rana 2010-05-21 01:20:21 UTC
(In reply to comment #19)
> basically, the patch will do the following (in summary):
> * enhance org.apache.fop.fonts.truetype.TTFFile in order to read the OpenType
> GSUB and GPOS tables, creating new org.apache.fop.fonts.GlyphTable instances
> which are added to MultiByteFont instances;
> * enhance org.apache.fop.fonts.apps.TTFReader in order to write out XML
> representation of this new data into the FOP metrics file;
> * enhance org.apache.fop.fonts.FontReader to read the new GSUB/GPOS data stored
> in the FOP metrics file;
> * enhance the knuth elements generation in
> org.apache.fop.layoutmgr.inline.TextLayoutManager, specifically, #processWord,
> in order to perform substitution processing, which, if the current font
> supports substitution, causes the font to invoke substitution processing using
> the new metrics; this substitution process is a multi-stage process starting
> with a mapping from a sequence of character codes to a sequence of glyph
> indices, followed by one or more mappings from sequence of glyph indices to
> sequences of glyph indices, and finally mapping back to a sequence of character
> codes denoting the final mapped glyphs to be used;
> * similarly, if font supports these new metrics, then perform glyph positioning
> process to produce sequence of [dx,dy] adjustments to apply, the application of
> which follows a somewhat updated logic to handle both X and Y advancements on a
> per-resultant-glyph (= per output character) basis;
> implement bidi algorithm specified in XSL-FO 1.1 Section 5.8 "Unicode Bidi
> Processing", which essentially involves resolving the final
> inline-progression-direction for each glyph or inline area child of an inline
> area and each inline child of a line area;
> * enhance area generation process to make use of inline-progress-direction
> produced by bidi processing in order to reorder areas to satisfy unicode bidi
> semantics (both explicit and implied);
> initially, i am testing against the set of Arabic fonts shipped with Windows 7;
> but I expect to work with a few other fonts that have GSUB/GPOS tables as well;
> i am actually doing this work on MacOSX 10.6, so at some point I would hope to
> add support for the TrueType GX tables known as 'mort' and 'morx' which perform
> similar processes;
> note that these processes (substitution/positioning/etc) allow support for a
> number of complex scripts, not just arabic script; e.g., the indic scripts,
> southeast asian, mongolian, tibetan scripts, etc, and also advanced typographic
> effects on latin, greek, cyrillic, etc., and east asian scripts (e.g.,
> JISX4051) are supported by these processes as well; nevertheless, in order to
> make use of specific sub-tables of GSUB/GPOS, it is necessary to make use of
> script specific processing; therefore, I have implemented a mechanism to make
> use of script information, either supplied by the XSL-FO script property or, by
> default, scanning the characters to determine their dominant script; I have
> started by implementing this general mechanism and also specific Arabic and
> Default script processors; it will then be straightforward to add other script
> specific support;
> i don't have a fixed schedule, but I have most of the GSUB code working and
> tested; i am wrapping up the bidi algorithm work now, and when that is complete
> to submit a patch for potential incorporation into the trunk; i am hoping to
> have that patch done within the next 2 to 4 weeks time;
> as a teaser, i will add an attachment containing several files, one showing an
> FOP metrics file with the new data (see the <script-extras/> element); the
> others show the GlyphSubstitutionTable and ArabicScriptProcessor classes, which
> are not functionally complete, but are sufficiently complete to perform basic
> Arabic glyphs substitution (but not yet ligature processing);
> regarding how it will be used, you will need to:
> * possess or have access to a font in the form of a TTF file that contains
> GSUB/GPOS metrics; if it is to be used with Arabic, then it must should contain
> the GSUB lookup tables for the following features, e.g., 'isol', 'init',
> 'medi', 'fina', and 'liga';
> * create the FOP font metrics file for it by using the
> org.apache.fop.fonts.apps.TTFReader application;
> update your FOP configuration file as needed to refer to the new font and
> metrics;
> reference the font as usual using XSL-FO properties;
> * where necessary, add explicit use of <fo:bidi-override/> in order to override
> the default Unicode bidi logic; e.g., to override implicit directionality or to
> create embedding levels; you can also make use of the explicit Unicode bidi
> control characters, LRO RLO LRE RLE and PDF, but it is better to use explicit
> markup with <fo:bidi-override/>;
> * where necessary, to force or prevent joining behavior when the default would
> not join or would join, then you can use the ZWJ and ZWNJ Unicode controls,
> however, forced joining must be supported by the font to have an effect; while
> force non-joining doesn't depend on the font (though if font did not support
> joining of two characters in the first place then ZWNJ would have no visible
> effect);
> regards,
> glenn
> p.s. I have an ICLA on file with the apache office;
> (In reply to comment #18)
> > (In reply to comment #17)
> > > FYI, I am preparing a candidate patch that will add direct support for Arabic
> > > (and other complex scripts). This primarily involves making use of the advanced
> > > typographic tables present in TrueType and OpenType fonts (e.g., 'mort',
> > > 'morx', 'GSUB' and 'GPOS' tables). Initial support will focus on use of the
> > > GSUB table. This patch will not have any external dependencies, i.e., it does
> > > not make use of ICU4J.
> > > 
> > > Regards,
> > > Glenn Adams
> > 
> > 
> > Glenn,
> > Can you please provide more information on this as to when & how will this
> > update be available?
> > 
> > Regards,
> > Sachin Sharma.

Hi Glenn,

Thanks for the arabic patch teaser attachment, as this could be very useful to rendering the indic texts, and other fonts as mentioned in your comments. We are trying hard for rendering the correct letters for indic texts and other fonts too, since last few days.

Can you please provide more detail information regarding how this patch to be applied / enhance to FOP 0.95?

It would be better if you can provide more information regarding changes to the following instances:-

1) org.apache.fop.fonts.truetype.TTFFile
2) org.apache.fop.fonts.apps.TTFReader
3) org.apache.fop.fonts.FontReader
4) org.apache.fop.layoutmgr.inline.TextLayoutManager

Regards,
Dharmesh Rana
Comment 23 Dharmesh Rana 2010-06-23 00:43:58 UTC
Hi Glenn,

When is this Arabic patch going to be rleased?

Regards,
Dharmesh Rana

(In reply to comment #19)
> basically, the patch will do the following (in summary):
> * enhance org.apache.fop.fonts.truetype.TTFFile in order to read the OpenType
> GSUB and GPOS tables, creating new org.apache.fop.fonts.GlyphTable instances
> which are added to MultiByteFont instances;
> * enhance org.apache.fop.fonts.apps.TTFReader in order to write out XML
> representation of this new data into the FOP metrics file;
> * enhance org.apache.fop.fonts.FontReader to read the new GSUB/GPOS data stored
> in the FOP metrics file;
> * enhance the knuth elements generation in
> org.apache.fop.layoutmgr.inline.TextLayoutManager, specifically, #processWord,
> in order to perform substitution processing, which, if the current font
> supports substitution, causes the font to invoke substitution processing using
> the new metrics; this substitution process is a multi-stage process starting
> with a mapping from a sequence of character codes to a sequence of glyph
> indices, followed by one or more mappings from sequence of glyph indices to
> sequences of glyph indices, and finally mapping back to a sequence of character
> codes denoting the final mapped glyphs to be used;
> * similarly, if font supports these new metrics, then perform glyph positioning
> process to produce sequence of [dx,dy] adjustments to apply, the application of
> which follows a somewhat updated logic to handle both X and Y advancements on a
> per-resultant-glyph (= per output character) basis;
> implement bidi algorithm specified in XSL-FO 1.1 Section 5.8 "Unicode Bidi
> Processing", which essentially involves resolving the final
> inline-progression-direction for each glyph or inline area child of an inline
> area and each inline child of a line area;
> * enhance area generation process to make use of inline-progress-direction
> produced by bidi processing in order to reorder areas to satisfy unicode bidi
> semantics (both explicit and implied);
> initially, i am testing against the set of Arabic fonts shipped with Windows 7;
> but I expect to work with a few other fonts that have GSUB/GPOS tables as well;
> i am actually doing this work on MacOSX 10.6, so at some point I would hope to
> add support for the TrueType GX tables known as 'mort' and 'morx' which perform
> similar processes;
> note that these processes (substitution/positioning/etc) allow support for a
> number of complex scripts, not just arabic script; e.g., the indic scripts,
> southeast asian, mongolian, tibetan scripts, etc, and also advanced typographic
> effects on latin, greek, cyrillic, etc., and east asian scripts (e.g.,
> JISX4051) are supported by these processes as well; nevertheless, in order to
> make use of specific sub-tables of GSUB/GPOS, it is necessary to make use of
> script specific processing; therefore, I have implemented a mechanism to make
> use of script information, either supplied by the XSL-FO script property or, by
> default, scanning the characters to determine their dominant script; I have
> started by implementing this general mechanism and also specific Arabic and
> Default script processors; it will then be straightforward to add other script
> specific support;
> i don't have a fixed schedule, but I have most of the GSUB code working and
> tested; i am wrapping up the bidi algorithm work now, and when that is complete
> to submit a patch for potential incorporation into the trunk; i am hoping to
> have that patch done within the next 2 to 4 weeks time;
> as a teaser, i will add an attachment containing several files, one showing an
> FOP metrics file with the new data (see the <script-extras/> element); the
> others show the GlyphSubstitutionTable and ArabicScriptProcessor classes, which
> are not functionally complete, but are sufficiently complete to perform basic
> Arabic glyphs substitution (but not yet ligature processing);
> regarding how it will be used, you will need to:
> * possess or have access to a font in the form of a TTF file that contains
> GSUB/GPOS metrics; if it is to be used with Arabic, then it must should contain
> the GSUB lookup tables for the following features, e.g., 'isol', 'init',
> 'medi', 'fina', and 'liga';
> * create the FOP font metrics file for it by using the
> org.apache.fop.fonts.apps.TTFReader application;
> update your FOP configuration file as needed to refer to the new font and
> metrics;
> reference the font as usual using XSL-FO properties;
> * where necessary, add explicit use of <fo:bidi-override/> in order to override
> the default Unicode bidi logic; e.g., to override implicit directionality or to
> create embedding levels; you can also make use of the explicit Unicode bidi
> control characters, LRO RLO LRE RLE and PDF, but it is better to use explicit
> markup with <fo:bidi-override/>;
> * where necessary, to force or prevent joining behavior when the default would
> not join or would join, then you can use the ZWJ and ZWNJ Unicode controls,
> however, forced joining must be supported by the font to have an effect; while
> force non-joining doesn't depend on the font (though if font did not support
> joining of two characters in the first place then ZWNJ would have no visible
> effect);
> regards,
> glenn
> p.s. I have an ICLA on file with the apache office;
> (In reply to comment #18)
> > (In reply to comment #17)
> > > FYI, I am preparing a candidate patch that will add direct support for Arabic
> > > (and other complex scripts). This primarily involves making use of the advanced
> > > typographic tables present in TrueType and OpenType fonts (e.g., 'mort',
> > > 'morx', 'GSUB' and 'GPOS' tables). Initial support will focus on use of the
> > > GSUB table. This patch will not have any external dependencies, i.e., it does
> > > not make use of ICU4J.
> > > 
> > > Regards,
> > > Glenn Adams
> > 
> > 
> > Glenn,
> > Can you please provide more information on this as to when & how will this
> > update be available?
> > 
> > Regards,
> > Sachin Sharma.
Comment 24 Glenn Adams 2010-08-02 05:48:48 UTC
See also patch at, which address my prior comments on this thread:

https://issues.apache.org/bugzilla/show_bug.cgi?id=49687

Regards,
Glenn
Comment 25 Glenn Adams 2012-02-27 18:09:38 UTC
Added complex script support (bidi, shaping, etc) at revision 1293736.