Bug 51843

Summary: Surrogate pairs not treated as single unicode codepoint for display purposes
Product: Fop - Now in Jira Reporter: Glenn Adams <gadams>
Component: generalAssignee: fop-dev
Status: NEW ---    
Severity: enhancement CC: alex.giotis, maloklam, saasha
Priority: P5    
Version: trunk   
Target Milestone: ---   
Hardware: All   
OS: All   
Attachments: Sample XSL file to generate Chinese character. It use "Mingliu" Chinese fonts
Result PDF file
Sample XML file contains both PUA and non-PUA chinese character
Sample FO file to generate PDF file
Sample FO file to generate PDF
Result PDF file

Description Glenn Adams 2011-09-19 01:35:24 UTC
unicode codepoints outside of the BMP (base multilingual plane), i.e., whose scalar value is greater than 0xFFFF (65535), are coded as UTF-16 surrogate pairs in Java strings, which pair should be treated as a single codepoint for the purpose of mapping to a glyph in a font (that supports extra-BMP mappings);

at present, FOP does not correctly handle this case in simple (non complex script) rendering paths;

furthermore, though some support has been added to handle this in the complex script rendering path, it has not yet been tested, so is not necessarily working there either;
Comment 1 Glenn Adams 2012-04-07 01:45:16 UTC
resetting P2 open bugs to P3 pending further review
Comment 2 Thomas T. 2012-06-08 09:52:57 UTC
request to fix this to support surrogate pairs characters.
Comment 3 Glenn Adams 2012-06-08 13:06:44 UTC
(In reply to comment #2)
> request to fix this to support surrogate pairs characters.

thanks for your request; could you provide additional information:

1. what specific non-BPM characters you would like to use?
2. what specific fonts will you use for these characters?
Comment 4 Glenn Adams 2012-06-08 13:17:31 UTC
(In reply to comment #3)
> (In reply to comment #2)
> > request to fix this to support surrogate pairs characters.
> 
> thanks for your request; could you provide additional information:
> 
> 1. what specific non-BPM characters you would like to use?
> 2. what specific fonts will you use for these characters?

s/BPM/BMP/
Comment 5 Saašha Metsärantala 2012-06-08 14:56:41 UTC
Hello!

Today, the majority of Unicode's characters are outside the BMP. This involves many alphabets and other character sets. Here are links to two of these non-BMP planes:
http://www.unicode.org/roadmaps/smp/ and
http://www.unicode.org/roadmaps/sip/ and more are to come because dozens of alphabets are not part of Unicode, yet.

Today's FOP supports no more than a minority of Unicode's characters and this minority will become proportionally less and less in the future. I consider that there is a need to solve this problem in the long run.

Trying to "solve" the problem for some specific non-BMP characters will lead to this problem coming back again and again ...

I will use FOP *much* more often as soon as it supports non-BMP characters.

Regards!

Saašha,
Comment 6 Glenn Adams 2012-06-08 15:23:14 UTC
(In reply to comment #5)
> Hello!
> 
> Today, the majority of Unicode's characters are outside the BMP. This
> involves many alphabets and other character sets. Here are links to two of
> these non-BMP planes:
> http://www.unicode.org/roadmaps/smp/ and
> http://www.unicode.org/roadmaps/sip/ and more are to come because dozens of
> alphabets are not part of Unicode, yet.
> 
> Today's FOP supports no more than a minority of Unicode's characters and
> this minority will become proportionally less and less in the future. I
> consider that there is a need to solve this problem in the long run.
> 
> Trying to "solve" the problem for some specific non-BMP characters will lead
> to this problem coming back again and again ...
> 
> I will use FOP *much* more often as soon as it supports non-BMP characters.
> 
> Regards!
> 
> Saašha,

I'm sorry Saašha but I do not accept the rationale of your argument. First,
FOP supports the representation of all BMP characters which is the vast
majority of modern usage, >99.994%.

If you cannot demonstrate to me a real, current need to use non-BMP characters
or cannot demonstrate a font that actually supports these character mappings
that you need to use, then I will leave this bug prioritized low (P5).

If you wish to contribute a patch that adds non-BMP support, then the FOP
team would be happy to apply it. In the mean time, you shall have to wait
until this enhancement gets higher in the priority queue, and that will have
to await many other enhancements in my opinion, such as finishing support
for complex scripts, adding full CJK support, etc.
Comment 7 Saašha Metsärantala 2012-06-08 17:50:41 UTC
Hello!

Thanks for your reply! Here are a few clarifications!

> the vast majority of modern usage,
Many, many software do not support non-BMP characters. I would like to clarify that FOP is not the only one. The fact that non-BMP characters are poorly supported is among the main reasons why non-BMP characters are seldom encoded as such. Instead, work-arounds are used. For example, non-BMP characters are often converted to parts of the so called "private use area" (U+E000 to U+F8FF) before being processed. Sometimes, "font-tricks" are used, where the glyphs of one alphabet are just copied to a BMP-alphabet's place -- reminding of the (early) nineties, where greek and cyrillic glyphs (among others) were often living in "ASCII"-fonts. Sometimes, they are replaced by PNG's. All these work-arounds contribute to many confusions and also contribute to the "non-visibility" of these alphabets and to great difficulties to find text written with these character sets.

In other words, the poor support for non-BMP characters is indeed one of the main reasons for their "non-visibility". It is important to avoid misinterpretations here: these characters are both used and useful.

> demonstrate to me a real, current need to use non-BMP characters
To be accepted as part of Unicode, an alphabet or other character set (such as mathematical symbols, etc.) needs to be supported by a VERY active community during a long time. Otherwise, the Unicode consortium does not include this alphabet. The very fact that Unicode includes non-BMP alphabets and other character sets is a proof that an active community needs those characters.

On the other hand, the fact that dozens of alphabets are still absent from Unicode shall not be misinterpreted as a non-usage of these alphabets.

> adding full CJK support,
Thousands of CJK characters live outside the BMP. A full CJK support requires support for non-BMP characters.

> If you wish to contribute a patch that adds non-BMP support,
I plan to try to write some kind of fix this summer.

Regards!

Saašha,
Comment 8 Glenn Adams 2012-06-08 18:03:53 UTC
(In reply to comment #7)
> Hello!
> 
> Thanks for your reply! Here are a few clarifications!
> 
> > the vast majority of modern usage,
> Many, many software do not support non-BMP characters. I would like to
> clarify that FOP is not the only one. The fact that non-BMP characters are
> poorly supported is among the main reasons why non-BMP characters are seldom
> encoded as such. Instead, work-arounds are used. For example, non-BMP
> characters are often converted to parts of the so called "private use area"
> (U+E000 to U+F8FF) before being processed. Sometimes, "font-tricks" are
> used, where the glyphs of one alphabet are just copied to a BMP-alphabet's
> place -- reminding of the (early) nineties, where greek and cyrillic glyphs
> (among others) were often living in "ASCII"-fonts. Sometimes, they are
> replaced by PNG's. All these work-arounds contribute to many confusions and
> also contribute to the "non-visibility" of these alphabets and to great
> difficulties to find text written with these character sets.
> 
> In other words, the poor support for non-BMP characters is indeed one of the
> main reasons for their "non-visibility". It is important to avoid
> misinterpretations here: these characters are both used and useful.
> 
> > demonstrate to me a real, current need to use non-BMP characters
> To be accepted as part of Unicode, an alphabet or other character set (such
> as mathematical symbols, etc.) needs to be supported by a VERY active
> community during a long time. Otherwise, the Unicode consortium does not
> include this alphabet. The very fact that Unicode includes non-BMP alphabets
> and other character sets is a proof that an active community needs those
> characters.
> 
> On the other hand, the fact that dozens of alphabets are still absent from
> Unicode shall not be misinterpreted as a non-usage of these alphabets.
> 
> > adding full CJK support,
> Thousands of CJK characters live outside the BMP. A full CJK support
> requires support for non-BMP characters.
> 
> > If you wish to contribute a patch that adds non-BMP support,
> I plan to try to write some kind of fix this summer.
> 
> Regards!
> 
> Saašha,

again you are giving me general reasons, but not specific ones that drive your immediate needs; i am extremely familiar with Unicode, having been a co-author of Unicode 2.0, a technical director of the Unicode consortium from 93-98, and Unicode's representative to the ISO SC2/WG2 IRG (Ideographic Rapporteur Group), who created the CJK encodings in Unicode;

i want to know specifically what non-BMP characters *you* want to use and what specific fonts *you* will use to print these non-BMP characters; if you can demonstrate a good, real need (as opposed to generalities), then perhaps I will
be inclined to give non-BMP support a greater priority; if not, I will
continue to assign higher priority to other features that better support
non-Roman scripts that use the BMP; regarding CJK and non-BMP, I agree that
it is useful to support those characters, however, i'd like to see fonts
that are available for these characters first;
Comment 9 ngkit 2012-06-12 03:20:39 UTC
Hello,
I have used FOP library to generate PDF files for a serval years. It was a great library to perform the task. However, I found some "?" exist in PDF files recently. I have tried to find the root cause, the problem character byte code is not same with my previous using one. According to Microsoft document(here is the link http://www.microsoft.com/en-us/download/details.aspx?id=12080), some of the characters can be represented by both PUA or Unicode 4.1 byte code. And PUA is just a backward compatiable solution. And it seems PUA support is going to fade out in coming future. So is it possible to put this enhancment to higher priority?
Kit
Comment 10 Glenn Adams 2012-06-12 03:24:58 UTC
(In reply to comment #9)
> Hello,
> I have used FOP library to generate PDF files for a serval years. It was a
> great library to perform the task. However, I found some "?" exist in PDF
> files recently. I have tried to find the root cause, the problem character
> byte code is not same with my previous using one. According to Microsoft
> document(here is the link
> http://www.microsoft.com/en-us/download/details.aspx?id=12080), some of the
> characters can be represented by both PUA or Unicode 4.1 byte code. And PUA
> is just a backward compatiable solution. And it seems PUA support is going
> to fade out in coming future. So is it possible to put this enhancment to
> higher priority?
> Kit

I don't understand your comment. You need to provide more details to know if you have a problem or not, and if you do, whether it relates to this bug or not. If you have a problem with a specific input FO file, then attach that file along with the PDF file you obtain when running FOP. Also attach any console output. Once you do these things, I can evaluate whether your problem is legitimate or not and whether it is related or not.
Comment 11 ngkit 2012-06-12 04:20:50 UTC
Hello,
Thanks for your comment and sorry for my misleading message and poor English. 
Here is my problem:
When XML data files contains Chinese character with byte code does not exist in PUA, "?" will be displayed. 
And here is the fonts library information
http://www.microsoft.com/en-us/download/details.aspx?id=12080
And here is the character I failed to generated
Unicde code (Hex):2070E

According to the above URL, old PUA characters have been moved to non PUA code point assignment. It seems that Chinese characters in PUA will not have any enhancement or support in coming future. So is it possible to put this enhancment (support surrogate pairs characters) to higher priority? 

Cheers, 
Kit
Comment 12 Glenn Adams 2012-06-12 05:03:06 UTC
(In reply to comment #11)
> Hello,
> Thanks for your comment and sorry for my misleading message and poor
> English. 
> Here is my problem:
> When XML data files contains Chinese character with byte code does not exist
> in PUA, "?" will be displayed. 
> And here is the fonts library information
> http://www.microsoft.com/en-us/download/details.aspx?id=12080
> And here is the character I failed to generated
> Unicde code (Hex):2070E
> 
> According to the above URL, old PUA characters have been moved to non PUA
> code point assignment. It seems that Chinese characters in PUA will not have
> any enhancement or support in coming future. So is it possible to put this
> enhancment (support surrogate pairs characters) to higher priority? 
> 
> Cheers, 
> Kit

Irrelevant. Characters encoded using PUA are not interchangeable. Private means Private. In any case, I'll ignore your comment unless and until you provide a sample FO/PDF pair demonstrating a problem.

May I remind you that work on FOP (or any other Apache project) is done on a volunteer or sponsorship basis. If you want the priority placed higher, then either volunteer to do the work or sponsor someone to do the work. I welcome all improvements to FOP and will do my utmost to apply patches quickly, but your request to prioritize a particular feature has no weight unless you do something concrete to assist.

Just as an FYI, my personal priority is to improve support for BMP encoded scripts, and then move on to non-BMP features.

Respectfully, Glenn
Comment 13 Glenn Adams 2012-06-12 05:16:07 UTC
(In reply to comment #11)
> Hello,
> Thanks for your comment and sorry for my misleading message and poor
> English. 
> Here is my problem:
> When XML data files contains Chinese character with byte code does not exist
> in PUA, "?" will be displayed. 
> And here is the fonts library information
> http://www.microsoft.com/en-us/download/details.aspx?id=12080
> And here is the character I failed to generated
> Unicde code (Hex):2070E
> 
> According to the above URL, old PUA characters have been moved to non PUA
> code point assignment. It seems that Chinese characters in PUA will not have
> any enhancement or support in coming future. So is it possible to put this
> enhancment (support surrogate pairs characters) to higher priority? 
> 
> Cheers, 
> Kit

i've asked once, and i'll ask again: please provide a minimal input FO file and an output PDF file demonstrating a problem; if you can't or won't do this, i can not do anything to help
Comment 14 ngkit 2012-06-12 07:26:29 UTC
Created attachment 28914 [details]
Sample XSL file to generate Chinese character. It use "Mingliu" Chinese fonts
Comment 15 ngkit 2012-06-12 07:29:11 UTC
Created attachment 28915 [details]
Result PDF file

XML data files contains both characters from PUA and non PUA
Comment 16 ngkit 2012-06-12 07:31:14 UTC
Created attachment 28916 [details]
Sample XML file contains both PUA and non-PUA chinese character
Comment 17 ngkit 2012-06-12 07:33:39 UTC
Hello,
I have uploaded XML data file, XSL template file and result PDF file. Any other information require?
Cheers,
Kit
Comment 18 Pascal Sancho 2012-06-12 07:42:11 UTC
(In reply to comment #17)
> Hello,
> I have uploaded XML data file, XSL template file and result PDF file. Any
> other information require?
> Cheers,
> Kit

Hi,
as Glenn said, you should attach the resulting XSL-FO resulting from the XML+XSLT transformation, this will be very helpful to reproduce (or not) the issue and identify what causes it.

See bug reporting guidelines at [1] for further info.

[1] http://xmlgraphics.apache.org/fop/bugs.html#issues_new
Comment 19 ngkit 2012-06-12 08:30:12 UTC
Created attachment 28917 [details]
Sample FO file to generate PDF file
Comment 20 ngkit 2012-06-12 08:32:06 UTC
Created attachment 28918 [details]
Sample FO file to generate PDF
Comment 21 ngkit 2012-06-12 08:32:32 UTC
Created attachment 28919 [details]
Result PDF file
Comment 22 ngkit 2012-06-12 09:15:33 UTC
Hello,
Sorry, I have upload wrong files before. I have uploaded XSL-FO result file result PDF file. Any other information require?
Cheers,
Kit
Comment 23 Jacky 2012-06-13 07:20:42 UTC
Hi all,
 Glad to see the thread is active again as I had similiar concerns of using non-BMP characters.  The support of non-BMP characters are very important as there are Street names that no other characters can be substituted.  

 If FOP can support the double surrogates, I'm sure many more developers can enjoy it as the generated PDF embedded the font by default that solved many physical printing problems of printer loaded fonts.

Jacky
Comment 24 Rick 2012-06-14 06:54:28 UTC
Regarding above problem, we encountered same issue on my applications.
It looks an common issue for chinese characters applications. Hoping that fix could be provided soon. Many Thanks.

Rick
Comment 25 TY@Taiwan 2012-06-14 11:29:15 UTC
Great that finally searched some related information about support non-BMP characters issue with FOP, & also wanna to know if it is due to FOP, & that problem quite annoying if my APPL should finally go ahead for deploy with FOP @production.

Join thread to hear gd news.
TY
Comment 26 Glenn Adams 2012-06-14 15:50:18 UTC
(In reply to comment #25)
> Great that finally searched some related information about support non-BMP
> characters issue with FOP, & also wanna to know if it is due to FOP, & that
> problem quite annoying if my APPL should finally go ahead for deploy with
> FOP @production.
> 
> Join thread to hear gd news.
> TY

don't jump to the conclusion that anything has changed in FOP: it hasn't!

also, keep in mind that adding support for non-BMP characters in FOP is only a part of the solution; the larger part of the solution is outside of the scope of FOP, namely, the availability of OpenType or TrueType fonts that contain a 'cmap' table that satisfies one of the following:

* platform ID 0 (unicode), encoding ID 3 (unicode 2.0 or later),
  format 10.0 (trimmed array)

* platform ID 0 (unicode), encoding ID 3 (unicode 2.0 or later),
  format 12.0 (segmented coverage)

* platform ID 3 (windows), encoding ID 10 (ucs-4),
  format 12.0 (segmented coverage) 

so far, nobody has provide me a link to or a copy of such a font, and, until i have such a font in hand, i'm not going to take any action with respect to this bug
Comment 27 C.C 2012-06-15 07:05:52 UTC
Same problem here! Do you guys can provide me any work around before the bug is fixed? you know, it takes time to seek a suitable fonts to fit. Anyway, will keep an eye on the thread.

Cusson
Comment 28 Thomas T. 2012-06-18 10:29:30 UTC
Hi Glenn, 

Sorry not understand your requested fonts clearly. Is there any software/tools
to check the fonts supported the 'cmap' you mentioned?
I tried Microsoft Font Properties extension tools
http://www.microsoft.com/typography/TrueTypeProperty21.mspx
to check if i got fonts that suit, but it didn't involve the cmap properties.
Thanks.

Thomas T.

(In reply to comment #26)
> (In reply to comment #25)
> Great that finally searched some related
> information about support non-BMP
> characters issue with FOP, & also wanna
> to know if it is due to FOP, & that
> problem quite annoying if my APPL
> should finally go ahead for deploy with
> FOP @production.
> 
> Join thread
> to hear gd news.
> TY

don't jump to the conclusion that anything has
> changed in FOP: it hasn't!

also, keep in mind that adding support for
> non-BMP characters in FOP is only a part of the solution; the larger part of
> the solution is outside of the scope of FOP, namely, the availability of
> OpenType or TrueType fonts that contain a 'cmap' table that satisfies one of
> the following:

* platform ID 0 (unicode), encoding ID 3 (unicode 2.0 or
> later),
  format 10.0 (trimmed array)

* platform ID 0 (unicode), encoding
> ID 3 (unicode 2.0 or later),
  format 12.0 (segmented coverage)

* platform
> ID 3 (windows), encoding ID 10 (ucs-4),
  format 12.0 (segmented coverage) 
> so far, nobody has provide me a link to or a copy of such a font, and, until
> i have such a font in hand, i'm not going to take any action with respect to
> this bug
Comment 29 Glenn Adams 2012-06-18 14:42:03 UTC
(In reply to comment #28)
> Sorry not understand your requested fonts clearly. Is there any
> software/tools
> to check the fonts supported the 'cmap' you mentioned?
> I tried Microsoft Font Properties extension tools
> http://www.microsoft.com/typography/TrueTypeProperty21.mspx
> to check if i got fonts that suit, but it didn't involve the cmap properties.

One option is the 'ttx' tool in the Adobe Font Development Kit for Opentype (AFDKO)
Comment 30 Thomas T. 2012-06-19 04:47:03 UTC
Hi Glenn,
From your suggested tools, i found 4 kinds of fonts bundled in windows 7 with the following cmap supported, are that what you are looking for?
ebrima.ttf
<cmap_format_12 platformID="3" platEncID="10" format="12" reserved="0" length="7012" language="0" nGroups="583">

ebrimabd.ttf
<cmap_format_12 platformID="3" platEncID="10" format="12" reserved="0" length="7012" language="0" nGroups="583">

seguisym.ttf
<cmap_format_12 platformID="3" platEncID="10" format="12" reserved="0" length="1900" language="0" nGroups="157">

simsunb.ttf
<cmap_format_12 platformID="3" platEncID="10" format="12" reserved="0" length="40" language="0" nGroups="2">

Thomas T.

(In reply to comment #29)
> (In reply to comment #28)
> Sorry not understand your requested fonts
> clearly. Is there any
> software/tools
> to check the fonts supported the
> 'cmap' you mentioned?
> I tried Microsoft Font Properties extension tools
>
> http://www.microsoft.com/typography/TrueTypeProperty21.mspx
> to check if i
> got fonts that suit, but it didn't involve the cmap properties.

One option
> is the 'ttx' tool in the Adobe Font Development Kit for Opentype (AFDKO)
Comment 31 Thomas T. 2012-06-25 11:28:09 UTC
Hi, 
Is my suggested fonts help? Or i need to find another??

Thomas T.
Comment 32 Glenn Adams 2012-06-25 13:26:11 UTC
(In reply to comment #31)
> Is my suggested fonts help? Or i need to find another??

Yes, it will be helpful when I am ready to start working on this bug. I do not have a schedule for when I will start. Thanks for your checking on Win fonts that support non-BMP encodings.
Comment 33 Shepard Lee 2012-07-05 07:23:14 UTC
Hi All, 

I encountered the same issue on my applications using fop 1.0. 
Glad to see the issue is going to be fixed in the coming version.
May I know if this bug will be fixed in version 1.1 only or it will be patched in version 1.0, too?

Shepard
Comment 34 Glenn Adams 2012-07-05 07:29:23 UTC
(In reply to comment #33)
> Hi All, 
> 
> I encountered the same issue on my applications using fop 1.0. 
> Glad to see the issue is going to be fixed in the coming version.
> May I know if this bug will be fixed in version 1.1 only or it will be
> patched in version 1.0, too?

No, this is NOT going to be fixed in the upcoming version. I have made NO statements about when this will be addressed in FOP.

In particular, it will NOT be patched in 1.0 and will NOT be addressed in 1.1. This is a POSSIBLE 1.2 (or later) fix.
Comment 35 Jacky 2012-07-09 03:53:49 UTC
(In reply to comment #34)
> (In reply to comment #33)
> Hi All, 
> 
> I encountered the same issue on my
> applications using fop 1.0. 
> Glad to see the issue is going to be fixed in
> the coming version.
> May I know if this bug will be fixed in version 1.1
> only or it will be
> patched in version 1.0, too?

No, this is NOT going to
> be fixed in the upcoming version. I have made NO statements about when this
> will be addressed in FOP.

In particular, it will NOT be patched in 1.0 and
> will NOT be addressed in 1.1. This is a POSSIBLE 1.2 (or later) fix.

(In reply to comment #34)
> (In reply to comment #33)
> Hi All, 
> 
> I encountered the same issue on my
> applications using fop 1.0. 
> Glad to see the issue is going to be fixed in
> the coming version.
> May I know if this bug will be fixed in version 1.1
> only or it will be
> patched in version 1.0, too?

No, this is NOT going to
> be fixed in the upcoming version. I have made NO statements about when this
> will be addressed in FOP.

In particular, it will NOT be patched in 1.0 and
> will NOT be addressed in 1.1. This is a POSSIBLE 1.2 (or later) fix.

Hi Adams,
 Nice to see you'd considered this thread.  As I knew, even mainframe has similiar issues in using supporting surrogate pairs. Is that any workaround if stick to latest FOP version, or any news on tenative rollout of v1.2?

Jacky
Comment 36 Glenn Adams 2012-07-09 04:56:33 UTC
(In reply to comment #35)
> Nice to see you'd considered this thread.  As I knew, even mainframe has
> similiar issues in using supporting surrogate pairs. Is that any workaround
> if stick to latest FOP version, or any news on tenative rollout of v1.2?

FOP 1.1rc1 was just release, and perhaps one month later 1.1 will be released. After that, I intend to put this work item on my list for possible 1.2 features. There is no schedule for 1.2, but I'd like to do it by the end of this year.
Comment 37 Sameh Ayoub 2012-12-06 18:03:05 UTC
(In reply to comment #36)
> (In reply to comment #35)
> > Nice to see you'd considered this thread.  As I knew, even mainframe has
> > similiar issues in using supporting surrogate pairs. Is that any workaround
> > if stick to latest FOP version, or any news on tenative rollout of v1.2?
> 
> FOP 1.1rc1 was just release, and perhaps one month later 1.1 will be
> released. After that, I intend to put this work item on my list for possible
> 1.2 features. There is no schedule for 1.2, but I'd like to do it by the end
> of this year.

Hi Glenn, 
Thanks for considering adding this feature in FOP 1.2 by the end of this year.

We are using FOP 1.1, and we want to have this feature as soon as it is get added.

So, we are wondering:
- Do you think this will be done in the near future?
- Will the solution can be patched to FOP 1.1 easily? 

Thanks for your coordination.