unicode codepoints outside of the BMP (base multilingual plane), i.e., whose scalar value is greater than 0xFFFF (65535), are coded as UTF-16 surrogate pairs in Java strings, which pair should be treated as a single codepoint for the purpose of mapping to a glyph in a font (that supports extra-BMP mappings); at present, FOP does not correctly handle this case in simple (non complex script) rendering paths; furthermore, though some support has been added to handle this in the complex script rendering path, it has not yet been tested, so is not necessarily working there either;
resetting P2 open bugs to P3 pending further review
request to fix this to support surrogate pairs characters.
(In reply to comment #2) > request to fix this to support surrogate pairs characters. thanks for your request; could you provide additional information: 1. what specific non-BPM characters you would like to use? 2. what specific fonts will you use for these characters?
(In reply to comment #3) > (In reply to comment #2) > > request to fix this to support surrogate pairs characters. > > thanks for your request; could you provide additional information: > > 1. what specific non-BPM characters you would like to use? > 2. what specific fonts will you use for these characters? s/BPM/BMP/
Hello! Today, the majority of Unicode's characters are outside the BMP. This involves many alphabets and other character sets. Here are links to two of these non-BMP planes: http://www.unicode.org/roadmaps/smp/ and http://www.unicode.org/roadmaps/sip/ and more are to come because dozens of alphabets are not part of Unicode, yet. Today's FOP supports no more than a minority of Unicode's characters and this minority will become proportionally less and less in the future. I consider that there is a need to solve this problem in the long run. Trying to "solve" the problem for some specific non-BMP characters will lead to this problem coming back again and again ... I will use FOP *much* more often as soon as it supports non-BMP characters. Regards! Saašha,
(In reply to comment #5) > Hello! > > Today, the majority of Unicode's characters are outside the BMP. This > involves many alphabets and other character sets. Here are links to two of > these non-BMP planes: > http://www.unicode.org/roadmaps/smp/ and > http://www.unicode.org/roadmaps/sip/ and more are to come because dozens of > alphabets are not part of Unicode, yet. > > Today's FOP supports no more than a minority of Unicode's characters and > this minority will become proportionally less and less in the future. I > consider that there is a need to solve this problem in the long run. > > Trying to "solve" the problem for some specific non-BMP characters will lead > to this problem coming back again and again ... > > I will use FOP *much* more often as soon as it supports non-BMP characters. > > Regards! > > Saašha, I'm sorry Saašha but I do not accept the rationale of your argument. First, FOP supports the representation of all BMP characters which is the vast majority of modern usage, >99.994%. If you cannot demonstrate to me a real, current need to use non-BMP characters or cannot demonstrate a font that actually supports these character mappings that you need to use, then I will leave this bug prioritized low (P5). If you wish to contribute a patch that adds non-BMP support, then the FOP team would be happy to apply it. In the mean time, you shall have to wait until this enhancement gets higher in the priority queue, and that will have to await many other enhancements in my opinion, such as finishing support for complex scripts, adding full CJK support, etc.
Hello! Thanks for your reply! Here are a few clarifications! > the vast majority of modern usage, Many, many software do not support non-BMP characters. I would like to clarify that FOP is not the only one. The fact that non-BMP characters are poorly supported is among the main reasons why non-BMP characters are seldom encoded as such. Instead, work-arounds are used. For example, non-BMP characters are often converted to parts of the so called "private use area" (U+E000 to U+F8FF) before being processed. Sometimes, "font-tricks" are used, where the glyphs of one alphabet are just copied to a BMP-alphabet's place -- reminding of the (early) nineties, where greek and cyrillic glyphs (among others) were often living in "ASCII"-fonts. Sometimes, they are replaced by PNG's. All these work-arounds contribute to many confusions and also contribute to the "non-visibility" of these alphabets and to great difficulties to find text written with these character sets. In other words, the poor support for non-BMP characters is indeed one of the main reasons for their "non-visibility". It is important to avoid misinterpretations here: these characters are both used and useful. > demonstrate to me a real, current need to use non-BMP characters To be accepted as part of Unicode, an alphabet or other character set (such as mathematical symbols, etc.) needs to be supported by a VERY active community during a long time. Otherwise, the Unicode consortium does not include this alphabet. The very fact that Unicode includes non-BMP alphabets and other character sets is a proof that an active community needs those characters. On the other hand, the fact that dozens of alphabets are still absent from Unicode shall not be misinterpreted as a non-usage of these alphabets. > adding full CJK support, Thousands of CJK characters live outside the BMP. A full CJK support requires support for non-BMP characters. > If you wish to contribute a patch that adds non-BMP support, I plan to try to write some kind of fix this summer. Regards! Saašha,
(In reply to comment #7) > Hello! > > Thanks for your reply! Here are a few clarifications! > > > the vast majority of modern usage, > Many, many software do not support non-BMP characters. I would like to > clarify that FOP is not the only one. The fact that non-BMP characters are > poorly supported is among the main reasons why non-BMP characters are seldom > encoded as such. Instead, work-arounds are used. For example, non-BMP > characters are often converted to parts of the so called "private use area" > (U+E000 to U+F8FF) before being processed. Sometimes, "font-tricks" are > used, where the glyphs of one alphabet are just copied to a BMP-alphabet's > place -- reminding of the (early) nineties, where greek and cyrillic glyphs > (among others) were often living in "ASCII"-fonts. Sometimes, they are > replaced by PNG's. All these work-arounds contribute to many confusions and > also contribute to the "non-visibility" of these alphabets and to great > difficulties to find text written with these character sets. > > In other words, the poor support for non-BMP characters is indeed one of the > main reasons for their "non-visibility". It is important to avoid > misinterpretations here: these characters are both used and useful. > > > demonstrate to me a real, current need to use non-BMP characters > To be accepted as part of Unicode, an alphabet or other character set (such > as mathematical symbols, etc.) needs to be supported by a VERY active > community during a long time. Otherwise, the Unicode consortium does not > include this alphabet. The very fact that Unicode includes non-BMP alphabets > and other character sets is a proof that an active community needs those > characters. > > On the other hand, the fact that dozens of alphabets are still absent from > Unicode shall not be misinterpreted as a non-usage of these alphabets. > > > adding full CJK support, > Thousands of CJK characters live outside the BMP. A full CJK support > requires support for non-BMP characters. > > > If you wish to contribute a patch that adds non-BMP support, > I plan to try to write some kind of fix this summer. > > Regards! > > Saašha, again you are giving me general reasons, but not specific ones that drive your immediate needs; i am extremely familiar with Unicode, having been a co-author of Unicode 2.0, a technical director of the Unicode consortium from 93-98, and Unicode's representative to the ISO SC2/WG2 IRG (Ideographic Rapporteur Group), who created the CJK encodings in Unicode; i want to know specifically what non-BMP characters *you* want to use and what specific fonts *you* will use to print these non-BMP characters; if you can demonstrate a good, real need (as opposed to generalities), then perhaps I will be inclined to give non-BMP support a greater priority; if not, I will continue to assign higher priority to other features that better support non-Roman scripts that use the BMP; regarding CJK and non-BMP, I agree that it is useful to support those characters, however, i'd like to see fonts that are available for these characters first;
Hello, I have used FOP library to generate PDF files for a serval years. It was a great library to perform the task. However, I found some "?" exist in PDF files recently. I have tried to find the root cause, the problem character byte code is not same with my previous using one. According to Microsoft document(here is the link http://www.microsoft.com/en-us/download/details.aspx?id=12080), some of the characters can be represented by both PUA or Unicode 4.1 byte code. And PUA is just a backward compatiable solution. And it seems PUA support is going to fade out in coming future. So is it possible to put this enhancment to higher priority? Kit
(In reply to comment #9) > Hello, > I have used FOP library to generate PDF files for a serval years. It was a > great library to perform the task. However, I found some "?" exist in PDF > files recently. I have tried to find the root cause, the problem character > byte code is not same with my previous using one. According to Microsoft > document(here is the link > http://www.microsoft.com/en-us/download/details.aspx?id=12080), some of the > characters can be represented by both PUA or Unicode 4.1 byte code. And PUA > is just a backward compatiable solution. And it seems PUA support is going > to fade out in coming future. So is it possible to put this enhancment to > higher priority? > Kit I don't understand your comment. You need to provide more details to know if you have a problem or not, and if you do, whether it relates to this bug or not. If you have a problem with a specific input FO file, then attach that file along with the PDF file you obtain when running FOP. Also attach any console output. Once you do these things, I can evaluate whether your problem is legitimate or not and whether it is related or not.
Hello, Thanks for your comment and sorry for my misleading message and poor English. Here is my problem: When XML data files contains Chinese character with byte code does not exist in PUA, "?" will be displayed. And here is the fonts library information http://www.microsoft.com/en-us/download/details.aspx?id=12080 And here is the character I failed to generated Unicde code (Hex):2070E According to the above URL, old PUA characters have been moved to non PUA code point assignment. It seems that Chinese characters in PUA will not have any enhancement or support in coming future. So is it possible to put this enhancment (support surrogate pairs characters) to higher priority? Cheers, Kit
(In reply to comment #11) > Hello, > Thanks for your comment and sorry for my misleading message and poor > English. > Here is my problem: > When XML data files contains Chinese character with byte code does not exist > in PUA, "?" will be displayed. > And here is the fonts library information > http://www.microsoft.com/en-us/download/details.aspx?id=12080 > And here is the character I failed to generated > Unicde code (Hex):2070E > > According to the above URL, old PUA characters have been moved to non PUA > code point assignment. It seems that Chinese characters in PUA will not have > any enhancement or support in coming future. So is it possible to put this > enhancment (support surrogate pairs characters) to higher priority? > > Cheers, > Kit Irrelevant. Characters encoded using PUA are not interchangeable. Private means Private. In any case, I'll ignore your comment unless and until you provide a sample FO/PDF pair demonstrating a problem. May I remind you that work on FOP (or any other Apache project) is done on a volunteer or sponsorship basis. If you want the priority placed higher, then either volunteer to do the work or sponsor someone to do the work. I welcome all improvements to FOP and will do my utmost to apply patches quickly, but your request to prioritize a particular feature has no weight unless you do something concrete to assist. Just as an FYI, my personal priority is to improve support for BMP encoded scripts, and then move on to non-BMP features. Respectfully, Glenn
(In reply to comment #11) > Hello, > Thanks for your comment and sorry for my misleading message and poor > English. > Here is my problem: > When XML data files contains Chinese character with byte code does not exist > in PUA, "?" will be displayed. > And here is the fonts library information > http://www.microsoft.com/en-us/download/details.aspx?id=12080 > And here is the character I failed to generated > Unicde code (Hex):2070E > > According to the above URL, old PUA characters have been moved to non PUA > code point assignment. It seems that Chinese characters in PUA will not have > any enhancement or support in coming future. So is it possible to put this > enhancment (support surrogate pairs characters) to higher priority? > > Cheers, > Kit i've asked once, and i'll ask again: please provide a minimal input FO file and an output PDF file demonstrating a problem; if you can't or won't do this, i can not do anything to help
Created attachment 28914 [details] Sample XSL file to generate Chinese character. It use "Mingliu" Chinese fonts
Created attachment 28915 [details] Result PDF file XML data files contains both characters from PUA and non PUA
Created attachment 28916 [details] Sample XML file contains both PUA and non-PUA chinese character
Hello, I have uploaded XML data file, XSL template file and result PDF file. Any other information require? Cheers, Kit
(In reply to comment #17) > Hello, > I have uploaded XML data file, XSL template file and result PDF file. Any > other information require? > Cheers, > Kit Hi, as Glenn said, you should attach the resulting XSL-FO resulting from the XML+XSLT transformation, this will be very helpful to reproduce (or not) the issue and identify what causes it. See bug reporting guidelines at [1] for further info. [1] http://xmlgraphics.apache.org/fop/bugs.html#issues_new
Created attachment 28917 [details] Sample FO file to generate PDF file
Created attachment 28918 [details] Sample FO file to generate PDF
Created attachment 28919 [details] Result PDF file
Hello, Sorry, I have upload wrong files before. I have uploaded XSL-FO result file result PDF file. Any other information require? Cheers, Kit
Hi all, Glad to see the thread is active again as I had similiar concerns of using non-BMP characters. The support of non-BMP characters are very important as there are Street names that no other characters can be substituted. If FOP can support the double surrogates, I'm sure many more developers can enjoy it as the generated PDF embedded the font by default that solved many physical printing problems of printer loaded fonts. Jacky
Regarding above problem, we encountered same issue on my applications. It looks an common issue for chinese characters applications. Hoping that fix could be provided soon. Many Thanks. Rick
Great that finally searched some related information about support non-BMP characters issue with FOP, & also wanna to know if it is due to FOP, & that problem quite annoying if my APPL should finally go ahead for deploy with FOP @production. Join thread to hear gd news. TY
(In reply to comment #25) > Great that finally searched some related information about support non-BMP > characters issue with FOP, & also wanna to know if it is due to FOP, & that > problem quite annoying if my APPL should finally go ahead for deploy with > FOP @production. > > Join thread to hear gd news. > TY don't jump to the conclusion that anything has changed in FOP: it hasn't! also, keep in mind that adding support for non-BMP characters in FOP is only a part of the solution; the larger part of the solution is outside of the scope of FOP, namely, the availability of OpenType or TrueType fonts that contain a 'cmap' table that satisfies one of the following: * platform ID 0 (unicode), encoding ID 3 (unicode 2.0 or later), format 10.0 (trimmed array) * platform ID 0 (unicode), encoding ID 3 (unicode 2.0 or later), format 12.0 (segmented coverage) * platform ID 3 (windows), encoding ID 10 (ucs-4), format 12.0 (segmented coverage) so far, nobody has provide me a link to or a copy of such a font, and, until i have such a font in hand, i'm not going to take any action with respect to this bug
Same problem here! Do you guys can provide me any work around before the bug is fixed? you know, it takes time to seek a suitable fonts to fit. Anyway, will keep an eye on the thread. Cusson
Hi Glenn, Sorry not understand your requested fonts clearly. Is there any software/tools to check the fonts supported the 'cmap' you mentioned? I tried Microsoft Font Properties extension tools http://www.microsoft.com/typography/TrueTypeProperty21.mspx to check if i got fonts that suit, but it didn't involve the cmap properties. Thanks. Thomas T. (In reply to comment #26) > (In reply to comment #25) > Great that finally searched some related > information about support non-BMP > characters issue with FOP, & also wanna > to know if it is due to FOP, & that > problem quite annoying if my APPL > should finally go ahead for deploy with > FOP @production. > > Join thread > to hear gd news. > TY don't jump to the conclusion that anything has > changed in FOP: it hasn't! also, keep in mind that adding support for > non-BMP characters in FOP is only a part of the solution; the larger part of > the solution is outside of the scope of FOP, namely, the availability of > OpenType or TrueType fonts that contain a 'cmap' table that satisfies one of > the following: * platform ID 0 (unicode), encoding ID 3 (unicode 2.0 or > later), format 10.0 (trimmed array) * platform ID 0 (unicode), encoding > ID 3 (unicode 2.0 or later), format 12.0 (segmented coverage) * platform > ID 3 (windows), encoding ID 10 (ucs-4), format 12.0 (segmented coverage) > so far, nobody has provide me a link to or a copy of such a font, and, until > i have such a font in hand, i'm not going to take any action with respect to > this bug
(In reply to comment #28) > Sorry not understand your requested fonts clearly. Is there any > software/tools > to check the fonts supported the 'cmap' you mentioned? > I tried Microsoft Font Properties extension tools > http://www.microsoft.com/typography/TrueTypeProperty21.mspx > to check if i got fonts that suit, but it didn't involve the cmap properties. One option is the 'ttx' tool in the Adobe Font Development Kit for Opentype (AFDKO)
Hi Glenn, From your suggested tools, i found 4 kinds of fonts bundled in windows 7 with the following cmap supported, are that what you are looking for? ebrima.ttf <cmap_format_12 platformID="3" platEncID="10" format="12" reserved="0" length="7012" language="0" nGroups="583"> ebrimabd.ttf <cmap_format_12 platformID="3" platEncID="10" format="12" reserved="0" length="7012" language="0" nGroups="583"> seguisym.ttf <cmap_format_12 platformID="3" platEncID="10" format="12" reserved="0" length="1900" language="0" nGroups="157"> simsunb.ttf <cmap_format_12 platformID="3" platEncID="10" format="12" reserved="0" length="40" language="0" nGroups="2"> Thomas T. (In reply to comment #29) > (In reply to comment #28) > Sorry not understand your requested fonts > clearly. Is there any > software/tools > to check the fonts supported the > 'cmap' you mentioned? > I tried Microsoft Font Properties extension tools > > http://www.microsoft.com/typography/TrueTypeProperty21.mspx > to check if i > got fonts that suit, but it didn't involve the cmap properties. One option > is the 'ttx' tool in the Adobe Font Development Kit for Opentype (AFDKO)
Hi, Is my suggested fonts help? Or i need to find another?? Thomas T.
(In reply to comment #31) > Is my suggested fonts help? Or i need to find another?? Yes, it will be helpful when I am ready to start working on this bug. I do not have a schedule for when I will start. Thanks for your checking on Win fonts that support non-BMP encodings.
Hi All, I encountered the same issue on my applications using fop 1.0. Glad to see the issue is going to be fixed in the coming version. May I know if this bug will be fixed in version 1.1 only or it will be patched in version 1.0, too? Shepard
(In reply to comment #33) > Hi All, > > I encountered the same issue on my applications using fop 1.0. > Glad to see the issue is going to be fixed in the coming version. > May I know if this bug will be fixed in version 1.1 only or it will be > patched in version 1.0, too? No, this is NOT going to be fixed in the upcoming version. I have made NO statements about when this will be addressed in FOP. In particular, it will NOT be patched in 1.0 and will NOT be addressed in 1.1. This is a POSSIBLE 1.2 (or later) fix.
(In reply to comment #34) > (In reply to comment #33) > Hi All, > > I encountered the same issue on my > applications using fop 1.0. > Glad to see the issue is going to be fixed in > the coming version. > May I know if this bug will be fixed in version 1.1 > only or it will be > patched in version 1.0, too? No, this is NOT going to > be fixed in the upcoming version. I have made NO statements about when this > will be addressed in FOP. In particular, it will NOT be patched in 1.0 and > will NOT be addressed in 1.1. This is a POSSIBLE 1.2 (or later) fix. (In reply to comment #34) > (In reply to comment #33) > Hi All, > > I encountered the same issue on my > applications using fop 1.0. > Glad to see the issue is going to be fixed in > the coming version. > May I know if this bug will be fixed in version 1.1 > only or it will be > patched in version 1.0, too? No, this is NOT going to > be fixed in the upcoming version. I have made NO statements about when this > will be addressed in FOP. In particular, it will NOT be patched in 1.0 and > will NOT be addressed in 1.1. This is a POSSIBLE 1.2 (or later) fix. Hi Adams, Nice to see you'd considered this thread. As I knew, even mainframe has similiar issues in using supporting surrogate pairs. Is that any workaround if stick to latest FOP version, or any news on tenative rollout of v1.2? Jacky
(In reply to comment #35) > Nice to see you'd considered this thread. As I knew, even mainframe has > similiar issues in using supporting surrogate pairs. Is that any workaround > if stick to latest FOP version, or any news on tenative rollout of v1.2? FOP 1.1rc1 was just release, and perhaps one month later 1.1 will be released. After that, I intend to put this work item on my list for possible 1.2 features. There is no schedule for 1.2, but I'd like to do it by the end of this year.
(In reply to comment #36) > (In reply to comment #35) > > Nice to see you'd considered this thread. As I knew, even mainframe has > > similiar issues in using supporting surrogate pairs. Is that any workaround > > if stick to latest FOP version, or any news on tenative rollout of v1.2? > > FOP 1.1rc1 was just release, and perhaps one month later 1.1 will be > released. After that, I intend to put this work item on my list for possible > 1.2 features. There is no schedule for 1.2, but I'd like to do it by the end > of this year. Hi Glenn, Thanks for considering adding this feature in FOP 1.2 by the end of this year. We are using FOP 1.1, and we want to have this feature as soon as it is get added. So, we are wondering: - Do you think this will be done in the near future? - Will the solution can be patched to FOP 1.1 easily? Thanks for your coordination.