Bug 49849 - [PATCH] PDF links do only support ISO encoding
Summary: [PATCH] PDF links do only support ISO encoding
Status: NEEDINFO
Alias: None
Product: Fop - Now in Jira
Classification: Unclassified
Component: pdf (show other bugs)
Version: all
Hardware: PC All
: P3 normal
Target Milestone: ---
Assignee: fop-dev
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-08-31 05:26 UTC by Max Aster
Modified: 2012-04-24 05:56 UTC (History)
0 users



Attachments
test case (758 bytes, application/octet-stream)
2010-08-31 05:32 UTC, Max Aster
Details
patch to utf-8 (637 bytes, patch)
2010-08-31 05:33 UTC, Max Aster
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Max Aster 2010-08-31 05:26:52 UTC
The current version of FOP (1.0) does only support "ISO-8859-1" encoding for pdf actions like links.

See PDFDocument.java
Comment 1 Max Aster 2010-08-31 05:32:47 UTC
Created attachment 25963 [details]
test case

Test case with some polish characters
Comment 2 Max Aster 2010-08-31 05:33:51 UTC
Created attachment 25964 [details]
patch to utf-8

Changes the encoding to UTF-8
Comment 3 Max Aster 2010-08-31 05:37:24 UTC
See patch
Comment 4 Vincent Hennebert 2010-08-31 09:27:21 UTC
Hi,

Thanks for your patch. This bug should remain open until it has actually been committed. Otherwise we will loose track of it.

Vincent
Comment 5 Glenn Adams 2012-04-01 21:08:49 UTC
(In reply to comment #3)
> See patch

A brief look at this patch shows that it simply changes the output encoding used for the PDFDocument.encode() function as follows:

-    public static final String ENCODING = "ISO-8859-1";
+    public static final String ENCODING = "UTF-8";

I believe this is incorrect. PDF files employ three string types:

(1) byte string (unspecified encoding)
(2) ascii string (us-ascii encoding)
(3) text string (either PDFDocEncoding or UTF-16BE)

Since (1) the encode() mechanism is used in a variety of contexts and (2) no explicit use of UTF-8 is made by PDF, it would be incorrect to simply change the output encoding returned by encode().

See ISO/IEC 32000 (2008), Section 7.9.2 for details.

This patch needs to be reworked to take these details into account. Furthermore, the description of this bug is not adequate: it really doesn't explain what the problem is:

* is it the fact that the rendered text of the content of basic-link is not rendered with Polish characters? if so, then the problem is a font selection problem, not a character encoding problem

* is it related to the character encoding used in the /Filespec dictionary for the link annotation?

In any case, the present patch MUST NOT be applied.
Comment 6 Glenn Adams 2012-04-01 21:09:51 UTC
see comment 5
Comment 7 Glenn Adams 2012-04-07 01:41:47 UTC
resetting P2 open bugs to P3 pending further review
Comment 8 Glenn Adams 2012-04-24 05:56:11 UTC
(In reply to comment #6)
> see comment 5

Max, I am still awaiting your input as requested above. if I see no further input by April 30, I will close this bug due to lack of requested information. Regards, Glenn