49849 – [PATCH] PDF links do only support ISO encoding

Bug 49849 - [PATCH] PDF links do only support ISO encoding

Summary: [PATCH] PDF links do only support ISO encoding

Status:	NEEDINFO

Alias:	None

Product:	Fop - Now in Jira
Classification:	Unclassified
Component:	pdf (show other bugs)
Version:	all
Hardware:	PC All

Importance:	P3 normal
Target Milestone:	---
Assignee:	fop-dev

URL:
Keywords:

Depends on:
Blocks:

Reported:	2010-08-31 05:26 UTC by Max Aster
Modified:	2012-04-24 05:56 UTC (History)
CC List:	0 users

Attachments
test case (758 bytes, application/octet-stream) 2010-08-31 05:32 UTC, Max Aster	Details
patch to utf-8 (637 bytes, patch) 2010-08-31 05:33 UTC, Max Aster	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Max Aster 2010-08-31 05:26:52 UTC

The current version of FOP (1.0) does only support "ISO-8859-1" encoding for pdf actions like links.

See PDFDocument.java

Comment 1 Max Aster 2010-08-31 05:32:47 UTC

Created attachment 25963 [details]
test case

Test case with some polish characters

Comment 2 Max Aster 2010-08-31 05:33:51 UTC

Created attachment 25964 [details]
patch to utf-8

Changes the encoding to UTF-8

Comment 3 Max Aster 2010-08-31 05:37:24 UTC

See patch

Comment 4 Vincent Hennebert 2010-08-31 09:27:21 UTC

Hi,

Thanks for your patch. This bug should remain open until it has actually been committed. Otherwise we will loose track of it.

Vincent

Comment 5 Glenn Adams 2012-04-01 21:08:49 UTC

(In reply to comment #3)
> See patch

A brief look at this patch shows that it simply changes the output encoding used for the PDFDocument.encode() function as follows:

-    public static final String ENCODING = "ISO-8859-1";
+    public static final String ENCODING = "UTF-8";

I believe this is incorrect. PDF files employ three string types:

(1) byte string (unspecified encoding)
(2) ascii string (us-ascii encoding)
(3) text string (either PDFDocEncoding or UTF-16BE)

Since (1) the encode() mechanism is used in a variety of contexts and (2) no explicit use of UTF-8 is made by PDF, it would be incorrect to simply change the output encoding returned by encode().

See ISO/IEC 32000 (2008), Section 7.9.2 for details.

This patch needs to be reworked to take these details into account. Furthermore, the description of this bug is not adequate: it really doesn't explain what the problem is:

* is it the fact that the rendered text of the content of basic-link is not rendered with Polish characters? if so, then the problem is a font selection problem, not a character encoding problem

* is it related to the character encoding used in the /Filespec dictionary for the link annotation?

In any case, the present patch MUST NOT be applied.

Comment 6 Glenn Adams 2012-04-01 21:09:51 UTC

see comment 5

Comment 7 Glenn Adams 2012-04-07 01:41:47 UTC

resetting P2 open bugs to P3 pending further review

Comment 8 Glenn Adams 2012-04-24 05:56:11 UTC

(In reply to comment #6)
> see comment 5

Max, I am still awaiting your input as requested above. if I see no further input by April 30, I will close this bug due to lack of requested information. Regards, Glenn