Bug 52477

Summary: FOP always uses the same prefix for embeded font
Product: Fop - Now in Jira Reporter: quamis <quamis+asf>
Component: pdfAssignee: fop-dev
Status: REOPENED ---    
Severity: enhancement    
Priority: P5    
Version: all   
Target Milestone: ---   
Hardware: PC   
OS: All   

Description quamis 2012-01-17 16:03:53 UTC
After having some problems with ghostscript while trying to concatenate PDF files generated by FOP, we went to the conclusion that FOP generates the embedded fonts prefix by always using the same sequence.

@see http://bugs.ghostscript.com/show_bug.cgi?id=692795 for the initial bug report

According to Ken Sharp from ghostscript, the embedded font should have an unique name, non-repeatable across multiple generations. I couldn't find this in the PDF specs, but i kinda lost myself trying to find anything in there, so this is not really relevant.

Basically, it seems that FOP always generates the embedded font prefix by using EAAAAA, EAAAAB, EAAAAC etc sequentially when it should generate unique prefixes.
Because it generates the same prefix, gs(and the PDF viewer) cannot display the required fonts. I cannot contribute a patch as i have 0 knowledge of java, but i think that the prefix should be based on the current timestamp+the current index(easiest), or be based on the currently embedded font glyphs, this should be more accurate, but any method will do for now. 

It should be able to disable this through the command line to allow automatic unit-tests that tests binary files to not fail because of always having something different in otherwise identical files.

I have font-embeding enabled, according to http://xmlgraphics.apache.org/fop/trunk/fonts.html#embedding , and it only embeds used glyphs. Same thing happens though if i embed the whole font (using encoding-mode).

I have located the culprit in java\org\apache\fop\pdf\PDFFactory.java in function createSubsetFontPrefix(), but as mentioned i'm unable to provide a patch.
I have found this as a related issue. http://osdir.com/ml/fop-users-xmlgraphics.apache.org/2009-04/msg00127.html
Comment 1 Mehdi Houshmand 2012-01-17 16:18:54 UTC
Hi,

This isn't a bug, the PDF specification doesn't mandate that the font prefixes are unique outside scope of the document. The only mandate is:

"The tag consists of exactly six uppercase letters; the choice of letters is arbitrary, but different subsets in the same PDF file must have different tags."
From Section 5.5.3 PDF v1.4 Reference.

As such this isn't a bug. Sorry to be dismissive, but as you said in your post on the ghostscript bug report, making these "unique" doesn't solve the issue since there could likely be clashes since the prefix is only 6 chars.

In my opinion, Ken Sharp is mistaken when he says "If the font has the same name and prefix then it is the same font", that isn't what the PDF specification says (though understandably that's how it could be interpreted). The spec only says that each subset has to be unique within the scope of a document, which is what FOP already does.

Mehdi
Comment 2 quamis 2012-01-18 07:44:52 UTC
(In reply to comment #1)
> Hi,
> 
> This isn't a bug, the PDF specification doesn't mandate that the font prefixes
> are unique outside scope of the document. The only mandate is:
> 
> "The tag consists of exactly six uppercase letters; the choice of letters is
> arbitrary, but different subsets in the same PDF file must have different
> tags."
> From Section 5.5.3 PDF v1.4 Reference.
> 
> As such this isn't a bug. Sorry to be dismissive, but as you said in your post
> on the ghostscript bug report, making these "unique" doesn't solve the issue
> since there could likely be clashes since the prefix is only 6 chars.
> 
> In my opinion, Ken Sharp is mistaken when he says "If the font has the same
> name and prefix then it is the same font", that isn't what the PDF
> specification says (though understandably that's how it could be interpreted).
> The spec only says that each subset has to be unique within the scope of a
> document, which is what FOP already does.
> 
> Mehdi

Yes, he might not be exactly wrong, but this doesn't mean that FOP shouldn't try to be as arbitrary as possible. The algorithm used in the code is completely predictable.
Comment 3 Mehdi Houshmand 2012-01-18 07:57:11 UTC
(In reply to comment #2)
</snip>

There are 2 points to address there:
1) We can't arbitrarily make changes to FOP in order for it to "better" (not even fully!!) support the client systems, in this case ghostscript. The bug is in ghostscript, it should know if it is using a new PDF and change any prefixes accordingly.

2) We use the deterministic trait of the prefixes in our testing framework. The value of having a comprehensive test suite is far greater than making the code change for this scenario.

I understand that none of the above particularly helps you, but we can't very well go changing FOP to accommodate nuanced bugs in ghostscript.

Mehdi
Comment 4 quamis 2012-01-18 08:16:33 UTC
(In reply to comment #3)
> 2) We use the deterministic trait of the prefixes in our testing framework. The
> value of having a comprehensive test suite is far greater than making the code
> change for this scenario.

That why i was saying that a command-line switch to disable the "randomized" behavior should exist. The change seemed trivial enough.

 
> I understand that none of the above particularly helps you, but we can't very
> well go changing FOP to accommodate nuanced bugs in ghostscript.
> 
> Mehdi

I understand that, but generating the same sequence over and over just seems to be a compromise for easier automated testing, not for an actual working&tested product.

For now we'll go on by using pdftk, which seems to handle multiple fonts-same-name case correctly, but its too bad one would have to use 3 different applications all with their own quirks and bugs and usage patterns simply because the standard isn't very clear for a specific issue, and that issue could easily be fixed by any of the 2 applications involved in this chain...
Comment 5 Chris Bowditch 2012-01-18 10:18:02 UTC
I agree that making the prefix unique will make it easier for applications that process the PDF to extract or de-duplicate font resources when merging multiple PDF Files. However, the suggestion in this bug report to make the prefix random based on time introduces problems for regression testing PDFs generated by FOP, not just for the FOP project itself, but users of FOP who wish to regression test their documents.

We could change FOP to make the prefix dependent on the glyphs in the subset, but that would be a lot of work.

An alternative approach that will also make it easier for applications to extract or de-duplicate font resources when merging multiple PDFs is to allow FOP to fully embed the font resources in the PDF, rather than creating a subset. I believe this is possible today for a limited use-case, by specifying encoding-mode="single-byte" on the font element within the fop.xconf file. I say "limited" because that only works if no characters outside the ASCII range are required.

Luis Bernardo, a new contributor to the FOP project is working on a new feature embedding-mode="full" which will fully embed the font and this will work for character ranges outside ASCII. If the font is fully embedded it will allow applications to more readily de-duplicate font resources when merging FOP generated PDF.

I've re-opend this bugzilla, but as an enhancement request rather than a bug. As Mehdi stated, this isn't a bug, but rather a convenience feature.
Comment 6 Mehdi Houshmand 2012-01-18 10:51:13 UTC
(In reply to comment #5)
</snip>
> An alternative approach that will also make it easier for applications to
> extract or de-duplicate font resources when merging multiple PDFs is to allow
> FOP to fully embed the font resources in the PDF, rather than creating a
> subset. I believe this is possible today for a limited use-case, by specifying
> encoding-mode="single-byte" on the font element within the fop.xconf file. I
> say "limited" because that only works if no characters outside the ASCII range
> are required.

That wouldn't necessarily fix the issue here. Fully embedding a font means that the pseudo-unique prefix isn't used, however this isn't necessarily a good thing. A parser like ghostscript, could and apparently does assume that if 2 fonts have the same name (prefix or not) that they are the same font. This is an assumption  that I've made previously and has proved manifestly naive. Also, any implementation CANNOT clash within the same document. Using a glyph subset idea, there could be a scenario in which the 2 fonts with the same glyph subsets produce the same prefix.

We have to be careful what we're supporting here. There is no standardised method to identify a font, since anyone can call any font by any name. I don't agree that making the prefix "more unique" (not sure there is a scale by which something can be measured unique, it's binary, it is or it isn't), would help here, because given time, inevitably you'll get a clash. Then what?

The prefixes are 6 chars long, the guys at Adobe made no indication that they wanted it to be unique in a global sense, only within a document.
Comment 7 quamis 2012-01-18 11:27:19 UTC
(In reply to comment #6)

> That wouldn't necessarily fix the issue here. Fully embedding a font means that
> the pseudo-unique prefix isn't used, however this isn't necessarily a good
> thing. A parser like ghostscript, could and apparently does assume that if 2
> fonts have the same name (prefix or not) that they are the same font. This is
> an assumption  that I've made previously and has proved manifestly naive. Also,
> any implementation CANNOT clash within the same document. Using a glyph subset
> idea, there could be a scenario in which the 2 fonts with the same glyph
> subsets produce the same prefix.

But if 2 fonts have the same glyph subsets used within a document, then it wouldn't be necessary to include them twice, so no clashing would occur. I think that glyph subsets are a good idea, but i do realize that it would be more complex to implement.

> 
> We have to be careful what we're supporting here. There is no standardised
> method to identify a font, since anyone can call any font by any name. I don't
> agree that making the prefix "more unique" (not sure there is a scale by which
> something can be measured unique, it's binary, it is or it isn't), would help
> here, because given time, inevitably you'll get a clash. Then what?

Because the prefix is 6 chars long, its inevitably that one would eventually get a clash, if he uses enough millions of different fonts within the same file. But this is an acceptable limitation.

> The prefixes are 6 chars long, the guys at Adobe made no indication that they
> wanted it to be unique in a global sense, only within a document.

Yes, the Adobe guys probably meant that the prefix should be unique within the same file and it would be the pdf reader/writer's job to handle duplicate fonts coming from different fonts. It makes sense. This is why i think both fop and gs handle this particular case wrong, as they both assumed things about that prefix, and it seems that this assumptions are now proven wrong. 
gs in particular should warn about merging files with embedded fonts, either when merging, or at least in the manual, or a "known-issues" page.
Comment 8 Pascal Sancho 2012-01-18 12:11:52 UTC
Since font files are versionned, how this will be handled when 2 subsets use the same glyphes of the same font, but in different version?
subset reduction should take care of that.