Issue 63015

Summary: PDF-Export and Type1-Fonts: Error exporting umlaut
Product: gsl Reporter: martho <martin>
Component: codeAssignee: hdu <hdu>
Status: CLOSED FIXED QA Contact:
Severity: Trivial    
Priority: P3 CC: coredump, edv, eric.savary, giuseppe.castagno, hdu, issues, kschenk, mseidel, pescetti, philipp.lohmann, stefan.baltzer, thomas.lendo, vdvo
Version: OOo 2.0.2Flags: pescetti: 4.1.2_release_blocker+
Target Milestone: 4.x   
Hardware: All   
OS: Windows, all   
Issue Type: DEFECT Latest Confirmation in: 4.2.0-dev
Developer Difficulty: ---
Attachments:
Description Flags
Original dokument with some umlauts
none
Resulting PDF with broken umlauts
none
PDF created with GhostScript which contains the umlauts correctly.
none
another writer file in Swedish. having the same problem
none
the resulting PDF, the Swedish letters with character missing. none

Description martho 2006-03-10 14:03:43 UTC
Hello!

You can export writer-documents to PDFs with Type1-Fonts in OO 2.0.2 (see issue
62307). But this doesn't seem to work with German umlauts like äöü. I tested it
only in the us-englisch Version, but since the umlauts are displayed correctly
in the document they should appear correctly in the PDF as well.

I attached a sample-document and the resulting PDF.

Greetings

Martin
Comment 1 martho 2006-03-10 14:04:24 UTC
Created attachment 34732 [details]
Original dokument with some umlauts
Comment 2 martho 2006-03-10 14:04:49 UTC
Created attachment 34733 [details]
Resulting PDF with broken umlauts
Comment 3 michael.ruess 2006-03-10 14:22:51 UTC
Reassigned to HI.
Comment 4 coredump40 2006-04-12 10:51:33 UTC
Confirmed it with the German version as well (2.0.2)
Comment 5 hdu@apache.org 2006-05-17 12:35:22 UTC
The root cause of the problem is that the font "TheMixExtraBold" doesn't contain
these umlauts. This is rather unfortunate considering that the font seems to be
designed in Germany.

The best workaround for problems with fonts not supporting characters you want
to use is to select another font instead which supports the text. Though OOo
contains rescue mechanisms for situations like this the chances that the missing
character is replaced by a by using the corresponding character from another
font which matches the style of the selected font 100% are pretty dim.
Especially when the style of the selected font is rather unique.

OOo's heuristic to deal with situations with fonts not supporting some
characters usually works quite well for truetype or opentype fonts. Also for
Type1 fonts on unix platforms. OOo currently cannot detect whether a character
is supported in a Type1 font on windows platforms, so this is the worst case
situation for OOo. Maybe we can improve the heuristic for Type1 fonts on windows.
Comment 6 martho 2006-05-17 12:45:55 UTC
Hi hud and thanks for your reply.

The font used ("TheMixExtraBold") in the example definitely does contain the
umlauts. I see them in Writer when selecting "TheMixExtraBold" and type them,
and because of the shape it's only this font the umlauts can come from. So IMHO
the umlauts are really inside the font. If OpenOffice-Writer can display those
characters without any problems they should be exported to pdf correctly. 

Unfortunatly this font is licensed so I'm not able to provide a copy of it as a
font-file.
Comment 7 hdu@apache.org 2006-05-17 13:08:54 UTC
> I see them in Writer when selecting "TheMixExtraBold" and type them

ATM is responsible for rastering and displaying Type1 fonts on windows. When ATM
sees that a glyph is empty it seems to do something like glyph fallback. For OOo
Window's GDI subsystem looks a black box. OOo doesn't and shouldn't know what
the different Window's subsystems do under the hood. OOo doesn't implement DMA
access to the harddisks either...

> So IMHO the umlauts are really inside the font.

I know the font, the umlauts are not in there. Have a look with a good font viewer.
Comment 8 martho 2006-05-17 13:56:58 UTC
Thanks again for you reply! I think I learned something new (never thought GDI
would be capable to "simulate" umlauts).
Comment 9 martho 2006-05-17 14:21:34 UTC
I just was able to create a pdf containing TheMixExtraBoldCaps-umlauts with
GhostScript and a Windows PS-Printer. Perhaps this might give you a glue how to
solve the problem.

Comment 10 hdu@apache.org 2006-05-17 14:33:27 UTC
This is interesting. Can you attach the corresponding PDF from ghostscript?

I just analyzed this problem some more with our PDF expert. I looks like the
deeper problem is that the font claims it has StandardEncoding, which doesn't
contain some characters like umlauts or others symbols but it has them. So OOo
does embed the font and the correct codes for umlauts into the PDF, but for some
reason it doesn't come together. If there is a workaround for this situation it
has to be handled by our PDF export. Reassigning to PL.
Comment 11 martho 2006-05-17 14:39:37 UTC
Created attachment 36547 [details]
PDF created with GhostScript which contains the umlauts correctly.
Comment 12 hdu@apache.org 2006-05-17 15:03:52 UTC
*** Issue 64070 has been marked as a duplicate of this issue. ***
Comment 13 hdu@apache.org 2006-05-17 15:10:11 UTC
Looking at attached PDF which was produced by ghostscript I see that the font is
reencoded/subsetted with a WinAnsiEncoding, which covers the umlauts...
Comment 14 philipp.lohmann 2006-06-15 15:13:49 UTC
target
Comment 15 hakre 2006-07-05 12:09:08 UTC
I tested with some other PostScript font (that has umlaut support) and version 
2.0.3 (win32 platform) and I can only confirm the behaviour that PDF Export 
does not export these Umlaut and more language specific characters: ÄÜÖäüöß

A workaround is to install a pdf-printer like PDFCreator. It does export a 
really nice and much smaller PDF from the same document.
Comment 16 vdvo 2006-07-06 16:34:26 UTC
I have a problem that may or may not be related: I make a document with form
fields and I export it to PDF, and then when filling in the form with Adobe
Reader, any text fields will show black dots instead of some Czech accented
characters. Curiously, some Czech characters work, but some don't.
Is this related? Should I file a new bug for this?
Comment 17 philipp.lohmann 2006-07-06 16:44:03 UTC
vdvo: that is basically issue 42985 (for which there is no solution yet). The
characters that do not work are most certainly those not in the WinAnsiEncoding.
Comment 18 Giuseppe Castagno (aka beppec56) 2006-08-09 10:42:19 UTC
Created attachment 38358 [details]
another writer file in Swedish. having the same problem
Comment 19 Giuseppe Castagno (aka beppec56) 2006-08-09 10:45:36 UTC
Created attachment 38359 [details]
the resulting PDF, the Swedish letters with character missing.
Comment 20 Giuseppe Castagno (aka beppec56) 2006-08-09 10:50:11 UTC
It appears that the document I receveid from a user suffer of the same problem.
Unfortunately on Linux FC5 I could reproduce it, because the AGaramond Type1
used  gets converted to Times New Roman and the error disappear.

Strange that AGaramond is missing the umlaut though.

CC myself.
Comment 21 Giuseppe Castagno (aka beppec56) 2006-08-09 10:51:49 UTC
I should have said "...on Linux FC5 I could not reproduce it...
Comment 22 philipp.lohmann 2007-01-11 10:33:51 UTC
*** Issue 73347 has been marked as a duplicate of this issue. ***
Comment 23 philipp.lohmann 2007-01-11 10:35:22 UTC
prio
Comment 24 cbrunet 2007-03-27 14:59:59 UTC
*** Issue 75707 has been marked as a duplicate of this issue. ***
Comment 25 philipp.lohmann 2007-04-23 12:23:39 UTC
pl->hdu: the real problem here is that the PDF code cannot know the real codon
for these characters due to the a little "simplistic" implementation of
WinSalGraphics::GetFontEncodingVector. This method should output the non encoded
(in the standard encoding) characters and their adobe name, but it doesn't so
the PDF code cannot contain them. However I don't know whether you have a chance
to get those pairs on Windows. Please have a look.
Comment 26 hdu@apache.org 2007-04-23 15:59:24 UTC
@pl: we could cook up a simple parser for the pfb's eexec section...
Comment 27 philipp.lohmann 2007-05-11 10:05:37 UTC
target
Comment 28 ooo 2007-08-03 16:20:47 UTC
retargeted to 2.4
Comment 29 hdu@apache.org 2008-01-24 17:14:04 UTC
Decrypting and parsing the Type1 eexec string isn't implemented yet and this will take a while. A 
workaround for psprint till this happens would be to use the adobe glyph names for them.
Comment 30 michael.ruess 2009-11-27 11:33:35 UTC
*** Issue 107264 has been marked as a duplicate of this issue. ***
Comment 31 astumpf 2010-01-24 13:13:26 UTC
Please allow me to add my 2cents: You don't have to decode the encrypted part of
the type1 font. I'm not aware of any quality font without glyphs for umlauts,
accented characters, etc. (except Symbol Fonts), so you won't get any helpful
information by decoding the font. 
In fact you have to specify the encoding (the mapping of font glyph names to
characters). In the ghostscript example this is done with the tag "/Encoding
/WinAnsiEncoding" within the Font object. In the broken example this tag is
missing. And as the PDF-Reader has no idea, that the character 246 (ö) should be
represented by the glyph /oumlaut it is shown as space.
For Windows and ISO8859-1 character sets on Unix-Systems this encoding should
work fine, if you need eastern european characters as defined in ISO8859-2, you
will have to specify your own encoding vector, as there are no predefined
vectors in PDF for these character sets.
I hope this helps.
Comment 32 bengtahlgren 2010-03-11 20:26:09 UTC
I have this problem too for a font that I just bought (Helvetica Neue LT,
windows postscript version).  There is no problem with e.g., Nimbus Sans.  Is
more information needed to solve the issue?

Looking at the PDF file, there is a difference in how Nimbus Sans and Helvetica
Neue are embedded.  For the first, the umlauted characters are embedded
separately in addition to the complete set.  The separate embedding looks like this:

54 0 obj
<</Type/Encoding/Differences[ 0
 /Udieresis /Aring /Adieresis /Odieresis /aring /adieresis /odieresis]>>
endobj

[...]

56 0 obj
<</Type/Font/Subtype/Type1/BaseFont/NimbusSanL-Regu
/Encoding 54 0 R
/ToUnicode 55 0 R
/FirstChar 0
/LastChar 6
/Widths[722 667 667 778 556 556 556  ]
/FontDescriptor 51 0 R>>
endobj

There is no such handling of the umlauted characters for Helvetica Neue.

Comparing the .afm files, Nimbus Sans has, for example:

C -1 ; WX 556 ; N aring ; B 42 -23 535 754 ;

and Helvetica Neue has:

C 229 ; WX 574 ; N aring ; B -7 -14 522 778 ;

If you need more info, I'd be happy to provide!
Comment 33 bengtahlgren 2010-03-17 15:46:03 UTC
Some more experimenting based on astumpf's comment:

In the PDF file I manually added:

/Encoding/WinAnsiEncoding/Subtype/Type1

to the font object like this:

17 0 obj
<</Type/Font/Subtype/Type1/BaseFont/HelveticaNeueLT-Italic
/Encoding/WinAnsiEncoding/Subtype/Type1
/ToUnicode 16 0 R
/FirstChar 0 /LastChar 255
/Widths[0 0 0 0 0 222 222 0 222 0 222 222 222 222 0 0
0 0 0 0 0 0 0 0 0 167 519 519 556 222 611 444
278 259 426 556 556 926 630 278 259 259 352 600 278 389 278 333
556 556 556 556 556 556 556 556 556 556 278 278 600 600 600 556
800 667 685 722 704 611 574 759 722 259 519 667 556 870 722 759
648 759 685 648 574 722 611 926 611 611 611 259 333 259 600 500
222 519 593 537 593 537 296 574 556 222 222 481 222 852 556 574
593 593 333 481 315 556 481 759 481 481 444 333 222 333 600 0
556 0 278 556 426 1000 556 556 222 1074 648 259 1074 0 0 0
0 278 278 426 426 500 500 1000 222 990 481 259 907 0 0 611
278 259 556 556 556 556 222 556 222 800 311 463 600 600 800 222
400 600 333 333 222 556 600 278 222 333 344 463 834 834 834 556
667 667 667 667 667 667 926 722 611 611 611 611 259 259 259 259
704 722 759 759 759 759 759 600 759 722 722 722 722 611 648 537
519 519 519 519 519 519 870 537 537 537 537 537 222 222 222 222
574 556 574 574 574 574 574 600 574 556 556 556 556 481 593 481
]
/FontDescriptor 15 0 R>>
endobj

and it worked!!!
Comment 34 stefan.baltzer 2011-02-14 14:08:47 UTC
Adding CCs.
Comment 35 edv 2014-09-02 14:09:56 UTC
After digging through PDF documentations, I found a pretty easy solution (for PDF-1.4). My Type 1 font is defined in the pdf file like this:

9 0 obj
<</Type/Font/Subtype/Type1/BaseFont/Syntax /ToUnicode 8 0 R
/FirstChar 0 /LastChar 255
/Widths[ ....

When adding "/Encoding /WinAnsiEncoding" to it so that it becomes

9 0 obj
<</Type/Font/Subtype/Type1/BaseFont/Syntax
/Encoding /WinAnsiEncoding /ToUnicode 8 0 R
/FirstChar 0 /LastChar 255
/Widths[ ...

the umlauts show up. So please can we add this to Type1 Font created PDF files?

According to PDF Reference, Third Edition, version 1.4 linked here http://www.adobe.com/devnet/pdf/pdf_reference_archive.html on Page 317-318 "Entries in a Type 1 font dictionary" it says for /Encoding:

(Optional) A specification of the font’s character encoding, if different from
dictionary its built-in encoding. The value of Encoding may be either the name of a predefined encoding (MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, as described in Appendix D) or an encoding dictionary that specifies differences from the font’s built-in encoding or from a specified predefined encoding (see Section 5.5.5, “Character Encoding”).

So as long as no encoding is set through other constraints "/Encoding /WinAnsiEncoding" could be set. My font itself has "StandardEncoding" as parameter and this means pretty much nothing as described on page 329.

Regards

Martin
Comment 36 edv 2014-09-19 08:05:47 UTC
I found the solution to this. 
In vcl\source\gdi\pdfwriter_impl.cxx at the function emitEmbeddedFont at the Line 3696:
if( !pFont->IsSymbolFont() && pEncoding == 0)
must be changed to:
if( !pFont->IsSymbolFont() )

Reason: Without the pEncoding check - "/Encoding/WinAnsiEncoding\n" is added to the pdf file which is correct. pEncoding specifies that a ToUnicode stream has to be generated (and it is) and nothing speaks against it because it is only a translation table and doesn't affect the encoding itself. For symbolic fonts WinAnsiEncoding would be wrong because they have there own encoding shipped with.

I don't want to create a patch and upload this myself because I don't intend to do more bugfixing on openoffice and it is to tiny to go through the whole upload process. So please someone else do this, I don't want any rights on that code submission.
Comment 37 SVN Robot 2014-10-15 09:03:40 UTC
"hdu" committed SVN revision 1631975 into trunk:
#i63015# always default to WinAnsiEncoding for non-symbol PDF-Type1 export
Comment 38 hdu@apache.org 2014-10-15 09:08:18 UTC
Many thanks for debugging into it and pointing out the problematic source line. Sorry that the review took so long.
Comment 39 Kay 2015-09-07 21:46:37 UTC
FINALLY fixed! We should include it 4.1.2.
Comment 40 Andrea Pescetti 2015-09-23 20:57:39 UTC
Accepted for 4.1.2.
Comment 41 SVN Robot 2015-09-24 23:01:28 UTC
"kschenk" committed SVN revision 1705192 into branches/AOO410:
#i63015# Merged from trunk r 1631975
Comment 42 Andrea Pescetti 2015-10-17 22:21:36 UTC
It would be great if one of the many people who are following this issue could take the time to download OpenOffice 4.1.2-RC2 (German: https://dist.apache.org/repos/dist/dev/openoffice/4.1.2-rc2-r1707648/binaries/de/ ) and comment on whether the bug is now fixed for the upcoming OpenOffice version 4.1.2. Thanks!
Comment 43 Andrea Pescetti 2015-10-18 21:50:12 UTC
Can't verify since (on Linux) the sample document ("Original dokument..." in the Attachments above) already gets converted to PDF correctly with older versions of OpenOffice, such as 4.1.0. Still, it does work with 4.1.2-RC2 too.
Comment 44 edv 2015-10-19 08:20:42 UTC
I just downloaded 4.1.2 Build 9781 Rev.1707648 testet it and it works. Thanks for fixing it. 
We personally switched the problematic font to an opentype font.
As remark for further development, the LOO community have merged the pdf creation process for all platforms in one compomnent and then the fix doesn't work out anymore for a linux constellation, see here:
http://cgit.freedesktop.org/libreoffice/core/commit/?id=297b22bd49ea11a90063ab8503fb83090f351668
Comment 45 Andrea Pescetti 2015-10-19 08:22:54 UTC
@edv: Thank for verification! Marking VERIFIED.
Comment 46 Kay 2016-08-30 21:26:11 UTC
Closing.