Issue 59576 - Can't copy text from PDF exported from OOo
Summary: Can't copy text from PDF exported from OOo
Status: CONFIRMED
Alias: None
Product: gsl
Classification: Code
Component: code (show other issues)
Version: OOo 2.0.1
Hardware: PC All
: P3 Trivial (vote)
Target Milestone: OOo 3.x
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords: needmoreinfo, oooqa
Depends on:
Blocks: 41707 92549
  Show dependency tree
 
Reported: 2005-12-20 04:01 UTC by pocha
Modified: 2018-01-18 05:47 UTC (History)
7 users (show)

See Also:
Issue Type: ENHANCEMENT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
The Thai Writer document. (6.33 KB, application/vnd.sun.xml.writer)
2005-12-20 04:04 UTC, pocha
no flags Details
The Thai PDF document. (11.77 KB, application/pdf)
2005-12-20 04:05 UTC, pocha
no flags Details
Cut Paste form PDF document. (29 bytes, text/plain)
2005-12-20 04:08 UTC, pocha
no flags Details
Save as from PDF document. (41 bytes, text/plain)
2005-12-20 04:10 UTC, pocha
no flags Details
Document create from Distriller. There is no problem. (120.25 KB, application/pdf)
2005-12-21 03:16 UTC, pocha
no flags Details
OpenDocument test file (7.31 KB, application/vnd.sun.xml.writer)
2006-01-21 06:27 UTC, jjc
no flags Details
PDF generated on Linux (14.57 KB, application/pdf)
2006-01-21 06:27 UTC, jjc
no flags Details
PDF generated on Windows (14.54 KB, application/pdf)
2006-01-21 06:28 UTC, jjc
no flags Details
Result of copying from Linux generated PDF with Acrobat Reader and pasting into gedit (130 bytes, text/plain)
2006-01-21 06:58 UTC, jjc
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description pocha 2005-12-20 04:01:14 UTC
1.Create a document in Writer.Type "à¸à¸²à¸£à¸•à¸´à¸”ตั้ง Openoffice.org 2.0".
2.Click export directly as PDF button and save the PDF file.
3.Open the PDF file in Adobe Reader 7.0.5.
4.Select all text and copy.
5.Paste to notepad2.The result will not be readable.
6.In Adobe reader save the file as  text.
7.Open the txt file in notepad2.
8.The result will not be readable too.
Comment 1 pocha 2005-12-20 04:04:34 UTC
Created attachment 32591 [details]
The Thai Writer document.
Comment 2 pocha 2005-12-20 04:05:33 UTC
Created attachment 32592 [details]
The Thai PDF document.
Comment 3 pocha 2005-12-20 04:08:03 UTC
Created attachment 32593 [details]
Cut Paste form PDF document.
Comment 4 pocha 2005-12-20 04:10:51 UTC
Created attachment 32594 [details]
Save as from PDF document.
Comment 5 jjc 2005-12-20 07:55:18 UTC
Is this an Adobe Reader problem or a OOo problem?  To demonstrate that this is
an OOo problem, I think we need an example of a  PDF document with Thai text
that can be successfully copied from using Adobe Reader.  What happens if you
use Acrobat to create the PDF file?
Comment 6 aziem 2005-12-20 14:22:20 UTC
The text pastes correctly into gedit ("GNOME's notepad").  Also, the text pasted
into OpenOffice.org in visible at first but appears after using Format->Default.

System: OpenOffice.org 2.0.1rc2, Acrobat Reader 7.0.0, Linux 2.6, x86-32

Comment 7 pocha 2005-12-21 03:10:09 UTC
I downloaded other PDF file to test. When I copy text and paste to Notepad2 and
there is no problem. Please test the PDF file.
Comment 8 pocha 2005-12-21 03:16:42 UTC
Created attachment 32618 [details]
Document create from Distriller. There is no problem.
Comment 9 markpeak 2005-12-21 03:42:10 UTC
aziem:
Which attach file did you test? #32592?

I can confirm this problem with both Evince and Adobe Reader 7.0 on Ubuntu 5.10.

Comment 10 michael.ruess 2005-12-22 10:42:11 UTC
Reassigned to hi.
Comment 11 jjc 2005-12-22 13:34:43 UTC
Confirmed.
Comment 12 h.ilter 2006-01-02 15:03:54 UTC
Sorry but is anybody able to save a *.txt file with thai font?
Btw. I was not able to paste a text from clipboard into notepad2 which I've
copied from an webpage like http://www.the-thainews.com
Paste into OOo was ok
In summary I don't think that we have an pdf issue here.
Comment 13 lohmaier 2006-01-02 22:04:01 UTC
The result looks like an ordinary encoding problem to me. (Some Thai-encoding
instead of UTF-8)

Cannot reproduce on linux when pasting to gedit (from evince). All works fine
with both attached PDFs.
Comment 14 samphan 2006-01-03 06:33:41 UTC
> Sorry but is anybody able to save a *.txt file with thai font?

Yes, most can. You only have to look at it with the right font/encoding (TIS-620
or UTF-8).

> Btw. I was not able to paste a text from clipboard into notepad2 which I've
> copied from an webpage like http://www.the-thainews.com

I do exactly the same with my notepad2 with the right font configured for Thai.
It works OK. Try Notepad, instead.

> Paste into OOo was ok

It should be.

> In summary I don't think that we have an pdf issue here.

It's a OOo generated PDF issue.
Thai PDF from Acrobat PDFMaker 6 (Distiller 6) in attachment id 32618 can be
selected and pasted to any program fine.
Select and pasted from OOo generated Thai PDF will have missing/converted
characters.
I guess this is something to do with encoding Thai text in PDF from OOo.

Please investigate it further. 
This lower the quality of PDF generated from OOo, for, I guess, most CTL.
Comment 15 h.ilter 2006-01-03 16:26:04 UTC
Reopened for further investigation.
Comment 16 h.ilter 2006-01-03 16:28:59 UTC
HI->HDU: I'm not able to reproduce the problem with my windows system. 
Maybe you can.
Comment 17 hdu@apache.org 2006-01-04 15:22:20 UTC
HDU->PL: I think the problem these PDF-viewers are having that for PDF export we
currently just keep unicodes<=0xFF in their place... Is it possible to use
non-unicode encodings for other text and their corresponding subsets?
Comment 18 hdu@apache.org 2006-01-05 07:50:28 UTC
forgot to reassign
Comment 19 philipp.lohmann 2006-01-10 17:08:32 UTC
The problem is not with most characters, but only with composed characters
(those which do not have a bijective unicode <-> glyph mapping ). With the
provided PDF i can copy the thai text easily apart from the composed glyphs
which would need to result in a Unicode sequence instead of one code. I don't
see how we can do that.

We'd need to output font subsets with different encodings then, but how would
that be possible given that the only thing we have is a SalLayout which knows
only about glyph ids ?
Comment 20 jjc 2006-01-21 06:25:31 UTC
The problems are different according to whether you create the PDF with
OpenOffice on Windows or on Linux:

- on Linux, the only problem is with SARA AM OE33
- on Windows, any character that is not the first in its cluster is lost
Comment 21 jjc 2006-01-21 06:27:09 UTC
Created attachment 33426 [details]
OpenDocument test file
Comment 22 jjc 2006-01-21 06:27:50 UTC
Created attachment 33427 [details]
PDF generated on Linux
Comment 23 jjc 2006-01-21 06:28:25 UTC
Created attachment 33428 [details]
PDF generated on Windows
Comment 24 jjc 2006-01-21 06:56:53 UTC
On Linux, you can see the problems in the first three lines of my test case.
(It's better to use Acrobat Reader to test, rather than evince, since evince has
some bugs.)

There are actually three problems:

a)  The first problem you can see in line 2.  In the cut-and-pasted text, the
single SARA AM (OE33) has turned into two SARA AMs.  What happens is that the
ICU layout engine decomposes SARA AM into NIKHAHIT (OE4D) and SARA AA (OE32). 
The glyph to character mapping returned by ICU associates both the NIKHAHIT and
the SARA AA glyphs with SARA AM character.

b) The second problem you can see in line 1.  The last character on the line,
which is SARA A in the PDF has been turned into a SARA AM in the cut-and-pasted
text.  This happens because when the PDF writer implementation sees the SARA AM
character it creates an entry in the font with a glyph SARA AA associated with
Unicode character SARA AM; when it sees the SARA AA character, it reuses the
font entry because it has the same SARA AA glyph, even though this SARA AA glyph
is associated with a SARA AA character.

c) The third problem you can see in line 3.  The MAI THO (OE49) in the PDF has
turned into another SARA AM.  In this case the ICU layout engine decomposes the
SARA AM as before, then it swaps the MAI THO and NIKHAHIT glyphs: the three
characters NO NEN, MAI THO, SARA AM are mapped into four glyphs, NO NEN,
NIKHAHIT, MAI THO, SARA A.  The character to glyph mapping generated by ICU is
[0 2 1 2], in other words it correctly and unambiguously associates the MAI THO
glyph with the MAI THO character.  However, IcuLayoutEngine::operator()
"smooths" this out to [0 2 2 2] as part of its cluster detection heuristics, so
you end up with three SARA AMs.

Note that to make the example in line 3 work properly, when SARA AM is
decomposed, in the PDF the NIKHAHIT glyph should not be associated with
anything, and the SARA AA glyph should be associated with the SARA AM character.



Comment 25 jjc 2006-01-21 06:58:41 UTC
Created attachment 33429 [details]
Result of copying from Linux generated PDF with Acrobat Reader and pasting into gedit
Comment 26 jjc 2006-01-21 07:17:19 UTC
On Windows, the situation is worse. The problem is that for any particular
cluster the Uniscribe ScriptShape function tells you which glyphs are part of
the cluster and which characters are part of the character, but it doesn't tell
you which glyph corresponds to which character.  Accordingly,
UniscribeLayout::GetNextGlyphs only generates a glyph to character mapping for
the first glyph in the cluster (the others are mapped to -1).

For very complex CTL scripts, the mapping between glyphs and characters in a
cluster is not very well-defined, but for Thai it's easy (with the exception of
SARA AM).  I think the following algorithm should work for Thai: map the first
glyph in the cluster to the first character in the cluster; then map the last
glyph in the cluster to last character in the cluster, the last but one glyph to
the last but one character and so on, stopping when you get to the first glyph
or first character in the cluster.
Comment 27 jjc 2006-01-21 07:25:34 UTC
The general strategy that the PDF writer implementation uses for supporting
recovery of the underlying Unicode text is adding a ToUnicode mapping to the
font.  Although, I think this can be made to work (with a bit of hackery) for
Thai, I don't think it will work in the general case for CTL.  PDF 1.5
introduces a feature designed to handle this, which allows you to explicitly
associate a Unicode string with a particular region of the PDF file; see the
ActualText property described in section 10.8.3 of the PDF 1.6 specification.
Comment 28 philipp.lohmann 2006-01-23 14:17:09 UTC
You're right that should work. However one would have find an algorithm when to
start such a span and when to end it. Also this could interfere with the overall
document structure (aka tagged PDF). What would you suggest should start such an
ActualText span and what should end it ?
Comment 29 hdu@apache.org 2006-01-23 15:15:16 UTC
reassigning to the owner of OOo's PDF export magic
Comment 30 philipp.lohmann 2006-01-30 11:23:14 UTC
target
Comment 31 philipp.lohmann 2006-06-15 15:51:13 UTC
target
Comment 32 philipp.lohmann 2006-10-27 08:50:10 UTC
FYI: As i learned with issue 69645 acrobat reader will not let you even select
text which is in an ActualText region (probably because there is no way to map
parts of the selected actual display to the equivalent parts of the ActualText).
So this would not solve the Copy/Paste problem. Any more thoughts on this ?
Comment 33 philipp.lohmann 2008-01-08 17:49:52 UTC
target
Comment 34 Rob Weir 2013-07-30 02:16:28 UTC
Reset assignee on issues not touched by assignee in more than 2000 days.
Comment 35 shreeshrii 2018-01-18 05:47:21 UTC
See related issue https://bz.apache.org/ooo/show_bug.cgi?id=58341