Issue 83194

Summary: Importing HTML without encoding specified - use system localle.
Product: Writer Reporter: kpalagin <kpalagin>
Component: open-importAssignee: AOO issues mailing list <issues>
Status: CONFIRMED --- QA Contact:
Severity: Trivial    
Priority: P3 CC: issues, michael.brauer, rail_ooo, www.openoffice.org
Version: 680m235   
Target Milestone: ---   
Hardware: PC   
OS: Windows XP   
Issue Type: ENHANCEMENT Latest Confirmation in: ---
Developer Difficulty: ---
Attachments:
Description Flags
Zip with sample HTML. none

Description kpalagin 2007-11-01 11:19:19 UTC
Please assign system localle as document encoding for HTML docs with missing 
charset specification. Word, IE and Firefox do this, we should do this too.
I am attaching sample document.
Comment 1 kpalagin 2007-11-01 11:47:19 UTC
Created attachment 49316 [details]
Zip with sample HTML.
Comment 2 michael.ruess 2007-11-01 13:32:34 UTC
Reassigned to ES.
Comment 3 Mathias_Bauer 2007-11-01 13:48:18 UTC
Michael, it seems that our default encoding for HTML files is set to MS-1252.
The code is in sfx2/source/bastyp/sfxhtml.cxx.

Do you have any memory about the reason for doing so?
Comment 4 michael.brauer 2007-11-01 14:06:00 UTC
Not really. But I assume the following: At least a long time ago, the default
encoding for HTML documents was ISO 8859-1. MS-1252 is a superset of ISO 8859-1
which, under Windows, allows to use a few additional character codes. So, by
using MS-1252, we covered ISO 8859-1, but also the Windows extension of this
encoding.

The HTML specification actually states reg. default encoding:

"The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-8859-1 as a default
character encoding when the "charset" parameter is absent from the
"Content-Type" header field. In practice, this recommendation has proved useless
because some servers don't allow a "charset" parameter to be sent, and others
may not be configured to send the parameter. Therefore, user agents must not
assume any default value for the "charset" parameter."

So, changing this to the system charset seems to be valid. However, we must do
that only if there is no Unicode Byte Order Mark present. If a BOM exists, the
encoding still has to be UTF-8.
Comment 5 mhatheoo 2007-11-02 17:49:41 UTC
I would prefer not to make decisions with default character-set, until the
background is more clarified:

-- Using the the window-specific 1252 will not make sense, the 21 characters at
position 128-159 may not be displayed correctly on Linux-systems or even have
other meanings there(even on windos-systems)
-- Using UTF-8 as default is nonsens
-- leaving out the tag with the page-encoding is nonsens - even when some
(which) providers do not transfer that

so the only useful thing is Latin-1 or 8859-1

Martin
Comment 6 mhatheoo 2007-11-02 17:57:11 UTC
oops, maybe I forgot something or did not make the point clear enough:

Using 1252 for HTML is a mistake, having not set the tag for page-decoding is a
mistake too. 
You should not - by default - support the faulty HTML-versions, even if they
exist quite often. 
A better solution would be to ask the user on opening the file for edit which
tag or if a tag should be set.

Martin
Comment 7 Mathias_Bauer 2007-11-02 18:20:05 UTC
That's not enough. The code to load a document must work without user
interaction also, so there *must* be a default treatment of such documents. This
default could be:

- throw an exception, quit loading (very rude IMHO)
- guess encoding from system locale of machine loading the document
- ??? something else

In case of HTML using the system locale is more questionable than in case of rtf
documents as HTML documents often come from the web and making any assumptions
about the system they were created on is just wild guessing. OK, you can assume
that because most computers run Windows and so most HTML documents will have
been created on Windows. You can further assume that the number of clueless
people not knowing that HTML documents should specify an encoding is even bigger
on Windows so that overall it's a good guess that a broken HTML file has been
created on a Windows machine. But is that how you want to design your software?
At least debatable, IMHO. 

But OTOH: I have no better idea. As Michael pointed out, the official
recommendation is not to assume any defaults. This recommendation explicitly
argues with a Web base scenario and there it sounds reasonable. OTOH if we are
talking about local HTML documents, especially documents created from somewhat
broken tools, guessing an encoding IMHO makes sense. So I think we can treat
HTML documents more or less the same way as rtf documents as in issue 68639. 
Comment 8 michael.brauer 2007-11-05 08:37:56 UTC
As for UTF-8: It does make sense to assume that a document is in UTF-8 if it
starts with a byte order mark. It has never been suggested to use it as default
if no byte order mark is present.
As for MS 1252: We are talking about the import. MS 1252 is a superset of ISO
8859-1. If a document contains only characters supported by ISO 8859-1, then one
won't notice a difference between a default ISO 8859-1 default and a MS 1252
default, regardless on which platform one is. And if it does contain MS1252-only
characters? Then you won't be able to display them on Linux, regardless whether
ISO8859-1 or MS-1252 is the default. So the situation of using MS-1252 as
default is definitely not worse than using ISO 8859-1. 

As for the general issue: Are we of aware of any issues with the current
behavior in real life? If not, I would suggest that we keep things as they are. 

Otherwise we may take the system encoding as default, but we have to be aware of
the fact that this may break in some situations, too.

Comment 9 eric.savary 2007-11-05 11:36:16 UTC
Reassigning to MBA.
Comment 10 kpalagin 2007-12-05 15:13:16 UTC
So shall we try implementing this RFE for 2.4?
Maybe Rail could code a patch?
(With the hint for sfx2/source/bastyp/sfxhtml.cxx and recent Rail's experience 
with RTF it should relatively painless for capable developer as he is).
Comment 11 pmike 2008-02-05 07:08:54 UTC
There is an option in "HTML Compatibility" - "Export", which allows to specify
output charset for HTML.
How about adding such option for import? Let user choose its default encoding,
like any Web browser does.

Also, this issue in not Writer specific. Calc is affected too. So it's framework
issue.
Comment 12 rail_ooo 2008-02-05 07:16:38 UTC
We can the method used in issue 68639 or use gsl_getSystemTextEncoding() to
preset  HTML encoding. I suugest to use the first one because almost all of
broken docs generated on Windows. :(

Another way is to "detect" the encoding of the document using libenca (GPL :( )
 or something similar.
Comment 13 serhiy 2010-03-08 10:29:09 UTC
I have to work on XHTML files encoded as UTF-8 and OpenOffice opens it as ANSI.
The file doesn't have encoding specified and I would be happy if it would let 
me choose encoding. For now it opens in ANSI, either cp-1252 or local encoding 
cp-1251.

Windows XP SP3,
OpenOffice 3.2
Comment 14 Marcus 2017-05-20 11:31:01 UTC
Reset assigne to the default "issues@openoffice.apache.org".