Issue 50670

Summary: HTML Import and Clipboard Paste behaviour
Product: Calc Reporter: falko.tesch
Component: open-importAssignee: frank
Status: CLOSED FIXED QA Contact: issues@sc <issues>
Severity: Trivial    
Priority: P3 CC: christian.jansen, douglasph, issues, jc.damgaard, josef, khirano, nesshof, ooo, stp, stx123, yoshimit
Version: OOo 2.0 Beta   
Target Milestone: ---   
Hardware: All   
OS: All   
URL: http://specs.openoffice.org/calc/ease-of-use/HTMLNumberImport.odt
Issue Type: ENHANCEMENT Latest Confirmation in: ---
Developer Difficulty: ---
Attachments:
Description Flags
Testcasespecification for CWS CalcHTMLnumbers
none
testdocument in HTML format
none
Testdocument in Rich Text Format none

Description falko.tesch 2005-06-13 15:20:16 UTC
Problem:
Importing or pasting HTML formatted content into Calc can result in irritating
formattinmg behaviour on Calcs side.
Some strings might be interpreted as numbers other as dates, even if the source
wasn't either of them.

Explanation:
This is due to an autodetection mechanism that tries to identify the formatting
of the incoming strings by certain characteristics, such as number of
separators, kinds of separators and combination of separators.

Suggested Solution:
To avoid such unwanted automatism the following is suggested:
Inserted HTML formatted content via import or clipboard will always be
interpreted by Calc current set locale.
If the user inserts HTML content with a different locale than the one set in
Calc he needs to change the locale (temporary) via "Tools - Options - Language
Settings - Document Language" to the source locale to get correct interpretation
of the inserted values.
The interpretation will follow exactly the same rules as if the user would
insert the strings manually.
Comment 1 falko.tesch 2005-06-13 15:22:13 UTC
*** Issue 39898 has been marked as a duplicate of this issue. ***
Comment 2 falko.tesch 2005-06-13 15:24:05 UTC
*** Issue 39898 has been marked as a duplicate of this issue. ***
Comment 3 falko.tesch 2005-06-13 16:29:26 UTC
FT:
Addendum:
The above suggestion also affects legacy documents that will be altered if
inserted HTML content is dynamically updated on load.
Therefore changing the document behaviour is NOT an option (at least not within
a minor release, 2.0.1)
Therefore this issue cannot be addressed w/o major changes/additions to the UI.
Which still leaves the question how to behave on new and legacy documents in the
future.

Any hacks that were suggested so far (example: depend the bahaviour on system
variables) are, from a User Expierence point of view, a complete no-go and shall
not be used.

I'm sorry not to present a viable solution here, but changing a feature for the
benefit of a few compared to data loss for other users cannot be an option.
Comment 4 frank 2005-06-16 13:18:06 UTC
So this has to go it's way through the enhancement cycle.
Comment 5 stp 2005-06-16 21:31:29 UTC
Delaying the obvious fix for this DEFECT will only build on what you now think
is a legacy of documents - a legacy of OOo 1.x documents - and thus enlarge the
scope of the problem in the future for hopefully many more documents and
OpenDocument-files, too.

And while you hesitate you also inhibit users of Quattro Pro and Excel to
migrate to StarOffice/OpenOffice.org.

I'm sorry to repeat myself but ignoring a DEFECT for the
benefit of a few locale confused people compared to securing the integrity and
trustworthiness of the spreadsheet cannot be an option.

Therefore you must kill the assumption that pasted content is en_US for the
coming release. Patch available in issue 39898.
Comment 6 stx123 2005-06-24 12:09:41 UTC
Hi 'thing', all,
I don't think that argueing about the issue type (or 
whether there are more "locale confused people" but happy people than others)
will help nobody.
We plan to work on a solution which should satisfy your needs and fit in the OOo
2.0 timeframe.
An environment variable should allow to specify that the default behaviour is as
suggested below. An UI and handling for existing document will be worked on later.
I hope that meets your expectation.
Stefan
Comment 7 ooo 2005-06-29 13:39:15 UTC
As there is no real solution without doing proper UI and/or
configuration and all the full-blown stuff, I offer the following
"hack":

An environment variable, for example OOO_CALC_HTML_IMPORT_LOCALE, will
be used to control the behavior.

- If not set the old behavior remains in effect so that without user
  interaction nothing changes in order to not break already existing
  documents that pull linked data from the web or an intranet (this is
  what Falko referred with "legacy documents", which IMHO was a bit
  unfortunate wording).

- If set and empty or with content "x-configured", which complies with
  RFC3066, the locale as configured under
  Tools.Options.LanguageSettings.Languages.LocaleSetting will be used.

- If set to an explicit locale, e.g. "en-US" or "da-DK", that locale
  will be used, if available.

- If set to a language only, e.g. "de", a match against some default
  locale of that language will be tried, e.g. "de-DE". Not recommended
  though because it may result in something the user doesn't expect.

- If interpreting the variable's content does not match any of the
  available locale data, the configured locale will be used instead.


Can everybody live with this?

  Eike
Comment 8 stp 2005-06-29 17:57:57 UTC
First of all please accept my sincere apology. I have tested my patch a bit more
and in e.g. a French locale the en_US number "1,234.56" would not be recognised
correctly with my patch. I now agree that this is unnecessary feature loss for
non en_US and non German locale alike users. I should have specified in the
patch that the en_US assumption should be disabled only in locales using "." as
thousands separator and "," as decimal (e.g. da_DK and de_DE). However, this
discovery contradicts people that say the en_US assumption affects locales using
" " (space) as thousands separator and "," as decimal. Can anyone explain or
elaborate?

Regarding the described solution, I strongly suggest default behavior should
respect the locale setting which would align Calc with spreadsheets like Quattro
Pro and Excel. Yes, that would force some existing users to set an environment
variable but all the new users of SO and OOo will be able to migrate to Calc
without experiencing serious defects like 1.000 being recognised as 1.

I honestly cannot see how you can change existing behavior (ie. assume en_US
when recognising 1.000) to the desired behavior (ie. respect locale when
recognising 1.000) without breaking legacy. The forthcoming major release and
the implementation of a new document standard is an obvious window of
opportunity to correct the misinterpretation of 1.000 = 1 in e.g. German and
Danish locales. The misinterpretation of 1.000 should be fixed before the legacy
expands to OpenDocument files and new users. And consequently before an
inevitable correction would break a much bigger legacy.

The locale setting must be reinstalled as the dominant locale setting. The
sooner the better.

Søren
Comment 9 frank 2005-06-30 11:50:05 UTC
*** Issue 50984 has been marked as a duplicate of this issue. ***
Comment 10 keld 2005-07-06 14:50:09 UTC
I will suggest that the correct solution to this problem is to follow the source
locale. For a discussion of which locale to use I offer the following:

I see two locales that could be obeyed:
                                                                               
                                                                               
                            
1. The locale of the ooo application being used
                                                                               
                                                                               
                            
2. the locale or language of the data being imported.
                                                                               
                                                                               
                            
I will argue that it should be the latter that is used.
                                                                               
                                                                               
                            
Consider that there is a string "1,000" in the document to import, eg a
HTML document. The meaning could be 1000 or 1 dependent of the locale that this
document was generated with. But the meaning surely should not change
due to the locale that is used in the importing application, eg calc.
                                                                               
                                                                               
                            
An American and a Dane should get the same number into the calc
application, from the same imported document. The numbers of the
imported tables should stay the same.
                                                                               
                                                                               
                            
In HTML/XML there is an entity that defines the language of the data,
namely the "lang" variable. This defines the value of the decimal and
thousands separator, which are language defined. Eg for Danish it is
defined in the Danish orthography specification "Retskrivningsordbogen" that
comma is used as the decimal separator. Likewise for English it is always the
period that is used.
                                                                               
                                                                               
                            
If no language is specified for the data, then the default should be
used, which for HTML and XML is period for the decimal separator and
comma for the thousands separator. One should not use the locale
information of the importing application, but always use the language
spec for the data.
                                                                               
                                                                               
                            
One could then introduce in the application a setting to override
the data locale, to be used when the data is marked up wrongly.
This should not be a prompt, as that would be unnecessary cumbersome in
most cases - to be asked this question every time one imports data.
                                                                               
                                                                               
                            
I would welcome a solution that implements this as dependent on a
environment variable, although I believe to follow the language
indication of the data is the obviously correct solution. This is also in
accordance with current internationalization theory in the object oriented world.

Should a solution as the one Eike proposes be implemented, I suggest that one of
the options would be that the source locale be used, eg. by using a string
"x-source-locale" to indicate to use the "lang" markup or other language or
locale specification of the source data. The behaviour on what to do if the lang
do not describe an actual locale could well be the one Eike describes. 
 
                                                                               
                                                                               
                            

Comment 11 keld 2005-07-06 15:01:10 UTC
To Søren Thing

I dont really know how to be able to handle input numbers that use space as the
thousands separator. I think using that is very error prone, as it is difficult
to  differentiate between a space between numbers, and a space within a number
that has the function as thousands separator. 

I therefore would advice against using space as the thousands separator, if the
numbers are used later as input (and you can probaly not tell if this may be the
case).

One way to remedy this could be to use the NO BREAK SPACE instead of space if
this is really wanted. That would lessen the probability of errors.
Comment 12 ooo 2005-07-08 12:50:12 UTC
Folks,

we can argue about this over and over again, but people commenting on
this should at least read and _understand_ the discussion in the
predecessor issue 39898 and not blow up this issue with the same things
again, like

> 2. the locale or language of the data being imported.

With HTML there is no locale associated with the data.

> In HTML/XML there is an entity that defines the language of the data,
> namely the "lang" variable.

Exactly, the _language_, but not the locale. Only for the cases where
a language is clearly assigned to just and only one locale using this
element would be possible. And most times the element isn't even
present. So nothing we could rely on.

> This defines the value of the decimal and thousands separator,

No, it doesn't.

> which are language defined.

No, they aren't. They are defined per _locale_.

> Eg for Danish it is defined in the Danish orthography specification
> "Retskrivningsordbogen" that comma is used as the decimal separator.

Just because Danish is used almost only in Denmark and its regions.

> Likewise for English it is always the period that is used. 

By coincidence. For French, for example, it is a period in France
(fr_FR), but a comma in Switzerland (fr_CH). Same for Spanish, different
separators in many different countries.


Ok, back to something productive: I offered my "solution" in the comment
of Wed Jun 29 05:39:15 -0700 2005. With that hack I will _not_ implement
a different default that would break already existing documents with
links to data that rely on the current behavior _unless_ User Experience
/ Program Management tell me to do so.

If we can't agree we will have to postpone this for OOo2.0.1 or later to
do a full-blown solution with UI and such.

  Eike
Comment 13 keld 2005-07-08 21:49:12 UTC
Just found out that email replies do not make it into bugzilla. So...

Eike is correct that HTML has no locale, but there may be a language associated.

And that language can be associated with a locale by the same way that was
proposed to associate a language/locale of ooo to a system locale.

Eike also wrote that there may be differences between the same language in
different countries, for example in Switzerland they are suposed to use comma as
the thousands separator, while in France they are supposed to use period.

That sounds strange to me, but anyway you can say that your HTML is
suissefrançais by giving it the language code 'fr-ch' - which is a valid RFC
3066 code, and then associate to the locale fr_CH from that.

Anyway, I am just asking to provide one more option, the option to use the
source language as defining the thousands and decimal delimiters.
                                                                               
                                         
I can understand if this would not be the default choice in 2.00, but IMHO
there are good theroretical and practical reasons for this to exist, and
even to be the desired default. If we just make it an option, then we
can try it out, and get some experience with it.
  
 
Comment 14 ooo 2005-08-05 13:03:13 UTC
Seems we don't come to an agreement here and old arguments are repeated over and
over again. I made my proposal what I would had implemented for OOo2.0,
retargeting to OOo2.0.1 now.
Comment 15 ooo 2005-08-24 12:58:00 UTC
Hi Falko,

Reassigning ownership for the specification phase of UI with bells and whistles
to you. Though I most certainly doubt that we'd still get this implemented for
2.0.1, probably retargeting to 2.0.2 would be appropriate.

  Eike
Comment 16 falko.tesch 2005-08-26 09:34:31 UTC
FT: OK, I agree, but we must bring this into PP2.
Comment 17 falko.tesch 2005-08-26 09:35:47 UTC
*** Issue 15509 has been marked as a duplicate of this issue. ***
Comment 18 falko.tesch 2005-10-20 20:40:06 UTC
FT: I'm leaving so I will re-assign this issue to requirement default user
Comment 19 stx123 2005-10-23 19:52:04 UTC
I don't think the default user is appropriate for this issue.
Christian, Matthias, would you be able to follow-up?
Comment 20 bigserpent 2005-11-09 06:27:27 UTC
Please, look at the issue
http://qa.openoffice.org/issues/show_bug.cgi?id=51662
because it has the common origin and may be even more common case.
Comment 21 matthias.mueller-prove 2005-11-21 12:27:14 UTC
Here is the plan. We will introduce a new option in Load/Save Options>HTML
Compatibility pane called something like "Import numbers according to locale
setting".
If this checkbox is marked numbers will be interpreted according to the setting
in Language Settings>Languages>Locale setting.
If this checkboy is not marked numbers will be interpreted as English.
By default the new option is not marked.

The objective is to offer a way to import (paste or open or insert HTML content)
properly for numbers that are not formatted in US style. "1,000" is 1000 in
English but 1 in German. "1.000" is the opposite case: 1 in English and 1000 in
German. Other languages have different conventions the we hope to address with
the proposed new checkbox. 

I am currently writing the spec for OOo 2.0.2. Please let me know if this does
not meet your needs. (Maybe you bring up the concerns that I did not mention here.)
Comment 22 matthias.mueller-prove 2005-11-21 12:43:22 UTC
spec URL added
Comment 23 josefg 2005-11-21 15:10:26 UTC
Well, it's just that when you work in an international environment, you will be
copying data from several different locales. Especially when taking down data
from the web. If the source file is unable to provide the information needed for
which locale to use, having to adjust this in the Load/Save Options>HTML
Compatibility pane can be a bit cumbersome. 

I don't claim to have the ideal solution to the problem, but if it would be
possible to build something more accessible in the future, I think that would be
advantageous. My original sugestion was something like an extra option in the
paste special dialog box, but that of course doesn't take into account that the
same problem needs to be solved for imported documents.
Comment 24 keld 2005-11-22 01:13:39 UTC
Please add a possibility to use the locale associated with the language of the
input HTML data. That is, if there is no language specified in the iput HTML,
then assume en_US, else assume the locale that can be associated from the
language specified in the HTML headers. This could be a checkbox in Load/Save
Options>HTML
Compatibility pane called something like "Import numbers according to the
language of the input file".
Comment 25 ooo 2005-11-22 10:12:14 UTC
Mostly to Keld:

You can not determinate the locale from the language setting in the source file
(HTML document). As far as I know there is no locale parameter in HTML.

But the insert mechanisme could suggest a locale based on the language setting
(if sat) because some languages are strongly linked to a locale - Danish
language if strongly linked to the Danish locale. But other language/locale
links may be more loose or even doesn't exist.
Comment 26 ooo 2005-11-22 10:15:22 UTC
The insert special dialog could also show possible formats (value) based on the
input as an extra advanced format dialog window.
Comment 27 stp 2005-11-22 23:33:09 UTC
thing->mmp: Under the circumstances that sounds like a good plan for the next
next minor release.

Furthermore, will this issue take care of the default behavior in the next major
release? Or do you want me to open a new issue targeted for 3.0 which should per
default handle pasted/imported non-localised numbers according to Calc's locale
just like Excel, Quattro Pro, Gnumeric and 1-2-3.
Comment 28 keld 2005-11-26 22:52:43 UTC
Just to recap: 

One of the problems are that the example guy Herbert in Germany wants to use
both HTML spreadsheets written in German and in English, and dependent on which
language the input is in, the data should be interpreted differently.

The draft spec by mmc will not handle this situation correctly. If the proposed
option is set, the English data will be handled wrongly for Herbert.

I propose one more option, and that is to set the thousands and decimal
separator according to the 'lang' field of the input HTML document. This can be
done via an association of the 'lang' parameter of the input HTML document to a
system locale. Claus says that you generally cannot do this association between
language and locale, but that is pretty easy in my mind, you can just provide a
table for all these relations. I am willing to provide such a table for UNIX, as
I am personally one of the main contributors of locales for UNIX. I don't know
too much on locales for MS Windows, maybe there is not a database of locales
readily available on normal windows systems. Another way forward is to have a
table from language to a pair of (thousands separator, decimal separator) - I
can also construct that table for maybe 100 languages. I can even provide you
with the code for this, given I get the programming language to write this in.

My suggestion is thus that there be the option that mmc described, and the
option that I described above. Then we can try them out and the users can pick
the one that suits them best. That is, two strings:

"Import numbers according to locale setting of ooo"
"Import numbers according to language setting of imported file"

And then I have nothing against a further option in the insert special dialog.
Comment 29 matthias.mueller-prove 2005-12-01 13:44:39 UTC
The current (slightly updated) spec proposes a solution for "Herbert". With the
additional option it is at least possible to import the HTML data of any locale
into Calc. Today (OOo 2.0.1) this is not possible at all.
I understand that smarter solutions are desired (and possible) but please keep
in mind that this time we implement a fix for a situation that can also be seen
as a bug.
There has to be a grand solution for OOo 3.0.
Comment 30 frank 2005-12-06 10:03:29 UTC
*** Issue 57168 has been marked as a duplicate of this issue. ***
Comment 31 matthias.mueller-prove 2005-12-29 10:25:48 UTC
new target: OOo 2.0.3 (due to Winter break) -- Have a happy new year!
Comment 32 matthias.mueller-prove 2006-04-03 13:56:19 UTC
missed milestone 2.03
retargeted to 2.04
Comment 33 ooo 2006-06-07 16:35:40 UTC
Grabbing issue.
Comment 34 ooo 2006-06-07 16:36:32 UTC
Accepted.
Comment 35 ooo 2006-06-09 21:02:22 UTC
In CWS calchtmlnumbers

officecfg/registry/schema/org/openoffice/Office/Common.xcs  1.115.52.1
svx/inc/htmlcfg.hxx  1.4.438.1
svx/source/dialog/opthtml.cxx  1.4.438.1
svx/source/dialog/opthtml.hrc  1.3.438.1
svx/source/dialog/opthtml.hxx  1.3.438.1
svx/source/dialog/opthtml.src  1.5.418.1
svx/source/options/htmlcfg.cxx  1.3.438.1
sc/source/filter/rtf/eeimpars.cxx  1.14.68.1
Comment 36 ooo 2006-06-12 17:12:28 UTC
Reassigning to QA.

re-open issue and reassign to fst@openoffice.org
Comment 37 ooo 2006-06-12 17:12:37 UTC
reassign to fst@openoffice.org
Comment 38 ooo 2006-06-12 17:12:46 UTC
reset resolution to FIXED
Comment 39 frank 2006-06-29 13:38:53 UTC
Created attachment 37417 [details]
Testcasespecification for CWS CalcHTMLnumbers
Comment 40 frank 2006-06-29 13:39:45 UTC
Created attachment 37418 [details]
testdocument in HTML format
Comment 41 frank 2006-06-29 13:41:01 UTC
Created attachment 37419 [details]
Testdocument in Rich Text Format
Comment 42 frank 2006-07-04 15:34:26 UTC
found integrated in cws calchtmlnumbers. Testcasespecification and testdocuments
are attached. checked on Solaris, Linux and Windows.
Comment 43 frank 2006-07-17 13:53:05 UTC
Found integrated on master m176 using Solaris, Linux and Windows build
Comment 44 stp 2006-07-22 10:21:22 UTC
I briefly tried it with m176 and it works great. Sincere thanks Eike, rest of Sun!

Søren