Issue 39898 - Number recognition formats pasted value from HTML page of e.g. 1.000 as 1 despite . (period) being thousands separator according to Calc's locale setting
Summary: Number recognition formats pasted value from HTML page of e.g. 1.000 as 1 des...
Status: CLOSED DUPLICATE of issue 50670
Alias: None
Product: Calc
Classification: Application
Component: formatting (show other issues)
Version: 680m65
Hardware: All All
: P4 Trivial with 38 votes (vote)
Target Milestone: ---
Assignee: falko.tesch
QA Contact: issues@sc
URL: http://specs.openoffice.org/calc/
Keywords:
: 48926 (view as issue list)
Depends on:
Blocks:
 
Reported: 2005-01-04 16:44 UTC by stp
Modified: 2013-08-07 15:12 UTC (History)
7 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
Testcase (933 bytes, text/html)
2005-03-12 07:33 UTC, stp
no flags Details
Testcase (HTML 4.01) (828 bytes, text/html)
2005-03-12 07:57 UTC, stp
no flags Details
Simple but effective patch (519 bytes, patch)
2005-05-26 00:47 UTC, stp
no flags Details | Diff
Testcases (25.61 KB, application/pdf)
2005-05-26 00:48 UTC, stp
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description stp 2005-01-04 16:44:39 UTC
Using a locale with . as thousands seperator, marking, copying and pasting a
value on a HTML page e.g. 1.000 into Calc will verify this bug. The value will
be pasted into a cell as the value 1. 

However, e.g. 1.000.000 will be formatted correctly.

This bug has been spotted in locales with . as the thousands seperator
(including German). Using English locale and pasting e.g. 1,000 where thousands
seperator is , will correctly be pasted as 1000.

I have without any luck tried to find the logic behind this bug in:
http://specs.openoffice.org/calc/ease-of-use/enhance_number_recognition.sxw

Please also fix this bug in 1.1.5 if possible.

Thanks.

Søren
Comment 1 frank 2005-01-05 11:37:03 UTC
Hi Søren,

I copied the 1.000 from the Issue description and pasted it into an german
localized Office. I've got the same result as you. Using paste special and
unformatted text as output format gives me 1000. So the encoding of the webpage
is carried over to calc and interpreted as english 1 with three decimals and
converted to an german 1. So the result is correct.

Sorry for no better reply.

Frank
Comment 2 frank 2005-01-05 11:37:39 UTC
changed Prio from 2 to 4
closed wfm
Comment 3 stp 2005-01-05 15:27:49 UTC
Yes, you can paste as unformatted text, but this bug report is for pasting HTML
formatted text.
Comment 4 frank 2005-01-05 15:53:36 UTC
Hi Falko,

please have a look.

Frank
Comment 5 stp 2005-01-05 15:58:01 UTC
Use this HTML table for bug targetting:
<table>
<tr><td>1.000</td><td>1.000.000</td></tr>
<tr><td>1,000</td><td>1,000,000</td></tr>
</table>

Pasting the table as HTML with english locale results in:
1	1.000.000
1000	1000000

Pasting the table as HTML with german locale results in:
1	1000000
1	1,000,000

All but 1.000 in german locale is recognised correctly.

In conclusion, I don't think number recognition works according to the encoding
of the webpage.

Thanks.
Comment 6 falko.tesch 2005-01-06 14:40:04 UTC
FT: I talked to the responsible engineer and here's why it behaves like it does:
If HTML text is pasted via clipboard numbers are always interpreted in English
and current locale (German for example).
That's why 1.000 and 1,000 both turn to 1. In one case it is following the
German locale of the document (1,000) and in the second case Calc "thinks" its
an English number (1.000 = 1) and print 1.

In ancient times (pre StarOffice 4.0) all HTML numbers from clipboard were
strictly interpreted by the locale of the document (locale/tag of the HTML
source was never taken in account) and 1,000 would become 1 and 1.000 would
become 1000.

It then was changed since most web pages (then) were English number-formatted,
so copy and paste in a lot of cases was creating unwanted results. So the
behaviour was changed to the already described solution.

If you consider this behaviour bnot useful please be so kind state and tery to
collect some votes. Thx.
Comment 7 falko.tesch 2005-01-06 14:40:38 UTC
.
Comment 8 stp 2005-01-06 18:45:42 UTC
I understand that it was then categorised as a feature, but I think this really
is a bug nowadays. 

Honestly, I cannot think of a case where interpreting solely according to Calc's
locale would create problems today. Please give an example so people can
consider cost/benefit for fixing this issue.
Comment 9 simonbr 2005-01-10 23:22:18 UTC
These would be cases where interpretation according to calc locale would go wrong:

http://www.sharebot.net/?code=MRL
http://investdb.theglobeandmail.com/invest/investSQL/gx.price_history?pi_symbol=UALAQ-Q2
http://www.fortbraggfcu.org/ASP/rates.asp

Maybe it would be better if the user were asked how it must be pasted. 
Comment 10 stp 2005-01-11 07:10:55 UTC
Pasting the table from
http://investdb.theglobeandmail.com/invest/investSQL/gx.price_history?pi_symbol=UALAQ-Q2
 in a local using . as thousands seperator (e.g. german) underlines the problem;
some values are correct others are not, but the user is unable to understand the
logic behind this. Next question from the user could be: If Calc behaves without
logic, how can it be trusted to handle my spreadsheets?

As Simon suggests, some kind of user action could be needed. Maybe even an
option to convert , to . (vice versa).
Comment 11 stp 2005-01-12 13:20:10 UTC
The web is multilingual nowadays. Pasted HTML should dictate locale if the
pasted HTML contains the lang tag. Otherwise Calc should set the locale.

To improve number recognition Calc can guess the locale of the pasted HTML if
any value in the the pasted content contains either two periods or two commas
with three digits in between (e.g. 1.000.000). Furthermore, if digits are
separated by both a period and a comma it is almost certain that the last of
these is the decimal. (e.g. 1.000,01). It is a pretty sure shot that the rest of
the contents use the same thousands separator.

Pasting HTML with the value 1.000 and getting 1 when locale uses , as decimal is
a bug is a bug. I am setting the issue type to defect.
Comment 12 ooo 2005-03-08 21:08:23 UTC
I'll chime in for some comments:

Soeren wrote:
> The web is multilingual nowadays. Pasted HTML should dictate locale if
> the pasted HTML contains the lang tag. Otherwise Calc should set the
> locale.

I'd say that HTML documents in 99.8% of all cases don't specify
a locale, in fact I haven't seen any so far. If at all, they will
contain the language, which in case of, for example, "en", "fr" or "es"
is equivalent to no locale. Only for languages where there is not more
than one possible locale it would be sufficient. Furthermore, pages in
English tend to contain no language tag, because "the language of the
web is English", implying an en_US locale. The majority of all these
cases interpreted in the current Calc locale would lead to wrong results
if separators didn't match.

> To improve number recognition Calc can guess the locale of the pasted
> HTML if any value in the the pasted content contains either two
> periods or two commas with three digits in between (e.g. 1.000.000).

There are locales that don't use dots or commas as separators. There are
also locales that use the group separator (AKA thousands separator) not
in groups of three digits each, but in mixed forms; currently not
supported by OOo, but possible. Hindi for example, AFAIR. The only thing
one can say for sure is that a separator appearing more than once can't
be a decimal separator, and more than two different separators don't
form a number. The way to go for each string would be

0. in case the document specifies a locale, use that
1. look if it could be parsed as an en_US number
2. look if it could be parsed as a number according to the current
   locale
3. maybe try to detect another meaningful combination of separators

Calc currently tries a somewhat weaker method of 1. and 2. for each
string and if both apply, en_US is taken. This is why 1.000 in de_DE
locale results in 1. The checks, however, could be refined to stick to
one locale once detected, and better take the group separator into
account. Still, if there are several numbers like 1.000 and 2,000 you
can't say which is what. This is the amount of uncertainty a detection
could not solve, like it is the case for
http://investdb.theglobeandmail.com/invest/investSQL/gx.price_history?pi_symbol=UALAQ-Q2
and only a dialog would help.

> Furthermore, if digits are separated by both a period and a comma it
> is almost certain that the last of these is the decimal. (e.g.
> 1.000,01).

Seconded.

> It is a pretty sure shot that the rest of the contents use the same
> thousands separator. 

Which effectively means parsing the document content twice, at least up
to the point where it is certain that both separators were detected and
previous assumptions didn't match. Ok, not a big deal, but maybe not
nice regarding performance for large documents.

  Eike
Comment 13 stp 2005-03-09 09:57:04 UTC
Hi Eike,

Thanks for your in depth analysis. I agree overall and have a few comments.

Eike wrote:
> Furthermore, pages in English tend to contain no language tag, because "the 
> language of the web is English", implying an en_US locale.

I meant specifying lang="da" for the table - not necessarily the whole page. 

Either way, if Calc would respect the locale set in HTML's lang tag - only if
specified - the user always has a way of dictating the locale. As of now the
locale is sometimes lost in translation in the mix between 1 and 2 without any
way of dictating the locale. That is why I suggest respecting HTML's lang tag -
if specified.

W3 spec: http://www.w3.org/TR/REC-html40/struct/dirlang.html

I will try to analyse how other office suites handle the issue in general.

Søren
Comment 14 jkjel 2005-03-09 10:15:31 UTC
Hi Eike
Thank you for your analysis. We have discissed the issue and has come to the
conclusion that the primary issue is to have the user decide which
numberingformat is to be used. The easiest way to do this is to have the OOo
locale setting to determine the numberingformat of the HTML import. If the user
wants US numebring format of the import the locale should be set to US. If the
user wants Danish og German the locale setting of OOo should be set to Danish or
German.

The current situation is completely confusing since a HTML import with Danish
locale setting uses a mix of US and Danish nummering.

Spreadsheets like Excel and QuattroPro has a dialog box in which the user can
set numberingformats on each HTML import.
Thanks
JensK
Comment 15 ooo 2005-03-09 11:27:54 UTC
Even though you are using OpenOffice.org in Danish you will often get tables 
from URL's in English. 
 
I would suggest that you can select the number format when using "Insert 
special" and not just by changing the locale setting. 
 
And when using the normal "Insert" then use the locale setting. 
Comment 16 ooo 2005-03-09 11:57:53 UTC
Please bear in mind that the issue not only arises when inserting from the
clipboard but  also when opening files, including http/ftp, via the Calc HTML
filter.
Comment 17 ooo 2005-03-09 12:04:48 UTC
Jens,

> Spreadsheets like Excel and QuattroPro has a dialog box in which the user can
> set numberingformats on each HTML import.

Don't know about QuattroPro, but I've never seen it in Excel. Where did you
encounter it?

Eike
Comment 18 ooo 2005-03-09 13:05:38 UTC
Then you could also select the numbering format as an option when importing 
HTML to Calc or External Data from the Insert menu in Calc. 
Comment 19 stp 2005-03-12 07:33:18 UTC
Created attachment 23717 [details]
Testcase
Comment 20 stp 2005-03-12 07:57:39 UTC
Created attachment 23718 [details]
Testcase (HTML 4.01)
Comment 21 stp 2005-03-12 08:05:06 UTC
Excel (locale = da) recognises all 3 tables in the attached testcase as:
1.000	1.000.000
1	1,000,000

Changing the lang tag for the entire page doesn't change Excels number recognition.
Comment 22 ooo 2005-03-12 11:13:32 UTC
In 680m82 in Danish I can't insert data from hyperlink with the first example 
using XHTML 1.0 Strict from thing. 
 
But the second using HTML 4.01 Transitional gives these results: 
 
Table without lang tag  
1   1000000               should have been   1000   1000000 
1 1,000,000 (textstring)  should have been      1 1,000,000 (textstring) 
Table with lang = da  
1   1000000               should have been   1000   1000000 
1 1,000,000 (textstring)  should have been      1 1,000,000 (textstring) 
Table with lang = en  
1   1000000               should have been      1 1.000.000 (textstring) 
1 1,000,000 (textstring)  should have been   1000   1000000 
Comment 23 ooo 2005-03-12 11:19:42 UTC
The first example in XHTML 1.0 Strict works the same way as the second example 
in HTML 4.01 Transitional when opening in Calc. 
Comment 24 stp 2005-03-12 12:36:32 UTC
Excel adheres the locale setting of the spreadsheet exclusively. As far as I can
tell it doesn't evaluate the locale of the source document.

Quattro Pro 12 acts the same.

In the short run I suggest that Calc should follow Calc's locale only to fix the
imminent defect. Further up the road HTML's lang tag could define the locale -
if the lang tag is set for either the table or the whole page. The latter step
can be regarded as an enhancement.

Søren

PS. As Claus points out there is a problem with opening the XHTML testcase, but
that is another issue. Is there an issue filed for this problem?
Comment 25 ooo 2005-03-12 14:33:59 UTC
I have opened issue 44872 "Link to External data can't handle XHTML 1.0 Strict 
tables": 
 
  http://www.openoffice.org/issues/show_bug.cgi?id=44872 
Comment 26 ooo 2005-03-14 11:24:48 UTC
Soeren,

> Excel (locale = da) recognises all 3 tables in the attached testcase as:
> 1.000   1.000.000
> 1       1,000,000

Of which 1 is wrong, and 1,000,000 is probably a string, and not a number.

> In the short run I suggest that Calc should follow Calc's locale only 
> to fix the imminent defect. Further up the road HTML's lang tag could 
> define the locale - if the lang tag is set for either the table or the 
> whole page. The latter step can be regarded as an enhancement.

Please read again what I wrote previously on
Tue Mar 8 13:08:23 -0800 2005

Interpreting according to Calc's locale only is most certainly not what 
you want, it would misinterpret the great majority of web documents,
being in US-English. And the lang tag isn't really helpful, except maybe
for Danish.

  Eike
Comment 27 jcdamgaard 2005-03-14 11:53:03 UTC
In Excel XP and 2003 you can specify whether number recognition should be 
handled by system settings or you can manually specify the thousands and 
decimal seperators.( Tools > Options > International tab)
(This means that if your locale is danish, an import of 1,000,000 will be 
interpreted as text and not as the number 1000000)

The good thing about this solution is that it puts the user in control instead 
of letting Calc try to guess the right format.
Could this be a possible solution for Calc?

JC
Comment 28 jcdamgaard 2005-03-16 13:21:54 UTC
Hi
Just to sum up: When using locales other than english, Calc converts BOTH â€.†
and â€,†to a decimal separator.
It happens not only when copying from html-pages but also from Word, Writer etc.
This is a serious bug as it causes data loss. Therefore i suggest changing to 
prio 2, or at least prio 3.

In 2.0 beta, you can specify if the decimal separator key should be â€Same as 
local setting ( , )†but it doesnt work. It still converts â€.†to a decimal 
separator. (Tools > Options > Language settings > Languages)

N.B. It seems like Issue 38494 describes the same bug!

Thanks!
Comment 29 stp 2005-03-19 12:41:41 UTC
Eike,

> Interpreting according to Calc's locale only is most certainly not what
> you want, it would misinterpret the great majority of web documents,
> being in US-English. And the lang tag isn't really helpful, except maybe
> for Danish.

What I don't want is Calc's current inconsistent locale guessing strategy that
makes pasted HTML untrustworthy. It confuses both existing users and migraters
from Excel.

According to this 3 year old statistics about 50% of the webpages on the
internet is in English. http://www.netz-tipp.de/sprachen.html (the page is in
German...) Today, you can't assume that all HTML is english. 

However, I concur that the lang tag doesn't work wonders. It is just a
suggestive input for an enhancement (and I will file it as such elsewhere).

Søren
Comment 30 stp 2005-03-21 15:25:16 UTC
I agree that issue 38494 is another derivation of the same underlying technical
defect; apparently formatted text (such as HTML) is always pasted in English
locale (ie. . is the decimal).

For example if I simply copy the text "1.000" (one thousand in German, Danish
etc.) from Writer it is recognised as "1" when pasted in Calc.

Please raise priority if you agree that this is a serious bug that can cause
data loss.
Comment 31 polypoly 2005-04-01 20:32:06 UTC
Almost as many in the world are using (,) as thousands separator as thouse using 
(.)
At least 100 million are using (,)
That should be reflected in OOo
Comment 32 stp 2005-04-02 09:42:10 UTC
Here are the bugged locales:
bs_BA
es_CL
da_DK
es_ES
de_DE
el_GR
es_PE
es_UY
eu
fo_FO
fr_BE
lt_LT
nl_BE
nl_NL
hr_HR
ia
id_ID
is_IS
pt_BR
it_IT
pt_PT
km_KH
rw_RW
sh_YU
sl_SI
sr_YU
tr_TR
uk_UA
es_AR
ca_ES
es_CR
de_AT
de_LU
es_PY
gl_ES
Comment 33 Roberto Salomon 2005-05-04 20:28:09 UTC
On 1.1.3 pt_BR the problem also happens. The issue, however, seems to be related
to whether the paste event is sent when the cell is opened or not. If I select
the cell and paste the 1.000 value copied from an HTML page, the number 1 is
displayed. However, if I double-click the cell and then paste, the correct
number, i.e. 1.000 is displayed.
Comment 34 stp 2005-05-26 00:47:00 UTC
Created attachment 26566 [details]
Simple but effective patch
Comment 35 stp 2005-05-26 00:48:15 UTC
Created attachment 26567 [details]
Testcases
Comment 36 stp 2005-05-26 00:53:48 UTC
Dear Falko,

Please find attached patch (attachment 26566 [details]) and the tested testcases
(attachment 26567 [details]).

Requesting patch review.

Søren
Comment 37 ooo 2005-05-26 12:01:55 UTC
Søren,

I don't know why you think that in a 'de' locale the results of the patched
version are ok now. They are for the two cases you marked with a red background,
but introduced are three more date values that haven't been a date before,
namely IDs 4 (1.2), 6 (1.23) and 18 (1.2345). Furthermore 6 results are now
strings that were numbers before, IDs 13 (1,234.5), 21 (1234.5), 22 (123.45), 31
(1.23456), 32 (1,234.56) and 39 (12,345.6). These are the cases why that special
handling was introduced: to be able to read en_US number values. The remaining
date values are yet another story and the entire date recognition must be
straightened out for it.

Again: there is no way to automatically determine from a HTML document reliably
if 1.234 represents 1234 (group separator) or 1.234 (decimal separator).

Eike
Comment 38 stp 2005-05-26 18:34:44 UTC
Tha patch fixes issue 38494 , too.
Comment 39 pavel 2005-05-31 20:25:50 UTC
adding me to CC:
Comment 40 frank 2005-06-09 13:41:23 UTC
*** Issue 48926 has been marked as a duplicate of this issue. ***
Comment 41 bobharvey 2005-06-09 22:08:10 UTC
Moved my vote from issue 48926 .
I am beginning to wonder if there is any need to emphasise the HTML element,
although that does aid the discovery of the originating localle.  I can,
however, concieve of situations where the data in the clipboard could have come
from sources other than a web browser, and hence 'paste special - as numbers'
would be appropriate.

Comment 42 josefg 2005-06-10 13:54:54 UTC
Please note that this problem applies equally to numbers using space as their
thousand separator (which ae pasted as strings rather than numbers).
Comment 43 jcdamgaard 2005-06-13 08:22:56 UTC
The patch posted by Thing seems to fix this bug. After testing it thoroughly 
for some time now, we havent encountered any ´sideeffects'. Number pasting is 
now working like other spreadsheets, and like you would expect it to work. I 
can only recommend using the patch if you are experiencing loss of data caused 
by this bug. (we've installed the patch for several users now, and they all 
report that number recognition works well now!)
Comment 44 falko.tesch 2005-06-13 15:21:59 UTC
FT: This issue will closed due to complexity.
A comprehensive (and short!) summary including a new proposal on a solution is
posted in its f'up issues number 50670.

Regards
Falko
Comment 45 falko.tesch 2005-06-13 15:22:47 UTC
FT: This issue will closed due to complexity.
A comprehensive (and short!) summary including a new proposal on a solution is
posted in its f'up issues number 50670.

Regards
Falko
Comment 46 falko.tesch 2005-06-13 15:24:06 UTC
.

*** This issue has been marked as a duplicate of 50670 ***
Comment 47 falko.tesch 2005-06-13 15:24:24 UTC
.
Comment 48 stp 2005-06-16 21:01:19 UTC
All, now please move your votes to issue 50670 instead if you hope to see
something _done_ about this.