Issue 68639

Summary: use the system locale if document does not provide a character
Product: Writer Reporter: kpalagin <kpalagin>
Component: open-importAssignee: stefan.baltzer
Status: CLOSED FIXED QA Contact: issues@sw <issues>
Severity: Trivial    
Priority: P2 CC: cno, issues, Mathias_Bauer, ooo, rail_ooo
Version: 680m180Keywords: oooqa
Target Milestone: ---   
Hardware: PC   
OS: Windows, all   
Issue Type: PATCH Latest Confirmation in: ---
Developer Difficulty: ---
Attachments:
Description Flags
screenshot
none
problematic rtf file
none
screenshot of worviewer
none
file opened in wordviewer on german version of windows
none
zip of html
none
Prototype
none
Regional options
none
Regional options none

Description kpalagin 2006-08-15 11:45:37 UTC
Writer incorrectly displays cyrillic characters in .rtf file. On 
attached "cyrillic-in-rtf.gif" see how Writer displays (circled in red) and 
other word processor (circled in green). Problematci "bud.rtf" is attached too.
This issue greatly affects our company. Please consider fixing it in 2.0.4.

Thanks a lot.
WBR,
K. Palagin.
Comment 1 kpalagin 2006-08-15 11:46:15 UTC
Created attachment 38525 [details]
screenshot
Comment 2 kpalagin 2006-08-15 11:47:46 UTC
Created attachment 38526 [details]
problematic rtf file
Comment 3 lohmaier 2006-08-15 13:20:06 UTC
invalid. The document is broken.

Even word viewer cannot display the document right.

It misses the fontcharset declaration.

change it so that it does lists the charset and OOo and word-viewer can display it.

$ diff ~/Desktop/Downloads/bud.rtf charset.rtf
2c2
< {\fonttbl{\f1 Arial}{\f2 Code 128}}
---
> {\fonttbl{\f1 \fcharset204 Arial}{\f2 Code 128}}
Comment 4 lohmaier 2006-08-15 13:21:03 UTC
closing issue. Bug in external application that creates these RTF docs, not in OOo.
Comment 5 kpalagin 2006-08-15 14:18:35 UTC
cloph,
thanks for your analisys. I suspected that external app is broken, but can't 
you compensate for other people deficiencies and display cyrillic in readable 
form?
Thanks a lot.
WBR,
K. Palagin.
Comment 6 kpalagin 2006-08-15 14:37:11 UTC
It turns out that latest Word Viewer displays them just fine. Please see 
attached screenshot.
Comment 7 kpalagin 2006-08-15 14:38:04 UTC
Created attachment 38534 [details]
screenshot of worviewer
Comment 8 lohmaier 2006-08-17 18:57:32 UTC
Word viewer probably tries the system's locale setting.

Try this on a german version of Windows and you get wrong results again.
Comment 9 lohmaier 2006-08-17 18:58:15 UTC
Created attachment 38599 [details]
file opened in wordviewer on german version of windows
Comment 10 kpalagin 2006-08-22 22:59:30 UTC
Any chance OO can behave in the same way as wordviewer?

Thanks.
WBR,
K. Palagin.
Comment 11 michael.ruess 2006-08-23 06:03:32 UTC
Well, there is the possibility that you file an "Enhancement" issue which states
something like "use the system locale if document does not provide a character
encoding".
Comment 12 kpalagin 2006-08-23 07:35:09 UTC
(kpalagin - I am reopening this issue as RFE).
Please make OO applications to use the system locale if document does not 
provide a character encoding.

Thanks a lot.
WBR,
K. Palagin.
Comment 13 michael.ruess 2006-08-23 08:07:04 UTC
reassigned to requirements.
Comment 14 kpalagin 2006-10-09 09:56:32 UTC
Any chance of implementing this in 2.1?
Comment 15 kpalagin 2006-12-16 11:19:45 UTC
Maybe in 2.2?
Comment 16 kpalagin 2007-07-01 05:57:37 UTC
Dear developers,
please consider this issue for 2.3.
Thanks a lot for your attention.
Comment 17 cno 2007-09-28 07:51:38 UTC
added myself as cc
Comment 18 Mathias_Bauer 2007-09-28 21:40:42 UTC
It seems that initializing the encoding with the system value before parsing is
enough; in case an encoding is available it will be set in
SvxRTFParser::NextToken() in the "case RTF_DEFLANG:" part.
It needs some testing/checking though but if this is all I see no problem with
fixing it in 2.4.
Thanks for bringing this to my attention on the mailing list. We have just too
much issues like this one to remember them all. :-)
Comment 19 kpalagin 2007-10-30 20:04:01 UTC
Mathias,
shall we do the same for other text docs without encoding specified? For 
example Word6 docs and HTML come to mind. I am attaching sample .html, which 
does not display correctly ecause of missing charset specification.
Comment 20 kpalagin 2007-10-30 20:06:03 UTC
Created attachment 49277 [details]
zip of html
Comment 21 rail_ooo 2007-10-31 01:52:05 UTC
Mathias,

IMO you should set an encoding in SvxRTFParser::ReadFontTable() because
RTF_FCHARSET is not invoked if you open bud.rtf. If you suggest me a "right way"
of getting default charset (not gsl_getSystemTextEncoding() ;) ), I can provide
a patch and some QA.

Set CC.
Comment 22 Mathias_Bauer 2007-10-31 11:31:56 UTC
Thanks, that was my plan also. And this indirectly answers Kyrill's last
question: the fix will be specific for the RTF filter. The "character set"
information is something that is relevant only at the time of import, after that
and in memory all strings are UniCode strings anyway.

So if other filters also don't import documents that are broken wrt. charset
declaration in the expected way it has to be fixed in that filter also.
Comment 23 rail_ooo 2007-10-31 13:55:05 UTC
Mathias,

What do you think about using OOo locale
(Application::GetSettings().GetLocale()) to choose charset for such documents?
This requires some additional effort to prepare suitable conversion table for
each supported locale but gives additional flexibility: you don't need to change
system locale and restart OOo (just re-open the document) to apply changes.
Proposed conversion table can be used at least in ww8 and excel filters.

WBR,
Rail Aliev
Comment 24 Mathias_Bauer 2007-11-01 10:05:03 UTC
As I don't have a russian Windows version: Kyrill, can you tell me what would be
the right encoding for the "bud.rtf" document? Then I can test where I have to
set the encoding.
Comment 25 rail_ooo 2007-11-01 10:09:51 UTC
Mathias,

it should be RTL_TEXTENCODING_MS_1251
Comment 26 rail_ooo 2007-11-01 10:17:26 UTC
Attached is a prototype which uses OOo locale to "guess" needed encoding. It
works for "bud.rtf" if you switch OOo's locale to Russian (no langpack needed)
and reopen the document.
Comment 27 Mathias_Bauer 2007-11-01 10:17:38 UTC
Thanks, that looks fine. At least to me - the characters are looking somewhat
kyrillic now. :-)

I will let one of my russion colleagues check the result.
Comment 28 rail_ooo 2007-11-01 10:18:22 UTC
Created attachment 49311 [details]
Prototype
Comment 29 rail_ooo 2007-11-01 10:30:58 UTC
Mathias, in which CWS? I volunteer for QA. :) 
Comment 30 Mathias_Bauer 2007-11-01 10:47:32 UTC
The CWS is called mba24isues01. The only file I changed for the fix is
svx/source/svrtf/svxrtf.cxx.
Comment 31 Mathias_Bauer 2007-11-01 12:01:05 UTC
forgot to change owner :-)
Comment 32 kpalagin 2007-11-01 12:01:55 UTC
Thank you very much!!!

I have filed http://www.openoffice.org/issues/show_bug.cgi?id=83194 for HTML 
filter.

Please consider that for 2.4 too.
Comment 33 rail_ooo 2007-11-01 13:08:01 UTC
Mathias,
it should work on Windows but not on Unix.

Try to run OOo with en_US.UTF-8, ru_RU.UTF-8 (or other UTF-8 locale) and you get
"wrong format" error and empty file.
If you run with non UTF-8 locale, like ISO-8859-1, KOI8-R, CP1251 you get the
opened but it appears as needed only for CP1251 which is very rare used these days.
I suggest not use osl_getThreadTextEncoding() (which depends on generated libc
locales on GNU/Linux for example) and use proposed
AllSettings::GetDefaultTextEncoding() (with changes of course) for MS-dependent
formats.
Comment 34 Mathias_Bauer 2007-11-01 13:37:21 UTC
Your code looked like a Windows-only solution also (as it uses _MS_...). 
The hack we are talking about only makes sense if Office and tool are running on
the same machine, in all other cases it is just luck.

The obviously broken tool that created the rtf documents seems to be a Windows
app. If it was a Linux app using osl_getThreadTextEncoding would be fine if OOo
on the same machine was used.

Comment 35 rail_ooo 2007-11-01 13:56:24 UTC
> Your code looked like a Windows-only solution also (as it uses _MS_...). 

But it works on Linux and Windows. ;) I tested it on both platforms. It proposed
to be used for MS formats only.

> The hack we are talking about only makes sense if Office and tool are running on
the same machine, in all other cases it is just luck.

You mean osl_getThreadTextEncoding() based solution? ;) 
Yes, it's a luck. My suggestion is don't forget about users "without a luck"
(Unix users) and don't use system locale for MS based formats (RTF, DOC, XLS).
Comment 36 Mathias_Bauer 2007-11-01 14:53:48 UTC
So your assumption is that the broken tools is a tool that runs on Windows and
so we should only consider Windows encodings? You don't want to have a fix for
the opposite case that the tool runs on Linux and use the system encoding on
that machine?

Well, that indeed is supported by your proposal. In that case we can perhaps
build upon the Windows implementation of osl_getTextEncodingFromLocale,
Comment 37 Mathias_Bauer 2007-11-01 16:13:50 UTC
Thinking a bit about your patch: GetDefaultEncoding() in the way you proposed it
should be a private method of the RTFParser. This way to detect an encoding is a
special hack for MS file format based Windows tools. So we shouldn't move that
code to VCL. The proper default encoding should be what you get from
osl_getThreadTextEncoding or osl_getTextEncodingFromLocale. Our hack only is for
this special case. And it failed for the same broken tools located on Linux.

Besides that I would be more confident if we had a clear explanation for which
encoding should be chosen for what encoding. 

On Windows osl_getTextEncodingFromLocale works fine at least for the "uk","tr"
and "ru" locales but I think it should be fine to use the same method for other
locales. But I currently don't know what implementation is used on other platforms. 
Comment 38 rail_ooo 2007-11-05 09:15:26 UTC
Mathias,
I agree with you that osl_getThreadTextEncoding is more generec than usein
cp125x locales. I have a collection of files with wrong or unknown encoding for
RTF, XLS and DOC files and all of them cp125x based files, still no ISO-8859-x
or others.

IMO, the best and flexible solution will be creating a new config item whith
fallback encoding for such documents and set ut to cp125x (depending on locale)
by default. User may change it via Tools-Options.

I used vcl because that function depends on OOo config, but it can be moved
anywhere where it can be shared by sc, sc and maybe binfilter.
Comment 39 kpalagin 2007-11-15 14:14:06 UTC
Thank you very much for this inhancement!
When would this CWS be integrated into master?
Comment 40 Mathias_Bauer 2007-11-15 19:28:02 UTC
I hope that the CWS can be handed over to QA as soon as they have finished their
work on all UI/help relevant CWS that need to be integrated until UI freeze. So
perhaps in a week or so.
Comment 41 Mathias_Bauer 2007-12-15 00:01:25 UTC
@rail: I forgot to add that I accepted your way to detect the locale. For rtf it
makes sense to do it "Windows like". I didn't add this to the Settings class
though, it should stay an RTF specific function.
Thanks for the patch.
Comment 42 Mathias_Bauer 2007-12-15 00:20:10 UTC
As I don't have a russion OS I would like to ask for testing.

Some remarks about doc and xls: these are different filters and should be
treated by different issues. In case the filter developers accepted the same
treatment we can move the currently local function in rtf filter somewhere else.
Comment 43 kpalagin 2007-12-15 06:42:11 UTC
Tried m239 on WinXP english with Russian localle (see attached screenshots) - 
attached "bud.rtf" still does not open in Russian. I guess this issue should 
be reopened then?
Comment 44 kpalagin 2007-12-15 06:43:40 UTC
Created attachment 50331 [details]
Regional options
Comment 45 kpalagin 2007-12-15 06:45:12 UTC
Created attachment 50332 [details]
Regional options
Comment 46 rail_ooo 2007-12-15 08:49:36 UTC
@Kirill: mba24issues01 is not integrated yet
(http://tools.services.openoffice.org/EIS2/cws.ShowCWS?logon=true&Id=5912&Path=SRC680%2Fmba24issues01).

@Mathias: I think we should add other languages in the lang->ms_encoding table
too. What do you think to add such table to i18npool?
Comment 47 Mathias_Bauer 2007-12-15 11:48:23 UTC
It's still not in the master. Sorry, I should have mentioned that. The Child
Workspace should become "Ready for QA" on monday.

@rail: extending the table surely is a good idea. For the moment we can leave it
in svx. This is Windows specific, so I hesitate to put it into i18n. Maybe Eike
can give his opinion about this.

Eike: if we wanted to have a table that maps ISO locales to ISO charsets like
Windows does it - would that be something we can put into i18n or should we
start thinking about a collection of MS specific things (I remember the date
coding, the storage format tricks etc.)?

The idea of such a table would be to treat documents in MS specific formats
better in case the application creating them "forgot" to specify the locale. The
documents work well with MS Office because they fall back to the Windows system
locale. The idea is having this fallback to the Windows locale for OOo on all(!)
platforms also.
Comment 48 daniel.rentz 2007-12-17 10:09:20 UTC
If someone writes me an issue about the XLS filters, please remember to add a
link to this issue ;-)
Comment 49 ooo 2007-12-17 10:47:40 UTC
@mba: I think adding such locale -> text encoding information would be best
suited somewhere near rtl_TextEncodingInfo and some rtl_getTextEncodingFrom...()
function, see sal/inc/rtl/tencinfo.h
'sb' might have some opinion/suggestion/objection.
Comment 50 Mathias_Bauer 2007-12-18 08:10:02 UTC
Stefan, for testing you will need an operation system with a russion locale. The
attached document should show kyrillic letters then.
Comment 51 stefan.baltzer 2007-12-20 16:06:32 UTC
Verified in CWS mba24issues01 on Windows and Linux.

My test scenario:

(1) The intuitive case:
Having set the system Locale to Russian (on Linux, I tried ru_RU), the file
opens fine. 
Note: No native Russian Windows needed for this, tried with English WinXP.

(2) The workaround for non-russian locales 
Having another locale set, the file opens "not good", same as before. 
But then, Tools-Options-Language settings-Languages -> set "Locale setting" to
"Russian", OK
File - Reload -> Voila, readible.
Comment 52 stefan.baltzer 2008-02-15 12:55:05 UTC
OK in OOH680_m7. OK in SRC680_m247. Closed.