Issue 58767 - encoding flaw in dictionary entries - garbled special chars in documents
Summary: encoding flaw in dictionary entries - garbled special chars in documents
Status: CLOSED FIXED
Alias: None
Product: App Dev
Classification: Unclassified
Component: api (show other issues)
Version: 3.3.0 or older (OOo)
Hardware: PC All
: P3 Trivial
Target Milestone: ---
Assignee: stefan.baltzer
QA Contact: issues@api
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-12-02 04:00 UTC by ms2
Modified: 2013-02-24 21:08 UTC (History)
3 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
the dictionary list in question (1.35 KB, application/octet-stream)
2006-07-05 01:42 UTC, ms2
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description ms2 2005-12-02 04:00:08 UTC
When dictionary entries are inserted into a writer doc from SBASIC code
the enconding is not respected, special chars (german umlauts and the like)
are not converted correctly.

This problem vanishes if the locale is set to de_DE.UTF-8 before staring
OO.o.

Tested with:
OO.o 1.1.3-de/FreeBSD
OO.o 1.1.5-de/Windows 98
OO.o 1.1.5-de/Windows 2000
OO.o 2.0 RC1-de/Windows 98

Any combination garbles special chars, only on FreeBSD with locale
set to de_DE.UTF-8 the chars are shown correctly.

You find a complete description and diagnosis and testing code for
reproducing below:

Von: 	Stephan Bergmann <stephan.bergmann@sun.com>
Antwort an: 	dev@api.openoffice.org
An: 	dev@api.openoffice.org
Betreff: 	Re: [api-dev] encoding flaw in dictionary entries
Datum: 	Wed, 30 Nov 2005 09:57:54 +0100
Newsgroups: 	openoffice.api.dev

Marc Santhoff wrote:
> Am Dienstag, den 29.11.2005, 09:56 +0100 schrieb Stephan Bergmann:
> 
>>Marc Santhoff wrote:
>>
>>>Am Montag, den 28.11.2005, 10:29 +0100 schrieb Stephan Bergmann:
>>>
>>>
>>>>Marc Santhoff wrote:
>>>>
>>>>
>>>>>Hi,
>>>>>
>>>>>I'm using dictionaries from basic code and noticed a problem. When the
>>>>>search word from a dictionary entry is inserted into a writer doc the
>>>>>encoding is not shown correctly.
>>>>>
>>>>>Try this in a german localized version:
>>>>>
>>>>>sub encError
>>>>>   dls = createUnoService("com.sun.star.linguistic2.DictionaryList")
>>>>>   dic = dls.getDictionaryByName("soffice.dic")
>>>>>   entries = dic.getEntries()
>>>>>   msgbox entries(16).getDictionaryWord()
>>>>>end sub
>>>>>
>>>>>In a german language version of OO.o 1.1.x this should read
>>>>>"Bemaßungslinien" but the char "ß" is not converted correctly. This
>>>>>holds true for the german  OO.o2.0-RC1/Windows, too.
>>>>>
>>>>>Is this worth filing an issue or is it a pilots error?
>>>>
>>>>It sure sounds like an error (so please file an issue): 
>>>>XDictionaryEntry.getDictionaryWord returns a UNO string, which is 
>>>>Unicode, so no excuse to garble an "ß" (and Basic's msgbox command 
>>>>should also be fully Unicode...).
>>>
>>>
>>>Thank for replying.
>>>
>>>I only thought I was missing some conversion function or the like
>>>because all umlauts are garbled too. They are shown as two chars in a
>>>writer doc. And from the GUI anything works as expected ...
>>
>>You mean, adding text to a writer doc via some Basic code (where the 
>>text to be added is represented as a literal Basic string) leads to 
>>garbled characters?  That's strange.  Maybe Andreas Bregas knows whether 
>>there is some part of Basic or the Basic IDE that works with 
>>locale-dependent text encodings instead of Unicode?
> 
> 
> Yes, that's what I wanted to say.
> 
> Another Test fpor the german localized OO.o:
> 
> sub encError2
>       BasicLibraries.LoadLibrary("Tools")
>       dls = createUnoService("com.sun.star.linguistic2.DictionaryList")
>       dic = dls.getDictionaryByName("soffice.dic")
>       entries = dic.getEntries()
>       tmpDoc = CreateNewDocument("swriter")
>       csr = tmpDoc.Text.createTextCursor()
>       tmpDoc.Text.string = entries(16).getDictionaryWord() ' "ß"
>       tEnd = tmpDoc.Text.getEnd()
>       tEnd.String = entries(46).getDictionaryWord() ' "ö"
> end sub
> 
> This does garble the special chars, too.
> 
> Regards,
> Marc

Two things I noticed when trying to reproduce this:

1  You must be using a non-UTF-8 locale (probably 8859-1), check the 
environment variable LANG.  If you set LANG to something like 
"de_DE.UTF-8" the problem should go away.

2  If you modify the Basic script by adding

     tEnd = tmpDoc.Text.getEnd()
     tEnd.String = "äöü"
   end sub

to the end, you see that Basic is not the culprit, as the umlauts show 
up correctly in the writer doc, regardless of LANG setting.

I suspect that the OOo dictionary implementation erroneously uses 
osl_getThreadTextEncoding() (which depends on LANG) to translate the 
(obviously UTF-8 encoded) strings within the dictionary data base to 
Unicode.  Please update the issue (did you already write one?) accordingly.

-Stephan
Comment 1 stephan.wunderlich 2005-12-02 08:48:13 UTC
sw->tl: looks like one for you
Comment 2 thomas.lange 2006-06-20 14:25:34 UTC
.
Comment 3 thomas.lange 2006-07-04 15:53:01 UTC
TL->ms2:
Yes it uses osl_getThreadTextEncoding() but only for the older versions of
dictionaries.
For the latest two it uses UTF-8 only. This should be is the case since
StarOffice 7 most probably already since StarOffice 6.
Thus unless you have a very old dictionary there should be no problem like this.

I also tried to reproduce your problem by creating a new dictionary with
"äußern" in Win XP and then used it with iso8859-1 and iso8859-15 on Solaris
without any problems.

Can you give more information?
Also if you have that very dictionary you use, please attach it to this issue.
Thanks!
Comment 4 thomas.lange 2006-07-04 15:55:29 UTC
BTW: I was using office versions 680 m173 and 680 m168. 

TL->ms2: Can you check again with a current version?
Comment 5 ms2 2006-07-05 01:39:49 UTC
tl:
I've tried the basic code today on OO.o 2.0.2-de Release (680m5, build 8011 or
9011 ;):

Running on Win98se it's still there, the converion from utf8 fails. The
dictionary I've use i the one from OO.o2.

The history of this installation should be: tried 2.0 RC1 in parallel to 1.1.5,
put 2.0.2 release over it. So I'm not absolutely sure if the dictionary is old
or new. It 's "soffice.dic" and I'll attach it later.

Testing a newer version will take some time...
Comment 6 ms2 2006-07-05 01:42:11 UTC
Created attachment 37515 [details]
the dictionary list in question
Comment 7 thomas.lange 2006-07-06 09:02:18 UTC
The problem actually is that the entries in the dictionary are already converted
to UTF-8 while the verrsion string of the dictionary says ("WBSWG2") which means
it states to be a thread encoded positive dictionary (as it was still used in SO
5.x).
Since SO6 that version string should have been "WBSWG6" which indicates an UTF-8
encoded dictionary and thus does not used thread encoding anymore when reading
or writing the dictionary.

The problem could probably solved by just editing that version string in an
(hex)editor.
But to be sure I wrote a conversion macro that reads the entries and writes new
dictionaries. Be sure to call that macro from an environment with UTF-8 encoding!
And just in case make backups from the original dictionaries before.

Please change the "aNewDicDirURL" in the macro to point to a suitable
output directory.
=================================

Sub Main

xDicList = createUnoService("com.sun.star.linguistic2.DictionaryList")
aDics = xDicList.getDictionaries()
nDics = ubound( aDics ) + 1

aNewDicDirURL = "file:///c:/NewDics"

for i = 0 to nDics - 1
	xDic = aDics(i)
	aDicName = xDic.getName()
	if aDicName <> "IgnoreAllList" then
		aDicLocale = xDic.getLocale()
		eDicType = xDic.getDictionaryType
		aEntries = xDic.getEntries()
		nEntries = xDic.getCount()
		
		aURL = aNewDicDirURL + "/" + aDicName
		xNewDic = xDicList.createDictionary( aDicName, aDicLocale, eDicType, aURL )
		for k = 0 to nEntries - 1
			aEntry = aEntries(k)
			xEntry = xDic.getEntry( aEntry.getDictionaryWord() )
			xNewDic.addEntry( xEntry )
		next k
		xNewDic.storeToURL( aURL, DimArray() )   
	endif
next i

End Sub

=================================

When the macro has successfully run and the you have exited the running Office
you just need to copy the new converted dictionary to their respective locations
(either user/wordbook or the respective subdirectory in share/wordbook).
After that things should be fine with all locale settings.

Tl->Ms2: Please check. What still makes me wonder is that the dictionary
versions that we used were also "WBSWG2" but we did not encounter the problem...
Comment 8 thomas.lange 2006-07-06 09:04:25 UTC
TL->VA: What you need to now is to convert all the dictionaries named "sun.dic",
"soffice.dic" and the like from the project extras_full as described above and
have those new versions checked in.
Please take over. Thanks!
Comment 9 Martin Hollmichel 2007-11-09 17:27:51 UTC
set target from 2.x to 3.x according to
http://wiki.services.openoffice.org/wiki/Target_3x
Comment 10 weko 2007-12-19 13:31:30 UTC
VA: Issue will be fixed in connection with Issue 84688.
Comment 11 weko 2007-12-20 15:14:09 UTC
.
Comment 12 thomas.lange 2008-01-14 12:58:31 UTC
TL->VA: you may use the macro  below to see all entries of a specific dictionary.

Sub Main

dls = createUnoService("com.sun.star.linguistic2.DictionaryList")
dic = dls.getDictionaryByName("soffice_de.dic")

entries = dic.getEntries()
n = ubound(entries)
msgbox n
aText = ""
for i = 0 to n 
    aText = aText + entries(i).getDictionaryWord() + " "
next i
msgbox aText

End Sub
Comment 13 weko 2008-01-14 13:17:17 UTC
.
Comment 14 weko 2008-01-15 11:20:43 UTC
VA->SBA: Please verify.
Comment 15 stefan.baltzer 2008-01-17 15:03:21 UTC
Verified in CWS extras240
Comment 16 stefan.baltzer 2008-09-08 09:51:45 UTC
Closing issue.
Note: Dictionaries in OOo 3.0 will work as extensions only (See 81365 "Allow for
dictionaries to be provided as extensions")