Apache OpenOffice (AOO) Bugzilla – Issue 58767
encoding flaw in dictionary entries - garbled special chars in documents
Last modified: 2013-02-24 21:08:18 UTC
When dictionary entries are inserted into a writer doc from SBASIC code the enconding is not respected, special chars (german umlauts and the like) are not converted correctly. This problem vanishes if the locale is set to de_DE.UTF-8 before staring OO.o. Tested with: OO.o 1.1.3-de/FreeBSD OO.o 1.1.5-de/Windows 98 OO.o 1.1.5-de/Windows 2000 OO.o 2.0 RC1-de/Windows 98 Any combination garbles special chars, only on FreeBSD with locale set to de_DE.UTF-8 the chars are shown correctly. You find a complete description and diagnosis and testing code for reproducing below: Von: Stephan Bergmann <stephan.bergmann@sun.com> Antwort an: dev@api.openoffice.org An: dev@api.openoffice.org Betreff: Re: [api-dev] encoding flaw in dictionary entries Datum: Wed, 30 Nov 2005 09:57:54 +0100 Newsgroups: openoffice.api.dev Marc Santhoff wrote: > Am Dienstag, den 29.11.2005, 09:56 +0100 schrieb Stephan Bergmann: > >>Marc Santhoff wrote: >> >>>Am Montag, den 28.11.2005, 10:29 +0100 schrieb Stephan Bergmann: >>> >>> >>>>Marc Santhoff wrote: >>>> >>>> >>>>>Hi, >>>>> >>>>>I'm using dictionaries from basic code and noticed a problem. When the >>>>>search word from a dictionary entry is inserted into a writer doc the >>>>>encoding is not shown correctly. >>>>> >>>>>Try this in a german localized version: >>>>> >>>>>sub encError >>>>> dls = createUnoService("com.sun.star.linguistic2.DictionaryList") >>>>> dic = dls.getDictionaryByName("soffice.dic") >>>>> entries = dic.getEntries() >>>>> msgbox entries(16).getDictionaryWord() >>>>>end sub >>>>> >>>>>In a german language version of OO.o 1.1.x this should read >>>>>"Bemaßungslinien" but the char "ß" is not converted correctly. This >>>>>holds true for the german OO.o2.0-RC1/Windows, too. >>>>> >>>>>Is this worth filing an issue or is it a pilots error? >>>> >>>>It sure sounds like an error (so please file an issue): >>>>XDictionaryEntry.getDictionaryWord returns a UNO string, which is >>>>Unicode, so no excuse to garble an "ß" (and Basic's msgbox command >>>>should also be fully Unicode...). >>> >>> >>>Thank for replying. >>> >>>I only thought I was missing some conversion function or the like >>>because all umlauts are garbled too. They are shown as two chars in a >>>writer doc. And from the GUI anything works as expected ... >> >>You mean, adding text to a writer doc via some Basic code (where the >>text to be added is represented as a literal Basic string) leads to >>garbled characters? That's strange. Maybe Andreas Bregas knows whether >>there is some part of Basic or the Basic IDE that works with >>locale-dependent text encodings instead of Unicode? > > > Yes, that's what I wanted to say. > > Another Test fpor the german localized OO.o: > > sub encError2 > BasicLibraries.LoadLibrary("Tools") > dls = createUnoService("com.sun.star.linguistic2.DictionaryList") > dic = dls.getDictionaryByName("soffice.dic") > entries = dic.getEntries() > tmpDoc = CreateNewDocument("swriter") > csr = tmpDoc.Text.createTextCursor() > tmpDoc.Text.string = entries(16).getDictionaryWord() ' "ß" > tEnd = tmpDoc.Text.getEnd() > tEnd.String = entries(46).getDictionaryWord() ' "ö" > end sub > > This does garble the special chars, too. > > Regards, > Marc Two things I noticed when trying to reproduce this: 1 You must be using a non-UTF-8 locale (probably 8859-1), check the environment variable LANG. If you set LANG to something like "de_DE.UTF-8" the problem should go away. 2 If you modify the Basic script by adding tEnd = tmpDoc.Text.getEnd() tEnd.String = "äöü" end sub to the end, you see that Basic is not the culprit, as the umlauts show up correctly in the writer doc, regardless of LANG setting. I suspect that the OOo dictionary implementation erroneously uses osl_getThreadTextEncoding() (which depends on LANG) to translate the (obviously UTF-8 encoded) strings within the dictionary data base to Unicode. Please update the issue (did you already write one?) accordingly. -Stephan
sw->tl: looks like one for you
.
TL->ms2: Yes it uses osl_getThreadTextEncoding() but only for the older versions of dictionaries. For the latest two it uses UTF-8 only. This should be is the case since StarOffice 7 most probably already since StarOffice 6. Thus unless you have a very old dictionary there should be no problem like this. I also tried to reproduce your problem by creating a new dictionary with "äußern" in Win XP and then used it with iso8859-1 and iso8859-15 on Solaris without any problems. Can you give more information? Also if you have that very dictionary you use, please attach it to this issue. Thanks!
BTW: I was using office versions 680 m173 and 680 m168. TL->ms2: Can you check again with a current version?
tl: I've tried the basic code today on OO.o 2.0.2-de Release (680m5, build 8011 or 9011 ;): Running on Win98se it's still there, the converion from utf8 fails. The dictionary I've use i the one from OO.o2. The history of this installation should be: tried 2.0 RC1 in parallel to 1.1.5, put 2.0.2 release over it. So I'm not absolutely sure if the dictionary is old or new. It 's "soffice.dic" and I'll attach it later. Testing a newer version will take some time...
Created attachment 37515 [details] the dictionary list in question
The problem actually is that the entries in the dictionary are already converted to UTF-8 while the verrsion string of the dictionary says ("WBSWG2") which means it states to be a thread encoded positive dictionary (as it was still used in SO 5.x). Since SO6 that version string should have been "WBSWG6" which indicates an UTF-8 encoded dictionary and thus does not used thread encoding anymore when reading or writing the dictionary. The problem could probably solved by just editing that version string in an (hex)editor. But to be sure I wrote a conversion macro that reads the entries and writes new dictionaries. Be sure to call that macro from an environment with UTF-8 encoding! And just in case make backups from the original dictionaries before. Please change the "aNewDicDirURL" in the macro to point to a suitable output directory. ================================= Sub Main xDicList = createUnoService("com.sun.star.linguistic2.DictionaryList") aDics = xDicList.getDictionaries() nDics = ubound( aDics ) + 1 aNewDicDirURL = "file:///c:/NewDics" for i = 0 to nDics - 1 xDic = aDics(i) aDicName = xDic.getName() if aDicName <> "IgnoreAllList" then aDicLocale = xDic.getLocale() eDicType = xDic.getDictionaryType aEntries = xDic.getEntries() nEntries = xDic.getCount() aURL = aNewDicDirURL + "/" + aDicName xNewDic = xDicList.createDictionary( aDicName, aDicLocale, eDicType, aURL ) for k = 0 to nEntries - 1 aEntry = aEntries(k) xEntry = xDic.getEntry( aEntry.getDictionaryWord() ) xNewDic.addEntry( xEntry ) next k xNewDic.storeToURL( aURL, DimArray() ) endif next i End Sub ================================= When the macro has successfully run and the you have exited the running Office you just need to copy the new converted dictionary to their respective locations (either user/wordbook or the respective subdirectory in share/wordbook). After that things should be fine with all locale settings. Tl->Ms2: Please check. What still makes me wonder is that the dictionary versions that we used were also "WBSWG2" but we did not encounter the problem...
TL->VA: What you need to now is to convert all the dictionaries named "sun.dic", "soffice.dic" and the like from the project extras_full as described above and have those new versions checked in. Please take over. Thanks!
set target from 2.x to 3.x according to http://wiki.services.openoffice.org/wiki/Target_3x
VA: Issue will be fixed in connection with Issue 84688.
TL->VA: you may use the macro below to see all entries of a specific dictionary. Sub Main dls = createUnoService("com.sun.star.linguistic2.DictionaryList") dic = dls.getDictionaryByName("soffice_de.dic") entries = dic.getEntries() n = ubound(entries) msgbox n aText = "" for i = 0 to n aText = aText + entries(i).getDictionaryWord() + " " next i msgbox aText End Sub
VA->SBA: Please verify.
Verified in CWS extras240
Closing issue. Note: Dictionaries in OOo 3.0 will work as extensions only (See 81365 "Allow for dictionaries to be provided as extensions")