Apache OpenOffice (AOO) Bugzilla – Issue 126863
en_AU.dic has UTF-8 errors
Last modified: 2023-08-18 14:40:54 UTC
In regards to the en_AU.dic extension for Australian spelling, a number of spellings were corrupted. This appears to have occurred due to incorrect conversion to/from UTF-8 during adding new words or in the editing process in 2008, but these errors persist to the current version of the en_AU.dic. I would fix these errors myself but surely there is a maintainer to contact in regards to this issue? Has it occurred with other dictionaries? Two options I see, delete all entries with characters that are not Australian English, or change all those bad characters to good ones. Noting that the character � implies error, not a particular character. In other words we see variants such as pi�ata (should be piñata) and clich� (should be cliché). I tracked this down through various versions of the en_AU.dic http://extensions.services.openoffice.org/en/project/AustralianDictionary Here is some analysis of version and line numbers of 2 words as they changed over time. This problem is rife in the newest version of en_AU.dic, with at least 211 occurrences of the ¿ character, which indicates a failed conversion. The word cliche, for example, is misrepresented over time in different ways. Note that many words with the � character in the en_AU.dic file never appeared correctly, although this example for the word cliché was originally correct but was corrupted over time. Version 2016.03.01 (Newest) 1700: clich�/SM Version: 2010.03.16 1700: clich�/SM Version: 2008.11.25 1700: clich�/SM Version: 2008.10.3 1702: cliché/MS 1703: clich�/SM Version: 1.0.0 1523: cliché/MS With reference to files at: http://extensions.services.openoffice.org/en/project/english-dictionaries-apache-openoffice http://extensions.services.openoffice.org/en/project/AustralianDictionary
@Marco: Please can you have a look? Maybe you can explain and solve the issue. Thanks.
I have opened the .AFF + .DIC and I confirm that there are corrupted words. The person who converted the .DIC to UTF-8 probably did something wrong in the procedure. I am the maintainer of the English dictionaries but I only add words to the British one. The other dictionaries are only packed by me in the monthly OXT. @Tom, is there any chance you could fix the words since you know what to search for? If you do it, please also update the .AFF and README with your name and state that you fixed the issue. Then, ZIP and upload the files here and in the monthly update it will be fixed for everyone. Thanks! :-)
Words to be fixed are about 200. The following search lists all "suspicious" entries, each with its line number in the en_AU.dic file. It is likely that some smart find/replace can fix the file easily. $ grep -n -v -e "^[a-zA-Z0-9/'-\. \!]*$" en_AU.dic 1104:bour�e 1110:boutonni�re/SM 1626:ch�telaine/SM 1700:clich�/SM 1701:clich�d 1713:cloisonn�/M 1882:Concepci�n/M 2083:coul�e/SM 2202:cr�pe/SM 2468:derri�re/S 2533:diamant� 2604:discoth�que/SM 2647:d�mod� 2721:d�pays�e 2761:D�sseldorf 2762:d'�tre 3010:entrec�te/SM 3011:entrep�t/S 3786:glac�/DGS 4919:kinderg�rtner/SM 5209:litt�rateur/S 5331:macram�/MS 5458:matin�e/S 5461:ma�tre 5652:�migr�/S 5730:m�nage 5985:n�e 6047:n�glig� 6196:Noun�a 6759:pi�on/S 7123:pur�eing 7480:r�gime/SM 7540:r�le/MS 8002:shouldâve 8190:S�o 8208:soign� 8209:soir�e/SM 8731:Tannh�user/M 9119:t�te-b�che 9120:t�te-�-t�te 10250:appliqu�/SMG 10251:appliqu�d 10381:attach�/MS 10808:Bogot�/M 11388:ch�teau/MS 11445:�clat/M 11660:confr�re/MS 11819:cort�ge/SM 11896:cr�che/MS 11940:crudit�s 12044:Dana� 12075:d�colletage/S 12076:d�coupage 12391:divorc� 12392:divorc�e/SM 12472:d�pays� 12477:d�railleur/SM 12505:d�tente/S 12924:expos�/SM 12954:fa�ade/MS 13301:Fran�oise/M 13687:Gr�newald/M 13702:G�teborg/M 14541:jalape�o/S 15118:lyc�e 15228:manqu�/M 15294:mat�riel/MS 15527:m�l�e/SM 15528:m�moire 15663:m�tier/S 15784:na�vety/S 16016:Noum�a/M 16370:pass�/M 16824:premi�re/DMGS 16834:p�res/F 16995:pur�e/DSM 17101:raison d'�tre 17465:r�sum�/S 17491:R�union/M 17801:se�orita/SM 18913:touch� 19393:voil� 19772:abb�/S 19872:adi�s 20700:blas� 21017:caf�/SM 21838:cr�pey 21978:d�but/S 21980:d�collet� 22316:d�nouement 22492:Dvor�k/M 22929:Faberg�/M 23037:fianc�/MS 23275:Fran�ois 23589:gr�ce 23932:H�loise/M 24445:jardini�re/MS 24494:Jos�/M 24703:lam� 24717:�lan/M 24952:Lom�/M 25096:Mallarm�/M 25383:mightâve 25709:naivet�/MS 25745:na�vet�/S 26162:outr� 27135:recherch� 27421:ros� 27443:rou�/SM 27541:s�ance/MS 27776:se�ora/SM 27777:Se�ora/M 28142:soup�on/SM 28900:Tom�/M 29414:vis-�-vis 30136:anim� 30291:ar�te/MS 30341:Asunci�n/M 30549:Bart�k/M 30579:b�che 30854:boucl� 30935:bric-�-brac 31646:comp�re/M 31723:consomm�/S 32070:d�class� 32071:d�class�e 32073:d�cor/MS 32369:d�j� 32406:doppelg�nger 32633:Elys�e/M 32762:Esterh�zy/M 32920:fa�ence/S 33018:fianc�e/MS 33269:frapp� 33422:G�del/M 33484:Gew�rztraminer 33731:habitu�/SM 34305:ing�nue/S 35087:Lumi�re/M 35494:M�nchhausen/M 35719:na�veness 36676:porti�re/SM 36721:pr�cis/dMS 36869:prot�g�/SM 36870:prot�g�e/S 36903:p�t�/M 37244:rep�chage 37552:saut�/GSD 37596:Schr�dinger/M 37754:se�ores 38976:T�rshavn/M 39263:Vel�squez/M 39308:vicu�a/S 39889:aide-m�moire 40658:blowhole 40864:b�te/S 40865:b�tise 41015:canap�/S 41322:�clair/MS 41360:client�le/M 41490:communiqu�/SM 41741:coup�/SM 41840:cro�ton/SM 41853:C�te 41960:d�b�cle/MS 41962:d�butante/MS 41963:d�collet�e 42395:d�shabill�'s 42678:entr�e/S 43040:f�hn 43041:f�hrer/SM 43108:flamb�/DSG 43315:f�te/SM 43385:�gar 43397:gar�on/MS 43668:Gruy�re 44086:howâd 45140:ma�ana/M 45216:man�ge/GDS 45346:M�bius 45604:m�lange 45803:n� 45854:na�ve/Y 45881:neglig�e/SM 46200:ol� 46501:pass�e 46688:pi�ata/S 47063:Proven�al 47489:risqu� 47872:se�or/M 48226:souffl�/SM 49086:t�te 49797:Yaound�/M
Created attachment 85357 [details] en_AU - accents fixes Here is the fixed .DIC . I auto-replaced all corrupted characters with an "é" and then had to check the entire .DIC because over 90% of é's were other characters with accents. It seems that the corrupted symbol was all the same. Please tell me if you find any invalid words with accents.
Are these fixes in the latest en_AU dictionary? If yes, can we close this issue?
I assume this is fixed, at least I couldn't find these errors in the latest en_AU dictionary... If you disagree, feel free to reopen.
(In reply to Matthias Seidel from comment #6) > I assume this is fixed, at least I couldn't find these errors in the latest > en_AU dictionary... > > If you disagree, feel free to reopen. I also checked and the errors are not there. All the words with the corrupted accents are now without any accents, e.g. like "cliché/MS" is now "cliche/MDS". Should work just fine.
Hi, thank you for the confirmation. I want to close some old issues here. ;-)