Issue 126863 - en_AU.dic has UTF-8 errors
Summary: en_AU.dic has UTF-8 errors
Status: CONFIRMED
Alias: None
Product: General
Classification: Code
Component: spell checking (show other issues)
Version: 4.2.0-dev
Hardware: All All
: P5 (lowest) Normal (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-03-09 12:57 UTC by Tom Anderson
Modified: 2016-03-14 10:13 UTC (History)
4 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: 4.1.2
Developer Difficulty: ---


Attachments
en_AU - accents fixes (240.38 KB, patch)
2016-03-14 10:13 UTC, marcoagpinto
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this issue.
Description Tom Anderson 2016-03-09 12:57:02 UTC
In regards to the en_AU.dic extension for Australian spelling, a number of spellings were corrupted. This appears to have occurred due to incorrect conversion to/from UTF-8 during adding new words or in the editing process in 2008, but these errors persist to the current version of the en_AU.dic. I would fix these errors myself but surely there is a maintainer to contact in regards to this issue? Has it occurred with other dictionaries?

Two options I see, delete all entries with characters that are not Australian English, or change all those bad characters to good ones. Noting that the character � implies error, not a particular character. In other words we see variants such as pi�ata (should be piñata) and clich� (should be cliché).

I tracked this down through various versions of the en_AU.dic http://extensions.services.openoffice.org/en/project/AustralianDictionary

Here is some analysis of version and line numbers of 2 words as they changed over time. This problem is rife in the newest version of en_AU.dic, with at least 211 occurrences of the ¿ character, which indicates a failed conversion. The word cliche, for example, is misrepresented over time in different ways. Note that many words with the � character in the en_AU.dic file never appeared correctly, although this example for the word cliché was originally correct but was corrupted over time.

Version 2016.03.01 (Newest)
1700: clich�/SM

Version: 2010.03.16
1700: clich�/SM

Version: 2008.11.25
1700: clich�/SM

Version: 2008.10.3
1702: cliché/MS
1703: clich�/SM

Version: 1.0.0
1523: cliché/MS

With reference to files at:
http://extensions.services.openoffice.org/en/project/english-dictionaries-apache-openoffice
http://extensions.services.openoffice.org/en/project/AustralianDictionary
Comment 1 Marcus 2016-03-09 19:39:36 UTC
@Marco:
Please can you have a look? Maybe you can explain and solve the issue. Thanks.
Comment 2 marcoagpinto 2016-03-09 22:33:48 UTC
I have opened the .AFF + .DIC and I confirm that there are corrupted words.

The person who converted the .DIC to UTF-8 probably did something wrong in the procedure.

I am the maintainer of the English dictionaries but I only add words to the British one.

The other dictionaries are only packed by me in the monthly OXT.

@Tom, is there any chance you could fix the words since you know what to search for?

If you do it, please also update the .AFF and README with your name and state that you fixed the issue.

Then, ZIP and upload the files here and in the monthly update it will be fixed for everyone.

Thanks!

:-)
Comment 3 Andrea Pescetti 2016-03-13 09:34:22 UTC
Words to be fixed are about 200. The following search lists all "suspicious" entries, each with its line number in the en_AU.dic file. It is likely that some smart find/replace can fix the file easily.

$ grep -n -v -e "^[a-zA-Z0-9/'-\. \!]*$" en_AU.dic 
1104:bour�e
1110:boutonni�re/SM
1626:ch�telaine/SM
1700:clich�/SM
1701:clich�d
1713:cloisonn�/M
1882:Concepci�n/M
2083:coul�e/SM
2202:cr�pe/SM
2468:derri�re/S
2533:diamant�
2604:discoth�que/SM
2647:d�mod�
2721:d�pays�e
2761:D�sseldorf
2762:d'�tre
3010:entrec�te/SM
3011:entrep�t/S
3786:glac�/DGS
4919:kinderg�rtner/SM
5209:litt�rateur/S
5331:macram�/MS
5458:matin�e/S
5461:ma�tre
5652:�migr�/S
5730:m�nage
5985:n�e
6047:n�glig�
6196:Noun�a
6759:pi�on/S
7123:pur�eing
7480:r�gime/SM
7540:r�le/MS
8002:shouldâve
8190:S�o
8208:soign�
8209:soir�e/SM
8731:Tannh�user/M
9119:t�te-b�che
9120:t�te-�-t�te
10250:appliqu�/SMG
10251:appliqu�d
10381:attach�/MS
10808:Bogot�/M
11388:ch�teau/MS
11445:�clat/M
11660:confr�re/MS
11819:cort�ge/SM
11896:cr�che/MS
11940:crudit�s
12044:Dana�
12075:d�colletage/S
12076:d�coupage
12391:divorc�
12392:divorc�e/SM
12472:d�pays�
12477:d�railleur/SM
12505:d�tente/S
12924:expos�/SM
12954:fa�ade/MS
13301:Fran�oise/M
13687:Gr�newald/M
13702:G�teborg/M
14541:jalape�o/S
15118:lyc�e
15228:manqu�/M
15294:mat�riel/MS
15527:m�l�e/SM
15528:m�moire
15663:m�tier/S
15784:na�vety/S
16016:Noum�a/M
16370:pass�/M
16824:premi�re/DMGS
16834:p�res/F
16995:pur�e/DSM
17101:raison d'�tre
17465:r�sum�/S
17491:R�union/M
17801:se�orita/SM
18913:touch�
19393:voil�
19772:abb�/S
19872:adi�s
20700:blas�
21017:caf�/SM
21838:cr�pey
21978:d�but/S
21980:d�collet�
22316:d�nouement
22492:Dvor�k/M
22929:Faberg�/M
23037:fianc�/MS
23275:Fran�ois
23589:gr�ce
23932:H�loise/M
24445:jardini�re/MS
24494:Jos�/M
24703:lam�
24717:�lan/M
24952:Lom�/M
25096:Mallarm�/M
25383:mightâve
25709:naivet�/MS
25745:na�vet�/S
26162:outr�
27135:recherch�
27421:ros�
27443:rou�/SM
27541:s�ance/MS
27776:se�ora/SM
27777:Se�ora/M
28142:soup�on/SM
28900:Tom�/M
29414:vis-�-vis
30136:anim�
30291:ar�te/MS
30341:Asunci�n/M
30549:Bart�k/M
30579:b�che
30854:boucl�
30935:bric-�-brac
31646:comp�re/M
31723:consomm�/S
32070:d�class�
32071:d�class�e
32073:d�cor/MS
32369:d�j�
32406:doppelg�nger
32633:Elys�e/M
32762:Esterh�zy/M
32920:fa�ence/S
33018:fianc�e/MS
33269:frapp�
33422:G�del/M
33484:Gew�rztraminer
33731:habitu�/SM
34305:ing�nue/S
35087:Lumi�re/M
35494:M�nchhausen/M
35719:na�veness
36676:porti�re/SM
36721:pr�cis/dMS
36869:prot�g�/SM
36870:prot�g�e/S
36903:p�t�/M
37244:rep�chage
37552:saut�/GSD
37596:Schr�dinger/M
37754:se�ores
38976:T�rshavn/M
39263:Vel�squez/M
39308:vicu�a/S
39889:aide-m�moire
40658:blowhole
40864:b�te/S
40865:b�tise
41015:canap�/S
41322:�clair/MS
41360:client�le/M
41490:communiqu�/SM
41741:coup�/SM
41840:cro�ton/SM
41853:C�te
41960:d�b�cle/MS
41962:d�butante/MS
41963:d�collet�e
42395:d�shabill�'s
42678:entr�e/S
43040:f�hn
43041:f�hrer/SM
43108:flamb�/DSG
43315:f�te/SM
43385:�gar
43397:gar�on/MS
43668:Gruy�re
44086:howâd
45140:ma�ana/M
45216:man�ge/GDS
45346:M�bius
45604:m�lange
45803:n�
45854:na�ve/Y
45881:neglig�e/SM
46200:ol�
46501:pass�e
46688:pi�ata/S
47063:Proven�al
47489:risqu�
47872:se�or/M
48226:souffl�/SM
49086:t�te
49797:Yaound�/M
Comment 4 marcoagpinto 2016-03-14 10:13:13 UTC
Created attachment 85357 [details]
en_AU - accents fixes

Here is the fixed .DIC .

I auto-replaced all corrupted characters with an "é" and then had to check the entire .DIC because over 90% of é's were other characters with accents.

It seems that the corrupted symbol was all the same.

Please tell me if you find any invalid words with accents.