Issue 67649

Summary: concordance files + multilungal entries; utf-8 don't work
Product: Writer Reporter: uko_571 <koch>
Component: editingAssignee: AOO issues mailing list <issues>
Status: CONFIRMED --- QA Contact:
Severity: Trivial    
Priority: P3 CC: gerry, issues, jeffooo, petko
Version: OOo 2.0.3   
Target Milestone: ---   
Hardware: PC   
OS: All   
Issue Type: DEFECT Latest Confirmation in: ---
Developer Difficulty: ---
Issue Depends on: 128019    
Issue Blocks:    
Attachments:
Description Flags
writer document
none
concordance file sdi utf-8 none

Description uko_571 2006-07-21 10:58:36 UTC
For example: Among the words I need for alphabetical index are few in a
different language than my document, e.g. in Polish where diacritic marks are
relevant (Michał Załęcki), but also in Russian etc. That's no problem in Writer
and the document itself. 
But if I try to create the concordance file manually and save the file in
unicode encoding (utf-8) words like the example above aren't indexed.
And if I open the file in 'Edit concordance file' screen special characters are
changed and became faulty.
Also if I try to create a concordance file only within the 'New concordance
file' screen and type them there - a) typed in with keybord or b) choosing
character from charset map and insert via double click -  I only see correct
characters until I click the ok button.
All what I tried ended up in the same dilemma. Terms I wanted to include in the
index aren't indexed. 
To me it seems that the feature "using a concordance file" cannot handle utf-8
in general or don't handle utf-8 properly for all characters.
Comment 1 uko_571 2006-07-21 11:01:11 UTC
Created attachment 37942 [details]
writer document
Comment 2 uko_571 2006-07-21 11:03:37 UTC
Created attachment 37943 [details]
concordance file sdi utf-8
Comment 3 michael.ruess 2006-07-21 11:29:59 UTC
Reassigned to ES.
Comment 4 grsingleton 2006-07-21 13:09:56 UTC
added myself to cc list
Comment 5 eric.savary 2006-08-28 14:36:10 UTC
I cannot reproduce ther problem.
Differenciate:
a) entries in the document
b) entries in the sdi file

I see that aome entries from a) are broken.
But all entries of b) are ok.
So something wrong happened while generating (searching and marking the entries)
the index.

To cure the problem:
- delete all entries in the doc
- reapply the sdi.

Any comments?
Comment 6 grsingleton 2006-08-28 14:47:41 UTC
For the record, your experience is similar to mine. We use a concordence file to
create the index for the User Guide. It is not complete because we lost our
indexer but no problems, such as described in this issue, have shown up. Very
strange but worthwhile monitoring progress.
Comment 7 uko_571 2006-08-29 12:22:27 UTC
ES wrote: 
To cure the problem:
- delete all entries in the doc
- reapply the sdi.

This does not solve the problem of establishing concordance between text in open
office with diacritic characters and concordance file. 
Have ever tried to create a concordance file including diacritic characters (it
seems open office does not support for example latin-A extended for creating
alphabetical index by using concordance file).
Please create a writer document only with two entries: Böhm and Chałasiński to
see whether it works - it would be helpful!
Comment 8 eric.savary 2006-08-29 12:44:29 UTC
That's what I already did first using your file (and I still have no problem
creating a new doc, new concordance file with thos 2 names).
We need to find out where the problem is:
- your locales (system, document, OOo)?
- the way you create the sdi:
-> I have no problem using the UI (I copy/pasted the names from the issue into
the sdi table)
-> I have no problem exporting an edited sdi as utf-8 and loading it in the
index dialog.

So what do you do exactly?

Please also describe your system.
Comment 9 uko_571 2006-08-29 15:13:00 UTC
PC Windows XP Professional
Locales: 
System - German (default)
 	additional keyboard input locales:      
       - English-US
       - Polish
       - Russian
OpenOffice 2.0.3 - language: German
writer document:language German (general). (But I also choose language setting
only for document and tried also English and Polish without success re index by
using concordance file.)

For sdi-file I tried:

a) created file in EM-Editor and saved as utf-8, checked BOM (also tried
unchecked BOM)
b) created file with GUI in Openoffice writer.
   Special characters in concordance file when using option inside writer: "edit
file" or "new" I
	- typed in using Polish keyboard or
	- right click mouse: insert > special characters or
	- paste and copy like you have done
	Index doesn't work.

One difference between case a and b above:
If I create concordance file in GUI OO writer, German-umlaut is okay (but no
Polish diacritics); 
If I use the external sdi file  German-umlaut also becomes wrong characters.

If I open the concordance file for editing in GUI writer, the encoding of the
not correctly recognized characters differs between a and b. 
Case a -loaded external sdi-file, created with EmEditor, I see:  Cha?asi?ski and
Böhm; 
Case b - reopen sdi-file, created with GUI within OO writer, I see: Chałasińsk
amd Böhm

What could be wrong?
Comment 10 eric.savary 2006-10-12 12:49:15 UTC
ES->OS: as tested...
It seems to be a Windows problem only (no problem on Linux). The Condordence
dialog and/or import/export in not Unicode compliant. Though this (surprisingly)
never worked (neither in OOo 1.1.5), this prevents people writing texts in non
strict ASCII to use this function. -> 2.x.
Comment 11 pesala 2006-10-22 22:18:41 UTC
I also encountered this problem on Windows ME when attempting to cut and paste Latin 
Extended-A and Latin Extended Additional to the concordance file dialogue. I can edit 
the file in Open Office, and type the extended characters OK, but the changes are not 
saved. 
Comment 12 Martin Hollmichel 2007-09-10 13:36:13 UTC
move target to 3.x according http://wiki.services.openoffice.org/wiki/Target_3x
Comment 13 soundspaces 2010-12-04 14:32:47 UTC
I have this problem, very important to solve it! Please!

Thank a lot
Massimo
Comment 14 soundspaces 2011-01-03 16:17:45 UTC
any news?
Comment 15 Marcus 2017-05-20 11:15:41 UTC
Reset assigne to the default "issues@openoffice.apache.org".
Comment 16 oooforum (fr) 2019-01-29 10:46:16 UTC
*** Issue 128023 has been marked as a duplicate of this issue. ***
Comment 17 jeffooo 2019-01-29 16:19:26 UTC
Hi,

Reproduced with french UI

With Windows, your concordance file must be encoding in ANSI
With Linux, your concordance file must be encoding in UFT-8

It's not funny if the text document is open with two different OS...
Comment 18 Peter 2019-02-01 06:05:12 UTC
I think this should be solve when we overhaul our String Implementation. I set this Bug depends on the Overhaul bug, because both are closely related, but do not have the same goal.
Maybe It makes sense to fix this before the more complex overhaul?