18024 – Direction of weak characters: A new method for dealing with text direction without using keyboard layout

Issue 18024 - Direction of weak characters: A new method for dealing with text direction without using keyboard layout

Summary: Direction of weak characters: A new method for dealing with text direction wi...

Status:	CLOSED IRREPRODUCIBLE

Alias:	None

Product:	Internationalization
Classification:	Code
Component:	BiDi (show other issues)
Version:	OOo 1.1 RC2
Hardware:	PC All

Importance:	P3 Trivial with 69 votes (vote)
Target Milestone:	---
Assignee:	frank.meies
QA Contact:	issues@l10n

URL:
Keywords:

Duplicates (11):	14590 20688 21887 25548 27618 31149 33854 61016 79777 81501 81662 (view as issue list)
Depends on:
Blocks:	19012 19848
	Show dependency tree

Reported:	2003-08-08 13:19 UTC by mehlng
Modified:	2013-08-07 15:00 UTC (History)
CC List:	10 users (show)

See Also:
Issue Type:	ENHANCEMENT
Latest Confirmation in:	---
Developer Difficulty:	---

Attachments
problem file which IME woudl not solve (102.50 KB, application/octet-stream) 2003-12-03 08:57 UTC, sforbes	no flags	Details
LRM/RML macros in wizards/source/tools (1.36 KB, text/plain) 2005-11-13 13:59 UTC, alan	no flags	Details
patch to script.xlb (841 bytes, patch) 2005-11-13 14:01 UTC, alan	no flags	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description mehlng 2003-08-08 13:19:25 UTC

The main problem needing keyboard layout detection to solve AFAIK is the f(a)
and english_word+. problem, I'll demonstrate:
the expression FUNCTION f(X) should be
f(X) NOITCNUF
and not
)f(X NOITCNUF
On the contrary the expression SHALOM mumi. shoulr render the other way around as:
.mumi MOLAHS
and not:
mumi. MOLAHS
This is pretty true almost always, as you almost never need to place a dot in te
end of an embedded English word
Solving the problem like MSword did (Keyboard layout) seems to me like a bad idea.
1) unintuitive, setting things the way you want them to be set, and taking care
for each space/punctuation directionality (MSWord has Hebrew spaces and English
spaces) is an Hell on Earth.
2) not always work as intended - so we won't get so much benefit from using it
(you'll solve the f(x) problem but when ending a sentence with englishword+dot -
the dot will not automatically jump to the real end of the sentence) so it's not
really much of a gain.
3) extra complexity sould be added, so the whole BiDi engine should be altered now.

I suggest a Unique sollution based of already existing LRM/RLM signs.

The sollution is very simple, define a macro that will cause OOo to insert
LRM[1] sign automatically in the following condition:
Writing a word begins with English_Letter* ')', we should think of more similar
chars like ')' we want to insert the LRM sign afterwards, but the idea is simple.
It's advatages:
1) almost comparable to MSWord's sollution, could immitate it almost exactly.
1) highly customable, seem to actually solve the problem completely in most cases.
2) Even in the unlikely event where the user would like to use, say, f(x) that
will render as:
)f(x MOLAHS
He can do that very easily no exchange of character, a simple hit on the
backspace key will delete the LRM and will solve the problem.

Please think of my idea - I think it'll do only good and save us from
unneccessary implementation of MSWord immitation which is not neccessarily
better then my sollution.
Please answer me
CC'd to the openoffice-hebrew mailing list at openoffice.org.il


[1] I'm not sure LRM is the correct sign, the idea is sign $ that will cause
SHALOM f(x)$ to be rendered as:
f(X) MOLAHS
were the sign $/LRM/RLM is invisible of course
Caps are Hebrew chars of course as the usual notation.

Comment 1 frank.meies 2003-08-11 14:35:41 UTC

Hi,

thank you for your suggestions. Your description of the problem is
absolutely correct. Inserting control characters (like LRE, RLE, PDF,
...) is one way to solve this problem, the other way would be to use
character attributes to define the directionality of characters. From
my point of view, it would be easier to use the control characters,
because we'll only have to pass the string (containing the control
characters) to the BiDi algorithm and everything works fine. On the
other hand, using a character attribute would be the 'correct' way to
solve the problem, without entering hidden control characters to the
paragraph.

I'll forward this issue to the user experience team, so they should
have a look.

Comment 2 mehlng 2003-08-12 08:49:31 UTC

hi, thanks for your comment.
Again, and this is the point I'd like to emphasize, I think
character-attributes is NOT a "correct" sollution to the problem.
As I see it, there are two views for describing a correct sollution.
The one defines correct sollution as the sollution who works always
well and as intended, there are no extreme cases that can prevent it
from working, the code is easily portable and extremely maintainable
etc. If this is the case I believe character-attributes is NOT the
"correct" sollution. Let along the fact, that both sollution has the
same functionality, the CA version is an hell on earth to manage (ever
tried  to paste a C source to Hebrew MSOffice? What about an English
document pasted into HEbrew MSoffice - it has now all it's dots
reversed, how can the naive user fix it (clue - the sollution is
replace all "Hebrew" spaces with "English" Spaces, one of the toughest
missions I've ever seen.). The CA method does not place the
end-of-sentence dot correctly, RLM method does. The CA method will not
keep your text layout between OOo to other application, RLM method
does. The CA version, as said before, is very hard and unintuitive to
override, LRM method just requires a backspace after the word needs
overriding. I think that if this is how one defines "correct" way,
than surely LRM method is the better one.
The other definition of "correct" method is its standardizing and
intuitiveness (and mathematical truthfullness - but that's not quite
related to here). For example, using filesystem for IPC instead of the
normal IPC tools (KDE DCOP for instance) might be easier to implement
and maintain, but surely IPC-tools are the correct standard way to
handle processes communication. Even if this is the case, I believe
LRM method should be considered as well. The LRM method as opposed to
the CA method is standardized by the Unicode system-wide standards. It
is just as intuitive as the CA method (noone imagines in which
language does he write his puctuations, this is not a rational way to
handle text). I think therefor that the RLM method has no
disadvantages over the CA method as well.
In the bottom line I can't see any reason to implement complex
libraries that will eventually provide as with nothing more than we
can achieve without this extra-complexity.

Please, when discussing with the User-Experience team, make sure there
are some Hebrew speakers and more important USERS. Having a declared
extreme good experience but a very poor use is not the way to go if
you ask me. Except, if you can allow me to share my view in the
USer-Experience groups discussion I'll be very glad. Thanks again.

Comment 3 frank.meies 2003-08-15 07:20:40 UTC

Comment 4 frank.meies 2003-08-15 08:17:05 UTC

FME: As long as there is no solution for this problem, one can use
simple makros to insert an RLM or LRM at the current cursor position:

sub InsertLRM
xsel = thiscomponent.currentcontroller.getselection
xrange = xsel(0)
xrange.setstring(chr$(8206))
end sub

sub InsertRLM
xsel = thiscomponent.currentcontroller.getselection
xrange = xsel(0)
xrange.setstring(chr$(8207))
end sub

Comment 5 frank.meies 2003-08-15 08:49:26 UTC

*** Issue 14590 has been marked as a duplicate of this issue. ***

Comment 6 frank.meies 2003-08-15 08:53:05 UTC

FME: Added 'Direction of weak characters' to title.

Comment 7 mehlng 2003-08-15 10:36:52 UTC

*** Issue 16247 has been marked as a duplicate of this issue. ***

Comment 8 falko.tesch 2003-08-19 15:31:56 UTC

UE speaking:
Even that I am no native Hebrew and/or Arabic speaker/writer I
understand your concenrn about changing text directions.
We already discussed various approaches to this issue but couldn't
come to a solution.
Whether MS Word nor other office word processors have a special
implementation for this (at least I couldn't find any).
Can you help me please in finding a reasonable, user-friendly
UI/function to address this problem? Thx.

Comment 9 mehlng 2003-08-19 15:53:38 UTC

mehlng->ft:
I believe my proposition is pretty concise and described in here. I'll
try to describe a python pseudo-code that'll solve the problem:
=============cut here===========================
chars_usualy_ends_sentence = [ ')',']','}','>' ]
while c = getchar():
   if c==' ':
      if text_direction()!=paragrap_direction():
          if lastchar in chars_usualy_ends_sentence:
              print_before_input(RLM_sign)
=============ends here=============================
this is supposed to more or less solve the problem almost completely,
besides a nice approach to handle graphically the RLM sign would be
nice (IE when cursor is after an RLM sign an explanation would appear
and simply deleting it would automatically delete the character before it.

Please contact me, or shachar (shemes.biz) or Gilad, or eli Marmor.
I'll be interested to explain this on the phone.
Do contact mehlng@yahoo.com I'm very eager to discuss this issue.

Comment 10 mehlng 2003-08-21 00:28:38 UTC

One last word about LRM handling (which is especially vital if we
intend to add them regulary in OOo):

In order to keep the naive user unconfused the RLM *must* be
hard-linked to the character behind it, it'll disappear as the
character is deleted (in any form) otherwise it'll remain unnoticed in
the text and will rear its ugly head with plenty unexplained errors.
The only HIGHLY UNLIKELY problem it might arise is if the ')' is moved
to a different place and ment to be used as a Hebrew '(' sign.
Demonstration of problems might arise, $ stands for invisible RLM sign:
current text
MOLAHS TIRVI
User adds english with parenthesis and makes the RLM sign
automatically inserted:
english (text)$ MOLAHS TIRVI
user deletes all English text but two parenthesis
) MOLAHS TIRVI
User now continues to write HEBREW in parenthesis
(MIARGOS)$ MOLASH TIRVI
problem now can arise.
However MSWord approach won't solve this issue (!) a ')'-LRM is just
like Hebrew-type-parenthesis of the
MSWord, thus we didn't cause anything MSWord can't have!

Comment 11 sforbes 2003-10-08 20:31:48 UTC

issue #19848 is related

Comment 12 insount 2003-10-18 15:19:52 UTC

Bug 21019 is NOT a duplicate of this bug. That also discusses handling
of imported/legacy texts, as opposed to text entry.

Comment 13 insount 2003-10-18 15:58:26 UTC

Above comment posted to wrong issue; sorry for the spam.

Comment 14 sforbes 2003-11-02 12:28:04 UTC

*** Issue 21887 has been marked as a duplicate of this issue. ***

Comment 15 sforbes 2003-11-02 12:29:51 UTC

from issue #21887 (marked as dup of this one):
"1. when typing a hebrew text (direction right to left) ending with an
english
word, followed by a hebrew ":" (on the left of it), and then typing an
english
text, the engish word ending the hebrwe text jumps left. e.g.: 
when typing (from right to left)
   "english word 2"   <  "a hebrew :" <  "english word 1"  < "hebrew"

one gets:
                   "english word 1""a hebrew :""english word 2"  < 
"hebrew"

2. when writing a hebrew doc (direction right to left) and inserting
an english
text starting with a number, the numbert jumps over to the right side
(as if it
was still hebrew). e.g.:
when typing  (right to left)

        "number" > "english text"   (changing to english)  <   "hebrew"

one gets:
                   "english"     "number"     "hebrew"

"

Comment 16 falko.tesch 2003-11-17 15:10:52 UTC

FT: We discussed possible solution here at Star Office.
That's what we came up with:
- For OO.o running under Windows we will make use of our new feature
reading out the IME.
Once we detect the IME inputting a RTL language we will hint the ICU
to determine the correct text direction (RTL in this case) for neutral
and weak character.
- For OO.o running under Unix systems we cannot change anything yet
since all Unix IMEs do not feed back their current language set.
Therefore we must still rely on the already existing logic coming from
the ICU.
As soon as there are Unix IMEs that report their language we will
support this the same way we will do for Windows.

We strongly oppose to implement _any_ UI to work around the Unix
flaws. Reason:
If we would implement some UI and eventually some but not all IME will
support language reporting we will have a redundant (and possibly a
concurrent) system: UI and Automatism.
This will rather confuse the user than help him.

Comment 17 frank.meies 2003-11-25 07:58:01 UTC

Comment 18 sforbes 2003-11-26 09:44:46 UTC

two points:

* are we going to leave linux, mac and windows users who don't have a version of 
windows that supports ime out in the cold?

* ime will only help for new text, currect? what are we going to do about exsisting 
text?

Comment 19 frank.meies 2003-11-26 09:49:17 UTC

Added Falko to Cc.

Comment 20 sforbes 2003-12-03 08:57:06 UTC

Created attachment 11713 [details]
problem file which IME woudl not solve

Comment 21 sforbes 2003-12-03 08:59:15 UTC

See the file I just attached- compare the highlighted paragraphs with
the original word display.

How would using IME solve the problem of the location changing of the
mathematical/roman characters?

Comment 22 sforbes 2003-12-03 11:00:27 UTC

Dina: tkos input would be welcome on this issue

Comment 23 frank.meies 2003-12-04 08:27:27 UTC

FME->sforbes: As far as I can see from your bugdoc, there are problems
in two different cases:

1. case: For all section numberings (except section 7.6), the
character order is

7 . 1
7 . 2
7 . 3

These sections are correctly visualized in Writer. Section 7.6 has
been entered with the character order

6 7 .

According to the Unicode Bidi Algorithm this is correcltly painted as
".67" in Writer. However, Word displays this as "7.6". The reason for
this is that "6" has been entered with the Hebrew IME turned on, and
"7." has been entered with the Englisch IME. Depending on the IME
which is used to insert characters, Word builds some kind of direction
attribute for this characters, which is interpreted during the text
formatting.

2. case: The subsections a) b) c)

The input sequence for these was "open paranthesis" before "a". Again,
according to the UBA, this is correctly painted as

a)

in Writer. In Word, these the characters have been entered with the
Englisch IME. Therefore they have the attribute LTR and they are
displayed as

(a

So what's the conclusion?

To behave like Word, we

1. need a character attribute, that overrides the directions from the UBA
2. have to set the direction attribute automatically depending on the
current IME.

Comment 24 frank.meies 2003-12-18 08:05:43 UTC

Comment 25 caolanm 2004-02-16 14:16:46 UTC

cmc->fme: This property in word to mark what the direction of a character range
is 0x85A, you can see that I make use of it for export in
sw/source/filter/ww8/wrtw8nds.cxx, but not for import. If changes are done in
this area to introduce a direction property for a character range, thats the
piece of import/export magic required from msword.

Comment 26 frank.meies 2004-02-17 07:35:43 UTC

*** Issue 25548 has been marked as a duplicate of this issue. ***

Comment 27 mehlng 2004-04-01 20:08:06 UTC

see the sollution in #27174 which I think of more suitable now.

Comment 28 frank.meies 2004-04-06 14:43:54 UTC

*** Issue 20688 has been marked as a duplicate of this issue. ***

Comment 29 sforbes 2004-04-19 04:25:54 UTC

*** Issue 27618 has been marked as a duplicate of this issue. ***

Comment 30 sforbes 2004-04-25 23:37:22 UTC

Unicode 4.0.1 has some changes relavent to this bug- esp. the treatment of
minus-hyphen in Hebrew text.
http://www.unicode.org/versions/Unicode4.0.1/

Comment 31 falko.tesch 2004-06-18 12:17:55 UTC

FT: Since this issue is also MS Office import related I vote for doing it "like
Microsoft".

Comment 32 sforbes 2004-07-07 09:59:08 UTC

*** Issue 31149 has been marked as a duplicate of this issue. ***

Comment 33 sforbes 2004-07-07 10:04:14 UTC

An exmaple of the same problem in the opposite situation (Hebrew text in an
English run) can be found in the duplicate issue #31149. I wish I had a better
answer to give a user, as entering RLM is not possible due to issue #13091

Comment 34 sforbes 2004-07-07 10:53:04 UTC

*** Issue 31149 has been marked as a duplicate of this issue. ***

Comment 35 prognathous 2004-07-08 15:47:41 UTC

Instead of RLM, he could type a Hebrew Geresh. Here's how to do it in Windows:

1. Make sure the input language is Hebrew (HE).
2. Hold the left Alt pressed and type 0215 using the alphanumeric keyboard.

Ft said: 
> FT: Since this issue is also MS Office import related I vote for 
> doing it "like Microsoft".

Unicode 4.0.1 defines the use of Hyphen-Minus "like Microsoft" and so does
Mozilla. OO on the other hand doesn't.

Reference: http://bugzilla.mozilla.org/show_bug.cgi?id=73251#c47

Prog.

Comment 36 andreas.martens 2004-07-20 11:28:46 UTC

Because of a shortage of resources we have to retarget this issue to OOo later.

Comment 37 Unknown 2004-07-21 20:42:14 UTC

Please add keywords: ms_interoperability

Comment 38 frank.meies 2004-09-06 10:39:41 UTC

*** Issue 33854 has been marked as a duplicate of this issue. ***

Comment 39 prognathous 2004-09-07 08:40:34 UTC

(In reply to fme, Issue 33854)
> Duplicate of issue 18024. Any character without an explicit direction will 
> cause these problems, since the unicode bidi algorithm cannot determine on 
> which side of the previous word it has to appear.

I don't see how Issue 33854 is a duplicate of this one. The Unicode BiDi
Algorithm doesn't specify how text pasted from the clipboard should be handled.
Microsoft Office doesn't suffer from this problem, it just includes the original
direction with the copied text. By doing so, it doesn't violate the UBA, but it
does provide the behavior users expect.

Prog.

Comment 40 frank.meies 2004-09-07 09:14:24 UTC

[..] I don't see how Issue 33854 is a duplicate of this one. [...]
Let me explain.

[...] The Unicode BiDi
Algorithm doesn't specify how text pasted from the clipboard should be handled.
Microsoft Office doesn't suffer from this problem, it just includes the original
direction with the copied text. By doing so, it doesn't violate the UBA, but it
does provide the behavior users expect. [...]

MS Office has some kind of character attribute, specifying the direction of the
characters. The text, together with the attribute is copied into the clipboard.
We currently do not have this character attribute, therefore a portion of hebrew
text ending with a neutral character will look different in RTL and LTR
environments.

Comment 41 prognathous 2004-09-07 10:10:55 UTC

Perhaps I misinterpreted the title of this issue. After all, "dealing with text
direction without using keyboard layout" isn't the same as "dealing with text
direction without using LRM/RLM".

OO doesn't need the user to manually insert control characters, it can do it
automatically, without having to reinvent the whell with proprietary character
attributes. Text copied to the clipboard can simply have surrounding control
characters that would help retain it's original direction, regardless of input
method.

Prog.

Comment 42 frank.meies 2004-09-07 11:18:55 UTC

We should prefer attributes to control characters. Please have a look at
http://www.unicode.org/unicode/reports/tr20/#Charlist

Comment 43 prognathous 2004-09-07 12:08:48 UTC

I fail to see this suggested in the page that you linked. In fact, LRM/RLM are
perfectly fine:

Code points     Names/Description                         Short Comment
U+200E..U+200F  Implicit directional marks (LRM and RLM)  LRM and RLM are allowed

http://www.unicode.org/unicode/reports/tr20/#Format

Prog.

Comment 44 shmuelh 2004-09-07 12:32:57 UTC

As an "average user" who has suffered from this problem for months, and who is
unable to understand the programming which appears in the various comments on
this issue, is there any workaround that users can use in the meantime? I've
tried to add a space after the apostrophe, or a numeral, or an English letter.
In each case, the apostrophe and whatever followed it was moved to the right of
the word when I moved the word to an English document. The only thing I've found
to be infallible - but it's a real nuisance - is to switch the receiving
document (the English document) into R2L mode, then to Copy, and then to revert
to L2R mode. That is an enormous bother.

Comment 45 frank.meies 2004-09-07 12:51:19 UTC

Automatic insertion of LRM/RLM characters will 'taint' the document. We would
have to deal with these characters during formatting, painting, and cursor
travelling. Using automatically inserted directional attributes is a much
smarter way to solve the problem with the neutral characters. An additional
advantage would be the improved interoperability and compatibility with MS Word
(of course you will still be able to insert the control character manually,
i.e., by using a macro). But since this issue is targeted to 'OOo later' I
cannot invest more time in this right now.

Comment 46 frank.meies 2004-09-07 13:19:46 UTC

FME->shmuelh: Please see my comment from  Fri Aug 15 00:17:05 -0700 2003. You
can insert LRM or RLM character using these makros (e.g., assign InsertLRM to
F11 and InsertRLM to F12). These makros give you some control over the automatic
character positioning.

Comment 47 prognathous 2004-09-07 13:24:34 UTC

shmuelh, you can work around this problem by inserting hidden RLM or LRM
characters via the numeric keypad.

- Inserting RLM. When you paste a Hebrew_Word+Punctuation into English text,
switch input language to Hebrew, hold the left Alt key down and type 0254 (using
the numeric keypad).
- Inserting LRM. When you paste an English_Word+Punctuation into Hebrew text,
switch input language to Hebrew, hold the left Alt key down and type 0253.

The above instructions assume that you're using Windows.

You can find more information about this subject here:
http://mozilla.org.il/board/viewtopic.php?t=363

Prog.

Comment 48 alan 2005-11-13 13:44:55 UTC

In our Hebrew build of OOo 2.0, we have included fme's macros for inserting
LRM's and RLM's, and linked them to a button on the toolbar, and to hotkeys
Shift-F3 and Shift-F4. Several users have asked if this feature can be included
in the distributed OOo. While a solution to the general problem has been
proposed in Issue 27174, its target milestone is "OOo Later". Until its
implementation, it could be a good idea to include the macros in the distributed
OOo. I'm attaching the macro file that we used,
wizards/source/tools/DirectionMarkers.xba, and a patch to
wizards/source/tools/script.xlb

Comment 49 alan 2005-11-13 13:59:57 UTC

Created attachment 31433 [details]
LRM/RML macros in wizards/source/tools

Comment 50 alan 2005-11-13 14:01:29 UTC

Created attachment 31434 [details]
patch to script.xlb

Comment 51 alan 2005-11-13 14:12:09 UTC

The macro file which I posted also includes a macro Insert_RTL_Footnote, for
inserting footnotes which are aligned to the right. This macro is not related to
this issue, and it's only there because I forgot to take it out before posting.
Still, it may useful for RTL users who read the comments to this issue.

Comment 52 alonbl 2005-12-22 14:39:46 UTC

Hello,

Installed version 2.0.1 and as promised by fme@openoffice.org there are
"Left-to-right mark" and "Right-to-left mark" commands, they are not shown in
menu by default, so you need to customize and create your own Bidi menu and put
these commands in-side.

Also, now these characters are invisible (also in Linux) so it is safe to use
them, and it works in Impress although Bidi rendering there is quite strange there.

I think that one final touch should be added... Show these characters when
"Noneprinting Characters" option is on... Or perhaps I just don't know how to do
this.

Thanks!!! This was the last major issue that prevented using openoffice.

Comment 53 frank.meies 2006-12-07 13:25:06 UTC

fme->all: After the implementation of the "insert RLM/LRM" buttons, I think we
should close this issue. There has been a lot of discussion about this issue
(see also mailing list hebrew@openoffice.org.il: "Request: Behaviour on weak
characters in mixed directional environment" dated from 2004), all of them ended
without an agreement. Should we 
A) implement the Word like direction character attribute (and set it
automatically depending on the current IME) or
B) implement some heuristics to automatically insert RLM/LRM characters on
certain occasions or
C) are we just happy with the new toolbar buttons?
Personally I'm happy with the toolbar buttons (one possible enhancement would be
to visualize the the RLM/LRM characters as alonbl suggests - please file a
request for enhancement for this if you like). So I declare this one as
worksforme, because by implementing the buttons we offered a solution how to
manipulate the results from the UBA.
This issue has already 83 votes, but I have no clue what the votes are actually
for - A, B, or C? So I you disagree, please file a new issue, including a
description of what to do. All discussions should go to a public mailing list
(dev@sw.openoffice.org).

Comment 54 frank.meies 2007-01-08 11:04:30 UTC

No objections -> closing issue.

Comment 55 stefan.baltzer 2007-09-24 15:52:29 UTC

*** Issue 81662 has been marked as a duplicate of this issue. ***

Comment 56 stefan.baltzer 2007-09-26 15:31:07 UTC

*** Issue 79777 has been marked as a duplicate of this issue. ***

Comment 57 stefan.baltzer 2007-10-11 16:38:48 UTC

*** Issue 81501 has been marked as a duplicate of this issue. ***

Comment 58 frank.meies 2007-11-18 08:12:07 UTC

*** Issue 61016 has been marked as a duplicate of this issue. ***