Apache OpenOffice (AOO) Bugzilla – Issue 18024
Direction of weak characters: A new method for dealing with text direction without using keyboard layout
Last modified: 2013-08-07 15:00:01 UTC
The main problem needing keyboard layout detection to solve AFAIK is the f(a) and english_word+. problem, I'll demonstrate: the expression FUNCTION f(X) should be f(X) NOITCNUF and not )f(X NOITCNUF On the contrary the expression SHALOM mumi. shoulr render the other way around as: .mumi MOLAHS and not: mumi. MOLAHS This is pretty true almost always, as you almost never need to place a dot in te end of an embedded English word Solving the problem like MSword did (Keyboard layout) seems to me like a bad idea. 1) unintuitive, setting things the way you want them to be set, and taking care for each space/punctuation directionality (MSWord has Hebrew spaces and English spaces) is an Hell on Earth. 2) not always work as intended - so we won't get so much benefit from using it (you'll solve the f(x) problem but when ending a sentence with englishword+dot - the dot will not automatically jump to the real end of the sentence) so it's not really much of a gain. 3) extra complexity sould be added, so the whole BiDi engine should be altered now. I suggest a Unique sollution based of already existing LRM/RLM signs. The sollution is very simple, define a macro that will cause OOo to insert LRM[1] sign automatically in the following condition: Writing a word begins with English_Letter* ')', we should think of more similar chars like ')' we want to insert the LRM sign afterwards, but the idea is simple. It's advatages: 1) almost comparable to MSWord's sollution, could immitate it almost exactly. 1) highly customable, seem to actually solve the problem completely in most cases. 2) Even in the unlikely event where the user would like to use, say, f(x) that will render as: )f(x MOLAHS He can do that very easily no exchange of character, a simple hit on the backspace key will delete the LRM and will solve the problem. Please think of my idea - I think it'll do only good and save us from unneccessary implementation of MSWord immitation which is not neccessarily better then my sollution. Please answer me CC'd to the openoffice-hebrew mailing list at openoffice.org.il [1] I'm not sure LRM is the correct sign, the idea is sign $ that will cause SHALOM f(x)$ to be rendered as: f(X) MOLAHS were the sign $/LRM/RLM is invisible of course Caps are Hebrew chars of course as the usual notation.
Hi, thank you for your suggestions. Your description of the problem is absolutely correct. Inserting control characters (like LRE, RLE, PDF, ...) is one way to solve this problem, the other way would be to use character attributes to define the directionality of characters. From my point of view, it would be easier to use the control characters, because we'll only have to pass the string (containing the control characters) to the BiDi algorithm and everything works fine. On the other hand, using a character attribute would be the 'correct' way to solve the problem, without entering hidden control characters to the paragraph. I'll forward this issue to the user experience team, so they should have a look.
hi, thanks for your comment. Again, and this is the point I'd like to emphasize, I think character-attributes is NOT a "correct" sollution to the problem. As I see it, there are two views for describing a correct sollution. The one defines correct sollution as the sollution who works always well and as intended, there are no extreme cases that can prevent it from working, the code is easily portable and extremely maintainable etc. If this is the case I believe character-attributes is NOT the "correct" sollution. Let along the fact, that both sollution has the same functionality, the CA version is an hell on earth to manage (ever tried to paste a C source to Hebrew MSOffice? What about an English document pasted into HEbrew MSoffice - it has now all it's dots reversed, how can the naive user fix it (clue - the sollution is replace all "Hebrew" spaces with "English" Spaces, one of the toughest missions I've ever seen.). The CA method does not place the end-of-sentence dot correctly, RLM method does. The CA method will not keep your text layout between OOo to other application, RLM method does. The CA version, as said before, is very hard and unintuitive to override, LRM method just requires a backspace after the word needs overriding. I think that if this is how one defines "correct" way, than surely LRM method is the better one. The other definition of "correct" method is its standardizing and intuitiveness (and mathematical truthfullness - but that's not quite related to here). For example, using filesystem for IPC instead of the normal IPC tools (KDE DCOP for instance) might be easier to implement and maintain, but surely IPC-tools are the correct standard way to handle processes communication. Even if this is the case, I believe LRM method should be considered as well. The LRM method as opposed to the CA method is standardized by the Unicode system-wide standards. It is just as intuitive as the CA method (noone imagines in which language does he write his puctuations, this is not a rational way to handle text). I think therefor that the RLM method has no disadvantages over the CA method as well. In the bottom line I can't see any reason to implement complex libraries that will eventually provide as with nothing more than we can achieve without this extra-complexity. Please, when discussing with the User-Experience team, make sure there are some Hebrew speakers and more important USERS. Having a declared extreme good experience but a very poor use is not the way to go if you ask me. Except, if you can allow me to share my view in the USer-Experience groups discussion I'll be very glad. Thanks again.
.
FME: As long as there is no solution for this problem, one can use simple makros to insert an RLM or LRM at the current cursor position: sub InsertLRM xsel = thiscomponent.currentcontroller.getselection xrange = xsel(0) xrange.setstring(chr$(8206)) end sub sub InsertRLM xsel = thiscomponent.currentcontroller.getselection xrange = xsel(0) xrange.setstring(chr$(8207)) end sub
*** Issue 14590 has been marked as a duplicate of this issue. ***
FME: Added 'Direction of weak characters' to title.
*** Issue 16247 has been marked as a duplicate of this issue. ***
UE speaking: Even that I am no native Hebrew and/or Arabic speaker/writer I understand your concenrn about changing text directions. We already discussed various approaches to this issue but couldn't come to a solution. Whether MS Word nor other office word processors have a special implementation for this (at least I couldn't find any). Can you help me please in finding a reasonable, user-friendly UI/function to address this problem? Thx.
mehlng->ft: I believe my proposition is pretty concise and described in here. I'll try to describe a python pseudo-code that'll solve the problem: =============cut here=========================== chars_usualy_ends_sentence = [ ')',']','}','>' ] while c = getchar(): if c==' ': if text_direction()!=paragrap_direction(): if lastchar in chars_usualy_ends_sentence: print_before_input(RLM_sign) =============ends here============================= this is supposed to more or less solve the problem almost completely, besides a nice approach to handle graphically the RLM sign would be nice (IE when cursor is after an RLM sign an explanation would appear and simply deleting it would automatically delete the character before it. Please contact me, or shachar (shemes.biz) or Gilad, or eli Marmor. I'll be interested to explain this on the phone. Do contact mehlng@yahoo.com I'm very eager to discuss this issue.
One last word about LRM handling (which is especially vital if we intend to add them regulary in OOo): In order to keep the naive user unconfused the RLM *must* be hard-linked to the character behind it, it'll disappear as the character is deleted (in any form) otherwise it'll remain unnoticed in the text and will rear its ugly head with plenty unexplained errors. The only HIGHLY UNLIKELY problem it might arise is if the ')' is moved to a different place and ment to be used as a Hebrew '(' sign. Demonstration of problems might arise, $ stands for invisible RLM sign: current text MOLAHS TIRVI User adds english with parenthesis and makes the RLM sign automatically inserted: english (text)$ MOLAHS TIRVI user deletes all English text but two parenthesis ) MOLAHS TIRVI User now continues to write HEBREW in parenthesis (MIARGOS)$ MOLASH TIRVI problem now can arise. However MSWord approach won't solve this issue (!) a ')'-LRM is just like Hebrew-type-parenthesis of the MSWord, thus we didn't cause anything MSWord can't have!
issue #19848 is related
Bug 21019 is NOT a duplicate of this bug. That also discusses handling of imported/legacy texts, as opposed to text entry.
Above comment posted to wrong issue; sorry for the spam.
*** Issue 21887 has been marked as a duplicate of this issue. ***
from issue #21887 (marked as dup of this one): "1. when typing a hebrew text (direction right to left) ending with an english word, followed by a hebrew ":" (on the left of it), and then typing an english text, the engish word ending the hebrwe text jumps left. e.g.: when typing (from right to left) "english word 2" < "a hebrew :" < "english word 1" < "hebrew" one gets: "english word 1""a hebrew :""english word 2" < "hebrew" 2. when writing a hebrew doc (direction right to left) and inserting an english text starting with a number, the numbert jumps over to the right side (as if it was still hebrew). e.g.: when typing (right to left) "number" > "english text" (changing to english) < "hebrew" one gets: "english" "number" "hebrew" "
FT: We discussed possible solution here at Star Office. That's what we came up with: - For OO.o running under Windows we will make use of our new feature reading out the IME. Once we detect the IME inputting a RTL language we will hint the ICU to determine the correct text direction (RTL in this case) for neutral and weak character. - For OO.o running under Unix systems we cannot change anything yet since all Unix IMEs do not feed back their current language set. Therefore we must still rely on the already existing logic coming from the ICU. As soon as there are Unix IMEs that report their language we will support this the same way we will do for Windows. We strongly oppose to implement _any_ UI to work around the Unix flaws. Reason: If we would implement some UI and eventually some but not all IME will support language reporting we will have a redundant (and possibly a concurrent) system: UI and Automatism. This will rather confuse the user than help him.
two points: * are we going to leave linux, mac and windows users who don't have a version of windows that supports ime out in the cold? * ime will only help for new text, currect? what are we going to do about exsisting text?
Added Falko to Cc.
Created attachment 11713 [details] problem file which IME woudl not solve
See the file I just attached- compare the highlighted paragraphs with the original word display. How would using IME solve the problem of the location changing of the mathematical/roman characters?
Dina: tkos input would be welcome on this issue
FME->sforbes: As far as I can see from your bugdoc, there are problems in two different cases: 1. case: For all section numberings (except section 7.6), the character order is 7 . 1 7 . 2 7 . 3 These sections are correctly visualized in Writer. Section 7.6 has been entered with the character order 6 7 . According to the Unicode Bidi Algorithm this is correcltly painted as ".67" in Writer. However, Word displays this as "7.6". The reason for this is that "6" has been entered with the Hebrew IME turned on, and "7." has been entered with the Englisch IME. Depending on the IME which is used to insert characters, Word builds some kind of direction attribute for this characters, which is interpreted during the text formatting. 2. case: The subsections a) b) c) The input sequence for these was "open paranthesis" before "a". Again, according to the UBA, this is correctly painted as a) in Writer. In Word, these the characters have been entered with the Englisch IME. Therefore they have the attribute LTR and they are displayed as (a So what's the conclusion? To behave like Word, we 1. need a character attribute, that overrides the directions from the UBA 2. have to set the direction attribute automatically depending on the current IME.
cmc->fme: This property in word to mark what the direction of a character range is 0x85A, you can see that I make use of it for export in sw/source/filter/ww8/wrtw8nds.cxx, but not for import. If changes are done in this area to introduce a direction property for a character range, thats the piece of import/export magic required from msword.
*** Issue 25548 has been marked as a duplicate of this issue. ***
see the sollution in #27174 which I think of more suitable now.
*** Issue 20688 has been marked as a duplicate of this issue. ***
*** Issue 27618 has been marked as a duplicate of this issue. ***
Unicode 4.0.1 has some changes relavent to this bug- esp. the treatment of minus-hyphen in Hebrew text. http://www.unicode.org/versions/Unicode4.0.1/
FT: Since this issue is also MS Office import related I vote for doing it "like Microsoft".
*** Issue 31149 has been marked as a duplicate of this issue. ***
An exmaple of the same problem in the opposite situation (Hebrew text in an English run) can be found in the duplicate issue #31149. I wish I had a better answer to give a user, as entering RLM is not possible due to issue #13091
Instead of RLM, he could type a Hebrew Geresh. Here's how to do it in Windows: 1. Make sure the input language is Hebrew (HE). 2. Hold the left Alt pressed and type 0215 using the alphanumeric keyboard. Ft said: > FT: Since this issue is also MS Office import related I vote for > doing it "like Microsoft". Unicode 4.0.1 defines the use of Hyphen-Minus "like Microsoft" and so does Mozilla. OO on the other hand doesn't. Reference: http://bugzilla.mozilla.org/show_bug.cgi?id=73251#c47 Prog.
Because of a shortage of resources we have to retarget this issue to OOo later.
Please add keywords: ms_interoperability
*** Issue 33854 has been marked as a duplicate of this issue. ***
(In reply to fme, Issue 33854) > Duplicate of issue 18024. Any character without an explicit direction will > cause these problems, since the unicode bidi algorithm cannot determine on > which side of the previous word it has to appear. I don't see how Issue 33854 is a duplicate of this one. The Unicode BiDi Algorithm doesn't specify how text pasted from the clipboard should be handled. Microsoft Office doesn't suffer from this problem, it just includes the original direction with the copied text. By doing so, it doesn't violate the UBA, but it does provide the behavior users expect. Prog.
[..] I don't see how Issue 33854 is a duplicate of this one. [...] Let me explain. [...] The Unicode BiDi Algorithm doesn't specify how text pasted from the clipboard should be handled. Microsoft Office doesn't suffer from this problem, it just includes the original direction with the copied text. By doing so, it doesn't violate the UBA, but it does provide the behavior users expect. [...] MS Office has some kind of character attribute, specifying the direction of the characters. The text, together with the attribute is copied into the clipboard. We currently do not have this character attribute, therefore a portion of hebrew text ending with a neutral character will look different in RTL and LTR environments.
Perhaps I misinterpreted the title of this issue. After all, "dealing with text direction without using keyboard layout" isn't the same as "dealing with text direction without using LRM/RLM". OO doesn't need the user to manually insert control characters, it can do it automatically, without having to reinvent the whell with proprietary character attributes. Text copied to the clipboard can simply have surrounding control characters that would help retain it's original direction, regardless of input method. Prog.
We should prefer attributes to control characters. Please have a look at http://www.unicode.org/unicode/reports/tr20/#Charlist
I fail to see this suggested in the page that you linked. In fact, LRM/RLM are perfectly fine: Code points Names/Description Short Comment U+200E..U+200F Implicit directional marks (LRM and RLM) LRM and RLM are allowed http://www.unicode.org/unicode/reports/tr20/#Format Prog.
As an "average user" who has suffered from this problem for months, and who is unable to understand the programming which appears in the various comments on this issue, is there any workaround that users can use in the meantime? I've tried to add a space after the apostrophe, or a numeral, or an English letter. In each case, the apostrophe and whatever followed it was moved to the right of the word when I moved the word to an English document. The only thing I've found to be infallible - but it's a real nuisance - is to switch the receiving document (the English document) into R2L mode, then to Copy, and then to revert to L2R mode. That is an enormous bother.
Automatic insertion of LRM/RLM characters will 'taint' the document. We would have to deal with these characters during formatting, painting, and cursor travelling. Using automatically inserted directional attributes is a much smarter way to solve the problem with the neutral characters. An additional advantage would be the improved interoperability and compatibility with MS Word (of course you will still be able to insert the control character manually, i.e., by using a macro). But since this issue is targeted to 'OOo later' I cannot invest more time in this right now.
FME->shmuelh: Please see my comment from Fri Aug 15 00:17:05 -0700 2003. You can insert LRM or RLM character using these makros (e.g., assign InsertLRM to F11 and InsertRLM to F12). These makros give you some control over the automatic character positioning.
shmuelh, you can work around this problem by inserting hidden RLM or LRM characters via the numeric keypad. - Inserting RLM. When you paste a Hebrew_Word+Punctuation into English text, switch input language to Hebrew, hold the left Alt key down and type 0254 (using the numeric keypad). - Inserting LRM. When you paste an English_Word+Punctuation into Hebrew text, switch input language to Hebrew, hold the left Alt key down and type 0253. The above instructions assume that you're using Windows. You can find more information about this subject here: http://mozilla.org.il/board/viewtopic.php?t=363 Prog.
In our Hebrew build of OOo 2.0, we have included fme's macros for inserting LRM's and RLM's, and linked them to a button on the toolbar, and to hotkeys Shift-F3 and Shift-F4. Several users have asked if this feature can be included in the distributed OOo. While a solution to the general problem has been proposed in Issue 27174, its target milestone is "OOo Later". Until its implementation, it could be a good idea to include the macros in the distributed OOo. I'm attaching the macro file that we used, wizards/source/tools/DirectionMarkers.xba, and a patch to wizards/source/tools/script.xlb
Created attachment 31433 [details] LRM/RML macros in wizards/source/tools
Created attachment 31434 [details] patch to script.xlb
The macro file which I posted also includes a macro Insert_RTL_Footnote, for inserting footnotes which are aligned to the right. This macro is not related to this issue, and it's only there because I forgot to take it out before posting. Still, it may useful for RTL users who read the comments to this issue.
Hello, Installed version 2.0.1 and as promised by fme@openoffice.org there are "Left-to-right mark" and "Right-to-left mark" commands, they are not shown in menu by default, so you need to customize and create your own Bidi menu and put these commands in-side. Also, now these characters are invisible (also in Linux) so it is safe to use them, and it works in Impress although Bidi rendering there is quite strange there. I think that one final touch should be added... Show these characters when "Noneprinting Characters" option is on... Or perhaps I just don't know how to do this. Thanks!!! This was the last major issue that prevented using openoffice.
fme->all: After the implementation of the "insert RLM/LRM" buttons, I think we should close this issue. There has been a lot of discussion about this issue (see also mailing list hebrew@openoffice.org.il: "Request: Behaviour on weak characters in mixed directional environment" dated from 2004), all of them ended without an agreement. Should we A) implement the Word like direction character attribute (and set it automatically depending on the current IME) or B) implement some heuristics to automatically insert RLM/LRM characters on certain occasions or C) are we just happy with the new toolbar buttons? Personally I'm happy with the toolbar buttons (one possible enhancement would be to visualize the the RLM/LRM characters as alonbl suggests - please file a request for enhancement for this if you like). So I declare this one as worksforme, because by implementing the buttons we offered a solution how to manipulate the results from the UBA. This issue has already 83 votes, but I have no clue what the votes are actually for - A, B, or C? So I you disagree, please file a new issue, including a description of what to do. All discussions should go to a public mailing list (dev@sw.openoffice.org).
No objections -> closing issue.
*** Issue 81662 has been marked as a duplicate of this issue. ***
*** Issue 79777 has been marked as a duplicate of this issue. ***
*** Issue 81501 has been marked as a duplicate of this issue. ***
*** Issue 61016 has been marked as a duplicate of this issue. ***