Apache OpenOffice (AOO) Bugzilla – Issue 105623
Brackets are not handled right when a Hebrew word is bracketed in western (Dutch,English) text
Last modified: 2017-05-20 11:15:44 UTC
When I write a text in Dutch (or English or probably any LTR language) and I put in the text a Hebrew word between brackets, then the brackets are not handled right at the end of the line. When a Hebrew word (or words) between brackets is standing at the end of the line and goes to the next line, then the first bracket is not going with the Hebrew word to the next line, but stays behind (all alone). It should stay with the Hebrew word. The problem does not occur with an English word in a Hebrew text. See also https://bugzilla.novell.com/show_bug.cgi?id=397090
Created attachment 65144 [details] Test document.
@HDU: please have a look.
@od: I suggest to try to keep BiDi-runs together when determining the line break position. AFAIK only WriterEngine has this problem, EditEngine already seems to do it properly.
update to above: the BiDi-runs for the pantheses and the RTL-word are different. They still shouldn't be separated. There is probably already logic for it is handled well in the non-BiDi case: the parentheses are kept with their content text
Actually it would be nice, if the parentheses belonged to the RTL run. Issue 89825 introduced a related approach for numerals, issue 16354 for punctuation characters. Especially with parentheses at the border of embedded bidi runs, we sometimes encounter problems. Directionality problems with parentheses can usually be fixed with LRM and RLM characters, but most users don't seem to be aware of this option. Unfortunately, I don't see a straightforward way to assign correct bidi properties to parentheses by context. But maybe it is worth discussing the possibilities and options? @pmladek: The problem *does* also occur with LTR word in e.g. Hebrew text. But you need to set the paragraph direction to RTL. But what is interesting: The first (RTL word in LTR paragraph) case can be fixed with by inserting RLM before the first and after the second parenthesis. If you apply the same approach to the second case (inserting LRMs), the English word is broken apart when placed at the end of the line.
We should stay as close as possible with the BiDi-algorithm (except for issues such as 100737) so changing the bracket's BiDi properties (which influences bracket mirroring) doesn't sound like such a good idea to me. I'll leave it to the expert users to decide on this though. I agree that it might be a good idea to use the same font for the parentheses/brackets/braces etc. as the font for contained text (in this case the CTL-font for the CTL-text). The question what to do with mixed content or with unbalanced brackets becomes non-trivial.
@hdu: Generally I agree with you on the subject of changing bidi properties. But in the case of brackets and parentheses similar problems repetitively seem to pop up. Therefore I feel the need to rethink the current situation once more. Some of these thoughts don't relate directly to this issue, but I think the broader view will also contribute to the current problem. Paired parentheses contain the notion of opening and closing. This is expressed visually, but only if both parentheses are directionally interpreted in the same way. So we mainly encounter problems in cases, where one bracket has an unambiguous bidi context and the second one is on the boundary of bidi runs. I think these cases could be improved by applying the unambiguous context to the pair bracket. The example to this issue is different, because here we have a symmetric situation (only one direction inside the brackets). Directionally-speaking it does not make a difference, if the brackets are assigned to the outer or inner run. They will swap their places, but the visual result is the same. Typographically, there is a difference: The brackets could belong to the inner or outer script and would be displayed in the corresponding font. In any case, as you stated before, bracket and enclosed word shouldn't be separated even if they belong to different runs. In the current situation the paragraph direction determines the script type of the parentheses in the latter cases. In these paired situations, the script type could also be determined by the enclosed script or the outer script. Each mode of interpretation could lead to subtle differences in the visual appearance (depending on the fonts chosen). But I think it is also a more general (philosophical) question: Where do the brackets belong? To the outer or to the enclosed?
Added some experts to CC to join a constructive discussion. > I think these cases could be improved by applying the unambiguous context to the pair bracket. I agree that the pair should have matched properties. > Where do the brackets belong? To the outer or to the enclosed? IMHO brackets/braces/parantheses belong to the outer text: for me they mean something like CALL and RET so they should belong to the calling context... the same applies to quotation marks While we are at it we should also consider the default direction of the inner text: should it be defined by the outer text or by its "natural" direction or by an own flag (e.g. "bracket default direction" which defaults to "paragraph default direction")
> I agree that the pair should have matched properties. So here is one detail, that could be improved. The more I think of it, the concepts of opening and closing, inner and outer are most valid and need to be considered (and currently they are not!!!) > IMHO brackets/braces/parantheses belong to the outer text: Semantically speaking I agree with you. But visually they are placed closer to the enclosed text. Therefore I think there should be at least an option to display them in the same style (font) as the enclosed text. Other opinions on this? In regular writing I seldomly feel the need to force a change on the writing direction. There may be more special cases, but currently I cannot think of any sensible way to improve anything here in an automatized manner.
In some little tests it looks as if WriterEngine already does something like bracket pairing for roman text. If this is so I suggest to make that code also applicable and active for BiDi cases. This might be a reasonable first step.
@hdu: Can you be more specific? What kind of bracket pairing is Writer doing?
Add a bracketed word like "(hello)" to a line and experiment with it. Writer will keep not break the word and the brackets apart, even if it is spelled e.g. "( hello)". Haven't looked at the relevant Writer code though.
But this doesn't seem to be "pairing". If you delete the 2nd bracket, the result is the same. More strange: type "(hello )" Result: The closing bracket doesn't stay with the word. But also peculiar: You can add as many spaces as you want between the opening bracket and the word: The spaces are treated as if they were hard spaces. Would be interesting to have a look at the code...
The same problems with arabic. See attachment.
Created attachment 65426 [details] The same problems with arabic
Maybe you can look to gedit (sourcecode available), this editor as a wonderful handling of unicode (without the need for configuration), e.g. western, arabic and hebrew. Please have a look to the screenshots. Brackets are also handled right.
Created attachment 65544 [details] unicode bracket handling in gedit
Created attachment 65545 [details] unicode bracket handling in gedit
Created attachment 65546 [details] unicode bracket handling in gedit
Created attachment 65547 [details] bracket handling in openoffice 3.1.1
From Mati Allouche: a) I think that parentheses belong to the encompassing text and not to the text included within. As proof, consider the following logical string (where upper case represents Hebrew letters), displayed in a LTR paragraph: eng1 eng2 (HEB3 HEB4) eng5 eng6 If the whole string is displayed on one line, as shown below eng1 eng2 (4BEH 3BEH) eng5 eng6 it does not matter if the left parenthesis is an open parenthesis associated with the English text or a closing parenthesis associated with the Hebrew text and subject to symmetric swapping (and reversely for the right parenthesis). But if the string is broken into 2 lines, associating the parentheses with the encompassing text will display as eng1 eng2 (3BEH 4BEH) eng5 eng6 while associating the parentheses with the inner text will display as eng1 eng2 3BEH) (4BEH eng5 eng6 I think that the first display is the preferred one. b) The problem does not seem to be related to directionality, but to the algorithm for determining line breaks (it might be that the algorithm considers directional runs boundaries as allowed break points). Why this algorithm behaves differently for LTR and RTL text, at least when parentheses are concerned, is part of the issue. c) If the problem is not related to directionality, changing the Bidi properties of parentheses is not going to fix it. d) Changing the Bidi properties of any character to values different from specified by Unicode is a bad idea anyway. I hope that there is no need to justify this statement.
Mati and Alan: Thanks for your expert comments. I agree with a, b and c. These items also confirm that having this issue assigned to the WriterEngine and EditEngine team is correct. For item d I agree in principle but I'd also point to issue 100737.
It is not clear due to limited resource, if this issue can be solved for OOo 3.3. To be honest I am adjusting the target.
Issue 112240 which blocked this issue was fixed. Can someone look on this issue again for 3.4? Thanks.
Reset assigne to the default "issues@openoffice.apache.org".