Issue 92577

Summary: Linebreaking does not work properly with Japanese punctuation
Product: Writer Reporter: larsko <lars>
Component: formattingAssignee: stefan.baltzer
Status: CLOSED FIXED QA Contact: issues@sw <issues>
Severity: Trivial    
Priority: P3 CC: curvirgo, issues, kamataki, karl.hong, masaya.k, ooo, tora3, y-catch
Version: OOo 3.0 Beta 2   
Target Milestone: ---   
Hardware: PC   
OS: All   
Issue Type: DEFECT Latest Confirmation in: ---
Developer Difficulty: ---
Attachments:
Description Flags
a sample file
none
Bugdoc with all given example characters at line ends none

Description larsko 2008-08-08 07:41:26 UTC
The line breaking algorithm does not take Japanese punctuation and some other 
characters into account. This causes the line to extend beyond the text border 
before starting a new line.

To reproduce, type some Japanese text just to the end of the line and then add 
a Japanese period (。) to end the sentence. The period character will be in the 
page margin.

The characters I've seen this occur with are period 。, comma ã€, Kanji 
repetition character 々, Katakana long vowel character ー, small Katakana 
characters such as ョェュ.

Exact OpenOffice version is openoffice.org-core 1:2.4.1-6, Sun Jul 13 06:36:47 
UTC 2008 (Debian package).
Comment 1 michael.ruess 2008-08-08 09:17:08 UTC
MRU->ES: pls evaluate; maybe this is already fixed in 3.0...
Comment 2 eric.savary 2008-08-08 12:14:10 UTC
Still reproducible in DEV300m29
Comment 3 eric.savary 2008-08-08 12:15:32 UTC
Created attachment 55654 [details]
a sample file
Comment 4 frank.meies 2008-08-08 12:30:36 UTC
fme->larsko: This is a feature called 'hanging punctuation'. If you have the
Asian features enabled (Tools - Options - Language settings - Languages), you
will find a tab page 'Asian typography' containing a setting 'allow hanging
punctuation' in the Format - Paragraph dialog.
Comment 5 frank.meies 2008-08-08 12:30:56 UTC
.
Comment 6 larsko 2008-08-11 02:07:35 UTC
fme: thanks for the pointer. I've read up on hanging punctuation, but this only 
seems to include 。 and ã€, not any of the other characters I've seen this 
occur with -- which aren't really punctuation (cf. http://www.w3.org/TR/
jlreq/). Can you point me to the reference used when deciding that those 
characters should be included in hanging punctuation?
Comment 7 frank.meies 2008-08-11 07:54:44 UTC
fme->larsko: I think its the characters listed in Tools - Options - Language
Settings - Asian Layout - Not at start of line.
Comment 8 larsko 2008-08-11 08:15:05 UTC
fme: So all those characters are allowed as hanging punctuation? Seems a bit 
much since the w3 recommendation only specifies dot and comma. The Japanese 
wikipedia article on this topic [1] also only specifies real punctuation. 
Should the small Kana etc. really be part of hanging punctuation?

[1] http://ja.wikipedia.org/wiki/ã¶ã‚‰ä¸‹ã’組ã¿
Comment 9 frank.meies 2008-08-11 09:52:45 UTC
fme->larsko: Thanks for the pointer. It looks like making the hanging
punctuation depend on the forbidden characters does not seem to be correct.

fme->khong: Please have a look and take over. Looks like Word also does not
allow all the not-at-start to be hanging punctuation either.

fme->tora: Any input from your side?
Comment 10 frank.meies 2008-08-11 09:53:17 UTC
.
Comment 11 tora3 2008-08-11 20:20:34 UTC
tora->fme: Thank you for giving me a chance to comment. 

Current implementation of OOo has two lists of characters: 
 (1) Not at start of line
 (2) Not at end of line

Theoretical implementation might have three lists of characters:
 (1) Not at start of line
 (2) Not at end of line
 (3) Punctuation 

Current implementation of OOo treats (1) as (3) while Word seems to use three 
lists. In Word, a set of both (1) and (2) can be tweaked in a similar way of 
OOo through one of the followings:
 - the tab Asian Typography of the menu Tools > Options
 - the button Options in the tab Asian Typography of the menu Format > Paragraph

Word names (1) "Cannnot start line:" and (2) "Cannot end line:." 

(3) of Word could be specified through somewhere or be hard-coded. I am not sure, 
but the dialog Properties of IME 2003 has a combo box listing the following 
combinations:
 (a) ã€ã€‚ u3001 and u3002 (widely used for several purposes)
 (b) ,. uFF0C and uFF0E (sometimes used in a thesis, similar paper, book,...)
 (c) ã€ï¼Ž u3001 and uFF0E (sometimes can be seen in a book or magazine)
 (d) ,。 uFF0C and u3002 (sometimes can be seen in a book or magazine)

http://www.unicode.org/charts/PDF/U3000.pdf
http://www.unicode.org/charts/PDF/UFF00.pdf

Some punctuation characters described above and some special characters such as 
「 and ã€, u300C and u300D can be pushed within the margin. Word offers this 
feature while current OOo does not. A concept of this feature is illustrated 
in http://www.openoffice.org/nonav/issues/showattachment.cgi/18738/concept01.png
attached in the issue 36313. 

http://www.openoffice.org/nonav/issues/showattachment.cgi/18786/Japanese_Justification_0.1.sxw
attached in the issue 36408 does also try to describe the concept, but it has 
not been finished yet. 

In sum, it would be better if OOo has the third list (3) for punctuation 
characters which sorely can be hanged beyond the margin and the first list (1) 
should not be used for the hanging characters. 

In addition to the third list (3), a new feature could be also incorporated. 
The feature compresses a total width of line to meet to the margin by slightly 
shrinking rooms between every characters in a line if the line ends with a 
hanging punctuation character or a combination of one or more hanging characters. 
Comment 12 frank.meies 2008-08-12 08:37:44 UTC
fme->tora: Thank you for your detailed analysis of this issue. While I agree
that having a third list would be the perfect solution, I'm tempted to ask
whether we can't start with a hard-coded list? Looks like this list is
hard-coded in Word as well. Currently the LineBreakUserOptions which are passed
to the break iterator cannot hold a third list. So introducing a third list means:

1. Changing the UI
2. Changing the API
3. Changing settings.xml in the ODF files

Looks like a lot of work for this not-too-much-requested issue. Another point I
like to address is this: Changing the line break algorithm means that existing
documents might change their layout. Can we cope with this or should we
introduce some kind of (hidden) compatibility option so that only new documents
make use of the changed line break behavior?
Comment 13 karl.hong 2008-08-12 19:33:15 UTC
Forbiden rule characters are editible by end users, so it needs to be passed
from writer to breakiterator. If the third hanging punctuation list is hidden
from end users, we can keep it in locale data, wihch will be known only inside
i18npool module, and no API and UI changes are required.
Comment 14 karl.hong 2008-08-15 21:28:57 UTC
If we don't need the compatibility option fme mentioned, I will implement third
list in locale data in i18npool in next release.
Comment 15 tora3 2008-08-18 03:53:11 UTC
tora->khong: I agree with you. 
Comment 16 karl.hong 2008-08-19 05:34:22 UTC
khong->tora, I add thrid list 

<LineBreakHangingCharacters>!,.:;?ã€ã€‚ï¼ï¼Œï¼Žï¼šï¼›ï¼Ÿ</LineBreakHangingCharacters>

in localedata for CJK languages. Please let me know if the list is sufficient.

Fixed in cws i18n45.
Comment 17 tora3 2008-08-19 11:33:37 UTC
tora->khong: Thank you for your implementation.

The list for Japanese might be either (a) or (b). 
(a) <LineBreakHangingCharacters>ã€ã€‚,.</LineBreakHangingCharacters>
(b) <LineBreakHangingCharacters>ã€ã€‚</LineBreakHangingCharacters>

I am asking comments in the mailing list of Japanese community and letting 
you know. 
Comment 18 karl.hong 2008-09-02 03:58:37 UTC
ready for QA.
Comment 19 tora3 2008-09-02 07:40:50 UTC
tora->khong: Could you revise the locale data of Japanese?

  <LineBreakHangingCharacters>ã€ã€‚,.</LineBreakHangingCharacters>

Notes:
 - 〠u3001 IDEOGRAPHIC COMMA
 - 。 u3002 IDEOGRAPHIC FULL STOP
 - , uFF0C FULLWIDTH COMMA
 - . uFF0E FULLWIDTH FULL STOP

References: 
 http://www.unicode.org/charts/PDF/U3000.pdf
 http://www.unicode.org/charts/PDF/UFF00.pdf
 http://www.unicode.org/Public/UNIDATA/NamesList.txt

Discussion: 
 http://www.freeml.com/openoffice/11243/latest (Japanese)
Comment 20 stefan.baltzer 2008-09-10 14:16:40 UTC
Added khong on c/c.
Stefan -> Karl: Please note toras last question and comment. Thank you.
Comment 21 karl.hong 2008-09-10 19:33:51 UTC
khong->tora, yes, that is the list currently implemented in cws i18n45.
Comment 22 tora3 2008-09-18 06:13:15 UTC
tora->khong, thanks a lot.
Comment 23 stefan.baltzer 2008-09-18 10:09:25 UTC
SBA: Verified in CWS i18n45.
I will attach a bugdoc that has all given example characters from the initial
description at line ends. To see, add and remove "i" letters to "shift the text"
of the respective line.
Comment 24 stefan.baltzer 2008-09-18 10:10:37 UTC
Created attachment 56602 [details]
Bugdoc with all given example characters at line ends
Comment 25 stefan.baltzer 2008-10-06 11:31:40 UTC
Correcting target to OOo 3.1. CWS i18n45 is already integrated.
Comment 26 stefan.baltzer 2009-03-06 18:52:33 UTC
OK in OOO310_m3. Closing issue.