Bug 48745 - Hyphenation results don't always equal OpenOffice result even with the same patterns
Summary: Hyphenation results don't always equal OpenOffice result even with the same p...
Status: NEW
Alias: None
Product: Fop - Now in Jira
Classification: Unclassified
Component: general (show other bugs)
Version: 0.95
Hardware: PC Windows XP
: P3 normal
Target Milestone: ---
Assignee: fop-dev
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-02-15 12:10 UTC by onkelpax-fop
Modified: 2012-04-07 01:51 UTC (History)
0 users



Attachments
German hyphenation file (57.41 KB, application/octet-stream)
2010-02-15 12:10 UTC, onkelpax-fop
Details
fix unpacking of hyphenation pattern values (1.29 KB, patch)
2010-02-15 15:17 UTC, Carlos Villegas
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description onkelpax-fop 2010-02-15 12:10:06 UTC
Created attachment 24988 [details]
German hyphenation file

As already known, the hyphenation library has some problems with patterns who contain numbers like 7 or 8. I realized that HyphenationTree.unpackValues(int) extracts the characters ( and ' for these values. They differ by exactly 16 character positions in ASCII table. Following code changes transforms these characters into the right ones:

    protected String unpackValues(int k) {
      StringBuilder buf = new StringBuilder();
        byte v = this.vspace.get(k++);
        while (v != 0) {
            char c = (char)((v >>> 4) - 1 + '0');
            if (!Character.isDigit(c)) {
              /* #21219: Bug fixed which sometimes occurs. Just
               * shift the ASCII position by a correction offset. */
              c += 16;
            }
            buf.append(c);
            c = (char)(v & 0x0f);
            if (c == 0) {
                break;
            }
            c = (char)(c - 1 + '0');
            if (!Character.isDigit(c)) {
              /* #21219: Bug fixed which sometimes occurs. Just
               * shift the ASCII position by a correction offset. */
              c += 16;
            }
            buf.append(c);
            v = this.vspace.get(k++);
        }
        return buf.toString();
    }

But there's another problem which could be experienced in languages with common occurences of these two digits in patterns. Please compare the hyphenation result of the German word, "Flickenteppich", (Pattern: .fli7ck8en7tep7pic8h) with OpenOffice's result. OpenOffice doesn't generate a hyphenation like "Flick-enteppich". But FOP does it, even with the cheap bug fix above. There's an explicit prohibition at this word's position by the concerning pattern. Other implementations of Liang's algorithm do notice this rule (see http://www.davidashen.net/texhyphj.html or LibHnj used by OpenOffice).

My question is: Is this issue known? If yes, are there any existing trackers concerning this bug? When will this be fixed?

Best regards


PAX
Comment 1 Carlos Villegas 2010-02-15 15:17:32 UTC
Created attachment 24989 [details]
fix unpacking of hyphenation pattern values

Thanks to PAX for pointing out the problem area. Not only unpackValues but getValues also needed a similar fix. The proper fix is to mask the lower 4 bits of the packed value after shifting.
The example mentioned in the report now works, I think.
Comment 2 Glenn Adams 2012-04-07 01:42:36 UTC
resetting P2 open bugs to P3 pending further review