Issue 103308

Summary: HTML import mangles non-BMP unicodes
Product: Writer Reporter: hdu <hdu>
Component: open-importAssignee: AOO issues mailing list <issues>
Status: CONFIRMED --- QA Contact:
Severity: Trivial    
Priority: P3 CC: damjan, issues
Version: OOO310m14   
Target Milestone: ---   
Hardware: All   
OS: All   
Issue Type: DEFECT Latest Confirmation in: 4.2.0-dev
Developer Difficulty: ---
Issue Depends on:    
Issue Blocks: 102943    
Attachments:
Description Flags
bugdoc none

Description hdu@apache.org 2009-07-03 08:30:23 UTC
The HTML filter uses the 16bit type sal_Unicode for all its text processing needs and so it strips of the 
most significant bits of unicodes beyond the baseplane. This results in a mangled import.
Comment 1 hdu@apache.org 2009-07-03 08:33:06 UTC
Created attachment 63344 [details]
bugdoc
Comment 2 hdu@apache.org 2009-07-03 08:38:13 UTC
Fixing the method "sal_Unicode CSS1Parser::GetNextChar()" in sw/source/filter/html/parcss1.cxx is 
probably a good starting point.
Comment 3 openoffice 2009-07-03 08:54:58 UTC
set target
Comment 4 Marcus 2017-05-20 11:18:15 UTC
Reset assigne to the default "issues@openoffice.apache.org".
Comment 5 damjan 2023-01-03 09:37:09 UTC
(In reply to hdu@apache.org from comment #2)
> Fixing the method "sal_Unicode CSS1Parser::GetNextChar()" in
> sw/source/filter/html/parcss1.cxx is 
> probably a good starting point.

Yes but that's just CSS parsing, the remainder of the HTML parsing is in main/svtools/source/svhtml/parhtml.cxx, which, sadly like most of our codebase, also operates one Unicode code unit at a time, retrieved from  SvParser::GetNextChar().

The function
inline sal_uInt16 GetCharSize() const;
got my hopes up, does it tell us the code point size?

inline sal_uInt16 SvParser::GetCharSize() const
{
    return (RTL_TEXTENCODING_UCS2 == eSrcEnc) ? 2 : 1;
}

No, just the bytes per BMP character for the current encoding, a useless statistic.

SvParser does not have any functions for code points. We'd have to add them and change a lot of code - not just HTML parsing - to use them.