Issue 22579

Summary: incorrect import : HTML page with CKJ characters coded in hexadecimal
Product: General Reporter: lcn <lcn>
Component: codeAssignee: AOO issues mailing list <issues>
Status: ACCEPTED --- QA Contact:
Severity: Trivial    
Priority: P3 CC: damjan, issues, lcn
Version: OOo 1.1Keywords: needmoreinfo
Target Milestone: AOO Later   
Hardware: PC   
OS: Windows 2000   
Issue Type: DEFECT Latest Confirmation in: ---
Developer Difficulty: ---
Attachments:
Description Flags
HMTL files for test. md5 signature : da3d80ef80f1c010e68755b20aae48a6 none

Description lcn 2003-11-17 20:59:52 UTC
incorrect import : HTML page with CKJ (Chinese Korean Japanese) characters 
coded in hexadecimal.
This problem affects maybe all the program of OpenOffice.org : Writer, 
Spreadsheet, Presentation, Draw, HTML writer...

CKJ can be coded in two ways, for example :
&#12354;, decimal &#12354, hexadecimal &#x3042.

When import in Writer or Spreadsheet :
For decimal code, the character is imported correctly. But for hexadecimal 
code, the HTML code is imported.

I'll post a zip file, there are HTML files for test.
Comment 1 lcn 2003-11-17 21:07:49 UTC
Created attachment 11352 [details]
HMTL files for test. md5 signature : da3d80ef80f1c010e68755b20aae48a6
Comment 2 frank 2003-11-18 10:07:14 UTC
Hi,

as this is not only a calc problem but one of the edit engine, I
change the component to framework and re-assign it to the appropriate
developer.

Frank
Comment 3 malte_timmermann 2003-11-19 18:05:43 UTC
Henning...
Comment 4 openoffice 2003-11-24 10:02:26 UTC
accepted
Comment 5 lcn 2003-11-24 20:15:56 UTC
Seems that it affects not only CKJ but all characters (ASCII, 
accentued, CKJ,... ) coded in hexadecimal in HTML pages.

Comment 6 Marcus 2017-05-20 11:29:38 UTC
Reset assigne to the default "issues@openoffice.apache.org".
Comment 7 damjan 2023-01-04 14:10:57 UTC
(In reply to lcn from comment #5)
> Seems that it affects not only CKJ but all characters (ASCII, 
> accentued, CKJ,... ) coded in hexadecimal in HTML pages.

All 3 sample documents look the same now, and my tests show hexadecimally coded ASCII (eg. &#x5a; for "Z") look right. Please confirm whether this is still an issue?

I believe the parsing happens in HTMLParser::ScanText() in main/svtools/source/svhtml/parhtml.cxx, and it supports both hex and decimal encoding.