Issue 22579 - incorrect import : HTML page with CKJ characters coded in hexadecimal
Summary: incorrect import : HTML page with CKJ characters coded in hexadecimal
Status: ACCEPTED
Alias: None
Product: General
Classification: Code
Component: code (show other issues)
Version: OOo 1.1
Hardware: PC Windows 2000
: P3 Trivial with 2 votes (vote)
Target Milestone: AOO Later
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords: needmoreinfo
Depends on:
Blocks:
 
Reported: 2003-11-17 20:59 UTC by lcn
Modified: 2023-01-04 14:10 UTC (History)
3 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
HMTL files for test. md5 signature : da3d80ef80f1c010e68755b20aae48a6 (4.54 KB, application/octet-stream)
2003-11-17 21:07 UTC, lcn
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description lcn 2003-11-17 20:59:52 UTC
incorrect import : HTML page with CKJ (Chinese Korean Japanese) characters 
coded in hexadecimal.
This problem affects maybe all the program of OpenOffice.org : Writer, 
Spreadsheet, Presentation, Draw, HTML writer...

CKJ can be coded in two ways, for example :
あ, decimal &#12354, hexadecimal &#x3042.

When import in Writer or Spreadsheet :
For decimal code, the character is imported correctly. But for hexadecimal 
code, the HTML code is imported.

I'll post a zip file, there are HTML files for test.
Comment 1 lcn 2003-11-17 21:07:49 UTC
Created attachment 11352 [details]
HMTL files for test. md5 signature : da3d80ef80f1c010e68755b20aae48a6
Comment 2 frank 2003-11-18 10:07:14 UTC
Hi,

as this is not only a calc problem but one of the edit engine, I
change the component to framework and re-assign it to the appropriate
developer.

Frank
Comment 3 malte_timmermann 2003-11-19 18:05:43 UTC
Henning...
Comment 4 openoffice 2003-11-24 10:02:26 UTC
accepted
Comment 5 lcn 2003-11-24 20:15:56 UTC
Seems that it affects not only CKJ but all characters (ASCII, 
accentued, CKJ,... ) coded in hexadecimal in HTML pages.

Comment 6 Marcus 2017-05-20 11:29:38 UTC
Reset assigne to the default "issues@openoffice.apache.org".
Comment 7 damjan 2023-01-04 14:10:57 UTC
(In reply to lcn from comment #5)
> Seems that it affects not only CKJ but all characters (ASCII, 
> accentued, CKJ,... ) coded in hexadecimal in HTML pages.

All 3 sample documents look the same now, and my tests show hexadecimally coded ASCII (eg. Z for "Z") look right. Please confirm whether this is still an issue?

I believe the parsing happens in HTMLParser::ScanText() in main/svtools/source/svhtml/parhtml.cxx, and it supports both hex and decimal encoding.