Issue 22579

Summary:

incorrect import : HTML page with CKJ characters coded in hexadecimal

Product:

General

Reporter:

lcn <lcn>

Component:

code

Assignee:

AOO issues mailing list <issues>

Status:

ACCEPTED ---

QA Contact:

Severity:

Trivial

Priority:

CC:

damjan, issues, lcn

Version:

OOo 1.1

Keywords:

needmoreinfo

Target Milestone:

AOO Later

Hardware:

OS:

Windows 2000

Issue Type:

DEFECT

Latest Confirmation in:

---

Developer Difficulty:

---

Attachments:

Description	Flags
HMTL files for test. md5 signature : da3d80ef80f1c010e68755b20aae48a6	none

Description lcn 2003-11-17 20:59:52 UTC

incorrect import : HTML page with CKJ (Chinese Korean Japanese) characters 
coded in hexadecimal.
This problem affects maybe all the program of OpenOffice.org : Writer, 
Spreadsheet, Presentation, Draw, HTML writer...

CKJ can be coded in two ways, for example :
&#12354;, decimal &#12354, hexadecimal &#x3042.

When import in Writer or Spreadsheet :
For decimal code, the character is imported correctly. But for hexadecimal 
code, the HTML code is imported.

I'll post a zip file, there are HTML files for test.

Comment 1 lcn 2003-11-17 21:07:49 UTC

Created attachment 11352 [details]
HMTL files for test. md5 signature : da3d80ef80f1c010e68755b20aae48a6

Comment 2 frank 2003-11-18 10:07:14 UTC

Hi,

as this is not only a calc problem but one of the edit engine, I
change the component to framework and re-assign it to the appropriate
developer.

Frank

Comment 3 malte_timmermann 2003-11-19 18:05:43 UTC

Henning...

Comment 4 openoffice 2003-11-24 10:02:26 UTC

accepted

Comment 5 lcn 2003-11-24 20:15:56 UTC

Seems that it affects not only CKJ but all characters (ASCII, 
accentued, CKJ,... ) coded in hexadecimal in HTML pages.

Comment 6 Marcus 2017-05-20 11:29:38 UTC

Reset assigne to the default "issues@openoffice.apache.org".

Comment 7 damjan 2023-01-04 14:10:57 UTC

(In reply to lcn from comment #5)
> Seems that it affects not only CKJ but all characters (ASCII, 
> accentued, CKJ,... ) coded in hexadecimal in HTML pages.

All 3 sample documents look the same now, and my tests show hexadecimally coded ASCII (eg. &#x5a; for "Z") look right. Please confirm whether this is still an issue?

I believe the parsing happens in HTMLParser::ScanText() in main/svtools/source/svhtml/parhtml.cxx, and it supports both hex and decimal encoding.