24101 – xml-export failed, wrong character, non UTF-8

Issue 24101 - xml-export failed, wrong character, non UTF-8

Summary: xml-export failed, wrong character, non UTF-8

Status:	CLOSED FIXED

Alias:	None

Product:	Infrastructure
Classification:	Infrastructure
Component:	Bugzilla (show other issues)
Version:	current
Hardware:	All All

Importance:	P2 Trivial (vote)
Target Milestone:	---
Assignee:	Unknown
QA Contact:	issues@www

URL:
Keywords:

Depends on:
Blocks:

Reported:	2004-01-05 12:12 UTC by hans_werner67
Modified:	2004-07-06 18:34 UTC (History)
CC List:	3 users (show)

See Also:
Issue Type:	DEFECT
Latest Confirmation in:	---
Developer Difficulty:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description hans_werner67 2004-01-05 12:12:07 UTC

An invalid character was found in text content. Error processing resource
'http://www.openoffice.org/issues/xml.cgi?id=22604'.

The encoding for an exported character is not UTF-8 conform. Please check it.
Thx

Comment 1 hans_werner67 2004-01-05 12:13:44 UTC

added tbo on cc-list

Comment 2 stefan.baltzer 2004-01-05 16:12:38 UTC

SBA: Just for the record... Happend also with issue 20287.

Comment 3 stx123 2004-01-05 16:25:22 UTC

reassigning to support

Comment 4 hans_werner67 2004-01-13 09:53:46 UTC

Also for #13964#, #23923#, #23294#

Comment 5 bjoern.milcke 2004-01-13 10:37:28 UTC

Same problem with Issues 1489 and 5289

Comment 6 stephan.wunderlich 2004-01-13 10:50:37 UTC

Same problem for the issues 23420, 23415, 23414, 23398, 23301, 23300, and 23295

Comment 7 Unknown 2004-01-30 22:56:20 UTC

fma: could you please elaborate a bit more on the problem? I'm not receiving any
errors while viewing issues exported as xml. thanks

Comment 8 hans_werner67 2004-02-02 09:40:16 UTC

Ok, here same detailed problems, e.g for 20287

Export via http://openoffice.org/issues/xml.cgi?id=20287 should only contain
UTF-8 characters, but some are ISO-8859 -> I think in the cc-list petermüller
has an ISO-encoding so the import failed by using a simple xml-parser:
Stack for 20287:
java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence.
        at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
        at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unk
nown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContent
Dispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Un
known Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
        at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
or use a MS Internet Explorer for the url
http://openoffice.org/issues/xml.cgi?id=20287, it fails at the cc-line.

I believe, that the switch to the new IssueTracker has a problem with
ISO-encoded pages. The new UTF-8-encoding of all IssueTracker-pages has lead to
this problem, the backend has no UTF-8 validation inside. The export seems to
write out only the database content. So if non-UTF characters came in, none
UTF-8 chars came out. That's the problem. Many browsers support UTF-8 but if a
browser uses ISO-encoding hard set or doesn't support UTF-8 we will have this
problem!

If you have further questions, feel free to ask

Comment 9 Unknown 2004-02-03 13:54:04 UTC

Able to see the problem by formatting the issues as XML. Filed an internal 
issue 25809 on this regard and will update once the engineers update the 
internal issue.

Thanks,
Priya

Comment 10 Unknown 2004-02-13 11:39:03 UTC

Queried the engineer on this issue and waiting for him to respond on this.

Priya

Comment 11 Unknown 2004-02-17 21:24:48 UTC

closing as wontfix. this is behaving as designed. the xml-export is generating
correctly however certain browsers have errors related to displaying the
content. the export was not intended to be used via browsers. browsers are
intended to use IZ querying and use the displayed results. (if you disagree with
this summary, I'd suggest filing a new enhancement request and I'll have it
filed internally against IZ's replacement, scarab).

Comment 12 Martin Hollmichel 2004-02-17 21:39:04 UTC

hmm, If I understand this correctly this worked fine before the upgrade, so I
don't accept this as "worked as designed" but as a regression. please correct me
if I'm totally wrong with this. since our internal workflow is depending on
this, I consider this as a data loss issue for us.

Comment 13 hans_werner67 2004-02-18 07:33:48 UTC

Kenneths, I can't understand why you say: '... the export was not intended to be
used via browsers. browsers are intended to use IZ querying and use the
displayed results. ...'

On 2nd Feb I gave you a stacktrace of invalid xml-parsing by a java-xml-parser.
This isn't only a browser-problem to display issue-content via xml. You are
exporting non UTF-characters but in the header you explain it as UTF-8. That's
the problem. In the old IssueZilla-version this doesn't occur. In my eyes a
clear regression bug, not a feature. Sorry.

I believe, you have irregular UTF-characters in your database and export them
via the old cgi-script and don't validate them if UTF-8 conform or not. Please
find a solution. Thanks

Comment 14 Unknown 2004-02-25 21:27:36 UTC

the last commentary I had from before I split up the issues is this:

<snip>it is the case that stray iso-8859-1 characters are not properly encoded
in the xml (it's the umlaut in peter's name), but that doesn't affect generation
- only validation.  I can still generate the file.

I think my confusion lay in that our engineers commented on the above problem in
our internal issue which was mistakenly lumped into a seemingly duplicate issue
for a similiar problem for another customer. I have filed an new internal issue
for the engineers to review. (Corrected whiteboard)

Comment 15 Unknown 2004-03-10 23:17:52 UTC

The engineers understand the problem and will be providing a patch to OOo's site
at some point in the future to address this problem. I'll be discussing this on
our weekly call tomorrow and will update this issue afterwards with an updated
comment and hopefully a timeframe estimate.

Comment 16 stx123 2004-03-26 09:33:35 UTC

Frank, this issues should be fixed. Could you please verify.
Thanks, Stefan

Comment 17 Unknown 2004-03-29 14:20:14 UTC

Assigning this to the reporter as per ST.

Comment 18 hans_werner67 2004-03-31 15:02:10 UTC

It works now for the cc-field and the issue 20287, but a task with non UTF-8
chars in the comment field it doesn't, e.g. 22604 with some umlaut in it. Please
enlarge the solution also for the comments fields.

Thanks

Comment 19 Unknown 2004-03-31 15:19:42 UTC

fma:

can you please let us what you followed for testing this issue?

Thanks
Priya

Comment 20 stx123 2004-03-31 16:15:47 UTC

I used
$ wget -O- http://www.openoffice.org/issues/xml.cgi?id=22604 | iconv -f utf8
 -t utf8
and got an error at
[...]
<long_desc>
   <who>gh</who>
   <issue_when>2003-12-19 03:54:04</issue_when>
   <thetext>Spec:
Third party applications like the testtool have to find the StarOffice. In the
cws networker2 we agreed, that the setup still writes a file named
iconv: illegal input sequence at position 3210

Comment 21 Unknown 2004-04-14 22:21:58 UTC

whiteboard update

Comment 22 Unknown 2004-04-15 01:19:46 UTC

I spoke with engineering on this issue. Initially they wanted to fix this in
version 2.6.4.
Since this is not qualified for Solaris, I have told engineering that the 2.6.4
fix is not acceptable.
I have now gone to the release managers to find out which release this will be
fixed in.
They have input this issue to their council meeting for discussion tomorrow to
see how to best address this from here.
I should have another update by the end of the week at the latest.
Eric

Comment 23 Unknown 2004-04-16 00:10:44 UTC

Hi Stefan,

solution proposed:
(Would it be acceptable to replace non UTF-8 characters with a "?") 

Thanks,
Eric

Comment 24 hans_werner67 2004-04-16 08:33:25 UTC

Hi Eric,
accepted. This is nothing more as the user currently see in the browser with
UTF-8 encoding, so it should be a solution. Thx.

Comment 25 Unknown 2004-04-16 21:12:34 UTC

I let engineering know that this is an acceptable solution.
I will keep monitoring the internal issue to let you know how the work on this
solution progresses.
Eric

Comment 26 Unknown 2004-04-22 01:15:39 UTC

The inteternal issue now resides in my inst engineers queue to have this applied
to the solstage site for testing.
Eric

Comment 27 Unknown 2004-04-28 21:38:57 UTC

The instance set is being tested to make sure there are no incompatabilities
with Solaris.
Awaiting results.
Eric

Comment 28 Unknown 2004-05-06 04:39:06 UTC

This has been tested & rolled to the next instance set rollout.

Eric

Comment 29 Unknown 2004-05-12 00:08:50 UTC

I am tryign to get the new instance set applied to stage by tomorrow.
Eric

Comment 30 Unknown 2004-05-12 23:47:41 UTC

Hi There,
I just received confirmation that this instance set has been rolled out onto the
stage box.
Please verify & give feedback concerning this issue.
Thanks,
Eric

Comment 31 hans_werner67 2004-05-13 10:20:17 UTC

Export verified on staging server, it works.
Test issue was 1139. Use browser, set different encoding as UTF (e.g. western)
and fill in umlaute like äöü etc. Export give ??? back. Tested for summary and
description.
Thx. Frank

Comment 32 Unknown 2004-06-08 16:40:08 UTC

This has been confirmed fixed on the stage system.

Comment 33 Unknown 2004-06-10 07:32:28 UTC

Can we set this resolved as its fixed on the stage system?

Comment 34 stx123 2004-06-10 19:49:41 UTC

Following your example I added the entry "verified_on_stage".
I would like to track the issue until it's available on the production site.
If you think we should clarify the issue handling feel free to contact me via
direct mail.

Comment 35 Unknown 2004-07-06 15:52:31 UTC

wget -O- http://www.openoffice.org/issues/xml.cgi?id=22604 | iconv -f utf8 -t 
utf8 works for me on production.  closing.

Comment 36 stx123 2004-07-06 18:34:24 UTC

verified and closing