Apache OpenOffice (AOO) Bugzilla – Issue 24101
xml-export failed, wrong character, non UTF-8
Last modified: 2004-07-06 18:34:24 UTC
An invalid character was found in text content. Error processing resource 'http://www.openoffice.org/issues/xml.cgi?id=22604'. The encoding for an exported character is not UTF-8 conform. Please check it. Thx
added tbo on cc-list
SBA: Just for the record... Happend also with issue 20287.
reassigning to support
Also for #13964#, #23923#, #23294#
Same problem with Issues 1489 and 5289
Same problem for the issues 23420, 23415, 23414, 23398, 23301, 23300, and 23295
fma: could you please elaborate a bit more on the problem? I'm not receiving any errors while viewing issues exported as xml. thanks
Ok, here same detailed problems, e.g for 20287 Export via http://openoffice.org/issues/xml.cgi?id=20287 should only contain UTF-8 characters, but some are ISO-8859 -> I think in the cc-list petermüller has an ISO-encoding so the import failed by using a simple xml-parser: Stack for 20287: java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence. at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source) at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unk nown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContent Dispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Un known Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) at javax.xml.parsers.DocumentBuilder.parse(Unknown Source) or use a MS Internet Explorer for the url http://openoffice.org/issues/xml.cgi?id=20287, it fails at the cc-line. I believe, that the switch to the new IssueTracker has a problem with ISO-encoded pages. The new UTF-8-encoding of all IssueTracker-pages has lead to this problem, the backend has no UTF-8 validation inside. The export seems to write out only the database content. So if non-UTF characters came in, none UTF-8 chars came out. That's the problem. Many browsers support UTF-8 but if a browser uses ISO-encoding hard set or doesn't support UTF-8 we will have this problem! If you have further questions, feel free to ask
Able to see the problem by formatting the issues as XML. Filed an internal issue 25809 on this regard and will update once the engineers update the internal issue. Thanks, Priya
Queried the engineer on this issue and waiting for him to respond on this. Priya
closing as wontfix. this is behaving as designed. the xml-export is generating correctly however certain browsers have errors related to displaying the content. the export was not intended to be used via browsers. browsers are intended to use IZ querying and use the displayed results. (if you disagree with this summary, I'd suggest filing a new enhancement request and I'll have it filed internally against IZ's replacement, scarab).
hmm, If I understand this correctly this worked fine before the upgrade, so I don't accept this as "worked as designed" but as a regression. please correct me if I'm totally wrong with this. since our internal workflow is depending on this, I consider this as a data loss issue for us.
Kenneths, I can't understand why you say: '... the export was not intended to be used via browsers. browsers are intended to use IZ querying and use the displayed results. ...' On 2nd Feb I gave you a stacktrace of invalid xml-parsing by a java-xml-parser. This isn't only a browser-problem to display issue-content via xml. You are exporting non UTF-characters but in the header you explain it as UTF-8. That's the problem. In the old IssueZilla-version this doesn't occur. In my eyes a clear regression bug, not a feature. Sorry. I believe, you have irregular UTF-characters in your database and export them via the old cgi-script and don't validate them if UTF-8 conform or not. Please find a solution. Thanks
the last commentary I had from before I split up the issues is this: <snip>it is the case that stray iso-8859-1 characters are not properly encoded in the xml (it's the umlaut in peter's name), but that doesn't affect generation - only validation. I can still generate the file. I think my confusion lay in that our engineers commented on the above problem in our internal issue which was mistakenly lumped into a seemingly duplicate issue for a similiar problem for another customer. I have filed an new internal issue for the engineers to review. (Corrected whiteboard)
The engineers understand the problem and will be providing a patch to OOo's site at some point in the future to address this problem. I'll be discussing this on our weekly call tomorrow and will update this issue afterwards with an updated comment and hopefully a timeframe estimate.
Frank, this issues should be fixed. Could you please verify. Thanks, Stefan
Assigning this to the reporter as per ST.
It works now for the cc-field and the issue 20287, but a task with non UTF-8 chars in the comment field it doesn't, e.g. 22604 with some umlaut in it. Please enlarge the solution also for the comments fields. Thanks
fma: can you please let us what you followed for testing this issue? Thanks Priya
I used $ wget -O- http://www.openoffice.org/issues/xml.cgi?id=22604 | iconv -f utf8 -t utf8 and got an error at [...] <long_desc> <who>gh</who> <issue_when>2003-12-19 03:54:04</issue_when> <thetext>Spec: Third party applications like the testtool have to find the StarOffice. In the cws networker2 we agreed, that the setup still writes a file named iconv: illegal input sequence at position 3210
whiteboard update
I spoke with engineering on this issue. Initially they wanted to fix this in version 2.6.4. Since this is not qualified for Solaris, I have told engineering that the 2.6.4 fix is not acceptable. I have now gone to the release managers to find out which release this will be fixed in. They have input this issue to their council meeting for discussion tomorrow to see how to best address this from here. I should have another update by the end of the week at the latest. Eric
Hi Stefan, solution proposed: (Would it be acceptable to replace non UTF-8 characters with a "?") Thanks, Eric
Hi Eric, accepted. This is nothing more as the user currently see in the browser with UTF-8 encoding, so it should be a solution. Thx.
I let engineering know that this is an acceptable solution. I will keep monitoring the internal issue to let you know how the work on this solution progresses. Eric
The inteternal issue now resides in my inst engineers queue to have this applied to the solstage site for testing. Eric
The instance set is being tested to make sure there are no incompatabilities with Solaris. Awaiting results. Eric
This has been tested & rolled to the next instance set rollout. Eric
I am tryign to get the new instance set applied to stage by tomorrow. Eric
Hi There, I just received confirmation that this instance set has been rolled out onto the stage box. Please verify & give feedback concerning this issue. Thanks, Eric
Export verified on staging server, it works. Test issue was 1139. Use browser, set different encoding as UTF (e.g. western) and fill in umlaute like äöü etc. Export give ??? back. Tested for summary and description. Thx. Frank
This has been confirmed fixed on the stage system.
Can we set this resolved as its fixed on the stage system?
Following your example I added the entry "verified_on_stage". I would like to track the issue until it's available on the production site. If you think we should clarify the issue handling feel free to contact me via direct mail.
wget -O- http://www.openoffice.org/issues/xml.cgi?id=22604 | iconv -f utf8 -t utf8 works for me on production. closing.
verified and closing