Issue 15799 - MS-Word import fails, FlatXMLWriter produces invalid XML
Summary: MS-Word import fails, FlatXMLWriter produces invalid XML
Status: CLOSED FIXED
Alias: None
Product: Writer
Classification: Application
Component: code (show other issues)
Version: OOo 1.1 Beta2
Hardware: PC Windows XP
: P2 Trivial (vote)
Target Milestone: ---
Assignee: michael.ruess
QA Contact: issues@sw
URL: http://www.canberracity.org/X1035.doc
Keywords:
: 17493 (view as issue list)
Depends on:
Blocks:
 
Reported: 2003-06-19 07:45 UTC by chrisbitmead
Modified: 2013-08-07 14:41 UTC (History)
1 user (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description chrisbitmead 2003-06-19 07:45:01 UTC
With the attached MS-Word document, it doesn't convert quite right.
But even much worse, the resulting XML with FlatXMLWriter is invalid.

The problem is that the original MS-Word document contains a vertical tab
character (0xb). (On the "To:" line of this fax). In MS-Word, this puts the
following text on the next line.

OpenOffice can't handle this and displays a wierd character on the screen
instead of putting it on the next line.

Ok, this is annoying, but what is disasterous is that when using the
FlatXMLWriter, the resulting XML is invalid because the control character is put
directly into the output, and Java XML parsers barf on it. From my reading of
the W3C XML spec (http://www.w3.org/TR/2000/REC-xml-20001006.html#NT-Char)
vertical tab characters are illegal.

This is a big problem for our project at the National Archive which plans to do
long term storage with OpenOffice flat format because we have a lot of MS-Word
documents that we can't convert.

I think you need a dual resolution of this. Fix the MS-Word import, but more
importantly, put more checking in your FlatXMLFilter to stop it producing
invalid XML EVER!
Comment 1 h.ilter 2003-06-23 15:40:43 UTC
Reassigned to MRU
Comment 2 michael.ruess 2003-06-24 11:18:37 UTC
MRU->CMC: if I understood correctly, by a "vertical tab" he means a
"line break". Would it be possible to import the line break from the
form field into Writer's input field?
BTW: It is possible to have a line break in Writer's Input field.
Comment 3 chrisbitmead 2003-06-24 14:26:29 UTC
>if I understood correctly, by a "vertical tab" he means a
>"line break"

No, I mean the ASCII character for "vertical tab" - "VT" which is 0xb
hex 013 oct 11 dec. This is different to either carriage return or
line feed.

Although, in the original MS-Word file it seems to be represented
visually by a new line.
Comment 4 caolanm 2003-06-24 14:32:44 UTC
VT Hex 0x0b dec 11 is used in word as a hard line break. in writer we
use LF hex 0xa dev 12 for that purpose. I'll look into it. We do the
conversion elsewhere in office, just not inside fields results.
Comment 5 caolanm 2003-06-25 08:53:04 UTC
Fixed, but a little risky for 1.1. Will make fix available in 2.0
Comment 6 caolanm 2003-08-15 17:36:02 UTC
reopen to reassign
Comment 7 caolanm 2003-08-15 17:36:26 UTC
cmc->mru: Working in limerickfilterteam08
Comment 8 michael.ruess 2003-08-28 11:19:30 UTC
Checked fix with internal CWS filterteam08.
Comment 9 michael.ruess 2003-08-28 11:19:49 UTC
Fix verified. Wil be included in OO .20.
Comment 10 michael.ruess 2003-10-09 09:11:10 UTC
*** Issue 17493 has been marked as a duplicate of this issue. ***
Comment 11 utomo99 2003-11-18 09:27:24 UTC
Hi,

I found that this issue is Fixed, but target is OOo 2.0. 
Please consider to include this in OOo 1.1.1 if possible. 
Thankyou
Comment 12 michael.ruess 2003-11-18 10:11:23 UTC
No, as CMC pointed out, a bit too risky for OO 1.1.1. Will leave this
as 2.0 fix.
Comment 13 michael.ruess 2004-03-25 17:06:02 UTC
Closed. Works with OO 2.0 snapshot build 680m28.