Summary: | Special character (greater than equal) converts to '(' text in word documents | ||
---|---|---|---|
Product: | POI | Reporter: | teresa.kim |
Component: | HWPF | Assignee: | POI Developers List <dev> |
Status: | NEW --- | ||
Severity: | normal | ||
Priority: | P2 | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Hardware: | PC | ||
OS: | All | ||
Attachments: | symbl test example doc document |
Description
teresa.kim
2019-10-07 19:34:46 UTC
Created attachment 36814 [details]
symbl test example doc document
Since I want to learn about the non-Excel formats in POI, I am trying to find out what's going on here. Three things so far: - I can confirm that the first one is rendered as '>=' (as a single character) in word (at least on MAC) - the program produces the wrong output - as far as I can tell, the error has nothing to do with the converter because I can see the '(' showing up in the debugger when inspecting the `wordDocument` variable before the converter is even initialised. I will see if I can find out what's wrong, but no promises (this is my first time ever to look at the word code). When reading the word file, text pieces are read by converting `byte[]` to String in `buildInitSB()`. I investigated the raw data passed to that method: - so according to the unicode table, the "greater or equal sign" has the code 0x2265 which I also see in the debugger right before the "good one" bytes. - right before "bad one" there's a 0x0028, which in Unicode is the left parenthesis. So it seems that the error happens at a very low level when reading the byte stream. ----- Additional findings: LibreOffice doesn't render the symbol in front of "bad one" at all. Pages displays the correct symbol. ----- Extracting the file on the command line yields: axel@xiaolong tmp % unzip ../symbol_test.doc Archive: ../symbol_test.doc warning [../symbol_test.doc]: 10574 extra bytes at beginning or within zipfile (attempting to process anyway) inflating: [Content_Types].xml inflating: _rels/.rels inflating: theme/theme/themeManager.xml inflating: theme/theme/theme1.xml inflating: theme/theme/_rels/themeManager.xml.rels Could it be that the file is corrupt? Compare with a simple test document: axel@xiaolong tmp % unzip ../Test.docx Archive: ../Test.docx inflating: [Content_Types].xml inflating: _rels/.rels inflating: word/_rels/document.xml.rels inflating: word/document.xml inflating: word/theme/theme1.xml inflating: word/settings.xml inflating: docProps/core.xml inflating: word/fontTable.xml inflating: word/webSettings.xml inflating: word/styles.xml inflating: docProps/app.xml But since Apple pages renders it correctly and you said that you have multiple such documents, maybe I am missing something. Anyway, I'm out of this one. (In reply to Axel Howind from comment #3) Thanks for looking into this issue. > > ----- > > Extracting the file on the command line yields: > > axel@xiaolong tmp % unzip ../symbol_test.doc > Archive: ../symbol_test.doc > warning [../symbol_test.doc]: 10574 extra bytes at beginning or within > zipfile > (attempting to process anyway) > inflating: [Content_Types].xml > inflating: _rels/.rels > inflating: theme/theme/themeManager.xml > inflating: theme/theme/theme1.xml > inflating: theme/theme/_rels/themeManager.xml.rels > I think it is since 'symbol_test' is 'doc' type where as 'Test.docx' is 'ooxml docx' type. > Could it be that the file is corrupt? Compare with a simple test document: > > axel@xiaolong tmp % unzip ../Test.docx > Archive: ../Test.docx > inflating: [Content_Types].xml > inflating: _rels/.rels > inflating: word/_rels/document.xml.rels > inflating: word/document.xml > inflating: word/theme/theme1.xml > inflating: word/settings.xml > inflating: docProps/core.xml > inflating: word/fontTable.xml > inflating: word/webSettings.xml > inflating: word/styles.xml > inflating: docProps/app.xml > > But since Apple pages renders it correctly and you said that you have > multiple such documents, maybe I am missing something. > > Anyway, I'm out of this one. Yes I have many documents and besides it is not only 'greater than equal' symbol but there are other characters that are converetd into '('. I am in need of identifying each of this character to postprocess it. Unfortunately this seems to be caused somewhere deep in the Microsoft DOC binary format, the text-bytes that we read from the document-stream in class TextPiece already results in ") bad one", so there is no conversion in Apache POI as far as I see, but still LibreOffice can display this correctly, so it seems there is some additional information stored somewhere in the data which Apache POI does not read/interpret yet. This would need much more knowledge about this format than I can provide, sorry, hopefully someone else can come up with a clue why this happens. (In reply to Dominik Stadler from comment #5) > Unfortunately this seems to be caused somewhere deep in the Microsoft DOC > binary format, the text-bytes that we read from the document-stream in class > TextPiece already results in ") bad one", so there is no conversion in > Apache POI as far as I see, but still LibreOffice can display this > correctly, so it seems there is some additional information stored somewhere > in the data which Apache POI does not read/interpret yet. > > This would need much more knowledge about this format than I can provide, > sorry, hopefully someone else can come up with a clue why this happens. Thanks Dominik I downloaded Libreoffice and saved the document into HTML output. You're right that the libreloffice outputs this correctly. Is there any way to mimic this behaviour in Apache POI? Unfortunately it seems this information is stored in a way that Apache POI does not support right now, so it would need someone to find the time and expertise to dig into the format and the code of Apache POI, no way to "mimic" as far as I see. (In reply to Dominik Stadler from comment #7) > Unfortunately it seems this information is stored in a way that Apache POI > does not support right now, so it would need someone to find the time and > expertise to dig into the format and the code of Apache POI, no way to > "mimic" as far as I see. Thanks Dominik for looking into this issue. I would love to involve to approach this issue and I never did before, but have used the apache poi API for a while. I have time but no expertise, is there any route to get some help from experts to start or any instructions to follow for someone like me? Unfortunately it may require some getting used to the topic if you find the time. You may first need to review the official technical documentation from Microsoft at https://msdn.microsoft.com/en-us/library/cc313105%28v=office.12%29.aspx and compare this with the actual code in Apache POI, e.g. the starting point would be the constructor of class HWPFDocument and the classes used there to read the binary format. Otherwise the dev-mailing list will be a good place for asking questions while you go along. |