Bug 63813 - Special character (greater than equal) converts to '(' text in word documents
Summary: Special character (greater than equal) converts to '(' text in word documents
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-10-07 19:34 UTC by teresa.kim
Modified: 2019-10-10 17:30 UTC (History)
0 users



Attachments
symbl test example doc document (26.50 KB, application/msword)
2019-10-07 19:35 UTC, teresa.kim
Details

Note You need to log in before you can comment on or make changes to this bug.
Description teresa.kim 2019-10-07 19:34:46 UTC
Version:

POI 4.1.0

I have documents (either 'doc' or 'docx') that have a special character for 'greater than equal' and using codes in 'WordToHtmlConverter', I see those characters are converted into '('.

I tried with the latest apache poi release 4.1.0.


My java code is:


public class TestWordtoHtmlConverter {

    public static void main(String[] args ) {
        try {
        HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream(args[0]));

        WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
                DocumentBuilderFactory.newInstance().newDocumentBuilder()
                        .newDocument());

        wordToHtmlConverter.processDocument(wordDocument);
        Document htmlDocument = wordToHtmlConverter.getDocument();
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        DOMSource domSource = new DOMSource(htmlDocument);
        StreamResult streamResult = new StreamResult(out);

        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer serializer = tf.newTransformer();
        serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
        serializer.setOutputProperty(OutputKeys.INDENT, "yes");
        serializer.setOutputProperty(OutputKeys.METHOD, "html");
        serializer.transform(domSource, streamResult);
        out.close();

        String result = new String(out.toByteArray());
        System.out.println(result);
      } catch (Exception e) {
      }

Is there anyway I can correctly identify these symbols?


In the sample document, I am interested in getting 'bad one'.


Thanks
Comment 1 teresa.kim 2019-10-07 19:35:37 UTC
Created attachment 36814 [details]
symbl test example doc document
Comment 2 Axel Howind 2019-10-08 10:41:05 UTC
Since I want to learn about the non-Excel formats in POI, I am trying to find out what's going on here. Three things so far:

 - I can confirm that the first one is rendered as '>=' (as a single character) in word (at least on MAC)
 - the program produces the wrong output
 - as far as I can tell, the error has nothing to do with the converter because I can see the '(' showing up in the debugger when inspecting the `wordDocument` variable before the converter is even initialised.

I will see if I can find out what's wrong, but no promises (this is my first time ever to look at the word code).
Comment 3 Axel Howind 2019-10-08 11:31:33 UTC
When reading the word file, text pieces are read by converting `byte[]` to String in `buildInitSB()`. I investigated the raw data passed to that method:

- so according to the unicode table, the "greater or equal sign" has the code 0x2265 which I also see in the debugger right before the "good one" bytes.

- right before "bad one" there's a 0x0028, which in Unicode is the left parenthesis. 

So it seems that the error happens at a very low level when reading the byte stream.

-----

Additional findings: LibreOffice doesn't render the symbol in front of "bad one" at all. Pages displays the correct symbol.

-----

Extracting the file on the command line yields:

axel@xiaolong tmp % unzip ../symbol_test.doc 
Archive:  ../symbol_test.doc
warning [../symbol_test.doc]:  10574 extra bytes at beginning or within zipfile
  (attempting to process anyway)
  inflating: [Content_Types].xml     
  inflating: _rels/.rels             
  inflating: theme/theme/themeManager.xml  
  inflating: theme/theme/theme1.xml  
  inflating: theme/theme/_rels/themeManager.xml.rels  

Could it be that the file is corrupt? Compare with a simple test document:

axel@xiaolong tmp % unzip ../Test.docx 
Archive:  ../Test.docx
  inflating: [Content_Types].xml     
  inflating: _rels/.rels             
  inflating: word/_rels/document.xml.rels  
  inflating: word/document.xml       
  inflating: word/theme/theme1.xml   
  inflating: word/settings.xml       
  inflating: docProps/core.xml       
  inflating: word/fontTable.xml      
  inflating: word/webSettings.xml    
  inflating: word/styles.xml         
  inflating: docProps/app.xml

But since Apple pages renders it correctly and you said that you have multiple such documents, maybe I am missing something.

Anyway, I'm out of this one.
Comment 4 teresa.kim 2019-10-08 12:53:11 UTC
(In reply to Axel Howind from comment #3)
Thanks for looking into this issue.
> 
> -----
> 
> Extracting the file on the command line yields:
> 
> axel@xiaolong tmp % unzip ../symbol_test.doc 
> Archive:  ../symbol_test.doc
> warning [../symbol_test.doc]:  10574 extra bytes at beginning or within
> zipfile
>   (attempting to process anyway)
>   inflating: [Content_Types].xml     
>   inflating: _rels/.rels             
>   inflating: theme/theme/themeManager.xml  
>   inflating: theme/theme/theme1.xml  
>   inflating: theme/theme/_rels/themeManager.xml.rels  
> 

I think it is since 'symbol_test' is 'doc' type where as 'Test.docx' is 'ooxml docx' type.

> Could it be that the file is corrupt? Compare with a simple test document:
> 
> axel@xiaolong tmp % unzip ../Test.docx 
> Archive:  ../Test.docx
>   inflating: [Content_Types].xml     
>   inflating: _rels/.rels             
>   inflating: word/_rels/document.xml.rels  
>   inflating: word/document.xml       
>   inflating: word/theme/theme1.xml   
>   inflating: word/settings.xml       
>   inflating: docProps/core.xml       
>   inflating: word/fontTable.xml      
>   inflating: word/webSettings.xml    
>   inflating: word/styles.xml         
>   inflating: docProps/app.xml
> 
> But since Apple pages renders it correctly and you said that you have
> multiple such documents, maybe I am missing something.
> 
> Anyway, I'm out of this one.

Yes I have many documents and besides it is not only 'greater than equal' symbol but there are other characters that are converetd into '('. 
I am in need of identifying each of this character to postprocess it.
Comment 5 Dominik Stadler 2019-10-08 18:08:32 UTC
Unfortunately this seems to be caused somewhere deep in the Microsoft DOC binary format, the text-bytes that we read from the document-stream in class TextPiece already results in ") bad one", so there is no conversion in Apache POI as far as I see, but still LibreOffice can display this correctly, so it seems there is some additional information stored somewhere in the data which Apache POI does not read/interpret yet. 

This would need much more knowledge about this format than I can provide, sorry, hopefully someone else can come up with a clue why this happens.
Comment 6 teresa.kim 2019-10-08 21:31:55 UTC
(In reply to Dominik Stadler from comment #5)
> Unfortunately this seems to be caused somewhere deep in the Microsoft DOC
> binary format, the text-bytes that we read from the document-stream in class
> TextPiece already results in ") bad one", so there is no conversion in
> Apache POI as far as I see, but still LibreOffice can display this
> correctly, so it seems there is some additional information stored somewhere
> in the data which Apache POI does not read/interpret yet. 
> 
> This would need much more knowledge about this format than I can provide,
> sorry, hopefully someone else can come up with a clue why this happens.

Thanks Dominik
I downloaded Libreoffice and saved the document into HTML output. You're right that the libreloffice outputs this correctly. Is there any way to mimic this behaviour in Apache POI?
Comment 7 Dominik Stadler 2019-10-09 19:11:45 UTC
Unfortunately it seems this information is stored in a way that Apache POI does not support right now, so it would need someone to find the time and expertise to dig into the format and the code of Apache POI, no way to "mimic" as far as I see.
Comment 8 teresa.kim 2019-10-10 04:37:09 UTC
(In reply to Dominik Stadler from comment #7)
> Unfortunately it seems this information is stored in a way that Apache POI
> does not support right now, so it would need someone to find the time and
> expertise to dig into the format and the code of Apache POI, no way to
> "mimic" as far as I see.

Thanks Dominik for looking into this issue.
I would love to involve to approach this issue and I never did before, but have used the apache poi API for a while. I have time but no expertise, is there any route to get some help from experts to start or any instructions to follow for someone like me?
Comment 9 Dominik Stadler 2019-10-10 17:30:41 UTC
Unfortunately it may require some getting used to the topic if you find the time.

You may first need to review the official technical documentation from Microsoft at https://msdn.microsoft.com/en-us/library/cc313105%28v=office.12%29.aspx and compare this with the actual code in Apache POI, e.g. the starting point would be the constructor of class HWPFDocument and the classes used there to read the binary format.

Otherwise the dev-mailing list will be a good place for asking questions while you go along.