Bug 57008 - Wrting _x0427_ to a string cell changes the string to some strange UTF-8 character
Summary: Wrting _x0427_ to a string cell changes the string to some strange UTF-8 char...
Status: REOPENED
Alias: None
Product: POI
Classification: Unclassified
Component: XSSF (show other bugs)
Version: 3.10-FINAL
Hardware: PC All
: P2 minor (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
: 57780 60147 64990 (view as bug list)
Depends on:
Blocks: 64990
  Show dependency tree
 
Reported: 2014-09-23 14:09 UTC by Günter Hoffmann
Modified: 2020-12-15 08:26 UTC (History)
3 users (show)



Attachments
Test classes and description of the results (5.50 KB, application/x-7z-compressed)
2015-03-30 13:28 UTC, carlo.dellacqua
Details
Test version of the program that writes 1024 OLE objects across 3 sheets (34.39 KB, text/x-csrc)
2018-01-19 18:35 UTC, Ravi Wallau
Details
Control Characters excel file failed to read the exact value. (10.87 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2019-02-06 15:55 UTC, Vishal Jaldawar
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Günter Hoffmann 2014-09-23 14:09:20 UTC
I've created a Workbook using "new XSSFWorkbook()", created a sheet, a row and a cell. Into this cell I wrote "sometext_x0427_sometext" using "setCellValue". When storing the resulting workbook in a file and then try to open it with Excel, Excel tells me that the file is corrupted. After saying "ok, repair the file" I can see some strange character instead of the "_0x427_". If you send the excel file from Web-Server to IE-Client over a response stream you cannot even open the Excel data.
This seems to be an Excel "feature" and if you prepend this string with "_0x005F" the string stays at it is.
Of course I might change my code and check the strings I want to write whether they contain strings in a certain format and then prepend this "_0x005F". But I think "nobody" knows about this feature and every one is wondering what happened to their data.
So I think the prepending of "_0x005f" should happen in POI.
If a programmer wants to use this "feature" he might call a new POI function i.e. setUseStrangeExcelFeature(true).
Comment 1 Nick Burch 2014-09-23 15:47:55 UTC
First up, can you try with 3.11 beta 2? (And the associated newer xmlbeans runtime)

If that doesn't help, can you see if that happens for all characters in that range, or just 0x0427?
Comment 2 Günter Hoffmann 2014-09-24 09:37:05 UTC
Sorry I entered the wrong POI version. The version I am using is the 3.10.1 (I changed it in this bug)

If I write the string: 

hgh_x0041_**_x0100_*_x0101_*_x0190_*_x0200_*_x0300_*_x0427_*

into a cell the following appears on the screen: 

hghA**Ā*ā*Ɛ*Ȁ*̀*Ч*

So the problem occurs (so it seems) for all strings with the pattern: _xhhhh_
where h is a hex digit and x is exact a lower x.

I will try to use the latest poi version if I find the time. I think this will last some days.

Thanks for your help!
Comment 3 Günter Hoffmann 2014-09-24 15:06:12 UTC
My problem is described on the following URL:
http://social.msdn.microsoft.com/Forums/office/en-US/7580de39-489b-4593-b68d-dc1c33b1a6b3/escaping-encoded-characters-ex-xf8ff-?forum=exceldev
But there seems to be no answer since 2010 :-(
Comment 4 Dominik Stadler 2014-10-13 19:48:13 UTC
I did some searching and found a few places where this is discussed:

* http://cpansearch.perl.org/src/JMCNAMARA/Excel-Writer-XLSX-0.74/lib/Excel/Writer/XLSX/Package/SharedStrings.pm

    # Excel escapes control characters with _xHHHH_ and also escapes any
    # literal strings of that type by encoding the leading underscore. So
    # "\0" -> _x0000_ and "_x0000_" -> _x005F_x0000_.
    # The following substitutions deal with those cases.

    # Escape the escape.
    $string =~ s/(_x[0-9a-fA-F]{4}_)/_x005F$1/g;

    # Convert control character to the _xHHHH_ escape.
    $string =~ s/([\x00-\x08\x0B-\x1F])/sprintf "_x%04X_", ord($1)/eg;

* http://public.vrac.iastate.edu/~charding/HCI574_lecture_notes/lecture24/XlsxWriter-0.5.3/xlsxwriter/sharedstrings.py

        # Excel escapes control characters with _xHHHH_ and also escapes any
        # literal strings of that type by encoding the leading underscore.
        # So "\0" -> _x0000_ and "_x0000_" -> _x005F_x0000_.
        # The following substitutions deal with those cases.

        # Escape the escape.
        string = re.sub('(_x[0-9a-fA-F]{4}_)', r'_x005F\1', string)

        # Convert control character to the _xHHHH_ escape.
        string = re.sub(r'([\x00-\x08\x0B-\x1F])',
                        lambda match: "_x%04X_" %
                        ord(match.group(1)), string)

* https://social.technet.microsoft.com/Forums/sharepoint/en-US/a00949a6-4b2d-4cde-875f-850870b76900/ssrs2012-export-to-excel-excel-found-unreadable-content?forum=sqlreportingservices

So there seems to be a way to escape characters this way and also a way to escape the escape-sequence to allwo to embed this in literal form.
Comment 5 Günter Hoffmann 2014-11-03 15:27:25 UTC
Hi there, 
I wrote a test program and also a small web application using the newest beta version of POI (poi-3.11-beta2-20140822)
and (as expected) the problem is still there.

I´m not familiar with reporting bugs and what responses I can expect from you. So I don´t want to offend anybody but
My Problem is: I have about 20 reports which get their data from a database. Each reports got several sheets and each sheet got 20 and more columns.
So there are many, many statements like "call.setCellValue(dataFromDatabase)".
It ´s rather impractical to add some conversion code to all these calls.
And the fact is that the problem is in POI not in my/our code.
In my opinion POI should offer a method where the programmer can decide whether he wants to have an automatic conversion of these "_xhhhh_" stuff or not.
And the default behavior of setCellValue should be
"I´d like to see the text in Excel as I can see it in the database".
Or in other words: If I enter "_x0123_" in Excel it stays "_x0123_" in Excel, but if I enter "_x0123_ in POI (using setCellValue) it does not stay the same!

So I don´t need a work-around I need a solution.
And this solution should be in POI I think.

So again: I don´t want to bother you (thanks to Dominik for his researches) but I need some "POI responsible" who corrects this mis-behavior in POI and then of course a new POI version some day.
Comment 6 Dominik Stadler 2014-11-04 16:24:29 UTC
As POI-developer time is always scarce, you can greatly increase the chances of getting this fixed if you can provide a simple unit tests that show the problem and a proposed fix, if you are able to come up with one.
Comment 7 carlo.dellacqua 2015-03-30 13:28:35 UTC
Created attachment 32621 [details]
Test classes and description of the results

Test_xNNNN_inCell is a test class which calls setCellValue(...) (for XSSF,SXSSF, HSSF) to write a text string containing the substring "_x0046_";
Utils is a class which contains methods used to escape the previous code;
Test_xNNNN_inCell_HSSF.xls, Test_xNNNN_inCell_XSSF.xlsx, Test_xNNNN_inCell_SXSSF.xlsx are the excel files written by the test class.

Test_xNNNN_inCell_HSSF.xls contains:
row 0, cell 0 _x0046_ without changes
row 0, cell 1 _x005f_x0046_ with changes
that is the right original and escaped text strings.

Test_xNNNN_inCell_XSSF.xlsx contains:
row 0, cell 0 F without changes
row 0, cell 1 _x0046_ with changes
that is the first cell contains 'F' which is the character related to the unicode x0046, which is the text written in the cell,
the second cell contains the expected text "_x0046_", obtained adding the text "_x005F".
Comment 8 Dominik Stadler 2015-03-30 14:51:26 UTC
Thanks, that now made it easier to dig into this, unfortunately it seems this is currently done "by design" in POI to implement the encoding that Excel files define so we write and read files consistently with what Microsoft OOXML spec states.

Escaping is done at XSSFRichtTextString.utfDecode(), however it seems this was put in on purpose, so I am reluctant to "undo" it here as likely others depend on it being in place...

Furthermore as we did not get any related bug-report for a long time and the code is in place since 2010, I don't think many people write such a _xXXXX_ text "by chance" into their files, unfortunately it seems to have hit you here.

Overall I think your best bet currently is to perform the escaping via _0x005F in your code before you populate the cell via some helper function.
Comment 9 Dominik Stadler 2015-04-20 20:40:06 UTC
*** Bug 57780 has been marked as a duplicate of this bug. ***
Comment 10 Dominik Stadler 2016-09-19 13:14:32 UTC
*** Bug 60147 has been marked as a duplicate of this bug. ***
Comment 11 Matthias Gerth 2018-01-12 15:15:42 UTC
I understand that this is decoding is necessary when reading an excel file. But why this decoding is done when simply setting the value of a cell programmatically escapes me.

This behaviour is not mentioned in the API documentation. And there is no escaping function provided.

Since it this function does not work like it is defined in the Cell interface I consider this a bug.
Comment 12 Mark Murphy 2018-01-12 20:16:58 UTC
I was intrigued by this issue so I went into Excel and typed LUS_BO_WP_x24B8_AI into a cell. Excel retained the text as typed. So I looked in the XML and found that _x24B8_ had been changed to _x005F_x24B8_. It seems to me that when a user sets a value, that value should be escaped if it isn't already so that what they set the value to is what they get. I also recognize that some may need to use the special strings to encode Unicode characters, and would want them to appear as such. It seems to me that the default behavior should be to insert the escaped value into the cell, and have a special setRawValue() method for the cases where you want those special values to remain unescaped. Though I could be convinced that we should leave setValue alone and create a setEscapedText() method where a user might want to use these _x####_ values.
Comment 13 Matthias Gerth 2018-01-15 16:02:15 UTC
I've written an escape function as a workaround.
https://stackoverflow.com/questions/48222502/xssfcell-in-apache-poi-encodes-certain-character-sequences-as-unicode-character

So my use case is this: I need to store a string containing "_x24B8_" into an excel file. This is user input and I can not prevent this. The setValue function on XSSFCell has one parameter of type java.lang.String. Java string does not use microsoft encoding to represent unicode character.

So this happens
1. String value = "_x24B8_";
2. String valueEscaped = escape(value); // "_x005F_x24B8_"
3. cell.setValue(valueEscaped) // cell.value is now "_x24B8_"
4. once the file is written is changes back to "_x005F_x24B8_" in the file

I think setValue should not call XSSFRichTextString.utfDecode(). This would prevent this back and forth encoding.
We could also make XSSFRichTextString.utfDecode() a public for people who are using this type of encoding. I would prefer this microsoft encoding terminated within the library since it is specific to the office file format.
Comment 14 Mark Murphy 2018-01-16 17:40:39 UTC
(In reply to Matthias Gerth from comment #13)
> I've written an escape function as a workaround.
> https://stackoverflow.com/questions/48222502/xssfcell-in-apache-poi-encodes-
> certain-character-sequences-as-unicode-character
> 
> So my use case is this: I need to store a string containing "_x24B8_" into
> an excel file. This is user input and I can not prevent this. The setValue
> function on XSSFCell has one parameter of type java.lang.String. Java string
> does not use microsoft encoding to represent unicode character.
> 
> So this happens
> 1. String value = "_x24B8_";
> 2. String valueEscaped = escape(value); // "_x005F_x24B8_"
> 3. cell.setValue(valueEscaped) // cell.value is now "_x24B8_"
> 4. once the file is written is changes back to "_x005F_x24B8_" in the file
> 
> I think setValue should not call XSSFRichTextString.utfDecode(). This would
> prevent this back and forth encoding.
> We could also make XSSFRichTextString.utfDecode() a public for people who
> are using this type of encoding. I would prefer this microsoft encoding
> terminated within the library since it is specific to the office file format.

In my mind, this is how it should work: cell.setValue() should perform a utfEscape() (this does not currently exist) which will add "_x005F" to anything that looks like "_x####_" but not to things that look like "_x005F_x####_". conversely cell.getRichTextString() should perform a utfUnescape() (this does not exist) which will strip the "_x005F" from anything that looks like "_x005F_x####_". But we need to make sure that when reading a file, or processing XML, these Unicode special characters remain unchanged. Section 3.18.96 in Office Open XML 1st Edition Part 4 indicates that Unicode characters that cannot be in an XML document need to be encoded as "_x####_". I was unable to find out just which characters those were, so I don't know about encoding and decoding these characters. Right now it looks like we just decode "_x####_" to the Unicode character on read.

So If I use your example above:
1. cell.setValue("_x24B8_"); // resulting XML should contain "_x005F_x24B8_"
2. cell.getRichTextString(); // should return "_x24B8_"

For a file read in that contains "_x24B8_" in a cell:
3. cell.getRechTextString(): // should return "Ⓒ" - utfDecode() does this correctly.

For a file read in that contains "_x005F_x24B8_" in a cell:
3. cell.getRechTextString(): // should return "_x24B8_"
Comment 15 Matthias Gerth 2018-01-17 13:11:05 UTC
Why should utfEscape() not escape things like "_x005F_x####_"?
If I want a cell that contains the string "_x005F_x####_" it needs to be escaped as "_x005F_x005F_x####_" in the file so we get a cell that contains "_x005F_x####_".
That is the workaround I'm currently running.
Comment 16 Matthias Gerth 2018-01-17 17:20:55 UTC
> So If I use your example above:
> 1. cell.setValue("_x24B8_"); // resulting XML should contain "_x005F_x24B8_"
> 2. cell.getRichTextString(); // should return "_x24B8_"

I agree. But currently cell.getRichTextString() returns "Ⓒ" in this eample and it will show "Ⓒ" in excel.

The other two examples already work correctly.
Comment 17 Mark Murphy 2018-01-17 20:23:02 UTC
(In reply to Matthias Gerth from comment #15)
> Why should utfEscape() not escape things like "_x005F_x####_"?
> If I want a cell that contains the string "_x005F_x####_" it needs to be
> escaped as "_x005F_x005F_x####_" in the file so we get a cell that contains
> "_x005F_x####_".
> That is the workaround I'm currently running.

Agreed, I was just thinking myself into a loop, and then trying to avoid it. We should check Excel to see how it stores various versions of "_x005f_x####", "_x005f_x005f_x####", _x005f_x005f_x005f_x####_", ... there should be a pattern.
Comment 18 Greg Woolsey 2018-01-17 20:45:53 UTC
I always go back to the standards doc when I get going around in circles.  Here's what it says about escaped strings:

22.4.2.4 bstr (Basic String)
This element defines a binary basic string variant type, which can store any valid Unicode character. Unicode characters that cannot be directly represented in XML as defined by the XML 1.0 specification, shall be escaped using the Unicode numerical character representation escape character format _xHHHH_, where H represents a hexadecimal character in the character's value. [Example: The Unicode character 8 is not permitted in an XML 1.0 document, so it shall be escaped as _x0008_. end example] To store the literal form of an escape sequence, the initial underscore shall itself be escaped (i.e. stored as _x005F_). [Example: The string literal _x0008_ would be stored as _x005F_x0008_. end example]

The possible values for this element are defined by the W3C XML Schema string datatype.

I think POI should assume it needs to escape Unicode when setting CT* class value strings, and unescape when reading them.  I don't think POI should be attempting to unescape them at any other time than when reading a string value from a CT* class.
Comment 19 Mark Murphy 2018-01-18 12:46:27 UTC
I saw that, and the XML spec isn't terribly helpful except that it does tell which characters are valid in XML. x00-x1F are invalid except for TAB, CR, and LF (x09, x0A, x0D). There are a few others as well that we probably should encode/decode when copying to/from the CT classes. The OOXML specs mention the XML 1.0 spec as the driving source. Here are the valid XML characters according to the XML 1.0 specification:

Char    ::=    #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Comment 20 Ravi Wallau 2018-01-19 18:35:11 UTC
Created attachment 35688 [details]
Test version of the program that writes 1024 OLE objects across 3 sheets

Test version of the program that writes 1024 OLE objects across 3 sheets.

This version does not trigger the error - to trigger it, change the number in the constant to 1024 - this will cause 1025 objects to be written and it will trigger the error.
Comment 21 Mark Murphy 2018-01-22 12:51:09 UTC
(In reply to Ravi Wallau from comment #20)
> Created attachment 35688 [details]
> Test version of the program that writes 1024 OLE objects across 3 sheets
> 
> Test version of the program that writes 1024 OLE objects across 3 sheets.
> 
> This version does not trigger the error - to trigger it, change the number
> in the constant to 1024 - this will cause 1025 objects to be written and it
> will trigger the error

Was this attached to the wrong bug?
Comment 22 Vishal Jaldawar 2019-02-06 15:55:50 UTC
Created attachment 36427 [details]
Control Characters excel file failed to read the exact value.

Control Characters excel file failed to read the exact value.
Comment 23 Vishal Jaldawar 2019-02-06 16:06:40 UTC
I have similar issue. Please refer the excel file attached by me.
There are some control characters which are not read correctly by Apache POI.

I have written a custom Sheet Handler extending DefaultHandler.
I read the data and form a xml out of it to store into database. I have used JAXB marshaller. When I unmarshall xml I get 

org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x1) was found in the element content of the document.

I debugged the code and found the XSSFRichTextString.utfDecode has some decoding logic. but it seems it is not working for some characters.

I am using Apache POI  3.17 version.

Please help.
Comment 24 Andreas Beeker 2020-12-15 08:26:18 UTC
*** Bug 64990 has been marked as a duplicate of this bug. ***