Bug 51519 - XSSFEventBasedExcelExtractor's Japanese xlsx file processing shouldn't extract t element within rPh elemtnts.
XSSFEventBasedExcelExtractor's Japanese xlsx file processing shouldn't extrac...
Status: NEW
Product: POI
Classification: Unclassified
Component: XSSF
PC All
: P2 normal with 3 votes (vote)
: ---
Assigned To: POI Developers List
Depends on:
  Show dependency tree
Reported: 2011-07-17 05:29 UTC by Mamoru Asagami
Modified: 2015-09-25 15:55 UTC (History)
1 user (show)

Example files used to reproduce InvocationTargetException (6.30 KB, application/octet-stream)
2011-12-20 16:51 UTC, Michael L.
Example file to reproduce issue (9.97 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2013-12-30 22:19 UTC, Christopher
Patch to ReadOnlySharedStringsTable to address this issue (2.61 KB, patch)
2014-02-07 21:27 UTC, Shaun Kalley
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Mamoru Asagami 2011-07-17 05:29:03 UTC
rPh is show pronunciation of the data text.
It is hidden data from user point of view and not accurate.
So, it is confusing the extraction include rPh elements.

This is a sample of sharedStrings.xml.

   <t>役割</t>  <!-- role  in English --> 
   <rPh sb="0" eb="2">
      <t>ヤクワリ</t> <!-- Japanese phonic symbol called Katakana of role --> 
   <phoneticPr fontId="1" /> 
Comment 1 Nick Burch 2011-07-17 16:08:23 UTC
Maybe we should make it an option? Some people may want that data for their indexing?
Comment 2 Michael L. 2011-12-20 16:51:32 UTC
Created attachment 28092 [details]
Example files used to reproduce InvocationTargetException
Comment 3 Michael L. 2011-12-20 16:53:31 UTC
Sorry, my attachment was for Bug #51158.
Comment 4 Christopher 2013-12-30 22:19:11 UTC
Created attachment 31165 [details]
Example file to reproduce issue

This file can be used to reproduce the issue. If you open the file using Excel and then load the file using Apache POI (streaming event model), you can see that extra characters are loaded that are not visible when opened in Excel. 

A good description of the purpose of these extra characters can be found at:

Comment 5 Shaun Kalley 2014-02-07 21:24:48 UTC
I'm also confronted by this issue.  If we set a goal of having XSSFEventBasedExcel produce output that is at parity with XSSFExcelExtractor, then the phonetic text should not be included in the output.  I'm attaching a patch that achieves that goal.
Comment 6 Shaun Kalley 2014-02-07 21:27:08 UTC
Created attachment 31295 [details]
Patch to ReadOnlySharedStringsTable to address this issue
Comment 7 Jai Ganesh 2015-09-25 13:36:21 UTC
the fix is not available in the version 3.12 and 3.13 beta. Will it be available in 3.13? Shall i patch it in my deployment if the license permits?
Comment 8 Javen O'Neal 2015-09-25 15:55:40 UTC
You may be familiar with some or all of the following.

To increases your chances of having your patch accepted into trunk, include unit tests that show the changed behavior is correct (fails on old code, passes on patched code). This is especially important for new public functions.

Unit tests are located in /trunk/src/testcases (Office97 and version-agnostic unit tests) and /trunk/ooxml/src/testcases (Office 2007+ only). If your code change affects both Office 97 and 2007, try to find the base test so you don't need to write your unit test twice.
If you haven't tested your changes, you'll need to install ant, then from the command line run "ant clean test". Build success indicates no problem. If the test fails, you'll need to look at the output junit test results in build/test-results and build/ooxml-test-results.

When you want to signal to committers that your patch is tested and ready for inclusion, prepend [PATCH] to the bug title and add PatchAvailable to keywords.

Set up build environment, including ant: https://poi.apache.org/howtobuild.html
Submitting patches: https://poi.apache.org/guidelines.html#SubmittingPatches

POI 3.13 release candidate was made a few days ago, but it's likely your changes with corresponding unit tests can make it into POI 3.14. I expect 3.14 will be released in about 6 months, though the date depends on the status of outstanding features come release time.