Bug 51519 - [PATCH] XSSFEventBasedExcelExtractor's Japanese xlsx file processing shouldn't extract t element within rPh elemtnts.
Summary: [PATCH] XSSFEventBasedExcelExtractor's Japanese xlsx file processing shouldn'...
Alias: None
Product: POI
Classification: Unclassified
Component: XSSF (show other bugs)
Version: 3.9-FINAL
Hardware: PC All
: P2 normal with 3 votes (vote)
Target Milestone: ---
Assignee: POI Developers List
Keywords: PatchAvailable
: 60481 (view as bug list)
Depends on:
Reported: 2011-07-17 05:29 UTC by Mamoru Asagami
Modified: 2017-10-24 13:04 UTC (History)
3 users (show)

Example files used to reproduce InvocationTargetException (6.30 KB, application/octet-stream)
2011-12-20 16:51 UTC, Michael L.
Example file to reproduce issue (9.97 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2013-12-30 22:19 UTC, Christopher
Patch to ReadOnlySharedStringsTable to address this issue (2.61 KB, patch)
2014-02-07 21:27 UTC, Shaun Kalley
Details | Diff
Initial patch (8.97 KB, patch)
2017-03-07 14:58 UTC, Tim Allison
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Mamoru Asagami 2011-07-17 05:29:03 UTC
rPh is show pronunciation of the data text.
It is hidden data from user point of view and not accurate.
So, it is confusing the extraction include rPh elements.

This is a sample of sharedStrings.xml.

   <t>役割</t>  <!-- role  in English --> 
   <rPh sb="0" eb="2">
      <t>ヤクワリ</t> <!-- Japanese phonic symbol called Katakana of role --> 
   <phoneticPr fontId="1" /> 
Comment 1 Nick Burch 2011-07-17 16:08:23 UTC
Maybe we should make it an option? Some people may want that data for their indexing?
Comment 2 Michael L. 2011-12-20 16:51:32 UTC
Created attachment 28092 [details]
Example files used to reproduce InvocationTargetException
Comment 3 Michael L. 2011-12-20 16:53:31 UTC
Sorry, my attachment was for Bug #51158.
Comment 4 Christopher 2013-12-30 22:19:11 UTC
Created attachment 31165 [details]
Example file to reproduce issue

This file can be used to reproduce the issue. If you open the file using Excel and then load the file using Apache POI (streaming event model), you can see that extra characters are loaded that are not visible when opened in Excel. 

A good description of the purpose of these extra characters can be found at:

Comment 5 Shaun Kalley 2014-02-07 21:24:48 UTC
I'm also confronted by this issue.  If we set a goal of having XSSFEventBasedExcel produce output that is at parity with XSSFExcelExtractor, then the phonetic text should not be included in the output.  I'm attaching a patch that achieves that goal.
Comment 6 Shaun Kalley 2014-02-07 21:27:08 UTC
Created attachment 31295 [details]
Patch to ReadOnlySharedStringsTable to address this issue
Comment 7 Jai Ganesh 2015-09-25 13:36:21 UTC
the fix is not available in the version 3.12 and 3.13 beta. Will it be available in 3.13? Shall i patch it in my deployment if the license permits?
Comment 8 Javen O'Neal 2015-09-25 15:55:40 UTC
You may be familiar with some or all of the following.

To increases your chances of having your patch accepted into trunk, include unit tests that show the changed behavior is correct (fails on old code, passes on patched code). This is especially important for new public functions.

Unit tests are located in /trunk/src/testcases (Office97 and version-agnostic unit tests) and /trunk/ooxml/src/testcases (Office 2007+ only). If your code change affects both Office 97 and 2007, try to find the base test so you don't need to write your unit test twice.
If you haven't tested your changes, you'll need to install ant, then from the command line run "ant clean test". Build success indicates no problem. If the test fails, you'll need to look at the output junit test results in build/test-results and build/ooxml-test-results.

When you want to signal to committers that your patch is tested and ready for inclusion, prepend [PATCH] to the bug title and add PatchAvailable to keywords.

Set up build environment, including ant: https://poi.apache.org/howtobuild.html
Submitting patches: https://poi.apache.org/guidelines.html#SubmittingPatches

POI 3.13 release candidate was made a few days ago, but it's likely your changes with corresponding unit tests can make it into POI 3.14. I expect 3.14 will be released in about 6 months, though the date depends on the status of outstanding features come release time.
Comment 9 Daniel Spiegelman 2016-05-19 18:38:07 UTC
Has this been built into any release?
Comment 10 Dominik Stadler 2016-05-19 18:51:42 UTC
As the state is NEW and not "RESOLVED FIXED" it is most probably not fixed yet.
Comment 11 Javen O'Neal 2016-05-20 03:01:54 UTC
The patch from comment has not been applied because it is missing unit tests. As Nick mentioned in comment 1, for backwards compatibility, we need to make rPh elements optionally available to users.
Comment 12 Dominik Stadler 2016-07-17 09:32:58 UTC
Setting to NEEDINFO to indicate that some proper unit-tests would be good to verify the changes also into the future.
Comment 13 Dominik Stadler 2016-12-14 21:23:22 UTC
*** Bug 60481 has been marked as a duplicate of this bug. ***
Comment 14 Tim Allison 2017-03-07 13:53:31 UTC
I think I'll take this one as prep for .xlsb unless there are objections.
Comment 15 Tim Allison 2017-03-07 14:58:19 UTC
Created attachment 34805 [details]
Initial patch

Any objections to adding a XSSFSharedString object that includes both the actual text and the phonetic runs?
Comment 16 Tim Allison 2017-03-08 13:43:02 UTC
Went with a new method rather than a new class to decrease the API surface.

Users can get the regular string at index i, and they can request the phonetic string at index i.

Thank you for sharing a test file!

Comment 17 Tim Allison 2017-03-08 16:29:12 UTC
For the full extraction workflow, we should enable including phonetic runs in the initializer of ReadOnlySharedTables, with default to be the legacy behavior (true).  We should also allow the user to set whether or not to include phonetic runs in the XSSFEventBasedExcelExtractor.
Comment 18 Tim Allison 2017-03-08 16:45:00 UTC