rPh is show pronunciation of the data text. It is hidden data from user point of view and not accurate. So, it is confusing the extraction include rPh elements. This is a sample of sharedStrings.xml. <si> <t>役割</t> <!-- role in English --> <rPh sb="0" eb="2"> <t>ヤクワリ</t> <!-- Japanese phonic symbol called Katakana of role --> </rPh> <phoneticPr fontId="1" /> </si>
Maybe we should make it an option? Some people may want that data for their indexing?
Created attachment 28092 [details] Example files used to reproduce InvocationTargetException
Sorry, my attachment was for Bug #51158.
Created attachment 31165 [details] Example file to reproduce issue This file can be used to reproduce the issue. If you open the file using Excel and then load the file using Apache POI (streaming event model), you can see that extra characters are loaded that are not visible when opened in Excel. A good description of the purpose of these extra characters can be found at: http://www.localizingjapan.com/blog/2011/02/13/sorting-in-japanese-%E2%80%94-an-unsolved-problem/
I'm also confronted by this issue. If we set a goal of having XSSFEventBasedExcel produce output that is at parity with XSSFExcelExtractor, then the phonetic text should not be included in the output. I'm attaching a patch that achieves that goal.
Created attachment 31295 [details] Patch to ReadOnlySharedStringsTable to address this issue
the fix is not available in the version 3.12 and 3.13 beta. Will it be available in 3.13? Shall i patch it in my deployment if the license permits?
You may be familiar with some or all of the following. To increases your chances of having your patch accepted into trunk, include unit tests that show the changed behavior is correct (fails on old code, passes on patched code). This is especially important for new public functions. Unit tests are located in /trunk/src/testcases (Office97 and version-agnostic unit tests) and /trunk/ooxml/src/testcases (Office 2007+ only). If your code change affects both Office 97 and 2007, try to find the base test so you don't need to write your unit test twice. If you haven't tested your changes, you'll need to install ant, then from the command line run "ant clean test". Build success indicates no problem. If the test fails, you'll need to look at the output junit test results in build/test-results and build/ooxml-test-results. When you want to signal to committers that your patch is tested and ready for inclusion, prepend [PATCH] to the bug title and add PatchAvailable to keywords. Set up build environment, including ant: https://poi.apache.org/howtobuild.html Submitting patches: https://poi.apache.org/guidelines.html#SubmittingPatches POI 3.13 release candidate was made a few days ago, but it's likely your changes with corresponding unit tests can make it into POI 3.14. I expect 3.14 will be released in about 6 months, though the date depends on the status of outstanding features come release time.
Has this been built into any release?
As the state is NEW and not "RESOLVED FIXED" it is most probably not fixed yet.
The patch from comment has not been applied because it is missing unit tests. As Nick mentioned in comment 1, for backwards compatibility, we need to make rPh elements optionally available to users.
Setting to NEEDINFO to indicate that some proper unit-tests would be good to verify the changes also into the future.
*** Bug 60481 has been marked as a duplicate of this bug. ***
I think I'll take this one as prep for .xlsb unless there are objections.
Created attachment 34805 [details] Initial patch Any objections to adding a XSSFSharedString object that includes both the actual text and the phonetic runs?
Went with a new method rather than a new class to decrease the API surface. Users can get the regular string at index i, and they can request the phonetic string at index i. Thank you for sharing a test file! r1785965
For the full extraction workflow, we should enable including phonetic runs in the initializer of ReadOnlySharedTables, with default to be the legacy behavior (true). We should also allow the user to set whether or not to include phonetic runs in the XSSFEventBasedExcelExtractor.
r1786021