Bug 63323 - HwmfText's getText can throw StringIndexOutOfRange on shiftjis encoded text
Summary: HwmfText's getText can throw StringIndexOutOfRange on shiftjis encoded text
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: POI Overall (show other bugs)
Version: 4.0.x-dev
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-04-08 14:26 UTC by Tim Allison
Modified: 2019-04-08 19:55 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Allison 2019-04-08 14:26:06 UTC
When upgrading Tika to POI 4.1.0-rc3, one of our unit tests that tests for correct encoding handling is now failing.  Multibyte character encodings need to be handled more carefully than relying on stringLength in the call to substring:


 public String getText(Charset charset) throws IOException {
            return (new String(this.rawTextBytes, charset)).substring(0, this.stringLength);
        }

The triggering test file is here:
https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/testWMF_charset.wmf
Comment 1 Tim Allison 2019-04-08 15:07:29 UTC
I _think_ this corresponds to Dominik's regression test findings:

54 ERROR java.lang.StringIndexOutOfBoundsException: String index out of range: *

with this file as an example:

http://people.apache.org/~centic/poi_regression/reports/download.oldindex/br.org.camaradojapao.jp_ppt_sasaki.ppt
Comment 2 Tim Allison 2019-04-08 19:55:38 UTC
r1857135