Bug 47950

Summary: No case insensitivity handling for OLE2 entry names
Product: POI Reporter: Trejkaz (pen name) <trejkaz>
Component: POIFSAssignee: POI Developers List <dev>
Status: NEW ---    
Severity: normal    
Priority: P2    
Version: 3.5-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: Windows NT   
Attachments: Initial work on changes for this bug, not fully done, some tests are now failing

Description Trejkaz (pen name) 2009-10-06 15:48:49 UTC
I created some test cases to test case sensitivity in OLE2 files.

    @Test
    public void testPoiCaseInsensitivityInMemory() throws Exception
    {
        POIFSFileSystem fs = new POIFSFileSystem();
        DirectoryEntry dir = fs.getRoot().createDirectory("A");
        dir.createDocument("B", new ByteArrayInputStream(new byte[] { 0, 1, 2, 3, 4, 5 }));

        DirectoryEntry dir2 = (DirectoryEntry) fs.getRoot().getEntry("a");
        DocumentEntry doc2 = (DocumentEntry) dir2.getEntry("b");
        assertArrayEquals("Wrong data read back", new byte[] { 0, 1, 2, 3, 4, 5 },
                          IOUtils.toByteArray(new DocumentInputStream(doc2)));
    }
    
    @Test
    public void testPoiCaseInsensitivityAfterReadingFromStorage() throws Exception
    {
        POIFSFileSystem fs = new POIFSFileSystem();
        DirectoryEntry dir = fs.getRoot().createDirectory("A");
        dir.createDocument("B", new ByteArrayInputStream(new byte[] { 0, 1, 2, 3, 4, 5 }));

        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        fs.writeFilesystem(baos);

        POIFSFileSystem fs2 = new POIFSFileSystem(new ByteArrayInputStream(baos.toByteArray()));
        DirectoryEntry dir2 = (DirectoryEntry) fs2.getRoot().getEntry("a");
        DocumentEntry doc2 = (DocumentEntry) dir2.getEntry("b");
        assertArrayEquals("Wrong data read back", new byte[] { 0, 1, 2, 3, 4, 5 },
                          IOUtils.toByteArray(new DocumentInputStream(doc2)));
    }

Both of these fail looking up "a" as it doesn't exist, but the comparison is supposed to be case insensitive according to available documentation.

Specifically, [MS-CFB] has the following to say about how entries in an OLE2 directory should be compared:

(2.6.1 pg 23)

When locating an object in the compound file except for the root storage, the directory entry name is compared using a special case-insensitive upper-case mapping, described in Red-Black Tree.

(2.6.4 "Red-Black Tree" pg 26)

  * For each UTF-16 code point, convert to upper-case with the Unicode Default Case Conversion
    Algorithm, simple case conversion variant (simple case foldings), with the following notes.<2> 

  *  Unicode surrogate characters are never upper-cased, since they are represented by two UTF-16
     code points, while the sorting relationship upper-cases a single UTF-16 code point at a time.

  * Lowercase characters defined in a newer, later version of the Unicode standard can be upper-
    cased by an implementation that conforms to that later Unicode standard.

Note <2> goes into further detail on which version of Unicode is used to perform the folding:

(pg 39)

For Windows XP and Windows Server 2003: The compound file implementation conforms to the Unicode 3.0.1 Default Case Conversion Algorithm, simple case folding (http://www.unicode.org/Public/3.1-Update1/CaseFolding-4.txt) with the following exceptions.
(table omitted for now)
For Windows Vista and Windows Server 2008: The compound files implementation conforms to the Unicode 5.0 Default Case Conversion Algorithm, simple case folding (http://www.unicode.org/Public/5.0.0/ucd/CaseFolding.txt) with the following exceptions.
(table omitted for now)


References:

[MS-CFB]: Compound File Binary File Format, Revision 0.01 (Wednesday, June 18, 2008)
Comment 1 Dominik Stadler 2015-08-27 06:25:18 UTC
Created attachment 33042 [details]
Initial work on changes for this bug, not fully done, some tests are now failing

I tried to work on this a bit, but could not get it fully done yet, attached is the intermediate set of changes that i came up with.
Comment 2 Uwe Schindler (ASF) 2015-08-27 06:41:04 UTC
Hi, please use toUpperCase(Locale.ROOT) instead of just toUpperCase()!

The problem is otheriwse that it would fail in e.g. Turkish Locale, too. This applies to the whole POI base, but here it is explicit -> especially through its defined how the whole stuff needs to be uppercases. Unfortunately the unicode version used is depending on the JDK.
Comment 3 Trejkaz (pen name) 2015-08-27 11:38:13 UTC
toUpperCase() is the devil... along with every other method in the JRE which has an overload where you can omit the locale, the file encoding, the time zone, etc. :(

I also wish it were possible in Java to do Unicode operations against a specific version. This is something I encountered very early on when I wanted to make a Lucene tokeniser for grapheme clusters and realised that Java was going to screw me on backwards compatibility so I had to archive off my own copy of the table for all that data.

I guess the same tactic could work here - save a copy of the uppercasing table. Microsoft even said there are special cases, so it is probably the easiest way to know it's correct. Depending on how you store it, the storage cost might not even be terribly bad. ICU uses some kind of trie encoding to store this sort of thing.
Comment 4 Uwe Schindler (ASF) 2015-08-27 13:01:46 UTC
I would as a first step enable forbidden-api checks on POI, too (see https://github.com/policeman-tools/forbidden-apis). In Lucene/Solr/Elasticsearch and also TIKA this is the default since long time. This checker forbids all those methods which depend on local configuration.

Trekjaz: Exactly I would wish that you could give unicode version, too. In ICU you can do this somehow (just don't remember how). We had an issue in Solr already: One test had a list of whitespace chars and that list changed in Unicode 7, as used by coming Java 9! Result was a failing test with preview builds, because we had a character which was no longer treated as whitespace in Java 9.
Comment 5 Nick Burch 2015-08-27 13:24:52 UTC
Forbidden APIs check now available on trunk, just run "ant forbidden-apis-check"

I've also updated the dist task to run this and the rat check automatically

Quite a lot of places we fail the forbidden APIs check though! Patches / fixes very much welcomed :)
Comment 6 Uwe Schindler (ASF) 2015-08-28 06:59:23 UTC
Thanks, quite a lot of violations! Maybe to start with remove the deprecated violations (they are not so urgent). We may also turn on the "jdk-system-out", because a library like POI should never print anything to System.out/err (POI's own logger framework may need to be supressed). This also complains if you have e.printStackTrace() as generated by Eclipse!
Comment 7 Nick Burch 2015-08-28 07:26:32 UTC
(In reply to Uwe Schindler (ASF) from comment #6)
> We may also turn on the "jdk-system-out", because a library like POI
> should never print anything to System.out/err

We do ship a number of runnable debug / dev / information tools in the jars, so that's not quite universally true. They're all in specific packages though, so we could look at enabling it except for those?
Comment 8 Uwe Schindler (ASF) 2015-08-28 22:27:20 UTC
There are several possibilities: You can change the inner fileset of forbiddenapis to exclude some files, yes.

In Lucene we use an Annotation. An example is given in the forbiddenapis JAR file, but you dont want to depend on it in the code. So we defined our own @SuppressForbidden annotation in our own source code. The Ant config just gest the class name.

http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/util/SuppressForbidden.java

And here it is enabled:

  <!-- applies to both source and test code -->
  <target name="-check-forbidden-all" depends="-init-forbidden-apis,compile-core,compile-test">
    <forbidden-apis internalRuntimeForbidden="true" suppressAnnotation="**.SuppressForbidden" classpathref="forbidden-apis.allclasses.classpath">
      <bundledSignatures name="jdk-unsafe-${javac.target}"/>
      <bundledSignatures name="jdk-deprecated-${javac.target}"/>
      <signaturesFileSet dir="${common.dir}/tools/forbiddenApis">
        <include name="base.txt"/>
        <include name="lucene.txt" if="forbidden-isLucene"/>
      </signaturesFileSet>
      <fileset dir="${build.dir}/classes/java" excludes="${forbidden-base-excludes}"/>
      <fileset dir="${build.dir}/classes/test" excludes="${forbidden-tests-excludes}" erroronmissingdir="false"/>
    </forbidden-apis>
  </target>