I created some test cases to test case sensitivity in OLE2 files. @Test public void testPoiCaseInsensitivityInMemory() throws Exception { POIFSFileSystem fs = new POIFSFileSystem(); DirectoryEntry dir = fs.getRoot().createDirectory("A"); dir.createDocument("B", new ByteArrayInputStream(new byte[] { 0, 1, 2, 3, 4, 5 })); DirectoryEntry dir2 = (DirectoryEntry) fs.getRoot().getEntry("a"); DocumentEntry doc2 = (DocumentEntry) dir2.getEntry("b"); assertArrayEquals("Wrong data read back", new byte[] { 0, 1, 2, 3, 4, 5 }, IOUtils.toByteArray(new DocumentInputStream(doc2))); } @Test public void testPoiCaseInsensitivityAfterReadingFromStorage() throws Exception { POIFSFileSystem fs = new POIFSFileSystem(); DirectoryEntry dir = fs.getRoot().createDirectory("A"); dir.createDocument("B", new ByteArrayInputStream(new byte[] { 0, 1, 2, 3, 4, 5 })); ByteArrayOutputStream baos = new ByteArrayOutputStream(); fs.writeFilesystem(baos); POIFSFileSystem fs2 = new POIFSFileSystem(new ByteArrayInputStream(baos.toByteArray())); DirectoryEntry dir2 = (DirectoryEntry) fs2.getRoot().getEntry("a"); DocumentEntry doc2 = (DocumentEntry) dir2.getEntry("b"); assertArrayEquals("Wrong data read back", new byte[] { 0, 1, 2, 3, 4, 5 }, IOUtils.toByteArray(new DocumentInputStream(doc2))); } Both of these fail looking up "a" as it doesn't exist, but the comparison is supposed to be case insensitive according to available documentation. Specifically, [MS-CFB] has the following to say about how entries in an OLE2 directory should be compared: (2.6.1 pg 23) When locating an object in the compound file except for the root storage, the directory entry name is compared using a special case-insensitive upper-case mapping, described in Red-Black Tree. (2.6.4 "Red-Black Tree" pg 26) * For each UTF-16 code point, convert to upper-case with the Unicode Default Case Conversion Algorithm, simple case conversion variant (simple case foldings), with the following notes.<2> * Unicode surrogate characters are never upper-cased, since they are represented by two UTF-16 code points, while the sorting relationship upper-cases a single UTF-16 code point at a time. * Lowercase characters defined in a newer, later version of the Unicode standard can be upper- cased by an implementation that conforms to that later Unicode standard. Note <2> goes into further detail on which version of Unicode is used to perform the folding: (pg 39) For Windows XP and Windows Server 2003: The compound file implementation conforms to the Unicode 3.0.1 Default Case Conversion Algorithm, simple case folding (http://www.unicode.org/Public/3.1-Update1/CaseFolding-4.txt) with the following exceptions. (table omitted for now) For Windows Vista and Windows Server 2008: The compound files implementation conforms to the Unicode 5.0 Default Case Conversion Algorithm, simple case folding (http://www.unicode.org/Public/5.0.0/ucd/CaseFolding.txt) with the following exceptions. (table omitted for now) References: [MS-CFB]: Compound File Binary File Format, Revision 0.01 (Wednesday, June 18, 2008)
Created attachment 33042 [details] Initial work on changes for this bug, not fully done, some tests are now failing I tried to work on this a bit, but could not get it fully done yet, attached is the intermediate set of changes that i came up with.
Hi, please use toUpperCase(Locale.ROOT) instead of just toUpperCase()! The problem is otheriwse that it would fail in e.g. Turkish Locale, too. This applies to the whole POI base, but here it is explicit -> especially through its defined how the whole stuff needs to be uppercases. Unfortunately the unicode version used is depending on the JDK.
toUpperCase() is the devil... along with every other method in the JRE which has an overload where you can omit the locale, the file encoding, the time zone, etc. :( I also wish it were possible in Java to do Unicode operations against a specific version. This is something I encountered very early on when I wanted to make a Lucene tokeniser for grapheme clusters and realised that Java was going to screw me on backwards compatibility so I had to archive off my own copy of the table for all that data. I guess the same tactic could work here - save a copy of the uppercasing table. Microsoft even said there are special cases, so it is probably the easiest way to know it's correct. Depending on how you store it, the storage cost might not even be terribly bad. ICU uses some kind of trie encoding to store this sort of thing.
I would as a first step enable forbidden-api checks on POI, too (see https://github.com/policeman-tools/forbidden-apis). In Lucene/Solr/Elasticsearch and also TIKA this is the default since long time. This checker forbids all those methods which depend on local configuration. Trekjaz: Exactly I would wish that you could give unicode version, too. In ICU you can do this somehow (just don't remember how). We had an issue in Solr already: One test had a list of whitespace chars and that list changed in Unicode 7, as used by coming Java 9! Result was a failing test with preview builds, because we had a character which was no longer treated as whitespace in Java 9.
Forbidden APIs check now available on trunk, just run "ant forbidden-apis-check" I've also updated the dist task to run this and the rat check automatically Quite a lot of places we fail the forbidden APIs check though! Patches / fixes very much welcomed :)
Thanks, quite a lot of violations! Maybe to start with remove the deprecated violations (they are not so urgent). We may also turn on the "jdk-system-out", because a library like POI should never print anything to System.out/err (POI's own logger framework may need to be supressed). This also complains if you have e.printStackTrace() as generated by Eclipse!
(In reply to Uwe Schindler (ASF) from comment #6) > We may also turn on the "jdk-system-out", because a library like POI > should never print anything to System.out/err We do ship a number of runnable debug / dev / information tools in the jars, so that's not quite universally true. They're all in specific packages though, so we could look at enabling it except for those?
There are several possibilities: You can change the inner fileset of forbiddenapis to exclude some files, yes. In Lucene we use an Annotation. An example is given in the forbiddenapis JAR file, but you dont want to depend on it in the code. So we defined our own @SuppressForbidden annotation in our own source code. The Ant config just gest the class name. http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/util/SuppressForbidden.java And here it is enabled: <!-- applies to both source and test code --> <target name="-check-forbidden-all" depends="-init-forbidden-apis,compile-core,compile-test"> <forbidden-apis internalRuntimeForbidden="true" suppressAnnotation="**.SuppressForbidden" classpathref="forbidden-apis.allclasses.classpath"> <bundledSignatures name="jdk-unsafe-${javac.target}"/> <bundledSignatures name="jdk-deprecated-${javac.target}"/> <signaturesFileSet dir="${common.dir}/tools/forbiddenApis"> <include name="base.txt"/> <include name="lucene.txt" if="forbidden-isLucene"/> </signaturesFileSet> <fileset dir="${build.dir}/classes/java" excludes="${forbidden-base-excludes}"/> <fileset dir="${build.dir}/classes/test" excludes="${forbidden-tests-excludes}" erroronmissingdir="false"/> </forbidden-apis> </target>
Note: Comment 2 - 8 are about forbidden api usage and not this bug per-se... Forbidden-API check is now fully implemented in POI as of 3.13.
Ross Spencer recently raised an issue on Tika on this topic: https://issues.apache.org/jira/browse/TIKA-4091 I might take a look.
I opened: https://github.com/apache/poi/pull/477 The one challenge is that we do rely on case sensitivity for "Book" vs "BOOK". We use the former as an indicator of the old excel format, and the latter can occur in some regular xls Crystal reports... So there are some less than great workaround to get everything to work. If there's a simpler way of doing this and/or of identifying the old excel formats, that'd be great.
Ugh, ugh, ugh. Ross Johnson. Please forgive me.
Committed the patch to fix this. Will be available with the 5.2.4 release. Thank you, Ross, PJ and Dominik!