Bug 64108

Summary: unsafe pipe character ("|") in Relationship target attribute is not being encoded into a '%7C'
Product: POI Reporter: Richard Costine <rjc>
Component: OPCAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: 4.1.1-FINAL   
Target Milestone: ---   
Hardware: All   
OS: All   
Attachments: This is a possible patch against REL_4_1_1 to encode unsafe pipe characters and a change to unit tests to show that it works.

Description Richard Costine 2020-01-30 20:44:35 UTC
Created attachment 36989 [details]
This is a possible patch against REL_4_1_1 to encode unsafe pipe characters and a change to unit tests to show that it works.

The pipe character ("|") in a Relationship target attribute is not being encoded into a "%7C" string. The pipe character is not one of the characters considered "valid" by this RFC: 

https://tools.ietf.org/html/rfc3986#section-2

In the 4.1.1 version, a comparison is made to see if "|" (0x7C) is whitespace, or is greater than or equal to 0x80. Since that evaluates "false" for a "|" character, it is deemed "safe" (in PackagingURIHelper) instead of "unsafe". Currently determining it as "safe" prevents it from being converted to "%7C", which would be the correct thing to do. 

The original way we found this was because we use the Tika AutoDetectParser to extract text from various document types (MS-Word docx, or xslx). Tika uses POI OPC to parse those documents in Open Office format, and although it logs the original POI exception, it doesn't fail right away. Instead, it internally replaces the original Relationship element target with the URL "http://invalid.uri" and fails with an exception later in the processing because that fake url is "absolute". As a result, there's no way to report to the user the actual Relationship in the document that contained the target with the "|". 

For what it's worth, this is the original Relationship element that caused parsing to fail:

	<Relationship Id="rId13" Target="#ctl||rId16||cmdAddAction||_x0000_i1029" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink"/>

Scanning the "Target" attribute failed at index 4 in that attribute's value, due to the "|" being in that place in the fragment after the "#".

The only way I found the original exception from POI OPC code was to run our code with a debugger attached to it.

Granted, the failure to include the root cause of the exception is an issue with Tika, and probably should be addressed by that group, but the POI OPC code should be considering the pipe character "unsafe" since it is not one of the characters that is specified as valid in RPC3986.

Running the unit test changes in the attached patch (it is against the REL_4_1_1 tag), and removing the change to the code, I got this exception: 

java.net.URISyntaxException: Illegal character in fragment at index 4: #ctl||rId16||cmdAddAction||_x0000_i1029
	at java.base/java.net.URI$Parser.fail(URI.java:2915)
	at java.base/java.net.URI$Parser.checkChars(URI.java:3086)
	at java.base/java.net.URI$Parser.parse(URI.java:3130)
	at java.base/java.net.URI.<init>(URI.java:600)
	at org.apache.poi.openxml4j.opc.PackagingURIHelper.toURI(PackagingURIHelper.java:724)
	at org.apache.poi.openxml4j.opc.TestPackagingURIHelper.testCreateURIFromString(TestPackagingURIHelper.java:128)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at junit.framework.TestCase.runTest(TestCase.java:176)
	at junit.framework.TestCase.runBare(TestCase.java:141)
	at junit.framework.TestResult$1.protect(TestResult.java:122)
	at junit.framework.TestResult.runProtected(TestResult.java:142)
	at junit.framework.TestResult.run(TestResult.java:125)
	at junit.framework.TestCase.run(TestCase.java:129)
	at junit.framework.TestSuite.runTest(TestSuite.java:252)
	at junit.framework.TestSuite.run(TestSuite.java:247)
	at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:86)
	at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:106)
	at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
	at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
	at org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:66)
	at org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
	at org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:32)
	at org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93)
	at com.sun.proxy.$Proxy5.processTestClass(Unknown Source)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:117)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
	at org.gradle.internal.remote.internal.hub.MessageHubBackedObjectConnection$DispatchWrapper.dispatch(MessageHubBackedObjectConnection.java:155)
	at org.gradle.internal.remote.internal.hub.MessageHubBackedObjectConnection$DispatchWrapper.dispatch(MessageHubBackedObjectConnection.java:137)
	at org.gradle.internal.remote.internal.hub.MessageHub$Handler.run(MessageHub.java:404)
	at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:63)
	at org.gradle.internal.concurrent.ManagedExecutorImpl$1.run(ManagedExecutorImpl.java:46)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at org.gradle.internal.concurrent.ThreadFactoryImpl$ManagedThreadRunnable.run(ThreadFactoryImpl.java:55)
	at java.base/java.lang.Thread.run(Thread.java:834)

With the code changes to the class, the altered test passes.

The attached patch changes the method in PackagingURIHelper from this:

    private static boolean isUnsafe(int ch) {
        return ch >= 0x80 || Character.isWhitespace(ch);
    }

To this:

    private static boolean isUnsafe(int ch) {
        return ch >= 0x80 || ch == 0x7C || Character.isWhitespace(ch);
    }

Then, the pipe characters in the URI will be deemed "unsafe" and subsequently converted properly to their hexadecimal '%7C' encodings.
Comment 1 PJ Fanning 2020-01-30 21:08:50 UTC
applied with https://svn.apache.org/repos/asf/poi/trunk@1873384 - thanks for the patch
Comment 2 Richard Costine 2020-01-30 22:59:03 UTC
I'm not sure if adding this could potentially break anybody who is already expecting the "|" there to actually cause a failure. 

I suppose that we could have the code look at a System property - something like:

-Dorg.apache.poi.openxml4.opc.safePipeInURI=true|false

when true it would work like before, and when false it would encode into a "%7C".
Comment 3 Richard Costine 2020-01-30 23:30:04 UTC
Something like this code would do it:


    private static boolean isUnsafe(int ch) {
        boolean safePipeInUri = true; // assume we will fail like before.
        try {
            // set this System property to false to make it not fail
            safePipeInUri = Boolean.parseBoolean(System.getProperty("org.apache.poi.openxml4.opc.safePipeInURI", "false"));
        }
        catch (Throwable t) { } // defaults to true if the property is not readable

        // safe pipe in URI, means that a "|" will fail like before
        return safePipeInUri ? (ch >= 0x80 || Character.isWhitespace(ch))
                : (ch >= 0x80 || ch == 0x7C || Character.isWhitespace(ch));
    }
Comment 4 PJ Fanning 2020-01-31 08:23:40 UTC
We don't typically support system properties.

You've caught us in the middle of a release. If your patch is not the best behaviour for this code and only suits your use case, then we need to remove it. The RFC seems to suggest that we should the existing code had a bug but if this not the case, then I will revert the code.
Comment 5 Richard Costine 2020-01-31 15:26:50 UTC
I think that the original patch I provided should be the correct behavior, since the RFC indicates that the pipe character is not considered "valid" in a url, and would normally be encoded with a "%7C".