Bug 62273 - Add support for alternate URL specification
Summary: Add support for alternate URL specification
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 8
Classification: Unclassified
Component: Specification APIs (show other bugs)
Version: 8.5.x-trunk
Hardware: PC All
: P1 enhancement (vote)
Target Milestone: ----
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
: 62398 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-04-09 14:09 UTC by Piotr Suwała
Modified: 2018-05-22 14:12 UTC (History)
2 users (show)



Attachments
HttpParser patch (3.54 KB, patch)
2018-04-11 09:55 UTC, Remy Maucherat
Details | Diff
URL patch (5.41 KB, patch)
2018-04-12 11:03 UTC, Remy Maucherat
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Piotr Suwała 2018-04-09 14:09:30 UTC
Browsers won't escape '{|}' in parameters area.
https://bugzilla.mozilla.org/show_bug.cgi?id=1451347
They state that they don't have to comply.

This link "http://localhost:8090/jira/browse/SSP-15?jql=%20text%20=%20{ksi|delta}" is completely legit for firefox and chrome.

Who should adjust?
Comment 1 Remy Maucherat 2018-04-09 15:09:23 UTC
Reading RFC 3986, the query part of the URL is the same as the rest, I don't understand the answer that it is different. I guess the assertion that the query string is different makes this worth being a separate BZ. Is RFC 3986 now out of date ?

   URI           = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
   hier-part     = "//" authority path-abempty
                 / path-absolute
                 / path-rootless
                 / path-empty

   path-abempty  = *( "/" segment )
   path-absolute = "/" [ segment-nz *( "/" segment ) ]
   path-noscheme = segment-nz-nc *( "/" segment )
   path-rootless = segment-nz *( "/" segment )
   path-empty    = 0<pchar>

   segment       = *pchar
   segment-nz    = 1*pchar
   segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
                 ; non-zero-length segment without any colon ":"

   pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

   query         = *( pchar / "/" / "?" )

   fragment      = *( pchar / "/" / "?" )

   pct-encoded   = "%" HEXDIG HEXDIG

   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
   reserved      = gen-delims / sub-delims
   gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
   sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="
Comment 2 Julian Reschke 2018-04-09 15:18:49 UTC
(In reply to Remy Maucherat from comment #1)
> Reading RFC 3986, the query part of the URL is the same as the rest, I don't
> understand the answer that it is different. I guess the assertion that the
> query string is different makes this worth being a separate BZ. Is RFC 3986
> now out of date ?
> ...

No, RFC 3986 is the proper spec. The problem is that some browser vendors thought it would be ok to write their own spec.
Comment 3 Mark Thomas 2018-04-09 15:38:21 UTC
No, RFC 3986 is not out of date. It appears that the browsers are working to a different spec: https://url.spec.whatwg.org/#query-state

It is the cookie debacle all over again. Sigh. At least this time there appear to be some efforts to work to a spec - even if it isn't the RFC.

The six characters where there appears to be a difference of opinion between the RFC 3986 and the spec the browsers are working to are:
"\", "^", "`", "{", "|", "}"

I guess we could consider (optionally) relaxing the query string parsing to allow these but I don't like it. We have RFCs for a reason.
Comment 4 Remy Maucherat 2018-04-09 15:50:42 UTC
Ok ! :(

In that other spec, the query string is indeed different from the path, but the path itself may be a bit different from our validation as well.

For path:
The C0 control percent-encode set are the C0 controls and all code points greater than U+007E (~).
The fragment percent-encode set is the C0 control percent-encode set and U+0020 SPACE, U+0022 ("), U+003C (<), U+003E (>), and U+0060 (`).
The path percent-encode set is the fragment percent-encode set and U+0023 (#), U+003F (?), U+007B ({), and U+007D (}). 

For the query:
"If byte is less than 0x21 (!), greater than 0x7E (~), or is 0x22 ("), 0x23 (#), 0x3C (<), or 0x3E (>), append byte, percent encoded, to url’s query"

+1 to allow this validation optionally.
Comment 5 Remy Maucherat 2018-04-10 09:27:50 UTC
Reopening as a high priority enhancement.
Comment 6 Piotr Suwała 2018-04-10 13:36:20 UTC
There is tomcat.util.http.parser.HttpParser.requestTargetAllow={|} for 8.5.x but it was dropped for 9.0.x. However it ignores these characters possibly everywhere.

There should be flag that just allows for these characters to be allowed in params query.
Comment 7 Remy Maucherat 2018-04-11 09:55:34 UTC
Created attachment 35856 [details]
HttpParser patch

According to my understanding on that URL spec wording, here's what the HttpParser could look like. Oddities include '\\'. A system property would likely be used to configure.
Comment 8 Mark Thomas 2018-04-12 10:04:32 UTC
From memory, the previous parser was lenient to the extent that it only checked for characters that were not permitted anywhere in the request target rather than checking each part individually.

My initial thoughts were something pluggable on the connector that delegated most of the work to the HttpParser which would almost certainly need some new methods. My thinking hasn't got much beyond that though.

I'm planning on putting together something to map request target part, characters allowed and specification as I want to get a clearer understanding of what is and isn't allowed where.
Comment 9 Remy Maucherat 2018-04-12 11:03:30 UTC
Created attachment 35860 [details]
URL patch

No problem if you have bigger plans, here's the last version of my patch.
Comment 10 Mark Thomas 2018-04-12 14:17:58 UTC
I've been spending some time looking at whatwg.org URL spec. It appears, from https://bugzilla.mozilla.org/show_bug.cgi?id=1451347 that the browsers consider any string that passes the URL parser as valid and that the result of that parsing is passed to the server.

The rules for writing URLs are different but they don;t appear to apply in this case since if they did, everything we expect to be encoded in a query string would be encoded.

I've been through the path parsing defined by the whatwg.org URL spec and I believe it is the same as RFC7230 / RFC3986.

I've also been over the query part of the whatwg.org URL spec and I now believe there are 8 characters we expect to be encoded that the whatwg.org URL spec does not. They are:
"[", "]", "\", "^", "`", "{", "|", "}"

I still need to check the authority part.
Comment 11 Remy Maucherat 2018-04-12 15:27:17 UTC
Given the spec test, I thought the request target was a bit different, allowing unencoded '\\', '|' and '^'.
Same conclusion for the query part.
Comment 12 Mark Thomas 2018-04-12 19:47:58 UTC
That wasn't my reading but it isn't the easiest spec to read. It is much closer to an implementation description than a spec.

I decided to do some testing to see what the current behaviour is. Results here:
https://cwiki.apache.org/confluence/display/TOMCAT/Encoding+and+URIs

It looks like tightening up the path parsing to limit '[' and ']' would cause problems. I don't see where the whatwg.org spec allows this though.

I didn't see the automatic conversion of '\' to '/' when I read the spec either.

The query string behaviour is close to what I expected although IE is worse.

The authority part of the whatwg.org spec is very relaxed for the userinfo compared to RFC 7230 / RFC 3986 but since Tomcat just ignores the userinfo I'm not too concerned about that.

Sigh.

I'm still thinking about the best way to approach this. If we have configurable parser instances they are going to have to be per connector as the URI is parsed before it can be mapped. I'm wondering if it is worth it. Different settings per connector is likely to be unusual but on the other hand it isn't that hard to implement.

One advantage of validating each part separately is that the query string may have to allow some characters we really don't want in the path like '\'.
Comment 13 Mark Thomas 2018-04-27 08:57:42 UTC
Adding extra code to Tomcat to account for specification non-compliance of other components is the wrong solution. The right solution is to open bugs against the non-compliant components. Unfortunately, in this case, those other components are all the major browser vendors and they do not accept that their behaviour is incorrect. I have yet to see a convincing argument as to why the browsers should not implement RFC 7230 and RFC 3986.

Working around the specification non-compliant browser behaviour just encourages vendors to continue to ignore specifications and leads to greater interoperability issues in the long term. However, the alternative is to break lots of applications for lots of users. Therefore, it is with regret that I have implemented this enhancement for all currently supported Tomcat versions.

Fixed in:
- trunk for 9.0.8 onwards
- 8.5.x for 8.5.31 onwards
- 8.0.x for 8.0.52 onwards
- 7.0.x for 7.0.87 onwards
Comment 14 Trevor Robinson 2018-05-17 18:53:15 UTC
This is not acceptable to be put into a bugfix version. We use square brackets for tag filtering on query parameters, causing a lot of our queries to be suddenly bounced back with a 400. Luckily this only occurred to us on our test environment, but if we had built our production at any point since this was released, because we only pin the major and minor versions, we would have had a crisis.

At worst it should be a minor, and considering that it breaks everything (as was already commented in this thread, every major browser does not adhere to this standard. In my own testing, cURL, Lynx, Node and Perl do not either!) it should probably be a major. We should not have to carefully monitor each bugfix version of the myriad software in our stack to see if basic functionality has changed.
Comment 15 Christopher Schultz 2018-05-18 22:09:40 UTC
(In reply to Trevor Robinson from comment #14)
> This is not acceptable to be put into a bugfix version.

Um... this should UN-BREAK things, not break things. I don't think you want to wait until Tomcat 10 for your brackets to start working again, do you?
Comment 16 Mark Thomas 2018-05-22 14:12:58 UTC
*** Bug 62398 has been marked as a duplicate of this bug. ***