Bug 49396

Summary: PATH_INFO normalization, especially relating to void path segments
Product: Apache httpd-2 Reporter: theimp
Component: CoreAssignee: Apache HTTPD Bugs Mailing List <bugs>
Status: RESOLVED LATER    
Severity: enhancement Keywords: MassUpdate
Priority: P2    
Version: 2.2.15   
Target Milestone: ---   
Hardware: PC   
OS: Linux   

Description theimp 2010-06-06 18:00:09 UTC
The PATH_INFO request variable is treated by httpd as a path, which is normalized to have dot segments or void path segments reduced (an empty path segment has traditionally, on UNIX, been treated as synonymous for a dot segment, ie /./ ). This is almost always the desired behavior, but is technically incorrect (the variable value itself, not how it is reduced), and can cause problems when a script/module cannot use PATH_INFO against REQUEST_URI. My proposed solution is to add a RAW_PATH_INFO variable, which contains the PATH_INFO portion of the REQUEST_URI as it appears in REQUEST_URI, undecoded and unresolved (ie as received on the Request Line).

The rest of this report is my rationale/testing and is probably superfluous and certainly badly edited for brevity, so please feel free to ignore it unless you think you need some background.

The following URL:

/index.html/1/2//3/./4/../5

has a PATH_INFO of:

/1/2/3/5

The removal of the dot segments is correct per RFC 3986, which doesn't recognize PATH_INFO other than as part of a path, and requires that dot segments be normalized irrespective of whether they are path components or opaque tokens (it's hierarchical so it is considered that it doesn't make a difference which type they are).

Note that most clients and/or intervening proxies will remove dot segments as part of their own resolution before they ever send the request to httpd.

So far, this is all correct behavior.

However, in the case of a void path segment (//), there is no normalization procedure defined as per RFC 3986 (or any of the others that deal with the subject - it's almost as if they're deliberately avoiding addressing it…).

So, a URL such as the following:

/index.html/http://example.com/index2.html
                 ^^
would have a PATH_INFO of:

http:/example.com/index2.html
     ^

And since there are fewer characters in PATH_INFO than there are in the PATH_INFO portion of REQUEST_URI, even after unencoding REQUEST_URI, it becomes extremely difficult to examine REQUEST_URI to determine the non-PATH_INFO portion of the path, or the original PATH_INFO.

Now, in this example, the slashes after http: are character data and not path separators, and so they should be encoded as %2F, but there is no way for the client to know to do this because it cannot differentiate between what is the PATH_INFO and what is the path - only the server knows this, and it only knows it when it decides what script to call. The author of the URL is at fault, but the script has to deal with it anyhow, just like any other invalid data. And while the script might just be able to throw back a HTTP 400 error (or other error of its choice), scripts that need the original URI (for example, for logging) without the PATH_INFO portion can't get it from REQUEST_URI (or anywhere else) even after normalization, because the normal procedure of simply removing (length PATH_INFO) characters from a normalized REQUEST_URI won't work if extra characters have been removed.

(Not that the default httpd configuration would support such a PATH_INFO if it did have encoded slashes, but if you're expecting to deal with non-filesystem PATH_INFOs, it'd be up to you to know that you'd have AllowEncodedSlashes on.)

The only way that a script can recover the URL sans PATH_INFO with it is by comparing the end of an unencoded REQUEST_URI (the number of characters from the right as there are in PATH_INFO) with the PATH_INFO and if they don't match, then work backwards along REQUEST_URI looking for dot and void segments to add back into PATH_INFO until it matches (with special handling for segments at the very beginning of the PATH_INFO), and only then what's left of REQUEST_URI is the non-PATH_INFO portion of the URL, and then applying its own segment resolution to PATH_INFO without collapsing void paths, to get the PATH_INFO. (Even this is impossible if the last character of the script as given in REQUEST_URI is an unencoded period ".", which would be rare and silly, but not impossible).

Certainly, I would agree that it's dumb to use the PATH_INFO for anything other than true files, as implied by RFC 3875 (you should use the Query string instead). The point is that even if you ARE using PATH_INFO only for normal files, that when you do get certain kinds of requests (valid files or not), you can't isolate PATH_INFO from the REQUEST_URI. This realization came from the debugging of deliberately malformed URLs as a robustness test.

Changing the path resolution engine to not reduce void path segments in PATH_INFO means that special code must be written for the resolution of PATH_INFO (and it looks like a whole new subrequest, at least). Also, using a different resolution for PATH_INFO, from what is used for all other resolutions, will probably break almost every existing script in the universe that uses it, if they encounter such a URL, because while it is very unfortunate that a fundamental assumption of RFC 3875 is that all URLs implicitly map to files on a filesystem, that certainly is indeed by far the most common use case (or certainly was, at the time).

By far the easiest, most compatible way of dealing with this is to add a variable like RAW_PATH_INFO that doesn't feature path normalization or escape decoding; it's simply lopped off of the end of REQUEST_URI. Anyone who has never cared can continue to not care, any anyone else can easily get what they need.

I'm not really sure of whether this constitutes a bug or a feature request.

Strictly speaking, reducing void path segments is not required by URL-related specs, and implicitly prohibited (that is, they MAY be significant, and you can't just remove significant data because of assumptions like that they represent a filesystem path). So, technically, the specific behavior of removing void path segments from FILE_INFO is a bug.

On the other hand, it IS the desired behavior; the PATH_INFO is specifically intended to represent a filesystem path. Almost every script/module ever written assumes that it will be a properly-formatted path (especially since RFC 3875 requires that it be unencoded). And the way that it is currently determined makes it very inefficient (or, complex) to fix.

Also, changing the resolving for PATH_INFO to preserve void segments will not entirely solve the discussed problem with it, because dot segments will still, correctly, be removed and the length of the PATH_INFO in the REQUEST_URI will remain as inscrutable as ever for such URLs.

So, adding the above-mentioned RAW_PATH_INFO would defer the argument over whether void path segments are significant, but that's nothing less than a naked feature request. So I classed it as a feature request.
Comment 1 William A. Rowe Jr. 2018-11-07 21:08:47 UTC
Please help us to refine our list of open and current defects; this is a mass update of old and inactive Bugzilla reports which reflect user error, already resolved defects, and still-existing defects in httpd.

As repeatedly announced, the Apache HTTP Server Project has discontinued all development and patch review of the 2.2.x series of releases. The final release 2.2.34 was published in July 2017, and no further evaluation of bug reports or security risks will be considered or published for 2.2.x releases. All reports older than 2.4.x have been updated to status RESOLVED/LATER; no further action is expected unless the report still applies to a current version of httpd.

If your report represented a question or confusion about how to use an httpd feature, an unexpected server behavior, problems building or installing httpd, or working with an external component (a third party module, browser etc.) we ask you to start by bringing your question to the User Support and Discussion mailing list, see [https://httpd.apache.org/lists.html#http-users] for details. Include a link to this Bugzilla report for completeness with your question.

If your report was clearly a defect in httpd or a feature request, we ask that you retest using a modern httpd release (2.4.33 or later) released in the past year. If it can be reproduced, please reopen this bug and change the Version field above to the httpd version you have reconfirmed with.

Your help in identifying defects or enhancements still applicable to the current httpd server software release is greatly appreciated.