Bug 32328

Summary: Make mod_rewrite escaping optional / expose internal map functions
Product: Apache httpd-2 Reporter: Christian Parpart <trapni>
Component: mod_rewriteAssignee: Apache HTTPD Bugs Mailing List <bugs>
Status: REOPENED ---    
Severity: enhancement CC: andrey, lowzl, nmix
Priority: P1    
Version: 2.5-HEAD   
Target Milestone: ---   
Hardware: All   
OS: All   
URL: http://meta.wikimedia.org/wiki/Using_a_very_short_URL#httpd.conf
Attachments: adds the ampescape function
adds the ampescape function

Description Christian Parpart 2004-11-19 22:44:18 UTC
This patch will add ampersand escaping to apache2 as  
recently posted to the dev@httpd list. 
 
example use (from URL above): 
RewriteMap ampescape int:ampescape 
RewriteRule ^/(.*)$ /index.php?title=${ampescape:$1} [L,QSA] 
 
regards, 
Christian Parpart.
Comment 1 Christian Parpart 2004-11-19 22:47:47 UTC
Created attachment 13507 [details]
adds the ampescape function

and here the patch
Comment 2 Christian Parpart 2004-11-19 23:06:22 UTC
Created attachment 13508 [details]
adds the ampescape function

* adapted patch to ASF's coding style
* the old patch was against 2.0.52, this patch is against HEAD
Comment 3 Paul Querna 2005-04-06 23:06:49 UTC
Looks good here, and this has been added as a default patch in Gentoo.
Comment 4 André Malo 2005-04-07 05:33:59 UTC
too special, as discussed on dev (long time ago).
So I'm still -1 on it.
Comment 5 Christian Parpart 2005-04-07 05:37:12 UTC
André Malo, maybe you have the time for writing the proposed more-generic 
extension? Because I (in my case) actually don't have it :( 
Comment 6 Mads Toftum 2005-04-07 08:11:15 UTC
I have to agree with Andrés -1 - we could be adding a bunch of these to handle
all sorts of special cases and that hardly makes sense.
If anything, we need a way to do something like the unix tr, not a special case
"hack".
Comment 7 André Malo 2005-07-20 08:34:58 UTC
Could someone tell me, what the problem (?) described on that url has to do with
the patch? The "obvious" rewriterule there is just plain wrong:

RewriteRule ^/(.*)\?(.*)$ /index.php?title=$1&$2 [L]

RewriteRules don't match the querystring. Period. There's no known issue about
it. The obvious rule would be:

RewriteRule ^/(.*) /index.php?title=$1 [L,QSA]

What am I missing?
Comment 8 Low Zhen Lin 2005-08-31 11:46:08 UTC
The problem with rewriting /(.*) to /index.php?title=$1 is that $1 containing &
would not escaped correctly, even if the user's URL had escaped & to %26.

For example, /AT%26T would be rewritten to /index.php?title=AT&T instead of
/index.php?title=AT%26T - causing title to only contain 'AT' instead of the
expected 'AT&T'. 

I think this patch is important even though it is too special because & is a
important character in query strings - just as / is a very important character
in path strings - it is quite possible that this case would more often with
other web applications if people made more use of mod_rewrite.
Comment 9 Christian Stadler 2005-09-18 03:35:58 UTC
From the latest patch:
unsigned char *copy = (char *)apr_palloc(r->pool, 3 * strlen(key) + 3);

shouldn't that be 
char *copy = (char *)apr_palloc(r->pool, 3 * strlen(key) + 3);
since your doing a cast to (char *) instead to (unsigned char *) _and_ since the
function returns char * instead of unsigned char * as per its definition?
Comment 10 Christian Parpart 2005-09-18 22:46:36 UTC
yeah, makes sense in any way, however, there are more "unsigned" that might be 
eliminated then. 
 
Some (longer) time ago, httpd-dev mailinglist members recommented in writing a 
MORE GENERIC variant of this patch, I can't remember exactly, however, it 
should be done anyway in order to get something like this functionality in.  
 
(I'm still not that familar with this kinda apache API anyway :( 
Comment 11 Steven Wittens 2006-06-21 13:01:06 UTC
The same problem occurs with # (%23) and is even more destructive there:

RewriteRule ^/(.*) /index.php?title=$1&something=else

/Foo%#23Bar
will get rewritten to:
/index.php?title=Foo#Bar&something=else

The 'Bar&something=else' is interpreted as a fragment identifier (i.e. page anchor) and ignored on the 
server side. The proposed patch is pretty short-sighted because it only treats one symptom, not the 
cause.

Why does mod rewrite need to unescape these characters in the first place? Special characters like & 
and # do not mean the same as %26 and %23 within in the context of an URL. By unescaping, this 
information is being lost...

At the very least, this unescaping should be optional.

I think you can fix most issues by just using the 'escape' RewriteMap on the substitute, but this is far 
from practical as it needs to be set globally for the entire server. This rules it out for hosted 
environments where usually the most you get is .htaccess. Is there any reason why the built-in map 
functions (toupper, tolower, escape, unescape) still need a very redundant RewriteMap directive?

So I guess the optimal solution would either:
- Allow you to turn off this automatic unescaping with a rewriterule flag (or similar) in htaccess
- or Allow you to use the built-in map functions directly without requiring those redundant RewriteMap 
directives
Comment 12 Bob Ionescu 2007-01-25 19:01:41 UTC
(In reply to comment #11)
> Why does mod rewrite need to unescape these characters in the first place?
Special characters like & 
> and # do not mean the same as %26 and %23 within in the context of an URL. By
unescaping, this 
> information is being lost...

At the early beginning, when the internal request processing starts, apache
unescapes the URL-path once. This is not done by mod_rewrite, this happens
before mod_rewrite is involved and I think this is also a part of the security
concept. 

If you are using your rewrite rules in directory context, you have a filename (a
physical path, e.g. /var/www/abc) while the per-dir prefix is stripped (so
you're matching only against the local path 'abc' if your rules are stored in
/var/www/). How would you map some unescaped URL-path to the file system?
There's no way to make the unescaping process optional for a physical path in
directory context.

URL-path and QueryString have different rules for encoding. The QueryString is
left untouched (by browser [except spaces] and server) while reserved and
special chars in the URL-path must be requested hex-encoded by the client.
Apache unescapes URL-path in order to process the request.

A way to soften this problem would be a map function which encodes all
non-[a-zA-Z0-9/,._-] characters into their %FF hex representation as discussed
above.

If you need the unescaped uri with all its consequences, use the ENV
THE_REQUEST, which contains the full untouched request string like
GET /foo%20bar?foo=bar HTTP/1.1

BTW: You can also analyze $_SERVER['REQUEST_URI'] within your php script and set
the variable 'title' there. That would be another workaround for scripts (typo3
is using this method).
Comment 13 Bob Ionescu 2007-01-25 19:03:10 UTC
*** Bug 39739 has been marked as a duplicate of this bug. ***
Comment 14 Mika Lindqvist 2007-03-08 13:11:29 UTC

*** This bug has been marked as a duplicate of 23295 ***
Comment 15 Bob Ionescu 2007-03-09 10:06:06 UTC
This PR is an enhancement request to implement a new internal map function which
still needs to be written more-generic.