50357 – improve matching mechanisms for mime type and encoding

Bug 50357 - improve matching mechanisms for mime type and encoding

Summary: improve matching mechanisms for mime type and encoding

Status:	NEW

Alias:	None

Product:	Apache httpd-2
Classification:	Unclassified
Component:	mod_mime (show other bugs)
Version:	2.5-HEAD
Hardware:	All All

Importance:	P2 enhancement (vote)
Target Milestone:	---
Assignee:	Apache HTTPD Bugs Mailing List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2010-11-28 16:56 UTC by Christoph Anton Mitterer
Modified:	2010-11-28 19:02 UTC (History)
CC List:	0 users

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Christoph Anton Mitterer 2010-11-28 16:56:24 UTC

Hi.

As far as I understand RFC 2161 allow:
- exactly one Content-Type (optionally with params like charset), describing the actual data
- Content-Language, describing the actual data
- one or more Content-Encodings, describing the encodings of the actual data

Thus e.g.:
- x.pdf.html => should be just type text/html
- x.gz.Z => should have _no_ type (or application/octet-stream) and encoding "gzip, compress"
- x.html.gz.Z => should have type text/html and encoding "gzip, compress"
- x.html.Z.gz => should have type text/html and encoding "compress, gzip"

IMHO the following should also be right:
- x.gz.Z.html => should have type text/html and _no_ encoding
- x.Z.gz.html => should have type text/html and _no_ encoding

With the current way how the Add* directives from mod_mime work, even when used together with <Files> and/or <FilesMatch> it's nearly impossible to implement this correctly, aspecially when considering crazy things like:
x.Z.gz.html.gz.Z.gz.gz
(which should be IMHO type text/html + encoding "gzip, compress, gzip, gzip")


As far as I can see, things like "x.pdf.html" are already handled correctly, as the most right type takes precedence,...
How ever, currently, with having:
AddEncoding compress Z
AddEncoding compress gz
one would get
"compress, gzip, gzip, compress, gzip, gzip" in th above example, instead of:
"                gzip, compress, gzip, gzip"

So IMHO, a possible solution would be, that AddEncoding matches only such extensions, after the last (most right) extension that is identified as type extension.


I'm however not sure how to best handle charset and language with this.
Probably on should simply allow them at any position, so that we'd get the following to match:

name.(lang|charset)*.(type)*.(lang|charset)*.(encoding)*

with * = zero or more

Which should mena:
- type = the most right defined type at allowed positions from above
- charset = the most right defined charset at allowed positions from above
- lang = all langs from the allowed positions above, in that order
- encoding = all encodings from the allowed positions above, in that order


As I already describe in #50356 we currently also map files like:
".html" (I do mean "^.html$ - exactly that name) to type "text/html", but not files of them name "html", right?

IMHO even the former case, ".html" should _not_ be matched.
So I propose, that the definition of "name" from above is (in regular expression):
*$
meaning, any number of characters but at least one.



mod_mime_magic should be adapted as required.


What do you think?

Comment 1 Christoph Anton Mitterer 2010-11-28 17:42:07 UTC

IMHO the best way to implement this is by adding a new directive which allows to set a format string" that defines how matching is done.
The default string could be one which is just equal to the current behaviour off matching anything at any position.

Maybe something like this:
1) ExtensionsMatchFormat "formatstring"

Where formatstring is a PCRE with the following additional special symbols:

%t = _one_ previously defined type
%c = _one_ previously defined charset
%e = _one_ previously defined encoding
%l = _one_ previously defined language
%h = _one_ previously defined handler
%i = _one_ previously defined input filter
%o = _one_ previously defined output filter

Currently this is, as far as I can see, something like this:
(?i)(.%t|.%c|.%e|.%l|.%h|.%i|.%o)*
meaning, any number of extensions of any given type at any position.

Note that this allows to change the separation character ("."), and to make it case-insensitive or not.


2) One needs further Directives which specify which matches are to be used.
Examples for type, charset, encoding, language could be:

TypeMatch formatstring1 formatstring2
CharsetMatch formatstring1 formatstring2
...
where:
formatstring1 determines which of the %t's (extensions-groups) from above to use with formatstring2, with:
\* => Use all
\n => Use the n-th one
e.g. in "(?i)(.%t|.%c|.%e|.%l|.%h|.%i|.%o)*" from above:
TypesMatch "\*" ...
would mean, consider all (valid) %t-groups with when matching formatstring2
TypesMatch "\1\3" ...
would mean, consider the (valid) 1st and the (valid) 3rd when matching formatstring2
So ".pdf.gz.html.Z.png.Z.Z.txt" would yield the following:
with "\*": ".pdf.html.png.txt"
with "\1\3" ".pdf.png"

formatstring2 decides which of the result from formatstring1 should be actually and finally be used, with (again some kind of a PCRE):
"*" => use all
"($U)\.*$" => use the last (most right) extension (I hope my ungreedy PCRE is correct)




So my proposal from comment #1 could look like the following (I hope my PCREs are correct ;) ):
ExtensionsMatchFormat "^.*.+(\.%l|\.%c)*(\.%t)*(\.%l|\.%c)*(\.%e)*$"


TypesMatch "\*" "($U)\.*$"
=> concatenate all %t-groups
=> take the last one form them

CharsetMatch "\*" "($U)\.*$"
=> concatenate all %c-groups
=> take the last one form them


LangMatch "\*" "*"
=> concatenate all %l-groups
=> take all of them

EncodingMatch "\*" "*"
=> concatenate all %e-groups
=> take all of them

Comment 2 Christoph Anton Mitterer 2010-11-28 19:02:42 UTC

This should also make it easier to handle (arguably stupid) cases like
x.tgz.Z, which should become
type: application/x-tar
encoding: gzip, compress